Introduction

This guide presents all SDx™ development environment features related to performance analysis of the design. It is also logically structured to assist in the actual performance improvement effort. Dedicated sections are available for the main components of the SDAccel™ environment performance bottlenecks, namely accelerator, PCIe® bus transfer, and host code. Each of these sections is structured to guide the developer from recognizing bottlenecks all the way to solution approaches to increase overall system performance.

Note: Performance optimization assumes, as a starting point, a working design intended for performance improvement. If erroneous behavior is encountered, look for guidance in SDAccel Environment Debugging Guide.

Similarly, the general concepts regarding coding of host code or accelerator kernels are not explained here; these concepts are introduced in the SDAccel Environment Programmers Guide.

SDAccel Execution Model

In the SDAccel framework, an application program is split between a host application and hardware accelerated kernels with a communication channel between them. The host application, written in C/C++ and using API abstractions like OpenCL, runs on an x86 server while hardware accelerated kernels run within the Xilinx FPGA. The API calls, managed by the Xilinx Runtime (XRT), are used to communicate with the hardware accelerators. Communication between the host x86 machine and the accelerator board, including control and data transfers, occurs across the PCIe bus. While control information is transferred between specific memory locations in hardware, global memory is used to transfer data between the host application and the kernels. Global memory is accessible by both the host processor and hardware accelerators, while host memory is only accessible by the host application.

For instance, in a typical application, the host will first transfer data, to be operated on by the kernel, from host memory into global memory. The kernel would subsequently operate on the data, storing results back to the global memory. Upon kernel completion, the host would transfer the results back into the host memory. Data transfers between the host and global memory introduce latency which can be costly to the overall acceleration. To achieve acceleration in a real system, the benefits achieved by hardware acceleration kernels must outweigh the extra latency of the data transfers. The general structure of this acceleration platform is shown in the following figure.

The FPGA hardware platform, on the right-hand side, contains the hardware accelerated kernels, global memory along with the DMA for memory transfers. Kernels can have one or more global memory interfaces and are programmable. The SDAccel execution model can be broken down into these steps:

The host application writes the data needed by a kernel into the global memory of the attached device through the PCIe interface.
The host application sets up the kernel with its input parameters.
The host application triggers the execution of the kernel function on the FPGA.
The kernel performs the required computation while reading data from global memory, as necessary.
The kernel writes data back to global memory and notifies the host that it has completed its task.
The host application reads data back from global memory into the host memory and continues processing as needed.

The FPGA can accommodate multiple kernel instances at one time; this can occur between different types of kernels or multiple instances of the same kernel. The XRT transparently orchestrates the communication between the host application and the kernels in the accelerator. The number of instances of a kernel is determined by compilation options.

SDAccel Build Process

The SDAccel environment offers all of the features of a standard software development environment:

Optimized compiler for host applications
Cross-compilers for the FPGA
Robust debugging environment to help identify and resolve issues in the code
Performance profilers to identify bottlenecks and optimize the code

Within this environment, the build process uses a standard compilation and linking process for both the software elements, and the hardware elements of the project. As shown in the following figure, the host application is built through one process using standard GCC compiler, and the FPGA binary is built through a separate process using the Xilinx xocc compiler.

Host application build process using GCC:
- Each host application source file is compiled to an object file (.o).
- The object files (.o) are linked with the Xilinx SDAccel runtime shared library to create the executable (.exe).
FPGA build process is highlighted in the following figure:
- Each kernel is independently compiled to a Xilinx object (.xo) file.
  - C/C++ and OpenCL C kernels are compiled for implementation on an FPGA using the xocc compiler. This step leverages the Vivado® HLS compiler. Pragmas and attributes supported by Vivado HLS can be used in C/C++ and OpenCL C kernel source code to specify the desired kernel micro-architecture and control the result of the compilation process.
  - RTL kernels are compiled using the package_xo utility. The RTL kernel wizard in the SDAccel environment can be used to simplify this process.
- The kernel .xo files are linked with the hardware platform (shell) to create the FPGA binary (.xclbin). Important architectural aspects are determined during the link step. In particular, this is where connections from kernel ports to global memory banks are established and where the number of instances for each kernel is specified.
  - When the build target is software or hardware emulation, as described below, xocc generates simulation models of the device contents.
  - When the build target is the system (actual hardware), xocc generates the FPGA binary for the device leveraging the Vivado Design Suite to run synthesis and implementation.

Note: The xocc compiler automatically uses the Vivado HLS and Vivado Design Suite tools to build the kernels to run on the FPGA platform. It uses these tools with predefined settings which have proven to provide good quality of results. Using the SDAccel environment and the xocc compiler does not require knowledge of these tools; however, hardware-savvy developers can fully leverage these tools and use all their available features to implement kernels.

Build Targets

The SDAccel tool build process generates the host application executable (.exe) and the FPGA binary (.xclbin). The SDAccel build target defines the nature of FPGA binary generated by the build process.

The SDAccel tool provides three different build targets, two emulation targets used for debug and validation purposes, and the default hardware target used to generate the actual FPGA binary:

Software Emulation (sw_emu): Both the host application code and the kernel code are compiled to run on the x86 processor. This allows iterative algorithm refinement through fast build-and-run loops. This target is useful for identifying syntax errors, performing source-level debugging of the kernel code running together with application, and verifying the behavior of the system.
Hardware Emulation (hw_emu): The kernel code is compiled into a hardware model (RTL) which is run in a dedicated simulator. This build and run loop takes longer but provides a detailed, cycle-accurate, view of kernel activity. This target is useful for testing the functionality of the logic that will go in the FPGA and for getting initial performance estimates.
System (hw): The kernel code is compiled into a hardware model (RTL) and is then implemented on the FPGA device, resulting in a binary that will run on the actual FPGA.

SDAccel Optimization Flow Overview

SDAccel environment is a complete software development environment for creating, compiling, and optimizing C/C++/OpenCL applications to be accelerated on Xilinx FPGAs. The SDAccel environment includes the three recommended flows for optimizing an application. For information on each flow, refer to the following:

Baselining Functionalities and Performance

Before starting any optimization efforts, it is important to understand the performance of your application. This is achieved by baselining the application in terms of functionalities and performance.

Identify Bottlenecks

The first step is to identify the bottlenecks of the current application running on your existing platform. The most effective way is to run the application with profiling tools, like valgrind, callgrind, and GNU gprof. The profiling data generated by these tools show the call graph with the number of calls to all functions and their execution time. The functions that consume the most execution time are good candidates to be offloaded and accelerated onto FPGAs.

Convert Target Functions

After the target functions are selected, convert them to OpenCL C kernels or C/C++ kernels without any optimization. The application code calling these kernels will also need to be converted to use OpenCL APIs for data movement and task scheduling.

TIP: Keep everything as simple as possible and minimize changes to the existing code, so you can quickly generate a working design on the FPGA and get the baselined performance and resource number.

Run Software and Hardware Emulation

Next, run software and hardware emulation to verify the function correctness and generate profiling data on the host code and the kernels. Analyze the kernel compilation reports, profile summary, timeline trace, and device hardware transactions to understand the baselined performance estimate such as timing, interval, and latency and resource utilization, such as DSP and block RAM.

Build and Run the Application

The last step in baselining is building and running the application on an FPGA acceleration card. Analyze the reports from the system compilation and the profiling data from application execution to see the actual performance and resource utilization.

TIP: Save all the reports during baselining, so you can reference and compare the results during optimization.

Optimizing Data Movement

In the OpenCL API, all data is first transferred from the host memory to the global memory on the device and then from the global memory to the kernel for computation. The computation results are written back from the kernel to the global memory and then from the global memory to the host memory. A key factor in determining strategies for kernel optimization is understanding how data can be efficiently moved around.

Note: Before optimizing computation, first optimize the data movement in the application.

During data movement optimization, it is important to isolate data transfer code from computation code because inefficiency in computation might cause stalls in data movement.

Note: Xilinx recommends that you modify the host code and kernels with data transfer code only for this optimization step.

The goal is to maximize the system level data throughput by maximizing PCIe bandwidth usage and DDR bandwidth usage. To achieve this goal, it usually takes multiple iterations of running software emulation, hardware emulation, as well as execution on FPGAs.

Optimizing Kernel Computation

One of the key benefits of an FPGA is that you can create custom logic for your specific application. The goal of kernel computation optimization is to create processing logic that can consume all the data as soon as they arrive at kernel interfaces. The key metric during this step is the initiation interval (II). This is generally achieved by expanding the processing code to match the data path with techniques, such as function pipelining, loop unrolling, array partitioning, data flowing, etc. The SDAccel environment produces various compilation reports and profiling data during hardware emulation and system run to assist your optimization effort. Refer to SDAccel Profiling and Optimization Features for details on the compilation and profiling report.