SDAccel Compilation Flow and Execution Model

The SDAccel™ development environment is a heterogeneous system architecture platform to accelerate compute intensive tasks using Xilinx® FPGA devices. The SDAccel environment contains a Host x86 machine that is connected to one or more Xilinx FPGA devices through a PCIe® bus, as shown below.

Figure: SDAccel Architecture

Programming Model

The SDAccel environment supports heterogeneous computing using the industry standard OpenCL protocol (https://www.khronos.org/opencl/). The host program executes on the Host CPU and offloads compute intensive tasks to execute on Xilinx FPGA devices using the OpenCL programming paradigm. Unlike a CPU (or GPU), the FPGA can be thought of as a blank canvas which does not have predefined instruction sets or fixed word size, and is capable of running the same or different instructions in parallel to greatly improve the performance of your applications.

Device Topology

In the SDAccel environment, devices are one or more FPGAs connected to a host x86 machine through a PCIe bus. The FPGA contains a programmable region that implements and executes kernels. The FPGA platform contains one or more global memory banks. The data transfer from the host machine to kernels and from kernels to the host happens through these global memory banks. The kernels running on the FPGA can have one or more memory interfaces. The connection from the global memory banks to those memory interfaces are configurable, as their features are determined by the kernel compilation options.

The programmable logic of Xilinx devices can implement multiple kernels at the same time, allowing for significant task parallelism. A single kernel can also be instantiated multiple times. The number of instances of a kernel is programmable and determined by the kernel compilation options.

Figure: SDAccel Architecture

The diagram above illustrates the flexible connections from the host application to multiple kernels through the global memory banks. The FPGA board device shown above contains four DDR memory banks. The programmable logic of the FPGA is running two kernels, Kernel A and Kernel B. Each kernel has two memory interfaces one for reading the data and another for writing. Also, note that there are two instances of Kernel A, totaling three simultaneous kernel instances on the FPGA.

In the diagram, the first instance of Kernel A: CU1 uses a single memory interface for reading and writing. Kernel B and the second instance of Kernel A: CU2 use different memory interfaces for reading, and writing, with Kernel B essentially passing data directly to Kernel A: CU2 through the global memory.

Note: To achieve the best performance, the global memory banks to kernel interface connections should carefully be defined as discussed in Connecting Kernel Ports to Global Memory.

SDAccel Build Process

The SDAccel environment offers all of the features of a standard software development environment:

  • Optimized compiler for host applications
  • Cross-compilers for the FPGA
  • Robust debugging environment to help identify and resolve issues in the code
  • Performance profilers to identify bottlenecks and optimize the code

Within this environment, the build process uses a standard compilation and linking process for both the software elements, and the hardware elements of the project. As shown in the following figure, the host application is built through one process using standard GCC compiler, and the FPGA binary is built through a separate process using the Xilinx xocc compiler.

Figure: Software/Hardware Build Process



  1. Host application build process using GCC:
    • Each host application source file is compiled to an object file (.o).
    • The object files (.o) are linked with the Xilinx SDAccel runtime shared library to create the executable (.exe).
  2. FPGA build process is highlighted in the following figure:
    • Each kernel is independently compiled to a Xilinx object (.xo) file.
      • C/C++ and OpenCL C kernels are compiled for implementation on an FPGA using the xocc compiler. This step leverages the Vivado® HLS compiler. Pragmas and attributes supported by Vivado HLS can be used in C/C++ and OpenCL C kernel source code to specify the desired kernel micro-architecture and control the result of the compilation process.
      • RTL kernels are compiled using the package_xo utility. The RTL kernel wizard in the SDAccel environment can be used to simplify this process.
    • The kernel .xo files are linked with the hardware platform (shell) to create the FPGA binary (.xclbin). Important architectural aspects are determined during the link step. In particular, this is where connections from kernel ports to global memory banks are established and where the number of instances for each kernel is specified.
      • When the build target is software or hardware emulation, as described below, xocc generates simulation models of the device contents.
      • When the build target is the system (actual hardware), xocc generates the FPGA binary for the device leveraging the Vivado Design Suite to run synthesis and implementation.

Figure: FPGA Build Process



Note: The xocc compiler automatically uses the Vivado HLS and Vivado Design Suite tools to build the kernels to run on the FPGA platform. It uses these tools with predefined settings which have proven to provide good quality of results. Using the SDAccel environment and the xocc compiler does not require knowledge of these tools; however, hardware-savvy developers can fully leverage these tools and use all their available features to implement kernels.

Build Targets

The SDAccel tool build process generates the host application executable (.exe) and the FPGA binary (.xclbin). The SDAccel build target defines the nature of FPGA binary generated by the build process.

The SDAccel tool provides three different build targets, two emulation targets used for debug and validation purposes, and the default hardware target used to generate the actual FPGA binary:

Software Emulation (sw_emu)
Both the host application code and the kernel code are compiled to run on the x86 processor. This allows iterative algorithm refinement through fast build-and-run loops. This target is useful for identifying syntax errors, performing source-level debugging of the kernel code running together with application, and verifying the behavior of the system.
Hardware Emulation (hw_emu)
The kernel code is compiled into a hardware model (RTL) which is run in a dedicated simulator. This build and run loop takes longer but provides a detailed, cycle-accurate, view of kernel activity. This target is useful for testing the functionality of the logic that will go in the FPGA and for getting initial performance estimates.
System (hw)
The kernel code is compiled into a hardware model (RTL) and is then implemented on the FPGA device, resulting in a binary that will run on the actual FPGA.

SDAccel Execution Model

In the SDAccel framework, an application program is split between a host application and hardware accelerated kernels with a communication channel between them. The host application, written in C/C++ and using API abstractions like OpenCL, runs on an x86 server while hardware accelerated kernels run within the Xilinx FPGA. The API calls, managed by the Xilinx Runtime (XRT), are used to communicate with the hardware accelerators. Communication between the host x86 machine and the accelerator board, including control and data transfers, occurs across the PCIe bus. While control information is transferred between specific memory locations in hardware, global memory is used to transfer data between the host application and the kernels. Global memory is accessible by both the host processor and hardware accelerators, while host memory is only accessible by the host application.

For instance, in a typical application, the host will first transfer data, to be operated on by the kernel, from host memory into global memory. The kernel would subsequently operate on the data, storing results back to the global memory. Upon kernel completion, the host would transfer the results back into the host memory. Data transfers between the host and global memory introduce latency which can be costly to the overall acceleration. To achieve acceleration in a real system, the benefits achieved by hardware acceleration kernels must outweigh the extra latency of the data transfers. The general structure of this acceleration platform is shown in the following figure.

Figure: Architecture of an SDAccel Application



The FPGA hardware platform, on the right-hand side, contains the hardware accelerated kernels, global memory along with the DMA for memory transfers. Kernels can have one or more global memory interfaces and are programmable. The SDAccel execution model can be broken down into these steps:

  1. The host application writes the data needed by a kernel into the global memory of the attached device through the PCIe interface.
  2. The host application sets up the kernel with its input parameters.
  3. The host application triggers the execution of the kernel function on the FPGA.
  4. The kernel performs the required computation while reading data from global memory, as necessary.
  5. The kernel writes data back to global memory and notifies the host that it has completed its task.
  6. The host application reads data back from global memory into the host memory and continues processing as needed.

The FPGA can accommodate multiple kernel instances at one time; this can occur between different types of kernels or multiple instances of the same kernel. The XRT transparently orchestrates the communication between the host application and the kernels in the accelerator. The number of instances of a kernel is determined by compilation options.

SDAccel Emulation Flows

The SDAccel environment development flow can be divided into two distinct steps. The first step is to compile the host and kernel code to generate the executables. The second step is to run the executables in a heterogeneous system comprised of the Host CPU and SDAccel environment accelerator platform. However, the kernel compilation process is long and can take several hours depending on the size of the kernels and the architecture of the target FPGA. Therefore, to save time and shorten the debug cycle before the kernel compilation process, the SDAccel environment provides two other build targets for testing purposes: software emulation and hardware emulation. The compilation and execution of these emulation targets are significantly faster, and do not require the actual FPGA board to be run. These emulation flows abstract the FPGA board, and its connection to the host machine, into software models to validate the combined functionality of the host and kernel code, as well as providing some performance estimates early in the design process. These performance estimates are just estimates, but they can greatly help debugging and identifying performance bottlenecks. Refer to the SDAccel Environment Debugging Guide for more information on debugging using software and hardware emulation flows.

Software Emulation Flow

Compilation of the software emulation target is the fastest. It is mainly used for checking the functional correctness when the host and kernel code are running together. The xocc compiler does the minimum transformation of the kernel code in order to run it in conjunction with the host code, so this software emulation flow helps to check functional correctness at the very early stage of the final binary creation. The software emulation flow can be used for algorithm refinement, debugging functional issues, and letting developers iterate quickly through the code to make improvements.

Hardware Emulation Flow

In the hardware emulation flow, the xocc compiler generates a model of the kernel in a hardware description language (RTL Verilog). The hardware emulation flow helps to check the functional correctness of the final binary creation after synthesis of the RTL from the C, C++, or OpenCL kernel code. The hardware emulation flow offers great debug capability with the waveform viewer if the system does not work as expected.

Table 1. Comparison of Emulation Flows with Hardware Execution
Software Emulation Hardware Emulation Hardware Execution
Host application runs with a C/C++ or OpenCL model of the kernels. Host application runs with a simulated RTL model of the kernels. Host application runs with actual hardware implementation of the kernels.
Confirm functional correctness of the system. Test the host / kernel integration, get performance estimates. Confirm that the system runs correctly and with desired performance.
Fastest turnaround time. Best debug capabilities, moderate compilation time. Final FPGA implementation and run provides accurate performance result with long build time.

SDAccel Example Designs

SDAccel Examples on GitHub

Xilinx provides many examples for programming with the SDAccel environment in the GitHub repository to help beginning users get familiar with the coding style of host and kernel code, and for more experienced users to use as a source of coding examples. All examples are available with host code, kernel code, and Makefile associated with the compilation flow and runtime flow. The following is one such example to get a basic understanding of the file structure of a standard example.

Inverse Discrete Cosine Transform (IDCT) Example

Look at the IDCT example design that demonstrates the key coding organization required for the SDAccel environment.

The Readme.md file discusses in detail how to run this example in both emulation and FPGA execution flows using the provided Makefile.

Inside the ./src directory, you can find host code idct.cpp, and kernel code krnl_idct.cpp.

In the following chapters you will learn the basic required knowledge to program the host code and kernel code for the SDAccel environment. During this process you might refer to the above design as an example.

Organization of this Guide

The remaining chapters are organized as follows:

  • Programming the Host Application: Describes writing host code in C or C++ using the OpenCL API targeted for Xilinx FPGA devices. This chapter assumes the user has prior knowledge of OpenCL. It discusses coding practices to follow when writing an effective host application interfacing with acceleration kernels running on Xilinx FPGA devices.
  • Programming C/C++ Kernels: Describes different elements of writing high-performance, compute-intensive kernel code to implement on an FPGA device.
  • Configuring the System Architecture: Discusses integrating and connecting the host application to one or more kernel instances during the linking process.