Introduction to Debugging in SDAccel

This document is intended to introduce the debugging capabilities of the SDAccel™ environment. The goal is to provide detailed instructions on how to analyze any failure encountered within the SDAccel flow. If no tool problem is encountered and the behavior of the design is deemed functionally correct, look for answers in the SDAccel Environment Profiling and Optimization Guide to determine if the performance of the design can be further improved.

SDAccel Execution Model

In the SDAccel framework, an application program is split between a host application and hardware accelerated kernels with a communication channel between them. The host application, written in C/C++ and using API abstractions like OpenCL, runs on an x86 server while hardware accelerated kernels run within the Xilinx FPGA. The API calls, managed by the Xilinx Runtime (XRT), are used to communicate with the hardware accelerators. Communication between the host x86 machine and the accelerator board, including control and data transfers, occurs across the PCIe bus. While control information is transferred between specific memory locations in hardware, global memory is used to transfer data between the host application and the kernels. Global memory is accessible by both the host processor and hardware accelerators, while host memory is only accessible by the host application.

For instance, in a typical application, the host will first transfer data, to be operated on by the kernel, from host memory into global memory. The kernel would subsequently operate on the data, storing results back to the global memory. Upon kernel completion, the host would transfer the results back into the host memory. Data transfers between the host and global memory introduce latency which can be costly to the overall acceleration. To achieve acceleration in a real system, the benefits achieved by hardware acceleration kernels must outweigh the extra latency of the data transfers. The general structure of this acceleration platform is shown in the following figure.

Figure: Architecture of an SDAccel Application



The FPGA hardware platform, on the right-hand side, contains the hardware accelerated kernels, global memory along with the DMA for memory transfers. Kernels can have one or more global memory interfaces and are programmable. The SDAccel execution model can be broken down into these steps:

  1. The host application writes the data needed by a kernel into the global memory of the attached device through the PCIe interface.
  2. The host application sets up the kernel with its input parameters.
  3. The host application triggers the execution of the kernel function on the FPGA.
  4. The kernel performs the required computation while reading data from global memory, as necessary.
  5. The kernel writes data back to global memory and notifies the host that it has completed its task.
  6. The host application reads data back from global memory into the host memory and continues processing as needed.

The FPGA can accommodate multiple kernel instances at one time; this can occur between different types of kernels or multiple instances of the same kernel. The XRT transparently orchestrates the communication between the host application and the kernels in the accelerator. The number of instances of a kernel is determined by compilation options.

SDAccel Build Process

The SDAccel environment offers all of the features of a standard software development environment:

  • Optimized compiler for host applications
  • Cross-compilers for the FPGA
  • Robust debugging environment to help identify and resolve issues in the code
  • Performance profilers to identify bottlenecks and optimize the code

Within this environment, the build process uses a standard compilation and linking process for both the software elements, and the hardware elements of the project. As shown in the following figure, the host application is built through one process using standard GCC compiler, and the FPGA binary is built through a separate process using the Xilinx xocc compiler.

Figure: Software/Hardware Build Process



  1. Host application build process using GCC:
    • Each host application source file is compiled to an object file (.o).
    • The object files (.o) are linked with the Xilinx SDAccel runtime shared library to create the executable (.exe).
  2. FPGA build process is highlighted in the following figure:
    • Each kernel is independently compiled to a Xilinx object (.xo) file.
      • C/C++ and OpenCL C kernels are compiled for implementation on an FPGA using the xocc compiler. This step leverages the Vivado® HLS compiler. Pragmas and attributes supported by Vivado HLS can be used in C/C++ and OpenCL C kernel source code to specify the desired kernel micro-architecture and control the result of the compilation process.
      • RTL kernels are compiled using the package_xo utility. The RTL kernel wizard in the SDAccel environment can be used to simplify this process.
    • The kernel .xo files are linked with the hardware platform (shell) to create the FPGA binary (.xclbin). Important architectural aspects are determined during the link step. In particular, this is where connections from kernel ports to global memory banks are established and where the number of instances for each kernel is specified.
      • When the build target is software or hardware emulation, as described below, xocc generates simulation models of the device contents.
      • When the build target is the system (actual hardware), xocc generates the FPGA binary for the device leveraging the Vivado Design Suite to run synthesis and implementation.

Figure: FPGA Build Process



Note: The xocc compiler automatically uses the Vivado HLS and Vivado Design Suite tools to build the kernels to run on the FPGA platform. It uses these tools with predefined settings which have proven to provide good quality of results. Using the SDAccel environment and the xocc compiler does not require knowledge of these tools; however, hardware-savvy developers can fully leverage these tools and use all their available features to implement kernels.

Build Targets

The SDAccel tool build process generates the host application executable (.exe) and the FPGA binary (.xclbin). The SDAccel build target defines the nature of FPGA binary generated by the build process.

The SDAccel tool provides three different build targets, two emulation targets used for debug and validation purposes, and the default hardware target used to generate the actual FPGA binary:

Software Emulation (sw_emu)
Both the host application code and the kernel code are compiled to run on the x86 processor. This allows iterative algorithm refinement through fast build-and-run loops. This target is useful for identifying syntax errors, performing source-level debugging of the kernel code running together with application, and verifying the behavior of the system.
Hardware Emulation (hw_emu)
The kernel code is compiled into a hardware model (RTL) which is run in a dedicated simulator. This build and run loop takes longer but provides a detailed, cycle-accurate, view of kernel activity. This target is useful for testing the functionality of the logic that will go in the FPGA and for getting initial performance estimates.
System (hw)
The kernel code is compiled into a hardware model (RTL) and is then implemented on the FPGA device, resulting in a binary that will run on the actual FPGA.

SDAccel Debug Flow Overview

This section presents the general debug flow of the SDAccel environment by detailing the general steps of a proven development process. This process allows you to focus rapidly on potential errors in the design. This sets the baseline for developers indicating where to start if an error occurs in their adopted development steps.

The debug flow described here assumes that an SDAccel platform board is installed and the initial setup checks have passed. It is possible to configure the SDAccel environment to work with custom hardware platforms that require a platform shell which defines the foundational components of the board.

The SDAccel environment provides application-level debug features which allow the host code, the kernel code, and the interactions between them to be efficiently debugged. The recommended application-level debugging flow consists of three levels of debugging: software emulation, hardware emulation, and hardware execution.

This three-tiered approach allows debugging of the host and kernel code and their interactions at different levels of abstraction. Each of the execution models described below is supported through the SDAccel IDE as well as through a batch flow using basic compile time and runtime setup options.

Software Emulation

Purpose
Algorithm verification
Execution Model
During software emulation, all processes are running pure C/C++ models. OpenCL kernel models are transformed to execute concurrently.

Figure: Software Emulation

Verify that both the host and kernel code are functionally correct by running software emulation. Because software emulation compiles and executes quickly, spend time here to iterate through the code until the host and kernel code function correctly. Both hardware emulation and hardware execution take more time to compile and execute.

Hardware Emulation

Purpose
RTL debugging, finding protocol violations.
Execution Model
During hardware emulation the host code is executed concurrently with a simulation of the RTL model of the kernel, directly imported, or created through Vivado HLS from the C/C++/OpenCL kernel code.

Figure: Hardware Emulation

Verify the host code and the kernel hardware implementation is correct by running hardware emulation on a data set. Hardware emulation performs detailed verification using an accurate model of the hardware (RTL) together with the host code C/OpenCL model. The hardware emulation flow invokes the hardware simulator in the SDAccel environment to test the functionality of the logic that is to be executed on the FPGA compute fabric. The interface between the models is represented by a transaction-level model (TLM) to limit impact of interface model on the overall execution time. The execution time for hardware emulation is longer than for software emulation.

TIP: Xilinx recommends that you use small data sets for debug and validation.
During the hardware emulation stage you can optionally modify the kernel code to improve performance. Iterate in hardware emulation until the functionality is correct and the estimated kernel performance is sufficient. See SDAccel Environment Profiling and Optimization Guide for more information.

Hardware Execution

Purpose
Final verification of the complete system, finding protocol violations (hardware hangs), and debugging system performance.
Execution Model
During hardware execution, the actual hardware platform is used to execute the kernels. The difference between this debug configuration and the final compilation of the kernel code is the inclusion of special hardware logic in the platform, such as ILA and VIO debug cores, and AXI performance monitors for debug purposes.

Figure: Hardware Execution

At this stage, a system image (xclbin) is compiled and executed on the actual hardware platform. Refer to the SDAccel Environment User Guide for more information on generating the xclbin file. At this point, the kernels are confirmed to be executing correctly on the actual FPGA hardware, and your focus can shift from debugging to performance tuning. See the SDAccel Environment Profiling and Optimization Guide.

Nevertheless, the hardware execution model might not be functional due to protocol issues, or issues with the hardware configuration. Towards that end, the SDAccel environment provides specific hardware debug capabilities which include ChipScope™ debug cores (such as System ILAs), which can be viewed in Vivado hardware manager, with waveform analysis, kernel activity reports, and memory access analysis to localize these critical hardware issues.

IMPORTANT: Debugging the kernel on the platform hardware requires additional logic to be incorporated into the overall hardware model. This means that if hardware debugging is enabled, there is some impact on resource use of the FPGA, as well as some impact on the kernel performance.