Profiling and Optimizing the Kernel

There are three distinct areas to be considered when performing algorithm optimization in SDAccel™:

Host optimization
Kernel optimization
PCIe® bandwidth optimization

Most application developers are familiar with host code optimization. This usually requires the programmers to studying algorithmic complexities, overall system performance, and data locality. There are many methodology guides and software tools to guide the developer to identify performance bottlenecks. These same techniques can be applied to the design targeting to be accelerated with SDAccel.

Consequently, as a first step, programmers should optimize their overall program performance independently of the final target. However, the main difference between SDAccel and general purpose software is that in SDAccel projects, part of the core compute algorithms are pushed onto the FPGA. This implies that the developer must be aware of algorithm concurrency, data transfers, and the fact that programmable hardware is targeted.

Generally, the programmer must identify the section of the algorithm to be accelerated. The ratio between computation and the required data transfers to the accelerator should be sufficient to avoid requiring the system bus to create an unnecessary bottleneck.

Similarly, the host needs to efficiently utilize the accelerator. This implies that the host code must be optimized to facilitate the data transfers and kernel execution, as well as performing additional pre- and post-processing, if possible.

SDAccel is designed to support your efforts to optimize these areas, by generating reports that help you analyze the host code and the hardware kernels in some detail. The reports are automatically generated when you build the project, and listed in the Report view of the SDx IDE. To open a listed report, double-click the report.

The following figures show the three main reports: the HLS Report, the Application Timeline, and the Profile Summary. To access these reports from the SDx IDE, make sure the Reports view is visible. This view is typically below the Project Explorer view.

TIP: You can use the Window > Show View > Other menu command to display the Reports view if it is not displayed. See Working with SDx for more information.

The HLS Report provides details about the High-Level Synthesis process (HLS). This tasks translates the C/C++ model into a hardware description language responsible for implementing the functionality on the FPGA. This enables the programmer to dive directly into the hardware implementation and optimize the kernel implementation.

Figure: HLS Report Window

The Application Timeline, provides a graphical representation of the OpenCL® interface calls during execution. It enables the programmer to visualize what operations are performed during what time across the complete application timeline. This enables the programmer to identify issues regarding kernel synchronization and efficient concurrent execution.

Finally, the Profile Summary provides annotated details regarding the overall application performance. All data gathered during the execution of the program is gathered by SDAccel and grouped into categories. The profile summary enables the programmer to drill down to actual Data Transfer and Kernel Execution numbers and statistics.

Note: The profile summary also contains Profile Rule Checks (PRCs). PRCs show at a high level how the current performance numbers compare to commonly-achieved reference numbers.

Figure: Profile Summary Window

More details on each viewer, as well as the profiling and optimization methodology, common optimization steps, and even coding guidelines can be found in the SDAccel Environment Profiling and Optimization Guide (UG1207).