Profiling and Optimizing the Kernel
There are three distinct areas to be considered when performing algorithm optimization in SDAccel™:
- Host optimization
- Kernel optimization
- PCIe® bandwidth optimization
Most application developers are familiar with host code optimization. This usually requires the programmers to studying algorithmic complexities, overall system performance, and data locality. There are many methodology guides and software tools to guide the developer to identify performance bottlenecks. These same techniques can be applied to the design targeting to be accelerated with SDAccel.
Consequently, as a first step, programmers should optimize their overall program performance independently of the final target. However, the main difference between SDAccel and general purpose software is that in SDAccel projects, part of the core compute algorithms are pushed onto the FPGA. This implies that the developer must be aware of algorithm concurrency, data transfers, and the fact that programmable hardware is targeted.
Generally, the programmer must identify the section of the algorithm to be accelerated. The ratio between computation and the required data transfers to the accelerator should be sufficient to avoid requiring the system bus to create an unnecessary bottleneck.
Similarly, the host needs to efficiently utilize the accelerator. This implies that the host code must be optimized to facilitate the data transfers and kernel execution, as well as performing additional pre- and post-processing, if possible.
SDAccel is designed to support your efforts to optimize these areas, by generating reports that help you analyze the host code and the hardware kernels in some detail. The reports are automatically generated when you build the project, and listed in the Report view of the SDx IDE. To open a listed report, double-click the report.
Figure: HLS Report Window
The Application Timeline, provides a graphical representation of the OpenCL® interface calls during execution. It enables the programmer to visualize what operations are performed during what time across the complete application timeline. This enables the programmer to identify issues regarding kernel synchronization and efficient concurrent execution.
Figure: Application Timeline Window
Finally, the Profile Summary provides annotated details regarding the overall application performance. All data gathered during the execution of the program is gathered by SDAccel and grouped into categories. The profile summary enables the programmer to drill down to actual Data Transfer and Kernel Execution numbers and statistics.
Figure: Profile Summary Window
More details on each viewer, as well as the profiling and optimization methodology, common optimization steps, and even coding guidelines can be found in the SDAccel Environment Profiling and Optimization Guide (UG1207).