October 30, 2018
Editor’s Note: This content is contributed by Khang Dao, Senior Director, Platform Hardware Engineering at Xilinx
On the second day at XDF 2018, Peter Frey, Xilinx Principal Software Product Application Engineer, provided an overview of the methods for accelerating an SDAccel design to help users get the most computation and acceleration from their designs.
<Software/Hardware Build Process>
The SDAccel environment is designed to provide a simplified development experience for FPGA-based software acceleration platforms. SDAccel includes a compiler, linker, libraries, and dynamically reconfigurable accelerators optimized for different applications that can be swapped in and out on the fly. Applications can have many multiple kernels and kernels can be updated during runtime without disrupting the interface between the server CPU and the FPGA. SDAccel supports kernel models from RTL to C/C++ to standard OpenCL. It allows developers to abstract the hardware platform and optimizes code to hardware as kernels running onto the FPGA acceleration board. The SDAccel development environment enables up to 25X better performance per Watt for application acceleration with FPGAs.
To maximize the benefits of SDAccel, Xilinx provides various optimization techniques. Each of topic below is to guide the developers from recognizing bottlenecks all the way to solution approaches to increase overall system performance.
Host code Optimization
<Overview of Host Code Optimization>
The host code uses the OpenCL API to schedule the individual compute unit executions and data transfers from and to the FPGA board. As a result, you need to be thinking about concurrent execution through the OpenCL queue(s). Typical areas of concern regarding host code optimization are the concurrency enabled by the OpenCL Command Queue(s), buffer management regarding data exchange between the host and kernels, general software pipelining on the FPGA, and synchronization between host and kernels.
Kernel Code Optimization
<Key Techniques to Develop High Performance C Kernel>
C/C++ and OpenCL kernels are mapped to the FPGA as accelerator kernels. However, implementation details need to be provided to the HLS compiler via the help of pragma directives or OpenCL attributes to create an efficient implementation. Roughly, these optimizations can be split into three categories, namely exploiting computational efficiency, memory mapping of large arrays, and interface optimizations.
When considering computational efficiency, it is important to understand that compared to GPUs or CPUs, FPGAs can implement a much more flexible and customizable processing unit. It starts with the fact that an FPGA is not restricted to just the standard C/C++ data types. This will considerably improve performance, if the programmer can make efficient use of arbitrary precision types such as 6 bits or any fixed precision floating point type. On the next level, loops do not have to be executed sequentially as in traditional implementations, with no or very limited overhead, loops can be implemented fully parallel (unrolled) or by exploiting temporal parallelism in a pipeline. On yet a higher level of abstraction, if we think about computational tasks (such as different functions), the implementer has the additional choice to create a DATAFLOW implementation allowing each of these tasks to be executed independently only to be synchronized via. data dependencies.
Arrays are utilized in many algorithm descriptions. However, a hardware implementation allows these objects to be implemented in various ways. As each of them will exhibit different performance characteristics which makes managing array implementation important with respect to kernel implementation.
Finally, the kernels in SDAccel are communicating with the host via. memory mapped AXI bus interfaces. Towards that end, configuring the interface model adequately is another kernel related optimization area.
Topological Optimization
Topological optimizations are concerned with the highest level of parallelism in a model implementation. On this level, it is possible to take full advantage of the generic fabric the FPGA offers to the programmer as it allows to multiply individual kernel implementations to increase computational parallelism. As a result, if a kernel is implementing a specific function A, the implementer can decide to create 4 implementations for A (A_1, A_2, A_3, and A_4) such that the complete program can call the 4 implementations in parallel. This reduces the total runtime for all calls of A from 4A to A.
Implementation Optimization
In most cases, this level of optimization is not required and was presented for completeness and shows the ability of SDAccel to control even lowest level implementation decisions. Today, the most typical implementation decision guided on this level is related to Super Logic Region (SLR) assignments.
These techniques and tools help provide insight into what portions of hardware and software to optimize and important design factors to consider. By following the advice and best practices described, the user can more quickly reach their performance goals and successfully deploy their accelerated application.