Optimizing the Performance
Host Optimization
This section focuses on optimization of the host program, which uses the OpenCL™ API to schedule the individual compute unit executions, and data transfers to and from the FPGA board. For optimizing data transfer and compute calls, you need to think about concurrent execution of tasks through the OpenCL command queue(s). This section discusses common pitfalls, and how to recognize and address them.
Reducing Overhead of Kernel Enqueing
The OpenCL-based execution model supports data parallel and task parallel programming models. An OpenCL host generally needs to call different kernels multiple times. These calls are enqueued in a command queue, either in a certain sequence, or in an out-of-order command queue. Then depending on the availability of compute resources and task data they get scheduled for execution on the device.
Kernel calls can be enqueued for execution on a command queue using
clEnqueueTask
. The dispatching process is executed
on the host processor. The dispatcher invokes kernel execution after transferring the
kernel arguments to the accelerator running on the device. The dispatcher uses a
low-level Xilinx® Runtime (XRT) library for
transferring kernel arguments and issuing trigger commands for starting the compute. The
overhead of dispatching the commands and arguments to the accelerator can be between 30
µs and 60 µs, depending on the number of arguments set for the kernel. You can reduce
the impact of this overhead by minimizing the number of times the kernel needs to be
executed, and minimizing calls to clEnqueueTask
.
Ideally, you should finish all the compute in a single call to clEnqueueTask
.
You can minimize the calls to clEnqueueTask
by batching your data and invoking the kernel one time, with
a loop wrapped around the original implementation to avoid the overhead of multiple
enqueue calls. It can also improve data transfer performance between the host and
accelerator, by transferring fewer large data packets rather than many small data
packets. For more information on reducing overhead on kernel execution, see Kernel Execution.
#define SIZE 256
extern "C" {
void add(int *a , int *b, int inc){
int buff_a[SIZE];
for(int i=0;i<size;i++)
{
buff_a[i] = a[i];
}
for(int i=0;i<size;i++)
{
b[i] = a[i]+inc;
}
}
}
num_batches
argument the
kernel can process multiple inputs of size 256 in a single call and avoid the overhead
of multiple clEnqueueTask
calls. The host application
changes to allocate data and buffers in chunks of SIZE *
num_batches
, essentially batching the memory allocation and transfer of
data between the host global and device memory.
#define SIZE 256
extern "C" {
void add(int *a , int *b, int inc, int num_batches){
int buff_a[SIZE];
for(int j=0;j<num_batches;j++)
{
for(int i=0;i<size;i++)
{
buff_a[i] = a[i];
}
for(int i=0;i<size;i++)
{
b[i] = a[i]+inc;
}
}
}
}
Optimizing Data Movement
In the OpenCL execution model, all data is transferred from the host main memory to the global device memory first, and then from the global device memory to the kernel for computation. The computation results are written back from the kernel to the global device memory, and lastly from the global memory to the host main memory. A key factor in determining strategies for kernel data movement optimization is understanding how data can be efficiently moved around between different level of memories maximizing the efficient use of bandwidth on all the memory interfaces.
During data movement optimization, it is important to isolate data transfer code from computation code because inefficiency in computation might cause stalls in data movement. You should focus on modifying the data transfer logic in the host and kernel code during this optimization step. The goal is to maximize the system level data throughput by maximizing data transfer bandwidth and device global memory bandwidth usage. It usually takes multiple iterations of running software emulation, hardware emulation, as well as execution in hardware to achieve optimum performance.
Overlapping Data Transfers with Kernel Computation
Applications, such as database analytics, have a much larger data set than can be stored in the available global device memory on the acceleration device. They require the complete data to be transferred and processed in blocks. Techniques that overlap the data transfers with the computation are critical to achieve high performance for these applications.
An example can be found in the vadd
kernel from the overlap example in the host category of Vitis Accelerated Examples on GitHub. This examples
demonstrates techniques to overlap Host (CPU) and FPGA computation in the application.
In this example, the kernel processes two arrays by adding them together and writing to
output. From the host perspective, there are four tasks to perform in this example:
- Write buffer a (
Wa
) - Write buffer b (
Wb
) - Execute
vadd
kernel - Read buffer c (
Rc
)
Using a simple in-order command queue without data transfer optimization, the overall execution timeline trace should look similar to the one shown below:
Using an out-of-order command queue, data transfer and kernel execution can overlap as illustrated in the figure below. In the host code for this example, double buffering is used for all buffers so that the kernel can process one set of buffers while the host can operate on the other set of buffers.
The OpenCL
event
object provides an easy method to set up complex
operation dependencies and synchronize host threads and device operations. Events are
OpenCL objects that track the status of
operations. Event objects are created by kernel execution commands, read
, write
, and copy
commands on memory objects, or user events created
using clCreateUserEvent
.
You can ensure an operation has completed by querying the events returned by these commands. The arrows in the figure below show how event triggering can be set up to achieve optimal performance.
In the example, the host code (host.cpp) enqueues the four tasks in a loop to process the
complete data set. It also sets up event synchronization between different tasks to
ensure that data dependencies are met for each task. The double buffering is set up by
passing different memory objects values to clEnqueueMigrateMemObjects
API. The event synchronization is achieved by
having each API call wait for other event as well as trigger its own event when the API
completes.
The Application Timeline view below clearly shows that the data transfer
time is completely hidden, while the compute unit vadd_1
is running constantly.
Buffer Memory Segmentation
Allocation and deallocation of memory buffers can lead to memory segmentation in the DDR controllers. This might result in sub-optimal performance of compute units, even if they could theoretically execute in parallel.
This issue occurs most often when multiple pthreads for different compute units are used and the threads allocate and release many device buffers with different sizes every time they enqueue the kernels. In this case, the timeline trace will exhibit gaps between kernel executions and it might seem the processes are sleeping.
Each buffer allocated by runtime should be continuous in hardware. For large memory, it might take some time to wait for that space to be freed, when many buffers are allocated and deallocated. This can be resolved by allocating device buffer and reusing it between different enqueues of a kernel.
Compute Unit Scheduling
Scheduling kernel operations is key to overall system performance. This becomes even more important when implementing multiple compute units (of the same kernel or of different kernels). This section examines the different command queues responsible for scheduling the kernels.
Multiple In-Order Command Queues
The following figure shows an example with two in-order command queues, CQ0 and CQ1. The scheduler dispatches commands from each queue in order, but commands from CQ0 and CQ1 can be pulled out by the scheduler in any order. You must manage synchronization between CQ0 and CQ1 if required.
The following is code extracted from host.cpp of the concurrent_kernel_execution_c example that sets up multiple in-order command queues and enqueues commands into each queue:
OCL_CHECK(
err,
cl::CommandQueue ooo_queue(context,
device,
CL_QUEUE_PROFILING_ENABLE |
CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE,
&err));
...
printf("[OOO Queue]: Enqueueing scale kernel\n");
OCL_CHECK(
err,
err = ooo_queue.enqueueTask(
kernel_mscale,nullptr,&ooo_events[0]));
set_callback(ooo_events[0], "scale");
...
// This is an out of order queue, events can be executed in any order. Since
// this call depends on the results of the previous call we must pass the
// event object from the previous call to this kernel's event wait list.
printf("[OOO Queue]: Enqueueing addition kernel (Depends on scale)\n");
kernel_wait_events.resize(0);
kernel_wait_events.push_back(ooo_events[0]);
OCL_CHECK(err,
err = ooo_queue.enqueueTask(
kernel_madd,
&kernel_wait_events, // Event from previous call
&ooo_events[1]));
set_callback(ooo_events[1], "addition");
...
// This call does not depend on previous calls so we are passing nullptr
// into the event wait list. The runtime should schedule this kernel in
// parallel to the previous calls.
printf("[OOO Queue]: Enqueueing matrix multiplication kernel\n");
OCL_CHECK(err,
err = ooo_queue.enqueueTask(
kernel_mmult,
nullptr,
&ooo_events[2]));
set_callback(ooo_events[2], "matrix multiplication");
Single Out-of-Order Command Queue
The following figure shows an example with a single out-of-order command queue. The scheduler can dispatch commands from the queue in any order. You must manually define event dependencies and synchronizations as required.
The following is code extracted from host.cpp of the concurrent_kernel_execution_c example that sets up a single out-of-order command queue and enqueues commands as needed:
OCL_CHECK(
err,
cl::CommandQueue ooo_queue(context,
device,
CL_QUEUE_PROFILING_ENABLE |
CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE,
&err));
...
printf("[OOO Queue]: Enqueueing scale kernel\n");
OCL_CHECK(
err,
err = ooo_queue.enqueueTask(
kernel_mscale,nullptr, &ooo_events[0]));
set_callback(ooo_events[0], "scale");
...
// This is an out of order queue, events can be executed in any order. Since
// this call depends on the results of the previous call we must pass the
// event object from the previous call to this kernel's event wait list.
printf("[OOO Queue]: Enqueueing addition kernel (Depends on scale)\n");
kernel_wait_events.resize(0);
kernel_wait_events.push_back(ooo_events[0]);
OCL_CHECK(err,
err = ooo_queue.enqueueTask(
kernel_madd,
&kernel_wait_events, // Event from previous call
&ooo_events[1]));
set_callback(ooo_events[1], "addition");
// This call does not depend on previous calls so we are passing nullptr
// into the event wait list. The runtime should schedule this kernel in
// parallel to the previous calls.
printf("[OOO Queue]: Enqueueing matrix multiplication kernel\n");
OCL_CHECK(err,
err = ooo_queue.enqueueTask(
kernel_mmult,
nullptr,
&ooo_events[2]));
set_callback(ooo_events[2], "matrix multiplication");
The Application Timeline view shows that the compute unit mmult_1
is running in parallel with the compute units
mscale_1
and madd_1
, using both multiple in-order queues and single out-of-order queue
methods.
Kernel Optimization
One of the key advantages of an FPGA is its flexibility and capacity to create customized designs specifically for your algorithm. This enables various implementation choices to trade off algorithm throughput versus power consumption. The following guidelines help manage the design complexity and achieve the desired design goals.
Optimizing Kernel Computation
The goal of kernel optimization is to create processing logic that can consume all the data as soon as it arrives at the kernel interfaces. The key metric is the initiation interval (II), or the number of clock cycles before the kernel can accept new input data. Optimizing the II is generally achieved by expanding the processing code to match the data path with techniques such as function pipelining, loop unrolling, array partitioning, data flowing, etc.
Interface Attributes (Detailed Kernel Trace)
The detailed kernel trace provides easy access to the AXI transactions and their properties. The AXI transactions are presented for the global memory, as well as the Kernel side (Kernel "pass" 1:1:1) of the AXI interconnect. The following figure illustrates a typical kernel trace of a newly accelerated algorithm.
Most interesting with respect to performance are the fields:
- Burst Length
- Describes how many packages are sent within one transaction.
- Burst Size
- Describes the number of bytes being transferred as part of one package.
Given a burst length of 1 and just 4 bytes per package, it will require many individual AXI transactions to transfer any reasonable amount of data.
Small burst lengths, as well as burst sizes, considerably less than 512 bits are therefore good opportunities to optimize interface performance.
Using Burst Data Transfers
Transferring data in bursts hides the memory access latency and improves bandwidth usage and efficiency of the memory controller. Also, check the HLS report for bursting information.
If burst data transfers occur, the detailed kernel trace will reflect the higher burst rate as a larger burst length number:
In the previous figure, it is also possible to observe that the memory data transfers following the AXI interconnect are actually implemented rather differently (shorter transaction time). Hover over these transactions, you would see that the AXI interconnect has packed the 16 x 4 byte transaction into a single package transaction of 1 x 64 bytes. This effectively uses the AXI4 bandwidth which is even more favorable. The next section focuses on this optimization technique in more detail.
Burst inference is heavily dependent on coding style and access pattern. However, you can ease burst detection and improve performance by isolating data transfer and computation, as shown in the following code snippet:
void kernel(T in[1024], T out[1024]) {
T tmpIn[1024];
T tmpOu[1024];
read(in, tmpIn);
process(tmpIn, tmpOut);
write(tmpOut, out);
}
In short, the function read
is
responsible for reading from the AXI input to an internal variable (tmpIn)
. The computation is implemented by the function
process
working on the internal variables tmpIn
and tmpOut
. The
function write
takes the produced output and writes to
the AXI output. For more information on burst, see the Vitis
High-Level Synthesis User Guide (UG1399).
The isolation of the read and write function from the computation results in:
- Simple control structures (loops) in the read/write function which makes burst detection simpler.
- The isolation of the computational function away from the AXI interfaces, simplifies potential kernel optimization. See Kernel Optimization for more information.
- The internal variables are mapped to on-chip memory, which allow faster access compared to AXI transactions. Acceleration platforms supported in the Vitis core development kit can have as much as 10 MB on-chip memories that can be used as pipes, local memories, and private memories. Using these resources effectively can greatly improve the efficiency and performance of your applications.
Using Full AXI Data Width
The user data width between the kernel and the memory controller can be
configured by the Vitis compiler based on the data
types of the kernel arguments. To maximize the data throughput, Xilinx recommends that you choose data types map to the full data width
on the memory controller. The memory controller in all supported acceleration cards
supports 512-bit user interface, which can be mapped to C/C++ arbitrary precision data
type ap_int<512>
or OpenCL vector data types such as int16
.
As described in Memory Interface Width Considerations, the default is for Vitis HLS to automatically re-size the kernel interface ports up to 512-bits to improve burst access. As shown on the following figure, you can observe burst AXI transactions (Burst Length 16) and a 512-bit package size (Burst Size 64 bytes).
This example shows good interface configuration as it maximizes AXI data width as well as actual burst transactions.
Complex structs or classes, used to declare interfaces, can lead to very complex hardware interfaces due to memory layout and data packing differences. This can introduce potential issues that are very difficult to debug in a complex system.
Optimizing Kernel to Kernel Communication
Support for hardware accelerator pipelines that communicate through streams is one of the major advantages of FPGAs and FPGA-based SoCs and have been used in DSP and image processing applications, as well as in communication systems. As described in Streaming Data Transfers between Kernels (K2K), AXI4-Stream interfaces can be used to stream data from one kernel to another without having to use the external memory, which greatly improves the overall system latency.
Kernel ports involved in streaming are defined within the kernel, and are not addressed by the
host program. There is no need to send data back to global memory before it is forwarded
for processing. The connections between the kernels are directly defined during the
v++
linking process as described in Specifying Streaming Connections between Compute Units.
Optimizing Memory Architecture
Memory architecture is a key aspect of implementation. Due to the limited access bandwidth, it can heavily impact the overall performance, as shown in the following example:
void run (ap_uint<16> in[256][4],
ap_uint<16> out[256]
) {
...
ap_uint<16> inMem[256][4];
ap_uint<16> outMem[256];
... Preprocess input to local memory
for( int j=0; j<256; j++) {
#pragma HLS PIPELINE OFF
ap_uint<16> sum = 0;
for( int i = 0; i<4; i++) {
sum += inMem[j][i];
}
outMem[j] = sum;
}
... Postprocess write local memory to output
}
This code adds the four values associated with the inner dimension of the two dimensional input array. If implemented without any additional modifications, it results in the following estimates:
The overall latency of 4608 (Loop 2) is due to 256 iterations of 18 cycles (16 cycles spent in the inner loop, plus the reset of sum, plus the output being written). This is observed in the Schedule Viewer in the HLS Project. The estimates become considerably better when unrolling the inner loop.
However, this improvement is largely because of the process using both ports of a dual port memory. This can be seen from the Schedule Viewer in the HLS Project:
Two read operations are performed per cycle to access all the values from the memory to calculate the sum. This is often an undesired result as this completely blocks the access to the memory. To further improve the results, the memory can be split into four smaller memories along the second dimension:
#pragma HLS ARRAY_PARTITION variable=inMem complete dim=2
For more information, see pragma HLS array_partition.
This results in four array reads, all executed on different memories using a single port:
Using a total of 256 * 4 cycles = 1024 cycles for loop 2.
Alternatively, the memory can be reshaped into to a single memory with four words in parallel. This is performed through the pragma:
#pragma HLS array_reshape variable=inMem complete dim=2
For more information, see pragma HLS array_reshape.
This results in the same latency as when the array partitioning, but with a single memory using a single port:
Although, either solution creates comparable results with respect to overall latency and utilization, reshaping the array results in cleaner interfaces and less routing congestion making this the preferred solution.
void run (ap_uint<16> in[256][4],
ap_uint<16> out[256]
) {
...
ap_uint<16> inMem[256][4];
ap_uint<16> outMem[256];
#pragma HLS array_reshape variable=inMem complete dim=2
... Preprocess input to local memory
for( int j=0; j<256; j++) {
#pragma HLS PIPELINE OFF
ap_uint<16> sum = 0;
for( int i = 0; i<4; i++) {
#pragma HLS UNROLL
sum += inMem[j][i];
}
outMem[j] = sum;
}
... Postprocess write local memory to output
}
Optimizing Computational Parallelism
By default, C/C++ does not model computational parallelism, as it always executes any algorithm sequentially. However, fully configurable computational engines like FPGAs allow more freedom to exploit computational parallelism.
Coding Data Parallelism
To leverage computational parallelism during the implementation of an algorithm on the FPGA, it should be mentioned that the synthesis tool will need to be able to recognize computational parallelism from the source code first. Loops and functions are prime candidates for reflecting computational parallelism and compute units in the source description. However, even in this case, it is key to verify that the implementation takes advantage of the computational parallelism as in some cases the Vitis technology might not be able to apply the desired transformation due to the structure of the source code.
It is quite common, that some computational parallelism might not be reflected in the source code to begin with. In this case, it will need to be added. A typical example is a kernel that might be described to operate on a single input value, while the FPGA implementation might execute computations more efficiently in parallel on multiple values. This kind of parallel modeling is described in Task Parallelism.
A 512-bit interface can be created using OpenCL vector data types such as int16
or C/C++ arbitrary precision data type ap_int<512>
. These vector types can also be used as a powerful way
to model data parallelism within a kernel, with up to 16 data paths operating in
parallel in case of int16
. Refer to the Median Filter Example in the vision category at Xilinx Getting Started Example on GitHub for
the recommended method to use vectors.
Loop Parallelism
Loops are the basic C/C++/OpenCL API method of representing repetitive algorithmic code. The following example illustrates various implementation aspects of a loop structure:
for(int i = 0; i<255; i++) {
out[i] = in[i]+in[i+1];
}
out[255] = in[255];
This code iterates over an array of values and adds consecutive values, except the last value. If this loop is implemented as written, each loop iteration requires two cycles for implementation, which results in a total of 510 cycles for implementation. This can be analyzed in detail through the Schedule Viewer in the HLS Project:
This can also be analyzed in terms of total numbers and latency through the Vivado synthesis results:
The key numbers here are the latency numbers and total LUT usage. For example, depending on the configuration, you could get latency of 511 and total LUT usage of 47. As a result, these values can vary based on the implementation choices. While this implementation will require very little area, it results in significant latency.
Unrolling Loops
Unrolling a loop enables the full parallelism of the model to be used. To perform this, mark a loop to be unrolled and the tool will create the implementation with the most parallelism possible. To mark a loop to unroll, an OpenCL loop can be marked with the UNROLL attribute:
__attribute__((opencl_unroll_hint))
Or a C/C++ loop can use the unroll pragma:
#pragma HLS UNROLL
For more information, see Loop Unrolling.
When applied to this specific example, the Schedule Viewer in the HLS Project will be:
The following figure shows the estimated performance:
Therefore, the total latency was considerably improved to be 127 cycles and as expected the computational hardware was increased to 4845 LUTs, to perform the same computation in parallel.
However, if you analyze the for-loop, you might ask why this algorithm cannot
be implemented in a single cycle, as each addition is completely independent of the
previous loop iteration. The reason is the memory interface is used for the variable
out
. The Vitis
core development kit uses dual port memory by default for an array. However, this
implies that at most two values can be written to the memory per cycle. Thus to see a
fully parallel implementation, you must specify that the variable out
should be kept in registers as in this example:
#pragma HLS array_partition variable= out complete dim= 0
For more information, see pragma HLS array_partition.
The results of this transformation can be observed in the following Schedule Viewer:
The associated estimates are:
Accordingly, this code can be implemented as a combinatorial function requiring only a fraction of the cycle to complete.
Pipelining Loops
Pipelining loops allow you to overlap iterations of a loop in time, as discussed in Loop Pipelining. Allowing loop iterations to operate concurrently is often a good approach, as resources can be shared between iterations (less resource utilization), while requiring less execution time compared to loops that are not unrolled.
Pipelining is enabled in C/C++ through the pragma HLS pipeline:
#pragma HLS PIPELINE
While the OpenCL API uses the xcl_pipeline_loop attribute:
__attribute__((xcl_pipeline_loop))
__attribute__((xcl_pipeline_workitems))
In this example, the Schedule Viewer in the HLS Project produces the following information:
With the overall estimates being:
Because each iteration of a loop consumes only two cycles of latency, there can only be a single iteration overlap. This enables the total latency to be cut into half compared to the original, resulting in 257 cycles of total latency. However, this reduction in latency was achieved using fewer resources when compared to unrolling.
In most cases, loop pipelining by itself can improve overall performance. Yet, the effectiveness of the pipelining depends on the structure of the loop. Some common limitations are:
- Resources with limited availability such as memory ports or process channels can limit the overlap of the iterations (Initiation Interval).
- Loop-carry dependencies, such as those created by variable conditions computed in one iteration affecting the next, might increase the II of the pipeline.
These are reported by the tool during high-level synthesis and can be observed and examined in the Schedule Viewer. For the best possible performance, the code might have to be modified to remove these limiting factors, or the tool needs to be instructed to eliminate some dependency by restructuring the memory implementation of an array, or breaking the dependencies all together.
Task Parallelism
Task parallelism allows you to take advantage of dataflow parallelism. In contrast to loop parallelism, when task parallelism is deployed, full execution units (tasks) are allowed to operate in parallel taking advantage of extra buffering introduced between the tasks.
See the following example:
void run (ap_uint<16> in[1024],
ap_uint<16> out[1024]
) {
ap_uint<16> tmp[128];
for(int i = 0; i<8; i++) {
processA(&(in[i*128]), tmp);
processB(tmp, &(out[i*128]));
}
}
When this code is executed, the function processA
and processB
are executed
sequentially 128 times in a row. Given the combined latency for processA
and processB
, the loop is set to
278 and the total latency can be estimated as:
The extra cycle is due to loop setup and can be observed in the Schedule Viewer.
For C/C++ code, task parallelism is performed by adding the DATAFLOW pragma into the for-loop:
#pragma HLS DATAFLOW
For OpenCL API code, add the attribute before the for-loop:
__attribute__ ((xcl_dataflow))
Refer to Dataflow Optimization, HLS Pragmas, and OpenCL Attributes for more details on this topic.
As illustrated by the estimates in the HLS report, applying the transformation will considerably improve the overall performance effectively using a double (ping-pong) buffer scheme between the tasks:
The overall latency of the design has almost halved in this case due to concurrent execution of the different tasks of the different iterations. Given the 139 cycles per processing function and the full overlap of the 128 iterations, this allows the total latency to be:
(1x only processA + 127x both processes + 1x only processB) * 139 cycles = 17931 cycles
Using task parallelism is a powerful method to improve performance when it comes to implementation. However, the effectiveness of applying the DATAFLOW pragma to a specific and arbitrary piece of code might vary vastly. It is often necessary to look at the execution pattern of the individual tasks to understand the final implementation of the DATAFLOW pragma. Finally, the Vitis core development kit provides the Detailed Kernel Trace, which illustrates concurrent execution.
For this Detailed Kernel Trace, the
tool displays the start of the dataflow loop, as shown in the previous figure. It
illustrates how processA
is starting up right away with the beginning
of the loop, while processB
waits until the completion of the
processA
before it can start up its first iteration. However, while
processB
completes the first iteration of the loop,
processA
begins operating on the second iteration, etc.
A more abstract representation of this information is presented in Timeline Trace for the host and device activity.
Optimizing Device Resources
Data Width
One, if not the most important aspect for performance is the data width required for the implementation. The tool propagates port widths throughout the algorithm. In some cases, especially when starting out with an algorithmic description, the C/C++/OpenCL code might only use large data types such as integers even at the ports of the design. However, as the algorithm is mapped to a fully configurable implementation, smaller data types such as 10-/12-bit might often suffice. It is beneficial to check the size of basic operations in the HLS Synthesis report during optimization.
In general, when the Vitis core development kit maps an algorithm onto the FPGA, more processing is required to comprehend the C/C++/OpenCL API structure and extract operational dependencies. Therefore, to perform this mapping the Vitis core development kit generally partitions the source code into operational units which are then mapped onto the FPGA. Several aspects influence the number and size of these operational units (ops) as seen by the tool.
In the following figure, the basic operations and their bit-width are reported.
Look for bit widths of 16, 32, and 64 bits commonly used in algorithmic descriptions and verify that the associated operation from the C/C++/OpenCL API source actually requires the bit width to be this large. This can considerably improve the implementation of the algorithm, as smaller operations require less computation time.
Fixed Point Arithmetic
Some applications use floating point computation only because they are optimized for other hardware architecture. Using fixed point arithmetic for applications like deep learning can save the power efficiency and area significantly while keeping the same level of accuracy.
Macro Operations
It is sometimes advantageous to think about larger computational elements. The tool will operate on the source code independently of the remaining source code, effectively mapping the algorithm without consideration of surrounding operations onto the FPGA. When applied, the Vitis technology keeps operational boundaries, effectively creating macro operations for specific code. This uses the following principles:
- Operational locality to the mapping process
- Reduction in complexity for the heuristics
This might create vastly different results when applied. In C/C++, macro
operations are created with the help of #pragma HLS inline
off
. While in the OpenCL API, the
same kind of macro operation can be generated by not specifying the following attribute when defining a function:
__attribute__((always_inline))
For more information, see pragma HLS inline.
Using Optimized Libraries
The OpenCL specification provides many
math built-in functions. All math built-in functions with the native_
prefix are mapped to one or more native device instructions
and will typically have better performance compared to the corresponding functions
(without the native_
prefix). The accuracy and in
some cases the input ranges of these functions is implementation-defined. In the
Vitis technology, these native_
built-in functions use the equivalent functions
in the Vitis HLS tool Math library, which are
already optimized for Xilinx FPGAs in terms of
area and performance.
native_
built-in functions or the HLS tool Math
library if the accuracy meets the application requirement.Exploring Kernel Optimizations Using Vitis HLS
All kernel optimizations using OpenCL or C/C++ can be performed from within the Vitis core development kit. The primary performance optimizations, such as those discussed in this section (pipelining function and loops, applying dataflow to enable greater concurrency between functions and loops, unrolling loops, etc.), are performed by the Vitis HLS tool.
The Vitis core development kit automatically calls the HLS tool. However, to use the GUI analysis capabilities, you must launch the HLS tool directly from within the Vitis technology. Using the HLS tool in standalone mode, as discussed in Compiling Kernels with Vitis HLS, enables the following enhancements to the optimization methodology:
- The ability to focus solely on the kernel optimization because there is no requirement to execute emulation.
- The skill to create multiple solutions, compare their results, and explore the solution space to find the most optimum design.
- The competence to use the interactive Analysis Perspective to analyze the design performance.
To open the HLS tool in standalone mode, from the Assistant window, right-click the hardware function object, and select Open HLS Project, as shown in the following figure.
Topological Optimization
This section focuses on the topological optimization. It looks at the attributes related to the rough layout and implementation of multiple compute units and their impact on performance.
Multiple Compute Units
Depending on available resources on the target device, multiple compute units of the same kernel (or different kernels) can be created to run in parallel, which improves the system processing time and throughput. For more details, see Creating Multiple Instances of a Kernel.
Using Multiple DDR Banks
Acceleration cards supported in Vitis technology provide one, two, or four DDR banks, and up to 80 GB/s raw DDR bandwidth. For kernels moving large amount of data between the FPGA and the DDR, Xilinx® recommends that you direct the Vitis compiler and runtime library to use multiple DDR banks.
In addition to DDR banks, the host application can access PLRAM to
transfer data directly to a kernel. This feature is enabled using the connnectivity.sp
option in a configuration file specified
with the v++ --config
option. Refer to Mapping Kernel Ports to Memory for more information on implementing this optimization
and Memory Mapped Interfaces on data transfer to the global memory banks.
To take advantage of multiple DDR banks, you need to assign CL memory
buffers to different banks in the host code as well as configure the xclbin file to match the bank assignment in v++
command line.
The following block diagram shows the Global Memory Two Banks (C) example in Vitis Examples on GitHub. This example connects the input pointer interface of the kernel to DDR bank 0, and the output pointer interface to DDR bank 1.
Assigning DDR Bank in Host Code
During the Vitis tool flow, the
kernel port to memory bank connectivity can be established using the --connectivity.sp
switch as described in Mapping Kernel Ports to Memory. The xclbin generated by v++
contains the
information about the kernel port to memory connectivity so that XRT can allocate
buffers appropriately. When a buffer is created in the host code, XRT automatically
assigns the buffer to memory from the kernel xclbin, and manages the buffers internally. If a single kernel port is
connected to multiple memory banks, XRT always starts from the lower numbered bank.
In most cases, this approach is sufficient. However, in some specific
cases you may need to manually assign the buffer location (or special property) in the
host code. For this purpose, the Xilinx
OpenCL vendor extension provides a buffer extension
called CL_MEM_XRT_PTR_XILINX
to specifically manage
bank assignment in the host code. The following code example shows the required header
file and code for assigning input and output buffers to DDR bank 0 and bank 1:
#include <CL/cl_ext.h>
…
int main(int argc, char** argv)
{
…
cl_mem_ext_ptr_t inExt, outExt; // Declaring two extensions for both buffers
inExt.flags = 0|XCL_MEM_TOPOLOGY; // Specify Bank0 Memory for input memory
outExt.flags = 1|XCL_MEM_TOPOLOGY; // Specify Bank1 Memory for output Memory
inExt.obj = 0 ; outExt.obj = 0; // Setting Obj and Param to Zero
inExt.param = 0 ; outExt.param = 0;
int err;
//Allocate Buffer in Bank0 of Global Memory for Input Image using Xilinx Extension
cl_mem buffer_inImage = clCreateBuffer(world.context, CL_MEM_READ_ONLY | CL_MEM_EXT_PTR_XILINX,
image_size_bytes, &inExt, &err);
if (err != CL_SUCCESS){
std::cout << "Error: Failed to allocate device Memory" << std::endl;
return EXIT_FAILURE;
}
//Allocate Buffer in Bank1 of Global Memory for Input Image using Xilinx Extension
cl_mem buffer_outImage = clCreateBuffer(world.context, CL_MEM_WRITE_ONLY | CL_MEM_EXT_PTR_XILINX,
image_size_bytes, &outExt, NULL);
if (err != CL_SUCCESS){
std::cout << "Error: Failed to allocate device Memory" << std::endl;
return EXIT_FAILURE;
}
…
}
The extension pointer cl_mem_ext_ptr_t
is
a struct
as defined below:
typedef struct{
unsigned flags;
void *obj;
void *param;
} cl_mem_ext_ptr_t;
- Valid values for
flags
are:- XCL_MEM_DDR_BANK0
- XCL_MEM_DDR_BANK1
- XCL_MEM_DDR_BANK2
- XCL_MEM_DDR_BANK3
- <id> | XCL_MEM_TOPOLOGYNote: The <id> is determined by looking at the Memory Configuration section in the xxx.xclbin.info file generated next to the xxx.xclbin file. In the xxx.xclbin.info file, the global memory (DDR, HBM, PLRAM, etc.) is listed with an index representing the <id>.
obj
is the pointer to the associated host memory allocated for the CL memory buffer only ifCL_MEM_USE_HOST_PTR
flag is passed toclCreateBuffer
API, otherwise set it to NULL.param
is reserved for future use. Always assign it to 0 or NULL.
Here are some specific cases where you might want to use the extension pointer:
- P2P Buffer
- For an explanation and example, refer to https://xilinx.github.io/XRT/master/html/p2p.html.
- Host-Memory Buffer
- For an explanation and example, refer to https://xilinx.github.io/XRT/master/html/sb.html.
- Allocating the host buffer to a specific bank when the kernel port is connected to multiple banks
- For example, DDR[0:1]. This use case is described in detail in the Using Multiple DDR Banks lab of the Vitis Optimizing Accelerated FPGA Applications: Bloom Filter Example tutorial.
Example of Allocating the Host Buffer to A Specific Bank
An example of the third case listed above, where you might need to
use cl_mem_ext_ptr_t
, is when the host and kernel
are both accessing the DDR bank simultaneously, and you would like to split the data
so that kernel and host access memory banks in a ping pong fashion. When the host is
writing/reading to a specific memory bank, the kernel is writing/reading from
another bank so that these host/kernel accesses don't compete and impact
performance. For this scenario, you must manage the buffer allocation yourself.
The kernel ports in the xclbin
are connected to DDR bank1 and bank2, and reading the data from these banks
alternatively. The connectivity is established during linking by the Vitis compiler using the --connectivity.sp
switch:
[connectivity]
sp=runOnfpga_1.input_words:DDR[1:2]
From the host code, you can send the input_words
data to DDR banks 1 and 2 alternatively. Two Xilinx extension pointer (cl_mem_ext_ptr_t
) objects are created as shown in the example code
below. The object flags will determine which DDR bank each buffer will be assigned
to for the kernel to access. The kernel argument can be set to input_words[0]
and input_words[1]
for consecutive kernel enqueues.
#include <CL/cl_ext.h>
…
int main(int argc, char** argv)
{
cl_mem_ext_ptr_t buffer_words_ext[2];
buffer_words_ext[0].flags = 1 | XCL_MEM_TOPOLOGY; // DDR[1]
buffer_words_ext[0].param = 0;
buffer_words_ext[0].obj = input_doc_words;
buffer_words_ext[1].flags = 2 | XCL_MEM_TOPOLOGY; // DDR[2]
buffer_words_ext[1].param = 0;
buffer_words_ext[1].obj = input_doc_words;
…
Assigning Global Memory for Kernel Code
Creating Multiple AXI Interfaces
OpenCL kernels, C/C++ kernels, and RTL kernels have different methods for assigning function parameters to AXI interfaces.
-
For OpenCL kernels, the
--max_memory_ports
option is required to generate one AXI4 interface for each global pointer on the kernel argument. The AXI4 interface name is based on the order of the global pointers on the argument list.The following code is taken from the example gmem_2banks_ocl in the ocl_kernels category from the Vitis Accel Examples on GitHub:
__kernel __attribute__ ((reqd_work_group_size(1, 1, 1))) void apply_watermark(__global const TYPE * __restrict input, __global TYPE * __restrict output, int width, int height) { ... }
In this example, the first global pointer
input
is assigned an AXI4 nameM_AXI_GMEM0
, and the second global pointeroutput
is assigned a nameM_AXI_GMEM1
. -
For C/C++ kernels, multiple AXI4 interfaces are generated by specifying different “bundle” names in the HLS INTERFACE pragma for different global pointers. Refer to Kernel Interfaces for more information.
The following is a code snippet from the gmem_2banks example that assigns theinput
pointer to the bundlegmem0
and theoutput
pointer to the bundlegmem1
. The bundle name can be any valid C string, and the AXI4 interface name generated will beM_AXI_<bundle_name>
. For this example, the input pointer will have AXI4 interface name asM_AXI_gmem0
, and the output pointer will haveM_AXI_gmem1
. Refer to pragma HLS interface for more information.#pragma HLS INTERFACE m_axi port=input offset=slave bundle=gmem0 #pragma HLS INTERFACE m_axi port=output offset=slave bundle=gmem1
- For RTL kernels, the port names are generated during the import process by
the RTL kernel wizard. The default names proposed by the RTL kernel wizard are
m00_axi
andm01_axi
. If not changed, these names have to be used when assigning a DDR bank through theconnectivity.sp
option in the configuration file. Refer to Mapping Kernel Ports to Memory for more information.
Assigning AXI Interfaces to DDR Banks
The following is an example configuration file that specifies the
connectivity.sp
option, and the v++
command line
that connects the input pointer (M_AXI_GMEM0
) to DDR
bank 0 and the output pointer (M_AXI_GMEM1
) to DDR bank
1:
The config_sp.cfg file:
[connectivity]
sp=apply_watermark_1.m_axi_gmem0:DDR[0]
sp=apply_watermark_1.m_axi_gmem1:DDR[1]
The v++
command line:
v++ apply_watermark --config config_sp.cfg
You can use the Device Hardware Transaction view to observe the actual DDR Bank communication, and to analyze DDR usage.
Assigning AXI Interfaces to PLRAM
Some platforms support PLRAMs. In these cases, use the same --connectivity.sp
option as described in Assigning AXI Interfaces to DDR Banks, but use the name, PLRAM[id]. Valid names
supported by specific platforms can be found in the Memory Configuration section of the
xclibin.info file generated alongside xclbin.
Kernel SLR and DDR Memory Assignments
Kernel compute unit (CU) instance and DDR memory resource floorplanning are keys to meeting quality of results of your design in terms of frequency and resources. Floorplanning involves explicitly allocating CUs (a kernel instance) to SLRs and mapping CUs to DDR memory resources. When floorplanning, both CU resource usage and DDR memory bandwidth requirements need to be considered.
The largest Xilinx FPGAs are made up of
multiple stacked silicon dies. Each stack is referred to as a super logic region (SLR)
and has a fixed amount of resources and memory including DDR interfaces. Available
device SLR resources which can be used for custom logic can be found in the Vitis Software Platform Release Notes, or can be displayed using the platforminfo
utility described in platforminfo Utility.
You can use the actual kernel resource utilization values to help distribute CUs across SLRs to reduce congestion in any one SLR. The system estimate report lists the number of resources (LUTs, Flip-Flops, BRAMs, etc.) used by the kernels early in the design cycle. The report can be generated during hardware emulation and system compilation through the command line or GUI and is described in System Estimate Report.
Use this information along with the available SLR resources to help assign CUs to SLRs such that no one SLR is over-utilized. The less congestion in an SLR, the better the tools can map the design to the FPGA resources and meet your performance target. For mapping memory resources and CUs, see Mapping Kernel Ports to Memory and Assigning Compute Units to SLRs.
After allocating your CUs to SLRs, map any CU master AXI port(s) to DDR memory resources. Xilinx recommends connecting to a DDR memory resource in the same SLR as the CU. This reduces competition for the limited SLR-crossing connection resources. In addition, connections between SLRs use super long line (SLL) routing resources, which incurs a greater delay than a standard intra-SLR routing.
It might be necessary to cross an SLR region to connect to a DDR resource in a
different SLR. However, if both the connectivity.sp
and
the connectivity.slr
directives are explicitly
defined, the tools automatically add additional crossing logic to minimize the effect of
the SLL delay, and facilitates better timing closure.
Guidelines for Kernels that Access Multiple Memory Banks
The DDR memory resources are distributed across the super logic regions (SLRs) of the platform. Because the number of connections available for crossing between SLRs is limited, the general guidance is to place a kernel in the same SLR as the DDR memory resource with which it has the most connections. This reduces competition for SLR-crossing connections and avoids consuming extra logic resources associated with SLR crossing.
As shown in the previous figure, when a kernel has a single AXI interface that
maps only a single memory bank, the platforminfo
utility described in platforminfo Utility lists the SLR that is associated
with the memory bank of the kernel; therefore, the SLR where the kernel would be best
placed. In this scenario, the design tools might automatically place the kernel in that
SLR without need for extra input; however, you might need to provide an explicit SLR
assignment for some of the kernels under the following conditions:
- If the design contains a large number of kernels accessing the same memory bank.
- A kernel requires some specialized logic resources that are not available in the SLR of the memory bank.
When a kernel has multiple AXI interfaces and all of the interfaces of the kernel access the same memory bank, it can be treated in a very similar way to the kernel with a single AXI interface, and the kernel should reside in the same SLR as the memory bank that its AXI interfaces are mapping.
When a kernel has multiple AXI interfaces to multiple memory banks in different SLRs, the recommendation is to place the kernel in the SLR that has the majority of the memory banks accessed by the kernel (shown it the figure above). This minimizes the number of SLR crossings required by this kernel which leaves more SLR crossing resources available for other kernels in your design to reach your memory banks.
When the kernel is mapping memory banks from different SLRs, explicitly specify the SLR assignment as described in Kernel SLR and DDR Memory Assignments.
As shown in the previous figure, when a platform contains more than two SLRs, it is possible that the kernel might map a memory bank that is not in the immediately adjacent SLR to its most commonly mapped memory bank. When this scenario arises, memory accesses to the distant memory bank must cross more than one SLR boundary and incur additional SLR-crossing resource costs. To avoid such costs it might be better to place the kernel in an intermediate SLR where it only requires less expensive crossings into the adjacent SLRs.