Optimizing the Performance
Host Optimization
This section focuses on optimization of the host program, which uses the OpenCL™ API to schedule the individual compute unit executions, and data transfers to and from the FPGA board. As a result, you need to think about concurrent execution of tasks through the OpenCL command queue(s). This section discusses common pitfalls, and how to recognize and address them.
Reducing Overhead of Kernel Enqueing
The OpenCL API execution model supports data parallel and task parallel programming models. Kernels are usually enqueued by the OpenCL runtime multiple times and then scheduled to be executed on the device. You must send the command to start the kernel in one of two ways:
- Using
clEnqueueNDRange
API for the data parallel case - Using
clEnqueueTask
for the task parallel case
The dispatching process is executed on the host processor, and the kernel commands and arguments need to be sent to the accelerator, over the PCIe® bus in the case of the Alveo card for instance. In the Xilinx Runtime (XRT) library, the overhead of dispatching the command and arguments to the accelerator can be between 30 µs and 60 µs, depending the number of arguments on the kernel. You can reduce the impact of this overhead by minimizing the number of times the kernel needs to be executed.
For the data parallel case, Xilinx recommends that you carefully choose the global and local work sizes for your host code and kernel so that the global work size is a small multiple of the local work size. Ideally, the global work size is the same as the local work size as shown in the following code snippet:
size_t global = 1;
size_t local = 1;
clEnqueueNDRangeKernel(world.command_queue, kernel, 1, nullptr,
&global, &local, 2, write_events.data(),
&kernel_events[0]));
clEnqueueTask
.
Ideally, you should finish all the workload in a single call to clEnqueueTask
.For more information on reducing overhead on kernel execution, see Kernel Execution.
Optimizing Data Movement
In the OpenCL API, all data is transferred from the host memory to the global memory on the device first and then from the global memory to the kernel for computation. The computation results are written back from the kernel to the global memory and lastly from the global memory to the host memory. A key factor in determining strategies for kernel optimization is understanding how data can be efficiently moved around.
During data movement optimization, it is important to isolate data transfer code from computation code because inefficiency in computation might cause stalls in data movement. Xilinx recommends that you modify the host code and kernels with data transfer code only for this optimization step. The goal is to maximize the system level data throughput by maximizing PCIe bandwidth usage and DDR bandwidth usage. It usually takes multiple iterations of running software emulation, hardware emulation, as well as execution on FPGAs to achieve this goal.
Overlapping Data Transfers with Kernel Computation
Applications, such as database analytics, have a much larger data set than can be stored in the available memory on the acceleration device. They require the complete data to be transferred and processed in blocks. Techniques that overlap the data transfers with the computation are critical to achieve high performance for these applications.
Below is the vadd
kernel from the overlap example in the host category of Vitis Accelerated Examples on GitHub.
#define BUFFER_SIZE 256
#define DATA_SIZE 1024
//TRIPCOUNT indentifier
const unsigned int c_len = DATA_SIZE / BUFFER_SIZE;
const unsigned int c_size = BUFFER_SIZE;
extern "C" {
void vadd(int *c, int *a, int *b, const int elements) {
int arrayA[BUFFER_SIZE];
int arrayB[BUFFER_SIZE];
for (int i = 0; i < elements; i += BUFFER_SIZE) {
#pragma HLS LOOP_TRIPCOUNT min=c_len max=c_len
int size = BUFFER_SIZE;
if (i + size > elements)
size = elements - i;
readA:
for (int j = 0; j < size; j++) {
#pragma HLS PIPELINE II=1
#pragma HLS LOOP_TRIPCOUNT min=c_size max=c_size
arrayA[j] = a[i + j];
}
readB:
for (int j = 0; j < size; j++) {
#pragma HLS PIPELINE II=1
#pragma HLS LOOP_TRIPCOUNT min=c_size max=c_size
arrayB[j] = b[i + j];
}
vadd_writeC:
for (int j = 0; j < size; j++) {
#pragma HLS PIPELINE II=1
#pragma HLS LOOP_TRIPCOUNT min=c_size max=c_size
c[i + j] = arrayA[j] + arrayB[j];
}
}
}
}
From the host perspective, there are four tasks to perform in this example:
- Write buffer a (Wa)
- Write buffer b (Wb)
- Execute vadd kernel
- Read buffer c (Rc)
Using an out-of-order command queue, data transfer and kernel execution can overlap as illustrated in the figure below. In the host code for this example, double buffering is used for all buffers so that the kernel can process one set of buffers while the host can operate on the other set of buffers.
The OpenCL
event
object provides an easy method to set up complex
operation dependencies and synchronize host threads and device operations. Events are
OpenCL objects that track the status of
operations. Event objects are created by kernel execution commands, read, write, copy
commands on memory objects or user events created using clCreateUserEvent
. You can ensure an operation has completed by querying
events returned by these commands. The arrows in the figure below show how event
triggering can be set up to achieve optimal performance.
The host code enqueues the four tasks in a loop to process the complete
data set. It also sets up event synchronization between different tasks to ensure that
data dependencies are met for each task. The double buffering is set up by passing
different memory objects values to clEnqueueMigrateMemObjects
API. The event synchronization is achieved by
having each API call wait for other event as well as trigger its own event when the API
completes.
// THIS PAIR OF EVENTS WILL BE USED TO TRACK WHEN A KERNEL IS FINISHED WITH
// THE INPUT BUFFERS. ONCE THE KERNEL IS FINISHED PROCESSING THE DATA, A NEW
// SET OF ELEMENTS WILL BE WRITTEN INTO THE BUFFER.
vector<cl::Event> kernel_events(2);
vector<cl::Event> read_events(2);
cl::Buffer buffer_a[2], buffer_b[2], buffer_c[2];
for (size_t iteration_idx = 0; iteration_idx < num_iterations; iteration_idx++) {
int flag = iteration_idx % 2;
if (iteration_idx >= 2) {
OCL_CHECK(err, err = read_events[flag].wait());
}
// Allocate Buffer in Global Memory
// Buffers are allocated using CL_MEM_USE_HOST_PTR for efficient memory and
// Device-to-host communication
std::cout << "Creating Buffers..." << std::endl;
OCL_CHECK(err,
buffer_a[flag] =
cl::Buffer(context,
CL_MEM_READ_ONLY | CL_MEM_USE_HOST_PTR,
bytes_per_iteration,
&A[iteration_idx * elements_per_iteration],
&err));
OCL_CHECK(err,
buffer_b[flag] =
cl::Buffer(context,
CL_MEM_READ_ONLY | CL_MEM_USE_HOST_PTR,
bytes_per_iteration,
&B[iteration_idx * elements_per_iteration],
&err));
OCL_CHECK(err,
buffer_c[flag] = cl::Buffer(
context,
CL_MEM_WRITE_ONLY | CL_MEM_USE_HOST_PTR,
bytes_per_iteration,
&device_result[iteration_idx * elements_per_iteration],
&err));
vector<cl::Event> write_event(1);
OCL_CHECK(err, err = krnl_vadd.setArg(0, buffer_c[flag]));
OCL_CHECK(err, err = krnl_vadd.setArg(1, buffer_a[flag]));
OCL_CHECK(err, err = krnl_vadd.setArg(2, buffer_b[flag]));
OCL_CHECK(err, err = krnl_vadd.setArg(3, int(elements_per_iteration)));
// Copy input data to device global memory
std::cout << "Copying data (Host to Device)..." << std::endl;
// Because we are passing the write_event, it returns an event object
// that identifies this particular command and can be used to query
// or queue a wait for this particular command to complete.
OCL_CHECK(
err,
err = q.enqueueMigrateMemObjects({buffer_a[flag], buffer_b[flag]},
0 /*0 means from host*/,
NULL,
&write_event[0]));
set_callback(write_event[0], "ooo_queue");
printf("Enqueueing NDRange kernel.\n");
// This event needs to wait for the write buffer operations to complete
// before executing. We are sending the write_events into its wait list to
// ensure that the order of operations is correct.
//Launch the Kernel
std::vector<cl::Event> waitList;
waitList.push_back(write_event[0]);
OCL_CHECK(err,
err = q.enqueueNDRangeKernel(
krnl_vadd, 0, 1, 1, &waitList, &kernel_events[flag]));
set_callback(kernel_events[flag], "ooo_queue");
// Copy Result from Device Global Memory to Host Local Memory
std::cout << "Getting Results (Device to Host)..." << std::endl;
std::vector<cl::Event> eventList;
eventList.push_back(kernel_events[flag]);
// This operation only needs to wait for the kernel call. This call will
// potentially overlap the next kernel call as well as the next read
// operations
OCL_CHECK(err,
err = q.enqueueMigrateMemObjects({buffer_c[flag]},
CL_MIGRATE_MEM_OBJECT_HOST,
&eventList,
&read_events[flag]));
set_callback(read_events[flag], "ooo_queue");
OCL_CHECK(err, err = read_events[flag].wait());
}
The Application Timeline view below clearly shows that the data transfer
time is completely hidden, while the compute unit vadd_1
is running constantly.
Buffer Memory Segmentation
Allocation and deallocation of memory buffers can lead to memory segmentation in the DDR controllers. This might result in sub-optimal performance of compute units, even if they could theoretically execute in parallel.
This issue occurs most often when multiple pthreads for different compute units are used and the threads allocate and release many device buffers with different sizes every time they enqueue the kernels. In this case, the timeline trace will exhibit gaps between kernel executions and it might seem the processes are sleeping.
Each buffer allocated by runtime should be continuous in hardware. For large memory, it might take some time to wait for that space to be freed, when many buffers are allocated and deallocated. This can be resolved by allocating device buffer and reusing it between different enqueues of a kernel.
For more details on optimizing memory performance, see Reading and Writing by Burst.
Compute Unit Scheduling
Scheduling kernel operations is key to overall system performance. This becomes even more important when implementing multiple compute units (of the same kernel or of different kernels). This section examines the different command queues responsible for scheduling the kernels.
Multiple In-Order Command Queues
The following figure shows an example with two in-order command queues, CQ0 and CQ1. The scheduler dispatches commands from each queue in order, but commands from CQ0 and CQ1 can be pulled out by the scheduler in any order. You must manage synchronization between CQ0 and CQ1 if required.
The following is code extracted from host.cpp of the concurrent_kernel_execution_c example that sets up multiple in-order command queues and enqueues commands into each queue:
OCL_CHECK(err,
cl::CommandQueue ordered_queue1(
context, device, CL_QUEUE_PROFILING_ENABLE, &err));
OCL_CHECK(err,
cl::CommandQueue ordered_queue2(
context, device, CL_QUEUE_PROFILING_ENABLE, &err));
...
printf("[Ordered Queue 1]: Enqueueing scale kernel\n");
OCL_CHECK(
err,
err = ordered_queue1.enqueueNDRangeKernel(
kernel_mscale, offset, global, local, nullptr, &kernel_events[0]));
set_callback(kernel_events[0], "scale");
...
printf("[Ordered Queue 1]: Enqueueing addition kernel\n");
OCL_CHECK(
err,
err = ordered_queue1.enqueueNDRangeKernel(
kernel_madd, offset, global, local, nullptr, &kernel_events[1]));
set_callback(kernel_events[1], "addition");
...
printf("[Ordered Queue 2]: Enqueueing matrix multiplication kernel\n");
OCL_CHECK(
err,
err = ordered_queue2.enqueueNDRangeKernel(
kernel_mmult, offset, global, local, nullptr, &kernel_events[2]));
set_callback(kernel_events[2], "matrix multiplication");
Single Out-of-Order Command Queue
The following figure shows an example with a single out-of-order command queue. The scheduler can dispatch commands from the queue in any order. You must manually define event dependencies and synchronizations as required.
The following is code extracted from host.cpp of the concurrent_kernel_execution_c example that sets up a single out-of-order command queue and enqueues commands as needed:
OCL_CHECK(
err,
cl::CommandQueue ooo_queue(context,
device,
CL_QUEUE_PROFILING_ENABLE |
CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE,
&err));
...
printf("[OOO Queue]: Enqueueing scale kernel\n");
OCL_CHECK(
err,
err = ooo_queue.enqueueNDRangeKernel(
kernel_mscale, offset, global, local, nullptr, &ooo_events[0]));
set_callback(ooo_events[0], "scale");
...
// This is an out of order queue, events can be executed in any order. Since
// this call depends on the results of the previous call we must pass the
// event object from the previous call to this kernel's event wait list.
printf("[OOO Queue]: Enqueueing addition kernel (Depends on scale)\n");
kernel_wait_events.resize(0);
kernel_wait_events.push_back(ooo_events[0]);
OCL_CHECK(err,
err = ooo_queue.enqueueNDRangeKernel(
kernel_madd,
offset,
global,
local,
&kernel_wait_events, // Event from previous call
&ooo_events[1]));
set_callback(ooo_events[1], "addition");
...
// This call does not depend on previous calls so we are passing nullptr
// into the event wait list. The runtime should schedule this kernel in
// parallel to the previous calls.
printf("[OOO Queue]: Enqueueing matrix multiplication kernel\n");
OCL_CHECK(err,
err = ooo_queue.enqueueNDRangeKernel(
kernel_mmult,
offset,
global,
local,
nullptr, // Does not depend on previous call
&ooo_events[2]));
set_callback(ooo_events[2], "matrix multiplication");
The Application Timeline view (as shown in the following figure) that the
compute unit mmult_1
is running in parallel with the
compute units mscale_1
and madd_1
, using both multiple in-order queues and single out-of-order queue
methods.
Kernel Optimization
One of the key advantages of an FPGA is its flexibility and capacity to create customized designs specifically for your algorithm. This enables various implementation choices to trade off algorithm throughput versus power consumption. The following guidelines help manage the design complexity and achieve the desired design goals.
Optimizing Kernel Computation
The goal of kernel optimization is to create processing logic that can consume all the data as soon as it arrives at the kernel interfaces. The key metric is the initiation interval (II), or the number of clock cycles before the kernel can accept new input data. Optimizing the II is generally achieved by expanding the processing code to match the data path with techniques such as function pipelining, loop unrolling, array partitioning, data flowing, etc. For more information on kernel optimization, see Linking the Kernels.
Interface Attributes (Detailed Kernel Trace)
The detailed kernel trace provides easy access to the AXI transactions and their properties. The AXI transactions are presented for the global memory, as well as the Kernel side (Kernel "pass" 1:1:1) of the AXI interconnect. The following figure illustrates a typical kernel trace of a newly accelerated algorithm.
Most interesting with respect to performance are the fields:
- Burst Length
- Describes how many packages are sent within one transaction.
- Burst Size
- Describes the number of bytes being transferred as part of one package.
Given a burst length of 1 and just 4 bytes per package, it will require many individual AXI transactions to transfer any reasonable amount of data.
Small burst lengths, as well as burst sizes, considerably less than 512 bits are therefore good opportunities to optimize interface performance.
Using Burst Data Transfers
Transferring data in bursts hides the memory access latency and improves bandwidth usage and efficiency of the memory controller.
If burst data transfers occur, the detailed kernel trace will reflect the higher burst rate as a larger burst length number:
In the previous figure, it is also possible to observe that the memory data transfers following the AXI interconnect are actually implemented rather differently (shorter transaction time). Hover over these transactions, you would see that the AXI interconnect has packed the 16 x 4 byte transaction into a single package transaction of 1 x 64 bytes. This effectively uses the AXI4 bandwidth which is even more favorable. The next section focuses on this optimization technique in more detail.
Burst inference is heavily dependent on coding style and access pattern. However, you can ease burst detection and improve performance by isolating data transfer and computation, as shown in the following code snippet:
void kernel(T in[1024], T out[1024]) {
T tmpIn[1024];
T tmpOu[1024];
read(in, tmpIn);
process(tmpIn, tmpOut);
write(tmpOut, out);
}
In short, the function read
is
responsible for reading from the AXI input to an internal variable (tmpIn)
. The computation is implemented by the function
process
working on the internal variables tmpIn
and tmpOut
. The
function write
takes the produced output and writes to
the AXI output.
The isolation of the read and write function from the computation results in:
- Simple control structures (loops) in the read/write function which makes burst detection simpler.
- The isolation of the computational function away from the AXI interfaces, simplifies potential kernel optimization. See Kernel Optimization for more information.
- The internal variables are mapped to on-chip memory, which allow faster access compared to AXI transactions. Acceleration platforms supported in the Vitis core development kit can have as much as 10 MB on-chip memories that can be used as pipes, local memories, and private memories. Using these resources effectively can greatly improve the efficiency and performance of your applications.
Using Full AXI Data Width
The user data width between the kernel and the memory controller can be
configured by the Vitis compiler based on the data
types of the kernel arguments. To maximize the data throughput, Xilinx recommends that you choose data types map to the full data width
on the memory controller. The memory controller in all supported acceleration cards
supports 512-bit user interface, which can be mapped to OpenCL
vector data types, such as int16
or C/C++ arbitrary
precision data type ap_int<512>
.
As shown on the following figure, you can observe burst AXI transactions (Burst Length 16) and a 512-bit package size (Burst Size 64 bytes).
This example shows good interface configuration as it maximizes AXI data width as well as actual burst transactions.
Complex structs or classes, used to declare interfaces, can lead to very complex hardware interfaces due to memory layout and data packing differences. This can introduce potential issues that are very difficult to debug in a complex system.
Setting Data Width Using OpenCL Attributes
The OpenCL API provides attributes to support a more automatic approach to incrementing AXI data width usage. The change of the interface data types, as stated above is supported in the API as well but will require the same code changes as C/C++ to the algorithm to accommodate the larger input vector.
To eliminate manual code modifications, the following OpenCL attributes are interpreted to perform data path widening and vectorization of the algorithm:
Examine the combined functionality on the following case:
__attribute__((reqd_work_group_size(64, 1, 1)))
__attribute__((vec_type_hint(int)))
__attribute__((xcl_zero_global_work_offset))
__kernel void vector_add(__global int* c, __global const int* a, __global const int* b) {
size_t idx = get_global_id(0);
c[idx] = a[idx] + b[idx];
}
In this case, the hard coded interface is a 32-bit wide data path (int *c, int* a, int *b)
, which drastically limits the
memory throughput if implemented directly. However, the automatic widening and
transformation is applied, based on the values of the three attributes.
__attribute__((vec_type_hint(int)))
- Declares that
int
is the main type used for computation and memory transfer (32-bit). This knowledge is used to calculate the vectorization/widening factor based on the target bandwidth of the AXI interface (512 bits). In this example the factor would be 16 = 512 bits / 32-bit. This implies that in theory, 16 values could be processed if vectorization can be applied. __attribute__((reqd_work_group_size(X, Y, Z)))
- Defines the total number of work items (where
X
,Y
, andZ
are positive constants).X*Y*Z
is the maximum number of work items therefore defining the maximum possible vectorization factor which would saturate the memory bandwidth. In this example, the total number of work items is64*1*1=64
.The actual vectorization factor to be applied will be the greatest common divider of the vectorization factor defined by the actual coded type or the
vec_type_hint
, and the maximum possible vectorization factor defined throughreqd_work_group_size
.The quotient of maximum possible vectorization factor divided by the actual vectorization factor provides the remaining loop count of the OpenCL description. As this loop is pipelined, it can be advantageous to have several remaining loop iterations to take advantage of a pipelined implementation. This is especially true if the vectorized OpenCL code has long latency.
__attribute__((xcl_zero_global_work_offset))
- The
__attribute__((xcl_zero_global_work_offset))
instructs the compiler that no global offset parameter is used at runtime, and all accesses are aligned. This gives the compiler valuable information with regard to alignment of the work groups, which in turn usually propagates to the alignment of the memory accesses (less hardware).
It should be noted, that the application of these transformations changes the actual design to be synthesized. Partially unrolled loops require reshaping of local arrays in which data is stored. This usually behaves nicely, but can interact poorly in rare situations.
For example:
- For partitioned arrays, when the partition factor is not divisible
by the unrolling/vectorization factor.
- The resulting access requires a lot of multiplexers and will create a difficult issue for the scheduler (might severely increase memory usage and compilation time). Xilinx recommends using partitioning factors that are powers of two (as the vectorization factor is always a power of two).
- If the loop being vectorized has an unrelated resource constraint,
the scheduler complains about II not being met.
- This is not necessarily correlated with a loss of performance (usually it is still performing better) because the II is computed on the unrolled loop (which has therefore a multiplied throughput for each iteration).
- The scheduler informs you of the possible resources constraints and resolving those will further improve the performance.
- Note that a common occurrence is that a local array does not get automatically reshaped (usually because it is accessed in a later section of the code in non-vectorizable method).
Reducing Kernel to Kernel Communication Latency with OpenCL Pipes
The OpenCL API 2.0 specification introduces a new memory object called a pipe. A pipe stores data organized as a FIFO. Pipe objects can only be accessed using built-in functions that read from and write to a pipe. Pipe objects are not accessible from the host. Pipes can be used to stream data from one kernel to another inside the FPGA without having to use the external memory, which greatly improves the overall system latency. For more information, see Pipe Functions on Version 2.0 of the OpenCL C Specification from Khronos Group.
In the Vitis IDE, pipes must be statically
defined outside of all kernel functions. Dynamic pipe allocation using the OpenCL 2.x clCreatePipe
API is not supported. The depth of a pipe must be specified by using the OpenCL attribute xcl_reqd_pipe_depth
in the pipe declaration. For more information, see
xcl_reqd_pipe_depth.
As specified in xcl_reqd_pipe_depth
, the
valid depth values are as follows: 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192,
16384, 32768.
pipe int p0 __attribute__((xcl_reqd_pipe_depth(32)));
read_pipe_block()
and write_pipe_block()
functions in blocking mode. read_pipe()
or
write_pipe()
functions is not
supported.The status of pipes can be queried using OpenCL
get_pipe_num_packets()
and
get_pipe_max_packets()
built-in
functions.
gentype
indicates the built-in OpenCL C scalar integer or floating-point data
types.int read_pipe_block (pipe gentype p, gentype *ptr)
int write_pipe_block (pipe gentype p, const gentype *ptr)
The following “dataflow/dataflow_pipes_ocl” from Xilinx Getting Started Examples on GitHub
uses pipes to pass data from one processing stage to the next using blocking read_pipe_block()
and write_pipe_block()
functions:
pipe int p0 __attribute__((xcl_reqd_pipe_depth(32)));
pipe int p1 __attribute__((xcl_reqd_pipe_depth(32)));
// Input Stage Kernel : Read Data from Global Memory and write into Pipe P0
kernel __attribute__ ((reqd_work_group_size(1, 1, 1)))
void input_stage(__global int *input, int size)
{
__attribute__((xcl_pipeline_loop))
mem_rd: for (int i = 0 ; i < size ; i++)
{
//blocking Write command to pipe P0
write_pipe_block(p0, &input[i]);
}
}
// Adder Stage Kernel: Read Input data from Pipe P0 and write the result
// into Pipe P1
kernel __attribute__ ((reqd_work_group_size(1, 1, 1)))
void adder_stage(int inc, int size)
{
__attribute__((xcl_pipeline_loop))
execute: for(int i = 0 ; i < size ; i++)
{
int input_data, output_data;
//blocking read command to Pipe P0
read_pipe_block(p0, &input_data);
output_data = input_data + inc;
//blocking write command to Pipe P1
write_pipe_block(p1, &output_data);
}
}
// Output Stage Kernel: Read result from Pipe P1 and write the result to Global
// Memory
kernel __attribute__ ((reqd_work_group_size(1, 1, 1)))
void output_stage(__global int *output, int size)
{
__attribute__((xcl_pipeline_loop))
mem_wr: for (int i = 0 ; i < size ; i++)
{
//blocking read command to Pipe P1
read_pipe_block(p1, &output[i]);
}
}
The Device Traceline view shows the detailed activities and stalls on the OpenCL pipes after hardware emulation is run. This information can be used to choose the correct FIFO sizes to achieve the optimal application area and performance.
Optimizing Computational Parallelism
By default, C/C++ does not model computational parallelism, as it always executes any algorithm sequentially. On the other hand, the OpenCL API does model computational parallelism with respect to work groups, but it does not use any additional parallelism within the algorithm description. However, fully configurable computational engines like FPGAs allow more freedom to exploit computational parallelism.
Coding Data Parallelism
To leverage computational parallelism during the implementation of an algorithm on the FPGA, it should be mentioned that the synthesis tool will need to be able to recognize computational parallelism from the source code first. Loops and functions are prime candidates for reflecting computational parallelism and compute units in the source description. However, even in this case, it is key to verify that the implementation takes advantage of the computational parallelism as in some cases the Vitis technology might not be able to apply the desired transformation due to the structure of the source code.
It is quite common, that some computational parallelism might not be reflected
in the source code to begin with. In this case, it will need to be added. A typical
example is a kernel that might be described to operate on a single input value, while
the FPGA implementation might execute computations more efficiently in parallel on
multiple values. This kind of parallel modeling is described in Using Full AXI Data Width. A 512-bit interface can be created using OpenCL vector data types such as int16
or C/C++ arbitrary precision data type ap_int<512>
.
int16
. Refer to the
Median Filter Example in the vision category at Xilinx Getting Started Example on GitHub for
the recommended method to use vectors.Loop Parallelism
Loops are the basic C/C++/OpenCL API method of representing repetitive algorithmic code. The following example illustrates various implementation aspects of a loop structure:
for(int i = 0; i<255; i++) {
out[i] = in[i]+in[i+1];
}
out[255] = in[255];
This code iterates over an array of values and adds consecutive values, except the last value. If this loop is implemented as written, each loop iteration requires two cycles for implementation, which results in a total of 510 cycles for implementation. This can be analyzed in detail through the Schedule Viewer in the HLS Project:
This can also be analyzed in terms of total numbers and latency through the Vivado synthesis results:
The key numbers here are the latency numbers and total LUT usage. For example, depending on the configuration, you could get latency of 511 and total LUT usage of 47. As a result, these values can vary based on the implementation choices. While this implementation will require very little area, it results in significant latency.
Unrolling Loops
Unrolling a loop enables the full parallelism of the model to be used. To perform this, mark a loop to be unrolled and the tool will create the implementation with the most parallelism possible. To mark a loop to unroll, an OpenCL loop can be marked with the UNROLL attribute:
__attribute__((opencl_unroll_hint))
Or a C/C++ loop can use the unroll pragma:
#pragma HLS UNROLL
For more information, see Loop Unrolling.
When applied to this specific example, the Schedule Viewer in the HLS Project will be:
The following figure shows the estimated performance:
Therefore, the total latency was considerably improved to be 127 cycles and as expected the computational hardware was increased to 4845 LUTs, to perform the same computation in parallel.
However, if you analyze the for-loop, you might ask why this algorithm cannot
be implemented in a single cycle, as each addition is completely independent of the
previous loop iteration. The reason is the memory interface is used for the variable
out
. The Vitis
core development kit uses dual port memory by default for an array. However, this
implies that at most two values can be written to the memory per cycle. Thus to see a
fully parallel implementation, you must specify that the variable out
should be kept in registers as in this example:
#pragma HLS array_partition variable= out complete dim= 0
For more information, see pragma HLS array_partition.
The results of this transformation can be observed in the following Schedule Viewer:
The associated estimates are:
Accordingly, this code can be implemented as a combinatorial function requiring only a fraction of the cycle to complete.
Pipelining Loops
Pipelining loops allow you to overlap iterations of a loop in time, as discussed in Loop Pipelining. Allowing loop iterations to operate concurrently is often a good approach, as resources can be shared between iterations (less resource utilization), while requiring less execution time compared to loops that are not unrolled.
Pipelining is enabled in C/C++ through the pragma HLS pipeline:
#pragma HLS PIPELINE
While the OpenCL API uses the xcl_pipeline_loop attribute:
__attribute__((xcl_pipeline_loop))
__attribute__((xcl_pipeline_workitems))
In this example, the Schedule Viewer in the HLS Project produces the following information:
With the overall estimates being:
Because each iteration of a loop consumes only two cycles of latency, there can only be a single iteration overlap. This enables the total latency to be cut into half compared to the original, resulting in 257 cycles of total latency. However, this reduction in latency was achieved using fewer resources when compared to unrolling.
In most cases, loop pipelining by itself can improve overall performance. Yet, the effectiveness of the pipelining depends on the structure of the loop. Some common limitations are:
- Resources with limited availability such as memory ports or process channels can limit the overlap of the iterations (Initiation Interval).
- Loop-carry dependencies, such as those created by variable conditions computed in one iteration affecting the next, might increase the II of the pipeline.
These are reported by the tool during high-level synthesis and can be observed and examined in the Schedule Viewer. For the best possible performance, the code might have to be modified to remove these limiting factors, or the tool needs to be instructed to eliminate some dependency by restructuring the memory implementation of an array, or breaking the dependencies all together.
Task Parallelism
Task parallelism allows you to take advantage of dataflow parallelism. In contrast to loop parallelism, when task parallelism is deployed, full execution units (tasks) are allowed to operate in parallel taking advantage of extra buffering introduced between the tasks.
See the following example:
void run (ap_uint<16> in[1024],
ap_uint<16> out[1024]
) {
ap_uint<16> tmp[128];
for(int i = 0; i<8; i++) {
processA(&(in[i*128]), tmp);
processB(tmp, &(out[i*128]));
}
}
When this code is executed, the function processA
and processB
are executed
sequentially 128 times in a row. Given the combined latency for processA
and processB
, the loop is set to
278 and the total latency can be estimated as:
The extra cycle is due to loop setup and can be observed in the Schedule Viewer.
For C/C++ code, task parallelism is performed by adding the DATAFLOW
pragma into the for-loop:
#pragma HLS DATAFLOW
For OpenCL API code, add the attribute before the for-loop:
__attribute__ ((xcl_dataflow))
Refer to Dataflow Optimization, HLS Pragmas, and OpenCL Attributes for more details on this topic.
As illustrated by the estimates in the HLS report, applying the transformation will considerably improve the overall performance effectively using a double (ping pong) buffer scheme between the tasks:
The overall latency of the design has almost halved in this case due to concurrent execution of the different tasks of the different iterations. Given the 139 cycles per processing function and the full overlap of the 128 iterations, this allows the total latency to be:
(1x only processA + 127x both processes + 1x only processB) * 139 cycles = 17931 cycles
Using task parallelism is a powerful method to improve performance when it
comes to implementation. However, the effectiveness of applying the DATAFLOW
pragma to a specific and arbitrary piece of code
might vary vastly. It is often necessary to look at the execution pattern of the
individual tasks to understand the final implementation of the DATAFLOW
pragma. Finally, the Vitis
core development kit provides the Detailed Kernel Trace, which illustrates concurrent
execution.
For this Detailed Kernel Trace, the tool displays the start of the dataflow loop, as shown in the previous figure. It illustrates how processA is starting up right away with the beginning of the loop, while processB waits until the completion of the processA before it can start up its first iteration. However, while processB completes the first iteration of the loop, processA begins operating on the second iteration, etc.
A more abstract representation of this information is presented in Application Timeline for the host and device activity.
Optimizing Compute Units
Data Width
One, if not the most important aspect for performance is the data width required for the implementation. The tool propagates port widths throughout the algorithm. In some cases, especially when starting out with an algorithmic description, the C/C++/OpenCL API code might only use large data types such as integers even at the ports of the design. However, as the algorithm is mapped to a fully configurable implementation, smaller data types such as 10-/12-bit might often suffice. It is beneficial to check the size of basic operations in the HLS Synthesis report during optimization.
In general, when the Vitis core development kit maps an algorithm onto the FPGA, more processing is required to comprehend the C/C++/OpenCL API structure and extract operational dependencies. Therefore, to perform this mapping the Vitis core development kit generally partitions the source code into operational units which are then mapped onto the FPGA. Several aspects influence the number and size of these operational units (ops) as seen by the tool.
In the following figure, the basic operations and their bit-width are reported.
Look for bit widths of 16, 32, and 64 bits commonly used in algorithmic descriptions and verify that the associated operation from the C/C++/OpenCL API source actually requires the bit width to be this large. This can considerably improve the implementation of the algorithm, as smaller operations require less computation time.
Fixed Point Arithmetic
Some applications use floating point computation only because they are optimized for other hardware architecture. Using fixed point arithmetic for applications like deep learning can save the power efficiency and area significantly while keeping the same level of accuracy.
Macro Operations
It is sometimes advantageous to think about larger computational elements. The tool will operate on the source code independently of the remaining source code, effectively mapping the algorithm without consideration of surrounding operations onto the FPGA. When applied, the Vitis technology keeps operational boundaries, effectively creating macro operations for specific code. This uses the following principles:
- Operational locality to the mapping process
- Reduction in complexity for the heuristics
This might create vastly different results when applied. In C/C++, macro
operations are created with the help of #pragma HLS inline
off
. While in the OpenCL API, the
same kind of macro operation can be generated by not specifying the following attribute when defining a function:
__attribute__((always_inline))
For more information, see pragma HLS inline.
Using Optimized Libraries
The OpenCL specification provides many
math built-in functions. All math built-in functions with the native_
prefix are mapped to one or more native device instructions
and will typically have better performance compared to the corresponding functions
(without the native_
prefix). The accuracy and in
some cases the input ranges of these functions is implementation-defined. In the
Vitis technology, these native_
built-in functions use the equivalent functions
in the Vitis HLS tool Math library, which are
already optimized for Xilinx FPGAs in terms of
area and performance.
native_
built-in functions or the HLS tool Math
library if the accuracy meets the application requirement.Optimizing Memory Architecture
Memory architecture is a key aspect of implementation. Due to the limited access bandwidth, it can heavily impact the overall performance, as shown in the following example:
void run (ap_uint<16> in[256][4],
ap_uint<16> out[256]
) {
...
ap_uint<16> inMem[256][4];
ap_uint<16> outMem[256];
... Preprocess input to local memory
for( int j=0; j<256; j++) {
#pragma HLS PIPELINE OFF
ap_uint<16> sum = 0;
for( int i = 0; i<4; i++) {
sum += inMem[j][i];
}
outMem[j] = sum;
}
... Postprocess write local memory to output
}
This code adds the four values associated with the inner dimension of the two dimensional input array. If implemented without any additional modifications, it results in the following estimates:
The overall latency of 4608 (Loop 2) is due to 256 iterations of 18 cycles (16 cycles spent in the inner loop, plus the reset of sum, plus the output being written). This is observed in the Schedule Viewer in the HLS Project. The estimates become considerably better when unrolling the inner loop.
However, this improvement is largely because of the process using both ports of a dual port memory. This can be seen from the Schedule Viewer in the HLS Project:
Two read operations are performed per cycle to access all the values from the memory to calculate the sum. This is often an undesired result as this completely blocks the access to the memory. To further improve the results, the memory can be split into four smaller memories along the second dimension:
#pragma HLS ARRAY_PARTITION variable=inMem complete dim=2
For more information, see pragma HLS array_partition.
This results in four array reads, all executed on different memories using a single port:
Using a total of 256 * 4 cycles = 1024 cycles for loop 2.
Alternatively, the memory can be reshaped into to a single memory with four words in parallel. This is performed through the pragma:
#pragma HLS array_reshape variable=inMem complete dim=2
For more information, see pragma HLS array_reshape.
This results in the same latency as when the array partitioning, but with a single memory using a single port:
Although, either solution creates comparable results with respect to overall latency and utilization, reshaping the array results in cleaner interfaces and less routing congestion making this the preferred solution.
void run (ap_uint<16> in[256][4],
ap_uint<16> out[256]
) {
...
ap_uint<16> inMem[256][4];
ap_uint<16> outMem[256];
#pragma HLS array_reshape variable=inMem complete dim=2
... Preprocess input to local memory
for( int j=0; j<256; j++) {
#pragma HLS PIPELINE OFF
ap_uint<16> sum = 0;
for( int i = 0; i<4; i++) {
#pragma HLS UNROLL
sum += inMem[j][i];
}
outMem[j] = sum;
}
... Postprocess write local memory to output
}
Kernel SLR and DDR Memory Assignments
Kernel compute unit (CU) instance and DDR memory resource floorplanning are keys to meeting quality of results of your design in terms of frequency and resources. Floorplanning involves explicitly allocating CUs (a kernel instance) to SLRs and mapping CUs to DDR memory resources. When floorplanning, both CU resource usage and DDR memory bandwidth requirements need to be considered.
The largest Xilinx FPGAs are made up of
multiple stacked silicon dies. Each stack is referred to as a super logic region (SLR)
and has a fixed amount of resources and memory including DDR interfaces. Available
device SLR resources which can be used for custom logic can be found in the Vitis 2020.1 Software Platform Release Notes, or can be displayed using the platforminfo
utility described in platforminfo Utility.
You can use the actual kernel resource utilization values to help distribute CUs across SLRs to reduce congestion in any one SLR. The system estimate report lists the number of resources (LUTs, Flip-Flops, BRAMs, etc.) used by the kernels early in the design cycle. The report can be generated during hardware emulation and system compilation through the command line or GUI and is described in System Estimate Report.
Use this information along with the available SLR resources to help assign CUs to SLRs such that no one SLR is over-utilized. The less congestion in an SLR, the better the tools can map the design to the FPGA resources and meet your performance target. For mapping memory resources and CUs, see Mapping Kernel Ports to Global Memory and Assigning Compute Units to SLRs.
After allocating your CUs to SLRs, map any CU master AXI port(s) to DDR memory resources. Xilinx recommends connecting to a DDR memory resource in the same SLR as the CU. This reduces competition for the limited SLR-crossing connection resources. In addition, connections between SLRs use super long line (SLL) routing resources, which incurs a greater delay than a standard intra-SLR routing.
It might be necessary to cross an SLR region to connect to a DDR resource in a
different SLR. However, if both the connectivity.sp
and
the connectivity.slr
directives are explicitly
defined, the tools automatically add additional crossing logic to minimize the effect of
the SLL delay, and facilitates better timing closure.
Guidelines for Kernels that Access Multiple Memory Banks
The DDR memory resources are distributed across the super logic regions (SLRs) of the platform. Because the number of connections available for crossing between SLRs is limited, the general guidance is to place a kernel in the same SLR as the DDR memory resource with which it has the most connections. This reduces competition for SLR-crossing connections and avoids consuming extra logic resources associated with SLR crossing.
As shown in the previous figure, when a kernel has a single AXI interface that
maps only a single memory bank, the platforminfo
utility described in platforminfo Utility lists the SLR that is associated
with the memory bank of the kernel; therefore, the SLR where the kernel would be best
placed. In this scenario, the design tools might automatically place the kernel in that
SLR without need for extra input; however, you might need to provide an explicit SLR
assignment for some of the kernels under the following conditions:
- If the design contains a large number of kernels accessing the same memory bank.
- A kernel requires some specialized logic resources that are not available in the SLR of the memory bank.
When a kernel has multiple AXI interfaces and all of the interfaces of the kernel access the same memory bank, it can be treated in a very similar way to the kernel with a single AXI interface, and the kernel should reside in the same SLR as the memory bank that its AXI interfaces are mapping.
When a kernel has multiple AXI interfaces to multiple memory banks in different SLRs, the recommendation is to place the kernel in the SLR that has the majority of the memory banks accessed by the kernel (shown it the figure above). This minimizes the number of SLR crossings required by this kernel which leaves more SLR crossing resources available for other kernels in your design to reach your memory banks.
When the kernel is mapping memory banks from different SLRs, explicitly specify the SLR assignment as described in Kernel SLR and DDR Memory Assignments.
As shown in the previous figure, when a platform contains more than two SLRs, it is possible that the kernel might map a memory bank that is not in the immediately adjacent SLR to its most commonly mapped memory bank. When this scenario arises, memory accesses to the distant memory bank must cross more than one SLR boundary and incur additional SLR-crossing resource costs. To avoid such costs it might be better to place the kernel in an intermediate SLR where it only requires less expensive crossings into the adjacent SLRs.
Exploring Kernel Optimizations Using Vitis HLS
All kernel optimizations using OpenCL or C/C++ can be performed from within the Vitis core development kit. The primary performance optimizations, such as those discussed in this section (pipelining function and loops, applying dataflow to enable greater concurrency between functions and loops, unrolling loops, etc.), are performed by the Vitis HLS tool.
The Vitis core development kit automatically calls the HLS tool. However, to use the GUI analysis capabilities, you must launch the HLS tool directly from within the Vitis technology. Using the HLS tool in standalone mode, as discussed in Compiling Kernels with Vitis HLS, enables the following enhancements to the optimization methodology:
- The ability to focus solely on the kernel optimization because there is no requirement to execute emulation.
- The skill to create multiple solutions, compare their results, and explore the solution space to find the most optimum design.
- The competence to use the interactive Analysis Perspective to analyze the design performance.
To open the HLS tool in standalone mode, from the Assistant window, right-click the hardware function object, and select Open HLS Project, as shown in the following figure.
Topological Optimization
This section focuses on the topological optimization. It looks at the attributes related to the rough layout and implementation of multiple compute units and their impact on performance.
Multiple Compute Units
Depending on available resources on the target device, multiple compute units of the same kernel (or different kernels) can be created to run in parallel, which improves the system processing time and throughput. For more details, see Creating Multiple Instances of a Kernel.
Using Multiple DDR Banks
Acceleration cards supported in Vitis technology provide one, two, or four DDR banks, and up to 80 GB/s raw DDR bandwidth. For kernels moving large amount of data between the FPGA and the DDR, Xilinx® recommends that you direct the Vitis compiler and runtime library to use multiple DDR banks.
In addition to DDR banks, the host application can access PLRAM to
transfer data directly to a kernel. This feature is enabled using the connnectivity.sp
option in a configuration file specified
with the v++ --config
option. Refer to Mapping Kernel Ports to Global Memory for more information on implementing this optimization
and Memory Mapped Interfaces on data transfer to the global memory banks.
To take advantage of multiple DDR banks, you need to assign CL memory
buffers to different banks in the host code as well as configure the xclbin file to match the bank assignment in v++
command line.
The following block diagram shows the Global Memory Two Banks (C) example in Vitis Examples on GitHub. This example connects the input pointer interface of the kernel to DDR bank 0, and the output pointer interface to DDR bank 1.
Assigning DDR Bank in Host Code
Bank assignment in host code is supported by Xilinx vendor extension. The following code snippet shows the header file required, as well as assigning input and output buffers to DDR bank 0 and bank 1, respectively:
#include <CL/cl_ext.h>
…
int main(int argc, char** argv)
{
…
cl_mem_ext_ptr_t inExt, outExt; // Declaring two extensions for both buffers
inExt.flags = 0|XCL_MEM_TOPOLOGY; // Specify Bank0 Memory for input memory
outExt.flags = 1|XCL_MEM_TOPOLOGY; // Specify Bank1 Memory for output Memory
inExt.obj = 0 ; outExt.obj = 0; // Setting Obj and Param to Zero
inExt.param = 0 ; outExt.param = 0;
int err;
//Allocate Buffer in Bank0 of Global Memory for Input Image using Xilinx Extension
cl_mem buffer_inImage = clCreateBuffer(world.context, CL_MEM_READ_ONLY | CL_MEM_EXT_PTR_XILINX,
image_size_bytes, &inExt, &err);
if (err != CL_SUCCESS){
std::cout << "Error: Failed to allocate device Memory" << std::endl;
return EXIT_FAILURE;
}
//Allocate Buffer in Bank1 of Global Memory for Input Image using Xilinx Extension
cl_mem buffer_outImage = clCreateBuffer(world.context, CL_MEM_WRITE_ONLY | CL_MEM_EXT_PTR_XILINX,
image_size_bytes, &outExt, NULL);
if (err != CL_SUCCESS){
std::cout << "Error: Failed to allocate device Memory" << std::endl;
return EXIT_FAILURE;
}
…
}
cl_mem_ext_ptr_t
is a struct
as defined below:
typedef struct{
unsigned flags;
void *obj;
void *param;
} cl_mem_ext_ptr_t;
- Valid values for
flags
are:- XCL_MEM_DDR_BANK0
- XCL_MEM_DDR_BANK1
- XCL_MEM_DDR_BANK2
- XCL_MEM_DDR_BANK3
- <id> | XCL_MEM_TOPOLOGYNote: The <id> is determined by looking at the Memory Configuration section in the xxx.xclbin.info file generated next to the xxx.xclbin file. In the xxx.xclbin.info file, the global memory (DDR, PLRAM, etc.) is listed with an index representing the <id>.
obj
is the pointer to the associated host memory allocated for the CL memory buffer only ifCL_MEM_USE_HOST_PTR
flag is passed toclCreateBuffer
API, otherwise set it to NULL.param
is reserved for future use. Always assign it to 0 or NULL.
Assigning Global Memory for Kernel Code
Creating Multiple AXI Interfaces
OpenCL kernels, C/C++ kernels, and RTL kernels have different methods for assigning function parameters to AXI interfaces.
-
For OpenCL kernels, the
--max_memory_ports
option is required to generate one AXI4 interface for each global pointer on the kernel argument. The AXI4 interface name is based on the order of the global pointers on the argument list.The following code is taken from the example gmem_2banks_ocl in the ocl_kernels category from the Vitis Accel Examples on GitHub:
__kernel __attribute__ ((reqd_work_group_size(1, 1, 1))) void apply_watermark(__global const TYPE * __restrict input, __global TYPE * __restrict output, int width, int height) { ... }
In this example, the first global pointer
input
is assigned an AXI4 nameM_AXI_GMEM0
, and the second global pointeroutput
is assigned a nameM_AXI_GMEM1
. -
For C/C++ kernels, multiple AXI4 interfaces are generated by specifying different “bundle” names in the HLS INTERFACE pragma for different global pointers. Refer to Kernel Interfaces for more information.
The following is a code snippet from the gmem_2banks example that assigns theinput
pointer to the bundlegmem0
and theoutput
pointer to the bundlegmem1
. The bundle name can be any valid C string, and the AXI4 interface name generated will beM_AXI_<bundle_name>
. For this example, the input pointer will have AXI4 interface name asM_AXI_gmem0
, and the output pointer will haveM_AXI_gmem1
. Refer to pragma HLS interface for more information.#pragma HLS INTERFACE m_axi port=input offset=slave bundle=gmem0 #pragma HLS INTERFACE m_axi port=output offset=slave bundle=gmem1
- For RTL kernels, the port names are generated during the import process by
the RTL kernel wizard. The default names proposed by the RTL kernel wizard are
m00_axi
andm01_axi
. If not changed, these names have to be used when assigning a DDR bank through theconnectivity.sp
option in the configuration file. Refer to Mapping Kernel Ports to Global Memory for more information.
Assigning AXI Interfaces to DDR Banks
The following is an example configuration file that specifies the
connectivity.sp
option, and the v++
command line
that connects the input pointer (M_AXI_GMEM0
) to DDR
bank 0 and the output pointer (M_AXI_GMEM1
) to DDR bank
1:
The config_sp.txt file:
[connectivity]
sp=apply_watermark_1.m_axi_gmem0:DDR[0]
sp=apply_watermark_1.m_axi_gmem1:DDR[1]
The v++
command line:
v++ apply_watermark --config config_sp.txt
You can use the Device Hardware Transaction view to observe the actual DDR Bank communication, and to analyze DDR usage.
Assigning AXI Interfaces to PLRAM
Some platforms support PLRAMs. In these cases, use the same --connectivity.sp
option as described in Assigning AXI Interfaces to DDR Banks, but use the name, PLRAM[id]. Valid names
supported by specific platforms can be found in the Memory Configuration section of the
xclibin.info file generated alongside xclbin.
Assigning Kernels to SLR Regions
Assigning ports to global memory banks requires the kernel to be physically routed on the FPGA, to connect to the assigned DDR, HBM, or block RAM. Currently, large FPGAs use stacked silicon devices with several super logic regions (SLRs). By default, the Vitis core development kit will place the compute units in the same SLR as the target platform. This is not always desirable, especially when the kernel connects to specific memory banks in a different SLR region. In this case, you will want to manually assign the kernel instance, or CU into the same SLR as the global memory. For more information, see Mapping Kernel Ports to Global Memory.
You can assign the CU instance to an SLR using the connectivity.slr
option described in Assigning Compute Units to SLRs.
platforminfo
command
described in platforminfo Utility.