C/C++ Kernels
In the Vitis™ core development kit, the kernel code is generally a compute-intensive part of the algorithm and meant to be accelerated on the FPGA. The Vitis core development kit supports the kernel code written in C/C++, OpenCL™, and also in RTL. This chapter focuses on the coding style for C/C++ based kernels.
Generally, off-the-shelf software cannot be efficiently converted into accelerated hardware on an FPGA. Even if the software program can be automatically converted (or synthesized) into hardware, achieving acceptable quality of results (QoR) will require additional work such as rewriting elements of the software to help Vitis HLS achieve the desired performance goals. To help, you need to understand the best practices for writing good software for execution on the FPGA as discussed in Design Principles for Software Programmers in the Vitis High-Level Synthesis User Guide (UG1399).
extern "C"
linkage
in the header file, or the whole function in the kernel source code must be
wrapped.extern "C" {
void kernel_function(int *in, int *out, int size);
}
Process Execution Modes
As discussed in Kernel Properties,
XRT-managed kernels have two types of execution modes. These modes are determined by
block protocols assigned to the kernels by Vitis HLS during kernel compilation. The block protocol can be
specified using #pragma HLS INTERFACE
. The modes
and block protocol to enable them are listed below:
- Pipeline
- Enabled by the default block protocol of
ap_ctrl_chain
lets kernels overlap in execution with a single kernel finishing the execution of one task while starting the execution of the next - Sequential
- Serial access mode enabled by
ap_ctrl_hs
requires a kernel to complete the execution of one task before starting the next
For more information on how XRT supports these execution modes, refer to Supported Kernel Execution Models.
Pipeline Execution
If a kernel can accept more data while it is still operating on data from previous transactions, XRT can send the next batch of data as described in Temporal Data Parallelism: Host-to-Kernel Dataflow. Pipeline mode lets the kernel overlap multiple kernel runs, which improves the overall throughput.
To support pipeline mode the kernel has to use the ap_ctrl_chain
protocol which is the default protocol used by Vitis HLS. This protocol can also be enabled by
assigning the #pragma HLS INTERFACE
to the
function return as shown in the following example.
void kernel_name( int *inputs,
... )// Other input or Output ports
{
#pragma HLS INTERFACE ap_ctrl_chain port=return bundle=control
For pipeline execution to be successful, the kernel should have a longer latency for the queue of the kernel, or else there might be insufficient time for the kernel to process each batch of data, and you would not see the benefit of the pipeline. If a pipelined kernel is unable to process data in a pipelined manner, it reverts to sequential execution.
For legacy reasons, XRT-managed kernels also support pure sequential mode
that can be configured using the ap_ctrl_hs
block
protocol for the function return in the #pragma HLS
INTERFACE
.
Never-Ending Mode
By default Vitis HLS generates a
kernel with synchronization controlled by the host application. The host controls
and monitors the start and end of the kernel. However, in some cases the kernel does
not need to be controlled by the host, such as in a continuous data stream. These
kernels can use an auto_restart signal of the
ap_ctrl_chain
block protocol, as described in
Streaming Data in User-Managed Never-Ending Kernels. This is considered a
user-manage kernel because the user sets the ap_start bit and auto_restart bit to
start kernel execution, but it is largely a non-software controlled kernel beyond
its initial start.
Data Types
As it is faster to write and verify the code by using native C data types such as int
, float
, or double
, it is a common practice to use these data types when coding for the first time. However, the code is implemented in hardware and all the operator sizes used in the hardware are dependent on the data types used in the accelerator code. The default native C/C++ data types can result in larger and slower hardware resources that can limit the performance of the kernel. Instead, consider using bit-accurate data types to ensure the code is optimized for implementation in hardware. Using bit-accurate, or arbitrary precision data types, results in hardware operators which are smaller and faster. This allows more logic to be placed into the programmable logic and also allows the logic to execute at higher clock frequencies while using less power.
Consider using bit-accurate data types instead of native C/C++ data types in your code.
In the following sections, the two most common arbitrary precision data types (arbitrary precision integer type and arbitrary precision fixed-point type) supported by the Vitis compiler are discussed.
Arbitrary Precision Integer Types
Arbitrary precision integer data types are defined by ap_int
or ap_uint
for signed and unsigned
integer respectively inside the header file ap_int.h. To use arbitrary precision integer data type:
- Add header file ap_int.h to the source code.
- Change the bit types to
ap_int<N>
orap_uint<N>
, where N is a bit-size from 1 to 1024.
The following example shows how the header file is added and the two variables are implemented to use 9-bit integer and 10-bit unsigned integer.
#include "ap_int.h"
ap_int<9> var1 // 9 bit signed integer
ap_uint<10> var2 // 10 bit unsigned integer
Arbitrary Precision Fixed-Point Data Types
Some existing applications use floating point data types as they are written for other hardware architectures. However, fixed-point data types are a useful replacement for floating point types which require many clock cycles to complete. When choosing to implement floating-point versus fixed-point arithmetic for your application and accelerators, carefully evaluate trade-offs in power, cost, productivity, and precision.
As discussed in Reduce Power and Cost by Converting from Floating Point to Fixed Point (WP491), using fixed-point arithmetic instead of floating point for applications can increase power efficiency, and lower the total power required. Unless the entire range of the floating-point type is required, the same accuracy can often be implemented with a fixed-point type, resulting in the same accuracy with smaller and faster hardware.
Fixed-point data types model the data as an integer and fraction bits. The fixed-point data type requires the ap_fixed
header, and supports both a signed and unsigned form as follows:
- Header file
- ap_fixed.h
- Signed fixed point
ap_fixed<W,I,Q,O,N>
- Unsigned fixed point
ap_ufixed<W,I,Q,O,N>
- W = Total width < 1024 bits
- I = Integer bit width. The value of I must be less than or equal to the width (W). The number of bits to represent the fractional part is W minus I. Only a constant integer expression can be used to specify the integer width.
- Q = Quantization mode. Only predefined enumerated values can be used to
specify Q. The accepted values are:
AP_RND
: Rounding to plus infinity.AP_RND_ZERO
: Rounding to zero.AP_RND_MIN_INF
: Rounding to minus infinity.AP_RND_INF
: Rounding to infinity.AP_RND_CONV
: Convergent rounding.AP_TRN
: Truncation. This is the default value when Q is not specified.AP_TRN_ZERO
: Truncation to zero.
- O = Overflow mode. Only predefined enumerated values can be used to
specify O. The accepted values are:
AP_SAT
: Saturation.AP_SAT_ZERO
: Saturation to zero.AP_SAT_SYM
: Symmetrical saturation.AP_WRAP
: Wrap-around. This is the default value when O is not specified.AP_WRAP_SM
: Sign magnitude wrap-around.
- N = The number of saturation bits in the overflow WRAP modes. Only a constant integer expression can be used as the parameter value. The default value is zero.
In the example code below, the ap_fixed
type is used to define a signed 18-bit variable with 6 bits representing the integer value above the binary point, and by implication, 12 bits representing the fractional value below the binary point. The quantization mode is set to round to plus infinity (AP_RND
). Because the overflow mode and saturation bits are not specified, the defaults AP_WRAP
and 0 are used.
#include <ap_fixed.h>
...
ap_fixed<18,6,AP_RND> my_type;
...
When performing calculations where the variables have different numbers of bits (W), or different precision (I), the binary point is automatically aligned. For more information on using fixed-point data types, see C++ Arbitrary Precision Fixed-Point Types in the Vitis High-Level Synthesis User Guide (UG1399).
Interfaces
Two types of data transfer occur from the host machine to and from the kernels on the FPGA. Data pointers are transferred between the host CPU and the accelerator through global memory banks. Scalar data is passed directly from the host to the kernel.
The time it takes to transfer data to/from the kernel can also influence the application architecture with respect to throughput goals. Due to the high overhead for data transfer, it is important to think about overlapping the computation with the communication (data movement) that is present in your application. Refer to Designing Efficient Kernels in the Vitis High-Level Synthesis User Guide (UG1399).
The Vitis HLS tool, which is part of
the Vitis core development kit, automatically
assigns interface ports for the parameters of your C/C++ kernel function. These port
assignments are made during the v++
compilation
process. The following sections provide additional details of these interface ports, and
your ability to manually assign them, or override the default assignments using the
INTERFACE pragma. If there are no user-defined INTERFACE pragmas in the code, then the
following interface protocols are assigned by the Vitis tool:
- AXI4 Master interfaces (m_axi) are assigned to pointer arguments of the C/C++ function.
- AXI4-Lite interfaces (s_axilite) are assigned to scalar arguments, control signals for arrays, global variables, and the return value of the software function.
- Vitis HLS automatically infers burst transactions to aggregate memory accesses to maximize the throughput bandwidth and/or minimize the latency. For more information on burst transfers, refer to Optimizing Burst Transfers in the Vitis High-Level Synthesis User Guide (UG1399).
- When
hls::stream
is used to define a parameter type, the Vitis HLS tool infers anaxis
streaming interface.
Memory Mapped Interfaces
Memory mapped interfaces are inferred from pointer parameters. They allow kernels to read and write data in global memory, which is the memory that is shared between kernels and the host application. Therefore, memory mapped interfaces are a convenient way of sharing data across different elements of the accelerated application, but interfaces are only allowed for sequential and pipelined kernel execution modes as described in Kernel Properties.
To customize the default interfaces assigned by the Vitis tools during compilation, you can use the INTERFACE pragma. For optimal performance, Xilinx recommends performing burst transfers, if possible up to the AXI protocol limit of 4 KB per transfer.
Kernel Interfaces
void cnn( int *pixel, // Input pixel
int *weights, // Input Weight Matrix
int *out, // Output pixel
... // Other input or Output ports
In the example above, the kernel function has three pointer parameters:
pixel
, weights
, and
out
. By default the Vitis compiler will map these three parameters to the same AXI4 interface (m_axi
).
The default interface mapping inferred by the compiler is equivalent to the following INTERFACE pragmas:
#pragma HLS INTERFACE m_axi port=pixel offset=slave bundle=gmem
#pragma HLS INTERFACE m_axi port=weights offset=slave bundle=gmem
#pragma HLS INTERFACE m_axi port=out offset=slave bundle=gmem
The bundle
keyword on the INTERFACE pragma
defines the name of the port. The system compiler will create a port for each unique bundle
name, resulting in a compiled kernel object (XO) file that has a single AXI interface, m_axi_gmem
. When the same bundle
name is used for different interfaces, this results in these interfaces being mapped to same
port.
gmem
name is short for global memory; however, it is not a keyword and is just
used for consistency. You can assign your own names for the bundles.Sharing ports helps save FPGA resources by eliminating AXI interfaces, but it can limit the performance of the kernel because all the memory transfers have to go through a single port. The m_axi port has independent READ and WRITE channels, so with a single m_axi port, you can do reads and writes simultaneously. However, the bandwidth and throughput of the kernel can be increased by creating multiple ports, using different bundle names, to connect to multiple memory banks. There are many options for configuring the INTERFACE, as described in pragma HLS interface. Some reasons to manually define an INTERFACE pragma in your code could include:
- Specifying the bundle for the INTERFACE pragma to separate AXI signals into separate bundles.
- Specifying the interface width to deviate from default int = 64 bytes (512-bits).
- Specifying AXI properties for burst transactions.
void cnn( int *pixel, // Input pixel
int *weights, // Input Weight Matrix
int *out, // Output pixel
... // Other input or Output ports
#pragma HLS INTERFACE m_axi port=pixel offset=slave bundle=gmem
#pragma HLS INTERFACE m_axi port=weights offset=slave bundle=gmem1
#pragma HLS INTERFACE m_axi port=out offset=slave bundle=gmem
In the example above, two bundle
names
create two distinct ports: gmem and gmem1. The kernel will
access pixel
and out
data
through the gmem port, while weights
data will be accessed through the gmem1
port. As a result, the kernel will be able to make parallel accesses to pixel
and weights
, potentially
improving the throughput of the kernel.
bundle=
names using all lowercase characters, so you can
assign it to a specific memory bank using the connectivity.sp option. The INTERFACE pragma is used during v++
compilation, resulting in a compiled kernel object (XO) file with two separate AXI interfaces,
m_axi_gmem
and m_axi_gmem1
, that can be connected to global memory as needed. During system
compiler linking, the separate interfaces can be mapped to different global memory banks using
the connectivity.sp option in a configuration file, as
described in Mapping Kernel Ports to Memory.
Memory Interface Width Considerations
The maximum data width from the global memory to and from the kernel is 512 bits. To maximize the data transfer rate, it is recommended that you use this full data width. By default in the Vitis kernel flow, the Vitis HLS tool automatically re-sizes the kernel interface ports up to 512-bits to improve burst access. For more information, refer to Automatic Port Width Resizing in the Vitis High-Level Synthesis User Guide (UG1399).
There are some pros and cons to using the automatic port width resizing feature which you should consider when using this feature:
- Improves the read latency from memory as the tool is reading a big vector, instead of the data type size.
- Adds resources as it needs to buffer the big vector, and shift the data to the data path size.
- Automatic port width resizing supports only standard C data types
and does not support non-aggregate types such as
ap_int
,ap_uint
, struct, or array.
Burst Accesses to Global Memory
Accessing the global memory bank interface from the kernel has a large latency, so global memory transfer should be done in burst. For more information on burst transfers, refer to Optimizing Burst Transfers in the Vitis High-Level Synthesis User Guide (UG1399).
To infer the burst, the following pipelined loop coding style is recommended.
hls::stream<datatype_t> str;
INPUT_READ: for(int i=0; i<INPUT_SIZE; i++) {
#pragma HLS PIPELINE
str.write(inp[i]); // Reading from Input interface
}
In the code example, a pipelined for
loop is used to read data from the input memory interface, and writes to an internal
hls::stream
variable. The above coding style reads
from the global memory bank in burst.
It is a recommended coding style to implement the for
loop operation in the example above inside a separate function, and
apply the dataflow
optimization, as discussed in Dataflow Optimization. The code example below shows how this
would look, letting the compiler establish dataflow between the read, execute, and write
functions:
top_function(datatype_t * m_in, // Memory data Input
datatype_t * m_out, // Memory data Output
int inp1, // Other Input
int inp2) { // Other Input
#pragma HLS DATAFLOW
hls::stream<datatype_t> in_var1; // Internal stream to transfer
hls::stream<datatype_t> out_var1; // data through the dataflow region
read_function(m_in, inp1, in_var1); // Read function contains pipelined for loop
// to infer burst
execute_function(in_var1, out_var1, inp1, inp2); // Core compute function
write_function(out_var1, m_out); // Write function contains pipelined for loop
// to infer burst
}
Scalar Inputs
Scalar inputs are typically control variables that are directly loaded from
the host machine. They can be thought of as programming data or parameters under which the
main kernel computation takes place. These kernel inputs are write-only from the host side.
In the following function, the scalar parameters are width
and
height
.
void process_image(int *input, int *output, int width, int height) {
The scalar arguments are assigned a default INTERFACE pragma, which is inferred by the tool.
#pragma HLS INTERFACE s_axilite port=width bundle=control
#pragma HLS INTERFACE s_axilite port=height bundle=control
In this example, there are two scalar inputs that specify the image width
and height
. These data
inputs come to the kernel directly from the host machine and not through global memory
banks. The pragmas shown are not added to the code by the tool.
bundle=
name should be same for all scalar data inputs and the function return
. In the preceding example, bundle=control
is used for all scalar inputs.Streaming Interfaces
If the data is accessed sequentially, a streaming interface can be used. This interface enables direct streaming of data from the host to kernel, and from the kernel to host without the need to migrate the data through the global memory as an intermediate step. The streaming interface can also be used between two kernels where one kernel is streaming data as a producer to another kernel acting as a consumer. This transfer also occurs directly and without making use of global memory. For more information, refer to Streaming Data Transfers.
Loops
Loops are an important aspect for a high performance accelerator. Generally, loops are either pipelined or unrolled to take advantage of the highly distributed and parallel FPGA architecture to provide a performance boost compared to running on a CPU.
By default, loops are neither pipelined nor unrolled. Each iteration of the loop takes at least one clock cycle to execute in hardware. Thinking from the hardware perspective, there is an implicit wait until clock for the loop body. The next iteration of a loop only starts when the previous iteration is finished.
Loop Pipelining
By default, every iteration of a loop only starts when the previous iteration
has finished. In the loop example below, a single iteration of the loop adds two
variables and stores the result in a third variable. Assume that in hardware this loop
takes three cycles to finish one iteration. Also, assume that the loop variable len
is 20, that is, the vadd
loop runs for 20 iterations in the kernel. Therefore, it requires a
total of 60 clock cycles (20 iterations * 3 cycles) to complete all the operations of
this loop.
vadd: for(int i = 0; i < len; i++) {
c[i] = a[i] + b[i];
}
vadd:…
). This practice helps with debugging when working in the Vitis core development kit. Note that the labels
generate warnings during compilation, which can be safely ignored.vadd: for(int i = 0; i < len; i++) {
#pragma HLS PIPELINE
c[i] = a[i] + b[i];
}
In the example above, it is assumed that every iteration of the loop takes three cycles: read, add, and write. Without pipelining, each successive iteration of the loop starts in every third cycle. With pipelining the loop can start subsequent iterations of the loop in fewer than three cycles, such as in every second cycle, or in every cycle.
The number of cycles it takes to start the next iteration of a loop is called
the initiation interval (II) of the pipelined loop. So II = 2 means each successive
iteration of the loop starts every two cycles. An II = 1 is the ideal case, where each
iteration of the loop starts in the very next cycle. When you use pragma HLS PIPELINE
, the compiler always tries to achieve
II = 1 performance.
The following figure illustrates the difference in execution between pipelined and non-pipelined loops. In this figure, (A) shows the default sequential operation where there are three clock cycles between each input read (II = 3), and it requires eight clock cycles before the last output write is performed.
If there are data dependencies inside a loop, as discussed in Loop Dependencies, it might not be possible to achieve II = 1, and a larger initiation interval might be the result.
Loop Unrolling
vadd: for(int i = 0; i < 20; i++) {
#pragma HLS UNROLL
c[i] = a[i] + b[i];
}
In the preceding example, you can see pragma HLS UNROLL
has been inserted into the body of the loop to instruct the compiler to unroll the loop completely. All 20 iterations of the loop are executed in parallel if that is permitted by any data dependency.
Partially Unrolled Loop
To completely unroll a loop, the loop must have a constant bound (20 in the example above). However, partial unrolling is possible for loops with a variable bound. A partially unrolled loop means that only a certain number of loop iterations can be executed in parallel.
array_sum:for(int i=0;i<4;i++){
#pragma HLS UNROLL factor=2
sum += arr[i];
}
In the above example the UNROLL
pragma is given a factor of 2. This is the equivalent of manually duplicating the loop body and running the two loops concurrently for half as many iterations. The following code shows how this would be written. This transformation allows two iterations of the above loop to execute in parallel.
array_sum_unrolled:for(int i=0;i<4;i+=2){
// Manual unroll by a factor 2
sum += arr[i];
sum += arr[i+1];
}
Just like data dependencies inside a loop impact the initiation interval of a pipelined loop, an unrolled loop performs operations in parallel only if data dependencies allow it. If operations in one iteration of the loop require the result from a previous iteration, they cannot execute in parallel, but execute as soon as the data from one iteration is available to the next.
PIPELINE
loops first, and then UNROLL
loops with small loop bodies and limited iterations to improve performance further. Loop Dependencies
Data dependencies in loops can impact the results of loop pipelining or unrolling. These loop dependencies can be within a single iteration of a loop or between different iterations of a loop. The straightforward method to understand loop dependencies is to examine an extreme example. In the following code example, the result of the loop is used as the loop continuation or exit condition. Each iteration of the loop must finish before the next can start.
Minim_Loop: while (a != b) {
if (a > b)
a -= b;
else
b -= a;
}
This loop cannot be pipelined. The next iteration of the loop cannot begin until the previous iteration ends.
Dealing with various types of dependencies with the Vitis compiler is an extensive topic requiring a detailed understanding of the high-level synthesis procedures underlying the compiler. For more information, refer to the Vitis High-Level Synthesis User Guide (UG1399).
Nested Loops
Coding with nested loops is a common practice. Understanding how loops are pipelined in a nested loop structure is key to achieving the desired performance.
If the HLS PIPELINE pragma is applied to a loop nested inside another loop,
the v++
compiler attempts to flatten the loops to create a single loop,
and apply the PIPELINE pragma to the constructed loop. The loop flattening helps in
improving the performance of the kernel.
The compiler is able to flatten the following types of nested loops:
- Perfect nested loop:
- Only the inner loop has a loop body.
- There is no logic or operations specified between the loop declarations.
- All the loop bounds are constant.
- Semi-perfect nested loop:
- Only the inner loop has a loop body.
- There is no logic or operations specified between the loop declarations.
- The inner loop bound must be a constant, but the outer loop bound can be a variable.
The following code example illustrates the structure of a perfect nested loop:
ROW_LOOP: for(int i=0; i< MAX_HEIGHT; i++) {
COL_LOOP: For(int j=0; j< MAX_WIDTH; j++) {
#pragma HLS PIPELINE
// Main computation per pixel
}
}
The above example shows a nested loop structure with two loops that performs some computation on incoming pixel data. In most cases, you want to process a pixel in every cycle, hence, PIPELINE is applied to the nested loop body structure. The compiler is able to flatten the nested loop structure in the example because it is a perfect nested loop.
The nested loop in the preceding example contains no logic between the two loop
declarations. No logic is placed between the ROW_LOOP
and COL_LOOP
; all of the processing logic is inside
the COL_LOOP
. Also, both the loops have a fixed number
of iterations. These two criteria help the v++
compiler flatten the
loops and apply the PIPELINE constraint.
Sequential Loops
void adder(unsigned int *in, unsigned int *out, int inc, int size) {
unsigned int in_internal[MAX_SIZE];
unsigned int out_internal[MAX_SIZE];
mem_rd: for (int i = 0 ; i < size ; i++){
#pragma HLS PIPELINE
// Reading from the input vector "in" and saving to internal variable
in_internal[i] = in[i];
}
compute: for (int i=0; i<size; i++) {
#pragma HLS PIPELINE
out_internal[i] = in_internal[i] + inc;
}
mem_wr: for(int i=0; i<size; i++) {
#pragma HLS PIPELINE
out[i] = out_internal[i];
}
}
In the previous example, three sequential loops are shown: mem_rd
, compute
, and mem_wr
.
- The
mem_rd
loop reads input vector data from the memory interface and stores it in internal storage. - The main
compute
loop reads from the internal storage and performs an increment operation and saves the result to another internal storage. - The
mem_wr
loop writes the data back to memory from the internal storage.
This code example is using two separate loops for reading and writing from/to the memory input/output interfaces to infer burst read/write.
By default, these loops are executed sequentially without any overlap. First, the mem_rd
loop finishes reading all the input data before the compute
loop starts its operation. Similarly, the compute
loop finishes processing the data before the mem_wr
loop starts to write the data. However, the execution of these loops can be overlapped, allowing the compute
(or mem_wr
) loop to start as soon as there is enough data available to feed its operation, before the mem_rd
(or compute
) loop has finished processing its data.
The loop execution can be overlapped using dataflow optimization as described in Dataflow Optimization.
Dataflow Optimization
Dataflow optimization is a powerful technique to improve the kernel performance
by enabling task-level pipelining and parallelism inside the kernel. It allows the
v++
compiler to schedule multiple functions of the
kernel to run concurrently to achieve higher throughput and lower latency. This is also
known as task-level parallelism. To help, you need to understand the best practices for
writing good software for execution on the FPGA as discussed in Optimizing for Throughput in the Vitis
High-Level Synthesis User Guide (UG1399).
The following figure shows a conceptual view of dataflow pipelining. The default
behavior is to execute and complete func_A
, then
func_B
, and finally func_C
. With the pragma HLS
dataflow enabled, the
compiler can schedule each function to execute as soon as data is available. In this
example, the original top
function has a latency and
interval of eight clock cycles. With the dataflow optimization, the interval is reduced
to only three clock cycles.
Dataflow Coding Example
In the dataflow coding example you should notice the following:
- The pragma HLS dataflow is applied to instruct the compiler to enable dataflow optimization. This is not a data mover, which deals with interfacing between the PS and PL, but instead addresses how the data flows through the accelerator.
- The
stream
class is used as a data transferring channel between each of the functions in the dataflow region.TIP: Thestream
class infers a first-in first-out (FIFO) memory circuit in the programmable logic. This memory circuit, which acts as a queue in software programming, provides data-level synchronization between the functions and achieves better performance.
void compute_kernel(ap_int<256> *inx, ap_int<256> *outx, DTYPE alpha) {
hls::stream<unsigned int>inFifo;
#pragma HLS STREAM variable=inFifo depth=32
hls::stream<unsigned int>outFifo;
#pragma HLS STREAM variable=outFifo depth=32
#pragma HLS DATAFLOW
read_data(inx, inFifo);
// Do computation with the acquired data
compute(inFifo, outFifo, alpha);
write_data(outx, outFifo);
return;
}
Canonical Forms of Dataflow Optimization
- Functions: The canonical form coding guideline for dataflow inside a function specifies:
- Use only the following types of variables inside the dataflow region:
- Local non-static scalar/array/pointer variables.
- Local static
hls::stream
variables.
- Function calls transfer data only in the forward direction.
- Array or
hls::stream
should have only one producer function and one consumer function. - The function arguments (variables coming from outside the dataflow region) should only be read, or written, not both. If performing both read and write on the same function argument then read should happen before write.
- The local variables (those that are transferring data in forward direction) should be written before being read.
The following code example illustrates the canonical form for dataflow within a function. Note that the first function (
func1
) reads the inputs and the last function (func3
) writes the outputs. Also note that one function creates output values that are passed to the next function as input parameters.void dataflow(Input0, Input1, Output0, Output1) { UserDataType C0, C1, C2; #pragma HLS DATAFLOW func1(read Input0, read Input1, write C0, write C1); func2(read C0, read C1, write C2); func3(read C2, write Output0, write Output1); }
- Use only the following types of variables inside the dataflow region:
- Loop: The canonical form coding guideline for dataflow inside a loop body includes the coding guidelines for a function defined above, and also specifies the following:
- Initial value 0.
- The loop condition is formed by a comparison of the loop variable with a numerical constant or variable that does not vary inside the loop body.
- Increment by 1.
The following code example illustrates the canonical form for dataflow within a loop.
void dataflow(Input0, Input1, Output0, Output1) { UserDataType C0, C1, C2; for (int i = 0; i < N; ++i) { #pragma HLS DATAFLOW func1(read Input0, read Input1, write C0, write C1); func2(read C0, read C0, read C1, write C2); func3(read C2, write Output0, write Output1); } }
Troubleshooting Dataflow
The following behaviors can prevent the Vitis compiler from performing dataflow optimizations:
- Single producer-consumer violations.
- Bypassing tasks.
- Feedback between tasks.
- Conditional execution of tasks.
- Loops with multiple exit conditions or conditions defined within the loop.
If any of the above conditions occur inside the dataflow region, you might need to re-architect your code to successfully achieve dataflow optimization.
Array Configuration
The Vitis compiler maps large arrays to the block RAM memory in the PL region. These block RAM can have a maximum of two access points or ports. This can limit the performance of the application as all the elements of an array cannot be accessed in parallel when implemented in hardware.
Depending on the performance requirements, you might need to access some or all of the elements of an array in the same clock cycle. To achieve this, the pragma HLS array_partition can be used to instruct the compiler to split the elements of an array and map it to smaller arrays, or to individual registers. The compiler provides three types of array partitioning, as shown in the following figure. The three types of partitioning are:
block
: The original array is split into equally sized blocks of consecutive elements of the original array.cyclic
: The original array is split into equally sized blocks interleaving the elements of the original array.complete
: Split the array into its individual elements. This corresponds to resolving a memory into individual registers. This is the default for the ARRAY_PARTITION pragma.
For block and cyclic partitioning, the factor
option specifies the number of arrays that are created. In the preceding figure, a factor of 2 is used to split the array into two smaller arrays. If the number of elements in the array is not an integer multiple of the factor, the later arrays will have fewer elements.
dimension
option is used to specify which dimension is partitioned. The following figure shows how the dimension
option is used to partition the following example code in three different ways:void foo (...) {
// my_array[dim=1][dim=2][dim=3]
// The following three pragma results are shown in the figure below
// #pragma HLS ARRAY_PARTITION variable=my_array dim=3 <block|cyclic> factor=2
// #pragma HLS ARRAY_PARTITION variable=my_array dim=1 <block|cyclic> factor=2
// #pragma HLS ARRAY_PARTITION variable=my_array dim=0 complete
int my_array[10][6][4];
...
}
The examples in the figure demonstrate how partitioning dimension 3 results in four separate arrays and partitioning dimension 1 results in 10 separate arrays. If 0 is specified as the dimension, all dimensions are partitioned.
The Importance of Careful Partitioning
A complete partition of the array maps all the array elements to the individual registers. This helps in improving the kernel performance because all of these registers can be accessed concurrently in a same cycle.
Choosing a Specific Dimension to Partition
int A[64][64];
int B[64][64];
ROW_WISE: for (int i = 0; i < 64; i++) {
COL_WISE : for (int j = 0; j < 64; j++) {
#pragma HLS PIPELINE
int result = 0;
COMPUTE_LOOP: for (int k = 0; k < 64; k++) {
result += A[i ][ k] * B[k ][ j];
}
C[i][ j] = result;
}
}
ROW_WISE
and
COL_WISE
loop is flattened together and COMPUTE_LOOP
is fully unrolled. To concurrently execute each
iteration (k) of the COMPUTE_LOOP
, the code must access
each column of matrix A and each row of matrix B in parallel. Therefore, the matrix A should
be split in the second dimension, and matrix B should be split in the first dimension.
#pragma HLS ARRAY_PARTITION variable=A dim=2 complete
#pragma HLS ARRAY_PARTITION variable=B dim=1 complete
Choosing between Cyclic and Block Partitions
Here the same matrix multiplication algorithm is used to demonstrate choosing between cyclic and block partitioning and determining the appropriate factor, by understanding the array access pattern of the underlying algorithm.
int A[64 * 64];
int B[64 * 64];
#pragma HLS ARRAY_PARTITION variable=A dim=1 cyclic factor=64
#pragma HLS ARRAY_PARTITION variable=B dim=1 block factor=64
ROW_WISE: for (int i = 0; i < 64; i++) {
COL_WISE : for (int j = 0; j < 64; j++) {
#pragma HLS PIPELINE
int result = 0;
COMPUTE_LOOP: for (int k = 0; k < 64; k++) {
result += A[i * 64 + k] * B[k * 64 + j];
}
C[i* 64 + j] = result;
}
}
In this version of the code, A and B are now one-dimensional arrays. To access each column of matrix A and each row of matrix B in parallel, cyclic and block partitions are used as shown in the above example. To access each column of matrix A in parallel, cyclic
partitioning is applied with the factor
specified as the row size, in this case 64. Similarly, to access each row of matrix B in parallel, block
partitioning is applied with the factor
specified as the column size, or 64.
Minimizing Array Accesses with Caching
As arrays are mapped to block RAM with limited number of access ports, repeated array accesses can limit the performance of the accelerator. You should have a good understanding of the array access pattern of the algorithm, and limit the array accesses by locally caching the data to improve the performance of the kernel.
mem[N]
to create a summed result.#include "array_mem_bottleneck.h"
dout_t array_mem_bottleneck(din_t mem[N]) {
dout_t sum=0;
int i;
SUM_LOOP:for(i=2;i<N;++i)
sum += mem[i] + mem[i-1] + mem[i-2];
return sum;
}
#include "array_mem_perform.h"
dout_t array_mem_perform(din_t mem[N]) {
din_t tmp0, tmp1, tmp2;
dout_t sum=0;
int i;
tmp0 = mem[0];
tmp1 = mem[1];
SUM_LOOP:for (i = 2; i < N; i++) {
tmp2 = mem[i];
sum += tmp2 + tmp1 + tmp0;
tmp0 = tmp1;
tmp1 = tmp2;
}
return sum;
}
Function Inlining
C code generally consists of several functions. By default, each function is compiled, and optimized separately by the Vitis compiler. A unique hardware module will be generated for the function body and reused as needed.
From a performance perspective, in general it is better to inline the function, or dissolve the function hierarchy. This helps Vitis compiler to perform optimization more globally across the function boundary. For example, if a function is called inside a pipelined loop, then inlining the function helps the compiler to do more aggressive optimization and results in a better pipeline performance of the loop (lower initiation interval or II number).
foo_sub (p, q) {
#pragma HLS INLINE
....
...
}
However, if the function body is very big and called several times inside the main
kernel function, then inlining the function may cause capacity issues due to consuming too
many resources. In cases like that you might not inline such functions, and let the v++
compiler optimize the function separately in its local context.
Streaming Data in User-Managed Never-Ending Kernels
In a typical XRT-managed application, the host manages the start and
stop of a kernel using an XRT Run object from the xrt::run
class as described in Executing Kernels on the Device. In user-managed kernels, the start and stop mechanism
is different, but the kernel is still software controlled from the host application
using register read and write calls. These API calls are transferred into low-level
calls by XRT to start and stop the kernel, consuming a small amount of time to start or
stop the kernel on the hardware and incurring some overhead each time the kernel is
started. This overhead can be significant when the data is input to the kernel from
either Ethernet or SerDes, which work at a line-rate of 1000 GB/s.
However, streaming kernels that are data-driven such as the kernel described here, eliminate these low-level API calls and let the kernel react to the flow of data from the Ethernet or SerDes at the much higher data rates. The user-managed kernel described here is started once and automatically restarted as needed, and so is called a never-ending kernel. These kernels are executed in a purely data-driven mode with streaming data coming from and going to the I/O pins (Ethernet, SerDes) of the FPGA, or streamed from or to a different kernel (kernel-to-kernel streaming).
Because never-ending kernels are data-driven, with the operation of the kernel
dependent on the data stream, they do not need to be user-managed by the host
application beyond the initial start thus avoiding the overhead of repeated API calls
from the host program. These kernels require the ap_ctrl_chain
protocol specified by Vitis HLS, using the auto_restart bit to
keep ap_start high and to run the kernel continuously
once started, or until auto_restart is reset. The host
application must manually enable the auto_restart bit
as explained in Enabling Auto-Restart of User-Managed Kernels. The kernel
requirements are detailed in the following section.
Kernel Coding Guidelines
Never-ending kernels have the following coding requirements for top-level function arguments:
- The kernel implements the ap_ctrl_chain block control protocol to enable the auto_restart bit.
- The kernel supports AXI4-Stream
interfaces (
axis
) and has no AXI4 memory mapped (m_axi) interface, and interacts with other kernels only through streams.
Modeling designs that use the streaming paradigm can be difficult in C.
The approach of using pointers to perform multiple read and/or write accesses can
introduce issues because there are implications for the type qualifier. Vitis HLS provides a C++ template class hls::stream<ap_axis<N>>
for modeling streaming
data structures. On the hardware, the hls::stream
is
implemented as an axis
interface.
template <typename T, size_t WUser, size_t WId, size_t WDest> struct axis { .. };
Where:
- T is the stream data type
- WUser is the width of the TUSER signal
- WId is the width of the TID signal
- WDest is the width of the TDest signal
When the stream data type (T) are simple integer types, there are two predefined types of AXI4-Stream implementations available:
- A signed implementation of the AXI4-Stream class:
hls::axis<ap_int<WData>, WUser, WId, WDest>
- An unsigned implementation:
hls::axis<ap_uint<WData>, WUser, WId, WDest>
TVALID, TREADY, and TLAST are required control signals for the AXI4-Stream protocol. Side-channel signals TKEEP, TSTRB, TUSER, TID, and TDEST are special signals that can be used to pass around additional bookkeeping data. The values specified for the template parameters WUser, WId, and WDest define the use of side-channel signals in the interface as explained in AXI4-Stream Interfaces in the Vitis High-Level Synthesis User Guide (UG1399).
The following example shows a programming model for a data-driven, never-ending kernel using AXI4-Stream:
#include "ap_axi_sdata.h"
#include "hls_stream.h"
typedef ap_axis<32, 0, 0, 0> pkt;
extern "C" {
10 void krnl_stream_vdatamover(hls::stream<pkt> &in,
11 hls::stream<pkt> &out // Internal Stream
12 ) {
13 #pragma HLS interface ap_ctrl_chain port=return
14 bool eos = false;
15 vdatamover:
16 do {
17 // Reading a and b streaming into packets
18 pkt t1 = in.read();
19
20 // Packet for output
21 pkt t_out;
22
23 // Reading data from input packet
24 ap_uint<DWIDTH> in1 = t1.data;
25
26 // Vadd operation
27 ap_uint<DWIDTH> tmpOut = in1;
28
29 // Setting data and configuration to output packet
30 t_out.data = tmpOut;
31 t_out.last = t1.last;
32 t_out.keep = -1; // Enabling all bytes
33
34 // Writing packet to output stream
35 out.write(t_out);
36
37 if (t1.last) {
38 eos = true;
39 }
40 } while (eos == false);
while(1)
loop
around the kernel. Therefore, you should not explicitly specify a while(1)
loop in your source code to prevent
non-deterministic behavior. Summary
As discussed in earlier topics, several important aspects of coding the kernel for FPGA acceleration using C/C++ include the following points:
- Consider using arbitrary precision data types,
ap_int
, andap_fixed
. - Understand kernel interfaces to determine scalar and memory interfaces. Use
bundle
switch with different names if separate DDR memory banks will be specified in the linking stage. - Use Burst read and write coding style from and to the memory interface.
- Consider exploiting the full width of DDR banks during the data transfer when selecting width of memory data inputs and outputs.
- Get the greatest performance boost using pipelining and dataflow.
- Write perfect or semi-perfect nested loop structure so that the
v++
compiler can flatten and apply pipeline effectively. - Unroll loops with a small number of iterations and low operation count inside the loop body.
- Consider understanding the array access pattern and apply
complete
partition to specific dimensions or applyblock
orcyclic
partitioning instead of acomplete
partition of the whole array. - Minimize the array access by using local cache to improve kernel performance.
- Consider inlining the function, specifically inside the pipelined region. Functions inside the dataflow should not be inlined.