Enabling DATAFLOW in C/C++ Kernels
Below is a loop DATAFLOW example in dataflow category on Xilinx On-boarding Example GitHub. The top level function
adder
consists of three loops with dataflow
pragma. The three
loops are automatically pipelined by the tool for the maximum throughput from the individual
loop.
void adder(unsigned int *in, unsigned int *out, int inc, int size)
{
hls::stream<unsigned int> inStream;
hls::stream<unsigned int> outStream;
#pragma HLS STREAM variable=inStream depth=32
#pragma HLS STREAM variable=outStream depth=32
#pragma HLS dataflow
mem_rd: for (int i = 0 ; i < size ; i++){
#pragma HLS LOOP_TRIPCOUNT min=4096 max=4096
inStream << in[i];
}
execute: for (int j = 0 ; j < size ; j++){
#pragma HLS LOOP_TRIPCOUNT min=4096 max=4096
outStream << (inStream.read() + inc);
}
mem_wr: for (int k = 0 ; k < size ; k++) {
#pragma HLS LOOP_TRIPCOUNT min=4096 max=4096
out[k] = outStream.read();
}
}
Below is the latency estimate when the kernel is compiled with the dataflow
pragma removed:
Below is the latency estimate when the kernel is compiled with the dataflow
pragma:
As shown in the latency estimate report, the SDAccel generates a separate process for each loop
with ~4000 clock cycle latency each. The scheduler is able to schedule three processes
concurrently, so the latency of the top level module adder
is also ~4000 clock
cycles. This reduces the overall latency of the kernel top level to 1/3 of the case without
DATAFLOW enabled.