Enabling DATAFLOW in C/C++ Kernels

Below is a loop DATAFLOW example in dataflow category on Xilinx On-boarding Example GitHub. The top level function adder consists of three loops with dataflow pragma. The three loops are automatically pipelined by the tool for the maximum throughput from the individual loop.
void adder(unsigned int *in, unsigned int *out, int inc, int size)
{
    hls::stream<unsigned int> inStream;
    hls::stream<unsigned int> outStream;
#pragma HLS STREAM variable=inStream  depth=32
#pragma HLS STREAM variable=outStream depth=32

#pragma HLS dataflow

    mem_rd: for (int i = 0 ; i < size ; i++){
#pragma HLS LOOP_TRIPCOUNT min=4096 max=4096
        inStream << in[i];
    }

    execute: for (int j = 0 ; j < size ; j++){
#pragma HLS LOOP_TRIPCOUNT min=4096 max=4096
        outStream << (inStream.read() + inc);
    }
    mem_wr: for (int k = 0 ; k < size ; k++) {
#pragma HLS LOOP_TRIPCOUNT min=4096 max=4096
        out[k] = outStream.read();
    }
}

Below is the latency estimate when the kernel is compiled with the dataflow pragma removed:

Below is the latency estimate when the kernel is compiled with the dataflow pragma:

As shown in the latency estimate report, the SDAccel generates a separate process for each loop with ~4000 clock cycle latency each. The scheduler is able to schedule three processes concurrently, so the latency of the top level module adder is also ~4000 clock cycles. This reduces the overall latency of the kernel top level to 1/3 of the case without DATAFLOW enabled.