Pipelining Loops

Although loop unrolling exposes concurrency, it does not address the issue of keeping all elements in a kernel data path busy at all times. This is necessary for maximizing kernel throughput and performance. Even in an unrolled case, loop control dependencies can lead to sequential behavior. The sequential behavior of operations results in idle hardware and a loss of performance.

Xilinx addresses this issue by introducing a vendor extension on top of the OpenCL 2.0 specification for loop pipelining. The Xilinx attribute for loop pipelining is xcl_pipeline_loop. By default, the SDAccel™ compiler automatically applies this attribute on the innermost loop with trip count more than 64 or its parent loop when its trip count is less than or equal 64.

To understand the effect of loop pipelining on performance, consider the following code example:
__kernel __attribute__ ((reqd_work_group_size(1, 1, 1)))
void vaccum(__global const int* a, __global const int* b, __global int* result) 
{
    int tmp = 0;

    LOOP_I: for (int i=0; i < 32; i++) {
        tmp += a[i] * b[i];
    }
    result[0] = tmp;
}
During the kernel compilation, the compiler automatically adds xcl_pipelie_loop attribute to LOOP_I, which conceptually looks like the code snippet below:
__kernel __attribute__ ((reqd_work_group_size(1, 1, 1)))
void vaccum(__global const int* a, __global const int* b, __global int* result) 
{
    int tmp = 0;

    __attribute__((xcl_pipeline_loop))
    LOOP_I: for (int i=0; i < 32; i++) {
        tmp += a[i] * b[i];
    }
    result[0] = tmp;
}
Adding this attribute exposes the pipeline nature of the design to the compiler. In turn, the compiler then adds appropriate pipelining to improve the performance of the design. The figure below shows a timing diagram of the vector accumulator before and after exposing loop pipelining. The diagram on top is sequential in nature and was confirmed with the Application Timeline for the un-pipelined kernel. The diagram below shows the improved timing of the pipelined version. Notice how the different operations are kept busy throughout the loop iterations. Similar to loop unrolling, this exploits the vast hardware resources available on an FPGA.