Unrolling Loops

Loop unrolling is the first optimization technique available in the SDAccel™ compiler. The purpose of the loop unroll optimization is to expose concurrency to the compiler. This is an official attribute in the OpenCL 2.0 specification.

For example, starting with the code:

#define LENGTH 64
__kernel void __attribute__ ((reqd_work_group_size(1, 1, 1)))
krnl_vmult(
     __global int* a,
     __global int* b,
     __global int* c)
{
    local int bufa[LENGTH];
    local int bufb[LENGTH];
    local int bufc[LENGTH];

    for(int i = 0; i < LENGTH; i++) {
        bufa[i] = a[i];
        bufb[i] = b[i];
    }

    for(int i = 0; i < LENGTH; i++) {
        bufc[i] = bufa[i] * bufb[i];
    }

    for(int i = 0; i < LENGTH; i++) {
        c[i] = bufc[i];
    }
    return;
}

This kernel multiplies two integer vectors, a and b. The length of the vectors is 64. Since we want to isolate the performance of the for loop, we first read the two vectors into local. Also, a third local memory is used to store the output vector, c, so all data in the for loop uses local memories. Once the loop is completed, the entire output vector is written back to DDR.

By default, the SDAccel compiler automatically applies unroll attribute on the loop with trip count 64 and less and unroll those loops completely. For these loops, you might choose to do pipeline (instead of the automatic unrolling) by explicitly providing #pragma nounroll before the loop and pipeline using the xcl_pipeline_loop attribute.

Below is the latency and area estimate after running xocc with --report estimate option with the baseline design without any loop unroll attribute. Note that in this particular example, the actual computation only uses one 32-bit integer multiplier (4 DSPs) and takes 64 cycles to complete.

The performance of the vector multiplier can be improved by using the opencl_unroll_hint attribute with an unroll factor of 2:

__attribute__((opencl_unroll_hint(2)))
for(int i = 0; i < LENGTH; i++) {
    bufc[i] = bufa[i] * bufb[i];
}

The code above tells SDAccel to unroll the loop by a factor of two. Conceptually the compiler transforms the loop above to the code below:

for(int i = 0; i < LENGTH; i+=2) {
    bufc[i] = bufa[i] * bufb[i];
    bufc[i+1] = bufa[i+1] * bufb[i+1];
}

This results in LENGTH/2 or 32 loop iterations for the compute unit to complete the operation. By enabling SDAccel to reduce the loop iteration count, the programmer has exposed more concurrency to the compiler. This newly exposed concurrency reduces latency and improves performance, but also consumes more FPGA fabric resources. Below is the latency and area estimate for the unroll factor of 2. Note that 2x DSPs are used for the compute unit and the computation is reduced by 32 clock cycles.

Another variety of this attribute is to unroll the loop completely. The syntax for the fully unrolled version of the vector multiplier example is as shown below:
__attribute__((opencl_unroll_hint))
for(int i = 0; i < LENGTH; i++) {
    bufc[i] = bufa[i] * bufb[i];
}

Due to resource constraints, the full unrolling is appropriate for loops of small or medium length. Large loops may require too many resources to implement on the FPGA device. For larger loops, it is recommended to use loop pipeline (see the Piplining Loops section of this guide).