xcl_pipeline_loop
Description
You can pipeline a loop to improve latency and maximize kernel throughput and performance.
Although unrolling loops increases concurrency, it does not address the issue of keeping all elements in a kernel data path busy at all times. Even in an unrolled case, loop control dependencies can lead to sequential behavior. The sequential behavior of operations results in idle hardware and a loss of performance.
Xilinx addresses this issue by introducing a vendor extension on top of the
OpenCL 2.0 specification for loop pipelining: xcl_pipeline_loop
.
By default, the XOCC compiler automatically pipelines loops with a trip
count more than 64, or unrolls loops with a trip count less than 64. This should provide
good results. However, you can choose to pipeline loops (instead of the automatic unrolling)
by explicitly specifying the nounroll
attribute and
xcl_pipeline_loop
attribute before the loop.
Syntax
Place the attribute in the OpenCL source before the loop definition:
__attribute__((xcl_pipeline_loop))
Examples
The following example pipelines LOOP_1
of function vaccum
to improve performance:
__kernel __attribute__ ((reqd_work_group_size(1, 1, 1)))
void vaccum(__global const int* a, __global const int* b, __global int*
result)
{
int tmp = 0;
__attribute__((xcl_pipeline_loop))
LOOP_1: for (int i=0; i < 32; i++) {
tmp += a[i] * b[i];
}
result[0] = tmp;
}
See Also
- pragma HLS pipeline
- SDAccel Environment Optimization Guide (UG1207)
- Vivado Design Suite User Guide: High-Level Synthesis (UG902)