Pipelining Work Items

Work item pipelining is the extension of loop pipelining to the kernel work group. The syntax for the attribute for this optimization is xcl_pipeline_workitems. The following kernel is an example where work pipelining can be applied:
__kernel __attribute__ ((reqd_work_group_size(8, 8, 1)))
void madd(__global int* a, __global int* b, __global int* output)
{
 int rank = get_local_size(0); 
 __local unsigned int bufa[64];
 __local unsigned int bufb[64];

 int x = get_local_id(0);
 int y = get_local_id(1);
 bufa[x*rank + y] = a[x*rank + y];
 bufb[x*rank + y] = b[x*rank + y];
 barrier(CLK_LOCAL_MEM_FENCE);

 int index = get_local_id(1)*rank + get_local_id(0);
 output[index] = bufa[index] + bufb[index];
}

To handle the reqd_work_group_size attribute, SDAccel automatically inserts a loop nest to handle the multi-dimensional characteristics of the ND range. For this example, the local work size is specified as (8, 8, 1). As a result of the loop nest added by SDAccel, the execution profile of this code is the same as that of an un-pipelined loop.

The work item pipeline attribute can be added to the code as follows:
__kernel __attribute__ ((reqd_work_group_size(8, 8, 1)))
void madd(__global int* a, __global int* b, __global int* output)
{
 int rank = get_local_size(0); 
 __local unsigned int bufa[64];
 __local unsigned int bufb[64];

  __attribute__((xcl_pipeline_workitems)) {
   int x = get_local_id(0);
   int y = get_local_id(1);
   bufa[x*rank + y] = a[x*rank + y];
   bufb[x*rank + y] = b[x*rank + y];
  }
  barrier(CLK_LOCAL_MEM_FENCE);

  __attribute__((xcl_pipeline_workitems)) {
   int index = get_local_id(1)*rank + get_local_id(0);
   output[index] = bufa[index] + bufb[index];
  }
}