One-Dimensional NDRange
Figure: One-Dimensional Work Size
The preceding figure illustrates an example of one-dimensional NDRange with global size = (4096, 1, 1) and local size = (512, 1, 1). This allows the computation to be broken down into eight work-groups, each with 512 work-items.
Now consider a simple vector adder kernel written with a work size of (1, 1, 1):
__kernel __attribute__ ((reqd_work_group_size(1, 1, 1)))
void vadd(__global const int* a, __global const int* b, __global int* c) {
int i;
or (i=0; i < 4096; i++) {
c[i] = a[i] + b[i];
}
}
In this example, the kernel is written in sequential C style. The length of the data is 4096, and the function iterates over the data using an explicit loop. In OpenCL C, however, it is better to write the kernel as shown below:
__kernel __attribute__ ((reqd_work_group_size(512, 1, 1)))
void vadd(__global const int* a, __global const int* b, __global int* c) {
int i = get_global_id(0);
c[i] = a[i] + b[i];
}
This produces the NDRange and work group sizes shown above. Because this example allows the OpenCL compiler and runtime to control the iteration over the 4096 data items, it allows a simpler coding style and enables the compiler to make better optimization decisions to parallelize the operations. The call to get_global_id(0) provides the current location in the NDRange and is analogous to the index of a for loop. This is a simple example but is extensible to other larger work sizes. When using SDAccel, it is sometimes useful to think of the above code as transformed into the following form by the SDAccel compiler:
__kernel void vadd(global const int* a, global const int* b, global int* c) {
localid_t id;
for (id[0] = 0; id[0] < 512; id[0]++) {
for (id[1] = 0; id[1] < 1; id[1]++) {
for (id[2] = 0; id[2] < 1; id[2]++) {
c[id[0]] = a[id[0]] + b[id[0]];
}
}
}
}
Note that the code written within the kernel is surrounded by three nested loops to traverse the entire work-group size. These three for loops are conceptually introduced by SDAccel into the kernel to handle the three-dimensional space of the NDRange. The SDAccel compiler exploits NDRange parallelism by pipelining and vectorizing these conceptual loops.
The conceptual loop nest introduced by SDAccel can have either variable or fixed loop bounds. By setting the reqd_work_group_size attribute, the programmer is setting the loop boundaries on this loop nest. Fixed boundaries allow the kernel compiler to optimize the size of local memory in the compute unit and to provide latency estimates. If the work size is not specified, SDAccel might assume a large size for private memory, which can hinder the number of compute units that can be instantiated in the FPGA fabric.