OpenCL C Example

The following example provides an overview of the user considerations when mapping an OpenCL kernel model onto the FPGA programmable logic.


#define LENGTH 64

__kernel __attribute__ ((reqd_work_group_size(1, 1, 1)))
void vmult(__global const int* a, __global const int* b, __global int* c) {
 local int bufa[LENGTH];
 local int bufb[LENGTH];
 local int bufc[LENGTH];

 event_t evt[3];
 evt[0] = async_work_group_copy(bufa, a, LENGTH, 0);
 evt[1] = async_work_group_copy(bufb, b, LENGTH, 0);
 wait_group_events(2, evt);

 for (int i=0; i < LENGTH; i++) {
  bufc[i] = bufa[i] * bufb[i];
 }

 barrier(CLK_LOCAL_MEM_FENCE);
 event_t e = async_work_group_copy(c, bufc, LENGTH, 0);
 wait_group_events(1, &e);
}

There are a number of FPGA resources leveraged by SDAccel to perform the functionality in this kernel. This includes the following:

  • Loops - These are common elements in kernel functionality and are implemented using a variety of FPGA resources including LUTs and flip-flops. These loops can be unrolled and pipelined based on resource and performance requirements (refer to Unrolling Loops and Pipelining Loops for more information). How loops are implemented can have a major impact on overall kernel performance.
  • Arrays - The arrays bufa, bufb, and bufc are typically implemented in BRAMs, using the distributed local storage on the FPGA.
  • Operators - The multiplication of each element in the vectors can be performed by either LUTs or DSP48 Blocks. The same is true for other common operators such as addition, subtraction, comparators, etc.
  • If desired to improve performance, the loop could be partially or fully unrolled. A high number of multiplications would then be performed concurrently.
  • Communication - The high-speed communication between this kernel and the rest of the device would be implemented using LUTs and flip-flops. This includes the interconnect and memory controller to handle the calls to async_work_group_copy using high-bandwidth burst data transfers.