Reducing Overhead of Kernel Enqueing

The OpenCL™ execution model supports data parallel and task parallel programming models. Kernels are usually enqueued by the OpenCL Runtime multiple times and then scheduled to be executed on the device. You must send the command to start the kernel in one of two ways:

Implicitly, using clEnqueueNDRange API for the data parallel case
Explicitly, using clEnqueueTask for the task parallel case

The dispatching process is executed on the host processor and the actual commands and kernel arguments need to be sent to the FPGA via PCIe link. In the current OpenCL Runtime Library of SDAccel Environment, the overhead of dispatching the command and arguments to the FPGA is between 30us and 60us, depending the number of arguments on the kernel. You can reduce the impact of this overhead by minimizing the number of times the kernel needs to be executed.

For the data parallel case, Xilinx recommends that you carefully choose the global and local work sizes for your host code and kernel so that the global work size is a small multiple of the local work size. Ideally, the global work size is the same as the local work size as shown in the code snippet below:

size_t global = 1;
size_t local = 1;
clEnqueueNDRangeKernel(world.command_queue, kernel, 1, nullptr,
                       &global, &local, 2, write_events.data(),
                       &kernel_events[0]));

For the task parallel case, Xilinx recommends that you minimize the call to clEnqueueTask. Ideally, you should finish all the work load in a single call to clEnqueueTask.