Reducing Overhead of Kernel Enqueing
The OpenCL™ execution model supports data parallel and task parallel programming models.
Kernels are usually enqueued by the OpenCL Runtime multiple times and then scheduled to
be executed on the device. You must send the command to start the kernel in one of two
ways:
- Implicitly, using clEnqueueNDRange API for the data parallel case
- Explicitly, using clEnqueueTask for the task parallel case
For the data parallel case, Xilinx recommends that you carefully choose the global and local
work sizes for your host code and kernel so that the global work size is a small multiple of
the local work size. Ideally, the global work size is the same as the local work size as shown
in the code snippet
below:
size_t global = 1;
size_t local = 1;
clEnqueueNDRangeKernel(world.command_queue, kernel, 1, nullptr,
&global, &local, 2, write_events.data(),
&kernel_events[0]));
For the task parallel case, Xilinx recommends that you minimize the call to clEnqueueTask. Ideally, you should finish all the work load in a single call to clEnqueueTask.