Creating the Xilinx OpenCL Compute Unit Binary Container

The main difference between targeting an OpenCL® application to a CPU/GPU and targeting an FPGA is the source of the compiled kernel. Both CPUs and GPUs have a fixed computing architecture onto which the kernel code is mapped by a compiler. Therefore, OpenCL programs targeted for either kind of device invokes just-in-time compilation of kernel source files from the host code. The API for invoking just-in-time kernel compilation is as follows:

clCreateProgramWithSource(…)

IMPORTANT!: ClCreateProgramWithSource function is not supported for kernels targeting FPGA.

In contrast to a CPU or a GPU, consider an FPGA as a blank computing canvas onto which a compiler generates an optimized computing architecture for each kernel in the system. This inherent flexibility of the FPGA allows the developer to explore different kernel optimizations and compute unit combinations that are beyond what is possible with a fixed architecture. The only drawback to this flexibility is that the generation of a kernel-specific optimized compute architecture takes a longer time than what is acceptable for just-in-time compilation. The OpenCL standard addresses this fundamental difference between devices by allowing for an offline compilation flow.

The SDAccel™ Environment uses this offline compilation flow to generate kernel binaries. To maximize efficiency in the host program and allow the simultaneous instantiation of kernels that cooperate in the computation of a portion of an application, Xilinx® has defined the Xilinx OpenCL Compute Unit Binary format, .xclbin. The xclbin file is a binary library of kernel compute units that will be loaded together into an OpenCL context for a specific device. This format can hold either programming files for the FPGA or shared libraries for the processor. It also contains library descriptive metadata used by the Xilinx OpenCL runtime library during program execution.

This lets you generate libraries of kernels that can be loaded and executed by the host program. The OpenCL APIs for supporting kernels generated in an offline compilation flow are, as follows:

// An offline binary container may be loaded with 
// clCreateProgramWithBinary()

cl_program p  =  clCreateProgramWithBinary(binary1);
clBuildProgram(p);

// Use the program: 
// create queues, create buffers,
// transfer data to device input buffers, 
// run kernels, 
// read back output buffer from device...

// when done Release the program 
clReleaseProgram(p);

Only after clReleaseProgram() a new call to clCreateProgramWithBinary() can be used to create a new program from another binary container.

IMPORTANT!: With dynamic memory topology in version 5.0 and later, the OpenCL runtime does not allow loading a new binary container after it is done with the first one if any cl_mem from the first run still exist. This means applications are not able to load a new binary container unless they release all cl_mems withclReleaseMemObject() before another call to clCreateProgramWithBinary() in the same process.

The library metadata included in the xclbin file is automatically generated by the SDAccel Environment and does not require user intervention. This data is composed of compute unit descriptive information that is automatically generated during compute unit synthesis and used by the runtime to understand the contents of an xclbin file.

The xclbin file is created by the Xilinx OpenCL Compiler (xocc) command line utility, which provides mechanisms modeled after gcc and is composed of two separate operating modes:

Compilation or build of kernel accelerator functions (described in C/C++/OpenCL language) into Xilinx object (.xo) files. This is the -c/--compile mode of xocc.
Linking several .xo files together with a platform to create the binary container needed by the host code. This is the -l/--link mode of xocc.

The xocc can be used standalone (or ideally in scripts or a build system like make), and also is fully supported by the SDx IDE. See the for more information.

IMPORTANT!: With the current version of SDAccel, xocc must be invoked twice: once for compilation and once for linking; it cannot be used to perform both compilation and linking in a single invocation.