Building the Device Binary

The kernel code is written in C, C++, OpenCL™ C, or RTL, and is built by compiling the kernel code into a Xilinx® object (XO) file, and linking the XO files into a Xilinx binary (.xclbin) file, as shown in the following figure.

Figure 1: Device Build Process

The process, as outlined above, has two steps:

  1. Build the Xilinx object files from the kernel source code.
    • For C, C++, or OpenCL kernels, the v++ -c command compiles the source code into Xilinx object (XO) files. Multiple kernels are compiled into separate XO files.
    • For RTL kernels, theVivado IP packager command produces the XO file to be used for linking. Refer to RTL Kernels for more information.
    • You can also create kernel object (XO) files working directly in the Vitis™ HLS tool. Refer to Compiling Kernels with Vitis HLS for more information.
  2. After compilation, the v++ -l command links one or multiple kernel objects (XO), together with the hardware platform XSA file, to produce the Xilinx binary .xclbin file.
TIP: The v++ command can be used from the command line, in scripts, or a build system like make, and can also be used through the Vitis IDE as discussed in Using the Vitis IDE.

Compiling Kernels with the Vitis Compiler

IMPORTANT: Set up the command shell or window as described in Setting Up the Vitis Environment prior to running the tools.
The first stage in building the xclbin file is to compile the kernel code using the Xilinx Vitis compiler. There are multiple v++ options that need to be used to correctly compile your kernel. The following is an example command line to compile the vadd kernel:
v++ -t sw_emu --platform xilinx_u200_xdma_201830_2 -c -k vadd \
-I'./src' -o'vadd.sw_emu.xo' ./src/vadd.cpp

The various arguments used are described below. Note that some of the arguments are required.

  • -t <arg>: Specifies the build target, as discussed in Build Targets. Software emulation (sw_emu) is used as an example. Optional. The default is hw.
  • --platform <arg>: Specifies the accelerator platform for the build. This is required because runtime features, and the target platform are linked as part of the FPGA binary. To compile a kernel for an embedded processor application, specify an embedded processor platform: --platform $PLATFORM_REPO_PATHS/zcu102_base/zcu102_base.xpfm.
  • -c: Compile the kernel. Required. The kernel must be compiled (-c) and linked (-l) in two separate steps.
  • -k <arg>: Name of the kernel associated with the source files.
  • -o'<output>.xo': Specify the shared object file output by the compiler. Optional.
  • <source_file>: Specify source files for the kernel. Multiple source files can be specified. Required.

The above list is a sample of the extensive options available. Refer to Vitis Compiler Command for details of the various command line options. Refer to Output Directories of the v++ Command to get an understanding of the location of various output files.

Compiling Kernels with Vitis HLS

The use model described for the Vitis core development kit is a top-down approach, starting with C/C++ or OpenCL code, and working toward compiled kernels. However, you can also directly develop the kernel to produce a Xilinx object (XO) file to be paired for linking using v++ to produce the .xclbin. This approach can be used for C/C++ kernels using the Vitis HLS tool, which is the focus of this section, or RTL kernels using the Vivado Design Suite. Refer to RTL Kernels for more information.

The approach of developing the kernel directly, either in RTL or C/C++, to produce an XO file, is sometimes referred to as the bottom-up flow. This allows you to validate kernel performance and perform optimizations within Vitis HLS, and export the Xilinx object file for use in the Vitis application acceleration development flow. Refer to the Vitis HLS Flow for more information on using that tool.

Figure 2: Vitis HLS Bottom-Up Flow

The benefits of the Vitis HLS bottom-up flow can include:

  • Design, validate, and optimize the kernel separately from the main application.
  • Enables a team approach to design, with collaboration on host program and kernel development.
  • Specific kernel optimizations are preserved in the XO file.
  • A collection of XO files can be used and reused like a library.

Creating Kernels in Vitis HLS

Generating kernels from C/C++ code for use in the Vitis core development kit follows the standard Vitis HLS process. However, because the kernel is required to operate in the Vitis software platform, the standard kernel requirements must be satisfied (see Kernel Properties). Most importantly, the interfaces must be modeled as AXI memory interfaces, except for scalar parameters which are mapped to an AXI4-Lite interface. Vitis HLS automatically defines the interface ports to meet the standard kernel requirements when using the Vitis Bottom Up Flow as described here.

The process for creating and compiling your HLS kernel is outlined briefly below. You should refer to Creating a New Vitis HLS Project in the Vitis HLS Flow documentation for a more complete description of this process.

  1. Launch Vitis HLS to open the integrated design environment (IDE), and specify File > New Project.
  2. In the New Vitis HLS Project wizard, specify the Project name, define the Location for the project, and click Next.
  3. In the Add/Remove Files page, click Add Files to add the kernel source code to the project. Select Top Function to define the kernel function by clicking the Browse button, and click Next when done.
  4. You can specify a C-based simulation test bench if you have one available, by clicking Add Files, or skip this by clicking Next.
    TIP: As discussed in the Vitis HLS documentation, the use of a test bench is strongly recommended.
  5. In the Solution Configuration page, you must specify the Clock Period for the kernel.

  6. Choose the target platform by clicking the browse button () in the Part Selection field to open the Device Selection Dialog box. Select Boards, and select the target platform for your compiled kernel, as shown below. Click OK to select the platform and return to the Solution Configuration page.
  7. In the Solution Configuration page, select the Vitis Kernel Flow Target drop-down menu under Flow Target, and click Finish to complete the process and create your HLS kernel project.
    IMPORTANT: You must select the Vitis Kernel Flow Target to generate the Xilinx object (XO) file from the project.

When the HLS project has been created you can Run C-Synthesis to compile the kernel code. Refer to the Vitis HLS documentation for a complete description of the HLS tool flow.

After synthesis is completed, the kernel can be exported as an XO file for use in the Vitis core development kit. The export command is available through the Solution > Export RTL command from the main menu.

Specify the file location, and the kernel is exported as a Xilinx object XO file.

The (XO) file can be used as an input file during the v++ linking process. Refer to Linking the Kernels for more information. You can also add it to an application project in the Vitis IDE, as discussed in Creating a Vitis IDE Project.

However, keep in mind that HLS kernels, created in the bottom-up flow described here, have certain limitations when used in the Vitis application acceleration development flow. Software emulation is not supported for applications using HLS kernels, because duplicated header file dependencies can create issues. GDB debug is not supported in the hardware emulation flow for HLS kernels, or RTL kernels.

Vitis HLS Script for Creating Kernels

If you run HLS synthesis through Tcl scripts, you can edit the following script to create HLS kernels as previously described:

# Define variables for your HLS kernel:
set projName <proj_name>
set krnlName <kernel_name>
set krnlFile <kernel_source_code>
set krnlTB <kernel_test_bench>
set krnlPlatform <target_part>
set path <path_to_project>

#Script to create and output HLS kernel
open_project $projName
set_top $krnlName
add_files $krnlFile
add_files -tb $krnlTB
open_solution "solution1"
set_part $krnlPlatform
create_clock -period 10 -name default
config_flow -target vitis
csim_design
csynth_design
cosim_design
export_design -flow impl -format xo -output "./hlsKernel/hlsKernel.xo"

Run the HLS kernel script by using the following command after setting up your environment as discussed in Setting Up the Vitis Environment.

vitis_hls -f <hls_kernel_script>.tcl

Linking the Kernels

TIP: Set up the command shell or window as described in Setting Up the Vitis Environment prior to running the tools.

The kernel compilation process results in a Xilinx object (XO) file whether the kernel is written in C/C++, OpenCL C, or RTL. During the linking stage, XO files from different kernels are linked with the platform to create the FPGA binary container file (.xclbin) used by the host program.

Similar to compiling, linking requires several options. The following is an example command line to link the vadd kernel binary:
v++ -t sw_emu --platform xilinx_u200_xdma_201830_2 --link vadd.sw_emu.xo \
-o'vadd.sw_emu.xclbin' --config ./connectivity.cfg

This command contains the following arguments:

  • -t <arg>: Specifies the build target. Software emulation (sw_emu) is used as an example. When linking, you must use the same -t and --platform arguments as specified when the input (XO) file was compiled.
  • --platform <arg>: Specifies the platform to link the kernels with. To link the kernels for an embedded processor application, you simply specify an embedded processor platform: --platform $PLATFORM_REPO_PATHS/zcu102_base/zcu102_base.xpfm
  • --link: Link the kernels and platform into an FPGA binary file (xclbin).
  • <input>.xo: Input object file. Multiple object files can be specified to build into the .xclbin.
  • -o'<output>.xclbin': Specify the output file name. The output file in the link stage will be an .xclbin file. The default output name is a.xclbin
  • --config ./connectivity.cfg: Specify a configuration file that is used to provide v++ command options for a variety of uses. Refer to Vitis Compiler Command for more information on the --config option.
TIP: Refer to Output Directories of the v++ Command to get an understanding of the location of various output files.

Beyond simply linking the Xilinx object (XO) files, the linking process is also where important architectural details are determined. In particular, this is where the number of compute unit (CUs) to instantiate into hardware is specified, connections from kernel ports to global memory are assigned, and CUs are assigned to SLRs. The following sections discuss some of these build options.

Creating Multiple Instances of a Kernel

By default, the linker builds a single hardware instance from a kernel. If the host program will execute the same kernel multiple times, due to data processing requirements for instance, then it must execute the kernel on the hardware accelerator in a sequential manner. This can impact overall application performance. However, you can customize the kernel linking stage to instantiate multiple hardware compute units (CUs) from a single kernel. This can improve performance as the host program can now make multiple overlapping kernel calls, executing kernels concurrently by running separate compute units.

Multiple CUs of a kernel can be created by using the connectivity.nk option in the v++ config file during linking. Edit a config file to include the needed options, and specify it in the v++ command line with the --config option, as described in Vitis Compiler Command.

For example, for the vadd kernel, two hardware instances can be implemented in the config file as follows:
[connectivity]
#nk=<kernel name>:<number>:<cu_name>.<cu_name>...
nk=vadd:2

Where:

<kernel_name>
Specifies the name of the kernel to instantiate multiple times.
<number>
The number of kernel instances, or CUs, to implement in hardware.
<cu_name>.<cu_name>...
Specifies the instance names for the specified number of instances. This is optional, and the CU name will default to kernel_1 when it is not specified.
Then the config file is specified on the v++ command line:
v++ --config vadd_config.cfg ...

In the vadd example above, the result is two instances of the vadd kernel, named vadd_1 and vadd_2.

TIP: You can check the results by using the xclbinutil command to examine the contents of the xclbin file. Refer to xclbinutil Utility.
The following example results in three CUs of the vadd kernel, named vadd_X, vadd_Y, and vadd_Z in the xclbin binary file:
[connectivity]
nk=vadd:3:vadd_X.vadd_Y.vadd_Z

Mapping Kernel Ports to Memory

The link phase is when the memory ports of the kernels are connected to memory resources which include DDR, HBM, and PLRAM. By default, when the xclbin file is produced during the v++ linking process, all kernel memory interfaces are connected to the same global memory bank (or gmem). As a result, only one kernel interface can transfer data to/from the memory bank at one time, limiting the performance of the application due to memory access.

While the Vitis compiler can automatically connect CU to global memory resources, you can also manually specify which global memory bank each kernel argument (or interface) is connected to. Proper configuration of kernel to memory connectivity is important to maximize bandwidth, optimize data transfers, and improve overall performance. Even if there is only one compute unit in the device, mapping its input and output arguments to different global memory banks can improve performance by enabling simultaneous accesses to input and output data.

IMPORTANT: Up to 15 kernel interfaces can be connected to a single global memory bank. Therefore, if there are more than 15 memory interfaces, then you must explicitly perform the memory mapping as described here, using the --conectivity.sp option to distribute connections across different memory banks.

The following example is based on the Kernel Interfaces example code. Start by assigning the kernel arguments to separate bundles to increase the available interface ports, then assign the arguments to separate memory banks:

  1. In C/C++ kernels, assign arguments to separate bundles in the kernel code prior to compiling them:
    void cnn( int *pixel, // Input pixel
      int *weights, // Input Weight Matrix
      int *out, // Output pixel
      ... // Other input or Output ports
    		   
    #pragma HLS INTERFACE m_axi port=pixel offset=slave bundle=gmem
    #pragma HLS INTERFACE m_axi port=weights offset=slave bundle=gmem1
    #pragma HLS INTERFACE m_axi port=out offset=slave bundle=gmem
    

    Note that the memory interface inputs pixel and weights are assigned different bundle names in the example above, while out is bundled with pixel. This creates two separate interface ports.

    IMPORTANT: You must specify bundle= names using all lowercase characters to be able to assign it to a specific memory bank using the --connectivity.sp option.
  2. Edit a config file to include the --connectivity.sp option, and specify it in the v++ command line with the --config option, as described in Vitis Compiler Command.
    For example, for the cnn kernel shown above, the connectivity.sp option in the config file would be as follows:
    [connectivity]
    #sp=<compute_unit_name>.<argument>:<bank name> 
    sp=cnn_1.pixel:DDR[0]          
    sp=cnn_1.weights:DDR[1]
    sp=cnn_1.out:DDR[2]
    

    Where:

    • <compute_unit_name> is an instance name of the CU as determined by the connectivity.nk option, described in Creating Multiple Instances of a Kernel, or is simply <kernel_name>_1 if multiple CUs are not specified.
    • <argument> is the name of the kernel argument. Alternatively, you can specify the name of the kernel interface as defined by the HLS INTERFACE pragma for C/C++ kernels, including m_axi_ and the bundle name. In the cnn kernel above, the ports would be m_axi_gmem and m_axi_gmem1.
      TIP: For RTL kernels, the interface is specified by the interface name defined in the kernel.xml file.
    • <bank_name> is denoted as DDR[0], DDR[1], DDR[2], and DDR[3] for a platform with four DDR banks. You can also specify the memory as a contiguous range of banks, such as DDR[0:2], in which case XRT will assign the memory bank at run time.

      Some platforms also provide support for PLRAM, HBM, HP or MIG memory, in which case you would use PLRAM[0], HBM[0], HP[0] or MIG[0]. You can use the platforminfo utility to get information on the global memory banks available in a specified platform. Refer to platforminfo Utility for more information.

      In platforms that include both DDR and HBM memory banks, kernels must use separate AXI interfaces to access the different memories. DDR and PLRAM access can be shared from a single port.

    • IMPORTANT: Customized bank assignments might also need to be reflected in the host code in some cases, as described in Assigning DDR Bank in Host Code.

Connecting Directly to Host Memory

The PCIe® Slave-Bridge IP is provided on some data center platforms to let kernels access directly to host memory. Configuring the device binary to connect to memory requires changing the link specified by the --connectivity.sp command below. It also requires changes to the accelerator card setup and your host application as described at Host-Memory Access in the XRT documentation.

[connectivity]
## Syntax
##sp=<cu_name>.<interface_name>:HOST[0]
sp=cnn_1.m_axi_gmem:HOST[0]

In the command syntax above, the CU name and interface name are the same, but the bank name is hard-coded to HOST[0].

HBM Configuration and Use

Some algorithms are memory bound, limited by the 77GB/s bandwidth available on DDR-based Alveo cards. For those applications there are HBM (High Bandwidth Memory) based Alveo cards, providing up to 460 GB/s memory bandwidth. For the Alveo implementation, 2 16-layer HBM (HBM2 specification) stacks are incorporated into the FPGA package and connected into the FPGA fabric with an interposer. A high-level diagram of the two HBM stacks is as follows.

Figure 3: High-Level Diagram of Two HBM Stacks

This implementation provides:

  • 8 GB HBM memory
  • 32 256 MB HBM segments, called pseudo channels (PCs)
  • An independent AXI channel for communication with the FPGA through a segmented crossbar switch per pseudo channel
  • A two-channel memory controller per two PCs
  • 14.375 GB/s max theoretical bandwidth per PC
  • 460 GB/S ( 32 *14.375 GB/s) max theoretical bandwidth for the HBM subsystem

Although each PC has a theoretical max performance of 14.375 GB/s, this is less than the theoretical max of 19.25 GB/s for a DDR channel. To get better than DDR performance, designs must efficiently use multiple AXI masters into the HBM subsystem. The programmable logic has 32 HBM AXI interfaces that can access any memory location in any of the PCs on either of the HBM stacks through a built-in switch providing access to the full 8 GB memory space. For more detailed information on the HBM, refer to AXI High Bandwidth Controller LogiCORE IP Product Guide (PG276).

Note: Because of the complexity and flexibility of the built-in switch, there are many combinations that result in congestion at a particular memory location or in the switch itself. Interleaved read and write transactions cause a drop in efficiency with respect to read-only or write-only due to memory controller timing parameters (bus turnaround). Write transactions that span both HBM stacks will also experience degraded performance, and should be avoided. It is important to plan memory accesses so that kernels access limited memory where possible, and to isolate the memory accesses for different kernels into different HBM PCs.

Connection to the HBM is managed by the HBM Memory Subsystem (HMSS) IP, which enables all HBM PCs, and automatically connects the XDMA to the HBM for host access to global memory. When used with the Vitis compiler, the HMSS is automatically customized to activate only the necessary memory controllers and ports as specified by the --connectivity.sp option to connect both the user kernels and the XDMA to those memory controllers for optimal bandwidth and latency. Refer to the Using HBM Tutorial for additional information and examples.

In the following config file example, the kernel input ports in1 and in2 are connected to HBM PCs 0 and 1 respectively, and writes output buffer out to HBM PCs 3–4. Each HBM PC is 256 MB, giving a total of 1 GB of memory access for this kernel.

[connectivity]
sp=krnl.in1:HBM[0]
sp=krnl.in2:HBM[1]
sp=krnl.out:HBM[3:4]
Note: In the config file, only the mapping to the HBM pseudo channel is defined, and each AXI interface should only access a contiguous subset of the available 32 HBM PCs. The HMSS chooses the appropriate HBM port to access memory and to maximize bandwidth and minimize latency.

The HBM ports are located in the bottom SLR of the device. The HMSS automatically handles the placement and timing complexities of AXI interfaces crossing super logic regions (SLR) in SSI technology devices. By default, without specifying the --connectivity.sp or --connectivity.slr options on v++, all kernel AXI interfaces access HBM[0] and all kernels are assigned to SLR0. However, you can specify the SLR assignments of kernels using the --connectivity.slr option. Refer to Assigning Compute Units to SLRs for more information.

Random Access and the RAMA IP

HBM performs well in applications where sequential data access is required. However, for applications requiring random data access, performance can vary significantly depending on the application requirements (for example, the ratio of read and write operations, minimum transaction size, and size of the memory space being addressed). In these cases, the addition of the Random Access Memory Attachment (RAMA) IP to the target platform can significantly improve random memory access efficiency in cases where the required memory exceeds the 256 MB limit of a single HBM PC. Refer to RAMA LogiCORE IP Product Guide (PG310) for more information.

TIP: To effectively use the RAMA IP in your application the kernel should access memory from multiple HBM PCs and should use a static single ID on the AXI transaction ID ports (AxID), or slowly changing (pseudo-static) AXI transaction IDs. If these conditions are not met, the thread creation used in the RAMA IP to improve performance has little effect, and consumes programmable logic resources for no purpose.

Add the RAMA IP to the target platform during the system linking process using the following v++ command option to specify a Tcl script to define the ports of interest:

v++ -l --advanced.param compiler.userPreSysLinkOverlayTcl=<path_to>/user_tcl_file.tcl
Within this user-specified Tcl script, an API is provided to let you configure the HMSS resource:
hbm_memory_subsystem::ra_master_interface <Endpoint AXI master interface> [get_bd_cells hmss_0]

The following example has two AXI master ports (M00_AXI and M01_AXI) for random access:

hbm_memory_subsystem::ra_master_interface [get_bd_intf_pins dummy/M00_AXI] [get_bd_cells hmss_0]
hbm_memory_subsystem::ra_master_interface [get_bd_intf_pins dummy/M01_AXI] [get_bd_cells hmss_0]
validate_bd_design -force

It is important to end the Tcl script with the validate_bd_design command as shown above to allow the information to be collected correctly by the HBM subsystem, and the block design to be updated.

PLRAM Configuration and Use

Alveo accelerator cards contain HBM DRAM and DDR DRAM memory resources. In some accelerator cards, an additional memory resource available is internal FPGA PLRAM (UltraRAM and block RAM). Supporting platforms typically contain instances of PLRAM in each SLR. The size and type of each PLRAM can be configured on the target platform before kernels or Compute Units are linked into the system.

You can use a Tcl script to configure the PLRAM before system linking occurs. The use of the Tcl script can be enabled on the v++ command line as follows:
v++ -l --advanced.param compiler.userPreSysLinkOverlayTcl=<path_to>/user_tcl_file.tcl
Within this user-specified Tcl script, an API is provided to let you configure the PLRAM instance or memory resource:
sdx_memory_subsystem::update_plram_specification <memory_subsystem_bdcell> <plram_resource> <plram_specification>

The <plram_specification> is a Tcl dictionary consisting of the following entries (entries below are the default values for each instance in the platform):

 { 
	SIZE 128K # Up to 4M 
	AXI_DATA_WIDTH 512 # Up to 512
	SLR_ASSIGNMENT SLR0 # SLR0 / SLR1 / SLR2 
	READ_LATENCY 1 # To optimise timing path 
	MEMORY_PRIMITIVE BRAM # BRAM or URAM 
}

In the example below, PLRAM_MEM00 is changed to be 2 MB in size and composed of UltraRAM; PLRAM_MEM01 is changed to be 4 MB in size and composed of UltraRAM. PLRAM_MEM00 and PLRAM_MEM01 correspond to the --conectivity.sp memory resources PLRAM[0] and PLRAM[1].

# Setup PLRAM 
sdx_memory_subsystem::update_plram_specification 
[get_bd_cells /memory_subsystem] PLRAM_MEM00 { SIZE 2M AXI_DATA_WIDTH 512 
SLR_ASSIGNMENT SLR0 READ_LATENCY 10 MEMORY_PRIMITIVE URAM} 

sdx_memory_subsystem::update_plram_specification 
[get_bd_cells /memory_subsystem] PLRAM_MEM01 { SIZE 4M AXI_DATA_WIDTH 512 
SLR_ASSIGNMENT SLR0 READ_LATENCY 10 MEMORY_PRIMITIVE URAM} 

validate_bd_design -force
save_bd_design

The READ_LATENCY is an important attribute, because it sets the number of pipeline stages between memories cascaded in depth. This varies by design, and affects the timing QoR of the platform and the eventual kernel clock rate. In the example above for PLRAM_MEM01:

  • 4 MB of memory are required in total.
  • Each UltraRAM is 32 KB (64 bits wide). 4 MB × 32 KB → 128 UltraRAMs in total.
  • Each PLRAM instance is 512 bits wide → 8 UltraRAMs are required in width.
  • 128 total UltraRAMs with 8 UltraRAMs in width → 16 UltraRAMs in depth.
  • A good rule of thumb is to pick a read latency of depth/2 + 2 → in this case, READ_LATENCY = 10.

This allows a pipeline on every second UltraRAM, resulting in the following:

  • Good timing performance between UltraRAMs.
  • Placement flexibility; not all UltraRAMs need to be placed in the same UltraRAM column for cascade.

Specifying Streaming Connections between Compute Units

The Vitis core development kit supports streaming data transfer between two kernels, allowing data to move directly from one kernel to another without having to transmit back through global memory. However, the process has to be implemented in the kernel code itself, as described in Streaming Data in User-Managed Never-Ending Kernels, and also specified during the kernel build process.

The streaming data ports of kernels can be connected during v++ linking using the --connectivity.sc option. This option can be specified at the command line, or from a config file that is specified using the --config option, as described in Vitis Compiler Command.

To connect the streaming output port of a producer kernel to the streaming input port of a consumer kernel, setup the connection in the v++ config file using the connectivity.stream_connect option as follows:

[connectivity]
#stream_connect=<cu_name>.<output_port>:<cu_name>.<input_port>:[<fifo_depth>]
stream_connect=vadd_1.stream_out:vadd_2.stream_in

Where:

  • <cu_name> is an instance name of the CU as determined by the connectivity.nk option, described in Creating Multiple Instances of a Kernel.
  • <output_port> or <input_port> is the streaming port defined in the producer or consumer kernel.
  • [:<fifo_depth>] inserts a FIFO of the specified depth between the two streaming ports to prevent stalls. The value is specified as an integer.
TIP: If the port-width of the output and input ports do not match, the Vitis compiler will automatically insert a data-width converter between the two ports as part of the build process.

Assigning Compute Units to SLRs

Currently, Xilinx devices on Data Center accelerator cards use stacked silicon consisting of several Super Logic Regions (SLRs) to provide device resources, including global memory. For best performance, when assigning ports to global memory banks, as described in Mapping Kernel Ports to Memory, it is best that the CU instance is assigned to the same SLR as the global memory it is connected to. In this case, you will want to manually assign the kernel instance, or CU into the same SLR as the global memory to ensure the best performance.

IMPORTANT: If your kernel is too large to fit into a single SLR, the Vitis compiler automatically places the logic across multiple SLRs. In this case you should not assign the SLR or this could result in an error during implementation.

A CU can be assigned to an SLR using the connectivity.slr option in a config file. The syntax of the connectivity.slr option in the config file is as follows:

[connectivity]
#slr=<compute_unit_name>:<slr_ID>
slr=vadd_1:SLR2
slr=vadd_2:SLR3

Where:

  • <compute_unit_name> is an instance name of the CU as determined by the connectivity.nk option, described in Creating Multiple Instances of a Kernel, or is simply <kernel_name>_1 if multiple CUs are not specified.
  • <slr_ID> is the SLR number to which the CU is assigned, in the form SLR0, SLR1,...

The assignment of a CU to an SLR must be specified for each CU separately, but is not required. If an assigned CU is connected to global memory located in another SLR, the tool will automatically insert SLR crossing registers to help with timing closure. In the absence of an SLR assignment, the v++ linker is free to assign the CU to any SLR.

After editing the config file to include the SLR assignments, you can use it during the v++ linking process by specifying the config file using the --config option:
v++ -l --config config_slr.cfg ...

Managing Clock Frequencies

IMPORTANT: The --clock options described here are only supported by platform shells with fixed status clocks as described in Identifying Platform Clocks. On older platform shells without fixed clocks, kernels operate at the default operating frequency of the platform.

Generally, in embedded processor platforms, and also in newer Data Center accelerator cards, the device binary can connect multiple kernels to the platform with different clock frequencies. Each kernel, or unique instance of the kernel can connect to a specified clock frequency, or multiple clocks, and different kernels can use different clock frequencies generated by the platform.

During the compilation process (v++ -c), you can specify a kernel frequency using the --hls.clock command. This lets you compile the kernel targeting the specified frequency, and lets the Vitis HLS tool perform validation of the kernel logic at the specified frequency. However, this is just an implementation target for compilation, but does provide optimization and feedback.

During the linking process, when the kernels are connected to the platform to build the device binary, you can specify the clock frequency for the kernels using the --clock Options of the v++ command. Therefore the process for managing clock frequencies is as follows:

  1. Compile the HLS code at a specified frequency using the Vitis compiler:
    v++ -c -k <krnl_name> --hls.clock freqHz:<krnl_name>
    TIP: freqHz must be in Hz (for example, 250000000Hz is 250 MHz).
  2. During linking, specify the clock frequency or clock ID for each clock signal in a kernel with the following command:
    v++ -l ... --clock.freqHz <freqHz>:kernelName.clk_name

You can specify the --clock option using either a clock ID from the platform shell, or by specifying a frequency for the kernel clock. When specifying the clock ID, the kernel frequency is defined by the frequency of that clock ID on the platform. When specifying the kernel frequency, the platform attempts to create the specified frequency by scaling one of the available fixed platform clocks. In some cases, the clock frequency can only be achieved in some approximation, and you can specify the --clock.tolerance or --clock.default_tolerance to indicate an acceptable range. If the available fixed clock cannot be scaled within the acceptable tolerance, a warning is issued and the kernel is connected to the default clock.

Identifying Platform Clocks

The handling of clocks in accelerator cards has evolved to support multiple platform clocks and clock frequencies. Improved clock handling is dependent on a fixed-status platform clock that is available to drive additional kernel frequencies. You can determine if the target platform shell has fixed clocks by using the platforminfo -v command. For example, the following command returns verbose information related to a newer shell for the U200 platform:
platforminfo -v -p xilinx_u200_gen3x16_xdma_1_202110_1 -o pfmClocks.txt
The information reported in the output file includes the following clock details:
=================
Clock Information
=================
...
...
  Clock Index:         2
    Frequency:         50.000000
    Name:              ii_level0_wire_ulp_m_aclk_ctrl_00
    Pretty Name:       PL 2
    Inst Ref:          ii_level0_wire
    Comp Ref:          ii_level0_wire
    Period:            20.000000
    Normalized Period: .020000
    Status:            fixed
...

In the example above you can see that Clock Index: 2 is a fixed status clock. In fact, although not shown here, this platform shell provides three fixed status clocks which can be used to drive multiple clock frequencies in linked kernels. You can use the --clock.xxx options to drive kernel clocks as explained above.

However, on older platform shells, such as the xilinx_u200_xdma_201830_2, there are no fixed platform clocks to drive clock frequencies in linked kernels. This can be determined by reviewing the Clock Information reported by the following command:
platforminfo -v -p xilinx_u200_xdma_201830_2 -o pfmClocks.txt

On legacy platforms, without fixed-status clocks, you can use the v++ --kernel frequency option to specify the clock frequency as described in Vitis Compiler General Options.

Managing Vivado Synthesis and Implementation Results

TIP: This topic requires an understanding of the Vivado Design Suite tools and design methodology as described in UltraFast Design Methodology Guide for Xilinx FPGAs and SoCs (UG949).

In most cases, the Vitis environment completely abstracts away the underlying process of synthesis and implementation of the programmable logic region, as the CUs are linked with the hardware platform and the FPGA binary (xclbin) is generated. This removes the application developer from the typical hardware development process, and the need to manage constraints such as logic placement and routing delays. The Vitis tool automates much of the FPGA implementation process.

However, in some cases you might want to exercise some control over the synthesis and implementation processes deployed by the Vitis compiler, especially when large designs are being implemented. Towards this end, the Vitis tool offers some control through specific options that can be specified in a v++ configuration file, or from the command line. The following are some of the methods you can interact with and control the Vivado synthesis and implementation results.

  • Using the --vivado options to manage the Vivado tool.
  • Using multiple implementation strategies to achieve timing closure on challenging designs.
  • Using the -to_step and -from_step options to run the compilation or linking process to a specific step, perform some manual intervention on the design, and resume from that step.
  • Interactively editing the Vivado project, and using the results for generating the FPGA binary.

Using the -vivado and -advanced Options

Using the --vivado option, as described in --vivado Options, and the --advanced option as described in --advanced Options, you can perform a number of interventions on the standard Vivado synthesis or implementation.

  1. Pass Tcl scripts with custom design constraints or scripted operations.

    You can create Tcl scripts to assign XDC design constraints to objects in the design, and pass these Tcl scripts to the Vivado tools using the PRE and POST Tcl script properties of the synthesis and implementation steps. For more information on Tcl scripting, refer to the Vivado Design Suite User Guide: Using Tcl Scripting (UG894). While there is only one synthesis step, there are a number of implementation steps as described in the Vivado Design Suite User Guide: Implementation (UG904). You can assign Tcl scripts for the Vivado tool to run before the step (PRE), or after the step (POST). The specific steps you can assign Tcl scripts to include the following: SYNTH_DESIGN, INIT_DESIGN, OPT_DESIGN, PLACE_DESIGN, ROUTE_DESIGN, WRITE_BITSTREAM.

    TIP: There are also some optional steps that can be enabled using the --vivado.prop run.impl_1.steps.phys_opt_design.is_enabled=1 option. When enabled, these steps can also have Tcl PRE and POST scripts.

    An example of the Tcl PRE and POST script assignments follow:

    --vivado.prop run.impl_1.STEPS.PLACE_DESIGN.TCL.PRE=/…/xxx.tcl

    In the preceding example a script has been assigned to run before the PLACE_DESIGN step. The command line is broken down as follows:

    • --vivado is the v++ command-line option to specify directives for the Vivado tools.
    • prop keyword to indicate you are passing a property setting.
    • run. keyword to indicate that you are passing a run property.
    • impl_1. indicates the name of the run.
    • STEPS.PLACE_DESIGN.TCL.PRE indicates the run property you are specifying.
    • /.../xx.tcl indicates the property value.
    TIP: Both the --advanced and --vivado options can be specified on the v++ command line, or in a configuration file specified by the --config option. The example above shows the command line use, and the following example shows the config file usage. Refer to Vitis Compiler Configuration File for more information.
  2. Setting properties on run, file, and fileset design objects.
    This is very similar to passing Tcl scripts as described above, but in this case you are passing values to different properties on multiple design objects. For example, to use a specific implementation strategy such as Performance_Explore and disable global buffer insertion during placement, you can define the properties as shown below:
    [vivado]
    prop=run.impl_1.STEPS.OPT_DESIGN.ARGS.DIRECTIVE=Explore
    prop=run.impl_1.STEPS.PLACE_DESIGN.ARGS.DIRECTIVE=Explore
    prop=run.impl_1.{STEPS.PLACE_DESIGN.ARGS.MORE OPTIONS}={-no_bufg_opt}
    prop=run.impl_1.STEPS.PHYS_OPT_DESIGN.IS_ENABLED=true
    prop=run.impl_1.STEPS.PHYS_OPT_DESIGN.ARGS.DIRECTIVE=Explore
    prop=run.impl_1.STEPS.ROUTE_DESIGN.ARGS.DIRECTIVE=Explore

    In the example above, the Explore value is assigned to the STEPS.XXX.DIRECTIVE property of various steps of the implementation run. Note the syntax for defining these properties is:

    <object>.<instance>.property=<value>

    Where:

    • <object> can be a design run, a file, or a fileset object.
    • <instance> indicates a specific instance of the object.
    • <property> specifies the property to assign.
    • <value> defines the value of the property.

    In this example the object is a run, the instance is the default implementation run, impl_1, and the property is an argument of the different step names, In this case the DIRECTIVE, IS_ENABLED, and {MORE OPTIONS}. Refer to --vivado Options for more information on the command syntax.

  3. Enabling optional steps in the Vivado implementation process.

    The build process runs Vivado synthesis and implementation to generate the device binary. Some of the implementation steps are enable and run as part of the default build process, and some of the implementation steps can be optionally enabled at your discretion.

    Optional steps can be listed using the --list_steps command, and include: vpl.impl.power_opt_design, vpl.impl.post_place_power_opt_design, vpl.impl.phys_opt_design, and vpl.impl.post_route_phys_opt_design.

    An optional step can be enabled using the --vivado.prop option. For example, to enable PHYS_OPT_DESIGN step, use the following config file content:

    [vivado]
    prop=run.impl_1.steps.phys_opt_design.is_enabled=1
    

    When an optional step is enabled as shown above, the step can be specified as part of the -from_step/-to_step command as described below in Running --to_step or --from_step, or enable a Tcl script to run before or after the step as described in --linkhook Options.

  4. Passing parameters to the tool to control processing.
    The --vivado option also allows you to pass parameters to the Vivado tools. The parameters are used to configure the tool features or behavior prior to launching the tool. The syntax for specifying a parameter uses the following form:
    --vivado.param <object><parameter>=<value>

    The keyword param indicates that you are passing a parameter for the Vivado tools, rather than a property for a design object. You must also define the <object> it applies to, the <parameter> that you are specifying, and the <value> to assign it.

    The following example project indicates the current Vivado project, writeIntermedateCheckpoints, is the parameter being passed and the value is 1, which enables this boolean parameter.

    --vivado.param project.writeIntermediateCheckpoints=1
  5. Managing the reports generated during synthesis and implementation.
    IMPORTANT: You must also specify --save-temps on the v++ command line when customizing the reports generated by the Vivado tool to preserve the temporary files created during synthesis and implementation, including any generated reports.

    You might also want to generate or save more than the standard reports provided by the Vivado tools when run as part of the Vitis tools build process. You can customize the reports generated using the --advanced.misc option as follows:

    [advanced]
    misc=report=type report_utilization name synth_report_utilization_summary steps {synth_design} runs {__KERNEL__} options {}
    misc=report=type report_timing_summary name impl_report_timing_summary_init_design_summary steps {init_design} runs {impl_1} options {-max_paths 10} 
    misc=report=type report_utilization name impl_report_utilization_init_design_summary steps {init_design} runs {impl_1} options {} 
    misc=report=type report_control_sets name impl_report_control_sets_place_design_summary steps {place_design} runs {impl_1} options {-verbose} 
    misc=report=type report_utilization name impl_report_utilization_place_design_summary steps {place_design} runs {impl_1} options {} 
    misc=report=type report_io name impl_report_io_place_design_summary steps {place_design} runs {impl_1} options {} 
    misc=report=type report_bus_skew name impl_report_bus_skew_route_design_summary steps {route_design} runs {impl_1} options {-warn_on_violation} 
    misc=report=type report_clock_utilization name impl_report_clock_utilization_route_design_summary steps {route_design} runs {impl_1} options {} 
    

    The syntax of the command line is explained using the following example:

    misc=report=type report_bus_skew name impl_report_bus_skew_route_design_summary steps {route_design} runs {impl_1} options {-warn_on_violation} 
    
    misc=report=
    Specifies the --advanced.misc option as described in --advanced Options, and defines the report configuration for the Vivado tool. The rest of the command line is specified in name/value pairs, reflecting the options of the create_report_config Tcl command as described in Vivado Design Suite Tcl Command Reference Guide (UG835).
    type report_bus_skew
    Relates to the -report_type argument, and specifies the type of the report as the report_bus_skew. Most of the report_* Tcl commands can be specified as the report type.
    name impl_report_bus_skew_route_design_summary
    Relates to the -report_name argument, and specifies the name of the report. Note this is not the file name of the report, and generally this option can be skipped as the report names will be auto-generated by the tool.
    steps {route_design}
    Relates to the -steps option, and specifies the synthesis and implementation steps that the report applies to. The report can be specified for use with multiple steps to have the report regenerated at each step, in which case the name of the report will be automatically defined.
    runs {impl_1}
    Relates to the -runs option, and specifies the name of the design runs to apply the report to.
    options {-warn_on_violation}
    Specifies various options of the report_* Tcl command to be used when generating the report. In this example, the -warn_on_violation option is a feature of the report_bus_skew command.
    IMPORTANT: There is no error checking to ensure the specified options are correct and applicable to the report type specified. If you indicate options that are incorrect the report will return an error when it is run.

Running Multiple Implementation Strategies for Timing Closure

For challenging designs, it can take multiple iterations of Vivado implementation using multiple different strategies to achieve timing closure. This topic shows you how to launch multiple implementation strategies at the same time in the hardware build (-t hw), and how to identify and use successful runs to generate the device binary and complete the build.

As explained in --vivado Options the --vivado.impl.strategies command enables you to specify multiple strategies to run in a single build pass. The command line would look as follows:

v++ --link -s -g -t hw --platform xilinx_zcu102_base_202010_1 -I . \
--vivado.impl.strategies "Performance_Explore,Area_Explore" -o kernel.xclbin hello.xo

In the example above, the Performance_Explore and Area_Explore strategies are run simultaneously in the Vivado build to see which returns the best results. You can specify the ALL to have all available strategies run within the tool.

You can also determine this option in a configuration file in the following form:

#Vivado Implementation Strategies
[vivado]
impl.strategies=Performance_Explore,Area_Explore

The Vitis compiler automatically picks the first completed run results that meets timing to proceed with the build process and generate the device binary. However, you can also direct the tool to wait for all runs to complete and pick the best results from the completed runs before proceeding. This would require the use of the --advanced.compiler directive as shown:

[advanced]
param=compiler.multiStrategiesWaitOnAllRuns=1

compiler.multiStrategiesWaitOnAllRuns=0 represents the default behavior. If you want v++ to wait for all runs to complete, which will get their report files, change that parameter value to 1.

As discussed in Link Summary: Multiple Strategies and Timing Reports, Vitis analyzer displays the implementation results for the all strategies that have been allowed to run to completion. This includes an overview of the implementation results, as well as a Timing Summary report. You can use this feature to review the different strategies and results.

You can also manually review the results of all implementation strategies after they have completed. Then, use the results of any of the implementation runs by using the --reuse_impl option as described in Using -to_step and Launching Vivado Interactively.

Using -to_step and Launching Vivado Interactively

The Vitis compiler lets you stop the build process after completing a specified step (--to_step), manually intervene in the design or files in some way, and then continue the build by specifying a step the build should resume from (--from_step). The --from_step directs the Vitis compiler to resume compilation from the step where --to_step left off, or some earlier step in the process. The --to_step and --from_step are described in Vitis Compiler General Options.

IMPORTANT: The --to_step and --from_step options are sequential build options that require you to use the same project directory when launching v++ --link --from_step as you specified when using v++ --link --to_step.

The Vitis compiler also provides a --list_steps option to list the available steps for the compilation or linking processes of a specific build target. For example, the list of steps for the link process of the hardware build can be found by:

v++ --list_steps --target hw --link

This command returns a number of steps, both default steps and optional steps that the Vitis compiler goes through during the linking process of the hardware build. Some of the default steps include: system_link, vpl, vpl.create_project, vpl.create_bd, vpl.generate_target, vpl.synth, vpl.impl.opt_design, vpl.impl.place_design, vpl.impl.route_design, and vpl.impl.write_bitstream.

Optional steps include: vpl.impl.power_opt_design, vpl.impl.post_place_power_opt_design, vpl.impl.phys_opt_design, and vpl.impl.post_route_phys_opt_design.
TIP: An optional step must be enabled before specifying it with --from_step or --to_step as previously described in Using the -vivado and -advanced Options.

Launching the Vivado IDE for Interactive Design

For example, with the --to_step command, you can launch the build process to Vivado synthesis and then start the Vivado IDE on the project to manually place and route the design. To perform this you would use the following command syntax:

v++ --target hw --link --to_step vpl.synth --save-temps --platform <PLATFORM_NAME> <XO_FILES>
TIP: As shown in the example above, you must also specify --save-temps when using --to_step to preserve any temporary files created by the build process.

This command specifies the link process of the hardware build, runs the build through the synthesis step, and saves the temporary files produced by the build process.

You can launch the Vivado tool directly on the project built by the Vitis compiler using the --interactive command. This opens the Vivado project found at <temp_dir>/link/vivado/vpl/prj in your build directory, letting you interactively edit the design:

v++ --target hw --link --interactive --save-temps --platform <PLATFORM_NAME> <XO_FILES>

When invoking the Vivado IDE in this mode, you can open the synthesis or implementation runs to manage and modify the project. You can change the run details as needed to close timing and try different approaches to implementation. You can save the results to a design checkpoint (DCP), or generate the project bitstream (.bit) to use in the Vitis environment to generate the device binary.

After saving the DCP from within the Vivado IDE, close the tool and return to the Vitis environment. Use the --reuse_impl option to apply a previously implemented DCP file in the v++ command line to generate the xclbin.

IMPORTANT: The --reuse_impl option is an incremental build option that requires you to apply the same project directory when resuming the Vitis compiler with --reuse_impl that you specified when using --to_step to start the build.

The following command completes the linking process by using the specified DCP file from the Vivado tool to create the project.xclbin from the input files.

v++ --link --platform <PLATFORM_NAME> -o'project.xclbin' project.xo --reuse_impl ./_x/link/vivado/routed.dcp
You can also use a bitstream file generated by the Vivado tool to create the project.xclbin:
v++ --link --platform <PLATFORM_NAME> -o'project.xclbin' project.xo --reuse_bit ./_x/link/vivado/project.bit

Additional Vivado Options

Some additional switches that can be used in the v++ command line or config file include the following:

  • --export_script/--custom_script edit and use Tcl scripts to modify the compilation or linking process.
  • --remote_ip_cache specify a remote IP cache directory for Vivado synthesis.
  • --no_ip_cache turn off the IP cache for Vivado synthesis. This causes all IP to be re-synthesized as part of the build process, scrubbing out cached data.

Controlling Report Generation

The v++ -R option (or --report_level) controls the level of information to report during compilation or linking for hardware emulation and system targets. Builds that generate fewer reports will typically run more quickly.

The command line option is as follows:

$ v++ -R <report_level>

Where <report_level> is one of the following options:

  • -R0: Minimal reports and no intermediate design checkpoints (DCP).
  • -R1: Includes R0 reports plus:
    • Identifies design characteristics to review for each kernel (report_failfast).
    • Identifies design characteristics to review for the full post-optimization design.
    • Saves post-optimization design checkpoint (DCP) file for later examination or use in the Vivado Design Suite.
    TIP: report_failfast is a utility that highlights potential device usage challenges, clock constraint problems, and potential unreachable target frequency (MHz).
  • -R2: Includes R1 reports plus:
    • Includes all standard reports from the Vivado tools, including saved DCPs after each implementation step.
    • Design characteristics to review for each SLR after placement.
  • -Restimate: Forces Vitis HLS to generate a System Estimate report, as described in System Estimate Report.
    TIP: This option is useful for the software emulation build (-t sw_emu).