Introduction to Vitis HLS
Vitis™ HLS is a high-level synthesis tool that allows C, C++, and OpenCL™ functions to become hardwired onto the device logic fabric and RAM/DSP blocks. Vitis HLS implements hardware kernels in the Vitis application acceleration development flow and uses C/C++ code for developing RTL IP for Xilinx® device designs in the Vivado® Design Suite.
In the Vitis application acceleration flow, the Vitis HLS tool automates much of the code modifications required to implement and optimize the C/C++ code in programmable logic and to achieve low latency and high throughput. The inference of required pragmas to produce the right interface for your function arguments and to pipeline loops and functions within your code is the foundation of Vitis HLS in the application acceleration flow. Vitis HLS also supports customization of your code to implement different interface standards or specific optimizations to achieve your design objectives.
Following is the Vitis HLS design flow:
- Compile, simulate, and debug the C/C++ algorithm.
- View reports to analyze and optimize the design.
- Synthesize the C algorithm into an RTL design.
- Verify the RTL implementation using RTL co-simulation.
- Package the RTL implementation into a compiled object file (.xo) extension, or export to an RTL IP.
Basics of High-Level Synthesis
The Xilinx Vitis HLS tool synthesizes a C or C++ function into RTL code for acceleration in programmable logic. Vitis HLS is tightly integrated with the Vitis core development kit and the application acceleration design flow.
Some benefits of using a high-level synthesis (HLS) design methodology include:
- Developing and validating algorithms at the C-level for the purpose of designing at an abstract level from the hardware implementation details.
- Using C-simulation to validate the design, and iterate more quickly than with traditional RTL design.
- Controlling the C-synthesis process using optimization pragmas to create high-performance implementations.
- Creating multiple design solutions from the C source code and pragmas to explore the design space, and find an optimal solution.
- Quickly recompile the C-source to target different platforms and hardware devices.
HLS includes the following stages:
- Scheduling determines which operations occur during each clock cycle
based on:
- When an operation’s dependencies have been satisfied or are available.
- The length of the clock cycle or clock frequency.
- The time it takes for the operation to complete, as defined by the target device.
- The available resource allocation.
- Incorporation of any user-specified optimization directives.
TIP: More operations can be completed in a single clock cycle for longer clock periods, or if a faster device is targeted, and all operations might complete in one clock cycle. However, for shorter clock periods, or when slower devices are targeted, HLS automatically schedules operations over more clock cycles. Some operations might need to be implemented as multi-cycle resources. - Binding assigns hardware resources to implement each scheduled
operation, and maps operators (such as addition, multiplication, and shift) to
specific RTL implementations. For example, a
mult
operation can be implemented in RTL as a combinational or pipelined multiplier. - Control logic extraction creates a finite state machine (FSM) that sequences the operations in the RTL design according to the defined schedule.
Scheduling and Binding Example
The following figure shows an example of the scheduling and binding phases for this code example:
int foo(char x, char a, char b, char c) {
char y;
y = x*a+b+c;
return y;
}
In the scheduling phase of this example, high-level synthesis schedules the following operations to occur during each clock cycle:
- First clock cycle: Multiplication and the first addition
- Second clock cycle: Second addition, if the result of the first addition is available in the second clock cycle, and output generation
x
, a
, and b
data ports. The second cycle reads data port c
and
generates output y
.In the final hardware implementation, high-level synthesis implements the
arguments to the top-level function as input and output (I/O) ports. In this example, the
arguments are simple data ports. Because each input variable is a char
type, the input data ports are all 8-bits wide. The function return
is a 32-bit int
data
type, and the output data port is 32-bits wide.
In the initial binding phase of this example, high-level synthesis implements the multiplier operation using a combinational multiplier (Mul) and implements both add operations using a combinational adder/subtractor (AddSub).
In the target binding phase, high-level synthesis implements both the multiplier and one of the addition operations using a DSP module resource. Some applications use many binary multipliers and accumulators that are best implemented in dedicated DSP resources. The DSP module is a computational block available in the FPGA architecture that provides the ideal balance of high-performance and efficient implementation.
Extracting Control Logic and Implementing I/O Ports Example
The following figure shows the extraction of control logic and implementation of I/O ports for this code example:
void foo(int in[3], char a, char b, char c, int out[3]) {
int x,y;
for(int i = 0; i < 3; i++) {
x = in[i];
y = a*x + b + c;
out[i] = y;
}
}
This code example performs the same operations as the previous example. However,
it performs the operations inside a for-loop, and two of the function arguments are arrays.
The resulting design executes the logic inside the for-loop three times when the code is
scheduled. High-level synthesis automatically extracts the control logic from the C code and
creates an FSM in the RTL design to sequence these operations. Top-level function arguments
become ports in the final RTL design. The scalar variable of type char
maps into a standard 8-bit data bus port. Array arguments, such as in
and out
, contain an entire
collection of data.
In high-level synthesis, arrays are synthesized into block RAM by default, but other options are possible, such as FIFOs, distributed RAM, and individual registers. When using arrays as arguments in the top-level function, high-level synthesis assumes that the block RAM is outside the top-level function and automatically creates ports to access a block RAM outside the design, such as data ports, address ports, and any required chip-enable or write-enable signals.
The FSM controls when the registers store data and controls the state
of any I/O control signals. The FSM starts in the state C0
.
On the next clock, it enters state C1
, then state
C2
, and then state C3
.
It returns to state C1
(and C2
,
C3
) a total of three times before returning to
state C0
.
C0
,{C1, C2,
C3}
, {C1, C2, C3}
, {C1,
C2, C3}
, and return to C0
.The design requires the addition of b
and c
only one time. High-level synthesis moves the operation outside the for-loop
and into state C0
. Each time the design enters
state C3
, it reuses the result of the addition.
The design reads the data from in
and stores
the data in x
. The FSM generates the address for
the first element in state C1
. In addition, in
state C1
, an adder increments to keep track of
how many times the design must iterate around states C1
,
C2
, and C3
. In state C2
,
the block RAM returns the data for in and stores it as variable x
.
High-level synthesis reads the data from port a
with other values to perform the calculation and generates the first
y
output. The FSM ensures that the correct address
and control signals are generated to store this value outside the block. The design then
returns to state C1
to read the next value from the
array/block RAM in
. This process continues until all
outputs are written. The design then returns to state C0
to read the next values of b
and
c
to start the process again.
Performance Metrics Example
The following figure shows the complete cycle-by-cycle execution for the code in the previous example, including the states for each clock cycle, read operations, computation operations, and write operations.
The following are performance metrics for this example:
- Latency: It takes the function 9 clock cycles to output all values.Note: When the output is an array, the latency is measured to the last array value output.
- Initiation Interval (II): The II is 10, which means it takes 10 clock
cycles before the function can initiate a new set of input reads and start to
process the next set of input data.Note: The time to perform one complete execution of a function is referred to as one transaction. In this example, it takes 11 clock cycles before the function can accept data for the next transaction.
- Loop iteration latency: The latency of each loop iteration is 3 clock cycles.
- Loop II: The interval is 3.
- Loop latency: The latency is 9 clock cycles.
Tutorials and Examples
To help you quickly get started with the Vitis HLS, you can find tutorials and example applications at the following locations:
- Vitis HLS Tiny Tutorials (https://github.com/Xilinx/HLS-Tiny-Tutorials/tree/master)
- Hosts many small code examples to demonstrate good design practices, coding guidelines, design pattern for common applications, and most importantly, optimization techniques to maximize application performance. All examples include a README.md file, and a run_hls.tcl script to help you use the example code.
- Vitis Accel Examples Repository (https://github.com/Xilinx/Vitis_Accel_Examples)
- Contains examples to showcase various features of the Vitis tools and platforms. This repository illustrates specific scenarios related to host code and kernel programming for the Vitis application acceleration development flow, by providing small working examples. The kernel code in these examples can be directly compiled in Vitis HLS.
- Vitis Application Acceleration Development Flow Tutorials (https://github.com/Xilinx/Vitis-Tutorials)
- Provides a number of tutorials that can be worked through to teach specific concepts regarding the tool flow and application development, including the use of Vitis HLS as a standalone application, and in the Vitis bottom up design flow.