Streaming Data Transfers
Streaming Data Between the Host and Kernel (H2K)
The Vitis core development kit provides a programming model that supports the direct streaming of data from host-to-kernel and kernel-to-host, without the need to migrate data through global memory as an intermediate step. This programming model uses minimal storage compared to the larger and slower global memory bank, and thus significantly improves both performance and power.
Using streaming data transfers brings the following advantages:
- The host application does not need to know the size of the data coming from the kernel.
- Data residing in host memory can be transferred to the kernel as soon as it is needed.
- Processed data can be transferred from the kernel back to the host
program when it is required.
Host-to-kernel and kernel-to-host streaming are only supported in PCIe-based platforms, such as the Alveo Data Center accelerator cards. This feature is also only available on specific target platforms, such as the QDMA platform for the Alveo Data Center accelerator cards. However, kernel-to-kernel streaming data transfer is supported for both PCIe-based and embedded platforms. If your platform is not configured to support streaming, your application will not run.
Host Coding Guidelines
Xilinx provides new OpenCL™ APIs for streaming operation as extension APIs.
clCreateStream()
- Creates a read or write stream.
clReleaseStream()
- Frees the created stream and its associated memory.
clWriteStream()
- Writes data to stream.
clReadStream()
- Gets data from stream.
clPollStreams()
- Polls for any stream on the device to finish. Required only for non-blocking stream operation.
The typical API flow is described below:
- Create the required number of the read/write streams by
clCreateStream
.- Streams should be directly attached to the OpenCL device object because it does not use any command queue. A stream itself is a command queue that only passes the data in a particular direction, either the kernel reading data from the host, or the kernel writing data to the host.
- An appropriate flag should be used to denote the stream as XCL_STREAM_WRITE_ONLY or XCL_STREAM_READ_ONLY; where read and write are from the perspective of the kernel code.
-
To specify how the stream is connected to the device, a Xilinx extension pointer object (
cl_mem_ext_ptr_t
) is used to identify the kernel, and the kernel argument the stream is associated with.IMPORTANT: If the streaming kernel has multiple compute units, the host code needs to use a uniquecl_kernel
object for each compute unit. The host code must useclCreateKernel
with<kernel_name>:{compute_unit_name}
to get each compute unit, creating streams for them, and enqueuing them individually.In the following code example, a
read_stream
and awrite_stream
are created, and associated with acl_kernel
object, and specified kernel arguments.#include <CL/cl_ext_xilinx.h> // Required for Xilinx extension pointer // Device connection specification of the stream through extension pointer cl_mem_ext_ptr_t ext; // Extension pointer ext.param = kernel; // The .param should be set to kernel (cl_kernel type) ext.obj = nullptr; // The .flag should be used to denote the kernel argument // Create write stream for argument 3 of kernel ext.flags = 3; cl_stream h2k_stream = clCreateStream(device_id, XCL_STREAM_READ_ONLY, CL_STREAM, &ext, &ret); // Create read stream for argument 4 of kernel ext.flags = 4; cl_stream k2h_stream = clCreateStream(device_id, XCL_STREAM_WRITE_ONLY, CL_STREAM, &ext,&ret);
- Set the remaining non-streaming kernel arguments and enqueue the kernel. The
following code block shows setting typical kernel argument (non-stream arguments,
such as buffer and/or scalar) and kernel
enqueuing:
// Set kernel non-stream argument (if any) clSetKernelArg(kernel, 0,...,...); clSetKernelArg(kernel, 1,...,...); clSetKernelArg(kernel, 2,...,...); // Argument 3 and 4 are not set as those are already specified during // the clCreateStream through the extension pointer // Schedule kernel enqueue clEnqueueTask(commands, kernel, . .. . );
- Initiate Read and Write transfers by
clReadStream
andclWriteStream
commands.- Note the usage of attribute CL_STREAM_XFER_REQ associated with read and write request.
- The
.flag
is used to denote transfer mechanism.- CL_STREAM_EOT
- Currently, successful stream transfer mechanism depends on identifying the end of the transfer by an End of Transfer signal. This flag is mandatory in the current release.
- CL_STREAM_NONBLOCKING
- By default the Read and Write transfers are blocking. For non-blocking transfer, CL_STREAM_NONBLOCKING has to be set.
- The
.priv_data
is used to specify a string (as a name for tagging purpose) associated with the transfer. This will help identify specific transfer completion when polling the stream completion. It is required when using the non-blocking version of the API.In the following code block, the stream read and write transfers are executed with the non-blocking approach.
// Initiate the READ transfer cl_stream_xfer_req rd_req {0}; rd_req.flags = CL_STREAM_EOT | CL_STREAM_NONBLOCKING; rd_req.priv_data = (void*)"read"; // You can think of this as tagging the // transfer with a name clReadStream(k2h_stream, host_read_ptr, max_read_size, &rd_req, &ret); // Initiating the WRITE transfer cl_stream_xfer_req wr_req {0}; wr_req.flags = CL_STREAM_EOT | CL_STREAM_NONBLOCKING; wr_req.priv_data = (void*)"write"; clWriteStream(h2k_stream, host_write_ptr, write_size, &wr_req , &ret);
- Poll all the streams for completion. For the non-blocking transfer, a polling API is provided to ensure the read/write transfers are completed. For the blocking version of the API, polling is not required.
- The polling results are stored in the
cl_streams_poll_req_completions
array, which can be used in verifying and checking the stream events result. - The
clPollStreams
is a blocking API. It returns the execution to the host code as soon as it receives the notification that all stream requests have been completed, or until you specify the timeout.// Checking the request completion cl_streams_poll_req_completions poll_req[2] {0, 0}; // 2 Requests auto num_compl = 2; clPollStreams(device_id, poll_req, 2, 2, &num_compl, 5000, &ret); // Blocking API, waits for 2 poll request completion or 5000ms, whichever occurs first
- The polling results are stored in the
- Read and use the stream data in host.
- After the successful poll request is completed, the host can read the data from the host pointer.
- Also, the host can check the size of the data transferred to the host. For this purpose, the host needs to find the correct poll request by matching
priv_data
and then fetching nbytes (the number of bytes transferred) from thecl_streams_poll_req_completions structure
.for (auto i=0; i<2; ++i) { if(rd_req.priv_data == poll_req[i].priv_data) { // Identifying the read transfer // Getting read size, data size from kernel is unknown ssize_t result_size=poll_req[i].nbytes; } }
The header file containing function prototype and argument description is available in the Xilinx Runtime GitHub repository.
Kernel Coding Guidelines
The basic guidelines to develop stream-based C kernel are as follows:
- Use
hls::stream
with theqdma_axis<D,0,0,0>
data type. Theqdma_axis
data type needs the header file ap_axi_sdata.h. - When
hls::stream
is used to define a parameter data type, the Vitis HLS tool infers anaxis
streaming interface. - The
qdma_axis<D,0,0,0>
is a special class used for data transfer between host and kernel when using the streaming platform. This is only used in the streaming kernel interface interacting with the host, not with another kernel. The template parameter <D> denotes data width. The remaining three parameters should be set to 0 (not to be used in the current release). - The following code block shows a simple kernel interface with one
input stream and one output
stream.
#include "ap_axi_sdata.h" #include "hls_stream.h" //qdma_axis is the HLS class for stream data transfer between host and kernel for streaming platform //It contains "data" and two sideband signals (last and keep) exposed to the user via class member function. typedef qdma_axis<64,0,0,0> datap; void kernel_top ( hls::stream<datap> &input, hls::stream<datap> &output, ..... , // Other Inputs/Outputs if any ) { ... }
TIP: Because the datatype is defined ashls::stream
, the Vitis HLS tool infersaxis
interfaces. The following INTERFACE pragmas are shown as an example, but are not added to the code.#pragma HLS INTERFACE axis port=input #pragma HLS INTERFACE axis port=output
- The
qdma_axis
data type contains three variables, which should be used inside the kernel code:- data
- Internally, the
qdma_axis
data type contains anap_int
<D> that should be accessed by the.get_data()
and.set_data()
method.- The D must be 8, 16, 32, 64, 128, 256, or 512 bits wide.
- last
- The
last
variable is used to indicate the last value of an incoming and outgoing stream. When reading from the input stream,last
is used to detect the end of the stream. Similarly when kernel writes to an output stream transferred to the host, thelast
must be set to indicate the end of stream.get_last
/set_last
: Accesses and sets thelast
variable used to denote the last data in the stream.
- keep
- In some special situations, the keep signal can be used to truncate the last data to the
fewer number of bytes. However, the keep should not be used to any data other than the last
data from the stream. Therefore, in most of the cases, you should set
keep to -1 for all of the outgoing
data from the kernel.
get_keep
/set_keep
: Accesses/sets thekeep
variable.- For all the data before the last data,
keep
must be set to -1 to denote all bytes of the data are valid. - For the last data, the kernel has the
flexibility to send fewer bytes. For example, for the four bytes
of data transfer, the kernel can truncate the last data by
sending one byte, two bytes, or three bytes using the following
set_keep()
function.- If the last data is one byte ≥
.set_keep(1)
- If the last data is two bytes ≥
.set_keep(3)
- If the last data is three bytes ≥
.set_keep(7)
- If the last data is all four bytes
(similar to all non-last data) ≥
.set_keep(-1)
- If the last data is one byte ≥
- The following code block shows how the stream
input
is read. Note the usage of.last
to determine the last data.// Stream Read // Using "last" flag to determine the end of input-stream // when kernel does not know the length of the input data hls::stream<ap_uint<64> > internal_stream; while(true) { datap temp = input.read(); // "input" -> Input stream internal_stream << temp.get_data(); // Getting data from the stream if(temp.get_last()) // Getting last signal to determine the EOT (end of transfer). break; }
- The following code block shows how the stream
output
is written. Theset_keep
is setting -1 for all data (general case). The kernel also uses theset_last()
to specify the last data of the stream.IMPORTANT: For the proper functionality of the host and kernel system, set thelast
bit setting.// Stream Write for(int j = 0; j <....; j++) { datap t; t.set_data(...); t.set_keep(-1); // keep flag -1 , all bytes are valid if(... ) // check if this is last data to be write t.set_last(1); // Setting last data of the stream else t.set_last(0); output.write(t); // output stream from the kernel }
Streaming Data Transfers Between Kernels (K2K)
The Vitis core development kit also supports streaming data transfer between two kernels. Consider the situation where one kernel is performing some part of the computation, and the second kernel completes the operation after receiving the output data from the first kernel. With kernel-to-kernel streaming support, data can move directly from one kernel to another without having to transmit back through the global memory. This results in a significant performance improvement.
Host Coding Guidelines
The kernel ports involved in kernel-to-kernel streaming do not require setup using
the clSetKernelArg
from the host
code. All kernel arguments not involved in the streaming
connection should be set up using clSetKernelArg
as described in Setting Kernel Arguments. However,
kernel ports involved in streaming will be defined within the
kernel itself, and are not addressed by the host program.
Streaming Kernel Coding Guidelines
In a kernel, the streaming interface directly sending or receiving data to
another kernel streaming interface is defined by hls::stream
with the ap_axiu<D,0,0,0>
data type. The ap_axiu<D,0,0,0>
data type requires the use of the ap_axi_sdata.h header file.
qdma_axis
data type. Both the ap_axiu
and qdma_axis
data types
are defined inside the ap_axi_sdata.h header file that is
distributed with the Vitis software platform
installation.The following example shows the streaming interfaces of the producer and consumer kernels.
// Producer kernel - provides output as a data stream
// The example kernel code does not show any other inputs or outputs.
void kernel1 (.... , hls::stream<ap_axiu<32, 0, 0, 0> >& stream_out) {
for(int i = 0; i < ...; i++) {
int a = ...... ; // Internally generated data
ap_axiu<32, 0, 0, 0> v; // temporary storage for ap_axiu
v.data = a; // Writing the data
stream_out.write(v); // Writing to the output stream.
}
}
// Consumer kernel - reads data stream as input
// The example kernel code does not show any other inputs or outputs.
void kernel2 (hls::stream<ap_axiu<32, 0, 0, 0> >& stream_in, .... ) {
for(int i = 0; i < ....; i++) {
ap_axiu<32, 0, 0, 0> v = stream_in.read(); // Reading the input stream
int a = v.data; // Extract the data
// Do further processing
}
}
Because the hls::stream
data type is
defined, the Vitis HLS tool infers axis
interfaces. The following INTERFACE pragmas are shown as an
example, but are not added to the code.
#pragma HLS INTERFACE axis port=stream_out
#pragma HLS INTERFACE axis port=stream_in
kernel1
to kernel2
must be defined during the
kernel linking process as described in Specify Streaming Connections between Compute Units.
For more information on mapping streaming connections, refer to Building and Running the Application.
Free-Running Kernel
The Vitis core development kit provides support for one or more free-running kernels. Free-running kernels have no control signal ports, and cannot be started or stopped. The no-control signal feature of the free-running kernel results in the following characteristics:
- The free-running kernel has no memory input or output port, and therefore it interacts with the host or other kernels (other kernels can be regular kernel or another free running kernel) only through streams.
- When the FPGA is programmed by the binary container
(xclbin), the free-running kernel starts running on the
FPGA, and therefore it does not need the
clEnqueueTask
command from the host code. - The kernel works on the stream data as soon as it starts receiving from the host or other kernels, and it stalls when the data is not available.
- The free-running kernel needs a special interface pragma
ap_ctrl_none
inside the kernel body.
Host Coding for Free-Running Kernels
If the free-running kernel interacts with the host, the host code should manage
the stream operation by clCreateStream
/clReadStream
/clWriteStream
as discussed in Host Coding Guidelines of Streaming Data Between the Host and Kernel (H2K).
As the free-running kernel has no other types of inputs or outputs, such as memory ports
or control ports, there is no need to specify clSetKernelArg
. The clEnqueueTask
is not
used because the kernel works on the stream data as soon as it starts receiving from the
host or other kernels, and it stalls when the data is not available.
Coding Guidelines for Free-Running Kernels
As mentioned previously, the free-running kernel only contains hls::stream
inputs and outputs. The recommended coding
guidelines include:
- Use
hls::stream<ap_axiu<D,0,0,0> >
if the port is interacting with another stream port from the kernel. - Use
hls::stream<qdma_axis<D,0,0,0> >
if the port is interacting with the host. - Use the
hls::stream
data type for the function parameter causes Vitis HLS to infer an AXI4-Stream port (axis) for the interface. - The free-running kernel must also specify the following special INTERFACE
pragma.
#pragma HLS interface ap_ctrl_none port=return
#pragma HLS interface s_axilite
or
#pragma HLS interface m_axi
(as there should not be
any memory or control port).The following code example shows a free-running kernel with one input and
one output communicating with another kernel. The while(1)
loop structure contains the substance of the kernel code, which
repeats as long as the kernel runs.
void kernel_top(hls::stream<ap_axiu<32, 0, 0, 0> >& input,
hls::stream<ap_axiu<32, 0, 0, 0> >& output) {
#pragma HLS interface ap_ctrl_none port=return // Special pragma for free-running kernel
#pragma HLS DATAFLOW // The kernel is using DATAFLOW optimization
while(1) {
...
}
}