Streaming Data Transfers

Streaming Data Between the Host and Kernel (H2K)

The Vitis core development kit provides a programming model that supports the direct streaming of data from host-to-kernel and kernel-to-host, without the need to migrate data through global memory as an intermediate step. This programming model uses minimal storage compared to the larger and slower global memory bank, and thus significantly improves both performance and power.

Using streaming data transfers brings the following advantages:

  • The host application does not need to know the size of the data coming from the kernel.
  • Data residing in host memory can be transferred to the kernel as soon as it is needed.
  • Processed data can be transferred from the kernel back to the host program when it is required.

    Host-to-kernel and kernel-to-host streaming are only supported in PCIe-based platforms, such as the Alveo Data Center accelerator cards. This feature is also only available on specific target platforms, such as the QDMA platform for the Alveo Data Center accelerator cards. However, kernel-to-kernel streaming data transfer is supported for both PCIe-based and embedded platforms. If your platform is not configured to support streaming, your application will not run.

Host Coding Guidelines

Xilinx provides new OpenCL™ APIs for streaming operation as extension APIs.

clCreateStream()
Creates a read or write stream.
clReleaseStream()
Frees the created stream and its associated memory.
clWriteStream()
Writes data to stream.
clReadStream()
Gets data from stream.
clPollStreams()
Polls for any stream on the device to finish. Required only for non-blocking stream operation.

The typical API flow is described below:

  • Create the required number of the read/write streams by clCreateStream.
    • Streams should be directly attached to the OpenCL device object because it does not use any command queue. A stream itself is a command queue that only passes the data in a particular direction, either the kernel reading data from the host, or the kernel writing data to the host.
    • An appropriate flag should be used to denote the stream as XCL_STREAM_WRITE_ONLY or XCL_STREAM_READ_ONLY; where read and write are from the perspective of the kernel code.
    • To specify how the stream is connected to the device, a Xilinx extension pointer object (cl_mem_ext_ptr_t) is used to identify the kernel, and the kernel argument the stream is associated with.

      IMPORTANT: If the streaming kernel has multiple compute units, the host code needs to use a unique cl_kernel object for each compute unit. The host code must use clCreateKernel with <kernel_name>:{compute_unit_name} to get each compute unit, creating streams for them, and enqueuing them individually.

      In the following code example, a read_stream and a write_stream are created, and associated with a cl_kernel object, and specified kernel arguments.

      #include <CL/cl_ext_xilinx.h> // Required for Xilinx extension pointer
       
      // Device connection specification of the stream through extension pointer
      cl_mem_ext_ptr_t  ext;  // Extension pointer
      ext.param = kernel;     // The .param should be set to kernel 
      						  (cl_kernel type)
      ext.obj = nullptr;
       
      // The .flag should be used to denote the kernel argument
      // Create write stream for argument 3 of kernel
      ext.flags = 3;
      cl_stream h2k_stream = clCreateStream(device_id, XCL_STREAM_READ_ONLY, CL_STREAM, &ext, &ret);
       
      // Create read stream for argument 4 of kernel
      ext.flags = 4;
      cl_stream k2h_stream = clCreateStream(device_id, XCL_STREAM_WRITE_ONLY, CL_STREAM, &ext,&ret);
  • Set the remaining non-streaming kernel arguments and enqueue the kernel. The following code block shows setting typical kernel argument (non-stream arguments, such as buffer and/or scalar) and kernel enqueuing:
    // Set kernel non-stream argument (if any)
    clSetKernelArg(kernel, 0,...,...);
    clSetKernelArg(kernel, 1,...,...);
    clSetKernelArg(kernel, 2,...,...);
    // Argument 3 and 4 are not set as those are already specified during 
    // the clCreateStream through the extension pointer
     
    // Schedule kernel enqueue
    clEnqueueTask(commands, kernel, . .. . );
  • Initiate Read and Write transfers by clReadStream and clWriteStream commands.
    • Note the usage of attribute CL_STREAM_XFER_REQ associated with read and write request.
    • The .flag is used to denote transfer mechanism.
      CL_STREAM_EOT
      Currently, successful stream transfer mechanism depends on identifying the end of the transfer by an End of Transfer signal. This flag is mandatory in the current release.
      CL_STREAM_NONBLOCKING
      By default the Read and Write transfers are blocking. For non-blocking transfer, CL_STREAM_NONBLOCKING has to be set.
    • The .priv_data is used to specify a string (as a name for tagging purpose) associated with the transfer. This will help identify specific transfer completion when polling the stream completion. It is required when using the non-blocking version of the API.

      In the following code block, the stream read and write transfers are executed with the non-blocking approach.

      // Initiate the READ transfer
      cl_stream_xfer_req rd_req {0};
       
      rd_req.flags = CL_STREAM_EOT | CL_STREAM_NONBLOCKING;
      rd_req.priv_data = (void*)"read"; // You can think of this as tagging the 
      							// transfer with a name
       
      clReadStream(k2h_stream, host_read_ptr, max_read_size, &rd_req, &ret);
       
      // Initiating the WRITE transfer
      cl_stream_xfer_req wr_req {0};
       
      wr_req.flags = CL_STREAM_EOT | CL_STREAM_NONBLOCKING;
      wr_req.priv_data = (void*)"write";
       
      clWriteStream(h2k_stream, host_write_ptr, write_size, &wr_req , &ret);
  • Poll all the streams for completion. For the non-blocking transfer, a polling API is provided to ensure the read/write transfers are completed. For the blocking version of the API, polling is not required.
    • The polling results are stored in the cl_streams_poll_req_completions array, which can be used in verifying and checking the stream events result.
    • The clPollStreams is a blocking API. It returns the execution to the host code as soon as it receives the notification that all stream requests have been completed, or until you specify the timeout.
      // Checking the request completion
         cl_streams_poll_req_completions poll_req[2] {0, 0}; // 2 Requests
       
         auto num_compl = 2;
         clPollStreams(device_id, poll_req, 2, 2, &num_compl, 5000, &ret);
         // Blocking API, waits for 2 poll request completion or 5000ms, 
            whichever occurs first
  • Read and use the stream data in host.
    • After the successful poll request is completed, the host can read the data from the host pointer.
    • Also, the host can check the size of the data transferred to the host. For this purpose, the host needs to find the correct poll request by matching priv_data and then fetching nbytes (the number of bytes transferred) from the cl_streams_poll_req_completions structure.
      for (auto i=0; i<2; ++i) { 
          if(rd_req.priv_data == poll_req[i].priv_data) { // Identifying the 
      													   read transfer
              // Getting read size, data size from kernel is unknown
              ssize_t result_size=poll_req[i].nbytes;      
              }
          }

The header file containing function prototype and argument description is available in the Xilinx Runtime GitHub repository.

Kernel Coding Guidelines

The basic guidelines to develop stream-based C kernel are as follows:

  • Use hls::stream with the qdma_axis<D,0,0,0> data type. The qdma_axis data type needs the header file ap_axi_sdata.h.
  • When hls::stream is used to define a parameter data type, the Vitis HLS tool infers an axis streaming interface.
  • The qdma_axis<D,0,0,0> is a special class used for data transfer between host and kernel when using the streaming platform. This is only used in the streaming kernel interface interacting with the host, not with another kernel. The template parameter <D> denotes data width. The remaining three parameters should be set to 0 (not to be used in the current release).
  • The following code block shows a simple kernel interface with one input stream and one output stream.
    #include "ap_axi_sdata.h"
    #include "hls_stream.h"
     
    //qdma_axis is the HLS class for stream data transfer between host and kernel for streaming platform
    //It contains "data" and two sideband signals (last and keep) exposed to the user via class member function. 
    typedef qdma_axis<64,0,0,0> datap;
     
    void kernel_top (
                 hls::stream<datap> &input,
                 hls::stream<datap> &output,
                 ..... , // Other Inputs/Outputs if any                   
                 )
    {
    
       ...
    }
    TIP: Because the datatype is defined as hls::stream, the Vitis HLS tool infers axis interfaces. The following INTERFACE pragmas are shown as an example, but are not added to the code.
    #pragma HLS INTERFACE axis port=input
    #pragma HLS INTERFACE axis port=output
  • The qdma_axis data type contains three variables, which should be used inside the kernel code:
    data
    Internally, the qdma_axis data type contains an ap_int <D> that should be accessed by the .get_data() and .set_data() method.
    • The D must be 8, 16, 32, 64, 128, 256, or 512 bits wide.
    last
    The last variable is used to indicate the last value of an incoming and outgoing stream. When reading from the input stream, last is used to detect the end of the stream. Similarly when kernel writes to an output stream transferred to the host, the last must be set to indicate the end of stream.
    • get_last/set_last: Accesses and sets the last variable used to denote the last data in the stream.
    keep
    In some special situations, the keep signal can be used to truncate the last data to the fewer number of bytes. However, the keep should not be used to any data other than the last data from the stream. Therefore, in most of the cases, you should set keep to -1 for all of the outgoing data from the kernel.
    • get_keep/set_keep: Accesses/sets the keep variable.
    • For all the data before the last data, keep must be set to -1 to denote all bytes of the data are valid.
    • For the last data, the kernel has the flexibility to send fewer bytes. For example, for the four bytes of data transfer, the kernel can truncate the last data by sending one byte, two bytes, or three bytes using the following set_keep() function.
      • If the last data is one byte ≥ .set_keep(1)
      • If the last data is two bytes ≥ .set_keep(3)
      • If the last data is three bytes ≥ .set_keep(7)
      • If the last data is all four bytes (similar to all non-last data) ≥ .set_keep(-1)
  • The following code block shows how the stream input is read. Note the usage of .last to determine the last data.
    // Stream Read
    // Using "last" flag to determine the end of input-stream
    // when kernel does not know the length of the input data
     hls::stream<ap_uint<64> >   internal_stream;
     while(true) {
            datap temp = input.read(); // "input" -> Input stream
            internal_stream << temp.get_data();  // Getting data from the 
    		stream
            if(temp.get_last())  // Getting last signal to determine the 
    		EOT (end of transfer). 
                break;
     }
  • The following code block shows how the stream output is written. The set_keep is setting -1 for all data (general case). The kernel also uses the set_last() to specify the last data of the stream.
    IMPORTANT: For the proper functionality of the host and kernel system, set the last bit setting.
    // Stream Write
    for(int j = 0; j <....; j++) {
          datap t;
          t.set_data(...);
          t.set_keep(-1);        // keep flag -1 , all bytes are valid
          if(... )               // check if this is last data to be write
             t.set_last(1);      // Setting last data of the stream
          else
             t.set_last(0);
          output.write(t);  	 // output stream from the kernel
    }

Streaming Data Transfers Between Kernels (K2K)

The Vitis core development kit also supports streaming data transfer between two kernels. Consider the situation where one kernel is performing some part of the computation, and the second kernel completes the operation after receiving the output data from the first kernel. With kernel-to-kernel streaming support, data can move directly from one kernel to another without having to transmit back through the global memory. This results in a significant performance improvement.

IMPORTANT: This feature is only available on specific target platforms, such as the QDMA platform for the Alveo Data Center accelerator cards. If your platform is not configured to support streaming, your application will not run.

Host Coding Guidelines

The kernel ports involved in kernel-to-kernel streaming do not require setup using the clSetKernelArg from the host code. All kernel arguments not involved in the streaming connection should be set up using clSetKernelArg as described in Setting Kernel Arguments. However, kernel ports involved in streaming will be defined within the kernel itself, and are not addressed by the host program.

Streaming Kernel Coding Guidelines

In a kernel, the streaming interface directly sending or receiving data to another kernel streaming interface is defined by hls::stream with the ap_axiu<D,0,0,0> data type. The ap_axiu<D,0,0,0> data type requires the use of the ap_axi_sdata.h header file.

IMPORTANT: Host-to-kernel and kernel-to-host streaming (see Streaming Data Between the Host and Kernel (H2K)) requires the use of the qdma_axis data type. Both the ap_axiu and qdma_axis data types are defined inside the ap_axi_sdata.h header file that is distributed with the Vitis software platform installation.

The following example shows the streaming interfaces of the producer and consumer kernels.

// Producer kernel - provides output as a data stream
// The example kernel code does not show any other inputs or outputs.

void kernel1 (.... , hls::stream<ap_axiu<32, 0, 0, 0> >& stream_out) {
      
  for(int i = 0; i < ...; i++) {
    int a = ...... ;         // Internally generated data
    ap_axiu<32, 0, 0, 0> v;  // temporary storage for ap_axiu
    v.data = a;              // Writing the data
    stream_out.write(v);         // Writing to the output stream.
  }
}
 
// Consumer kernel - reads data stream as input
// The example kernel code does not show any other inputs or outputs.

void kernel2 (hls::stream<ap_axiu<32, 0, 0, 0> >& stream_in, .... ) {
 
  for(int i = 0; i < ....; i++) {
    ap_axiu<32, 0, 0, 0> v = stream_in.read(); // Reading the input stream
    int a = v.data; // Extract the data
          
    // Do further processing
  }
}

Because the hls::stream data type is defined, the Vitis HLS tool infers axis interfaces. The following INTERFACE pragmas are shown as an example, but are not added to the code.

#pragma HLS INTERFACE axis port=stream_out
#pragma HLS INTERFACE axis port=stream_in
TIP: These example kernels show the definition of the streaming input/output ports in the kernel signature, and the handling of the input/output stream in the kernel code. The connection of kernel1 to kernel2 must be defined during the kernel linking process as described in Specify Streaming Connections between Compute Units.

For more information on mapping streaming connections, refer to Building and Running the Application.

Free-Running Kernel

The Vitis core development kit provides support for one or more free-running kernels. Free-running kernels have no control signal ports, and cannot be started or stopped. The no-control signal feature of the free-running kernel results in the following characteristics:

  • The free-running kernel has no memory input or output port, and therefore it interacts with the host or other kernels (other kernels can be regular kernel or another free running kernel) only through streams.
  • When the FPGA is programmed by the binary container (xclbin), the free-running kernel starts running on the FPGA, and therefore it does not need the clEnqueueTask command from the host code.
  • The kernel works on the stream data as soon as it starts receiving from the host or other kernels, and it stalls when the data is not available.
  • The free-running kernel needs a special interface pragma ap_ctrl_none inside the kernel body.

Host Coding for Free-Running Kernels

If the free-running kernel interacts with the host, the host code should manage the stream operation by clCreateStream/clReadStream/clWriteStream as discussed in Host Coding Guidelines of Streaming Data Between the Host and Kernel (H2K). As the free-running kernel has no other types of inputs or outputs, such as memory ports or control ports, there is no need to specify clSetKernelArg. The clEnqueueTask is not used because the kernel works on the stream data as soon as it starts receiving from the host or other kernels, and it stalls when the data is not available.

Coding Guidelines for Free-Running Kernels

As mentioned previously, the free-running kernel only contains hls::stream inputs and outputs. The recommended coding guidelines include:

  • Use hls::stream<ap_axiu<D,0,0,0> > if the port is interacting with another stream port from the kernel.
  • Use hls::stream<qdma_axis<D,0,0,0> > if the port is interacting with the host.
  • Use the hls::stream data type for the function parameter causes Vitis HLS to infer an AXI4-Stream port (axis) for the interface.
  • The free-running kernel must also specify the following special INTERFACE pragma.
    #pragma HLS interface ap_ctrl_none port=return
IMPORTANT: The kernel interface should not have any #pragma HLS interface s_axilite or #pragma HLS interface m_axi (as there should not be any memory or control port).

The following code example shows a free-running kernel with one input and one output communicating with another kernel. The while(1) loop structure contains the substance of the kernel code, which repeats as long as the kernel runs.

void kernel_top(hls::stream<ap_axiu<32, 0, 0, 0> >& input, 
   hls::stream<ap_axiu<32, 0, 0, 0> >& output) {
#pragma HLS interface ap_ctrl_none port=return  // Special pragma for free-running kernel
 
#pragma HLS DATAFLOW // The kernel is using DATAFLOW optimization
	while(1) {
		...
	}
}
TIP: The example shows the definition of the streaming input/output ports in a free-running kernel. However, the streaming connection from the free-running kernel to or from another kernel must be defined during the kernel linking process as described in Specify Streaming Connections between Compute Units.