SDAccel Streaming Platform

Streaming Data Transfers Between Host and Kernel

Starting from the SDAccel™ 2019.1 release, SDAccel provides a new programming model which supports the direct streaming of data from host to kernel and kernel to host without having to go through global memory. This feature is an addition to the existing host to kernel and kernel to host data transfer using global memories. By using streams, you can get some of the advantages such as:

  • The host application does not necessarily need to know the size of the data coming from the kernel.
  • Data resides on the host memory can be transferred to the kernel as soon as it is needed. Similarly, the processed data can be transferred back when it is required.

This programming model uses minimal storage compared to the larger and slower global memory bank, and thus improving the performance and power.

Host Coding Guidelines

Xilinx® provides new OpenCL™ APIs for streaming operation as extension APIs.

clCreateStream()
Creates a read or write stream.
clReleaseStream()
Frees the created stream and its associated memory.
clWriteStream()
Writes data to stream.
clReadStream()
Gets data from stream.
clPollStreams()
Polls for any stream on the device to finish. Required only for non-blocking stream operation.

The typical API flow is described below:

  • Create the required number of the read/write streams by clCreateStream.
    • Streams should be directly attached to the OpenCL device object because it does not use any command queue. A stream itself is a command queue that only passes the data to a particular direction, either from host to kernel or from kernel to host.
    • An appropriate flag should be used to denote stream write/read operation (from the host perspective).
    • To specify how the stream is connected to the device, a predefined extension pointer (cl_mem_ext_ptr_t) should be used to denote the kernel and its argument the stream is associated with.

      In the code block below, a Read Stream (named read_stream) and a Write Stream (named write_stream) are created.

      #include <CL/cl_ext_xilinx.h> // Required for Xilinx Extension
       
      // Device connection specification of the stream through extension pointer
      cl_mem_ext_ptr_t  ext;  // Extension pointer
      ext.param = kernel;     // The .param should be set to kernel 
      						  (cl_kernel type)
      ext.obj = nullptr;
       
      // The .flag should be used to denote the kernel argument
      // Create write stream for argument 3 of kernel
      ext.flags = 3;
      cl_stream write_stream = clCreateStream(device_id, CL_STREAM_WRITE_ONLY, CL_STREAM, &ext, &ret);
       
      // Create read stream for argument 4 of kernel
      ext.flags = 4;
      cl_stream read_stream = clCreateStream(device_id, CL_STREAM_READ_ONLY, CL_STREAM, &ext,&ret);
  • Set the remaining non-stream kernel arguments and enqueue the kernel. The following code block shows typical kernel argument (non-stream arguments such as buffer and/or scalar) setting and kernel enqueuing.
    // Set kernel non-stream argument (if any)
    clSetKernelArg(kernel, 0,...,...);
    clSetKernelArg(kernel, 1,...,...);
    clSetKernelArg(kernel, 2,...,...);
    // Argument 3 and 4 are not set as those are already specified during 
        the clCreateStream through extension pointer
     
    // Schedule kernel enqueue
    clEnqueueTask(commands, kernel, . .. . );
  • Initiate Read and Write transfer by clReadStream and clWriteStream.
    • Note the usage of attribute cl_stream_xfer_req associated with read and write request.
    • The .flag is used to denote transfer mechanism.
      CL_STREAM_EOT
      Currently, successful stream transfer mechanism depends on identifying the end of the transfer by an End of Transfer signal. This flag is mandatory in the current release.
      CL_STREAM_NONBLOCKING
      By default the Read and Write transfers are blocking. For non-blocking transfer, CL_STREAM_NONBLOCKING has to be set.
    • The .priv_data is used to specify a string (as a name for tagging purpose) associated with the transfer. This will help identify specific transfer completion when polling the stream completion. It is required when using the non-blocking version of the API.

      In the following code block, the stream read and write transfers are executed with the non-blocking approach.

      // Initiate the READ transfer
      cl_stream_xfer_req rd_req {0};
       
      rd_req.flags = CL_STREAM_EOT | CL_STREAM_NONBLOCKING;
      rd_req.priv_data = (void*)"read"; // You can think this as tagging the 
      									 transfer with a name
       
      clReadStream(read_stream, host_read_ptr, max_read_size, &rd_req, &ret);
       
      // Initiating the WRITE transfer
      cl_stream_xfer_req wr_req {0};
       
      wr_req.flags = CL_STREAM_EOT | CL_STREAM_NONBLOCKING;
      wr_req.priv_data = (void*)"write";
       
      clWriteStream(write_stream, host_write_ptr, write_size, &wr_req , &ret);
  • Poll all the streams for completion. For the non-blocking transfer, a polling API is provided to ensure the read/write transfers are completed. For the blocking version of the API, polling is not required.
    • The number of poll requests should be used through cl_streams_poll_req_completions.
    • The clPollStreams is a blocking API. It returns the execution to the host code as soon as it receives the notification that all stream requests have been completed, or until you specify the timeout.
      // Checking the request completion
         cl_streams_poll_req_completions poll_req[2] {0, 0}; // 2 Requests
       
         auto num_compl = 2;
         clPollStreams(device_id, poll_req, 2, 2, &num_compl, 5000, &ret);
         // Blocking API, waits for 2 poll request completion or 5000ms, 
            whichever occurs first
  • Read and use the stream data in host.
    • After the successful poll request is completed, the host can read the data from the host pointer.
    • Also, the host can check the size of the data transferred to the host. For this purpose, the host needs to find the correct poll request by matching priv_data and then fetching nbytes (the number of bytes transferred) from the cl_streams_poll_req_completions structure.
      for (auto i=0; i<2; ++i) { 
          if(rd_req.priv_data == poll_req[i].priv_data) { // Identifying the 
      													   read transfer
              // Getting read size, data size from kernel is unknown
              ssize_t result_size=poll_req[i].nbytes;      
              }
          }

The header file containing function prototype and argument description is available in the Xilinx Runtime GitHub repository.

IMPORTANT: If the streaming kernel has multiple CUs, the host code needs to use a unique cl_kernel object for each CU. The host code must use clCreateKernel with <kernel_name>:{compute_unit_name} to get each CU, creating streams for them, and enqueuing them individually.

Kernel Coding Guidelines

The basic guidelines to develop stream-based C kernel is as follows:

  • Use hls::stream with the qdma_axis<D,0,0,0> data type. The qdma_axis data type needs the header file ap_axi_sdata.h.
  • The qdma_axis<D,0,0,0> is a special class used for data transfer between host and kernel when using the streaming platform. This is only used in the streaming kernel interface interacting with the host, not with another kernel. The template parameter <D> denotes data width. The remaining three parameters should be set to 0 (not to be used in the current release).
  • The following code block shows a simple kernel interface with one input stream and one output stream.
    #include "ap_axi_sdata.h"
    #include "hls_stream.h"
     
    //qdma_axis is the HLS class for stream data transfer between host and kernel for streaming platform
    //It contains "data" and two sideband signals (last and keep) exposed to the user via class member function. 
    typedef qdma_axis<64,0,0,0> datap;
     
    void kernel_top (
                 hls::stream<datap> &input,
                 hls::stream<datap> &output,
                 ..... , // Other Inputs/Outputs if any                   
                 )
    {
        #pragma HLS INTERFACE axis port=input
        #pragma HLS INTERFACE axis port=output
    }
  • The qdma_axis data type contains three variables which should be used inside the kernel code:
    data
    Internally qdma_axis contains an ap_uint<D> that should be accessed by the .get_data() and .set_data() method.
    • The D must be 8, 16, 32, 64, 128, 256, or 512 bits wide.
    last
    The last variable is used to indicate the last value of an incoming and outgoing stream. When reading from the input stream, last is used to detect the end of the stream. Similarly when kernel writes to an output stream transferred to the host, the last must be set to indicate the end of stream.
    • get_last/set_last: Accesses/sets the last variable used to denote the last data in the stream.
    keep
    In some special situation, keep signal can be used to truncate the last data to the fewer number of bytes. However, keep should not be used to any data other than the last data from the stream. So, in most of the cases, you should set keep to -1 for all the outgoing data from the kernel.
    • get_keep/set_keep: Accesses/sets the keep variable.
    • For all the data before the last data, keep must be set to -1 to denote all bytes of the data are valid.
    • For the last data, the kernel has the flexibility to send fewer bytes. For example, for the four bytes data transfer, the kernel can truncate the last data by sending one byte, two bytes, or three bytes by using set_keep() function as below.
      • If the last data is one byte => .set_keep(1)
      • If the last data is two bytes => .set_keep(3)
      • If the last data is three bytes => .set_keep(7)
      • If the last data is all four bytes (similar to all non-last data) => .set_keep(-1)
  • The following code block shows how the stream input is read. Note the usage of .last to determine the last data.
    // Stream Read
    // Using "last" flag to determine the end of input-stream
    // when kernel does not know the length of the input data
     hls::stream<ap_uint<64> >   internal_stream;
     while(true) {
            datap temp = input.read(); // "input" -> Input stream
            internal_stream << temp.get_data();  // Getting data from the 
    		stream
            if(temp.get_last())  // Getting last signal to determine the 
    		EOT (end of transfer). 
                break;
     }
  • The following code block shows how the stream output is written. The set_keep is setting -1 for all data (general case). Also, the kernel uses the set_last() to specify the last data of the stream.
    IMPORTANT: For the proper functionality of the host and kernel system, it is very important to set the last bit setting.
    // Stream Write
    for(int j = 0; j <....; j++) {
          datap t;
          t.set_data(...);
          t.set_keep(-1);        // keep flag -1 , all bytes are valid
          if(... )               // check if this is last data to be write
             t.set_last(1);      // Setting last data of the stream
          else
             t.set_last(0);
          output.write(t);  	 // output stream from the kernel
    }

Streaming Data Transfers Between the Kernels

The SDAccel environment also supports streaming data transfer between two kernels. Consider the situation where one kernel is performing some part of the computation and the second kernel is operating the rest after receiving the output data from the first kernel. Before SDx™ 2019.1 version, the only method to transfer data from one kernel to another was through the global memory. Now with kernel to kernel streaming support, data can move directly from one kernel to another without having to transmit through global memory, improving performance.

Host Coding Guidelines

There is only one consideration from the host coding perspective for kernel to kernel streaming data transfer, the kernel ports involved in kernel to kernel data transfer does not need clSetKernelArg from the host code. The host code should set other kernel port arguments that are directly interacting with the host with the clSetKernelArg command.

Kernel Coding Guidelines

The kernel streaming interface directly sending or receiving data to another kernel streaming interface should be defined by hls::stream with the ap_axiu<D,0,0,0> data type. The ap_axiu<D,0,0,0> data type needs the header file ap_axi_sdata.h.

IMPORTANT: Xilinx requires using the qdma_axis data type for host to kernel and kernel to host as described in the previous section. On the other hand, the ap_axiu data type should be used for intra-kernel streaming data transfer. Both of these data types are defined inside ap_axi_sdata.h file distributed with the SDAccel release.
The following example shows the streaming interfaces of the producer and consumer kernels.
// Producer kernel
// Producing stream output to another kernel on the FPGA
// The below code segment ignores all other inputs and outputs, if any

void kernel1 (.... , hls::stream<ap_axiu<32, 0, 0, 0> >& stream_out)    {
#pragma HLS interface axis port=stream_out
 
      
        for(int i = 0; i < ...; i++) {
            int a = ...... ;         // Internally generated data
            ap_axiu<32, 0, 0, 0> v;  // temporary storage for ap_axiu
            v.data = a;              // Writing the data
            stream_out.write(v);         // Writing to the output stream.
        }
    }
 
// Consumer kernel
// Consuming stream input from another kernel on the FPGA
// The below code segment ignores all other inputs and outputs, if any
void kernel2 (hls::stream<ap_axiu<32, 0, 0, 0> >& stream_in, .... )    {
#pragma HLS interface axis port=stream_in
 
        for(int i = 0; i < ....; i++) {
            ap_axiu<32, 0, 0, 0> v = stream_in.read(); // Reading from the 
			Input stream
            int a = v.data; // Extract the data
             
            // Do further processing
        }
 }

Linking the Kernels

Additionally, connect the streaming output port of the producer kernel to the streaming input port of the consumer kernel by the --sc switch applied during the xocc link (-l) stage.

#Syntax:: xocc -l --sc <Source streaming port>:<Destination streaming port>
xocc -l --sc <kernel1 instance name>.stream_in:<kernel2 instance name>.stream_out