OpenCL Programming

OpenCL Host Application

In the Vitis™ core development kit, host code is written in C or C++ language using the Xilinx® runtime (XRT) API or industry standard OpenCL™ API. The XRT native API is described on the XRT site at https://xilinx.github.io/XRT/2020.2/html/xrt_native_apis.html. The Vitis core development kit supports the OpenCL 1.2 API as described at https://www.khronos.org/registry/OpenCL/specs/opencl-1.2.pdf. XRT extensions to OpenCL are described at https://xilinx.github.io/XRT/2020.2/html/opencl_extension.html.

TIP: The code examples shown in this text use the OpenCL C language API.

In general, the structure of the host code can be divided into three sections:

Setting up the environment.
Core command execution including executing one or more kernels.
Post processing and release of resources.

TIP: The Vitis core development kit supports the OpenCL Installable Client Driver (ICD) extension (cl_khr_icd). This extension allows multiple implementations of OpenCL to co-exist on the same system. For details and installation instructions, refer to OpenCL Installable Client Driver Loader.

Note: For multithreading the host program, exercise caution when calling a fork() system call from a Vitis core development kit application. The fork() does not duplicate all the runtime threads. Hence, the child process cannot run as a complete application in the Vitis core development kit. It is advisable to use the posix_spawn() system call to launch another process from the Vitis software platform application.

Setting Up the OpenCL Environment

The host code in the Vitis core development kit follows the OpenCL programming paradigm. To setup the runtime environment properly, the host application needs to initialize the standard OpenCL structures: target platform, devices, context, command queue, and program.

TIP: The host code examples and API commands used in this document follow the OpenCL C API. However, XRT also supports the OpenCL C++ wrapper API, and many of the Vitis Examples are written using the C++ API. For more information on this C++ wrapper API, refer to https://www.khronos.org/registry/OpenCL/specs/opencl-cplusplus-1.2.pdf.

Platform

Upon initialization, the host application needs to identify a platform composed of one or more Xilinx devices. The following code fragment shows a common method of identifying a Xilinx platform.

cl_platform_id platform_id;         // platform id

err = clGetPlatformIDs(16, platforms, &platform_count);
    
// Find Xilinx Platform
for (unsigned int iplat=0; iplat<platform_count; iplat++) {
  err = clGetPlatformInfo(platforms[iplat], 
    CL_PLATFORM_VENDOR, 
    1000, 
    (void *)cl_platform_vendor,
    NULL);

  if (strcmp(cl_platform_vendor, "Xilinx") == 0) { 
  // Xilinx Platform found
  platform_id = platforms[iplat];
  }
}

The OpenCL API call clGetPlatformIDs is used to discover the set of available OpenCL platforms for a given system. Then, clGetPlatformInfo is used to identify Xilinx device based platforms by matching cl_platform_vendor with the string "Xilinx".

Note: Though it is not explicitly shown in the preceding code, or in other host code examples used throughout this chapter, it is always a good coding practice to use error checking after each of the OpenCL API calls. This can help debugging and improve productivity when you are debugging the host and kernel code in the emulation flow, or during hardware execution. The following code fragment is an error checking code example for the clGetPlatformIDs command.

err = clGetPlatformIDs(16, platforms, &platform_count);
if (err != CL_SUCCESS) {
  printf("Error: Failed to find an OpenCL platform!\n");
  printf("Test failed\n");
  exit(1);
}

Devices

After a Xilinx platform is found, the application needs to identify the corresponding Xilinx devices.

The following code demonstrates finding all the Xilinx devices, with an upper limit of 16, by using API clGetDeviceIDs.

cl_device_id devices[16];  // compute device id
char cl_device_name[1001];
    
err = clGetDeviceIDs(platform_id, CL_DEVICE_TYPE_ACCELERATOR, 
  16, devices, &num_devices);
    
printf("INFO: Found %d devices\n", num_devices);
    
//iterate all devices to select the target device.
for (uint i=0; i<num_devices; i++) {
  err = clGetDeviceInfo(devices[i], CL_DEVICE_NAME, 1024, cl_device_name, 0);
  printf("CL_DEVICE_NAME %s\n", cl_device_name);
}

IMPORTANT: The clGetDeviceIDs API is called with the platform_id and CL_DEVICE_TYPE_ACCELERATOR to receive all the available Xilinx devices.

Sub-Devices

In the Vitis core development kit, sometimes devices contain multiple kernel instances of a single kernel or of different kernels. While the OpenCL API clCreateSubDevices allows the host code to divide a device into multiple sub-devices, the Vitis core development kit supports equally divided sub-devices (using CL_DEVICE_PARTITION_EQUALLY ), each containing one kernel instance.

The following example shows:

Sub-devices created by equal partition to execute one kernel instance per sub-device.
Iterating over the sub-device list and using a separate context and command queue to execute the kernel on each of them.
The API related to kernel execution (and corresponding buffer related) code is not shown for the sake of simplicity, but would be described inside the function run_cu.

cl_uint num_devices = 0;
  cl_device_partition_property props[3] = {CL_DEVICE_PARTITION_EQUALLY,1,0};
  
  // Get the number of sub-devices
  clCreateSubDevices(device,props,0,nullptr,&num_devices);  
  
  // Container to hold the sub-devices
  std::vector<cl_device_id> devices(num_devices);  

  // Second call of clCreateSubDevices    
  // We get sub-device handles in devices.data()
  clCreateSubDevices(device,props,num_devices,devices.data(),nullptr); 

  // Iterating over sub-devices
  std::for_each(devices.begin(),devices.end(),[kernel](cl_device_id sdev) {
      
	  // Context for sub-device
      auto context = clCreateContext(0,1,&sdev,nullptr,nullptr,&err);  
      
	  // Command-queue for sub-device
      auto queue = clCreateCommandQueue(context,sdev,
      CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE,&err); 
      
      // Execute the kernel on the sub-device using local context and 
	queue run_cu(context,queue,kernel); // Function not shown 
  });

IMPORTANT: As shown in the example, you must create a separate context for each sub-device. Though OpenCL supports a context that can hold multiple devices and sub-devices, XRT requires each device and sub-device to have a separate context.

Context

The clCreateContext API is used to create a context that contains a Xilinx device that will communicate with the host machine.

context = clCreateContext(0, 1, &device_id, NULL, NULL, &err);

In the code example, the clCreateContext API is used to create a context that contains one Xilinx device. Xilinx recommends creating only one context per device or sub-device. However, the host program should use multiple contexts if sub-devices are used with one context for each sub-device.

Command Queues

The clCreateCommandQueue API creates one or more command queues for each device. The FPGA can contain multiple kernels, which can be either the same or different kernels. When developing the host application, there are two main programming approaches to execute kernels on a device:

Single out-of-order command queue: Multiple kernel executions can be requested through the same command queue. XRT dispatches kernels as soon as possible, in any order, allowing concurrent kernel execution on the FPGA.
Multiple in-order command queue: Each kernel execution is requested from different in-order command queues. In such cases, XRT dispatches kernels from the different command queues, improving performance by running them concurrently on the device.

Note: For improved performance, Xilinx recommends using a single out-of-order command queue and manage event dependencies and synchronizations explicitly, instead of using multiple command queues.

The following is an example of standard API calls to create in-order and out-of-order command queues.

// Out-of-order Command queue
commands = clCreateCommandQueue(context, device_id, CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE, &err);

// In-order Command Queue
commands = clCreateCommandQueue(context, device_id, 0, &err);

Program

The host and kernel code are compiled separately to create separate executable files: the host program executable and the FPGA binary (.xclbin). When the host application runs, it must load the .xclbin file using the clCreateProgramWithBinary API.

The following code example shows how the standard OpenCL API is used to build the program from the .xclbin file.

unsigned char *kernelbinary;
char *xclbin = argv[1];

printf("INFO: loading xclbin %s\n", xclbin);
 
int size=load_file_to_memory(xclbin, (char **) &kernelbinary);
size_t size_var = size; 

cl_program program = clCreateProgramWithBinary(context, 1, &device_id, 
					 &size_var,(const unsigned char **) &kernelbinary, 
					 &status, &err);

// Function 
int load_file_to_memory(const char *filename, char **result)
{
  uint size = 0;
  FILE *f = fopen(filename, "rb");
  if (f == NULL) {
    *result = NULL;
    return -1; // -1 means file opening fail
  }
  fseek(f, 0, SEEK_END);
  size = ftell(f);
  fseek(f, 0, SEEK_SET);
  *result = (char *)malloc(size+1);
  if (size != fread(*result, sizeof(char), size, f)) {
    free(*result);
    return -2; // -2 means file reading fail
  }
  fclose(f);
  (*result)[size] = 0;
  return size;
}

The example performs the following steps:

The kernel binary file, .xclbin, is passed in from the command line argument, argv[1].
TIP: Passing the .xclbin through a command line argument is one approach. You can also hardcode the kernel binary file in the host program, define it with an environment variable, read it from a custom initialization file, or another suitable mechanism.
The load_file_to_memory function is used to load the file contents in the host machine memory space.
The clCreateProgramWithBinary API is used to complete the program creation process in the specified context and device.

Executing Commands in the FPGA

Once the OpenCL environment is initialized, the host application is ready to issue commands to the device and interact with the kernels. These commands include:

Setting up the kernels.
Buffer transfer to/from the FPGA.
Kernel execution on FPGA.
Event synchronization.

Setting Up Kernels

After setting up the runtime environment, such as identifying devices, creating the context, command queue, and program, the host application should identify the kernels that will execute on the device, and set up the kernel arguments.

The OpenCL API clCreateKernel should be used to access the kernels contained within the .xclbin file (the "program"). The cl_kernel object identifies a kernel in the program loaded into the FPGA that can be run by the host application. The following code example identifies two kernels defined in the loaded program.

kernel1 = clCreateKernel(program, "<kernel_name_1>", &err);            
kernel2 = clCreateKernel(program, "<kernel_name_2>", &err);  // etc

Setting Kernel Arguments

In the Vitis software platform, two types of arguments can be set for kernel objects:

Scalar arguments are used for small data transfer, such as constant or configuration type data. These are write-only arguments from the host application perspective, meaning they are inputs to the kernel.
Memory buffer arguments are used for large data transfer. The value is a pointer to a memory object created with the context associated with the program and kernel objects. These can be inputs to, or outputs from the kernel.

Kernel arguments can be set using the clSetKernelArg command, as shown in the following example for setting kernel arguments for two scalar and two buffer arguments.

// Create memory buffers
cl_mem dev_buf1 = clCreateBuffer(context, CL_MEM_WRITE_ONLY | CL_MEM_USE_HOST_PTR, size, &host_mem_ptr1, NULL);
cl_mem dev_buf2 = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_USE_HOST_PTR, size, &host_mem_ptr2, NULL);

int err = 0;
// Setup scalar arguments
cl_uint scalar_arg_image_width = 3840;
err |= clSetKernelArg(kernel, 0, sizeof(cl_uint), &scalar_arg_image_width); 
cl_uint scalar_arg_image_height = 2160; 
err |= clSetKernelArg(kernel, 1, sizeof(cl_uint), &scalar_arg_image_height); 
    
// Setup buffer arguments
err |= clSetKernelArg(kernel, 2, sizeof(cl_mem), &dev_buf1);
err |= clSetKernelArg(kernel, 3, sizeof(cl_mem), &dev_buf2);

IMPORTANT: Although OpenCL allows setting kernel arguments any time before enqueuing the kernel, you should set kernel arguments as early as possible. XRT will error out if you try to migrate a buffer before XRT knows where to put it on the device. Therefore, set the kernel arguments before performing any enqueue operation (for example, clEnqueueMigrateMemObjects) on any buffer.

For all kernel buffer arguments you must allocate the buffer on the device global memories. However, sometimes the content of the buffer is not required before the start of the kernel execution. For example, the output buffer content will only be populated during the kernel execution, and hence it is not important prior to kernel execution. In this case, you should specify clEnqueueMigrateMemObject with the CL_MIGRATE_MEM_OBJECT_CONTENT_UNDEFINED flag so that migration of the buffer will not involve the DMA operation between the host and the device, thus improving performance.

Buffer Allocation on the Device

By default, when kernels are linked to the platform the memory interfaces from all the kernels are connected to a single default global memory bank. As a result, only a single compute unit (CU) can transfer data to and from the global memory bank at one time, limiting the overall performance of the application.

If the device contains only one global memory bank, then this is the only option. However, if the device contains multiple global memory banks, you can customize the global memory bank connections by modifying the memory interface connection for a kernel during linking. The method for performing this is discussed in detail in Mapping Kernel Ports to Memory. Overall performance is improved by using separate memory banks for different kernels or compute units, enabling multiple kernel memory interfaces to concurrently read and write data.

IMPORTANT: XRT must detect the kernel's memory connection to send data from the host program to the correct memory location for the kernel. XRT will automatically find the buffer location from the kernel binary files if clSetKernelArgs is used before any enqueue operation on the buffer, such as clEnqueueMigrateMemObject.

Buffer Creation and Data Transfer

Interactions between the host program and hardware kernels rely on creating buffers and transferring data to and from the memory in the device. This process makes use of functions like clCreateBuffer and clEnqueueMigrateMemObjects.

IMPORTANT: A single buffer cannot be bigger than 4 GB, yet to maximize throughput from the host to global memory, Xilinx also recommends keeping the buffer size at least 2 MB if possible.

There are two methods for allocating memory buffers, and transferring data:

Letting XRT Allocate Buffers
Using Host Pointer Buffers

In the case where XRT allocates the buffer, use enqueueMapBuffer to capture the buffer handle. In the second case, allocate the buffer directly with CL_MEM_USE_HOST_PTR, so you do not need to capture the handle.

There are a number of coding practices you can adopt to maximize performance and fine-grain control. The OpenCL API supports additional commands for reading and writing buffers. For example, you can use clEnqueueWriteBuffer and clEnqueueReadBuffer commands in place of clEnqueueMigrateMemObjects. However, some of these commands have different effects that must be understood when using them. For example, clEnqueueReadBufferRect can read a rectangular region of a buffer object to the host application, but it does not transfer the data from the device global memory to the host. You must first use clEnqueueReadBuffer to transfer the data from the device global memory, and then use clEnqueueReadBufferRect to read the desired rectangular portion into the host application.

Letting XRT Allocate Buffers

On data center platforms, it is more efficient to allocate memory aligned on 4k page boundaries. On embedded platforms it is more efficient to perform contiguous memory allocation. In either case, you can let the XRT allocate host memory when creating the buffers. This is done by using the CL_MEM_ALLOC_HOST_PTR flag when creating the buffers, and then mapping the allocated memory to user-space pointers using clEnqueueMapBuffer. With this approach, it is not necessary to create a host space pointer aligned to the 4K boundary.

The clEnqueueMapBuffer API maps the specified buffer and returns a pointer created by XRT to this mapped region. Then, fill the host side pointer with your data, followed by clEnqueueMigrateMemObject to transfer the data to and from the device. The following code example uses this style:

// Two cl_mem buffer, for read and write by kernel
cl_mem dev_mem_read_ptr = clCreateBuffer(context, CL_MEM_ALLOC_HOST_PTR | CL_MEM_READ_ONLY,
    				 sizeof(int) * number_of_words, NULL, NULL); 

cl_mem dev_mem_write_ptr = clCreateBuffer(context, CL_MEM_ALLOC_HOST_PTR | CL_MEM_WRITE_ONLY,
    				 sizeof(int) * number_of_words, NULL, NULL); 


cl::Buffer in1_buf(context, CL_MEM_ALLOC_HOST_PTR | CL_MEM_READ_ONLY,  sizeof(int) * DATA_SIZE, NULL, &err);

// Setting arguments
clSetKernelArg(kernel, 0, sizeof(cl_mem), &dev_mem_read_ptr); 
clSetKernelArg(kernel, 1, sizeof(cl_mem), &dev_mem_write_ptr); 

// Get Host side pointer of the cl_mem buffer object
auto host_write_ptr = clEnqueueMapBuffer(queue,dev_mem_read_ptr,true,CL_MAP_WRITE,0,bytes,0,nullptr,nullptr,&err);
auto host_read_ptr = clEnqueueMapBuffer(queue,dev_mem_write_ptr,true,CL_MAP_READ,0,bytes,0,nullptr,nullptr,&err);

// Fill up the host_write_ptr to send the data to the FPGA

for(int i=0; i< MAX; i++) {
    host_write_ptr[i] = <.... > 
}

// Migrate
cl_mem mems[2] = {host_write_ptr,host_read_ptr};
clEnqueueMigrateMemObjects(queue,2,mems,0,0,nullptr,&migrate_event));

// Schedule the kernel
clEnqueueTask(queue,kernel,1,&migrate_event,&enqueue_event);

// Migrate data back to host
clEnqueueMigrateMemObjects(queue, 1, &dev_mem_write_ptr, 
                           CL_MIGRATE_MEM_OBJECT_HOST,1,&enqueue_event, &data_read_event);
     
clWaitForEvents(1,&data_read_event);

// Now use the data from the host_read_ptr

To work with an example using clEnqueueMapBuffer, refer to Data Transfer (C) in the Vitis Examples GitHub repository.

Using Host Pointer Buffers

IMPORTANT: Using CL_MEM_USE_HOST_PTR is not recommended for embedded platforms. Embedded platforms require contiguous memory allocation and should use the CL_MEM_ALLOC_HOST_PTR method, as described in Letting XRT Allocate Buffers.

There are two main parts of a cl_mem object: host side pointer and device side pointer. Before the kernel starts its operation, the device side pointer is implicitly allocated on the device side memory (for example, on a specific location inside the device global memory) and the buffer becomes a resident on the device. Using clEnqueueMigrateMemObjects this allocation and data transfer occur upfront, much ahead of the kernel execution. This especially helps to enable software pipelining if the host is executing the same kernel multiple times, because data transfer for the next transaction can happen when kernel is still operating on the previous data set, and thus hide the data transfer latency of successive kernel executions.

The OpenCL framework provides a number of APIs for transferring data between the host and the device. Typically, data movement APIs, such as clEnqueueWriteBuffer and clEnqueueReadBuffer, implicitly migrate memory objects to the device after they are enqueued. They do not guarantee when the data is transferred, and this makes it difficult for the host application to synchronize the movement of memory objects with the computation performed on the data.

Xilinx recommends using clEnqueueMigrateMemObjects instead of clEnqueueWriteBuffer or clEnqueueReadBuffer to improve the performance. Using this API, memory migration can be explicitly performed ahead of the dependent commands. This allows the host application to preemptively change the association of a memory object, through regular command queue scheduling, to prepare for another upcoming command. This also permits an application to overlap the placement of memory objects with other unrelated operations before these memory objects are needed, potentially hiding or reducing data transfer latencies. After the event associated with clEnqueueMigrateMemObjects has been marked complete, the host program knows the memory objects have been successfully migrated.

TIP: Another advantage of clEnqueueMigrateMemObjects is that it can migrate multiple memory objects in a single API call. This reduces the overhead of scheduling and calling functions to transfer data for more than one memory object.

The following code shows the use of clEnqueueMigrateMemObjects:

int host_mem_ptr[MAX_LENGTH]; // host memory for input vector
      
// Fill the memory input
for(int i=0; i<MAX_LENGTH; i++) {
  host_mem_ptr[i] = <... >   
}

cl_mem dev_mem_ptr = clCreateBuffer(context,  
    				 CL_MEM_READ_WRITE | CL_MEM_USE_HOST_PTR,
    				 sizeof(int) * number_of_words, host_mem_ptr, NULL); 

clSetKernelArg(kernel, 0, sizeof(cl_mem), &dev_mem_ptr); 

err = clEnqueueMigrateMemObjects(commands, 1, dev_mem_ptr, 0, 0, 
	  NULL, NULL);

Allocating Page-Aligned Host Memory

XRT allocates memory space in 4K boundary for internal memory management. If the host memory pointer is not aligned to a page boundary, XRT performs extra memcpy to make it aligned. Hence you should align the host memory pointer with the 4K boundary to save the extra memory copy operation.

The following is an example of how posix_memalign is used instead of malloc for the host memory space pointer.

int *host_mem_ptr; // = (int*) malloc(MAX_LENGTH*sizeof(int));
// Aligning memory in 4K boundary
posix_memalign(&host_mem_ptr,4096,MAX_LENGTH*sizeof(int)); 
 
// Fill the memory input       
for(int i=0; i<MAX_LENGTH; i++) {
  host_mem_ptr[i] = <... >   
}

cl_mem dev_mem_ptr = clCreateBuffer(context, 
				     CL_MEM_READ_WRITE | CL_MEM_USE_HOST_PTR,  
  			       sizeof(int) * number_of_words, host_mem_ptr, NULL); 

err = clEnqueueMigrateMemObjects(commands, 1, dev_mem_ptr, 0, 0, 
      NULL, NULL);

Sub-Buffers

Though not very common, using sub-buffers can be very useful in specific situations. The following sections discuss the scenarios where using sub-buffers can be beneficial.

Reading a Specific Portion from the Device Buffer

Consider a kernel that produces different amounts of data depending on the input to the kernel. For example, a compression engine where the output size varies depending on the input data pattern and similarity. The host can still read the whole output buffer by using clEnqueueMigrateMemObjects, but that is a suboptimal approach as more than the required memory transfer would occur. Ideally the host program should only read the exact amount of data that the kernel has written.

One technique is to have the kernel write the amount of the output data at the start of writing the output data. The host application can use clEnqueueReadBuffer two times, first to read the amount of data being returned, and second to read exact amount of data returned by the kernel based on the information from the first read.

clEnqueueReadBuffer(command_queue,device_write_ptr, CL_FALSE, 0, sizeof(int) * 1, 
                    &kernel_write_size, 0, nullptr, &size_read_event);
clEnqueueReadBuffer(command_queue,device_write_ptr, CL_FALSE, DATA_READ_OFFSET, 
                    kernel_write_size, host_ptr, 1, &size_read_event, &data_read_event);

With clEnqueueMigrateMemObject, which is recommended over clEnqueueReadBuffer or clEnqueueWriteBuffer, you can adopt a similar approach by using sub-buffers. This is shown in the following code sample.

TIP: The code sample shows only partial commands to demonstrate the concept.

//Create a small sub-buffer to read the quantity of data
cl_buffer_region buffer_info_1={0,1*sizeof(int)}; 
cl_mem size_info = clCreateSubBuffer (device_write_ptr, CL_MEM_WRITE_ONLY, 
      CL_BUFFER_CREATE_TYPE_REGION, &buffer_info_1, &err);

// Map the sub-buffer into the host space
auto size_info_host_ptr = clEnqueueMapBuffer(queue, size_info,,,, );

// Read only the sub-buffer portion
clEnqueueMigrateMemObjects(queue, 1, &size_info, CL_MIGRATE_MEM_OBJECT_HOST,,,);
                          
// Retrive size information from the already mapped size_info_host_ptr
kernel_write_size = ........... 

// Create sub-buffer to read the required amount of data     
cl_buffer_region buffer_info_2={DATA_READ_OFFSET, kernel_write_size};
cl_mem  buffer_seg = clCreateSubBuffer (device_write_ptr, CL_MEM_WRITE_ONLY, 
      CL_BUFFER_CREATE_TYPE_REGION, &buffer_info_2,&err);

// Map the subbuffer into the host space
auto read_mem_host_ptr = clEnqueueMapBuffer(queue, buffer_seg,,,);

// Migrate the subbuffer
clEnqueueMigrateMemObjects(queue, 1, &buffer_seg, CL_MIGRATE_MEM_OBJECT_HOST,,,);

// Now use the read data from already mapped read_mem_host_ptr

Device Buffer Shared by Multiple Memory Ports or Multiple Kernels

Sometimes memory ports of kernels only require small amounts of data. However, managing small sized buffers, transferring small amounts of data, may have potential performance issues for your application. Alternatively, your host program can create a larger size buffer, divided into smaller sub-buffers. Each sub-buffer is assigned as a kernel argument as discussed in Setting Kernel Arguments, for each of those memory ports requiring small amounts of data.

Once sub-buffers are created they are used in the host code similar to regular buffers. This can improve performance as XRT handles a large buffer in a single transaction, instead of several small buffers and multiple transactions.

Kernel Execution

Often the compute intensive task required by the host application can be defined inside a single kernel, and the kernel is executed only once to work on the entire data range. Because there is an overhead associated with multiple kernel executions, invoking a single monolithic kernel can improve performance. Though the kernel is executed only one time, and works on the entire range of the data, the parallelism is achieved on the FPGA inside the kernel hardware. If properly coded, the kernel is capable of achieving parallelism by various techniques such as instruction-level parallelism (loop pipeline) and function-level parallelism (dataflow). These different kernel coding techniques are discussed in C/C++ Kernels.

When the kernel is compiled to a single hardware instance (or CU) on the FPGA, the simplest method of executing the kernel is using clEnqueueTask as shown below.

err = clEnqueueTask(commands, kernel, 0, NULL, NULL);

XRT schedules the workload, or the data passed through OpenCL buffers from the kernel arguments, and schedules the kernel tasks to run on the accelerator on the Xilinx FPGA.

IMPORTANT: Though using clEnqueueNDRangeKernel is supported (only for OpenCL kernel), Xilinx recommends using clEnqueueTask.

However, sometimes using a single clEnqueueTask to run the kernel is not always feasible due to various reasons. For example, the kernel code can become too big and complex to optimize if it attempts to perform all compute intensive tasks in a single execution. Sometimes multiple kernels can be designed performing different tasks on the FPGA in parallel, requiring multiple enqueue commands. Or the host application can be receiving data over time, and not all the data can be processed at one time. Therefore, depending on the situation and application, you may need to break the data and the task of the kernel into multiple clEnqueueTask commands. In this case, an out-of-order command queue, or an in-order command queue can determine how the kernel tasks are processed as explained in Command Queues. In addition, multiple kernel tasks can be implemented as blocking events, or non-blocking events as described in Event Synchronization. These can all affect the performance of the design.

The following topics discuss various methods you can use to run a kernel, run multiple kernels, or run multiple instances of the same kernel on the accelerator.

Task Parallelism Using Different Kernels

Sometimes the compute intensive task required by the host application can be broken into multiple, different kernels designed to perform different tasks on the FPGA in parallel. By using multiple clEnqueueTask commands in an out-of-order command queue, for example, you can have multiple kernels performing different tasks, running in parallel. This enables the task parallelism on the FPGA.

Spatial Data Parallelism: Increase Number of Compute Units

Sometimes the compute intensive task required by the host application can process the data across multiple hardware instances of the same kernel, or compute units (CUs) to achieve data parallelism on the FPGA. If a single kernel has been compiled into multiple CUs, the clEnqueueTask command can be called multiple times in an out-of-order command queue, to enable data parallelism. Each call of clEnqueueTask would schedule a workload of data in different CUs, working in parallel.

Temporal Data Parallelism: Host-to-Kernel Dataflow

Sometimes, the data processed by a compute unit passes from one stage of processing in the kernel, to the next stage of processing. In this case, the first stage of the kernel may be free to begin processing a new set of data. In essence, like a factory assembly line, the kernel can accept new data while the original data moves down the line.

To understand this approach, assume a kernel has only one CU on the FPGA, and the host application enqueues the kernel multiple times with different sets of data. As shown in Using Host Pointer Buffers, the host application can migrate data to the device global memory ahead of the kernel execution, thus hiding the data transfer latency by the kernel execution, enabling software pipelining.

However, by default, a kernel can only start processing a new set of data only when it has finished processing the current set of data. Although clEnqueueMigrateMemObject hides the data transfer time, multiple kernel executions still remain sequential.

By enabling host-to-kernel dataflow, it is possible to further improve the performance of the accelerator by restarting the kernel with a new set of data while the kernel is still processing the previous set of data. As discussed in Enabling Host-to-Kernel Dataflow, the kernel must implement the ap_ctrl_chain interface, and must be written to permit processing data in stages. In this case, XRT restarts the kernel as soon as it is able to accept new data, thus overlapping multiple kernel executions. However, the host program must keep the command queue filled with requests so that the kernel can restart as soon as it is ready to accept new data.

The following is a conceptual diagram for host-to-kernel dataflow.

The longer the kernel takes to process a set of data from start to finish, the greater the opportunity to use host-to-kernel dataflow to improve performance. Rather than waiting until the kernel has finished processing one set of data, simply wait until the kernel is ready to begin processing the next set of data. This allows temporal parallelism, where different stages of the same kernel processes a different set of data from multiple clEnqueueTask commands, in a pipelined manner.

For advanced designs, you can effectively use both the spatial parallelism with multiple CUs to process data, combined with temporal parallelism using host-to-kernel dataflow, overlapping kernel executions on each compute unit.

IMPORTANT: Embedded processor platforms do not support the host-to-kernel dataflow feature.

Enabling Host-to-Kernel Dataflow

If a kernel is capable of accepting more data while it is still operating on data from the previous transactions, XRT can send the next batch of data. The kernel then works on multiple data sets in parallel at different stages of the algorithm, thus improving performance. To support host-to-kernel dataflow, the kernel has to implement the ap_ctrl_chain protocol using the pragma HLS interface for the function return:

void kernel_name( int *inputs,
                  ...         )// Other input or Output ports
{
#pragma HLS INTERFACE  .....   // Other interface pragmas
#pragma HLS INTERFACE ap_ctrl_chain port=return bundle=control

IMPORTANT: To take advantage of the host-to-kernel dataflow, the kernel must also be written to process data in stages, such as pipelined at the loop-level as discussed in Loop Pipelining, or pipelined at the task-level as discussed in Dataflow Optimization.

Symmetrical and Asymmetrical Compute Units

As discussed in Creating Multiple Instances of a Kernel, multiple compute units (CUs) of a single kernel can be instantiated on the FPGA during the kernel linking process. CUs can be considered symmetrical or asymmetrical with regard to other CUs of the same kernel.

Symmetrical: CUs are considered symmetrical when they have exactly the same connectivity.sp options, and therefore have identical connections to global memory. As a result, the Xilinx Runtime can use them interchangeably. A call to clEnqueueTask can result in the invocation of any instance in a group of symmetrical CUs.
Asymmetrical: CUs are considered asymmetrical when they do not have exactly the same connectivity.sp options, and therefore do not have identical connections to global memory. Using the same setup of input and output buffers, it is not possible for XRT to execute asymmetrical CUs interchangeably.

Kernel Handle and Compute Units

The first time clSetKernelArg is called for a given kernel object, XRT identifies the group of symmetrical CUs for subsequent executions of the kernel. When clEnqueueTask is called for that kernel, any of the symmetrical CUs in that group can be used to process the task.

If all CUs for a given kernel are symmetrical, a single kernel object is sufficient to access any of the CUs. However, if there are asymmetrical CUs, the host application will need to create a unique kernel object for each group of asymmetrical CUs. In this case, the call to clEnqueueTask must specify the kernel object to use for the task, and any matching CU for that kernel can be used by XRT.

Creating Kernel Objects for Specific Compute Units

For creating kernels associated with specific compute units, the clCreateKernel command supports specifying the CUs at the time the kernel object is created by the host program. The syntax of this command is shown below:

// Create kernel object only for a specific compute unit 
cl_kernel kernelA = clCreateKernel(program,"<kernel_name>:{compute_unit_name}",&err);
// Create a kernel object for two specific compute units 
cl_kernel kernelB = clCreateKernel(program, "<kernel_name>:{CU1,CU2}", &err);

IMPORTANT: As discussed in Creating Multiple Instances of a Kernel, the number of CUs is specified by the connectivity.nk option in a config file used by the v++ command during linking. Therefore, whatever is specified in the host program, to create or enqueue kernel objects, must match the options specified by the config file used during linking.

In this case, the Xilinx Runtime identifies the kernel handles (kernelA, kernelB) for specific CUs, or group of CUs, when the kernel is created. This lets you control which kernel configuration, or specific CU instance is used, when using clEnqueueTask from within the host program. This can be useful in the case of asymmetrical CUs, or to perform load and priority management of CUs.

Using Compute Unit Name to Get Handle of All Asymmetrical Compute Units

If a kernel instantiates multiple CUs that are not symmetrical, the clCreateKernel command can be specified with CU names to create different CU groups. In this case, the host program can reference a specific CU group by using the cl_kernel handle returned by clCreateKernel.

In the following example, the kernel mykernel has five CUs: K1, K2, K3, K4, and K5. The K1, K2, and K3 CUs are a symmetrical group, having symmetrical connection on the device. Similarly, CUs K4 and K5 form a second symmetrical CU group. The following code segment shows how to address a specific CU group using cl_kernel handles.

// Kernel handle for Symmetrical compute unit group 1: K1,K2,K3
cl_kernel kernelA = clCreateKernel(program,"mykernel:{K1,K2,K3}",&err);

for(i=0; i<3; i++) {
  // Creating buffers for the kernel_handle1
  .....
  // Setting kernel arguments for kernel_handle1
  .....
  // Enqueue buffers for the kernel_handle1
  .....
  // Possible candidates of the executions K1,K2 or K3
  clEnqueueTask(commands, kernelA, 0, NULL, NULL); 
  //
}

// Kernel handle for Symmetrical compute unit group 1: K4, K5
cl_kernel kernelB = clCreateKernel(program,"mykernel:{K4,K5}",&err);

for(int i=0; i<2; i++) {
  // Creating buffers for the kernel_handle2
  .....
  // Setting kernel arguments for kernel_handle2
  .....
  // Enqueue buffers for the kernel_handle2
  .....
  // Possible candidates of the executions K4 or K5
  clEnqueueTask(commands, kernelB, 0, NULL, NULL);
}

Event Synchronization

All OpenCL enqueue-based API calls are asynchronous. These commands will return immediately after the command is enqueued in the command queue. To pause the host program to wait for results, or resolve any dependencies among the commands, an API call such as clFinish or clWaitForEvents can be used to block execution of the host program.

The following code shows examples for clFinish and clWaitForEvents.

err = clEnqueueTask(command_queue, kernel, 0, NULL, NULL);
// Execution will wait here until all commands in the command queue are finished
clFinish(command_queue); 

// Create event, read memory from device, wait for read to complete, verify results
cl_event readevent;
// host memory for output vector
int host_mem_output_ptr[MAX_LENGTH]; 
//Enqueue ReadBuffer, with associated event object 
clEnqueueReadBuffer(command_queue, dev_mem_ptr, CL_TRUE, 0, sizeof(int) * number_of_words, 
  host_mem_output_ptr, 0, NULL, &readevent );
// Wait for clEnqueueReadBuffer event to finish
clWaitForEvents(1, &readevent); 
// After read is complete, verify results
...

Note how the commands have been used in the example above:

The clFinish API has been explicitly used to block the host execution until the kernel execution is finished. This is necessary otherwise the host can attempt to read back from the FPGA buffer too early and may read garbage data.
The data transfer from FPGA memory to the local host machine is done through clEnqueueReadBuffer. Here the last argument of clEnqueueReadBuffer returns an event object that identifies this particular read command, and can be used to query the event, or wait for this particular command to complete. The clWaitForEvents command specifies a single event (the readevent), and waits to ensure the data transfer is finished before verifying the data.

Post-Processing and FPGA Cleanup

At the end of the host code, all the allocated resources should be released by using proper release functions. If the resources are not properly released, the Vitis core development kit might not able to generate a correct performance related profile and analysis report.

clReleaseCommandQueue(Command_Queue);
clReleaseContext(Context);
clReleaseDevice(Target_Device_ID); 
clReleaseKernel(Kernel);
clReleaseProgram(Program);
free(Platform_IDs);
free(Device_IDs);

OpenCL Kernel Development

The following OpenCL™ kernel discussion is based on the information provided in the C/C++ Kernels topic. The same programming techniques for accelerating the performance of a kernel apply to both C/C++ and OpenCL kernels. However, the OpenCL kernel uses the __attribute syntax in place of pragmas. For details of the available attributes, refer to OpenCL Attributes.

The following code examples show some of the elements of an OpenCL kernel for the Vitis™ application acceleration development flow. This is not intended to be a primer on OpenCL or kernel development, but to merely highlight some of the key difference between OpenCL and C/C++ kernels.

Kernel Signature

In C/C++ kernels, the kernel is identified on the Vitis compiler command line using the v++ --kernel option. However, in OpenCL code, the __kernel keyword identifies a kernel in the code. You can have multiple kernels defined in a single .cl file, and the Vitis compiler will compile all of the kernels, unless you specify the --kernel option to identify which kernel to compile.

__kernel __attribute__ ((reqd_work_group_size(1, 1, 1)))
void apply_watermark(__global const TYPE * __restrict input, 
   __global TYPE * __restrict output, int width, int height) {
{
   ...
}

TIP: The complete code for the kernel function above, apply_watermark, can be found in the Global Memory Two Banks (CL) example in the Vitis Accel Examples GitHub repository.

In the example above, you can see the watermark kernel has two pointer type arguments: input and output, and has two scalar type int arguments: width and height.

In C/C++ kernels, these arguments would need to be identified with the HLS INTERFACE pragmas. However, in the OpenCL kernel, the Vitis compiler, and Vitis HLS recognize the kernel arguments, and compile them as needed: pointer arguments into m_axi interfaces, and scalar arguments into s_axilite interfaces.

Kernel Optimizations

Because the kernel is running in programmable logic on the target platform, optimizing your task to the environment is an important element of application design. Most of the optimization techniques discussed in C/C++ Kernels can be applied to OpenCL kernels. Instead of applying the HLS pragmas used for C/C++ kernels, you will use the __attribute__ keyword described in OpenCL Attributes. Following is an example:

// Process the whole image 
__attribute__((xcl_pipeline_loop))
image_traverse: for (uint idx = 0, x = 0 , y = 0  ; idx < size ; ++idx, x+= DATA_SIZE)
{
   ...
}

The example above specifies that the for loop, image_traverse, should be pipelined to improve the performance of the kernel. The target II in this case is 1. For more information, refer to xcl_pipeline_loop.

In the following code example, the watermark function uses the opencl_unroll_hint attribute to let the Vitis compiler unroll the loop to reduce latency and improve performance. However, in this case the __attribute__ is only a suggestion that the compiler can ignore if needed. For details, refer to opencl_unroll_hint.

//Unrolling below loop to process all 16 pixels concurrently
__attribute__((opencl_unroll_hint))
watermark: for ( int i = 0 ; i < DATA_SIZE ; i++)
{
   ...
}

For more information, review the OpenCL Attributes topics to see what specific optimizations are supported for OpenCL kernels, and review the C/C++ Kernels content to see how these optimizations can be applied in your kernel design.

Setting Data Width in OpenCL Kernels

For OpenCL kernels, the API provides attributes to support incrementing AXI data width usage. To eliminate manual code modifications, the following OpenCL attributes are interpreted to perform data path widening and vectorization of the algorithm:

vec_type_hint
reqd_work_group_size
xcl_zero_global_work_offset

Examine the combined functionality on the following case:

__attribute__((reqd_work_group_size(64, 1, 1)))
__attribute__((vec_type_hint(int)))
__attribute__((xcl_zero_global_work_offset))
__kernel void vector_add(__global int* c, __global const int* a, __global const int* b) {
    size_t idx = get_global_id(0);
    c[idx] = a[idx] + b[idx];
}

In this case, the hard coded interface is a 32-bit wide data path (int *c, int* a, int *b), which drastically limits the memory throughput if implemented directly. However, the automatic widening and transformation is applied, based on the values of the three attributes.

__attribute__((vec_type_hint(int)))

Declares that int is the main type used for computation and memory transfer (32-bit). This knowledge is used to calculate the vectorization/widening factor based on the target bandwidth of the AXI interface (512 bits). In this example the factor would be 16 = 512 bits / 32-bit. This implies that in theory, 16 values could be processed if vectorization can be applied.

__attribute__((reqd_work_group_size(X, Y, Z)))

Defines the total number of work items (where X, Y, and Z are positive constants). X*Y*Z is the maximum number of work items therefore defining the maximum possible vectorization factor which would saturate the memory bandwidth. In this example, the total number of work items is 64*1*1=64.

The actual vectorization factor to be applied will be the greatest common divider of the vectorization factor defined by the actual coded type or the vec_type_hint, and the maximum possible vectorization factor defined through reqd_work_group_size.

The quotient of maximum possible vectorization factor divided by the actual vectorization factor provides the remaining loop count of the OpenCL description. As this loop is pipelined, it can be advantageous to have several remaining loop iterations to take advantage of a pipelined implementation. This is especially true if the vectorized OpenCL code has long latency.

__attribute__((xcl_zero_global_work_offset))

The __attribute__((xcl_zero_global_work_offset)) instructs the compiler that no global offset parameter is used at runtime, and all accesses are aligned. This gives the compiler valuable information with regard to alignment of the work groups, which in turn usually propagates to the alignment of the memory accesses (less hardware).

It should be noted, that the application of these transformations changes the actual design to be synthesized. Partially unrolled loops require reshaping of local arrays in which data is stored. This usually behaves nicely, but can interact poorly in rare situations.

For example:

For partitioned arrays, when the partition factor is not divisible by the unrolling/vectorization factor.
- The resulting access requires a lot of multiplexers and will create a difficult issue for the scheduler (might severely increase memory usage and compilation time). Xilinx recommends using partitioning factors that are powers of two (as the vectorization factor is always a power of two).
If the loop being vectorized has an unrelated resource constraint, the scheduler complains about II not being met.
- This is not necessarily correlated with a loss of performance (usually it is still performing better) because the II is computed on the unrolled loop (which has therefore a multiplied throughput for each iteration).
- The scheduler informs you of the possible resources constraints and resolving those will further improve the performance.
- Note that a common occurrence is that a local array does not get automatically reshaped (usually because it is accessed in a later section of the code in non-vectorizable method).

Reducing Kernel to Kernel Communication Latency in OpenCL Kernels

The OpenCL API 2.0 specification introduces a new memory object called a pipe. A pipe stores data organized as a FIFO. Pipe objects can only be accessed using built-in functions that read from and write to a pipe. Pipe objects are not accessible from the host. Pipes can be used to stream data from one kernel to another inside the FPGA without having to use the external memory, which greatly improves the overall system latency. For more information, see Pipe Functions on Version 2.0 of the OpenCL C Specification from Khronos Group.

In the Vitis IDE, pipes must be statically defined outside of all kernel functions. Dynamic pipe allocation using the OpenCL 2.x clCreatePipe API is not supported. The depth of a pipe must be specified by using the OpenCL attribute xcl_reqd_pipe_depth in the pipe declaration. For more information, see xcl_reqd_pipe_depth.

As specified in xcl_reqd_pipe_depth, the valid depth values are as follows: 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192, 16384, 32768.

A given pipe can have one and only one producer and consumer in different kernels.

pipe int p0 __attribute__((xcl_reqd_pipe_depth(32)));

Pipes can be accessed using the Xilinx extended read_pipe_block() and write_pipe_block() functions in blocking mode.

TIP: Reading or writing to pipes using the non-blocking read_pipe() or write_pipe() functions is not supported.

The status of pipes can be queried using OpenCL get_pipe_num_packets() and get_pipe_max_packets() built-in functions.

The following function signatures are the currently supported pipe functions, where gentype indicates the built-in OpenCL C scalar integer or floating-point data types.

int read_pipe_block (pipe gentype p, gentype *ptr) 
int write_pipe_block (pipe gentype p, const gentype *ptr)

The following “dataflow/dataflow_pipes_ocl” from Xilinx Getting Started Examples on GitHub uses pipes to pass data from one processing stage to the next using blocking read_pipe_block() and write_pipe_block() functions:

pipe int p0 __attribute__((xcl_reqd_pipe_depth(32)));
pipe int p1 __attribute__((xcl_reqd_pipe_depth(32)));
// Input Stage Kernel : Read Data from Global Memory and write into Pipe P0
kernel __attribute__ ((reqd_work_group_size(1, 1, 1)))
void input_stage(__global int *input, int size)
{
    __attribute__((xcl_pipeline_loop)) 
    mem_rd: for (int i = 0 ; i < size ; i++)
    {
        //blocking Write command to pipe P0
        write_pipe_block(p0, &input[i]);
    }
}
// Adder Stage Kernel: Read Input data from Pipe P0 and write the result 
// into Pipe P1
kernel __attribute__ ((reqd_work_group_size(1, 1, 1)))
void adder_stage(int inc, int size)
{
    __attribute__((xcl_pipeline_loop))
    execute: for(int i = 0 ; i < size ;  i++)
    {
        int input_data, output_data;
        //blocking read command to Pipe P0
        read_pipe_block(p0, &input_data);
        output_data = input_data + inc;
        //blocking write command to Pipe P1
        write_pipe_block(p1, &output_data);
    }
}
// Output Stage Kernel: Read result from Pipe P1 and write the result to Global
// Memory
kernel __attribute__ ((reqd_work_group_size(1, 1, 1)))
void output_stage(__global int *output, int size)
{
    __attribute__((xcl_pipeline_loop))
    mem_wr: for (int i = 0 ; i < size ; i++)
    {
        //blocking read command to Pipe P1
        read_pipe_block(p1, &output[i]);
    }
}

The Device Traceline view shows the detailed activities and stalls on the OpenCL pipes after hardware emulation is run. This information can be used to choose the correct FIFO sizes to achieve the optimal application area and performance.

OpenCL Attributes

This section describes OpenCL™ attributes that can be added to source code to assist system optimization by the Vitis core development kit, and Vitis HLS tool synthesis.

The Vitis core development kit provides OpenCL attributes to optimize your code for data movement and kernel performance. The goal of data movement optimization is to maximize the system level data throughput by maximizing interface bandwidth usage and DDR bandwidth usage. The goal of kernel computation optimization is to create processing logic that can consume all the data as soon as they arrive at kernel interfaces. This is generally achieved by expanding the processing code to match the data path with techniques, such as function inlining and pipelining, loop unrolling, array partitioning, dataflowing, and so on.

The following table includes the OpenCL attributes are specified by type.

Table 1. OpenCL Attributes by Type
Type	Attributes
Kernel Optimization	reqd_work_group_size vec_type_hint work_group_size_hint xcl_latency xcl_max_work_group_size xcl_zero_global_work_offset
Function Inlining	always_inline
Task-level Pipeline	xcl_dataflow xcl_reqd_pipe_depth
Pipeline	xcl_pipeline_loop xcl_pipeline_workitems
Loop Optimization	opencl_unroll_hint xcl_loop_tripcount xcl_pipeline_loop
Array Optimization	xcl_array_partition xcl_array_reshape Note: Array variables only accept a single array optimization attribute.

TIP: The Vitis compiler also supports many of the standard attributes supported by gcc, such as:

ALWAYS_INLINE
NOINLINE
UNROLL
NOUNROLL

always_inline

Description

The ALWAYS_INLINE attribute indicates that a function must be inlined. This attribute is a standard feature of GCC, and a standard feature of the Vitis compiler.

TIP: The NOINLINE attribute is also a standard feature of GCC, and is also supported by the Vitis compiler.

This attribute enables a compiler optimization to have a function inlined into the calling function. The inlined function is dissolved and no longer appears as a separate level of hierarchy in the RTL.

In some cases, inlining a function allows operations within the function to be shared and optimized more effectively with surrounding operations in the calling function. However, an inlined function can no longer be shared with other functions, so the logic might be duplicated between the inlined function and a separate instance of the function which can be more broadly shared. While this can improve performance, this will also increase the area required for implementing the RTL.

For OpenCL kernels, the Vitis compiler uses its own rules to inline or not inline a function. To directly control inlining functions, use the ALWAYS_INLINE or NOINLINE attributes.

By default, inlining is only performed on the next level of function hierarchy, not sub-functions.

IMPORTANT: When used with the XCL_DATAFLOW attribute, the compiler will ignore the ALWAYS_INLINE attribute and not inline the function.

Syntax

Place the attribute in the OpenCL API source before the function definition to always have it inlined whenever the function is called.

__attribute__((always_inline))

Examples

This example adds the ALWAYS_INLINE attribute to function foo:

__attribute__((always_inline))
  void foo ( a, b, c, d ) {
  ...
}

This example prevents the inlining of the function foo:

__attribute__((noinline))
  void foo ( a, b, c, d ) {
  ...
}

opencl_unroll_hint

Description

IMPORTANT: This is a compiler hint which the compiler can ignore.

Loop unrolling is an optimization technique available in the Vitis compiler. The purpose of the loop unroll optimization is to expose concurrency to the compiler. This newly exposed concurrency reduces latency and improves performance, but also consumes more FPGA fabric resources.

The OPENCL_UNROLL_HINT attribute is part of the OpenCL Specification, and specifies that loops (for, while, do) can be unrolled by the Vitis compiler. See Loop Unrolling for more information.

The OPENCL_UNROLL_HINT attribute qualifier must appear immediately before the loop to be affected. You can use this attribute to specify full unrolling of the loop, partial unrolling by a specified amount, or to disable unrolling of the loop.

Syntax

Place the attribute in the OpenCL source before the loop definition:

__attribute__((opencl_unroll_hint(<n>)))

Where:

<n> is an optional loop unrolling factor and must be a positive integer, or compile time constant expression. An unroll factor of 1 disables unrolling.
TIP: If <n> is not specified, the compiler automatically determines the unrolling factor for the loop.

Examples

The following example unrolls the for loop by a factor of 2. This results in two parallel loop iterations instead of four sequential iterations for the compute unit to complete the operation.

__attribute__((opencl_unroll_hint(2)))
for(int i = 0; i < LENGTH; i++) {
bufc[i] = bufa[i] * bufb[i];
}

Conceptually the compiler transforms the loop above to the following code.

for(int i = 0; i < LENGTH; i+=2) {
bufc[i] = bufa[i] * bufb[i];
bufc[i+1] = bufa[i+1] * bufb[i+1];
}

reqd_work_group_size

Description

When OpenCL API kernels are submitted for execution on an OpenCL device, they execute within an index space, called an ND range, which can have 1, 2, or 3 dimensions. This is called the global size in the OpenCL API. The work-group size defines the amount of the ND range that can be processed by a single invocation of a kernel compute unit (CU). The work-group size is also called the local size in the OpenCL API. The OpenCL compiler can determine the work-group size based on the properties of the kernel and selected device. After the work-group size (local size) is determined, the ND range (global size) is divided automatically into work-groups, and the work-groups are scheduled for execution on the device.

Although the OpenCL compiler can define the work-group size, the specification of the REQD_WORK_GROUP_SIZE attribute on the kernel to define the work-group size is highly recommended for FPGA implementations of the kernel. The attribute is recommended for performance optimization during the generation of the custom logic for a kernel.

TIP: In the case of an FPGA implementation, the specification of the REQD_WORK_GROUP_SIZE attribute is highly recommended as it can be used for performance optimization during the generation of the custom logic for a kernel.

OpenCL kernel functions are executed exactly one time for each point in the ND range index space. This unit of work for each point in the ND range is called a work-item. Work-items are organized into work-groups, which are the unit of work scheduled onto compute units. The optional REQD_WORK_GROUP_SIZE attribute defines the work-group size of a compute unit that must be used as the local_work_size argument to clEnqueueNDRangeKernel. This allows the compiler to optimize the generated code appropriately for this kernel.

Syntax

Place this attribute before the kernel definition, or before the primary function specified for the kernel.

__attribute__((reqd_work_group_size(<X>, <Y>, <Z>)))

Where:

<X>, <Y>, <Z>: Specifies the ND range of the kernel. This represents each dimension of a three dimensional matrix specifying the size of the work-group for the kernel.

Examples

The following OpenCL C kernel code shows a vector addition design where two arrays of data are summed into a third array. The required size of the work-group is 16x1x1. This kernel will execute 16 times to produce a valid result.

#include <clc.h>
// For VHLS OpenCL C kernels, the full work group is synthesized
__attribute__ ((reqd_work_group_size(16, 1, 1)))
__kernel void 
vadd(__global int* a,
__global int* b,
__global int* c)
{
int idx = get_global_id(0);
c[idx] = a[idx] + b[idx];
}

vec_type_hint

Description

IMPORTANT: This is a compiler hint which the compiler can ignore.

The optional __attribute__((vec_type_hint(<type>))) is part of the OpenCL Language Specification, and hints to the OpenCL compiler representing the computational width of the kernel, providing a basis for calculating processor bandwidth usage when the compiler is looking to auto-vectorize the code.

By default, the kernel is assumed to have the __attribute__((vec_type_hint(int))) qualifier. This lets you specify a different vectorization type.

Implicit in autovectorization is the assumption that any libraries called from the kernel must be re-compilable at runtime to handle cases where the compiler decides to merge or separate work items. This means that these libraries can never be hard-coded binaries or that hard-coded binaries must be accompanied either by source or some re-targetable intermediate representation. This might be a code security question for some.

Syntax

Place this attribute before the kernel definition, or before the primary function specified for the kernel.

__attribute__((vec_type_hint(<type>)))

Where:

<type>: is one of the built-in vector types listed in the following table, or the constituent scalar element types.
Note: When not specified, the kernel is assumed to have an INT type.

Table 2. Vector Types
Type	Description
char<n>	A vector of <n> 8-bit signed two’s complement integer values.
uchar<n>	A vector of <n> 8-bit unsigned integer values.
short<n>	A vector of <n> 16-bit signed two’s complement integer values.
ushort<n>	A vector of <n> 16-bit unsigned integer values.
int<n>	A vector of <n> 32-bit signed two’s complement integer values.
uint<n>	A vector of <n> 32-bit unsigned integer values.
long<n>	A vector of <n> 64-bit signed two’s complement integer values.
ulong<n>	A vector of <n> 64-bit unsigned integer values.
float<n>	A vector of <n> 32-bit floating-point values.
double<n>	A vector of <n> 64-bit floating-point values.

Note: <n> is assumed to be 1 when not specified. The vector data type names defined above where <n> is any value other than 2, 3, 4, 8 and 16, are also reserved. Therefore, < n> can only be specified as 2,3,4,8, and 16.

Examples

The following example autovectorizes assuming double-wide integer as the basic computation width.

#include <clc.h>
// For VHLS OpenCL C kernels, the full work group is synthesized
__attribute__((vec_type_hint(double)))
__attribute__ ((reqd_work_group_size(16, 1, 1)))
__kernel void 
...

work_group_size_hint

Description

IMPORTANT: This is a compiler hint, which the compiler might ignore.

The work-group size in the OpenCL API standard defines the size of the ND range space that can be handled by a single invocation of a kernel compute unit. When OpenCL kernels are submitted for execution on an OpenCL device, they execute within an index space, called an ND range, which can have 1, 2, or 3 dimensions.

OpenCL kernel functions are executed exactly one time for each point in the ND range index space. This unit of work for each point in the ND range is called a work-item. Unlike for loops in C, where loop iterations are executed sequentially and in-order, an OpenCL runtime and device is free to execute work-items in parallel and in any order.

Work-items are organized into work-groups, which are the unit of work scheduled onto compute units. The optional WORK_GROUP_SIZE_HINT attribute is part of the OpenCL Language Specification, and is a hint to the compiler that indicates the work-group size value most likely to be specified by the local_work_size argument to clEnqueueNDRangeKernel. This allows the compiler to optimize the generated code according to the expected value.

TIP: In the case of an FPGA implementation, the specification of the REQD_WORK_GROUP_SIZE attribute, instead of the WORK_GROUP_SIZE_HINT is highly recommended because it can be used for performance optimization during the generation of the custom logic for a kernel.

Syntax

Place this attribute before the kernel definition, or before the primary function specified for the kernel:

__attribute__((work_group_size_hint(<X>, <Y>, <Z>)))

Where:

<X>, <Y>, <Z>: Specifies the ND range of the kernel. This represents each dimension of a three dimensional matrix specifying the size of the work-group for the kernel.

Examples

The following example is a hint to the compiler that the kernel will most likely be executed with a work-group size of 1.

__attribute__((work_group_size_hint(1, 1, 1)))
__kernel void
...

xcl_array_partition

Description

IMPORTANT: Array variables only accept one attribute. While XCL_ARRAY_PARTITION does support multi-dimensional arrays, you can only reshape one dimension of the array with a single attribute.

An advantage of using the FPGA over other compute devices for OpenCL programs is the ability for the application programmer to customize the memory architecture all throughout the system and into the compute unit. By default, the Vitis compiler generates a memory architecture within the compute unit that maximizes local and private memory bandwidth based on static code analysis of the kernel code. Further optimization of these memories is possible based on attributes in the kernel source code, which can be used to specify physical layouts and implementations of local and private memories. The attribute in the Vitis compiler to control the physical layout of memories in a compute unit is array_partition.

For one-dimensional arrays, the XCL_ARRAY_PARTITION attribute implements an array declared within kernel code as multiple physical memories instead of a single physical memory. The selection of which partitioning scheme to use depends on the specific application and its performance goals. The array partitioning schemes available in the Vitis compiler are cyclic, block, and complete.

Syntax

Place the attribute with the definition of the array variable.

__attribute__((xcl_array_partition(<type>, <factor>, 
<dimension>)))

Where:

<type>: Specifies one of the following partition types:
- cyclic: Cyclic partitioning is the implementation of an array as a set of smaller physical memories that can be accessed simultaneously by the logic in the compute unit. The array is partitioned cyclically by putting one element into each memory before coming back to the first memory to repeat the cycle until the array is fully partitioned.
- block: Block partitioning is the physical implementation of an array as a set of smaller memories that can be accessed simultaneously by the logic inside the compute unit. In this case, each memory block is filled with elements from the array before moving on to the next memory.
- complete: Complete partitioning decomposes the array into individual elements. For a one-dimensional array, this corresponds to resolving a memory into individual registers. The default <type> is complete.
<factor>: For cyclic type partitioning, the <factor> specifies how many physical memories to partition the original array into in the kernel code. For block type partitioning, the <factor> specifies the number of elements from the original array to store in each physical memory.
IMPORTANT: For complete type partitioning, the <factor>> is not specified.
<dimension>: Specifies which array dimension to partition. Specified as an integer from 1 to <N>. Vitis core development kit supports arrays of N dimensions and can partition the array on any single dimension.

Example 1

For example, consider the following array declaration.

int buffer[16];

The integer array, named buffer, stores 16 values that are 32-bits wide each. Cyclic partitioning can be applied to this array with the following declaration.

int buffer[16] __attribute__((xcl_array_partition(cyclic,4,1)));

In this example, the cyclic <partition_type> attribute tells the Vitis compiler to distribute the contents of the array among four physical memories. This attribute increases the immediate memory bandwidth for operations accessing the array buffer by a factor of four.

All arrays inside a compute unit in the context of the Vitis core development kit are capable of sustaining a maximum of two concurrent accesses. By dividing the original array in the code into four physical memories, the resulting compute unit can sustain a maximum of eight concurrent accesses to the array buffer.

Example 2

Using the same integer array as found in Example 1, block partitioning can be applied to the array with the following declaration.

int buffer[16] __attribute__((xcl_array_partition(block,4,1)));

Because the size of the block is four, the Vitis compiler will generate four physical memories, sequentially filling each memory with data from the array.

Example 3

Using the same integer array as found in Example 1, complete partitioning can be applied to the array with the following declaration.

int buffer[16] __attribute__((xcl_array_partition(complete, 1)));

In this example, the array is completely partitioned into distributed RAM, or 16 independent registers in the programmable logic of the kernel. Because complete is the default, the same effect can also be accomplished with the following declaration.

int buffer[16] __attribute__((xcl_array_partition));

While this creates an implementation with the highest possible memory bandwidth, it is not suited to all applications. The way in which data is accessed by the kernel code through either constant or data dependent indexes affects the amount of supporting logic that the Vitis compiler has to build around each register to ensure functional equivalence with the usage in the original code. As a general best practice guideline for the Vitis core development kit, the complete partitioning attribute is best suited for arrays in which at least one dimension of the array is accessed through the use of constant indexes.

xcl_array_reshape

Description

IMPORTANT: Array variables only accept one attribute. While the XCL_ARRAY_RESHAPE attribute does support multi-dimensional arrays, you can only reshape one dimension of the array with a single attribute.

This attribute combines array partitioning with vertical array mapping.

The XCL_ARRAY_RESHAPE attribute combines the effect of XCL_ARRAY_PARTITION, breaking an array into smaller arrays, and concatenating elements of arrays by increasing bit-widths. This reduces the number of block RAM consumed while providing parallel access to the data. This attribute creates a new array with fewer elements but with greater bit-width, allowing more data to be accessed in a single clock cycle.

Given the following code:

void foo (...) {
int array1[N] __attribute__((xcl_array_reshape(block, 2, 1)));
int array2[N] __attribute__((xcl_array_reshape(cycle, 2, 1)));
int array3[N] __attribute__((xcl_array_reshape(complete, 1)));
...
}

The ARRAY_RESHAPE attribute transforms the arrays into the form shown in the following figure.

Syntax

Place the attribute with the definition of the array variable.

__attribute__((xcl_array_reshape(<type>,<factor>, 
<dimension>)))

Where:

<type>: Specifies one of the following partition types:
- cyclic: Cyclic partitioning is the implementation of an array as a set of smaller physical memories that can be accessed simultaneously by the logic in the compute unit. The array is partitioned cyclically by putting one element into each memory before coming back to the first memory to repeat the cycle until the array is fully partitioned.
- block: Block partitioning is the physical implementation of an array as a set of smaller memories that can be accessed simultaneously by the logic inside the compute unit. In this case, each memory block is filled with elements from the array before moving on to the next memory.
- complete: Complete partitioning decomposes the array into individual elements. For a one-dimensional array, this corresponds to resolving a memory into individual registers. The default <type> is complete.
<factor>: For cyclic type partitioning, the <factor> specifies how many physical memories to partition the original array into in the kernel code. For Block type partitioning, the <factor> specifies the number of elements from the original array to store in each physical memory.
IMPORTANT: For complete type partitioning, the <factor> should not be specified.
<dimension>: Specifies which array dimension to partition. Specified as an integer from 1 to <N>. The Vitis core development kit supports arrays of <N> dimensions and can partition the array on any single dimension.

Example 1

Reshapes (partition and maps) an 8-bit array with 17 elements, AB[17], into a new 32-bit array with five elements using block mapping.

int AB[17] __attribute__((xcl_array_reshape(block,4,1)));

TIP: A <factor> of 4 indicates that the array should be divided into four. As a result, the 17 elements are reshaped into an array of five elements, with four times the bit-width. In this case, the last element, AB[17], is mapped to the lower eight bits of the fifth element, and the rest of the fifth element is empty.

Example 2

Reshapes the two-dimensional array AB[6][4] into a new array of dimension [6][2], in which dimension 2 has twice the bit-width:

int AB[6][4] __attribute__((xcl_array_reshape(block,2,2)));

Example 3

Reshapes the three-dimensional 8-bit array, AB[4][2][2] in function foo, into a new single element array (a register), 128-bits wide (4×2×2×8):

int AB[4][2][2] __attribute__((xcl_array_reshape(complete,0)));

TIP: A <dimension> of 0 means to reshape all dimensions of the array.

xcl_dataflow

Description

Enables task-level pipelining, allowing functions and loops to overlap in their operation, increasing the concurrency of the RTL implementation, and increasing the overall throughput of the design.

All operations are performed sequentially in a C description. In the absence of any directives that limit resources, such as pragma HLS allocation, the Vitis HLS tool seeks to minimize latency and improve concurrency. However, data dependencies can limit this. For example, functions or loops that access arrays must finish all read/write accesses to the arrays before they complete. This prevents the next function or loop that consumes the data from starting operation. The dataflow optimization enables the operations in a function or loop to start operation before the previous function or loop completes all its operations.

When dataflow optimization is specified, the HLS tool analyzes the dataflow between sequential functions or loops and creates channels (based on ping-pong RAMs or FIFOs) that allow consumer functions or loops to start operation before the producer functions or loops have completed. This allows functions or loops to operate in parallel, which decreases latency and improves the throughput of the RTL.

If no initiation interval (number of cycles between the start of one function or loop and the next) is specified, the HLS tool attempts to minimize the initiation interval and start operation as soon as data is available.

TIP: The HLS tool provides dataflow configuration settings. The config_dataflow command specifies the default memory channel and FIFO depth used in dataflow optimization.

For the DATAFLOW optimization to work, the data must flow through the design from one task to the next. The following coding styles prevent the HLS tool from performing the DATAFLOW optimization:

Single-producer-consumer violations
Bypassing tasks
Feedback between tasks
Conditional execution of tasks
Loops with multiple exit conditions

IMPORTANT: If any of these coding styles are present, the HLS tool issues a message and does not perform DATAFLOW optimization.

Finally, the DATAFLOW optimization has no hierarchical implementation. If a sub-function or loop contains additional tasks that might benefit from the DATAFLOW optimization, you must apply the optimization to the loop, the sub-function, or inline the sub-function.

Syntax

Assign the XCL_DATAFLOW attribute before the function definition or the loop definition:

__attribute__((xcl_dataflow))

Examples

Specifies dataflow optimization within function foo.

__attribute__((xcl_dataflow))
void foo ( a, b, c, d ) {
...
}

xcl_latency

Description

The XCL_LATENCY attribute specifies a minimum, or maximum latency value, or both, for the completion of functions, loops, and regions. Latency is defined as the number of clock cycles required to produce an output. Function or region latency is the number of clock cycles required for the code to compute all output values, and return. Loop latency is the number of cycles to execute all iterations of the loop. See Performance Metrics Example of Vitis High-Level Synthesis User Guide (UG1399).

The Vitis HLS tool always tries to minimize latency in the design. When the XCL_LATENCY attribute is specified, the tool behavior is as follows:

When latency is greater than the minimum, or less than the maximum: The constraint is satisfied. No further optimizations are performed.
When latency is less than the minimum: If the HLS tool can achieve less than the minimum specified latency, it extends the latency to the specified value, potentially increasing sharing.
When latency is greater than the maximum: If the HLS tool cannot schedule within the maximum limit, it increases effort to achieve the specified constraint. If it still fails to meet the maximum latency, it issues a warning, and produces a design with the smallest achievable latency in excess of the maximum.

TIP: You can also use the XCL_LATENCY attribute to limit the efforts of the tool to find a optimum solution. Specifying latency constraints for scopes within the code: loops, functions, or regions, reduces the possible solutions within that scope and improves tool runtime. For more information, refer to Improving Synthesis Runtime and Capacity of Vitis High-Level Synthesis User Guide (UG1399).

Syntax

Assign the XCL_LATENCY attribute before the body of the function, loop, or region:

__attribute__((xcl_latency(min, max)))

Where:

<min>: Specifies the minimum latency for the function, loop, or region of code.
<max>: Specifies the maximum latency for the function, loop, or region of code.

Examples

The for loop in the test function is specified to have a minimum latency of 4 and a maximum latency of 8.

__kernel void test(__global float *A, __global float *B, __global float *C, int id) 
{
  for (unsigned int i = 0; i < id; i++)
__attribute__((xcl_latency(4, 12))) {
   C[id] = A[id] * B[id];
 }
}

xcl_loop_tripcount

Description

The XCL_LOOP_TRIPCOUNT attribute can be applied to a loop to manually specify the total number of iterations performed by the loop.

IMPORTANT: The XCL_LOOP_TRIPCOUNT attribute is for analysis only, and does not impact the results of synthesis.

The Vivado High-Level Synthesis (HLS) reports the total latency of each loop, which is the number of clock cycles to execute all iterations of the loop. The loop latency is therefore a function of the number of loop iterations, or tripcount.

The tripcount can be a constant value. It can depend on the value of variables used in the loop expression (for example, x<y), or depend on control statements used inside the loop. In some cases, the HLS tool cannot determine the tripcount, and the latency is unknown. This includes cases in which the variables used to determine the tripcount are:

Input arguments, or
Variables calculated by dynamic operation.

In cases where the loop latency is unknown or cannot be calculated, the XCL_LOOP_TRIPCOUNT attribute lets you specify minimum, maximum, and average iterations for a loop. This lets the tool analyze how the loop latency contributes to the total design latency in the reports, and helps you determine appropriate optimizations for the design.

Syntax

Place the attribute in the OpenCL source before the loop declaration.

__attribute__((xcl_loop_tripcount(<min>, <max>, <average>)))

Where:

<min>: Specifies the minimum number of loop iterations.
<max>: Specifies the maximum number of loop iterations.
<avg>: Specifies the average number of loop iterations.

Examples

In this example, the WHILE loop in function f is specified to have a minimum tripcount of 2, a maximum tripcount of 64, and an average tripcount of 33.

__kernel void f(__global int *a) {
unsigned i = 0;
__attribute__((xcl_loop_tripcount(2, 64, 33)))
  while(i < 64) {
    a[i] = i;
    i++;
  }
}

xcl_max_work_group_size

Description

Use this attribute instead of REQD_WORK_GROUP_SIZE when you need to specify a larger kernel than the 4K size.

Extends the default maximum work group size supported in the Vitis core development kit by the reqd_work_group_size attribute. Vitis core development kit supports work size larger than 4096 with the XCL_MAX_WORK_GROUP_SIZE attribute.

Note: The actual workgroup size limit is dependent on the Xilinx device selected for the platform.

Syntax

Place this attribute before the kernel definition, or before the primary function specified for the kernel:

__attribute__((xcl_max_work_group_size(<X>, <Y>, <Z>)))

Where:

<X>, <Y>, <Z>: Specifies the ND range of the kernel. This represents each dimension of a three dimensional matrix specifying the size of the work-group for the kernel.

Examples

Below is the kernel source code for an un-optimized adder. No attributes were specified for this design, other than the work size equal to the size of the matrices (for example, 64x64). That is, iterating over an entire workgroup will fully add the input matrices, a and b, and output the result. All three are global integer pointers, which means each value in the matrices is four bytes, and is stored in off-chip DDR global memory.

#define RANK 64
__kernel __attribute__ ((reqd_work_group_size(RANK, RANK, 1)))
void madd(__global int* a, __global int* b, __global int* output) {
int index = get_local_id(1)*get_local_size(0) + get_local_id(0);
output[index] = a[index] + b[index];
}

This local work size of (64, 64, 1) is the same as the global work size. This setting creates a total work size of 4096.

Note: This is the largest work size that Vitis core development kit supports with the standard OpenCL attribute REQD_WORK_GROUP_SIZE. Vitis core development kit supports work size larger than 4096 with the Xilinx attribute xcl_max_work_group_size.

Any matrix larger than 64x64 would need to only use one dimension to define the work size. That is, a 128x128 matrix could be operated on by a kernel with a work size of (128, 1, 1), where each invocation operates on an entire row or column of data.

xcl_pipeline_loop

Description

You can pipeline a loop to improve latency and maximize kernel throughput and performance.

Although unrolling loops increases concurrency, it does not address the issue of keeping all elements in a kernel data path busy at all times. Even in an unrolled case, loop control dependencies can lead to sequential behavior. The sequential behavior of operations results in idle hardware and a loss of performance.

Xilinx addresses this issue by introducing a vendor extension on top of the OpenCL 2.0 API specification for loop pipelining using the XCL_PIPELINE_LOOP attribute.

By default, the v++ compiler automatically pipelines loops with a trip count more than 64, or unrolls loops with a trip count less than 64. This should provide good results. However, you can choose to pipeline loops (instead of the automatic unrolling) by explicitly specifying the NOUNROLL attribute and XCL_PIPELINE_LOOP attribute before the loop.

Syntax

Place the attribute in the OpenCL source before the loop definition:

__attribute__((xcl_pipeline_loop(<II_number>)))

Where:

<II_number>: Specifies the desired initiation interval (II) for the pipeline. The Vitis HLS tool tries to meet this request; however, based on data dependencies, the loop might have a larger initiation interval. When the II is not specified, the default is 1.

Examples

The following example specifies an II target of 3 for the for loop in the specified function:

__kernel void f(__global int *a) {
  __attribute__((xcl_pipeline_loop(3)))
  for (unsigned i = 0; i < 64; ++i)
    a[i] = i;
}

xcl_pipeline_workitems

Description

Pipeline a work item to improve latency and throughput. Work item pipelining is the extension of loop pipelining to the kernel work group. This is necessary for maximizing kernel throughput and performance.

Syntax

Place the attribute in the OpenCL API source before the elements to pipeline:

__attribute__((xcl_pipeline_workitems))

Example 1

To handle the reqd_work_group_size attribute in the following example, Vitis technology automatically inserts a loop nest to handle the three-dimensional characteristics of the ND range (3,1,1). As a result of the added loop nest, the execution profile of this kernel is like an unpipelined loop. Adding the XCL_PIPELINE_WORKITEMS attribute adds concurrency and improves the throughput of the code.

kernel
__attribute__ ((reqd_work_group_size(3,1,1)))
void foo(...)
{
...
__attribute__((xcl_pipeline_workitems)) {
int tid = get_global_id(0);
op_Read(tid);
op_Compute(tid);
op_Write(tid);
}
...
}

Example 2

The following example adds the work-item pipeline to the appropriate elements of the kernel:

__kernel __attribute__ ((reqd_work_group_size(8, 8, 1)))
void madd(__global int* a, __global int* b, __global int* output)
{
int rank = get_local_size(0);
__local unsigned int bufa[64];
__local unsigned int bufb[64];
__attribute__((xcl_pipeline_workitems)) {
int x = get_local_id(0);
int y = get_local_id(1);
bufa[x*rank + y] = a[x*rank + y];
bufb[x*rank + y] = b[x*rank + y];
}
barrier(CLK_LOCAL_MEM_FENCE);
__attribute__((xcl_pipeline_workitems)) {
int index = get_local_id(1)*rank + get_local_id(0);
output[index] = bufa[index] + bufb[index];
}
}

xcl_reqd_pipe_depth

Description

IMPORTANT: Pipes must be declared in lower case alphanumerics. printf() is also not supported with variables used in pipes.

The OpenCL framework 2.0 specification introduces a new memory object called pipe. A pipe stores data organized as a FIFO. Pipes can be used to stream data from one kernel to another inside the FPGA without using the external memory, which greatly improves the overall system latency.

In the Vitis core development kit, pipes must be statically defined outside of all kernel functions. The depth of a pipe must be specified by using the XCL_REQD_PIPE_DEPTH attribute in the pipe declaration:

pipe int p0 __attribute__((xcl_reqd_pipe_depth(512)));

Pipes can only be accessed using standard OpenCL read_pipe() and write_pipe() built-in functions in non-blocking mode, or using Xilinx-extended read_pipe_block() and write_pipe_block() functions in blocking mode.

IMPORTANT: Vitis HLS only supports blocking mode for reading and writing pipes. The non-block read_pipe/write_block functions are not supported.

IMPORTANT: A given pipe can have one and only one producer and consumer in different kernels.

Pipe objects are not accessible from the host CPU. The status of pipes can be queried using OpenCL get_pipe_num_packets() and get_pipe_max_packets() built-in functions. For more details on these built-in functions, see The OpenCL C Specification from Khronos OpenCL Working Group.

Syntax

This attribute must be assigned at the declaration of the pipe object:

pipe int <id> __attribute__((xcl_reqd_pipe_depth(<n>)));

Where:

<id>: Specifies an identifier for the pipe, which must consist of lower-case alphanumerics. For example, <infifo1> not <inFifo1>.
<n>: Specifies the depth of the pipe. Valid depth values are 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192, 16384, 32768.

Examples

The following is the dataflow_pipes_ocl example from Xilinx GitHub that use pipes to pass data from one processing stage to the next using blocking read_pipe_block() and write_pipe_block() functions:

pipe int p0 __attribute__((xcl_reqd_pipe_depth(32)));
pipe int p1 __attribute__((xcl_reqd_pipe_depth(32)));
// Input Stage Kernel : Read Data from Global Memory and write into Pipe P0
kernel __attribute__ ((reqd_work_group_size(1, 1, 1)))
void input_stage(__global int *input, int size)
{
__attribute__((xcl_pipeline_loop))
mem_rd: for (int i = 0 ; i < size ; i++)
{
//blocking Write command to pipe P0
write_pipe_block(p0, &input[i]);
}
}
// Adder Stage Kernel: Read Input data from Pipe P0 and write the result
// into Pipe P1
kernel __attribute__ ((reqd_work_group_size(1, 1, 1)))
void adder_stage(int inc, int size)
{
__attribute__((xcl_pipeline_loop))
execute: for(int i = 0 ; i < size ; i++)
{
int input_data, output_data;
//blocking read command to Pipe P0
read_pipe_block(p0, &input_data);
output_data = input_data + inc;
//blocking write command to Pipe P1
write_pipe_block(p1, &output_data);
}
}
// Output Stage Kernel: Read result from Pipe P1 and write the result to 
// Global Memory
kernel __attribute__ ((reqd_work_group_size(1, 1, 1)))
void output_stage(__global int *output, int size)
{
__attribute__((xcl_pipeline_loop))
mem_wr: for (int i = 0 ; i < size ; i++)
{
//blocking read command to Pipe P1
read_pipe_block(p1, &output[i]);
}
}

xcl_zero_global_work_offset

Description

If you use clEnqueueNDRangeKernel with the global_work_offset set to NULL or all zeros, use this attribute to tell the compiler that the global_work_offset is always zero.

This attribute can improve memory performance when you have memory accesses like:

A[get_global_id(x)] = ...;

Note: You can specify REQD_WORK_GROUP_SIZE, VEC_TYPE_HINT, and XCL_ZERO_GLOBAL_WORK_OFFSET together to maximize performance.

Syntax

Place this attribute before the kernel definition or the primary function specified for the kernel.

__kernel __attribute__((xcl_zero_global_work_offset))
void test (__global short *input, __global short *output, __constant short *constants) { }

OpenCL Programming

OpenCL Host Application

Setting Up the OpenCL Environment

Platform

Devices

Sub-Devices

Context

Command Queues

Program

Executing Commands in the FPGA

Setting Up Kernels

Setting Kernel Arguments

Buffer Allocation on the Device

Buffer Creation and Data Transfer

Letting XRT Allocate Buffers

Using Host Pointer Buffers

Allocating Page-Aligned Host Memory

Sub-Buffers

Reading a Specific Portion from the Device Buffer

Device Buffer Shared by Multiple Memory Ports or Multiple Kernels

Kernel Execution

Task Parallelism Using Different Kernels

Spatial Data Parallelism: Increase Number of Compute Units

Temporal Data Parallelism: Host-to-Kernel Dataflow

Enabling Host-to-Kernel Dataflow

Symmetrical and Asymmetrical Compute Units

Kernel Handle and Compute Units

Creating Kernel Objects for Specific Compute Units

Using Compute Unit Name to Get Handle of All Asymmetrical Compute Units

Event Synchronization

Post-Processing and FPGA Cleanup

OpenCL Kernel Development

Kernel Signature

Kernel Optimizations

Setting Data Width in OpenCL Kernels

Reducing Kernel to Kernel Communication Latency in OpenCL Kernels

OpenCL Attributes

always_inline

Description

Syntax

Examples

See Also

opencl_unroll_hint

Description

Syntax

Examples

See Also

reqd_work_group_size

Description

Syntax

Examples

See Also

vec_type_hint

Description

Syntax

Examples

See Also

work_group_size_hint

Description

Syntax

Examples

See Also

xcl_array_partition

Description

Syntax

Example 1

Example 2

Example 3

See Also

xcl_array_reshape

Description

Syntax

Example 1

Example 2

Example 3

See Also

xcl_dataflow

Description

Syntax

Examples