Programming the Host Application

In the SDAccel™ environment, host code is written in C or C++ language using the industry standard OpenCL™ API. The SDAccel environment provides an OpenCL 1.2 embedded profile conformant runtime API.

IMPORTANT: The SDAccel environment supports the OpenCL Installable Client Driver (ICD) extension (cl_khr_icd). This extension allows multiple implementations of OpenCL to co-exist on the same system. Refer to OpenCL Installable Client Driver Loader for details and installation instructions.

The SDAccel environment consists of a host x86 CPU and compute devices running on a Xilinx® FPGA.

In general, the structure of the host code can be divided into three sections:
  1. Setting up the environment.
  2. Core command execution including executing one or more kernels.
  3. Post processing and FPGA release.
The following sections discuss each of the above topics in detail.
Note: For multithreading the host program, exercise caution when calling a fork() system call from an SDAccel environment application. The fork() does not duplicate all the runtime threads. Hence the child process cannot run as a complete application in the SDAccel environment. It is advisable to use the posix_spawn() system call to launch another process from the SDAccel environment application.

Setting Up the OpenCL Environment

The host code in the SDAccel environment follows OpenCL programming paradigm. To set the environment properly, the host application should identify the standard OpenCL models. They are: platform, devices, context, command queue, and program.

TIP: The host code examples and API commands used in this document follow the OpenCL C API. The IDCT example referred to in SDAccel Example Designs is also written with the C API. However, the SDAccel runtime environment also supports the OpenCL C++ wrapper API, and many of the examples in the GitHub repository are written using the C++ API. Refer to https://www.khronos.org/registry/OpenCL/specs/opencl-cplusplus-1.2.pdf for more information on this C++ wrapper API.

Platform

From the very beginning the host code should identify the platform composed of Xilinx FPGA as one or more devices. The host code segment below is standard coding to identify the Xilinx device based platform.
cl_platform_id platform_id;         // platform id

err = clGetPlatformIDs(16, platforms, &platform_count);
    
// Find Xilinx Platform
for (unsigned int iplat=0; iplat<platform_count; iplat++) {
  err = clGetPlatformInfo(platforms[iplat], 
    CL_PLATFORM_VENDOR, 
    1000, 
    (void *)cl_platform_vendor,
    NULL);

  if (strcmp(cl_platform_vendor, "Xilinx") == 0) { 
  // Xilinx Platform found
  platform_id = platforms[iplat];
  }
}

The OpenCL API call clGetPlatformIDs is used to discover the set of available OpenCL platforms for a given system. Thereafter, clGetPlatformInfo is used to identify the Xilinx device based platform by matching cl_platform_vendor with the string "Xilinx".

Note: Though it is not explicitly shown in the preceding code, or in other host code examples used throughout this chapter, it is always a good coding practice to use error checking after each of the OpenCL API calls. This can help debugging and improve productivity when you are debugging the host and kernel code in the emulation flow, or during hardware execution. Below is an error checking code example for clGetPlatformIDs command:
err = clGetPlatformIDs(16, platforms, &platform_count);
if (err != CL_SUCCESS) {
  printf("Error: Failed to find an OpenCL platform!\n");
  printf("Test failed\n");
  exit(1);
}

Devices

After the platform detection, the Xilinx FPGA devices attached to the platform are identified. The SDAccel environment supports one or more Xilinx FPGA devices working together.

The following code demonstrates finding all the Xilinx devices (with a upper limit of 16) by using API clGetDeviceIDs and printing their names.
cl_device_id devices[16];  // compute device id
char cl_device_name[1001];
    
err = clGetDeviceIDs(platform_id, CL_DEVICE_TYPE_ACCELERATOR, 
  16, devices, &num_devices);
    
printf("INFO: Found %d devices\n", num_devices);
    
//iterate all devices to select the target device.
for (uint i=0; i<num_devices; i++) {
  err = clGetDeviceInfo(devices[i], CL_DEVICE_NAME, 1024, cl_device_name, 0);
  printf("CL_DEVICE_NAME %s\n", cl_device_name);
}
IMPORTANT: The clGetDeviceIDs API is called with the device_type CL_DEVICE_TYPE_ACCELERATOR to get all the available Xilinx devices.

Sub-devices

In the SDAccel environment, sometimes devices contain multiple kernel instances, of a single kernel or of different kernels. The OpenCL API clCreateSubDevices allows the host code to divide the device into multiple sub-devices containing one kernel instance per sub-device. Currently SDAccel environment supports equally divided sub-devices each containing only one kernel instance.

The following example shows:
  1. The sub-devices are created by equal partition to execute one kernel instance per sub-device.
  2. Iterating over the sub-device list and using a separate context and command queue to execute the kernel on each of them.
  3. The API related to kernel execution (and corresponding buffer related) code is not shown for the sake of simplicity, but would be described inside the function run_cu.
  cl_uint num_devices = 0;
  cl_device_partition_property props[3] = {CL_DEVICE_PARTITION_EQUALLY,1,0};
  
  // Get the number of sub-devices
  clCreateSubDevices(device,props,0,nullptr,&num_devices);  
  
  // Container to hold the sub-devices
  std::vector<cl_device_id> devices(num_devices);  

  // Second call of clCreateSubDevices    
  // We get sub-device handles in devices.data()
  clCreateSubDevices(device,props,num_devices,devices.data(),nullptr); 

  // Iterating over sub-devices
  std::for_each(devices.begin(),devices.end(),[kernel](cl_device_id sdev) {
      
	  // Context for sub-device
      auto context = clCreateContext(0,1,&sdev,nullptr,nullptr,&err);  
      
	  // Command-queue for sub-device
      auto queue = clCreateCommandQueue(context,sdev,
      CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE,&err); 
      
      // Execute the kernel on the sub-device using local context and 
	queue run_cu(context,queue,kernel); // Function not shown 
  });

Currently, if a kernel has multiple hardware instances (can be specified during the kernel compilation phase), the SDAccel environment execution model assumes all those hardware instances have the same global memory connectivity. If not, then you need to use sub-devices to allocate separate cl_kernel for each of those hardware instances.

Context

The OpenCL context creation process is straightforward. The API clCreateContext is used to create a context that contains one or more Xilinx devices that will communicate with the host machine.
context = clCreateContext(0, 1, &device_id, NULL, NULL, &err);

In the code example above, the API clCreateContext is used to create a context that contains one Xilinx device. You can create only one context for a device from a host program. However, the host program should use multiple contexts if sub-devices are used; one context for each sub-device.

Command Queues

One or more command queues for each device is created using the clCreateCommandQueue API. The FPGA device can contain multiple kernels. When developing the host application, there are two main programming approaches to execute kernels on a device:

  1. Single out-of-order command queue: Multiple kernel executions can be requested through the same command queue. The SDAccel runtime environment dispatches those kernels as soon as possible in any order allowing concurrent kernel execution on the FPGA.
  2. Multiple in-order command queue: Each kernel execution will be requested from different in-order command queues. In such cases, the SDAccel runtime environment can dispatch kernels from any command queue with the intention of improving performance by running them concurrently on the FPGA.
The following is an example of standard API calls to create in-order and out-of-order command queues.
// Out-of-order Command queue
commands = clCreateCommandQueue(context, device_id, CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE, &err);

// In-order Command Queue
commands = clCreateCommandQueue(context, device_id, 0, &err);

Program

As described in the SDAccel Build Process, the host and kernel code are compiled separately to create separate executable files: the host application (.exe) and the FPGA binary (.xclbin). When the host application is executed it must load the .xclbin using the clCreateProgramWithBinary API.

The following code example shows how the standard OpenCL API is used to build the program from the .xclbin file:
unsigned char *kernelbinary;
char *xclbin = argv[1];

printf("INFO: loading xclbin %s\n", xclbin);
 
int size=load_file_to_memory(xclbin, (char **) &kernelbinary);
size_t size_var = size; 

cl_program program = clCreateProgramWithBinary(context, 1, &device_id, &size_var,
                     (const unsigned char **) &kernelbinary, &status, &err);

err = clBuildProgram(program, 0, NULL, NULL, NULL, NULL);

// Function 
int load_file_to_memory(const char *filename, char **result)
{
  uint size = 0;
  FILE *f = fopen(filename, "rb");
  if (f == NULL) {
    *result = NULL;
    return -1; // -1 means file opening fail
  }
  fseek(f, 0, SEEK_END);
  size = ftell(f);
  fseek(f, 0, SEEK_SET);
  *result = (char *)malloc(size+1);
  if (size != fread(*result, sizeof(char), size, f)) {
    free(*result);
    return -2; // -2 means file reading fail
  }
  fclose(f);
  (*result)[size] = 0;
  return size;
}
The above example performs the following steps:
  1. The kernel binary file, .xclbin, is passed in from the command line argument, argv[1].
    TIP: Passing the .xclbin through a command line argument is specific to this example. You can also hardcode the kernel binary file in the application.
  2. The load_file_to_memory function is used to load the file contents in the host machine memory space.
  3. The API clCreateProgramWithBinary and clBuildProgram are used to complete the program creation process.

Executing Commands in the FPGA Device

Once the OpenCL environment is initialized, the host application is ready to issue commands to the device and interact with the kernels. Such commands include:
  1. Memory data transfer to and from the FPGA device.
  2. Kernel execution on FPGA.
  3. Event synchronization.

Buffer Transfer to/from the FPGA Device

Interactions between the host application and kernels rely on transferring data to and from global memory in the device. The simplest way to send data back and forth from the FPGA is using clCreateBuffer, clEnqueueWriteBuffer and clEnqueueReadBuffer commands. The following code example demonstrates this:

int host_mem_ptr[MAX_LENGTH]; // host memory for input vector
// Fill the memory input
for(int i=0; i<MAX_LENGTH; i++) {
  host_mem_ptr[i] = <... >   
}
cl_mem dev_mem_ptr = clCreateBuffer(context,  CL_MEM_READ_WRITE, 
                     sizeof(int) * number_of_words, NULL, NULL);

err = clEnqueueWriteBuffer(commands, dev_mem_ptr, CL_TRUE, 0,
      sizeof(int) * number_of_words, host_mem_ptr, 0, NULL, NULL);
IMPORTANT: A single buffer cannot be bigger than 4 GB.

For the majority of applications the example code above would be sufficient to transfer data from the host to the device memory. However, there are a number of coding practices you should adopt in order to maximize performance and fine-grain control.

Using clEnqueueMigrateMemObjects

Another consideration when transferring data is using clEnqueueMigrateMemObjects instead of clEnqueueWriteBuffer or clEnqueueReadBuffer to improve the performance. Typically, memory objects are implicitly migrated to a device for enqueued kernels. Using this API call results in data transfer ahead of kernel execution to reduce latency, particularly when a kernel is called multiple times.

The following code example is modified to use clEnqueueMigrateMemObjects:

int host_mem_ptr[MAX_LENGTH]; // host memory for input vector
      
// Fill the memory input
for(int i=0; i<MAX_LENGTH; i++) {
  host_mem_ptr[i] = <... >   
}

cl_mem_ext_ptr_t d_bank0_ext;
  d_bank0_ext.flags = XCL_MEM_DDR_BANK0;
  d_bank0_ext.obj = host_mem_ptr; 
  d_bank0_ext.param = 0;

cl_mem dev_mem_ptr = clCreateBuffer(context,  
    				 CL_MEM_READ_WRITE | CL_MEM_EXT_PTR_XILINX,
    				 sizeof(int) * number_of_words, &d_bank0_ext, NULL); 

err = clEnqueueMigrateMemObjects(commands, 1, dev_mem_ptr, 0, 0, 
	  NULL, NULL);

Using posix_memalign for Host Memory Space

SDAccel runtime allocates the memory space in 4K boundary for internal memory management. If the host memory pointer is not aligned to a 4K word boundary, the runtime performs extra memcpy to make it aligned. It does not significantly impact performance, but you should align the host memory pointer with the 4K boundary to follow the SDAccel runtime memory management.

The following is an example of how posix_memalign is used instead of malloc for the host memory space pointer.

int *host_mem_ptr; // = (int*) malloc(MAX_LENGTH*sizeof(int));
// Aligning memory in 4K boundary
posix_memalign(&host_mem_ptr,4096,MAX_LENGTH*sizeof(int)); 
 
// Fill the memory input       
for(int i=0; i<MAX_LENGTH; i++) {
  host_mem_ptr[i] = <... >   
}

cl_mem_ext_ptr_t d_bank0_ext;
  d_bank0_ext.flags = XCL_MEM_DDR_BANK0;
  d_bank0_ext.obj = host_mem_ptr; 
  d_bank0_ext.param = 0;

cl_mem dev_mem_ptr = clCreateBuffer(context, 
				     CL_MEM_READ_WRITE | CL_MEM_EXT_PTR_XILINX,  
  			       sizeof(int) * number_of_words, &d_bank0_ext, NULL); 

err = clEnqueueMigrateMemObjects(commands, 1, dev_mem_ptr, 0, 0, 
      NULL, NULL);

Enhanced Buffer Allocation

By default, all the memory interfaces from all the kernels are connected to a single global memory bank when kernels are linked. As a result, only one memory interface can transfer data to and from the global memory bank at a time, limiting the overall performance of the application. If the FPGA device contains only one global memory bank, this is the only option. However, if the device contains multiple global memory banks, you can customize the global memory bank connections by modifying the default connection. This topic is discussed in greater detail in Customization of DDR Bank to Kernel Connection. This improves overall performance by enabling multiple kernel memory interfaces to concurrently read and write data from separate global memory banks.

When kernel ports are mapped to memory banks other than the default one, it is necessary to use the enhanced buffer allocation pattern when creating the OpenCL buffers.

The enhanced buffer allocation pattern uses a Xilinx vendor extension, cl_mem_ext_ptr_t, pointer to help the Xilinx runtime determine which global memory bank the buffer should be allocated.

The cl_mem_ext_ptr_t type is a struct as defined below:

typedef struct{
     unsigned flags;
     void *obj;
     void *param;
  } 
  cl_mem_ext_ptr_t;

Use the explicit bank name method to operate cl_mem_ext_ptr_t for enhanced buffer allocation.

Explicit Bank Name Method

In this approach, the struct field flags is used to denote the DDR bank (XCL_MEM_DDR_BANK1, XCL_MEM_DDR_BANK2, etc.). The struct field param should not be used and set to NULL.

The following code example uses cl_mem_ext_ptr_t to assign the device buffer to DDR Bank 2.

int host_mem_ptr[MAX_LENGTH]; // host memory for input vector
  // Fill the memory input       
  for(int i=0; i<MAX_LENGTH; i++) {
    host_mem_ptr[i] = <... >   }

cl_mem_ext_ptr_t d_bank0_ext;
  d_bank0_ext.flags = XCL_MEM_DDR_BANK2;
  d_bank0_ext.obj = NULL; 
  d_bank0_ext.param = 0;

cl_mem dev_mem_ptr = clCreateBuffer(context,
				   CL_MEM_READ_WRITE | CL_MEM_EXT_PTR_XILINX,
   			     sizeof(int) * number_of_words, &d_bank0_ext, NULL); 

err = clEnqueueWriteBuffer(commands, dev_mem_ptr, CL_TRUE, 0,
      sizeof(int) * number_of_words, host_mem_ptr, 0, NULL, NULL);
IMPORTANT: Starting from the 2018.3 release, the new method of specifying a bank name is:
var_ext.flags =  <some integer> | XCL_MEM_TOPOLOGY
Where 0, 1, 2, and 3 stand for different DDR banks. However, the older naming style of XCL_MEM_DDR_BANK0, etc. would still work for the existing platform.

Kernel Setup and Execution

This section focuses on how a typical host application performs the following kernel related tasks in the SDAccel environment:
  1. Identifying the kernels.
  2. Setting kernel arguments.
  3. Executing kernels on the FPGA.

Identifying the kernels

At the beginning, the individual kernels present in the .xclbin file should be mapped to the kernel handles (denoted by cl_kernel type) in the host code. This is done by the clCreateKernel command with the kernel name as an argument:
kernel1 = clCreateKernel(program, "<kernel_name_1>", &err);            
kernel2 = clCreateKernel(program, "<kernel_name_2>", &err);  // etc

Setting Kernel Arguments

In the SDAccelenvironment framework two types of kernel arguments can be set.
  1. The scalar arguments are used for small data transfer, such as for constant, or configuration type data. These are write-only arguments.
  2. The buffer arguments are used for large data transfer as discussed in Buffer Transfer to/from the FPGA Device.
The kernel arguments can be set using the clSetKernelArg command as shown below. The following example shows setting kernel arguments for two scalar arguments, and three buffer arguments.
int err = 0;
// Setting up scalar arguments
cl_uint scalar_arg_image_width = 3840;
err |= clSetKernelArg(kernel, 0, sizeof(cl_uint), &scaler_arg_image_width); 
cl_uint scaler_arg_image_height = 2160; 
err |= clSetKernelArg(kernel, 1, sizeof(cl_uint), &scaler_arg_image_height); 
    
// Setting up buffer arguments
err |= clSetKernelArg(kernel, 2, sizeof(cl_mem), &dev_mem_ptr0);
err |= clSetKernelArg(kernel, 3, sizeof(cl_mem), &dev_mem_ptr1);
err |= clSetKernelArg(kernel, 4, sizeof(cl_mem), &dev_mem_ptr2);

Enqueing the Kernels

The kernel is enqueued to run on FPGA either by the clEnqueueTask or clEnqueueNDRangeKernel commands. Xilinx recommends using the clEnqueueTask command to execute the kernel over the entire range of input data set using the maximum number of work group items:
err = clEnqueueTask(commands, kernel, 0, NULL, NULL);
TIP: clEnqueueTask is the same as calling clEnqueueNDRangeKernel with work_dim set to 1, global_work_offset set to NULL, global_work_size[0] set to 1, and local_work_size[0] set to 1.

Just like all the enqueue commands, the clEnqueueTask and clEnqueueNDRangeKernel are asynchronous in nature. The host code continues executing without waiting for the kernel computation to complete on the FPGA device. This allows the host program to execute more kernels, either the same kernel multiple times over a different set of data, or different kernel. After finishing its work, the kernel writes the result data to the global memory bank. This data is read back to the host memory space by using clEnqueueReadBuffer or the clEnqueueMigrateMemObjects command.

Event Synchronization

All OpenCL clEnqueueXXX API calls are asynchronous. In other words, these commands will return immediately after the command is enqueued in the command queue. To resolve the dependencies among the commands, an API call such as clWaitForEvents or clFinish can be used to pause or block execution of the host program.

Example usage of clWaitForEvents and clFinish commands are shown below:
err = clEnqueueTask(command_queue, kernel, 0, NULL, NULL);
 
// Execution will wait here until all commands in the command queue are finished
clFinish(command_queue); 
  
// Read back the results from the device to verify the output
cl_event readevent;
int host_mem_output_ptr[MAX_LENGTH]; // host memory for output vector
   
clEnqueueReadBuffer(command_queue, dev_mem_ptr, CL_TRUE, 0, sizeof(int) * number_of_words, 
  host_mem_output_ptr, 0, NULL, &readevent );

clWaitForEvents(1, &readevent); // Wait for clEnqueueReadBuffer event to finish
   
// Check Results
// Compare Golden values with host_mem_output_ptr 
Note how the synchronization APIs have been added in the above example.
  1. The clFinish API has been explicitly used to block the host execution until the Kernel execution is finished. This is necessary otherwise the host can attempt to read back from the FPGA buffer too early and may read garbage data.
  2. The data transfer from FPGA memory to the local host machine is done through clEnqueueReadBuffer. Here the last argument of clEnqueueReadBuffer returns an event object that identifies this particular read command and can be used to query the event, or wait for this particular command to complete. The clWaitForEvents specifies that one event, and waits to ensure the data transfer is finished before checking the data from the host side memory.

Post Processing and FPGA Cleanup

At the final stage of the host program, it is good practice to confirm the FPGA functionality by comparing the output data from the FPGA with golden data. This action greatly helps identify and debug any issues with the kernel.
bool failed = false;
for (i=0; i<N; i++)
  if (Res[i] != GOLD_RES[i]) 
        failed=true;
At the end of the host code, all the allocated resources should be released by using proper release functions. The SDAccel environment may not able to generate a correct performance related profile and analysis report if resources are not properly released.
clReleaseCommandQueue(Command_Queue);
clReleaseContext(Context);
clReleaseDevice(Target_Device_ID); 
clReleaseKernel(Kernel);
clReleaseProgram(Program);
free(Platform_IDs);
free(Device_IDs);

Summary

As discussed in earlier topics, the recommended coding style for the host application in the SDAccel environment includes the following points:

  1. Add error checking after each OpenCL API call for debugging purpose, if required.
  2. In the SDAccel environment, one or more kernels are separately pre-compiled to the .xclbin file. The API clCreateProgramWithBinary is used to build the program from the kernel binary.
  3. Ensure using cl_mem_ext_ptr_t to match custom kernel memory interface to the DDR bank connection that has been used to build the kernel binary.
  4. Transfer data back and forth from the host code to the FPGAs by using clEnqueueMigrateMemObjects.
  5. Use posix_memalign to align the host memory pointer at 4K boundary.
  6. Use the out-of-order command queue, or multiple in-order command queues, for concurrent kernel execution on the FPGA.
  7. Execute the whole workload with clEnqueTask, rather than splitting the workload by using clEnqueueNDRangeKernel.
  8. Use synchronization commands to resolve dependencies of the asynchronous OpenCL API calls.