Host Application
In the Vitis™ core development kit, host code is written in C or C++ language using the Xilinx® runtime (XRT) API or industry standard OpenCL™ API. The XRT native API is described on the XRT site at https://xilinx.github.io/XRT/2020.2/html/xrt_native_apis.html. The Vitis core development kit supports the OpenCL 1.2 API as described at https://www.khronos.org/registry/OpenCL/specs/opencl-1.2.pdf. XRT extensions to OpenCL are described at https://xilinx.github.io/XRT/2020.2/html/opencl_extension.html.
In general, the structure of the host code can be divided into three sections:
- Setting up the environment.
- Core command execution including executing one or more kernels.
- Post processing and release of resources.
cl_khr_icd
). This extension allows multiple implementations of OpenCL to co-exist on the same system. For details and
installation instructions, refer to OpenCL Installable Client Driver Loader.fork()
system call from
a Vitis core development kit application. The
fork()
does not duplicate all the runtime threads.
Hence, the child process cannot run as a complete application in the Vitis core development kit. It is advisable to use the
posix_spawn()
system call to launch another process
from the Vitis software platform application.Setting Up the Runtime Environment
The host code in the Vitis core development kit follows the OpenCL programming paradigm. To setup the runtime environment properly, the host application needs to initialize the standard OpenCL structures: target platform, devices, context, command queue, and program.
Platform
Upon initialization, the host application needs to identify a platform composed of one or more Xilinx devices. The following code fragment shows a common method of identifying a Xilinx platform.
cl_platform_id platform_id; // platform id
err = clGetPlatformIDs(16, platforms, &platform_count);
// Find Xilinx Platform
for (unsigned int iplat=0; iplat<platform_count; iplat++) {
err = clGetPlatformInfo(platforms[iplat],
CL_PLATFORM_VENDOR,
1000,
(void *)cl_platform_vendor,
NULL);
if (strcmp(cl_platform_vendor, "Xilinx") == 0) {
// Xilinx Platform found
platform_id = platforms[iplat];
}
}
The OpenCL API call clGetPlatformIDs
is used to
discover the set of available OpenCL platforms for a
given system. Then, clGetPlatformInfo
is used to
identify Xilinx device based platforms by matching
cl_platform_vendor
with the string "Xilinx"
.
clGetPlatformIDs
command.err = clGetPlatformIDs(16, platforms, &platform_count);
if (err != CL_SUCCESS) {
printf("Error: Failed to find an OpenCL platform!\n");
printf("Test failed\n");
exit(1);
}
Devices
After a Xilinx platform is found, the application needs to identify the corresponding Xilinx devices.
The following code demonstrates finding all the Xilinx devices, with an upper limit of 16, by using API clGetDeviceIDs
.
cl_device_id devices[16]; // compute device id
char cl_device_name[1001];
err = clGetDeviceIDs(platform_id, CL_DEVICE_TYPE_ACCELERATOR,
16, devices, &num_devices);
printf("INFO: Found %d devices\n", num_devices);
//iterate all devices to select the target device.
for (uint i=0; i<num_devices; i++) {
err = clGetDeviceInfo(devices[i], CL_DEVICE_NAME, 1024, cl_device_name, 0);
printf("CL_DEVICE_NAME %s\n", cl_device_name);
}
clGetDeviceIDs
API is called with the platform_id
and CL_DEVICE_TYPE_ACCELERATOR
to receive all the available Xilinx devices.Sub-Devices
In the Vitis core development kit, sometimes
devices contain multiple kernel instances of a single kernel or of different kernels.
While the OpenCL API clCreateSubDevices
allows the
host code to divide a device into multiple sub-devices, the Vitis core development kit supports equally divided sub-devices (using
CL_DEVICE_PARTITION_EQUALLY
), each containing one
kernel instance.
The following example shows:
- Sub-devices created by equal partition to execute one kernel instance per sub-device.
- Iterating over the sub-device list and using a separate context and command queue to execute the kernel on each of them.
- The API related to kernel execution (and corresponding buffer related) code is not shown for the sake of simplicity, but would be described inside the function
run_cu
.
cl_uint num_devices = 0;
cl_device_partition_property props[3] = {CL_DEVICE_PARTITION_EQUALLY,1,0};
// Get the number of sub-devices
clCreateSubDevices(device,props,0,nullptr,&num_devices);
// Container to hold the sub-devices
std::vector<cl_device_id> devices(num_devices);
// Second call of clCreateSubDevices
// We get sub-device handles in devices.data()
clCreateSubDevices(device,props,num_devices,devices.data(),nullptr);
// Iterating over sub-devices
std::for_each(devices.begin(),devices.end(),[kernel](cl_device_id sdev) {
// Context for sub-device
auto context = clCreateContext(0,1,&sdev,nullptr,nullptr,&err);
// Command-queue for sub-device
auto queue = clCreateCommandQueue(context,sdev,
CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE,&err);
// Execute the kernel on the sub-device using local context and
queue run_cu(context,queue,kernel); // Function not shown
});
Context
The clCreateContext
API is used to
create a context that contains a Xilinx device that
will communicate with the host machine.
context = clCreateContext(0, 1, &device_id, NULL, NULL, &err);
In the code example, the clCreateContext
API
is used to create a context that contains one Xilinx
device. Xilinx recommends creating only one context per device or
sub-device. However, the host program should use multiple contexts if sub-devices are
used with one context for each sub-device.
Command Queues
The clCreateCommandQueue
API
creates one or more command queues for each device. The FPGA can contain multiple
kernels, which can be either the same or different kernels. When developing the host
application, there are two main programming approaches to execute kernels on a device:
- Single out-of-order command queue: Multiple kernel executions can be requested through the same command queue. XRT dispatches kernels as soon as possible, in any order, allowing concurrent kernel execution on the FPGA.
- Multiple in-order command queue: Each kernel execution will be requested from different in-order command queues. In such cases, XRT dispatches kernels from the different command queues, improving performance by running them concurrently on the device.
The following is an example of standard API calls to create in-order and out-of-order command queues.
// Out-of-order Command queue
commands = clCreateCommandQueue(context, device_id, CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE, &err);
// In-order Command Queue
commands = clCreateCommandQueue(context, device_id, 0, &err);
Program
The host and kernel code are compiled separately to create separate executable
files: the host program executable and the FPGA binary (.xclbin). When the host application runs, it must load the .xclbin file using the clCreateProgramWithBinary
API.
The following code example shows how the standard OpenCL API is used to build the program from the .xclbin file.
unsigned char *kernelbinary;
char *xclbin = argv[1];
printf("INFO: loading xclbin %s\n", xclbin);
int size=load_file_to_memory(xclbin, (char **) &kernelbinary);
size_t size_var = size;
cl_program program = clCreateProgramWithBinary(context, 1, &device_id,
&size_var,(const unsigned char **) &kernelbinary,
&status, &err);
// Function
int load_file_to_memory(const char *filename, char **result)
{
uint size = 0;
FILE *f = fopen(filename, "rb");
if (f == NULL) {
*result = NULL;
return -1; // -1 means file opening fail
}
fseek(f, 0, SEEK_END);
size = ftell(f);
fseek(f, 0, SEEK_SET);
*result = (char *)malloc(size+1);
if (size != fread(*result, sizeof(char), size, f)) {
free(*result);
return -2; // -2 means file reading fail
}
fclose(f);
(*result)[size] = 0;
return size;
}
The example performs the following steps:
- The kernel binary file, .xclbin, is passed in from the
command line argument,
argv[1]
.TIP: Passing the .xclbin through a command line argument is one approach. You can also hardcode the kernel binary file in the host program, define it with an environment variable, read it from a custom initialization file, or another suitable mechanism. - The
load_file_to_memory
function is used to load the file contents in the host machine memory space. - The
clCreateProgramWithBinary
API is used to complete the program creation process in the specified context and device.
Executing Commands in the FPGA
Once the OpenCL environment is initialized, the host application is ready to issue commands to the device and interact with the kernels. These commands include:
- Setting up the kernels.
- Buffer transfer to/from the FPGA.
- Kernel execution on FPGA.
- Event synchronization.
Setting Up Kernels
After setting up the runtime environment, such as identifying devices, creating the context, command queue, and program, the host application should identify the kernels that will execute on the device, and set up the kernel arguments.
The OpenCL API clCreateKernel
should be used to access the kernels contained within
the .xclbin file (the "program"). The cl_kernel
object identifies a kernel in the program loaded into
the FPGA that can be run by the host application. The following code example identifies two
kernels defined in the loaded program.
kernel1 = clCreateKernel(program, "<kernel_name_1>", &err);
kernel2 = clCreateKernel(program, "<kernel_name_2>", &err); // etc
Setting Kernel Arguments
In the Vitis software platform, two types of arguments can be set for kernel objects:
- Scalar arguments are used for small data transfer, such as constant or configuration type data. These are write-only arguments from the host application perspective, meaning they are inputs to the kernel.
- Memory buffer arguments are used for large data transfer. The value is a pointer to a memory object created with the context associated with the program and kernel objects. These can be inputs to, or outputs from the kernel.
Kernel arguments can be set using the clSetKernelArg
command, as
shown in the following example for setting kernel arguments for two scalar and two
buffer arguments.
// Create memory buffers
cl_mem dev_buf1 = clCreateBuffer(context, CL_MEM_WRITE_ONLY | CL_MEM_USE_HOST_PTR, size, &host_mem_ptr1, NULL);
cl_mem dev_buf2 = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_USE_HOST_PTR, size, &host_mem_ptr2, NULL);
int err = 0;
// Setup scalar arguments
cl_uint scalar_arg_image_width = 3840;
err |= clSetKernelArg(kernel, 0, sizeof(cl_uint), &scalar_arg_image_width);
cl_uint scalar_arg_image_height = 2160;
err |= clSetKernelArg(kernel, 1, sizeof(cl_uint), &scalar_arg_image_height);
// Setup buffer arguments
err |= clSetKernelArg(kernel, 2, sizeof(cl_mem), &dev_buf1);
err |= clSetKernelArg(kernel, 3, sizeof(cl_mem), &dev_buf2);
clEnqueueMigrateMemObjects
) on
any buffer. For all kernel buffer arguments you must allocate the buffer on the
device global memories. However, sometimes the content of the buffer is not required
before the start of the kernel execution. For example, the output buffer content will
only be populated during the kernel execution, and hence it is not important prior to
kernel execution. In this case, you should specify clEnqueueMigrateMemObject
with the CL_MIGRATE_MEM_OBJECT_CONTENT_UNDEFINED
flag so that migration of the
buffer will not involve the DMA operation between the host and the device, thus
improving performance.
Buffer Allocation on the Device
By default, when kernels are linked to the platform the memory interfaces from all the kernels are connected to a single default global memory bank. As a result, only a single compute unit (CU) can transfer data to and from the global memory bank at one time, limiting the overall performance of the application.
If the device contains only one global memory bank, then this is the only option. However, if the device contains multiple global memory banks, you can customize the global memory bank connections by modifying the memory interface connection for a kernel during linking. The method for performing this is discussed in detail in Mapping Kernel Ports to Memory. Overall performance is improved by using separate memory banks for different kernels or compute units, enabling multiple kernel memory interfaces to concurrently read and write data.
clSetKernelArgs
is used before any
enqueue operation on the buffer, such as clEnqueueMigrateMemObject
.Buffer Creation and Data Transfer
Interactions between the host program and hardware kernels rely on creating
buffers and transferring data to and from the memory in the device. This process makes use of
functions like clCreateBuffer
and clEnqueueMigrateMemObjects
.
There are two methods for allocating memory buffers, and transferring data:
In the case where XRT allocates the buffer, use enqueueMapBuffer
to capture the buffer handle. In the second case, allocate the
buffer directly with CL_MEM_USE_HOST_PTR
, so you do not need
to capture the handle.
There are a number of coding practices you can adopt to maximize performance
and fine-grain control. The OpenCL API supports additional
commands for reading and writing buffers. For example, you can use clEnqueueWriteBuffer
and clEnqueueReadBuffer
commands in place of
clEnqueueMigrateMemObjects
. However, some of these
commands have different effects that must be understood when using them. For example, clEnqueueReadBufferRect
can read a rectangular region of a buffer object to
the host application, but it does not transfer the data from the device global memory to the
host. You must first use clEnqueueReadBuffer
to transfer the
data from the device global memory, and then use clEnqueueReadBufferRect
to read the desired rectangular portion into the host
application.
Letting XRT Allocate Buffers
On data center platforms, it is more efficient to allocate memory aligned
on 4k page boundaries. On embedded platforms it is more efficient to perform contiguous
memory allocation. In either case, you can let the XRT allocate host memory when
creating the buffers. This is done by using the CL_MEM_ALLOC_HOST_PTR
flag when creating the buffers, and then mapping the
allocated memory to user-space pointers using clEnqueueMapBuffer
. With this approach, it is not necessary to
create a host space pointer aligned to the 4K boundary.
The clEnqueueMapBuffer
API maps the
specified buffer and returns a pointer created by XRT to this mapped region. Then, fill
the host side pointer with your data, followed by clEnqueueMigrateMemObject
to transfer the data to and from the device. The
following code example uses this style:
// Two cl_mem buffer, for read and write by kernel
cl_mem dev_mem_read_ptr = clCreateBuffer(context, CL_MEM_ALLOC_HOST_PTR | CL_MEM_READ_ONLY,
sizeof(int) * number_of_words, NULL, NULL);
cl_mem dev_mem_write_ptr = clCreateBuffer(context, CL_MEM_ALLOC_HOST_PTR | CL_MEM_WRITE_ONLY,
sizeof(int) * number_of_words, NULL, NULL);
cl::Buffer in1_buf(context, CL_MEM_ALLOC_HOST_PTR | CL_MEM_READ_ONLY, sizeof(int) * DATA_SIZE, NULL, &err);
// Setting arguments
clSetKernelArg(kernel, 0, sizeof(cl_mem), &dev_mem_read_ptr);
clSetKernelArg(kernel, 1, sizeof(cl_mem), &dev_mem_write_ptr);
// Get Host side pointer of the cl_mem buffer object
auto host_write_ptr = clEnqueueMapBuffer(queue,dev_mem_read_ptr,true,CL_MAP_WRITE,0,bytes,0,nullptr,nullptr,&err);
auto host_read_ptr = clEnqueueMapBuffer(queue,dev_mem_write_ptr,true,CL_MAP_READ,0,bytes,0,nullptr,nullptr,&err);
// Fill up the host_write_ptr to send the data to the FPGA
for(int i=0; i< MAX; i++) {
host_write_ptr[i] = <.... >
}
// Migrate
cl_mem mems[2] = {host_write_ptr,host_read_ptr};
clEnqueueMigrateMemObjects(queue,2,mems,0,0,nullptr,&migrate_event));
// Schedule the kernel
clEnqueueTask(queue,kernel,1,&migrate_event,&enqueue_event);
// Migrate data back to host
clEnqueueMigrateMemObjects(queue, 1, &dev_mem_write_ptr,
CL_MIGRATE_MEM_OBJECT_HOST,1,&enqueue_event, &data_read_event);
clWaitForEvents(1,&data_read_event);
// Now use the data from the host_read_ptr
To work with an example using clEnqueueMapBuffer
, refer to Data Transfer (C) in the Vitis Examples GitHub repository.
Using Host Pointer Buffers
CL_MEM_USE_HOST_PTR
is not recommended for embedded platforms.
Embedded platforms require contiguous memory allocation and should use the CL_MEM_ALLOC_HOST_PTR
method, as described in Letting XRT Allocate Buffers.There are two main parts of a cl_mem
object:
host side pointer and device side pointer. Before the kernel starts its operation, the device
side pointer is implicitly allocated on the device side memory (for example, on a specific
location inside the device global memory) and the buffer becomes a resident on the device.
Using clEnqueueMigrateMemObjects
this allocation and data
transfer occur upfront, much ahead of the kernel execution. This especially helps to enable
software pipelining if the host is executing the same
kernel multiple times, because data transfer for the next transaction can happen when kernel
is still operating on the previous data set, and thus hide the data transfer latency of
successive kernel executions.
The OpenCL framework provides a number of
APIs for transferring data between the host and the device. Typically, data movement APIs,
such as clEnqueueWriteBuffer
and clEnqueueReadBuffer
, implicitly migrate memory objects to the device after they
are enqueued. They do not guarantee when the data is transferred, and this makes it difficult
for the host application to synchronize the movement of memory objects with the computation
performed on the data.
Xilinx recommends using clEnqueueMigrateMemObjects
instead of
clEnqueueWriteBuffer
or clEnqueueReadBuffer
to improve the performance. Using this API, memory migration
can be explicitly performed ahead of the dependent commands. This allows the host application
to preemptively change the association of a memory object, through regular command queue
scheduling, to prepare for another upcoming command. This also permits an application to
overlap the placement of memory objects with other unrelated operations before these memory
objects are needed, potentially hiding or reducing data transfer latencies. After the event
associated with clEnqueueMigrateMemObjects
has been marked
complete, the host program knows the memory objects have been successfully migrated.
clEnqueueMigrateMemObjects
is that it can migrate multiple memory
objects in a single API call. This reduces the overhead of scheduling and calling functions to
transfer data for more than one memory object.The following code shows the use of clEnqueueMigrateMemObjects
:
int host_mem_ptr[MAX_LENGTH]; // host memory for input vector
// Fill the memory input
for(int i=0; i<MAX_LENGTH; i++) {
host_mem_ptr[i] = <... >
}
cl_mem dev_mem_ptr = clCreateBuffer(context,
CL_MEM_READ_WRITE | CL_MEM_USE_HOST_PTR,
sizeof(int) * number_of_words, host_mem_ptr, NULL);
clSetKernelArg(kernel, 0, sizeof(cl_mem), &dev_mem_ptr);
err = clEnqueueMigrateMemObjects(commands, 1, dev_mem_ptr, 0, 0,
NULL, NULL);
Allocating Page-Aligned Host Memory
XRT allocates memory space in 4K boundary for internal memory management. If
the host memory pointer is not aligned to a page boundary, XRT performs extra memcpy
to make it aligned. Hence you should align the host memory
pointer with the 4K boundary to save the extra memory copy operation.
The following is an example of how posix_memalign
is used instead of malloc
for the
host memory space pointer.
int *host_mem_ptr; // = (int*) malloc(MAX_LENGTH*sizeof(int));
// Aligning memory in 4K boundary
posix_memalign(&host_mem_ptr,4096,MAX_LENGTH*sizeof(int));
// Fill the memory input
for(int i=0; i<MAX_LENGTH; i++) {
host_mem_ptr[i] = <... >
}
cl_mem dev_mem_ptr = clCreateBuffer(context,
CL_MEM_READ_WRITE | CL_MEM_USE_HOST_PTR,
sizeof(int) * number_of_words, host_mem_ptr, NULL);
err = clEnqueueMigrateMemObjects(commands, 1, dev_mem_ptr, 0, 0,
NULL, NULL);
Sub-Buffers
Though not very common, using sub-buffers can be very useful in specific situations. The following sections discuss the scenarios where using sub-buffers can be beneficial.
Reading a Specific Portion from the Device Buffer
Consider a kernel that produces different amounts of data depending on the input
to the kernel. For example, a compression engine where the output size varies depending
on the input data pattern and similarity. The host can still read the whole output
buffer by using clEnqueueMigrateMemObjects
, but that is
a suboptimal approach as more than the required memory transfer would occur. Ideally the
host program should only read the exact amount of data that the kernel has written.
clEnqueueReadBuffer
two times, first to read the amount of
data being returned, and second to read exact amount of data returned by the kernel
based on the information from the first
read.clEnqueueReadBuffer(command_queue,device_write_ptr, CL_FALSE, 0, sizeof(int) * 1,
&kernel_write_size, 0, nullptr, &size_read_event);
clEnqueueReadBuffer(command_queue,device_write_ptr, CL_FALSE, DATA_READ_OFFSET,
kernel_write_size, host_ptr, 1, &size_read_event, &data_read_event);
clEnqueueMigrateMemObject
, which is
recommended over clEnqueueReadBuffer
or clEnqueueWriteBuffer
, you can adopt a similar approach by
using sub-buffers. This is shown in the following code sample. //Create a small sub-buffer to read the quantity of data
cl_buffer_region buffer_info_1={0,1*sizeof(int)};
cl_mem size_info = clCreateSubBuffer (device_write_ptr, CL_MEM_WRITE_ONLY,
CL_BUFFER_CREATE_TYPE_REGION, &buffer_info_1, &err);
// Map the sub-buffer into the host space
auto size_info_host_ptr = clEnqueueMapBuffer(queue, size_info,,,, );
// Read only the sub-buffer portion
clEnqueueMigrateMemObjects(queue, 1, &size_info, CL_MIGRATE_MEM_OBJECT_HOST,,,);
// Retrive size information from the already mapped size_info_host_ptr
kernel_write_size = ...........
// Create sub-buffer to read the required amount of data
cl_buffer_region buffer_info_2={DATA_READ_OFFSET, kernel_write_size};
cl_mem buffer_seg = clCreateSubBuffer (device_write_ptr, CL_MEM_WRITE_ONLY,
CL_BUFFER_CREATE_TYPE_REGION, &buffer_info_2,&err);
// Map the subbuffer into the host space
auto read_mem_host_ptr = clEnqueueMapBuffer(queue, buffer_seg,,,);
// Migrate the subbuffer
clEnqueueMigrateMemObjects(queue, 1, &buffer_seg, CL_MIGRATE_MEM_OBJECT_HOST,,,);
// Now use the read data from already mapped read_mem_host_ptr
Device Buffer Shared by Multiple Memory Ports or Multiple Kernels
Sometimes memory ports of kernels only require small amounts of data. However, managing small sized buffers, transferring small amounts of data, may have potential performance issues for your application. Alternatively, your host program can create a larger size buffer, divided into smaller sub-buffers. Each sub-buffer is assigned as a kernel argument as discussed in Setting Kernel Arguments, for each of those memory ports requiring small amounts of data.
Once sub-buffers are created they are used in the host code similar to regular buffers. This can improve performance as XRT handles a large buffer in a single transaction, instead of several small buffers and multiple transactions.
Kernel Execution
Often the compute intensive task required by the host application can be defined inside a single kernel, and the kernel is executed only once to work on the entire data range. Because there is an overhead associated with multiple kernel executions, invoking a single monolithic kernel can improve performance. Though the kernel is executed only one time, and works on the entire range of the data, the parallelism is achieved on the FPGA inside the kernel hardware. If properly coded, the kernel is capable of achieving parallelism by various techniques such as instruction-level parallelism (loop pipeline) and function-level parallelism (dataflow). These different kernel coding techniques are discussed in C/C++ Kernels.
When the kernel is compiled to a single hardware instance (or CU) on the FPGA, the
simplest method of executing the kernel is using clEnqueueTask
as shown below.
err = clEnqueueTask(commands, kernel, 0, NULL, NULL);
XRT schedules the workload, or the data passed through OpenCL buffers from the kernel arguments, and schedules the kernel tasks to run on the accelerator on the Xilinx FPGA.
clEnqueueNDRangeKernel
is supported (only
for OpenCL kernel), Xilinx recommends using
clEnqueueTask
.However, sometimes using a single clEnqueueTask to run the kernel is not
always feasible due to various reasons. For example, the kernel
code can become too big and complex to optimize if it attempts
to perform all compute intensive tasks in a single execution.
Sometimes multiple kernels can be designed performing different
tasks on the FPGA in parallel, requiring multiple enqueue
commands. Or the host application can be receiving data over
time, and not all the data can be processed at one time.
Therefore, depending on the situation and application, you may
need to break the data and the task of the kernel into multiple
clEnqueueTask
commands. In this case, an out-of-order command queue, or an
in-order command queue can determine how the kernel tasks are
processed as explained in Command Queues. In addition, multiple
kernel tasks can be implemented as blocking events, or
non-blocking events as described in Event Synchronization. These can all affect
the performance of the design.
The following topics discuss various methods you can use to run a kernel, run multiple kernels, or run multiple instances of the same kernel on the accelerator.
Task Parallelism Using Different Kernels
Sometimes the compute intensive task required by the host application can be
broken into multiple, different kernels designed to perform different tasks on the FPGA
in parallel. By using multiple clEnqueueTask
commands
in an out-of-order command queue, for example, you can have multiple kernels performing
different tasks, running in parallel. This enables the task parallelism on the FPGA.
Spatial Data Parallelism: Increase Number of Compute Units
Sometimes the compute intensive task required by the host application can
process the data across multiple hardware instances of the same kernel, or compute units
(CUs) to achieve data parallelism on the FPGA. If a single kernel has been compiled into
multiple CUs, the clEnqueueTask
command can be called
multiple times in an out-of-order command queue, to enable data parallelism. Each call
of clEnqueueTask
would schedule a workload of data in
different CUs, working in parallel.
Temporal Data Parallelism: Host-to-Kernel Dataflow
Sometimes, the data processed by a compute unit passes from one stage of processing in the kernel, to the next stage of processing. In this case, the first stage of the kernel may be free to begin processing a new set of data. In essence, like a factory assembly line, the kernel can accept new data while the original data moves down the line.
To understand this approach, assume a kernel has only one CU on the FPGA, and the host application enqueues the kernel multiple times with different sets of data. As shown in Using Host Pointer Buffers, the host application can migrate data to the device global memory ahead of the kernel execution, thus hiding the data transfer latency by the kernel execution, enabling software pipelining.
However, by default, a kernel can only start processing a new set of data only
when it has finished processing the current set of data. Although clEnqueueMigrateMemObject
hides the data
transfer time, multiple kernel executions still remain sequential.
By enabling host-to-kernel dataflow, it is possible to further improve the
performance of the accelerator by restarting the kernel with a new set of
data while the kernel is still processing the previous set of data. As
discussed in Enabling Host-to-Kernel Dataflow, the kernel
must implement the ap_ctrl_chain
interface, and must be
written to permit processing data in stages. In this case, XRT restarts the
kernel as soon as it is able to accept new data, thus overlapping multiple
kernel executions. However, the host program must keep the command queue
filled with requests so that the kernel can restart as soon as it is ready
to accept new data.
The following is a conceptual diagram for host-to-kernel dataflow.
The longer the kernel takes to process a set of data from start to finish, the
greater the opportunity to use host-to-kernel dataflow to improve
performance. Rather than waiting until the kernel has finished processing
one set of data, simply wait until the kernel is ready to begin processing
the next set of data. This allows temporal
parallelism, where different stages of the same kernel
processes a different set of data from multiple clEnqueueTask
commands, in a pipelined manner.
For advanced designs, you can effectively use both the spatial parallelism with multiple CUs to process data, combined with temporal parallelism using host-to-kernel dataflow, overlapping kernel executions on each compute unit.
Enabling Host-to-Kernel Dataflow
If a kernel is capable of accepting more data while it is still operating on
data from the previous transactions, XRT can send the next batch of data. The kernel
then works on multiple data sets in parallel at different stages of the algorithm, thus
improving performance. To support host-to-kernel dataflow, the kernel has to implement
the ap_ctrl_chain
protocol using the pragma HLS
interface for the
function return:
void kernel_name( int *inputs,
... )// Other input or Output ports
{
#pragma HLS INTERFACE ..... // Other interface pragmas
#pragma HLS INTERFACE ap_ctrl_chain port=return bundle=control
Symmetrical and Asymmetrical Compute Units
As discussed in Creating Multiple Instances of a Kernel, multiple compute units (CUs) of a single kernel can be instantiated on the FPGA during the kernel linking process. CUs can be considered symmetrical or asymmetrical with regard to other CUs of the same kernel.
- Symmetrical
- CUs are considered symmetrical when they have exactly the same
connectivity.sp
options, and therefore have identical connections to global memory. As a result, the Xilinx Runtime can use them interchangeably. A call toclEnqueueTask
can result in the invocation of any instance in a group of symmetrical CUs. - Asymmetrical
- CUs are considered asymmetrical when they do not have exactly
the same
connectivity.sp
options, and therefore do not have identical connections to global memory. Using the same setup of input and output buffers, it is not possible for XRT to execute asymmetrical CUs interchangeably.
Kernel Handle and Compute Units
The first time clSetKernelArg
is called for a
given kernel object, XRT identifies the group of symmetrical CUs for subsequent
executions of the kernel. When clEnqueueTask
is called
for that kernel, any of the symmetrical CUs in that group can be used to process the
task.
If all CUs for a given kernel are symmetrical, a single kernel object is
sufficient to access any of the CUs. However, if there are
asymmetrical CUs, the host application will need to create a
unique kernel object for each group of asymmetrical CUs. In this
case, the call to clEnqueueTask
must specify the kernel
object to use for the task, and any matching CU for that kernel
can be used by XRT.
Creating Kernel Objects for Specific Compute Units
For creating kernels associated with
specific compute units, the clCreateKernel
command supports
specifying the CUs at the time the kernel object is
created by the host program. The syntax of this
command is shown below:
// Create kernel object only for a specific compute unit
cl_kernel kernelA = clCreateKernel(program,"<kernel_name>:{compute_unit_name}",&err);
// Create a kernel object for two specific compute units
cl_kernel kernelB = clCreateKernel(program, "<kernel_name>:{CU1,CU2}", &err);
connectivity.nk
option
in a config file used by the v++
command during
linking. Therefore, whatever is specified in the
host program, to create or enqueue kernel objects,
must match the options specified by the config file
used during linking.In this case, the Xilinx Runtime identifies the kernel
handles (kernelA
,
kernelB
) for
specific CUs, or group of CUs, when the kernel is
created. This lets you control which kernel
configuration, or specific CU instance is used, when
using clEnqueueTask
from within the host
program. This can be useful in the case of
asymmetrical CUs, or to perform load and priority
management of CUs.
Using Compute Unit Name to Get Handle of All Asymmetrical Compute Units
If a kernel instantiates multiple CUs that are not symmetrical, the clCreateKernel
command can be specified with CU names to
create different CU groups. In this case, the host program can reference a specific CU
group by using the cl_kernel
handle returned by
clCreateKernel
.
In the following example, the kernel mykernel
has five CUs: K1, K2, K3, K4, and K5. The K1, K2, and K3 CUs are a symmetrical group, having
symmetrical connection on the device. Similarly, CUs K4 and K5 form a second symmetrical CU
group. The following code segment shows how to address a specific CU group using cl_kernel
handles.
// Kernel handle for Symmetrical compute unit group 1: K1,K2,K3
cl_kernel kernelA = clCreateKernel(program,"mykernel:{K1,K2,K3}",&err);
for(i=0; i<3; i++) {
// Creating buffers for the kernel_handle1
.....
// Setting kernel arguments for kernel_handle1
.....
// Enqueue buffers for the kernel_handle1
.....
// Possible candidates of the executions K1,K2 or K3
clEnqueueTask(commands, kernelA, 0, NULL, NULL);
//
}
// Kernel handle for Symmetrical compute unit group 1: K4, K5
cl_kernel kernelB = clCreateKernel(program,"mykernel:{K4,K5}",&err);
for(int i=0; i<2; i++) {
// Creating buffers for the kernel_handle2
.....
// Setting kernel arguments for kernel_handle2
.....
// Enqueue buffers for the kernel_handle2
.....
// Possible candidates of the executions K4 or K5
clEnqueueTask(commands, kernelB, 0, NULL, NULL);
}
Event Synchronization
All OpenCL enqueue-based API calls are
asynchronous. These commands will return immediately after the command is enqueued in the
command queue. To pause the host program to wait for results, or resolve any dependencies
among the commands, an API call such as clFinish
or clWaitForEvents
can be used to block
execution of the host program.
The following code shows examples for clFinish
and clWaitForEvents
.
err = clEnqueueTask(command_queue, kernel, 0, NULL, NULL);
// Execution will wait here until all commands in the command queue are finished
clFinish(command_queue);
// Create event, read memory from device, wait for read to complete, verify results
cl_event readevent;
// host memory for output vector
int host_mem_output_ptr[MAX_LENGTH];
//Enqueue ReadBuffer, with associated event object
clEnqueueReadBuffer(command_queue, dev_mem_ptr, CL_TRUE, 0, sizeof(int) * number_of_words,
host_mem_output_ptr, 0, NULL, &readevent );
// Wait for clEnqueueReadBuffer event to finish
clWaitForEvents(1, &readevent);
// After read is complete, verify results
...
Note how the commands have been used in the example above:
- The
clFinish
API has been explicitly used to block the host execution until the kernel execution is finished. This is necessary otherwise the host can attempt to read back from the FPGA buffer too early and may read garbage data. - The data transfer from FPGA memory to the local host machine is done
through
clEnqueueReadBuffer
. Here the last argument ofclEnqueueReadBuffer
returns an event object that identifies this particular read command, and can be used to query the event, or wait for this particular command to complete. TheclWaitForEvents
command specifies a single event (the readevent), and waits to ensure the data transfer is finished before verifying the data.
Post-Processing and FPGA Cleanup
At the end of the host code, all the allocated resources should be released by using proper release functions. If the resources are not properly released, the Vitis core development kit might not able to generate a correct performance related profile and analysis report.
clReleaseCommandQueue(Command_Queue);
clReleaseContext(Context);
clReleaseDevice(Target_Device_ID);
clReleaseKernel(Kernel);
clReleaseProgram(Program);
free(Platform_IDs);
free(Device_IDs);
Summary
As discussed in earlier topics, the recommended coding style for the host program in the Vitis core development kit includes the following points:
- Add error checking after each OpenCL API call for debugging purpose, if required.
- In the Vitis core development kit, one or
more kernels are separately compiled/linked to build the XCLBIN file. The API
clCreateProgramWithBinary
is used to build thecl_program
object from the kernel binary. - Use buffer for setting the kernel argument (
clSetKernelArg
) before any enqueue operation on the buffer. - Transfer data back and forth from the host code to the kernel by using
clEnqueueMigrateMemObjects
. - For data center platforms using
CL_MEM_USE_HOST_PTR
, applyposix_memalign
to align the host memory pointer at 4K boundary as described in Allocating Page-Aligned Host Memory. - Preferably use the out-of-order command queue for concurrent command execution on the FPGA.
- Execute the whole workload with
clEnqueueTask
, rather than splitting the workload by usingclEnqueueNDRangeKernel
. - Use event synchronization commands,
clFinish
andclWaitForEvents
, to resolve dependencies of the asynchronous OpenCL API calls. - Release all OpenCL allocated resources when finished.