Programming the Host Application
In the SDAccel™ environment, host code is written in C or C++ language using the industry standard OpenCL™ API. The SDAccel environment provides an OpenCL 1.2 embedded profile conformant runtime API.
cl_khr_icd
). This extension allows multiple implementations of OpenCL to co-exist on the same system. Refer to OpenCL Installable Client Driver Loader for details and installation
instructions.The SDAccel environment consists of a host x86 CPU and compute devices running on a Xilinx® FPGA.
- Setting up the environment.
- Core command execution including executing one or more kernels.
- Post processing and FPGA release.
fork()
system call from
an SDAccel environment application. The fork()
does not duplicate all the runtime threads. Hence
the child process cannot run as a complete application in the SDAccel environment. It is advisable to use the posix_spawn()
system call to launch another process from the SDAccel environment application.Setting Up the OpenCL Environment
The host code in the SDAccel environment follows OpenCL programming paradigm. To set the environment properly, the host application should identify the standard OpenCL models. They are: platform, devices, context, command queue, and program.
Platform
cl_platform_id platform_id; // platform id
err = clGetPlatformIDs(16, platforms, &platform_count);
// Find Xilinx Platform
for (unsigned int iplat=0; iplat<platform_count; iplat++) {
err = clGetPlatformInfo(platforms[iplat],
CL_PLATFORM_VENDOR,
1000,
(void *)cl_platform_vendor,
NULL);
if (strcmp(cl_platform_vendor, "Xilinx") == 0) {
// Xilinx Platform found
platform_id = platforms[iplat];
}
}
The OpenCL API call clGetPlatformIDs
is used to
discover the set of available OpenCL platforms for a
given system. Thereafter, clGetPlatformInfo
is used to
identify the Xilinx device based platform by
matching cl_platform_vendor
with the string "Xilinx"
.
clGetPlatformIDs
command:
err = clGetPlatformIDs(16, platforms, &platform_count);
if (err != CL_SUCCESS) {
printf("Error: Failed to find an OpenCL platform!\n");
printf("Test failed\n");
exit(1);
}
Devices
After the platform detection, the Xilinx FPGA devices attached to the platform are identified. The SDAccel environment supports one or more Xilinx FPGA devices working together.
clGetDeviceIDs
and printing
their
names.cl_device_id devices[16]; // compute device id
char cl_device_name[1001];
err = clGetDeviceIDs(platform_id, CL_DEVICE_TYPE_ACCELERATOR,
16, devices, &num_devices);
printf("INFO: Found %d devices\n", num_devices);
//iterate all devices to select the target device.
for (uint i=0; i<num_devices; i++) {
err = clGetDeviceInfo(devices[i], CL_DEVICE_NAME, 1024, cl_device_name, 0);
printf("CL_DEVICE_NAME %s\n", cl_device_name);
}
clGetDeviceIDs
API is called with the device_type
and CL_DEVICE_TYPE_ACCELERATOR
to receive all the available Xilinx devices.Sub-Devices
In the SDAccel environment, sometimes
devices contain multiple kernel instances of a single kernel or of different kernels.
The OpenCL API clCreateSubDevices
allows the
host code to divide the device into multiple sub-devices containing one kernel instance
per sub-device. Currently, the SDAccel environment
supports equally divided sub-devices each containing only one kernel instance.
The following example shows:
- The sub-devices are created by equal partition to execute one kernel instance per sub-device.
- Iterating over the sub-device list and using a separate context and command queue to execute the kernel on each of them.
- The API related to kernel execution (and corresponding buffer
related) code is not shown for the sake of simplicity, but would be described inside
the function
run_cu
.
cl_uint num_devices = 0;
cl_device_partition_property props[3] = {CL_DEVICE_PARTITION_EQUALLY,1,0};
// Get the number of sub-devices
clCreateSubDevices(device,props,0,nullptr,&num_devices);
// Container to hold the sub-devices
std::vector<cl_device_id> devices(num_devices);
// Second call of clCreateSubDevices
// We get sub-device handles in devices.data()
clCreateSubDevices(device,props,num_devices,devices.data(),nullptr);
// Iterating over sub-devices
std::for_each(devices.begin(),devices.end(),[kernel](cl_device_id sdev) {
// Context for sub-device
auto context = clCreateContext(0,1,&sdev,nullptr,nullptr,&err);
// Command-queue for sub-device
auto queue = clCreateCommandQueue(context,sdev,
CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE,&err);
// Execute the kernel on the sub-device using local context and
queue run_cu(context,queue,kernel); // Function not shown
});
Context
clCreateContext
is used to
create a context that contains one or more Xilinx
devices that will communicate with the host
machine.context = clCreateContext(0, 1, &device_id, NULL, NULL, &err);
In the code example above, the API clCreateContext
is used to create a context that contains one Xilinx device. You can create only one context for a device from a host
program. However, the host program should use multiple contexts if sub-devices are used;
one context for each sub-device.
Command Queues
One or more command queues for each device is created using the clCreateCommandQueue
API.
The FPGA device can contain multiple kernels. When developing the host application,
there are two main programming approaches to execute kernels on a device:
- Single out-of-order command queue: Multiple kernel executions can be requested through the same command queue. The SDAccel runtime environment dispatches those kernels as soon as possible in any order allowing concurrent kernel execution on the FPGA.
- Multiple in-order command queue: Each kernel execution will be requested from different in-order command queues. In such cases, the SDAccel runtime environment can dispatch kernels from any command queue with the intention of improving performance by running them concurrently on the FPGA.
// Out-of-order Command queue
commands = clCreateCommandQueue(context, device_id, CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE, &err);
// In-order Command Queue
commands = clCreateCommandQueue(context, device_id, 0, &err);
Program
As described in the SDAccel Build Process,
the host and kernel code are compiled separately to create separate executable files:
the host application (.exe) and the FPGA binary
(.xclbin). When the host application is
executed it must load the .xclbin using the clCreateProgramWithBinary
API.
unsigned char *kernelbinary;
char *xclbin = argv[1];
printf("INFO: loading xclbin %s\n", xclbin);
int size=load_file_to_memory(xclbin, (char **) &kernelbinary);
size_t size_var = size;
cl_program program = clCreateProgramWithBinary(context, 1, &device_id,
&size_var,(const unsigned char **) &kernelbinary,
&status, &err);
// Function
int load_file_to_memory(const char *filename, char **result)
{
uint size = 0;
FILE *f = fopen(filename, "rb");
if (f == NULL) {
*result = NULL;
return -1; // -1 means file opening fail
}
fseek(f, 0, SEEK_END);
size = ftell(f);
fseek(f, 0, SEEK_SET);
*result = (char *)malloc(size+1);
if (size != fread(*result, sizeof(char), size, f)) {
free(*result);
return -2; // -2 means file reading fail
}
fclose(f);
(*result)[size] = 0;
return size;
}
The above example performs the following steps:
- The kernel binary file, .xclbin, is passed in from the command line argument,
argv[1]
.TIP: Passing the .xclbin through a command line argument is specific to this example. You can also hardcode the kernel binary file in the application, use an environment variable, read it from a custom initialization file or any other suitable mechanism. - The
load_file_to_memory
function is used to load the file contents in the host machine memory space. - The API
clCreateProgramWithBinary
is used to complete the program creation process.
Executing Commands in the FPGA Device
- Setting up the kernels.
- Buffer transfer to/from the FPGA.
- Kernel execution on FPGA.
- Event synchronization.
Setting Up the Kernels
After the initialization of all the preliminaries such as context, command queues, and program, the host application should identify the kernels required to execute on the device and setting up their arguments.
Kernels Identification
clCreateKernel
API
should be used to access the kernels present inside the .xclbin file. The kernel handle (cl_kernel
type) denotes a kernel object that now can be used in the rest
of the host program.
kernel1 = (program, "<kernel_name_1>", &err);
kernel2 = clCreateKernel(program, "<kernel_name_2>", &err); // etc
Setting Kernel Arguments
In the SDAccel environment framework, two types of kernel arguments can be set:
- The scalar arguments are used for small data transfer, such as constant or configuration type data. These are write-only arguments.
- The buffer arguments are used for large data transfer.
clSetKernelArg
command as
shown below. The following example shows setting kernel arguments for two scalar and two
buffer
arguments.cl_mem dev_buf1 = clCreateBuffer(context, CL_MEM_WRITE, size, &host_mem_ptr1, NULL);
cl_mem dev_buf2 = clCreateBuffer(context, CL_MEM_READ, size, &host_mem_ptr2, NULL);
int err = 0;
// Setting up scalar arguments
cl_uint scalar_arg_image_width = 3840;
err |= clSetKernelArg(kernel, 0, sizeof(cl_uint), &scalar_arg_image_width);
cl_uint scalar_arg_image_height = 2160;
err |= clSetKernelArg(kernel, 1, sizeof(cl_uint), &scalar_arg_image_height);
// Setting up buffer arguments
err |= clSetKernelArg(kernel, 2, sizeof(cl_mem), &dev_buf1);
err |= clSetKernelArg(kernel, 3, sizeof(cl_mem), &dev_buf2);
clEmqueueMigrateMemObjects
) on any buffer.Buffer Transfer to/from the FPGA Device
clCreateBuffer
, clEnqueueWriteBuffer
, and
clEnqueueReadBuffer
commands.clEnqueueMigrateMemObjects
instead of
clEnqueueReadBuffer
and
clEnqueueWriteBuffer
.The following code example demonstrates this:
int host_mem_ptr[MAX_LENGTH]; // host memory for input vector
// Fill the memory input
for(int i=0; i<MAX_LENGTH; i++) {
host_mem_ptr[i] = <... >
}
cl_mem dev_mem_ptr = clCreateBuffer(context, CL_MEM_READ_WRITE,
sizeof(int) * number_of_words, NULL, NULL);
err = clEnqueueWriteBuffer(commands, dev_mem_ptr, CL_TRUE, 0,
sizeof(int) * number_of_words, host_mem_ptr, 0, NULL, NULL);
For simple applications the example code above would be sufficient to transfer data from the host to the device memory. However, there are a number of coding practices you should adopt to maximize performance and fine-grain control.
Using clEnqueueMigrateMemObjects
Xilinx recommends using clEnqueueMigrateMemObjects
instead of clEnqueueWriteBuffer
or clEnqueueReadBuffer
to improve the performance. There are
two main parts of a cl_mem
object: host side pointer
and device side pointer. Before the kernel starts its operation, the device side pointer
is implicitly allocated on the device side memory (for example, on a specific location
inside the device global memory) and the buffer becomes a resident on the device.
However, by using clEnqueueMigrateMemObjects
this
allocation and data transfer occur upfront, much ahead of the kernel execution. This
especially helps to enable software pipelining if
the host is executing the same kernel multiple times, because data transfer for the next
transaction can happen when kernel is still operating on the previous data set, and thus
hide the data transfer latency of successive kernel executions.
The following code example is modified to use clEnqueueMigrateMemObjects
:
int host_mem_ptr[MAX_LENGTH]; // host memory for input vector
// Fill the memory input
for(int i=0; i<MAX_LENGTH; i++) {
host_mem_ptr[i] = <... >
}
cl_mem dev_mem_ptr = clCreateBuffer(context,
CL_MEM_READ_WRITE | CL_MEM_USE_HOST_PTR,
sizeof(int) * number_of_words, host_mem_ptr, NULL);
clSetKernelArg(kernel, 0, sizeof(cl_mem), &dev_mem_ptr);
err = clEnqueueMigrateMemObjects(commands, 1, dev_mem_ptr, 0, 0,
NULL, NULL);
Allocating Page-Aligned Host Memory
Xilinx Runtime allocates the memory space in 4K boundary
for internal memory management. If the host memory pointer is not aligned to a page
boundary, the Xilinx Runtime performs extra memcpy
to make it aligned. Hence you should align the host memory pointer
with the 4K boundary to save the extra memory copy operation.
The following is an example of how posix_memalign
is used instead of malloc
for the host memory space pointer.
int *host_mem_ptr; // = (int*) malloc(MAX_LENGTH*sizeof(int));
// Aligning memory in 4K boundary
posix_memalign(&host_mem_ptr,4096,MAX_LENGTH*sizeof(int));
// Fill the memory input
for(int i=0; i<MAX_LENGTH; i++) {
host_mem_ptr[i] = <... >
}
cl_mem dev_mem_ptr = clCreateBuffer(context,
CL_MEM_READ_WRITE ,
sizeof(int) * number_of_words, host_mem_ptr, NULL);
err = clEnqueueMigrateMemObjects(commands, 1, dev_mem_ptr, 0, 0,
NULL, NULL);
Using clEnqueueMapBuffer
Another approach for creating and managing buffers is to use clEnqueueMapBuffer
. With this approach, it is not necessary
to create a host space pointer aligned to 4K boundary. The clEnqueueMapBuffer
API maps the specified buffer and returns a pointer
created by the Xilinx Runtime to this mapped region.
Then, fill the host side pointer with your data, followed by clEnqueueMigrateMemObject
to transfer the data to and from the device.
Below is an example that uses this style. Note CL_MEM_USE_HOST_PTR
is not used for clCreateBuffer
.
// Two cl_mem buffer, for read and write by kernel
cl_mem dev_mem_read_ptr = clCreateBuffer(context,
CL_MEM_READ_ONLY,
sizeof(int) * number_of_words, NULL, NULL);
cl_mem dev_mem_write_ptr = clCreateBuffer(context,
CL_MEM_WRITE_ONLY,
sizeof(int) * number_of_words, NULL, NULL);
// Setting arguments
clSetKernelArg(kernel, 0, sizeof(cl_mem), &dev_mem_read_ptr);
clSetKernelArg(kernel, 1, sizeof(cl_mem), &dev_mem_write_ptr);
// Get Host side pointer of the cl_mem buffer object
auto host_write_ptr = clEnqueueMapBuffer(queue,dev_mem_read_ptr,true,CL_MAP_WRITE,0,bytes,0,nullptr,nullptr,&err);
auto host_read_ptr = clEnqueueMapBuffer(queue,dev_mem_write_ptr,true,CL_MAP_READ,0,bytes,0,nullptr,nullptr,&err);
// Fill up the host_write_ptr to send the data to the FPGA
for(int i=0; i< MAX; i++) {
host_write_ptr[i] = <.... >
}
// Migrate
cl_mem mems[2] = {host_write_ptr,host_read_ptr};
clEnqueueMigrateMemObjects(queue,2,mems,0,0,nullptr,&migrate_event));
// Schedule the kernel
clEnqueueTask(queue,kernel,1,&migrate_event,&enqueue_event);
// Migrate data back to host
clEnqueueMigrateMemObjects(queue, 1, &dev_mem_write_ptr,
CL_MIGRATE_MEM_OBJECT_HOST,1,&enqueue_event, &data_read_event);
clWaitForEvents(1,&data_read_event);
// Now use the data from the host_read_ptr
Buffer Allocation on the Device
By default, all the memory interfaces from all the kernels are connected to a single default global memory bank when kernels are linked. As a result, only one compute unit (CU) can transfer data to and from the global memory bank at a time, limiting the overall performance of the application. If the FPGA contains only one global memory bank, then this is the only option. However, if the device contains multiple global memory banks, you can customize the global memory bank connections by modifying the default connection. This topic is discussed in greater detail in Connecting Kernel Ports to Global Memory. Overall performance is improved by enabling multiple kernel memory interfaces to concurrently read and write data from separate global memory banks.
As in the SDAccel environment the host
code and the kernel code are compiled independently. Xilinx Runtime needs to detect the kernel's memory connection to send the
data to the correct memory location from the host code. The latest 2019.1 Xilinx Runtime will automatically find the buffer
location from the kernel binary files if clSetKernelArgs
is used before any enqueue operation on the buffer, for
example clEnqueueMigrateMemObject
.
Before the 2019.1 release, the OpenCL
host code required a Xilinx extension (cl_mem_ext_ptr
) to specify the exact buffer location on
the device. Though this method is still supported, it is no longer necessary and is not
documented in this version of the guide. For more information on specifying buffer
location using cl_mem_ext_ptr
, see the earlier version
of this guide.
Sub-Buffers
Though not very common, using sub-buffer can be very useful in specific situations. The following sections discuss the scenarios where using sub-buffers can be beneficial.
Reading a Specific Portion from the Device Buffer
clEnqueueMigrateMemObjects
, but that is a suboptimal approach as more than
required memory transfer would take place. Ideally the host only needs to read the exact
amount of data that kernel has written. One of the techniques can be adopted by kernel
is to write a size information of the output data at the start of the output written
data. If using clEnqueueReadBuffer
, the host code can
use clEnqueueReadBuffer
two times, first for reading the
size of the data, and the second to read exact amount of data by using the size
information from the first
read.
clEnqueueReadBuffer(command_queue,device_write_ptr, CL_FALSE, 0, sizeof(int) * 1,
&kernel_write_size, 0, nullptr, &size_read_event);
clEnqueueReadBuffer(command_queue,device_write_ptr, CL_FALSE, DATA_READ_OFFSET,
kernel_write_size, host_ptr, 1, &size_read_event, &data_read_event);
With clEnqueueMigrateMemObject
, which is
recommended over clEnqueueReadBuffer
or clEnqueueWriteBuffer
, you can adopt a similar approach by
using sub-buffers. The following sample code is shown below (note that this is not the
complete API arguments).
//Create a small subbuffer
cl_buffer_region buffer_info_1={0,1*sizeof(int)};
cl_mem size_info = clCreateSubBuffer (device_write_ptr, CL_MEM_WRITE_ONLY,
CL_BUFFER_CREATE_TYPE_REGION, &buffer_info_1,&err);
// Map the subbuffer into the host space
auto size_info_host_ptr = clEnqueueMapBuffer(queue,size_info,,,, );
// Read only the subbuffer portion
clEnqueueMigrateMemObjects(queue, 1, &size_info,CL_MIGRATE_MEM_OBJECT_HOST,,,);
// Retrive size information from the already mapped size_info_host_ptr
kernel_write_size = ...........
// Create sub-buffer again for required amount
cl_buffer_region buffer_info_2={DATA_READ_OFFSET, kernel_write_size};
cl_mem buffer_seg = clCreateSubBuffer (device_write_ptr, CL_MEM_WRITE_ONLY,
CL_BUFFER_CREATE_TYPE_REGION, &buffer_info_2,&err);
// Map the subbuffer into the host space
auto read_mem_host_ptr = clEnqueueMapBuffer(queue, buffer_seg,,,);
// Migrate the subbuffer
clEnqueueMigrateMemObjects(queue, 1, &buffer_seg,CL_MIGRATE_MEM_OBJECT_HOST,,,);
// Now use the read data from already mapped read_mem_host_ptr
Device Buffer Shared by Multiple Memory Ports or Multiple Kernels
Sometimes memory ports of the kernels only require small amounts of data. Managing and sending multiple number of small sized buffers can have potential performance issues. Alternatively, the host can create a large size buffer divided into small sub-buffers. Each sub-buffer assigns a kernel argument for each of the memory ports which requires small amounts of data. This can improve performance as Xilinx Runtime handles a large buffer instead of several small buffers.
Once sub-buffers are created they are used in the host code similar to regular buffers.
Kernel Execution
Assuming the kernel is compiled to a single hardware instance (or CU) on the
FPGA, the simplest method of executing the kernel is using clEnqueueTask
as shown below.
err = clEnqueueTask(commands, kernel, 0, NULL, NULL);
The Xilinx Runtime schedules the workload (the data passed through OpenCL buffers through the kernel arguments) and schedules the kernel to compute intensive tasks on the FPGA.
clEnqueueNDRangeKernel
is supported (only for
OpenCL kernel), Xilinx recommends using
clEnqueueTask
.There are various methods you can execute the kernel, multiple kernels, or multiple instances of the same kernel on the FPGA. Those are discussed in the next section.
Single Kernel Invocation for the Entire Task and Data Parallelism
Often the complete compute intensive task is defined inside a single kernel and the kernel is executed only one time to work on the entire data range. As there is a overhead of multiple kernel executions, this approach certainly helps in many cases. Though the kernel is executed only one time and works on the entire range of the data, the parallelism (and thereby acceleration) is achieved on the FPGA inside the kernel hardware. Most of the time (if properly coded), kernel is capable of achieving parallelism by various technique such as instruction-level parallelism (loop pipeline) and function-level parallelism (dataflow). You will learn about different kernel coding techniques in Programming C/C++ Kernels.
However, the above mentioned single clEnqueueTask
is not always feasible due to various practical reasons. For
example, the kernel code can become too big and complex to optimize if it attempts to
perform all compute intensive task in a single execution. Another possible case is when
the host is receiving data over time and not all the data at the same time. Therefore,
depending on the situation and application, there are different ways to use clEnqueueTask
to break the data and the task into multiple
clEnqueueTask
commands as discussed in the next
sections.
Task Parallelism by Using Multiple Different Kernels
Sometimes multiple kernels can be designed performing different task on the
FPGA in parallel. By using the multiple clEnqueueTask
command (through a out-of-order command queue), it is possible to allow multiple kernels
(performing different task) working in parallel on the FPGA. This enables the task
parallelism on the FPGA.
Spatial Data Parallelism: Increase Number of Compute Units
If a single kernel has been compiled into multiple hardware instances (or CUs),
clEnqueueTask
can be called multiple times (using a
out-of-order queue) to enable data parallelism. Each call of clEnqueueTask
would schedule a workload in different CUs working on the
different data sets in parallel.
Temporal Data Parallelism: Host to Kernel Dataflow
To understand this approach, assume a kernel has only one CU on the FPGA and
the host requires to use the CU multiple times on different sets of data. As shown in
Using clEnqueueMigrateMemObjects, by using clEnqueueMigrateMemObjects
it is possible to send data to the device
global memory ahead of time (the data is transferred for the next kernel execution), and
thus hiding the data transfer latency by the kernel execution, enabling software pipelining.
However, by default, the kernel can start operating on the next set of
data only when it is finished working on the current set of data. Though clEnqueueMigrateMemObject
hides the data transfer execution
time, the kernel executions still remain sequential.
By enabling the host to kernel dataflow, it is even possible to further improve the performance by restarting the kernel while the kernel is still working on the previous sets of data. If the kernel is optimized in a manner such that it is capable of accepting the new data (for the next kernel operation) even when it is still working on the previous data (to achieve this the kernel has to be compiled in a certain manner, see Enabling Host to Kernel Dataflow), the XRT restarts the kernel as soon as possible, thus overlapping the multiple kernel executions.
This allows temporal parallelism between
host to kernel where each section of the kernel hardware is working on a specific data
set from the different clEnqueueTask
command in a
pipelined manner. However, the host still needs to fill the command queue ahead of the
time (by software pipelining) so that kernel can restart as soon as it is ready to
accept the new set of data.
The following is a conceptual diagram for the host to kernel dataflow.
For advanced designs, you can effectively use both the spatial parallelism
(using more hardware resources or CUs) and software pipeline (clEnqueueMigrateMemObjects
) combined with temporal parallelism (by host to
kernel dataflow, particularly overlapping kernel executions on each CU). If needed, you
can potentially combine all the techniques together.
Symmetrical and Asymmetrical Compute Units
During the kernel linking process, a kernel can have multiple CUs on the FPGA.
Symmetrical Compute Units
CUs are considered asymmetrical when they do not have identical connections to
global memory (when they do not have exactly the same --sp
options). As a result, the Xilinx Runtime can use them interchangeably. A call to clEnqueueTask
can result in the invocation of any one instance in a group
of symmetrical CUs.
Asymmetrical Compute Units
CUs are considered asymmetrical when they do not have identical connections to
global memory (when they do not have exactly the same --sp
options). Using the same setup of the input (and output) buffers, it
is not possible to execute both of these CUs interchangeably. So these are not execution
agnostic from the Xilinx Runtime perspective.
Kernel Handle and Compute Units
The first time clSetKernelArg
is called for
a given kernel object, the Xilinx Runtime selects a
group of symmetrical CUs for the subsequent executions of this kernel. When clEnqueueTask
is called, any of the symmetrical CUs in
that group can be used.
If all CUs for a given kernel are symmetrical, a single kernel object is sufficient to access any of these CUs. If there are asymmetrical CUs, the application will need to create as many kernel objects as there are groups of asymmetrical CUs to ensure all of them can be used.
Creating Kernel Objects for Specific Compute Units
// Create kernel object only for a specific compute unit
kernel1 = clCreateKernel(program, "<kernel_name_1>:{comp_unit_name_1}", &err);
// Create kernel object for two specific compute units
kernel1 = clCreateKernel(program, "<kernel_name_1>:{comp_unit_name_1,comp_unit_name_2}", &err);
This gives control within the application over which specific CU instance is used. This can be useful in the case of asymmetrical CUs or to perform explicit load and priority management of CUs.
Using Compute Unit Name to Get Handle of All Asymmetrical Compute Units
If a kernel has CUs that are not all symmetrical, the enhanced clCreateKernel
with the CU name can be used. In this case, the
host needs to manage each symmetrical CU group separately with different cl_kernel
handle. The following shows a hypothetical example.
Assume the kernel mykernel
has five CUs: K1,
K2, K3, K4, and K5. Also consider the CUs K1, K2, and K3 are having symmetrical connection on
the device and can be considered as a group of symmetrical CUs. Similarly, CUs named K4 and K5
form another group of symmetrical CU. The code segment below shows how two cl_kernel
handles are used to manage the two groups of symmetrical
CUs.
// Kernel handle for Symmetrical compute unit group 1: K1,K2,K3
cl_kernel kernel_handle1 = clCreateKernel(program,"mykernel:{K1,K2,K3}",&err);
for(i=0; i<3; i++) {
// Creating buffers for the kernel_handle1
.....
// Setting kernel arguments for kernel_handle1
.....
// Enqueue buffers for the kernel_handle1
.....
// Possible candidates of the executions K1,K2 or K3
clEnqueueTask(commands, kernel_handle1, 0, NULL, NULL);
//
}
// Kernel handle for Symmetrical compute unit group 1: K4, K5
cl_kernel kernel_handle2 = clCreateKernel(program,"mykernel:{K4,K5}",&err);
for(int i=0; i<2; i++) {
// Creating buffers for the kernel_handle2
.....
// Setting kernel arguments for kernel_handle2
.....
// Enqueue buffers for the kernel_handle2
.....
// Possible candidates of the executions K4 or K5
clEnqueueTask(commands, kernel_handle2, 0, NULL, NULL);
}
Event Synchronization
All OpenCL
clEnqueueXXX
API calls are asynchronous. These commands
will return immediately after the command is enqueued in the command queue. To resolve
the dependencies among the commands, an API call such as clWaitForEvents
or clFinish
can be used to pause
or block execution of the host program.
clWaitForEvents
and clFinish
commands are shown below:
err = clEnqueueTask(command_queue, kernel, 0, NULL, NULL);
// Execution will wait here until all commands in the command queue are finished
clFinish(command_queue);
// Read back the results from the device to verify the output
cl_event readevent;
int host_mem_output_ptr[MAX_LENGTH]; // host memory for output vector
clEnqueueReadBuffer(command_queue, dev_mem_ptr, CL_TRUE, 0, sizeof(int) * number_of_words,
host_mem_output_ptr, 0, NULL, &readevent );
clWaitForEvents(1, &readevent); // Wait for clEnqueueReadBuffer event to finish
// Check Results
// Compare Golden values with host_mem_output_ptr
- The
clFinish
API has been explicitly used to block the host execution until the Kernel execution is finished. This is necessary otherwise the host can attempt to read back from the FPGA buffer too early and may read garbage data. - The data transfer from FPGA memory to the local host machine is done through
clEnqueueReadBuffer
. Here the last argument ofclEnqueueReadBuffer
returns an event object that identifies this particular read command and can be used to query the event, or wait for this particular command to complete. TheclWaitForEvents
specifies that one event, and waits to ensure the data transfer is finished before checking the data from the host side memory.
Post Processing and FPGA Cleanup
clReleaseCommandQueue(Command_Queue);
clReleaseContext(Context);
clReleaseDevice(Target_Device_ID);
clReleaseKernel(Kernel);
clReleaseProgram(Program);
free(Platform_IDs);
free(Device_IDs);
Summary
As discussed in earlier topics, the recommended coding style for the host application in the SDAccel environment includes the following points:
- Add error checking after each OpenCL API call for debugging purpose, if required.
- In the SDAccel environment, one
or more kernels are separately pre-compiled to the .xclbin file. The API
clCreateProgramWithBinary
is used to build the program from the kernel binary. - Use buffer for setting the kernel argument (
clSetKernelArg
) before any enqueue operation on the buffer. - Transfer data back and forth from the host code to the FPGAs by using
clEnqueueMigrateMemObjects
. - Use
posix_memalign
to align the host memory pointer at 4K boundary. - Preferably use the out-of-order command queue for concurrent command execution on the FPGA.
- Use synchronization commands to resolve dependencies of the asynchronous OpenCL API calls.