Programming the Host Application
In the SDAccel™ environment, host code is written in C or C++ language using the industry standard OpenCL™ API. The SDAccel environment provides an OpenCL 1.2 embedded profile conformant runtime API.
cl_khr_icd
). This extension allows multiple implementations of OpenCL to co-exist on the same system. Refer to OpenCL Installable Client Driver Loader for details and installation
instructions.The SDAccel environment consists of a host x86 CPU and compute devices running on a Xilinx® FPGA.
- Setting up the environment.
- Core command execution including executing one or more kernels.
- Post processing and FPGA release.
fork()
system call from an
SDAccel environment application. The fork()
does not duplicate all the runtime threads. Hence
the child process cannot run as a complete application in the SDAccel environment. It is advisable to use the posix_spawn()
system call to launch another process from the SDAccel environment application.Setting Up the OpenCL Environment
The host code in the SDAccel environment follows OpenCL programming paradigm. To set the environment properly, the host application should identify the standard OpenCL models. They are: platform, devices, context, command queue, and program.
Platform
cl_platform_id platform_id; // platform id
err = clGetPlatformIDs(16, platforms, &platform_count);
// Find Xilinx Platform
for (unsigned int iplat=0; iplat<platform_count; iplat++) {
err = clGetPlatformInfo(platforms[iplat],
CL_PLATFORM_VENDOR,
1000,
(void *)cl_platform_vendor,
NULL);
if (strcmp(cl_platform_vendor, "Xilinx") == 0) {
// Xilinx Platform found
platform_id = platforms[iplat];
}
}
The OpenCL API call clGetPlatformIDs
is used to
discover the set of available OpenCL platforms for a
given system. Thereafter, clGetPlatformInfo
is used to
identify the Xilinx device based platform by
matching cl_platform_vendor
with the string "Xilinx"
.
clGetPlatformIDs
command:
err = clGetPlatformIDs(16, platforms, &platform_count);
if (err != CL_SUCCESS) {
printf("Error: Failed to find an OpenCL platform!\n");
printf("Test failed\n");
exit(1);
}
Devices
After the platform detection, the Xilinx FPGA devices attached to the platform are identified. The SDAccel environment supports one or more Xilinx FPGA devices working together.
clGetDeviceIDs
and printing
their
names.cl_device_id devices[16]; // compute device id
char cl_device_name[1001];
err = clGetDeviceIDs(platform_id, CL_DEVICE_TYPE_ACCELERATOR,
16, devices, &num_devices);
printf("INFO: Found %d devices\n", num_devices);
//iterate all devices to select the target device.
for (uint i=0; i<num_devices; i++) {
err = clGetDeviceInfo(devices[i], CL_DEVICE_NAME, 1024, cl_device_name, 0);
printf("CL_DEVICE_NAME %s\n", cl_device_name);
}
clGetDeviceIDs
API is called with the device_type
CL_DEVICE_TYPE_ACCELERATOR
to get all the available
Xilinx devices. Sub-devices
In the SDAccel environment,
sometimes devices contain multiple kernel instances, of a single kernel or of
different kernels. The OpenCL API clCreateSubDevices
allows
the host code to divide the device into multiple sub-devices containing one kernel
instance per sub-device. Currently SDAccel
environment supports equally divided sub-devices each containing only one kernel
instance.
- The sub-devices are created by equal partition to execute one kernel instance per sub-device.
- Iterating over the sub-device list and using a separate context and command queue to execute the kernel on each of them.
- The API related to kernel execution (and corresponding
buffer related) code is not shown for the sake of simplicity, but would be
described inside the function
run_cu
.
cl_uint num_devices = 0;
cl_device_partition_property props[3] = {CL_DEVICE_PARTITION_EQUALLY,1,0};
// Get the number of sub-devices
clCreateSubDevices(device,props,0,nullptr,&num_devices);
// Container to hold the sub-devices
std::vector<cl_device_id> devices(num_devices);
// Second call of clCreateSubDevices
// We get sub-device handles in devices.data()
clCreateSubDevices(device,props,num_devices,devices.data(),nullptr);
// Iterating over sub-devices
std::for_each(devices.begin(),devices.end(),[kernel](cl_device_id sdev) {
// Context for sub-device
auto context = clCreateContext(0,1,&sdev,nullptr,nullptr,&err);
// Command-queue for sub-device
auto queue = clCreateCommandQueue(context,sdev,
CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE,&err);
// Execute the kernel on the sub-device using local context and
queue run_cu(context,queue,kernel); // Function not shown
});
Currently, if a kernel has multiple hardware instances (can be specified
during the kernel compilation phase), the SDAccel environment execution model assumes all those hardware instances have the same
global memory connectivity. If not, then you need to use sub-devices to allocate
separate cl_kernel
for each of those hardware
instances.
Context
clCreateContext
is used to
create a context that contains one or more Xilinx
devices that will communicate with the host
machine.context = clCreateContext(0, 1, &device_id, NULL, NULL, &err);
In the code example above, the API clCreateContext
is used to create a context that contains one Xilinx device. You can create only one context for a device from a host
program. However, the host program should use multiple contexts if sub-devices are used;
one context for each sub-device.
Command Queues
One or more command queues for each device is created using the clCreateCommandQueue
API.
The FPGA device can contain multiple kernels. When developing the host application,
there are two main programming approaches to execute kernels on a device:
- Single out-of-order command queue: Multiple kernel executions can be requested through the same command queue. The SDAccel runtime environment dispatches those kernels as soon as possible in any order allowing concurrent kernel execution on the FPGA.
- Multiple in-order command queue: Each kernel execution will be requested from different in-order command queues. In such cases, the SDAccel runtime environment can dispatch kernels from any command queue with the intention of improving performance by running them concurrently on the FPGA.
// Out-of-order Command queue
commands = clCreateCommandQueue(context, device_id, CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE, &err);
// In-order Command Queue
commands = clCreateCommandQueue(context, device_id, 0, &err);
Program
As described in the SDAccel Build Process,
the host and kernel code are compiled separately to create separate executable files:
the host application (.exe) and the FPGA binary
(.xclbin). When the host application is
executed it must load the .xclbin using the clCreateProgramWithBinary
API.
unsigned char *kernelbinary;
char *xclbin = argv[1];
printf("INFO: loading xclbin %s\n", xclbin);
int size=load_file_to_memory(xclbin, (char **) &kernelbinary);
size_t size_var = size;
cl_program program = clCreateProgramWithBinary(context, 1, &device_id, &size_var,
(const unsigned char **) &kernelbinary, &status, &err);
err = clBuildProgram(program, 0, NULL, NULL, NULL, NULL);
// Function
int load_file_to_memory(const char *filename, char **result)
{
uint size = 0;
FILE *f = fopen(filename, "rb");
if (f == NULL) {
*result = NULL;
return -1; // -1 means file opening fail
}
fseek(f, 0, SEEK_END);
size = ftell(f);
fseek(f, 0, SEEK_SET);
*result = (char *)malloc(size+1);
if (size != fread(*result, sizeof(char), size, f)) {
free(*result);
return -2; // -2 means file reading fail
}
fclose(f);
(*result)[size] = 0;
return size;
}
- The kernel binary file, .xclbin, is passed in from the command line argument,
argv[1]
.TIP: Passing the .xclbin through a command line argument is specific to this example. You can also hardcode the kernel binary file in the application. - The
load_file_to_memory
function is used to load the file contents in the host machine memory space. - The API
clCreateProgramWithBinary
andclBuildProgram
are used to complete the program creation process.
Executing Commands in the FPGA Device
- Memory data transfer to and from the FPGA device.
- Kernel execution on FPGA.
- Event synchronization.
Buffer Transfer to/from the FPGA Device
Interactions between the host application and kernels rely on
transferring data to and from global memory in the device. The simplest way to send
data back and forth from the FPGA is using clCreateBuffer
, clEnqueueWriteBuffer
and
clEnqueueReadBuffer
commands. The following code example demonstrates this:
int host_mem_ptr[MAX_LENGTH]; // host memory for input vector
// Fill the memory input
for(int i=0; i<MAX_LENGTH; i++) {
host_mem_ptr[i] = <... >
}
cl_mem dev_mem_ptr = clCreateBuffer(context, CL_MEM_READ_WRITE,
sizeof(int) * number_of_words, NULL, NULL);
err = clEnqueueWriteBuffer(commands, dev_mem_ptr, CL_TRUE, 0,
sizeof(int) * number_of_words, host_mem_ptr, 0, NULL, NULL);
For the majority of applications the example code above would be sufficient to transfer data from the host to the device memory. However, there are a number of coding practices you should adopt in order to maximize performance and fine-grain control.
Using clEnqueueMigrateMemObjects
Another consideration when transferring data is using clEnqueueMigrateMemObjects
instead of clEnqueueWriteBuffer
or clEnqueueReadBuffer
to improve the performance.
Typically, memory objects are implicitly migrated to a device for enqueued kernels.
Using this API call results in data transfer ahead of kernel execution to reduce
latency, particularly when a kernel is called multiple times.
The following code example is modified to use clEnqueueMigrateMemObjects
:
int host_mem_ptr[MAX_LENGTH]; // host memory for input vector
// Fill the memory input
for(int i=0; i<MAX_LENGTH; i++) {
host_mem_ptr[i] = <... >
}
cl_mem_ext_ptr_t d_bank0_ext;
d_bank0_ext.flags = XCL_MEM_DDR_BANK0;
d_bank0_ext.obj = host_mem_ptr;
d_bank0_ext.param = 0;
cl_mem dev_mem_ptr = clCreateBuffer(context,
CL_MEM_READ_WRITE | CL_MEM_EXT_PTR_XILINX,
sizeof(int) * number_of_words, &d_bank0_ext, NULL);
err = clEnqueueMigrateMemObjects(commands, 1, dev_mem_ptr, 0, 0,
NULL, NULL);
Using posix_memalign
for Host
Memory Space
SDAccel runtime allocates the
memory space in 4K boundary for internal memory management. If the host memory
pointer is not aligned to a 4K word boundary, the runtime performs extra memcpy
to make it aligned. It does not significantly
impact performance, but you should align the host memory pointer with the 4K
boundary to follow the SDAccel runtime memory
management.
The following is an example of how posix_memalign
is used instead of malloc
for the host memory space pointer.
int *host_mem_ptr; // = (int*) malloc(MAX_LENGTH*sizeof(int));
// Aligning memory in 4K boundary
posix_memalign(&host_mem_ptr,4096,MAX_LENGTH*sizeof(int));
// Fill the memory input
for(int i=0; i<MAX_LENGTH; i++) {
host_mem_ptr[i] = <... >
}
cl_mem_ext_ptr_t d_bank0_ext;
d_bank0_ext.flags = XCL_MEM_DDR_BANK0;
d_bank0_ext.obj = host_mem_ptr;
d_bank0_ext.param = 0;
cl_mem dev_mem_ptr = clCreateBuffer(context,
CL_MEM_READ_WRITE | CL_MEM_EXT_PTR_XILINX,
sizeof(int) * number_of_words, &d_bank0_ext, NULL);
err = clEnqueueMigrateMemObjects(commands, 1, dev_mem_ptr, 0, 0,
NULL, NULL);
Enhanced Buffer Allocation
By default, all the memory interfaces from all the kernels are connected to a single global memory bank when kernels are linked. As a result, only one memory interface can transfer data to and from the global memory bank at a time, limiting the overall performance of the application. If the FPGA device contains only one global memory bank, this is the only option. However, if the device contains multiple global memory banks, you can customize the global memory bank connections by modifying the default connection. This topic is discussed in greater detail in Customization of DDR Bank to Kernel Connection. This improves overall performance by enabling multiple kernel memory interfaces to concurrently read and write data from separate global memory banks.
When kernel ports are mapped to memory banks other than the default one, it is necessary to use the enhanced buffer allocation pattern when creating the OpenCL buffers.
The enhanced buffer allocation pattern uses a Xilinx vendor extension, cl_mem_ext_ptr_t
, pointer to help the Xilinx runtime determine which global memory bank the buffer should
be allocated.
The cl_mem_ext_ptr_t
type is a
struct as defined below:
typedef struct{
unsigned flags;
void *obj;
void *param;
}
cl_mem_ext_ptr_t;
Use the explicit bank name method to operate cl_mem_ext_ptr_t
for enhanced buffer allocation.
Explicit Bank Name Method
In this approach, the struct field flags
is used to denote the DDR bank (XCL_MEM_DDR_BANK1, XCL_MEM_DDR_BANK2
, etc.). The struct field param
should not be used and set to NULL
.
The following code example uses cl_mem_ext_ptr_t
to assign the device buffer to DDR Bank 2.
int host_mem_ptr[MAX_LENGTH]; // host memory for input vector
// Fill the memory input
for(int i=0; i<MAX_LENGTH; i++) {
host_mem_ptr[i] = <... > }
cl_mem_ext_ptr_t d_bank0_ext;
d_bank0_ext.flags = XCL_MEM_DDR_BANK2;
d_bank0_ext.obj = NULL;
d_bank0_ext.param = 0;
cl_mem dev_mem_ptr = clCreateBuffer(context,
CL_MEM_READ_WRITE | CL_MEM_EXT_PTR_XILINX,
sizeof(int) * number_of_words, &d_bank0_ext, NULL);
err = clEnqueueWriteBuffer(commands, dev_mem_ptr, CL_TRUE, 0,
sizeof(int) * number_of_words, host_mem_ptr, 0, NULL, NULL);
var_ext.flags = <some integer> | XCL_MEM_TOPOLOGY
Where 0, 1, 2, and 3 stand for different DDR banks. However, the older naming style
of XCL_MEM_DDR_BANK0
, etc. would still work for the existing
platform.Kernel Setup and Execution
- Identifying the kernels.
- Setting kernel arguments.
- Executing kernels on the FPGA.
Identifying the kernels
cl_kernel
type) in the host code. This
is done by the clCreateKernel
command with
the kernel name as an argument:
kernel1 = clCreateKernel(program, "<kernel_name_1>", &err);
kernel2 = clCreateKernel(program, "<kernel_name_2>", &err); // etc
Setting Kernel Arguments
- The scalar arguments are used for small data transfer, such as for constant, or configuration type data. These are write-only arguments.
- The buffer arguments are used for large data transfer as discussed in Buffer Transfer to/from the FPGA Device.
clSetKernelArg
command as
shown below. The following example shows setting kernel arguments for two scalar
arguments, and three buffer
arguments.int err = 0;
// Setting up scalar arguments
cl_uint scalar_arg_image_width = 3840;
err |= clSetKernelArg(kernel, 0, sizeof(cl_uint), &scaler_arg_image_width);
cl_uint scaler_arg_image_height = 2160;
err |= clSetKernelArg(kernel, 1, sizeof(cl_uint), &scaler_arg_image_height);
// Setting up buffer arguments
err |= clSetKernelArg(kernel, 2, sizeof(cl_mem), &dev_mem_ptr0);
err |= clSetKernelArg(kernel, 3, sizeof(cl_mem), &dev_mem_ptr1);
err |= clSetKernelArg(kernel, 4, sizeof(cl_mem), &dev_mem_ptr2);
Enqueing the Kernels
clEnqueueTask
or clEnqueueNDRangeKernel
commands. Xilinx recommends using the clEnqueueTask
command to execute the kernel over the
entire range of input data set using the maximum number of work group items:
err = clEnqueueTask(commands, kernel, 0, NULL, NULL);
clEnqueueTask
is the same as calling clEnqueueNDRangeKernel
with work_dim
set to 1, global_work_offset
set to NULL, global_work_size[0]
set to 1, and local_work_size[0]
set to 1.Just like all the enqueue commands, the clEnqueueTask
and clEnqueueNDRangeKernel
are asynchronous in nature. The host code
continues executing without waiting for the kernel computation to complete on the
FPGA device. This allows the host program to execute more kernels, either the same
kernel multiple times over a different set of data, or different kernel. After
finishing its work, the kernel writes the result data to the global memory bank.
This data is read back to the host memory space by using clEnqueueReadBuffer
or the clEnqueueMigrateMemObjects
command.
Event Synchronization
All OpenCL
clEnqueueXXX
API calls are asynchronous. In other
words, these commands will return immediately after the command is enqueued in the
command queue. To resolve the dependencies among the commands, an API call such as
clWaitForEvents
or clFinish
can be used to pause
or block execution of the host program.
clWaitForEvents
and clFinish
commands are shown below:
err = clEnqueueTask(command_queue, kernel, 0, NULL, NULL);
// Execution will wait here until all commands in the command queue are finished
clFinish(command_queue);
// Read back the results from the device to verify the output
cl_event readevent;
int host_mem_output_ptr[MAX_LENGTH]; // host memory for output vector
clEnqueueReadBuffer(command_queue, dev_mem_ptr, CL_TRUE, 0, sizeof(int) * number_of_words,
host_mem_output_ptr, 0, NULL, &readevent );
clWaitForEvents(1, &readevent); // Wait for clEnqueueReadBuffer event to finish
// Check Results
// Compare Golden values with host_mem_output_ptr
- The
clFinish
API has been explicitly used to block the host execution until the Kernel execution is finished. This is necessary otherwise the host can attempt to read back from the FPGA buffer too early and may read garbage data. - The data transfer from FPGA memory to the local host machine is done through
clEnqueueReadBuffer
. Here the last argument ofclEnqueueReadBuffer
returns an event object that identifies this particular read command and can be used to query the event, or wait for this particular command to complete. TheclWaitForEvents
specifies that one event, and waits to ensure the data transfer is finished before checking the data from the host side memory.
Post Processing and FPGA Cleanup
bool failed = false;
for (i=0; i<N; i++)
if (Res[i] != GOLD_RES[i])
failed=true;
clReleaseCommandQueue(Command_Queue);
clReleaseContext(Context);
clReleaseDevice(Target_Device_ID);
clReleaseKernel(Kernel);
clReleaseProgram(Program);
free(Platform_IDs);
free(Device_IDs);
Summary
As discussed in earlier topics, the recommended coding style for the host application in the SDAccel environment includes the following points:
- Add error checking after each OpenCL API call for debugging purpose, if required.
- In the SDAccel environment, one or more
kernels are separately pre-compiled to the .xclbin file. The API
clCreateProgramWithBinary
is used to build the program from the kernel binary. - Ensure using
cl_mem_ext_ptr_t
to match custom kernel memory interface to the DDR bank connection that has been used to build the kernel binary. - Transfer data back and forth from the host code to the FPGAs by using
clEnqueueMigrateMemObjects
. - Use
posix_memalign
to align the host memory pointer at 4K boundary. - Use the out-of-order command queue, or multiple in-order command queues, for concurrent kernel execution on the FPGA.
- Execute the whole workload with
clEnqueTask
, rather than splitting the workload by usingclEnqueueNDRangeKernel
. - Use synchronization commands to resolve dependencies of the asynchronous OpenCL API calls.