Host Optimization
This section focuses on Host Code optimization. The host code uses the OpenCL™ API to schedule the individual compute unit executions and data transfers from and to the FPGA board. As a result, you need to be thinking about concurrent execution through the OpenCL queue(s). This section discusses in detail common pitfalls and how to recognize and address them.
Reducing Overhead of Kernel Enqueing
The OpenCL API execution model supports data parallel and task parallel programming models. Kernels are usually enqueued by the OpenCL Runtime multiple times and then scheduled to be executed on the device. You must send the command to start the kernel in one of two ways:
- Using clEnqueueNDRange API for the data parallel case.
- Using clEnqueueTask for the task parallel case.
The dispatching process is executed on the host processor and the actual commands and kernel arguments need to be sent to the FPGA via PCIe® link. In the current Xilinx runtime (XRT), the overhead of dispatching the command and arguments to the FPGA is between 30us and 60us, depending the number of arguments on the kernel. You can reduce the impact of this overhead by minimizing the number of times the kernel needs to be executed.
For the data parallel case, Xilinx® recommends that you carefully choose the global and local work sizes for your host code and kernel so that the global work size is a small multiple of the local work size. Ideally, the global work size is the same as the local work size as shown in the code snippet below:
size_t global = 1;
size_t local = 1;
clEnqueueNDRangeKernel(world.command_queue, kernel, 1, nullptr,
&global, &local, 2, write_events.data(),
&kernel_events[0]));
For the task parallel case, Xilinx recommends that you minimize the calls to clEnqueueTask. Ideally, you should finish all the work load in a single call to clEnqueueTask.
Data Transfers
Overlapping Data Transfers with Kernel Computation
Applications, such as database analytics, have a much larger data set than the available memory on the acceleration device. They require the complete data to be transferred and processed in blocks. Techniques that overlap the data transfers with the computation are critical to achieve high performance for these applications.
Below is the vector add kernel from the OpenCL Overlap Data Transfers with Kernel Computation Example in the host category from Xilinx On-boarding Example GitHub.
kernel __attribute__((reqd_work_group_size(1, 1, 1)))
void vadd(global int* c,
global const int* a,
global const int* b,
const int offset,
const int elements)
{
int end = offset + elements;
vadd_loop: for (int x=offset; x<end; ++x) {
c[x] = a[x] + b[x];
}
}
There are four tasks to perform in the host application for this example:
- Write buffer a (Wa)
- Write buffer b (Wb)
- Execute vadd kernel
- Read buffer c (Rc)
The asynchronous nature of OpenCL data transfer and kernel execution APIs allows overlap of data transfers and kernel execution as illustrated in the figure below. In this example, double buffering is used for all buffers so that the compute unit can process one set of buffers while the host can operate on the other set of buffers. The OpenCL event object provides an easy way to set up complex operation dependencies and synchronize host threads and device operations. The arrows in the figure below show how event triggering can be set up to achieve optimal performance.
The host code snippet below enqueues the four tasks in a loop. It also
sets up event synchronization between different tasks to ensure that data dependencies
are met for each task. The double buffering is set up by passing different memory
objects values to clEnqueueMigrateMemObjects
API. The
event synchronization is achieved by having each API call wait for other event as well
as trigger its own event when the API completes.
for (size_t iteration_idx = 0; iteration_idx < num_iterations; iteration_idx++) {
int flag = iteration_idx % 2;
if (iteration_idx >= 2) {
clWaitForEvents(1, &map_events[flag]);
OCL_CHECK(clReleaseMemObject(buffer_a[flag]));
OCL_CHECK(clReleaseMemObject(buffer_b[flag]));
OCL_CHECK(clReleaseMemObject(buffer_c[flag]));
OCL_CHECK(clReleaseEvent(read_events[flag]));
OCL_CHECK(clReleaseEvent(kernel_events[flag]));
}
buffer_a[flag] = clCreateBuffer(world.context,
CL_MEM_READ_ONLY | CL_MEM_USE_HOST_PTR,
bytes_per_iteration, &A[iteration_idx * elements_per_iteration], NULL);
buffer_b[flag] = clCreateBuffer(world.context,
CL_MEM_READ_ONLY | CL_MEM_USE_HOST_PTR,
bytes_per_iteration, &B[iteration_idx * elements_per_iteration], NULL);
buffer_c[flag] = clCreateBuffer(world.context,
CL_MEM_WRITE_ONLY | CL_MEM_USE_HOST_PTR,
bytes_per_iteration, &device_result[iteration_idx * elements_per_iteration], NULL);
array<cl_event, 2> write_events;
printf("Enqueueing Migrate Mem Object (Host to Device) calls\n");
// These calls are asynchronous with respect to the main thread because we
// are passing the CL_FALSE as the third parameter. Because we are passing
// the events from the previous kernel call into the wait list, it will wait
// for the previous operations to complete before continuing
OCL_CHECK(clEnqueueMigrateMemObjects(
world.command_queue, 1, &buffer_a[iteration_idx % 2],
0 /* flags, 0 means from host */,
0, NULL,
&write_events[0]));
set_callback(write_events[0], "ooo_queue");
OCL_CHECK(clEnqueueMigrateMemObjects(
world.command_queue, 1, &buffer_b[iteration_idx % 2],
0 /* flags, 0 means from host */,
0, NULL,
&write_events[1]));
set_callback(write_events[1], "ooo_queue");
xcl_set_kernel_arg(kernel, 0, sizeof(cl_mem), &buffer_c[iteration_idx % 2]);
xcl_set_kernel_arg(kernel, 1, sizeof(cl_mem), &buffer_a[iteration_idx % 2]);
xcl_set_kernel_arg(kernel, 2, sizeof(cl_mem), &buffer_b[iteration_idx % 2]);
xcl_set_kernel_arg(kernel, 3, sizeof(int), &elements_per_iteration);
printf("Enqueueing NDRange kernel.\n");
// This event needs to wait for the write buffer operations to complete
// before executing. We are sending the write_events into its wait list to
// ensure that the order of operations is correct.
OCL_CHECK(clEnqueueNDRangeKernel(world.command_queue, kernel, 1, nullptr,
&global, &local, 2 , write_events.data(),
&kernel_events[flag]));
set_callback(kernel_events[flag], "ooo_queue");
printf("Enqueueing Migrate Mem Object (Device to Host) calls\n");
// This operation only needs to wait for the kernel call. This call will
// potentially overlap the next kernel call as well as the next read
// operations
OCL_CHECK( clEnqueueMigrateMemObjects(world.command_queue, 1, &buffer_c[iteration_idx % 2],
CL_MIGRATE_MEM_OBJECT_HOST, 1, &kernel_events[flag], &read_events[flag]));
set_callback(read_events[flag], "ooo_queue");
clEnqueueMapBuffer(world.command_queue, buffer_c[flag], CL_FALSE, CL_MAP_READ, 0,
bytes_per_iteration, 1, &read_events[flag], &map_events[flag], 0);
set_callback(map_events[flag], "ooo_queue");
OCL_CHECK(clReleaseEvent(write_events[0]));
OCL_CHECK(clReleaseEvent(write_events[1]));
}
The Application Timeline view below clearly shows that the data transfer time is completely hidden, while the compute unit vadd_1 is running constantly.
Buffer Memory Segmentation
Allocation and deallocation of memory buffers can lead to memory segmentation in the DDRs. This might result in sub-optimal performance of compute units, even if they could theoretically execute in parallel.
This problem occurs most often when multiple pthreads for different compute units are used, and the threads allocate and release many device buffers with different sizes every time they enqueue the kernels. In this case, the timeline trace will exhibit gaps between kernel executions and it just seems the processes are sleeping.
Each buffer allocated by runtime should be continuous in hardware. For large memory, it might take a lot of time to wait for that space to be freed, when many buffers are allocated and deallocated. This can be resolved by allocating device buffer, and reusing it between different enqueues of a kernel.
Compute Unit Scheduling
Scheduling kernel operations is key to overall system performance. This becomes even more important when implementing multiple compute units (of the same kernel or of different kernels). This section examines the different command queues responsible for scheduling the kernels.
Multiple In-Order Command Queues
The following figure shows an example with two in-order command queues, CQ0 and CQ1. The scheduler dispatches commands from each queue in order, but commands from CQ0 and CQ1 can be pulled out by the scheduler in any order. You must manage synchronization between CQ0 and CQ1 if required.
Below is the code snippet from the Concurrent Kernel Execution Example in host category from SDAccel Getting Started Examples on GitHub that sets up multiple in-order command queues and enqueues commands into each queue:
cl_command_queue ordered_queue1 = clCreateCommandQueue(
world.context, world.device_id, CL_QUEUE_PROFILING_ENABLE, &err)
cl_command_queue ordered_queue2 = clCreateCommandQueue(
world.context, world.device_id, CL_QUEUE_PROFILING_ENABLE, &err);
clEnqueueNDRangeKernel(ordered_queue1, kernel_mscale, 1, offset,
global, local, 0, nullptr,
&kernel_events[0]));
clEnqueueNDRangeKernel(ordered_queue1, kernel_madd, 1, offset,
global, local, 0, nullptr,
&kernel_events[1]);
clEnqueueNDRangeKernel(ordered_queue2, kernel_mmult, 1, offset,
global, local, 0, nullptr,
&kernel_events[2]);
Single Out-of-Order Command Queue
The following figure shows an example with a single out-of-order command queue. The scheduler can dispatch commands from the queue in any order. You must set up event dependencies and synchronizations explicitly if required.
Below is the code snippet from the Concurrent Kernel Execution Example from SDAccel Getting Started Examples on GitHub that sets up a single out-of-order command queue and enqueues commands:
cl_command_queue ooo_queue = clCreateCommandQueue(
world.context, world.device_id,
CL_QUEUE_PROFILING_ENABLE | CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE, &err);
clEnqueueNDRangeKernel(ooo_queue, kernel_mscale, 1, offset, global,
local, 0, nullptr, &ooo_events[0]);
clEnqueueNDRangeKernel(ooo_queue, kernel_madd, 1, offset, global,
local, 1,
&ooo_events[0], // Event from previous call
&ooo_events[1]);
clEnqueueNDRangeKernel(ooo_queue, kernel_mmult, 1, offset, global,
local, 0,
nullptr, // Does not depend on previous call
&ooo_events[2])
The Application Timeline view (as shown in the following figure) that the
compute unit mmult_1
is running in parallel with the compute
units mscale_1
and madd_1
,
using both multiple in-order queues and single out-of-order queue methods.
Using clEnqueueMigrateMemObjects to Transfer Data
The OpenCL framework provides a number of
APIs for transferring data between the host and the device. Typically, data movement APIs, such
as clEnqueueWriteBuffer
and clEnqueueReadBuffer
, implicitly migrate memory objects to the device after they are
enqueued. They do not guarantee when the data is transferred. This makes it difficult for the
host application to overlap the placements of the memory objects onto the device with the
computation carried out by kernels.
OpenCL 1.2 framework introduced a new API,
clEnqueueMigrateMemObjects
. Using this API, memory migration
can be explicitly performed ahead of the dependent commands. This allows the application to
preemptively change the association of a memory object, through regular command queue scheduling,
to prepare for another upcoming command. This also permits an application to overlap the
placement of memory objects with other unrelated operations before these memory objects are
needed, potentially hiding transfer latencies. After the event associated by clEnqueueMigrateMemObjects
has been marked CL_COMPLETE, the memory
objects specified in mem_objects have been successfully migrated to the device associated with
command_queue.
The clEnqueueMigrateMemObjects
API can also be
used to direct the initial placement of a memory object after creation, possibly avoiding the
initial overhead of instantiating the object on the first enqueued command to use it.
Another advantage of clEnqueueMigrateMemObjects
is that it can migrate multiple memory objects in a single
API call. This reduces the overhead of scheduling and calling functions for transferring data for
more than one memory object.
Below is the code snippet showing the usage of clEnqueueMigrateMemObjects
from Vector Multiplication for
XPR Device example in the host category from SDAccel Getting Started Examples on GitHub.
int err = clEnqueueMigrateMemObjects(
world.command_queue,
1,
&d_mul_c,
CL_MIGRATE_MEM_OBJECT_HOST,
0,
NULL,
NULL);