Programming for SDAccel
The custom processing architecture generated by the SDAccel™ environment for a kernel running on a Xilinx® FPGA provides opportunities for significant performance gains. However, you must take advantage of these opportunities by writing your host and kernel code specifically for acceleration on an FPGA.
The host application is running on x86 servers and uses the SDAccel runtime to manage interactions with the FPGA kernels. The host application is written in C/C++ using OpenCL™ APIs. The custom kernels are running within a Xilinx® FPGA on an SDAccel platform.
The SDAccel hardware platform contains global memory banks. The data transfer from the host machine to kernels and from kernels to the host happens through these global memory banks. Communication between the host x86 machine and the SDAccel accelerator board occurs across the PCIe® bus.
The following topics discuss how to write code for the host application to setup the Xilinx Runtime (XRT), load the kernel binary into the SDAccel platform, pass data efficiently between the host application and the kernel, and trigger the kernel on the FPGA at the appropriate time in the host application.
In addition, the FPGA fabric can support multiple kernels running simultaneously. Therefore, you can create multiple instances of a single kernel, or configure multiple kernels on the same device, to increase the performance of the host application. Kernels running on the FPGA can have one or more memory interfaces to connect to the global memory of the platform. You will manage both the number of kernels running on the FPGA, and the specific memory banks accessed by the kernel through xocc linking options during the build process.
The content discussed here is provided in greater detail in the SDAccel Environment Programmers Guide. Refer to that guide for details of the host application, kernel code, and the interactions between them.
Coding the Host Application
When creating the host application, you must manage the required overhead to setup and configure the SDAccel runtime, program and launch the kernel, pass data back and forth between the host application and the kernel, as well as address the primary function of the application.
Setting Up the Runtime
- To set up the OpenCL
runtime environment, you need to identify the Xilinx platform using the
clGetPlatformIDs
andclGetPlatformInfo
commands. For example:// get all platforms std::vector<cl_platform_id> platforms(platform_count); clGetPlatformIDs(platform_count, platforms.data(), nullptr); for (int p = 0; p < (int)platform_count; ++p) { platform_id = platforms[p]; clGetPlatformInfo(platform_id,CL_PLATFORM_VENDOR,1000,(void *)cl_platform_vendor,NULL); clGetPlatformInfo(platform_id,CL_PLATFORM_NAME,1000,(void *)cl_platform_name,NULL); if(!strcmp(cl_platform_vendor,"Xilinx")) {...}
- Identify the Xilinx
devices on the platform available for enqueuing kernels, using the
clGetDeviceIDs
command. Finding the device IDs requires the platform ID discovered in the prior step. For example:clGetDeviceIDs(platform_id, CL_DEVICE_TYPE_ACCELERATOR, 1, &device_id, NULL);
- Setup the context using
clCreateContext
. The context is the environment that work-items execute, and identifies devices to be assigned transactions from the command queue. The example below shows the creation of the context:cl_context cntxt = clCreateContext(0, 1, &device_id, NULL, NULL, &err);
- Define the command queue using
clCreateCommandQueue
. The command queue is a list of commands waiting to be executed to a device. You can setup the command queue to handle commands in the order submitted, or to be out-of-order so that a command can be executed as soon as possible. Use the out-of-order command queue, or multiple in-order command queues, for concurrent kernel execution on the FPGA. An example follows:// Create In-order Command Queue cl_command_queue commands = clCreateCommandQueue(context, device_id, CL_QUEUE_PROFILING_ENABLE | CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE , &err);
- Finally, in the host code you need to set up the program, which contains
and configures the kernels to be passed to the command queue by the host
application. The
load_file_to_memory
function is used to load the file contents in the host machine memory space. TheclCreateProgramWithBinary
command downloads the FPGA binary (.xclbin
) to the device and returns acl_program
handle. The following example shows the creation of the program using these API calls:char *fpga_bin; size_t fpga_bin_size; fpga_bin_size = load_file_to_memory(binaryName, &fpga_bin); cl_program program = clCreateProgramWithBinary(context, 1, (const cl_device_id* ) &device_id, &fpga_bin_size, (const unsigned char**) &fpga_bin, NULL, &err);
Transferring Data to/from the FPGA Device
clCreateBuffer
, clEnqueueReadBuffer
, and
clEnqueueWriteBuffer
commands.
However, to transfer the data required ahead of the transaction, use the clEnqueueMigrateMemObjects
command. Using this command results reduced latency in the application. The following
code example demonstrates this:
// Move Buffer over input vector
mBlockExt.obj = blocks->data() + mNumBlocks64*64*start;
mQExt.obj = q->data();
mInBuffer[0] = clCreateBuffer(mContext,
CL_MEM_EXT_PTR_XILINX | CL_MEM_USE_HOST_PTR | CL_MEM_READ_ONLY,
mNumBlocks64*64*sizeof(int16_t),
&mBlockExt,
&err);
// Schedule actual writing of data
clEnqueueMigrateMemObjects(mQ, 2, mInBuffer, 0, 0, nullptr, &inEvVec[mCount]);
Setting Up the Kernel
- Create a kernel from the program and the loaded FPGA binary
using the
clCreateKernel
command:// Create Kernel cl_kernel krnl = clCreateKernel(program, "krnl_idct", &err);
- Set the kernel arguments using the
clSetKernelArg
. You can use this command to set the arguments for the kernel.// Set the kernel arguments clSetKernelArg(mKernel, 0, sizeof(cl_mem), &mInBuffer[0]); clSetKernelArg(mKernel, 1, sizeof(cl_mem), &mInBuffer[1]); clSetKernelArg(mKernel, 2, sizeof(cl_mem), &mOutBuffer[0]); clSetKernelArg(mKernel, 3, sizeof(int), &m_dev_ignore_dc); clSetKernelArg(mKernel, 4, sizeof(unsigned int), &mNumBlocks64);
- The kernel is scheduled to run on the FPGA by using the
clEnqueueTask
. The request to execute the kernel is placed into the command queue and either waits for its turn, or is executed when ready, depending on the nature of the queue.clEnqueueTask(mQ, mKernel, 1, &inEvVec[mCount], &runEvVec[mCount]);
- Because the
clEnqueueTask
(andclEnqueueMigrateMemObjects
) command is asynchronous in nature, and will return immediately after the command is enqueued in the command queue, you might need to manage the scheduling of events within the host application. To resolve the dependencies among the commands in the host application, you can useclWaitForEvents
orclFinish
commands to pause or block execution of the host program. For example:// Execution waits until all commands in the command queue are finished clFinish(command_queue); clWaitForEvents(1, &readevent); // Wait for clEnqueueReadBuffer event to finish
Kernel Language Support
The SDAccel environment supports kernels expressed in OpenCL C, C/C++, and RTL (SystemVerilog, Verilog, or VHDL). You can use different kernel types in the same application. However, each kernel has specific requirements and coding styles that should be used.
Kernels created from OpenCL C and C/C++ are well-suited to software and algorithm developers. It makes it easier to start from an existing C/C++ application and accelerate portions of it.
- A single slave AXI4-Lite interface used to access control registers (to pass scalar arguments and to start/stop the kernel)
- At least one of the following interfaces (can have both interfaces):
- AXI4 master interface to communicate with memory.
- AXI4-Stream interface for transferring data between kernels.
In the case of OpenCL kernels, the
AXI4-Lite interface is generated automatically while the
AXI4-Lite memory map interfaces are generated based on
the __global
directive in the function definition. For C/C++
kernels, use interface pragmas to map to AXI4-Lite and
AXI4 memory map interface. While for RTL kernels, you
are responsible for adding these interfaces.
Writing OpenCL C Kernels
The SDAccel environment supports the OpenCL C language constructs and built-in functions from the OpenCL 1.0 embedded profile. The following is an example of an OpenCL C kernel for matrix multiplication that can be compiled with the SDAccel environment.
__kernel __attribute__ ((reqd_work_group_size(16,16,1)))
void mult(__global int* a, __global int* b, __global int* output)
{
int r = get_local_id(0);
int c = get_local_id(1);
int rank = get_local_size(0);
int running = 0;
for(int index = 0; index < 16; index++){
int aIndex = r*rank + index;
int bIndex = index*rank + c;
running += a[aIndex] * b[bIndex];
}
output[r*rank + c] = running;
return;
}
Writing C/C++ Kernels
Kernels written in C/C++ are supported by the SDAccel environment. The above matrix multiplication kernel can be expressed in C/C++ code as shown below. For kernels captured in this way, the SDAccel environment supports all of the optimization techniques available in Vivado® HLS. The only thing that you must keep in mind is that expressing kernels in this way requires compliance with a specific function signature style.
It is important to keep in mind that by default, kernels captured in C/C++ for HLS do not have any inherent assumptions on the physical interfaces that will be used to transport the function parameter data. HLS uses pragmas embedded in the code to direct the compiler as to which physical interface to generate for a function port. For the function to be treated as a valid HLS C/C++ kernel, each function argument should have a valid HLS interface pragma.
void mmult(int *a, int *b, int *output)
{
#pragma HLS INTERFACE m_axi port=a offset=slave bundle=gmem
#pragma HLS INTERFACE m_axi port=b offset=slave bundle=gmem
#pragma HLS INTERFACE m_axi port=output offset=slave bundle=gmem
#pragma HLS INTERFACE s_axilite port=a bundle=control
#pragma HLS INTERFACE s_axilite port=b bundle=control
#pragma HLS INTERFACE s_axilite port=output bundle=control
#pragma HLS INTERFACE s_axilite port=return bundle=control
const int rank = 16;
int running = 0;
int bufa[256];
int bufb[256];
int bufc[256];
memcpy(bufa, (int *) a, 256*4);
memcpy(bufb, (int *) b, 256*4);
for (unsigned int c=0;c<rank;c++){
for (unsigned int r=0;r<rank;r++){
running=0;
for (int index=0; index<rank; index++) {
#pragma HLS pipeline
int aIndex = r*rank + index;
int bIndex = index*rank + c;
running += bufa[aIndex] * bufb[bIndex];
}
bufc[r*rank + c] = running;
}
}
memcpy((int *) output, bufc, 256*4);
return;
}void mmult(int *a, int *b, int *output)
Pointer Arguments
All pointers are mapped to global memory. The data is accessed through AXI interfaces which can be mapped to different banks. The memory interface specification needs the following two pragmas:
- The first is to define which argument the AXI memory map interface
is accessed. An offset is always required. The
offset=slave
means that the offset of the array<variable_name>
will be made available through the AXI slave interface of the kernel.#pragma HLS INTERFACE m_axi port=<variable name> offset=slave bundle=<AXI_MM_name>
- The second pragma for the AXI Slave interface. Scalars (and pointer
offsets) are mapped to one AXI Slave control interface which must be named
control
.#pragma HLS INTERFACE s_axilite port=<variable name> bundle=control
M_AXI_ARG_NAME
was used by making
arg_name
uppercase irrelevant of the original
capitalization and prefixing with M_AXI_
. Using current platforms (version 5.x or later) the interface name
m_axi_arg_name
is used; the original capitalization
of arg_name
must be lower case and prefixed by m_axi_
.
Scalars
Scalars are considered constant inputs and should also be mapped to
s_axilite
. The control interface specification is
generated by the following command:
#pragma HLS INTERFACE s_axilite port=<variable name> bundle=control
Detailed information on how these pragmas are used is available in the SDx Pragma Reference Guide.
When a kernel is defined in C++, use extern "C" { ... } around the functions targeted to be kernels. The use of extern "C" instructs the compiler/linker to use the C naming and calling conventions.
structs
, Xilinx
recommends that the struct
has a size in bytes that is
a power of two in total. Taking into consideration that the maximum bit width of the
underlying interface is 512 bits or 64 bytes, the recommended size of the struct
is 4, 8, 16, 32, or 64 bytes. To reduce the risk of
misalignment between the host code and the kernel code Xilinx recommends that the struct elements use types of the same
size.C++ arbitrary precision data types can be used for global memory pointers on a kernel. They are not supported for scalar kernel inputs that are passed by value.
Writing RTL Kernels
RTL kernels have both software and hardware requirements for it to be used in the SDAccel environment framework. On the software side, the RTL kernel must operate and adhere to the register definitions described in Kernel Software Requirements.
On the hardware side, it requires the interfaces outlined in the Kernel Interface Requirements.
For complete details on creating and using RTL kernels, see RTL Kernels.