Host Programming
In the Vitis™ environment, the host application can be written in native C++ using the Xilinx® runtime (XRT) native C++ API or industry standard OpenCL™ API. The XRT native API is described here in brief, with additional details available under XRT Native API on the XRT documentation site. Refer to OpenCL Programming for a discussion on writing the host application using the OpenCL API.
In general, the structure of the host application can be divided into the following steps:
- Specifying the accelerator device ID and loading the .xclbin
- Setting up the kernel and kernel arguments
- Transferring data between the host and kernels
- Running the kernel and returning results
xrt_coreutil
library. Compiling host code with XRT
native C++ API requires C++ standard with -std=c++14
or newer. For example:
g++ -g -std=c++14 -I$XILINX_XRT/include -L$XILINX_XRT/lib -lxrt_coreutil -pthread
fork()
system call from a Vitis core development kit
application. The fork()
does not duplicate all the
runtime threads. Hence, the child process cannot run as a complete application in the
Vitis core development kit. It is advisable to
use the posix_spawn()
system call to launch another
process from the Vitis software platform
application.Specifying the Device ID and Loading the XCLBIN
To use the Xilinx runtime (XRT) environment properly, the host application needs to identify the accelerator card and device ID that the kernel will run on, and load the device binary (.xclbin) into the device.
xrt::device
) that can be used to specify the device ID on the accelerator
card, and an XCLBIN class (xrt::xclbin
) that defines
the program for the runtime. You must use the following include statement in your source
code to load these classes:
#include <xrt/xrt_kernel.h>
The following code snippet creates a device object by specifying the device ID from the target platform, and then loads the .xclbin into the device, returning the UUID for the program.
//Setup the Environment
unsigned int device_index = 0;
std::string binaryFile = parser.value("kernel.xclbin");
std::cout << "Open the device" << device_index << std::endl;
auto device = xrt::device(device_index);
std::cout << "Load the xclbin " << binaryFile << std::endl;
auto uuid = device.load_xclbin(binaryFile);
xbutil
command for a specific
accelerator.Setting Up XRT-Managed Kernels and Kernel Arguments
After identifying devices and loading the program, the host application
should identify the kernels that execute on the device, and set up the kernel arguments.
All kernels the host application interacts with are defined within the loaded .xclbin
file, and so should be identified from there.
For XRT-managed kernels, the XRT API provides a Kernel class (xrt::kernel
), that is used to access the kernels contained
within the .xclbin file. The kernel object
identifies an XRT-managed kernel in the .xclbin
loaded into the Xilinx device that can be run by
the host application.
xrt::ip
) to identify the user-managed kernels in the
.xclbin file.The use of the kernel and buffer objects require the addition of the
following include
statements in your source code:
#include <xrt/xrt_kernel.h>
#include <xrt/xrt_bo.h>
The following code example identifies a kernel ("vadd"
) defined in the program (uuid
) loaded onto the device
:
auto krnl = xrt::kernel(device, uuid, "vadd");
xclbinutil
command to examine the contents of an
existing .xclbin file and determine the kernels
contained within. std::cout << "Allocate Buffer in Global Memory\n";
auto bo0 = xrt::bo(device, vector_size_bytes, krnl.group_id(0));
auto bo1 = xrt::bo(device, vector_size_bytes, krnl.group_id(1));
auto bo_out = xrt::bo(device, vector_size_bytes, krnl.group_id(2));
The kernel object (xrt::kernel
)
includes a method to return the memory associated with each kernel argument, kernel.group_id()
. You will assign a buffer object to each
kernel buffer argument because buffer is not created for scalar arguments.
Creating Multiple Compute Units
When building the .xclbin file you can specify the number of kernel
instances, or compute units (CU) to implement into the hardware by using the
--connectivity.nk
option as described in Creating Multiple Instances of a Kernel. After the .xclbin has been built,
you can access the CUs from the host application.
A single kernel object (xrt::kernel
) can be used
to execute multiple CUs as long as the CUs have identical interface connectivity,
meaning the CUs have the same memory connections (krnl.group_id
). If all CUs do not have the same kernel connectivity, then
you can create a separate kernel object for each unique configuration of the kernel, as
shown in the example below.
krnl1 = xrt::kernel(device, xclbin_uuid, "vadd:{vadd_1,vadd_2}");
krnl2 = xrt::kernel(device, xclbin_uuid, "vadd:{vadd_3}");
In the example above, krnl1
can be used to launch the CUs
vadd_1
and vadd_2
which have matching
connectivity, and krnl2
can be used to launch vadd_3
,
which has different connectivity.
Transferring Data between Host and Kernels
Transferring data to and from the memory in the accelerator card or device
uses the buffer objects (xrt::bo
) created when Setting Up XRT-Managed Kernels and Kernel Arguments.
The class constructor typically allocates a regular 4K aligned buffer object.
The following code creates regular buffer objects that have a host backing pointer allocated
by user space in heap memory, and a device-side buffer allocated in the memory bank associated
with the kernel argument (krnl.group_id
). Optional flags in
the xrt::bo
constructor let you create non-standard types of
buffers for use in special circumstances as described in Creating Special Buffers.
std::cout << "Allocate Buffer in Global Memory\n";
auto bo0 = xrt::bo(device, vector_size_bytes, krnl.group_id(0));
auto bo1 = xrt::bo(device, vector_size_bytes, krnl.group_id(1));
auto bo_out = xrt::bo(device, vector_size_bytes, krnl.group_id(2));
With the buffer established, and filled with data, there are a number of methods to enable transfers between the host and the kernel, as described below:
- Using
xrt::bo::sync()
- Use
xrt::bo::sync
to sync data from the host to the device withXCL_BO_SYNC_TO_DEVICE
flag, or from the device to the host withXCL_BO_SYNC_FROM_DEVICE
flag usingxrt::bo::write
, orxrt::bo::read
to write the buffer from the host application, or read the buffer from the device.bo0.write(buff_data); bo0.sync(XCL_BO_SYNC_BO_TO_DEVICE); bo1.write(buff_data); bo1.sync(XCL_BO_SYNC_BO_TO_DEVICE); ... bo_out.sync(XCL_BO_SYNC_BO_FROM_DEVICE); bo_out.read(buff_data);
Note: If the buffer is created using a user-pointer as described in Creating Buffers from User Pointers, thexrt::bo::sync
call is sufficient, and thexrt::bo::write
orxrt::bo::read
commands are not required. - Using
xrt::bo::map()
- This method maps the host-side buffer backing pointer to a user
pointer.
// Map the contents of the buffer object into host memory auto bo0_map = bo0.map<int*>(); auto bo1_map = bo1.map<int*>(); auto bo_out_map = bo_out.map<int*>();
The host code can subsequently exercise the user pointer for data reads and writes. However, after writing to the mapped pointer (or before reading from the mapped pointer) the
xrt::bo::sync()
command should be used with the required direction flag for the DMA operation.for (int i = 0; i < DATA_SIZE; ++i) { bo0_map[i] = i; bo1_map[i] = i; } // Synchronize buffer content with device side bo0.sync(XCL_BO_SYNC_BO_TO_DEVICE); bo1.sync(XCL_BO_SYNC_BO_TO_DEVICE);
There are additional buffer types and transfer scenarios supported by the XRT native API, as described in Miscellaneous Other Buffers.
Executing Kernels on the Device
The execution of a kernel is associated with a class called xrt::run
that implements methods to start and wait for kernel execution. Most interaction with
kernel objects are accomplished through xrt::run
objects, created from a kernel to represent an execution of the kernel.
The run object can be explicitly constructed from a kernel object, or implicitly constructed by starting a kernel execution as shown below.
std::cout << "Execution of the kernel\n";
auto run = krnl(bo0, bo1, bo_out, DATA_SIZE);
run.wait();
The above code example demonstrates launching the kernel execution using
the xrt::kernel()
operator with the list of arguments
for the kernel that returns an xrt::run
object. This is
an asynchronous operator that returns after starting the run. The xrt::run::wait()
member function is used to block the
current thread until the run is complete.
xrt::run
object can be used to relaunch
the same kernel function if desired.auto run = xrt::run(krnl);
run.set_arg(0,bo0); // Arguments are specified starting from 0
run.set_arg(0,bo1);
run.set_arg(0,bo_out);
run.start();
run.wait();
In this example, the run object is explicitly constructed from the kernel
object, the kernel arguments are specified with run.set_args()
, and the run execution is launched by the run.start()
command. Finally, the current thread is blocked
as it waits for the kernel to finish.
After the kernel has completed its execution, you can sync the kernel results back to the host application using code similar to the following example:
// Get the output;
bo_out.sync(XCL_BO_SYNC_BO_FROM_DEVICE);
// Validate our results
if (std::memcmp(bo_out_map, bufReference, DATA_SIZE))
throw std::runtime_error("Value read back does not match reference");
Setting Up User-Managed Kernels and Argument Buffers
User-managed kernels require the use of the XRT native API for the host
application, and are specified as an IP object of the xrt::ip
class. The following is a high-level overview of how to structure
your host application to access user-managed kernels from an .xclbin file.
- Add the following header files to include the XRT native API:
#include "experimental/xrt_ip.h" #include "xrt/xrt_bo.h"
- experimental/xrt_ip.h:
Defines the IP as an object of
xrt::ip
. - xrt/xrt_bo.h: Lets you create buffer objects in the XRT native API.
- experimental/xrt_ip.h:
Defines the IP as an object of
- Set up the application environment as described in Specifying the Device ID and Loading the XCLBIN.
- The IP object (
xrt::ip
) is constructed from thexrt::device
object, theuuid
of the .xclbin, and thename
of the user-managed kernel. Thexrt::ip
differs from the standardxrt::kernel
, and indicates that XRT does not manage the IP but does provide access to registers://User Managed Kernel = IP auto ip = xrt::ip(device, uuid, "Vadd_A_B");
- Create buffers for the IP arguments:
auto <buf_name> = xrt::bo(<device>,<DATA_SIZE>,<flag>,<bank_id>);
Where the buffer object constructor uses the following fields:
<device>
:xrt::device
object of the accelerator card.<DATA_SIZE>
: Size of the buffer as defined by the width and quantity of data.<flag>
: Flag for creating the buffer objects.<bank_id>
: Defines the memory bank on the device where the buffer should be allocated for IP access. The memory bank specified must match with the corresponding IP port's connection inside the .xclbin file. Otherwise you will getbad_alloc
when running the application. You can specify the assignment of the kernel argument using the--connectivity.sp
command as explained in Mapping Kernel Ports to Memory.
For example:
auto buf_in_a = xrt::bo(device,DATA_SIZE,xrt::bo::flags::normal,0); auto buf_in_b = xrt::bo(device,DATA_SIZE,xrt::bo::flags::normal,0);
TIP: Verify the IP connectivity to determine the specific memory bank, or you can get this information from the Vitis generated .xclbin.info file.For example, the following information for a user-managed kernel from the .xclbin could guide the construction of buffer objects in your host code:
Instance: Vadd_A_B_1 Base Address: 0x1c00000 Argument: scalar00 Register Offset: 0x10 Port: s_axi_control Memory: <not applicable> Argument: A Register Offset: 0x18 Port: m00_axi Memory: bank0 (MEM_DDR4) Argument: B Register Offset: 0x24 Port: m01_axi Memory: bank0 (MEM_DDR4)
- Transfer data between host and device:
auto a_data = buf_in_a.map<int*>(); auto b_data = buf_in_b.map<int*>(); // Sync Buffers buf_in_a.sync(XCL_BO_SYNC_BO_TO_DEVICE); buf_in_b.sync(XCL_BO_SYNC_BO_TO_DEVICE);
xrt::bo::map()
allows mapping the host-side buffer backing pointer to a user pointer. However, before reading from the mapped pointer or after writing to the mapped pointer, you should usexrt::bo::sync()
with direction flag for the DMA operation. After preparing the buffer (buffer create, sync operation as shown above), you are free to pass all the necessary information to the IP with the direct register write operation. For example, the code below shows the information passing the buffer base address through the
Then write to the registers to move data from the host application to the kernel:xrt::ip::write_register()
command.ip.write_register(REG_OFFSET_A,a_addr); ip.write_register(REG_OFFSET_A+4,a_addr>>32); ip.write_register(REG_OFFSET_B,b_addr); ip.write_register(REG_OFFSET_B+4,b_addr>>32);
- Start the IP execution. Because the IP is user-managed, you can
employ any number of register write/read to control the start/check status/restart
the IP to trigger the execution of the IP. The following example uses an
s_axilite
interface to access control signals in the control register:uint32_t axi_ctrl = 0; std::cout << "INFO:IP Start" << std::endl; axi_ctrl = IP_START; ip.write_register(CSR_OFFSET, axi_ctrl); // Wait until the IP is DONE axi_ctrl =0; while((axi_ctrl & IP_IDLE) != IP_IDLE) { axi_ctrl = ip.read_register(CSR_OFFSET); }
- After IP execution is finished, you can transfer the data back to
host by the
xrt::bo::sync
command with the appropriate flag to dictate the buffer transfer direction.buf_in_b.sync(XCL_BO_SYNC_BO_FROM_DEVICE);
- Optionally profile the application.
Because XRT is not in charge of starting or stopping the kernel, you cannot directly profile the operation of
user_managed
kernels as you would XRT managed kernels. However, you can use theuser_range
anduser_event
objects as discussed in Custom Profiling of the Host Application to profile elements of the host application. For example the following code captures the time it takes to write the registers from the host application:// Write Registers range.start("Phase 4a", "Write A Register"); ip.write_register(REG_OFFSET_A,a_addr); ip.write_register(REG_OFFSET_A+4,a_addr>>32); range.end(); range.start("Phase 4b", "Write B Register"); ip.write_register(REG_OFFSET_B,b_addr); ip.write_register(REG_OFFSET_B+4,b_addr>>32); range.end()
You can observe some aspects of the application and kernel operation in the Vitis analyzer as shown in the following figure.
Enabling Auto-Restart of User-Managed Kernels
In some user-managed kernels that implement the ap_ctrl_chain
protocol from Vitis HLS,
you can set the auto_restart bit in the s_axilite
control register so that the kernel will restart
automatically. A user-managed kernel that uses the auto_restart bit is called a never-ending kernel as described in Streaming Data in User-Managed Never-Ending Kernels.
Programming the never-ending kernel requires the host application to set
the auto_restart signal in the s_axilite
control register at the address 0x00
, otherwise the kernel will simply run in single
execution mode, and wait for the host application to start it again. To program the
kernel control register use the following process:
- Set up the host application to access the user-managed kernel as an
IP object of the
xrt::ip
class as previously described. - Write the value 129 (binary
10000001
) into the Control register, setting both the ap_start and auto_restart bits, enabling the kernel to run in never-ending mode. Thes_axilite
control register is located at0x00
, just as for otherap_ctrl_chain
kernels.IMPORTANT: Do not write anything else to the control register space which can lead to non-deterministic behavior.
auto ip = xrt::ip(device_2, xclbinId, "krnl_stream_vdatamover");
auto ip = xrt::ip(device, uuid, "krnl_stream_vdatamover");
int startNow =129;
size_t control_offset = 0;
ip.write_register(control_offset,startNow);
Summary
As discussed in earlier topics, the recommended coding style for the host program in the Vitis core development kit includes the following points:
- In the Vitis core development
kit, one or more kernels are separately compiled/linked to build the .xclbin file. The
device.load_xclbin(binaryFile)
command is used to load the kernel binary. - Create
xrt::kernel
objects from the loaded device binary, and associate buffer objects (xrt::bo
) with the memory banks assigned to kernel arguments. - Transfer data back and forth from the host application to the kernel
using
xrt::bo::sync
commands and buffer reads and write commands. - Execute the kernel using an
xrt::run
object to start the kernel and wait for kernel execution. - Additionally, you can add error checking after XRT API calls for debugging purpose, if required.