Host Programming

In the Vitis™ environment, the host application can be written in native C++ using the Xilinx® runtime (XRT) native C++ API or industry standard OpenCL™ API. The XRT native API is described here in brief, with additional details available under XRT Native API on the XRT documentation site. Refer to OpenCL Programming for a discussion on writing the host application using the OpenCL API.

TIP: For examples of host programming using the XRT native API refer to host_xrt in the Vitis_Accel_Examples.

In general, the structure of the host application can be divided into the following steps:

Specifying the accelerator device ID and loading the .xclbin
Setting up the kernel and kernel arguments
Transferring data between the host and kernels
Running the kernel and returning results

To use the native XRT APIs, the host application must link with the xrt_coreutil library. Compiling host code with XRT native C++ API requires C++ standard with -std=c++14 or newer. For example:

g++ -g -std=c++14 -I$XILINX_XRT/include -L$XILINX_XRT/lib -lxrt_coreutil -pthread

IMPORTANT: For multithreading the host program, exercise caution when calling a fork() system call from a Vitis core development kit application. The fork() does not duplicate all the runtime threads. Hence, the child process cannot run as a complete application in the Vitis core development kit. It is advisable to use the posix_spawn() system call to launch another process from the Vitis software platform application.

Specifying the Device ID and Loading the XCLBIN

To use the Xilinx runtime (XRT) environment properly, the host application needs to identify the accelerator card and device ID that the kernel will run on, and load the device binary (.xclbin) into the device.

The XRT API includes a Device class (xrt::device) that can be used to specify the device ID on the accelerator card, and an XCLBIN class (xrt::xclbin) that defines the program for the runtime. You must use the following include statement in your source code to load these classes:

#include <xrt/xrt_kernel.h>

The following code snippet creates a device object by specifying the device ID from the target platform, and then loads the .xclbin into the device, returning the UUID for the program.

//Setup the Environment
unsigned int device_index = 0;
std::string binaryFile = parser.value("kernel.xclbin");
std::cout << "Open the device" << device_index << std::endl;
auto device = xrt::device(device_index);
std::cout << "Load the xclbin " << binaryFile << std::endl;
auto uuid = device.load_xclbin(binaryFile);

TIP: The device ID can be obtained using the xbutil command for a specific accelerator.

Setting Up XRT-Managed Kernels and Kernel Arguments

After identifying devices and loading the program, the host application should identify the kernels that execute on the device, and set up the kernel arguments. All kernels the host application interacts with are defined within the loaded .xclbin file, and so should be identified from there.

For XRT-managed kernels, the XRT API provides a Kernel class (xrt::kernel), that is used to access the kernels contained within the .xclbin file. The kernel object identifies an XRT-managed kernel in the .xclbin loaded into the Xilinx device that can be run by the host application.

TIP: As discussed in Setting Up User-Managed Kernels and Argument Buffers, you should use the IP class (xrt::ip) to identify the user-managed kernels in the .xclbin file.

The use of the kernel and buffer objects require the addition of the following include statements in your source code:

#include <xrt/xrt_kernel.h>
#include <xrt/xrt_bo.h>

The following code example identifies a kernel ("vadd") defined in the program (uuid) loaded onto the device:

auto krnl = xrt::kernel(device, uuid, "vadd");

TIP: You can also use the xclbinutil command to examine the contents of an existing .xclbin file and determine the kernels contained within.

After identifying the kernel, or kernels to be run, you need to define buffer objects to associate with the kernel arguments and enable data transfer from the host application to the kernel instance or compute unit (CU):

std::cout << "Allocate Buffer in Global Memory\n";
auto bo0 = xrt::bo(device, vector_size_bytes, krnl.group_id(0));
auto bo1 = xrt::bo(device, vector_size_bytes, krnl.group_id(1));
auto bo_out = xrt::bo(device, vector_size_bytes, krnl.group_id(2));

The kernel object (xrt::kernel) includes a method to return the memory associated with each kernel argument, kernel.group_id(). You will assign a buffer object to each kernel buffer argument because buffer is not created for scalar arguments.

Creating Multiple Compute Units

When building the .xclbin file you can specify the number of kernel instances, or compute units (CU) to implement into the hardware by using the --connectivity.nk option as described in Creating Multiple Instances of a Kernel. After the .xclbin has been built, you can access the CUs from the host application.

A single kernel object (xrt::kernel) can be used to execute multiple CUs as long as the CUs have identical interface connectivity, meaning the CUs have the same memory connections (krnl.group_id). If all CUs do not have the same kernel connectivity, then you can create a separate kernel object for each unique configuration of the kernel, as shown in the example below.

krnl1 = xrt::kernel(device, xclbin_uuid, "vadd:{vadd_1,vadd_2}");
krnl2 = xrt::kernel(device, xclbin_uuid, "vadd:{vadd_3}");

In the example above, krnl1 can be used to launch the CUs vadd_1 and vadd_2 which have matching connectivity, and krnl2 can be used to launch vadd_3, which has different connectivity.

TIP: If you create a single kernel object for multiple CUs without matching connectivity, then XRT assigns one or more CUs with matching connectivity to the kernel object, and ignores the other CUs in the hardware when executing the kernel.

Transferring Data between Host and Kernels

Transferring data to and from the memory in the accelerator card or device uses the buffer objects (xrt::bo) created when Setting Up XRT-Managed Kernels and Kernel Arguments.

The class constructor typically allocates a regular 4K aligned buffer object. The following code creates regular buffer objects that have a host backing pointer allocated by user space in heap memory, and a device-side buffer allocated in the memory bank associated with the kernel argument (krnl.group_id). Optional flags in the xrt::bo constructor let you create non-standard types of buffers for use in special circumstances as described in Creating Special Buffers.

std::cout << "Allocate Buffer in Global Memory\n";
auto bo0 = xrt::bo(device, vector_size_bytes, krnl.group_id(0));
auto bo1 = xrt::bo(device, vector_size_bytes, krnl.group_id(1));
auto bo_out = xrt::bo(device, vector_size_bytes, krnl.group_id(2));

IMPORTANT: A single buffer cannot be bigger than 4 GB, yet to maximize throughput from the host to global memory, Xilinx also recommends keeping the buffer size at least 2 MB if possible.

With the buffer established, and filled with data, there are a number of methods to enable transfers between the host and the kernel, as described below:

Using xrt::bo::sync()

Use xrt::bo::sync to sync data from the host to the device with XCL_BO_SYNC_TO_DEVICE flag, or from the device to the host with XCL_BO_SYNC_FROM_DEVICE flag using xrt::bo::write, or xrt::bo::read to write the buffer from the host application, or read the buffer from the device.

bo0.write(buff_data);
bo0.sync(XCL_BO_SYNC_BO_TO_DEVICE);
bo1.write(buff_data);
bo1.sync(XCL_BO_SYNC_BO_TO_DEVICE);
...
bo_out.sync(XCL_BO_SYNC_BO_FROM_DEVICE);
bo_out.read(buff_data);

Note: If the buffer is created using a user-pointer as described in Creating Buffers from User Pointers, the xrt::bo::sync call is sufficient, and the xrt::bo::write or xrt::bo::read commands are not required.

Using xrt::bo::map()

This method maps the host-side buffer backing pointer to a user pointer.

// Map the contents of the buffer object into host memory
auto bo0_map = bo0.map<int*>();
auto bo1_map = bo1.map<int*>();
auto bo_out_map = bo_out.map<int*>();

The host code can subsequently exercise the user pointer for data reads and writes. However, after writing to the mapped pointer (or before reading from the mapped pointer) the xrt::bo::sync() command should be used with the required direction flag for the DMA operation.

for (int i = 0; i < DATA_SIZE; ++i) {
   bo0_map[i] = i;
   bo1_map[i] = i;
}

// Synchronize buffer content with device side
bo0.sync(XCL_BO_SYNC_BO_TO_DEVICE);
bo1.sync(XCL_BO_SYNC_BO_TO_DEVICE);

There are additional buffer types and transfer scenarios supported by the XRT native API, as described in Miscellaneous Other Buffers.

Executing Kernels on the Device

The execution of a kernel is associated with a class called xrt::run that implements methods to start and wait for kernel execution. Most interaction with kernel objects are accomplished through xrt::run objects, created from a kernel to represent an execution of the kernel.

The run object can be explicitly constructed from a kernel object, or implicitly constructed by starting a kernel execution as shown below.

std::cout << "Execution of the kernel\n";
auto run = krnl(bo0, bo1, bo_out, DATA_SIZE);
run.wait();

The above code example demonstrates launching the kernel execution using the xrt::kernel() operator with the list of arguments for the kernel that returns an xrt::run object. This is an asynchronous operator that returns after starting the run. The xrt::run::wait() member function is used to block the current thread until the run is complete.

TIP: Upon finishing the kernel execution, the xrt::run object can be used to relaunch the same kernel function if desired.

An alternative approach to run the kernel is shown in the code below:

auto run = xrt::run(krnl);
run.set_arg(0,bo0); // Arguments are specified starting from 0
run.set_arg(0,bo1); 
run.set_arg(0,bo_out); 
run.start();
run.wait();

In this example, the run object is explicitly constructed from the kernel object, the kernel arguments are specified with run.set_args(), and the run execution is launched by the run.start() command. Finally, the current thread is blocked as it waits for the kernel to finish.

After the kernel has completed its execution, you can sync the kernel results back to the host application using code similar to the following example:

// Get the output;
bo_out.sync(XCL_BO_SYNC_BO_FROM_DEVICE);

// Validate our results
if (std::memcmp(bo_out_map, bufReference, DATA_SIZE))
   throw std::runtime_error("Value read back does not match reference");

Setting Up User-Managed Kernels and Argument Buffers

User-managed kernels require the use of the XRT native API for the host application, and are specified as an IP object of the xrt::ip class. The following is a high-level overview of how to structure your host application to access user-managed kernels from an .xclbin file.

Add the following header files to include the XRT native API:
```
#include "experimental/xrt_ip.h"
#include "xrt/xrt_bo.h"
```
- experimental/xrt_ip.h: Defines the IP as an object of xrt::ip.
- xrt/xrt_bo.h: Lets you create buffer objects in the XRT native API.
Set up the application environment as described in Specifying the Device ID and Loading the XCLBIN.
The IP object (xrt::ip) is constructed from the xrt::device object, the uuid of the .xclbin, and the name of the user-managed kernel. The xrt::ip differs from the standard xrt::kernel, and indicates that XRT does not manage the IP but does provide access to registers:
```
//User Managed Kernel = IP
auto ip = xrt::ip(device, uuid, "Vadd_A_B");
```
Create buffers for the IP arguments:
```
auto <buf_name> = xrt::bo(<device>,<DATA_SIZE>,<flag>,<bank_id>);
```
Where the buffer object constructor uses the following fields:
- <device>: xrt::device object of the accelerator card.
- <DATA_SIZE>: Size of the buffer as defined by the width and quantity of data.
- <flag>: Flag for creating the buffer objects.
- <bank_id>: Defines the memory bank on the device where the buffer should be allocated for IP access. The memory bank specified must match with the corresponding IP port's connection inside the .xclbin file. Otherwise you will get bad_alloc when running the application. You can specify the assignment of the kernel argument using the --connectivity.sp command as explained in Mapping Kernel Ports to Memory.
For example:
```
auto buf_in_a = xrt::bo(device,DATA_SIZE,xrt::bo::flags::normal,0);
auto buf_in_b = xrt::bo(device,DATA_SIZE,xrt::bo::flags::normal,0);
```
TIP: Verify the IP connectivity to determine the specific memory bank, or you can get this information from the Vitis generated .xclbin.info file.
For example, the following information for a user-managed kernel from the .xclbin could guide the construction of buffer objects in your host code:
```
Instance:        Vadd_A_B_1
   Base Address: 0x1c00000

   Argument:          scalar00
   Register Offset:   0x10
   Port:              s_axi_control
   Memory:            <not applicable>

   Argument:          A
   Register Offset:   0x18
   Port:              m00_axi
   Memory:            bank0 (MEM_DDR4)

   Argument:          B
   Register Offset:   0x24
   Port:              m01_axi
   Memory:            bank0 (MEM_DDR4)
```
Transfer data between host and device:
```
    auto a_data = buf_in_a.map<int*>();
    auto b_data = buf_in_b.map<int*>();

    // Sync Buffers
    buf_in_a.sync(XCL_BO_SYNC_BO_TO_DEVICE);
    buf_in_b.sync(XCL_BO_SYNC_BO_TO_DEVICE);
```
xrt::bo::map() allows mapping the host-side buffer backing pointer to a user pointer. However, before reading from the mapped pointer or after writing to the mapped pointer, you should use xrt::bo::sync() with direction flag for the DMA operation.
After preparing the buffer (buffer create, sync operation as shown above), you are free to pass all the necessary information to the IP with the direct register write operation. For example, the code below shows the information passing the buffer base address through the xrt::ip::write_register() command.
Then write to the registers to move data from the host application to the kernel:
```
    ip.write_register(REG_OFFSET_A,a_addr);
    ip.write_register(REG_OFFSET_A+4,a_addr>>32);

    ip.write_register(REG_OFFSET_B,b_addr);
    ip.write_register(REG_OFFSET_B+4,b_addr>>32);
```
Start the IP execution. Because the IP is user-managed, you can employ any number of register write/read to control the start/check status/restart the IP to trigger the execution of the IP. The following example uses an s_axilite interface to access control signals in the control register:
```
    uint32_t axi_ctrl = 0;
    std::cout << "INFO:IP Start" << std::endl;
    axi_ctrl = IP_START;
    ip.write_register(CSR_OFFSET, axi_ctrl);

    // Wait until the IP is DONE 
    axi_ctrl =0;
    while((axi_ctrl & IP_IDLE) != IP_IDLE) {
        axi_ctrl = ip.read_register(CSR_OFFSET);
    }
 
```
After IP execution is finished, you can transfer the data back to host by the xrt::bo::sync command with the appropriate flag to dictate the buffer transfer direction.
```
    buf_in_b.sync(XCL_BO_SYNC_BO_FROM_DEVICE);
```
Optionally profile the application.
Because XRT is not in charge of starting or stopping the kernel, you cannot directly profile the operation of user_managed kernels as you would XRT managed kernels. However, you can use the user_range and user_event objects as discussed in Custom Profiling of the Host Application to profile elements of the host application. For example the following code captures the time it takes to write the registers from the host application:
```
    // Write Registers
    range.start("Phase 4a", "Write A Register");
    ip.write_register(REG_OFFSET_A,a_addr);
    ip.write_register(REG_OFFSET_A+4,a_addr>>32);
    range.end();
    range.start("Phase 4b", "Write B Register");
    ip.write_register(REG_OFFSET_B,b_addr);
    ip.write_register(REG_OFFSET_B+4,b_addr>>32);
    range.end()
```
You can observe some aspects of the application and kernel operation in the Vitis analyzer as shown in the following figure.

Enabling Auto-Restart of User-Managed Kernels

In some user-managed kernels that implement the ap_ctrl_chain protocol from Vitis HLS, you can set the auto_restart bit in the s_axilite control register so that the kernel will restart automatically. A user-managed kernel that uses the auto_restart bit is called a never-ending kernel as described in Streaming Data in User-Managed Never-Ending Kernels.

Programming the never-ending kernel requires the host application to set the auto_restart signal in the s_axilite control register at the address 0x00, otherwise the kernel will simply run in single execution mode, and wait for the host application to start it again. To program the kernel control register use the following process:

Set up the host application to access the user-managed kernel as an IP object of the xrt::ip class as previously described.
Write the value 129 (binary 10000001) into the Control register, setting both the ap_start and auto_restart bits, enabling the kernel to run in never-ending mode. The s_axilite control register is located at 0x00, just as for other ap_ctrl_chain kernels.
IMPORTANT: Do not write anything else to the control register space which can lead to non-deterministic behavior.

The following code example sets up the IP object and sets the control register by writing to it as described above. With the ap_start set the never-ending kernel begins execution, and with auto_restart set it never ends:

auto ip = xrt::ip(device_2, xclbinId, "krnl_stream_vdatamover");
auto ip = xrt::ip(device, uuid, "krnl_stream_vdatamover");
        int startNow =129;
        size_t control_offset = 0;
        ip.write_register(control_offset,startNow);

Summary

As discussed in earlier topics, the recommended coding style for the host program in the Vitis core development kit includes the following points:

In the Vitis core development kit, one or more kernels are separately compiled/linked to build the .xclbin file. The device.load_xclbin(binaryFile) command is used to load the kernel binary.
Create xrt::kernel objects from the loaded device binary, and associate buffer objects (xrt::bo) with the memory banks assigned to kernel arguments.
Transfer data back and forth from the host application to the kernel using xrt::bo::sync commands and buffer reads and write commands.
Execute the kernel using an xrt::run object to start the kernel and wait for kernel execution.
Additionally, you can add error checking after XRT API calls for debugging purpose, if required.