Programming the PS Host Application

In Creating a Data Flow Graph (Including Kernels) the discussion was centered around a very simple AI Engine graph application. The top-level application initialized the graph, ran the graph, and ended the graph. However, for actual AI Engine graph applications the host code must do much more than those simple tasks. The top-level PS application running on the Cortex®-A72, controls the graph and PL kernels: manage data inputs to the graph, handle data outputs from the graph, and control any PL kernels working with the graph.

In addition, AI Engine graph applications can be run on Linux operating systems or on bare-metal systems. The requirements for programming in these two systems are significantly different as outlined in the following topics. Xilinx provides drivers used by the API calls in the host program to control the graph and PL kernels based on the operating system. In Linux this is provided by the libadf_api_xrt library, in bare-metal the AI Engine kernels are controlled using the graph APIs, and PL kernels are controlled using libUIO driver calls.

Preventing Multiple Graph Executions

In cases where your graph is implemented in a PS-based host application, you must define a conditional pragma (#ifdef) for your graph.cpp code to ensure the graph is only initialized once, or run only once. The following example code is the simple application defined in Creating a Data Flow Graph (Including Kernels) with the additional guard macro __AIESIM__ and __X86SIM__.

#include "project.h"

simpleGraph mygraph;
simulation::platform<1,1> platform(“input.txt”,”output.txt”);
connect<> net0(platform.src[0], mygraph.in);
connect<> net1(mygraph.out, platform.sink[0]);

#if defined(__AIESIM__) || defined(__X86SIM__)

int main(void) {
  mygraph.init();
  mygraph.run(<number_of_iterations>);
  mygraph.end();
  return 0;
}
#endif

This conditional directive compiles the application only for use with the AI Engine simulator. It prevents the graph from being initialized or run multiple times, from both the graph and PS host application. The directive lets the graph.cpp run in simulation or in hardware emulation of the system design, which also runs in the AI Engine simulator and the x86 simulator. However when running in hardware, the graph is initialized and run from the PS application, rather than from the graph.cpp.

Host Programming on Linux

In Linux operating systems, the ADF API controls the AI Engine graph. The Xilinx Runtime (XRT) API is used to control PL kernels. The Xilinx Runtime (XRT) API can also be used to control the AI Engine graph. The following figure shows the APIs and drivers required in this system.

Controlling the AI Engine Graph with the ADF API

ADF APIs are used to control graph execution in the top-level application, or host code, as described in AI Engine Programming. For example, the following code is for a synchronous update of run-time parameters for AI Engine kernels in the graph:

// ADF API:run and update graph parameters (RTP)
gr.run(4);
gr.update(gr.trigger,10);
gr.update(gr.trigger,10);
gr.update(gr.trigger,100);
gr.update(gr.trigger,100);
gr.wait();

TIP: The use of graph.end() terminates the graph. It will not recover after end() has been called. Instead, you can use graph.wait() to wait for runs to be completed.

In the host application (host.cpp), the graph.update() function is called to update the RTPs, and graph.run() is called to launch the AI Engine kernels in the graph. In hardware emulation and hardware flows, the ADF API is calling the XRT API, and adf::registerXRT() is used to manage the relationship between them.

IMPORTANT: adf::registerXRT() must be called before any ADF API control or interaction with the graph.

The following is example code showing the RTP update and execution by the ADF API.

// update graph parameters (RTP) & run
adf::registerXRT(dhdl, uuid);
gr.update(gr.size, 1024);//update RTP
gr.run(16);//start AIE kernel
gr.wait();

In the preceding example, gr.run(16) specifies a run of 16 iterations.

In graph.wait(), the application waits for the AI Engine kernels to complete.

The code example shows that adf::registerXRT() requires the device handle (dhdl) and UUID of the XCLBIN image. They can be obtained using the XRT APIs:

auto dhdl = xrtDeviceOpen(0);//device index=0
xrtDeviceLoadXclbinFile(dhdl,xclbinFilename);
xuid_t uuid;
xrtDeviceGetXclbinUUID(dhdl, uuid);

Controlling the PL Kernel with the XRT API

Xilinx provides an OpenSource XRT API for controlling the execution of PL kernels when programming the host code for Linux.

The execution model for the XRT API controlling PL kernels is as follows:

Get device handle and load the XCLBIN. Get the uuid as needed.
Allocate buffer objects and map to host memory. Process and transfer data from host memory to device memory.
Get kernel and run handles, set arguments for kernels, and launch kernels.
Wait for kernel completion.
Transfer data from global memory in the device back to host memory.
Host code continues processing using the new data in the host memory.

When using the native XRT API, the host application looks like the following.

1.// Open device, load xclbin, and get uuid
    
auto dhdl = xrtDeviceOpen(0);//device index=0

xrtDeviceLoadXclbinFile(dhdl,xclbinFilename);
xuid_t uuid;
xrtDeviceGetXclbinUUID(dhdl, uuid);

2. Allocate output buffer objects and map to host memory

xrtBufferHandle out_bohdl = xrtBOAlloc(dhdl, output_size_in_bytes, 0, /*BANK=*/0);
std::complex<short> *host_out = (std::complex<short>*)xrtBOMap(out_bohdl);

3. Get kernel and run handles, set arguments for kernel, and launch kernel.
xrtKernelHandle s2mm_khdl = xrtPLKernelOpen(dhdl, top->m_header.uuid, "s2mm"); // Open kernel handle
xrtRunHandle s2mm_rhdl = xrtRunOpen(s2mm_khdl); 
xrtRunSetArg(s2mm_rhdl, 0, out_bohdl); // set kernel arg
xrtRunSetArg(s2mm_rhdl, 2, OUTPUT_SIZE); // set kernel arg
xrtRunStart(s2mm_rhdl); //launch s2mm kernel

// ADF API:run, update graph parameters (RTP) and so on
……

4. Wait for kernel completion.
auto state = xrtRunWait(s2mm_rhdl);

5. Sync output device buffer objects to host memory.

xrtBOSync(out_bohdl, XCL_BO_SYNC_BO_FROM_DEVICE , output_size_in_bytes,/*OFFSET=*/ 0);

//6. post-processing on host memory - "host_out"

After post-processing the data, release the allocated objects:

graph.end();
xrtRunClose(s2mm_rhdl);
xrtKernelClose(s2mm_khdl);

xrtBOFree(out_bohdl);
xrtDeviceClose(dhdl);

TIP: After graph.end(), the AI Engine kernels will not recover again. To run the host multiple times, you can comment out graph.end() if the host does not depend on graph.end() for synchronization purpose, or replace graph.end() with graph.wait() to do synchronization.

Controlling the AI Engine Graph with the XRT C API

A Xilinx provided OpenSource XRT API can also be used for controlling execution of AI Engine graph when programming the host code for Linux. To control the AI Engine graph, XRT provides APIs through the header file experimental/xrt_graph.h.

TIP: The header file experimental/xrt_graph.h is included in the header file experimental/xrt_kernel.h. So, including experimental/xrt_kernel.h is sufficient to use the XRT API to control the AI Engine graph.

XRT graph APIs contain both a C and C++ version. Example code to control the AI Engine graph using the XRT C API is as follows:

int narrow_filter[12] = {180, 89, -80, -391, -720, -834, -478, 505, 2063, 3896, 5535, 6504};
    int wide_filter[12] = {-21, -249, 319, -78, -511, 977, -610, -844, 2574, -2754, -1066, 18539};
    xrtGraphOpen(dhdl,top->m_header.uuid,"gr");
    if(!ghdl){
    int size=1024;
    xrtGraphUpdateRTP(ghdl,"gr.fir24.in[1]",(char*)narrow_filter,12*sizeof(int));
    xrtGraphRun(ghdl,16);
    xrtGraphWait(ghdl,0);
    xrtGraphUpdateRTP(ghdl,"gr.fir24.in[1]",(char*)wide_filter,12*sizeof(int));
    xrtGraphRun(ghdl,16);
    ......
    xrtGraphEnd(ghdl,0);
    xrtGraphClose(ghdl);

TIP: The file Work/ps/c_rts/aie_control_xrt.cpp contains information about graph, RTP, GMIO, and initialize configurations. You can find the information for these XRT APIs as required.

Controlling the AI Engine Graph with the XRT C++ API

As stated in the previous section, XRT provides C and C++ APIs through the header file experimental/xrt_graph.h to control the AI Engine graphs.

XRT provides class graph in the name space xrt and its member functions to control the graph. Example code to control the AI Engine graph using the XRT C++ API is as follows:

using namespace adf;
// Open xclbin
auto device = xrt::device(0); //device index=0
auto uuid = device.load_xclbin(xclbinFilename);
auto dhdl = xrtDeviceOpenFromXcl(device);
...
int coeffs_readback[12];
int narrow_filter[12] = {180, 89, -80, -391, -720, -834, -478, 505, 2063, 3896, 5535, 6504};
int wide_filter[12] = {-21, -249, 319, -78, -511, 977, -610, -844, 2574, -2754, -1066, 18539};
auto ghdl=xrt::graph(device,uuid,"gr");
ghdl.update("gr.fir24.in[1]",narrow_filter);
ghdl.run(16);
ghdl.wait();
ghdl.read("gr.fir24.inout[0]",coeffs_readback);//Read after graph::wait. RTP update effective
ghdl.update("gr.fir24.in[1]",wide_filter);
ghdl.run(16);
ghdl.read("gr.fir24.inout[0]", coeffs_readback);//Async read
ghdl.end();

Note: The file Work/ps/c_rts/aie_control_xrt.cpp contains information about the graph, RTP, GMIO, and initialize configurations. For example in the code above, you can find the RTP port name in the RTP Configurations section in aie_control_xrt.cpp called gr.fir24.in[1].

Error Reporting Through the XRT API

XRT provides error reporting APIs. The error reporting APIs can be categorized into two types: synchronous and asynchronous APIs. Synchronous errors are errors that can be detected during the XRT run-time function call. It is POSIX-compliant. For example:

An asynchronous error might not be related to the current XRT function call or the application that is running. Asynchronous errors are cached in driver subsystems and can be accessed by the user application through the asynchronous error reporting APIs. Cached errors are persistent until explicitly cleared. Persistent errors are not necessarily indicative of the current system state, for example, a board might have been reset and be functioning correctly while previously cached errors are still available. To avoid current state confusion, asynchronous errors have a timestamp attached indicating when the error occurred. The timestamp can be compared to, for example, the timestamp for last xbutil2 reset.

The errors cached by the driver contain a system error code and additional meta data as defined in xrt_error_code.h, which is shared between the user space and the kernel space.

The error code format for asynchronous errors is as shown here:

/**
 * xrtErrorCode layout
 *
 * This layout is internal to XRT (akin to a POSIX error code).
 *
 * The error code is populated by driver and consumed by XRT
 * implementation where it is translated into an actual error / info /
 * warning that is propagated to the end user.
 *
 * 63 - 48  47 - 40   39 - 32   31 - 24   23 - 16    15 - 0
 * --------------------------------------------------------
 * |    |    |    |    |    |    |    |    |    |    |----| xrtErrorNum
 * |    |    |    |    |    |    |    |    |----|---------- xrtErrorDriver
 * |    |    |    |    |    |    |----|-------------------- xrtErrorSeverity
 * |    |    |    |    |----|------------------------------ xrtErrorModule
 * |    |    |----|---------------------------------------- xrtErrorClass
 * |----|-------------------------------------------------- reserved
 *
 */
typedef uint64_t xrtErrorCode;
typedef uint64_t xrtErrorTime;

#define XRT_ERROR_NUM_MASK		0xFFFFUL
#define XRT_ERROR_NUM_SHIFT		0
#define XRT_ERROR_DRIVER_MASK		0xFUL
#define XRT_ERROR_DRIVER_SHIFT		16
#define XRT_ERROR_SEVERITY_MASK		0xFUL
#define XRT_ERROR_SEVERITY_SHIFT	24
#define XRT_ERROR_MODULE_MASK		0xFUL
#define XRT_ERROR_MODULE_SHIFT		32
#define XRT_ERROR_CLASS_MASK		0xFUL
#define XRT_ERROR_CLASS_SHIFT		40

#define	XRT_ERROR_CODE_BUILD(num, driver, severity, module, eclass) \
	((((num) & XRT_ERROR_NUM_MASK) << XRT_ERROR_NUM_SHIFT) | \
	(((driver) & XRT_ERROR_DRIVER_MASK) << XRT_ERROR_DRIVER_SHIFT) | \
	(((severity) & XRT_ERROR_SEVERITY_MASK) << XRT_ERROR_SEVERITY_SHIFT) | \
	(((module) & XRT_ERROR_MODULE_MASK) << XRT_ERROR_MODULE_SHIFT) | \
	(((eclass) & XRT_ERROR_CLASS_MASK) << XRT_ERROR_CLASS_SHIFT))

#define XRT_ERROR_NUM(code) (((code) >> XRT_ERROR_NUM_SHIFT) & XRT_ERROR_NUM_MASK)
#define XRT_ERROR_DRIVER(code) (((code) >> XRT_ERROR_DRIVER_SHIFT) & XRT_ERROR_DRIVER_MASK)
#define XRT_ERROR_SEVERITY(code) (((code) >> XRT_ERROR_SEVERITY_SHIFT) & XRT_ERROR_SEVERITY_MASK)
#define XRT_ERROR_MODULE(code) (((code) >> XRT_ERROR_MODULE_SHIFT) & XRT_ERROR_MODULE_MASK)
#define XRT_ERROR_CLASS(code) (((code) >> XRT_ERROR_CLASS_SHIFT) & XRT_ERROR_CLASS_MASK)

/**
 * xrt_error_num - XRT specific error numbers
 */

enum xrtErrorNum {
XRT_ERROR_NUM_FIRWWALL_TRIP = 1,
XRT_ERROR_NUM_TEMP_HIGH,
XRT_ERROR_NUM_AIE_SATURATION,
XRT_ERROR_NUM_AIE_FP,
XRT_ERROR_NUM_AIE_STREAM,
XRT_ERROR_NUM_AIE_ACCESS,
XRT_ERROR_NUM_AIE_BUS,
XRT_ERROR_NUM_AIE_INSTRUCTION,
XRT_ERROR_NUM_AIE_ECC,
XRT_ERROR_NUM_AIE_LOCK,
XRT_ERROR_NUM_AIE_DMA,
XRT_ERROR_NUM_AIE_MEM_PARITY,
XRT_ERROR_NUM_UNKNOWN
};

enum xrtErrorDriver {
  XRT_ERROR_DRIVER_XOCL,
  XRT_ERROR_DRIVER_XCLMGMT,
  XRT_ERROR_DRIVER_ZOCL,
  XRT_ERROR_DRIVER_AIE,
  XRT_ERROR_DRIVER_UNKNOWN
};

enum xrtErrorSeverity {
  XRT_ERROR_SEVERITY_EMERGENCY = 0,
  XRT_ERROR_SEVERITY_ALERT,
  XRT_ERROR_SEVERITY_CRITICAL,
  XRT_ERROR_SEVERITY_ERROR,
  XRT_ERROR_SEVERITY_WARNING,
  XRT_ERROR_SEVERITY_NOTICE,
  XRT_ERROR_SEVERITY_INFO,
  XRT_ERROR_SEVERITY_DEBUG,
  XRT_ERROR_SEVERITY_UNKNOWN
};

enum xrtErrorModule {
  XRT_ERROR_MODULE_FIREWALL = 0,
  XRT_ERROR_MODULE_CMC,
  XRT_ERROR_MODULE_AIE_CORE,
  XRT_ERROR_MODULE_AIE_MEMORY,
  XRT_ERROR_MODULE_AIE_SHIM,
  XRT_ERROR_MODULE_AIE_NOC,
  XRT_ERROR_MODULE_AIE_PL,
  XRT_ERROR_MODULE_AIE_UNKNOWN
};

enum xrtErrorClass {
XRT_ERROR_CLASS_FIRST_ENTRY = 1,
XRT_ERROR_CLASS_SYSTEM = XRT_ERROR_CLASS_FIRST_ENTRY,
XRT_ERROR_CLASS_AIE,
XRT_ERROR_CLASS_HARDWARE,
XRT_ERROR_CLASS_UNKNOWN,
XRT_ERROR_CLASS_LAST_ENTRY = XRT_ERROR_CLASS_UNKNOWN
};

The API header file experimental/xrt_error.h defines the APIs for accessing currently cached errors. It provides xrtErrorGetLast() and xrtErrorGetString() APIs to retrieve the system level asynchronous errors.

/**
 * xrtErrorGetLast - Get the last error code and its timestamp of a given error class.
 *
 * @handle:       Device handle.
 * @class:        Error Class for the last error to get.
 * @error:        Returned XRT error code.
 * @timestamp:    The timestamp when the error generated
 *
 * Return:        0 on success or appropriate XRT error code.
 */
int
xrtErrorGetLast(xrtDeviceHandle handle, xrtErrorClass ecl, xrtErrorCode* error, uint64_t* timestamp);

/**
 * xrtErrorGetString - Get the description string of a given error code.
 *
 * @handle:       Device handle.
 * @error:        XRT error code.
 * @out:          Preallocated output buffer for the error string.
 * @len:          Length of output buffer.
 * @out_len:      Output of length of message, ignored if null.
 *
 * Return:        0 on success or appropriate XRT error code.
 *
 * Specifying out_len while passing nullptr for output buffer will
 * return the message length, which can then be used to allocate the
 * output buffer itself.
 */
int
xrtErrorGetString(xrtDeviceHandle, xrtErrorCode error, char* out, size_t len, size_t* out_len);

The application can call xrtErrorGetLast() with a given error class to get the latest error code. The application can call xrtErrorGetString() with a given error code to get the error string corresponding to this error code. XRT maintains the latest error for each class and an associated timestamp for when the error was generated.

xbutil2 can be used to report errors. The error report accumulates all the errors from the various classes and sorts them by timestamp. The report queries the drivers as to when the last reset was requested. This reset will be merged (using the timestamp) into the report listing.

$ xbutil2 examine -r error               
Asynchronous Errors
  Time                               Class               Module              Driver              Severity            Error Code          
  2020-Oct-08 16:40:02               CLASS_SYSTEM        MODULE_FIREWALL     DRIVER_XOCL         SEVERITY_EMERGENCY  FIREWALL_TRIP


$ xbutil2 examine -r error -f JSON-2020.2
{
    "schema_version": {
        "schema": "JSON",
        "creation_date": "Fri Oct  9 11:04:24 2020 GMT"
    },
    "devices": [
        {
            "asynchronous_errors": [
                {
                    "timestamp": "1602175202572070700",
                    "class": "CLASS_SYSTEM",
                    "module": "MODULE_FIREWALL",
                    "severity": "SEVERITY_EMERGENCY",
                    "driver": "DRIVER_XOCL",
                    "error_code": {
                        "error_id": "1",
                        "error_msg": "FIREWALL_TRIP"
                    }
                }
            ]
        }
    ]
}

xbutil2 can also be used to report AI Engine running status and read registers for debug purposes. For example, the following command reads the status of kernels after the graph has executed.

$ xbutil2 examine -r aie

--------------------------
1/1 [0000:00:00.0] : edge
--------------------------
Aie
  Aie_Metadata
  GRAPH[ 0] Name : gr
          Status : running
    SNo. Core [C:R] Iteration_Memory [C:R] Iteration_Memory_Addresses 
    [ 0] 23:1 23:1 16388 
    [ 1] 23:2 23:0 6980 
    [ 2] 23:3 23:1 4 
    [ 3] 24:1 24:0 4 
    [ 4] 24:2 24:2 4 
    [ 5] 24:3 24:1 4 
    [ 6] 25:1 25:1 4 


Core [ 0]
  Column : 23
  Row : 1
  Core:
    Status : core_done
    Program Counter : 0x00000308
    Link Register : 0x00000290
    Stack Pointer : 0x000340a0
  DMA:
    MM2S:
      Channel:
        Id : 0
        Channel Status : idle
        Queue Size : 0
        Queue Status : okay
        Current BD : 0

        Id : 1
        Channel Status : idle
        Queue Size : 0
        Queue Status : okay
        Current BD : 0

    S2MM:
      Channel:
        Id : 0
        Channel Status : idle
        Queue Size : 0
        Queue Status : okay
        Current BD : 0

        Id : 1
        Channel Status : idle
        Queue Size : 0
        Queue Status : okay
        Current BD : 0

  Locks:
    0 : released_for_write
    1 : released_for_write
    2 : released_for_write
    3 : released_for_write
    4 : released_for_write
    5 : released_for_write
    6 : released_for_write
    7 : released_for_write
    8 : released_for_write
    9 : released_for_write
    10 : released_for_write
    11 : released_for_write
    12 : released_for_write
    13 : released_for_write
    14 : released_for_write
    15 : released_for_write


  Events:
    core : 1, 2, 5, 22, 23, 24, 28, 29, 31, 32, 35, 36, 38, 39, 40, 44, 45, 47, 68
    memory : 1, 43, 44, 45, 106, 113

......


Core [ 6]
  Column : 25
  Row : 1
  Core:
    Status : enabled, east_lock_stall
    Program Counter : 0x000001e6
    Link Register : 0x000000b0
    Stack Pointer : 0x00030020
  DMA:
    MM2S:
      Channel:
        Id : 0
        Channel Status : stalled_on_requesting_lock
        Queue Size : 0
        Queue Status : okay
        Current BD : 2

        Id : 1
        Channel Status : idle
        Queue Size : 0
        Queue Status : okay
        Current BD : 0

    S2MM:
      Channel:
        Id : 0
        Channel Status : running
        Queue Size : 0
        Queue Status : okay
        Current BD : 0

        Id : 1
        Channel Status : idle
        Queue Size : 0
        Queue Status : okay
        Current BD : 0


  Locks:
    0 : acquired_for_write
    1 : released_for_write
    2 : released_for_write
    3 : released_for_write
    4 : released_for_write
    5 : released_for_write
    6 : released_for_write
    7 : released_for_write
    8 : released_for_write
    9 : released_for_write
    10 : released_for_write
    11 : released_for_write
    12 : released_for_write
    13 : released_for_write
    14 : released_for_write
    15 : released_for_write

  Events:
    core : 1, 2, 5, 22, 26, 28, 29, 31, 32, 35, 38, 39, 44
    memory : 1, 20, 21, 23, 35, 43, 44, 106, 113

The following command can be used to read specific registers for debug purposes.

$ xbutil2 advanced --read-aie-reg -d 0000:00:0 0 25 Core_Status 
Register Core_Status Value of Row:0 Column:25 is 0x00000201

For AI Engine register definitions, see the Versal ACAP AI Engine Register Reference (AM015). For details on xbutil2 command use, see https://xilinx.github.io/XRT/master/html/index.html#.

AI Engine Error Events

This section provides error and related debug information for the errors obtained using the XRT error reporting APIs described previously. These are errors propagated from the AI Engine array and can be used to debug application specific errors in hardware. For errors with class XRT_ERROR_CLASS_AIE, you can obtain additional information by enabling the dmesg logs, which provide the causes of the error (and are described in the following tables). An example log is shown here:

[ 6616.963964] aie aie0: Asserted tile error event 56 at col 6 row 7
[DLBF] Completed reading 4 iterat[ 6616.970234] aie aie0: Asserted tile error event 56 at col 7 row 8
[ 6616.979187] aie aie0: Asserted tile error event 56 at col 8 row 5

Note: Note the tile location is indicated by the col and row number. Row 0 is the SHIM (interface) tile, AI Engines start from row 1.

The following tables list the various categories of error, in addition to the exact error number, description, and tips on the next steps to debug and resolve the errors.

Table 1. CORE Module Error Events
Error Group	No.	Name	Description	Debug Tips
Instruction Errors	59	Instruction Decompression Error	Event generated when AI Engine cannot decompress instruction fetched. This can happen if the program instructions are corrupt. Validate ELF generation.	Regenerate the ELF file with the Vitis compiler (V++) `--package` command. If the issue persists, contact Xilinx support.
Access Errors	55	PM Reg Access Failure	This error can happen on bank access conflict to PM by the memory mapped AXI interface and AI Engine.	Contact Xilinx support.
	60	DM address out of range	Event generated if AI Engine tries to access a memory location outside of 0x20000 – 0x3FFFF.	Run AI Engine simulator (`aiesimulator`) with `–-enable-memory-check` that will flag any access violations.
	65	PM address out of range	Event generated if PC is out of range	Run AI Engine simulator (`aiesimulator`) with `– enable-memory-check` that will flag any access violations.
	66	DM access to unavailable	Event generated if AI Engine issues an access to the isolated tile in neighborhood.	Check if the kernel runs on AI Engine accesses data memory of an isolated tile (a different partition). If the issue persists, contact Xilinx support.
Bus Errors	58	AXI MM Slave Error	Event generated if the memory mapped AXI interface slave read/write request is for an address which does not exist in the AI Engine tile.	If the PL IP is accessing the AI Engine registers using the memory mapped AXI interface, check the PL IP to see if it access invalid registers. If the issue persists, contact Xilinx support.
Stream Errors	54	TLAST in WSS words 0-2	Event generated if TLAST is not on the 4th word of a wide stream.	If PL IP is used to generate the stream, check if it generates TLAST correctly. If the issue persists, contact Xilinx support.
	56	Stream Pkt Parity Error	Event generated if there is any parity error in the packet header.	Check the data source such as PL IP which generates the packets to see if the packet is valid and if the parity bit is correctly calculated. If the data is from PL IP, check the packet header generated from the PL IP.
	57	Control Pkt Error	Control Packet Error	Check the data source, such as PL IP which generates the packets to see if it generates the packets correctly. If the issue persists, contact Xilinx support.
ECC Errors	64	PM ECC Error 2bit	Event generated when 2 bit ECC error is detected	Re-run the application. If the issue persists, contact Xilinx support.
ECC Errors	62	PM ECC Error Scrub 2bit	Event generated if ECC scrubber detects 2 Bit ECC error	Re-run the application. If the issue persists, contact Xilinx support.
Lock Errors	67	Lock Access to unavailable	Event generated if AI Engine issues an access to the isolated tile in neighborhood.	Contact Xilinx support.
CORE refers to the AI Engine in the AI Engine tile.

Table 2. MEMORY Module Error Events
Errors Group	No.	Name	Description	Debug Tips
ECC Errors	88	DM ECC Error Scrub 2bit	Event generated when ECC scrubber detects 2-bit ECC error in bank 0 or bank 1 of DM.	Re-run the application. If the issue persists, contact Xilinx support.
ECC Errors	90	DM ECC Error 2bit	Event generated when 2-bit ECC error is detected during access to bank 0 or 1 of DM. This data memory ECC error can be caused by DM access from the AI Engine, tile DMA, or memory mapped AXI interface.	Re-run the application. If the issue persists, contact Xilinx support.
Memory Parity Errors	91	DM Parity Error Bank 2	Event generated when a parity error is detected during access to DM bank 2. This data memory parity error can be caused by DM access from the AI Engine, tile DMA, or memory mapped AXI interface.	Re-run the application. If the issue persists, contact Xilinx support.
	92	DM Parity Error Bank 3	Event generated when a parity error is detected during access to DM bank 3. This data memory parity error can be caused by DM access from the AI Engine, tile DMA, or memory mapped AXI interface.	Re-run the application. If the issue persists, contact Xilinx support.
	93	DM Parity Error Bank 4	Event generated when a parity error is detected during access to DM bank 4. This data memory parity error can be caused by DM access from the AI Engine, tile DMA, or memory mapped AXI interface.	Re-run the application. If the issue persists, contact Xilinx support.
	94	DM Parity Error Bank 5	Event generated when a parity error is detected during access to DM bank 5. This data memory parity error can be caused by DM access from the AI Engine, tile DMA, or memory mapped AXI interface.	Re-run the application. If the issue persists, contact Xilinx support.
	95	DM Parity Error Bank 6	Event generated when a parity error is detected during access to DM bank 6. This data memory parity error can be caused by DM access from the AI Engine, tile DMA, or memory mapped AXI interface.	Re-run the application. If the issue persists, contact Xilinx support.
	96	DM Parity Error Bank 7	Event generated when a parity error is detected during access to DM bank 7. This data memory parity error can be caused by DM access from the AI Engine, tile DMA, or memory mapped AXI interface.	Re-run the application. If the issue persists, contact Xilinx support.
DMA Errors	97	DMA S2MM 0 Error	This error can be caused by writing to the BD task queue of S2MM channel 0 when it is full.	If you manage buffer descriptors in your application, verify that you are not pushing new buffer descriptors when the queue is full. If the issue persists, contact Xilinx support.
	98	DMA S2MM 1 Error	This error can be caused by writing to the BD task queue of S2MM channel 1 when it is full.	If you manage buffer descriptors in your application, verify that you are not pushing new buffer descriptors when the queue is full. If the issue persists, contact Xilinx support.
	99	DMA MM2S 0 Error	This error can be caused by writing to the BD task queue of MM2S channel 0 when it is full.	If you manage buffer descriptors in your application, verify that you are not pushing new buffer descriptors when the queue is full. If the issue persists, contact Xilinx support.
	100	DMA MM2S 1 Error	This error can be caused by writing to the BD task queue of MM2S channel 1 when it is full.	If you manage buffer descriptors in your application, verify that you are not pushing new buffer descriptors when the queue is full. If the issue persists, contact Xilinx support.

Table 3. SHIM Module Error Events
Error Group	No.	Name	Description	Debug Tips
Bus Errors	62	AXI MM Slave Tile Error	Event generated if a memory mapped AXI interface slave request comes to an interface tile but the address is invalid.	If using the PL IP to access the AI Engine register with the memory mapped AXI interface, check if the IP tries to access the wrong address. If the issue persists, contact Xilinx support.
	64	AXI MM Decode NSU Error	The memory mapped AXI interface traffic internally has responded with a DECERR. For example, if a column, set of tiles are clock gated, a decode error is generated internally and travels on the memory mapped AXI interface to the interface tile to generate this event.	If using the PL IP to access the AI Engine register using the memory mapped AXI interface, check if the IP tries to access tile which is gated. If the issue persists, contact Xilinx support.
	65	AXI MM Slave NSU Error	The memory mapped AXI interface traffic internally has responded with a SLVERR. For example, An AI Engine tile in that interface tile column has responded with a slave error. That slave error will travel over the memory mapped AXI interface to the interface tile as a slave error.	If using the PL IP to access the AI Engine register with the memory mapped AXI interface, check if the IP tries to access wrong address. If the issue persists, contact Xilinx support.
	66	AXI MM Unsupported Traffic	The memory mapped AXI interface from the NoC has made a request that the AI Engine does not support.	If using the PL IP to access the AI Engine register with the memory mapped AXI interface, check if the IP generates unsupported memory mapped AXI interface requests.
	67	AXI MM Unsecure Access in Secure Mode	The memory mapped AXI interface from the NoC is violating the secure mode (trying to route unsecured traffic when AI Engine only supports secure traffic).	Check if the AI Engine array is configured in secure mode.
	68	AXI MM Byte Strobe Error	The memory mapped AXI interface from the NoC is writing with non-complete 32-bit words (within a 32- bit word all byte strobes must be set).	If the PL IP is accessing the AI Engine using the memory mapped AXI interface, check if all byte strobes are set for a 32-bit word.
Stream Error	63	Control Pkt Error	Control Packet Error	If the PL IP is generating the control packets, check if the IP generates packets properly. If the issue persists, contact Xilinx support.
DMA Error	69	DMA S2MM 0 Error	This DMA error is for DMA S2MM channel 0. It can be caused by: writing to the BD task queue when it is full; decode error when it tries to access the memory slave error when it tries to access the memory	If you manage buffer descriptors in your application, verify that you are not pushing new buffer descriptors when the queue is full. If you manage buffer descriptors in your application, check if the memory address sent to the interface tile DMA buffer descriptor is invalid. If the issue persists, contact Xilinx support.
	70	DMA S2MM 1 Error	This DMA error is for DMA S2MM channel 1. It can be caused by: writing to the BD task queue when it is full; decode error when it tries to access the memory slave error when it tries to access the memory	If you manage buffer descriptors in your application, verify that you are not pushing new buffer descriptors when the queue is full. If you manage buffer descriptors in your application, check if memory address sent to the interface tile DMA buffer descriptor is invalid. If the issue persists, contact Xilinx support.
	71	DMA MM2S 0 Error	This DMA error is for DMA MM2S channel 0. It can be caused by: writing to the BD task queue when it is full; decode error when it tries to access the memory slave error when it tries to access the memory	If you manage buffer descriptors in your application, verify that you are not pushing new buffer descriptors when the queue is full. If you manage buffer descriptors in your application, check if memory address sent to the interface tile DMA buffer descriptor is invalid. If the issue persists, contact Xilinx support.
	72	DMA MM2S 1 Error	This DMA error is for DMA MM2S channel 1. It can be caused by: writing to the BD task queue when it is full; decode error when it tries to access the memory slave error when it tries to access the memory	If you manage buffer descriptors in your application, verify that you are not pushing new buffer descriptors when the queue is full. If you manage buffer descriptors in your application, check if memory address put to the sent to the interface tile DMA buffer descriptor is invalid. If the issue persists, contact Xilinx support.
SHIM refers to the interface tiles in the AI Engine array.

Host Code Reference with ADF API and XRT API

This section provides a summary of the XRT APIs that control the PL kernels and graph as well as a mapping relationship between the ADF API and XRT API. Complete host code using the ADF API or the XRT API to control the graph is also provided for reference.

Note: This section only lists part of APIs. See https://github.com/xilinx/xrt for the latest and more detailed information about the XRT API.

Table 4. XRT APIs
XRT API	Description
Category: Device handle (experimental/xrt_device.h)
`xrtDeviceHandle xrtDeviceOpen(unsigned int index);`	Open a device and obtain its handle.
`xrtDeviceHandle xrtDeviceOpenFromXcl(xclDeviceHandle xhdl);`	Get a device handle from `xclDeviceHandle`.
`int xrtDeviceClose(xrtDeviceHandle dhdl);`	Close an opened device.
`int xrtDeviceLoadXclbinFile(xrtDeviceHandle dhdl, const char* xclbin_fnm);`	Read and load an XCLBIN file.
`void xrtDeviceGetXclbinUUID(xrtDeviceHandle dhdl, xuid_t out);`	Get UUID of XCLBIN image loaded on device.
Category: PL kernel handle (experimental/xrt_kernel.h)
`xrtKernelHandle xrtPLKernelOpen(xrtDeviceHandle deviceHandle, const xuid_t xclbinId, const char *name);`	Open a PL kernel and obtain its handle.
`int xrtKernelClose(xrtKernelHandle kernelHandle);`	Close an opened kernel.
`xrtRunHandle xrtKernelRun(xrtKernelHandle kernelHandle, ...);`	Start a kernel execution.
`xrtRunHandle xrtRunOpen(xrtKernelHandle kernelHandle);`	Open a new run handle for a kernel without starting kernel.
`int xrtRunSetArg(xrtRunHandle rhdl, int index, ...);`	Set a specific kernel argument for this run.
`int xrtRunUpdateArg(xrtRunHandle rhdl, int index, ...);`	Asynchronous update of kernel argument.
`int xrtRunStart(xrtRunHandle rhdl);`	Start existing run handle.
`enum ert_cmd_state xrtRunWait(xrtRunHandle rhdl);`	Wait for a run to complete.
`int xrtRunClose(xrtRunHandle rhdl);`	Close a run handle.
Category: Graph handle (experimental/xrt_graph.h)
`xrtGraphHandle xrtGraphOpen(xrtDeviceHandle handle, const uuid_t xclbinUUID, const char *graphName);`	Open a graph and obtain its handle.
`void xrtGraphClose(xrtGraphHandle gh);`	Close an open graph.
`int xrtGraphRun(xrtGraphHandle gh, int iterations);`	Start a graph execution.
`int xrtGraphWait(xrtGraphHandle gh, uint64_t cycle);`	Wait a set number of AI Engine cycles since the last `xrtGraphRun` and then stop the graph. If `cycle` is 0, wait until the graph is finished. If the graph has already run more than the set number of cycles, stop the graph immediately.
`int xrtGraphResume(xrtGraphHandle gh);`	Resume a suspended graph.
`int xrtGraphEnd(xrtGraphHandle gh, uint64_t cycle);`	Wait a set number of AI Engine cycles since the last `xrtGraphRun` and then end the graph. If `cycle` is 0, wait until the graph is finished before ending the graph. If the graph has already run more than the set number of cycles, stop the graph immediately and end it.
`int xrtGraphUpdateRTP(xrtGraphHandle gh, const char hierPathPort, const char buffer, size_t size);`	Update RTP value of port with hierarchical name.
`int xrtGraphReadRTP(xrtGraphHandle gh, const char hierPathPort, char buffer, size_t size);`	Read RTP value of port with hierarchical name.
Category: AIE handle (experimental/xrt_aie.h)
`int xrtAIESyncBO(xrtDeviceHandle handle, xrtBufferHandle bohdl, const char *gmioName, enum xclBOSyncDirection dir, size_t size, size_t offset);`	Transfer data between DDR memory and interface tile DMA channel.
Category: Buffer object handle (experimental/xrt_bo.h)
`xrtBufferHandle xrtBOAlloc(xrtDeviceHandle dhdl, size_t size, xrtBufferFlags flags, xrtMemoryGroup grp);`	Allocate a BO of requested size with appropriate flags.
`int xrtBOFree(xrtBufferHandle bhdl);`	Synchronize buffer contents in requested direction.
`int xrtBOSync(xrtBufferHandle bhdl, enum xclBOSyncDirection dir, size_t size, size_t offset);`	Synchronize buffer contents in requested direction.
`void* xrtBOMap(xrtBufferHandle bhdl);`	Memory map BO into user address space.
Category: Error reporting (experimental/xrt_error.h)
`int xrtErrorGetLast(xrtDeviceHandle handle, xrtErrorClass ecl, xrtErrorCode* error, uint64_t* timestamp);`	Get the last error code and its timestamp of a given error class.
`int xrtErrorGetString(xrtDeviceHandle, xrtErrorCode error, char* out, size_t len, size_t* out_len);`	Get the description string of a given error code.

The following table lists the mapping between the ADF API and XRT API. The xrtGraphOpen(), xrtPLKernelOpen(), xrtRunOpen(), xrtKernelClose() XRT APIs are called inside the ADF APIs when required and there is no corresponding mapping listed.

Table 5. ADF API and XRT API mapping
Graph API	XRT API
`graph::run()`	`xrtGraphRun(xrtGraphHandle, 0)` for AI Engine.
`graph::run(iterations)`	`xrtGraphRun(xrtGraphHandle, iterations)` for AI Engine.
`graph::wait()`	`xrtGraphWait(xrtGraphHandle, 0)` for AI Engine.
`graph::wait(aie_cycles)`	`xrtGraphWait(xrtGraphHandle aie_cycles)`, for AI Engine.
`graph::resume()`	`xrtGraphResume(xrtGraphHandle)`
`graph::end()`	`xrtGraphEnd(xrtGraphHandle, 0)` and then `xrtGraphClose(xrtGraphHandle)` for AI Engine.
`graph::end(aie_cycles)`	`xrtGraphEnd(xrtGraphHandle, aie_cycles)` and then `xrtGraphClose(xrtGraphHandle)` for AI Engine.
`graph::update()`	`xrtGraphUpdateRTP()` for AI Engine;
`graph::read()`	`xrtGraphReadRTP()` for AI Engine;
`GMIO::malloc()`	`xrtBOAlloc()`, `xrtBOMap()`
`GMIO::free()`	`xrtBOFree()`
`GMIO::gm2aie_nb()`	N/A
`GMIO::aie2gm_nb()`	N/A
`GMIO::wait()`	N/A
`GMIO::gm2aie()`	`xrtSyncBOAIE(...,XCL_BO_SYNC_BO_GMIO_TO_AIE,...)`
`GMIO::aie2gm()`	`xrtSyncBOAIE(...,XCL_BO_SYNC_BO_AIE_TO_GMIO,...)`
`adf::event` APIs for profiling and event trace	N/A

The following is host code using the ADF API and XRT API for reference. The __USE_ADF_API__ is a user-defined macro in the code that can be used to switch between the ADF API and XRT API to control the AI Engine graph.

#include <stdlib.h>
#include <fstream>
#include <iostream>
#include "host.h"
#include <unistd.h>
#include <complex>
#include "adf/adf_api/XRTConfig.h"
#include "experimental/xrt_kernel.h"

#include "graph.cpp"

#define OUTPUT_SIZE 2048

using namespace adf;

int main(int argc, char* argv[]) {

    size_t output_size_in_bytes = OUTPUT_SIZE * sizeof(int);

    //TARGET_DEVICE macro needs to be passed from gcc command line
    if(argc != 2) {
	printf("Usage: %d <xclbin>\r\n",argv[0]);
	return EXIT_FAILURE;
    }
    char* xclbinFilename = argv[1];
	
    int ret;
    // Open xclbin
    auto dhdl = xrtDeviceOpen(0);//device index=0
    if(!dhdl){
	printf("Device open error\n");
    }
    ret=xrtDeviceLoadXclbinFile(dhdl,xclbinFilename);
    if(ret){
	printf("Xclbin Load fail\n");
    }
    xuid_t uuid;
    xrtDeviceGetXclbinUUID(dhdl, uuid);
    
    // output memory
    xrtBufferHandle out_bohdl = xrtBOAlloc(dhdl, output_size_in_bytes, 0, /*BANK=*/0);
    std::complex<short> *host_out = (std::complex<short>*)xrtBOMap(out_bohdl);

    // s2mm ip
    xrtKernelHandle s2mm_khdl = xrtPLKernelOpen(dhdl, uuid, "s2mm");
    xrtRunHandle s2mm_rhdl = xrtRunOpen(s2mm_khdl);
    xrtRunSetArg(s2mm_rhdl, 0, out_bohdl);
    xrtRunSetArg(s2mm_rhdl, 2, OUTPUT_SIZE);
    xrtRunStart(s2mm_rhdl);
    printf("run s2mm\n");

#if __USE_ADF_API__
    // update graph parameters (RTP) & run
    adf::registerXRT(dhdl, uuid);
    printf("Register XRT\r\n");
    int narrow_filter[12] = {180, 89, -80, -391, -720, -834, -478, 505, 2063, 3896, 5535, 6504};
    int wide_filter[12] = {-21, -249, 319, -78, -511, 977, -610, -844, 2574, -2754, -1066, 18539};
    gr.run(16);//start AIE kernel   
    gr.update(gr.fir24.in[1], narrow_filter, 12);//update AIE kernel RTP
    printf("Update fir24 done\r\n");
    printf("Graph run done\r\n");
    gr.wait(); // wait for AIE kernel to complete
    printf("Graph wait done\r\n");    
    gr.update(gr.fir24.in[1], wide_filter, 12);//Update AIE kernel RTP
    printf("Update fir24 done\r\n");
    gr.run(16);//start AIE kernel
    printf("Graph run done\r\n");
#else
    int narrow_filter[12] = {180, 89, -80, -391, -720, -834, -478, 505, 2063, 3896, 5535, 6504};
    int wide_filter[12] = {-21, -249, 319, -78, -511, 977, -610, -844, 2574, -2754, -1066, 18539};
    auto ghdl=xrtGraphOpen(dhdl,uuid,"gr");
    if(!ghdl){
	printf("Graph Open error\r\n);;
    }else{
	printf("Graph Open ok\r\n");;
    }
    int size=1024;
    xrtKernelHandle noisegen_khdl = xrtPLKernelOpen(dhdl, uuid, "random_noise");
    xrtRunHandle noisegen_rhdl = xrtRunOpen(noisegen_khdl);
    xrtRunSetArg(noisegen_rhdl, 1, size);
    xrtRunStart(noisegen_rhdl);
    printf("run noisegen\n");
    ret=xrtGraphUpdateRTP(ghdl,"gr.fir24.in[1]",(char*)narrow_filter,12*sizeof(int));
    if(ret!=0){
	printf("Graph RTP update fail\n");
	return 1;
    }
    ret=xrtGraphRun(ghdl,16);
    if(ret){
	printf("Graph run error\r\n");
    }else{
	printf("Graph run ok\r\n");
    }
    ret=xrtGraphWait(ghdl,0);
    if(ret){
	printf("Graph wait error\r\n");
    }else{
	printf("Graph wait ok\r\n");
    }
    xrtRunWait(noisegen_rhdl);
    xrtRunSetArg(noisegen_rhdl, 1, size);
    xrtRunStart(noisegen_rhdl);
    printf("run noisegen\n");
    ret=xrtGraphUpdateRTP(ghdl,"gr.fir24.in[1]",(char*)wide_filter,12*sizeof(int));
    if(ret!=0){
	printf("Graph RTP update fail\n");
	return 1;
    }
    ret=xrtGraphRun(ghdl,16);
    if(ret){
	printf("Graph run error\r\n");
    }else{
	printf("Graph run ok\r\n");
    }
#endif
    // wait for s2mm done
    auto state = xrtRunWait(s2mm_rhdl);
    printf("s2mm completed with status %d\r\n",state);
	
    xrtBOSync(out_bohdl, XCL_BO_SYNC_BO_FROM_DEVICE , output_size_in_bytes,/*OFFSET=*/ 0);

    std::ofstream out("out.txt",std::ofstream::out);
    std::ifstream golden("data/filtered.txt",std::ifstream::in);
    short g_real=0,g_imag=0;
    int match = 0;
    for (int i = 0; i < OUTPUT_SIZE; i++) {
	golden >> std::dec >> g_real;
	golden >> std::dec >> g_imag;
	if(g_real!=host_out[i].real() || g_imag!=host_out[i].imag()){
		
          printf("ERROR: i=%d gold.real=%d gold.imag=%d out.real=%d out.imag=%d\r\n",i,g_real,g_imag,host_out[i].real(),host_out[i].imag());
		match=1;
	}
	out<<host_out[i].real()<<" "<<host_out[i].imag()<<" "<<std::endl;
    }
    out.close();
    golden.close();

#if __USE_ADF_API__
    gr.end();
#else
    ret=xrtGraphEnd(ghdl,0);
    if(ret){
	printf("Graph end error"\r\n);;
    }
    xrtRunClose(noisegen_rhdl);
    xrtKernelClose(noisegen_khdl);
    xrtGraphClose(ghdl);
#endif
    xrtRunClose(s2mm_rhdl);
    xrtKernelClose(s2mm_khdl);
    xrtBOFree(out_bohdl);
    xrtDeviceClose(dhdl);


    char pPass[]= "PASSED";
    char pFail[]= "FAILED";
    char* presult;
    presult = (match ? pFail : pPass); 
    printf("TEST %s\r\n",presult);

    return (match ? EXIT_FAILURE :  EXIT_SUCCESS);
}

The XRT API has C and C++ versions for controlling the PL kernels. For more information about the C++ version of the XRT API, see https://xilinx.github.io/XRT/master/html/xrt_native_apis.html.

Host Programming for Bare-metal Systems

In a bare-metal/standalone environment, Xilinx provides standalone board support package (BSP), drivers, and libraries for applications to use to reduce development effort. As described in Host Programming on Linux, the top-level application for bare-metal systems must also integrate and manage the AI Engine graph and PL kernels.

TIP: The steps to integrate a bare-metal system with the AI Engine graph and PL kernels is described in Building a Bare-metal System, or in Building a Bare-metal AI Engine in the Vitis IDE.

The following is an example top-level application (main.cpp) for a bare-metal system:

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <unistd.h>
#include "platform.h"
#include "xparameters.h"
#include "xil_io.h"
#include "xil_cache.h"
#include "input.h"
#include "golden.h"
...
void InitData(int32_t** out, int size)
{
    int i;
    *out = (int32_t*)malloc(sizeof(int32_t) * size);

    if(!out) {
        printf("Allocation of memory failed \n");
        exit(-1);
    }

    for(i = 0; i < size; i++) {
        (*out)[i] = 0xABCDEF00;
    }
}

int RunTest(uint64_t mm2s_base, uint64_t s2mm_base, int32_t* in, int32_t* golden, 
    int32_t* out, int input_size, int output_size)
{
    int i;
    int errCount = 0;
    uint64_t memAddr = (uint64_t)in;
    uint64_t mem_outAddr = (uint64_t)out;

    printf("Starting test w/ cu\n");

    Xil_Out32(mm2s_base + MEM_OFFSET, (uint32_t) memAddr);
    Xil_Out32(mm2s_base + MEM_OFFSET + 4, 0);
    Xil_Out32(s2mm_base + MEM_OFFSET, (uint32_t) mem_outAddr);
    Xil_Out32(s2mm_base + MEM_OFFSET + 4, 0);
    Xil_Out32(mm2s_base + SIZE_OFFSET, input_size);
    Xil_Out32(s2mm_base + SIZE_OFFSET, output_size);
    Xil_Out32(mm2s_base + CTRL_OFFSET, 1);
    Xil_Out32(s2mm_base + CTRL_OFFSET, 1);

    printf("GRAPH INIT\n");
    clipgraph.init();

    printf("GRAPH RUN\n");
    clipgraph.run();

    while(1) {
        uint32_t v = Xil_In32(s2mm_base + CTRL_OFFSET);
        if(v & 6) {
            break;
        }
    }

    printf("PLIO IP DONE!\n");

    for(i = 0; i < output_size; i++) {
        if((((int32_t*)out)[i] != ((int32_t*)golden)[i]) ) {
            printf("Error found in sample %d != to the golden %d\n", i+1, ((int32_t*)out)[i], ((int32_t*)golden)[i]);
            errCount++;
        }
        else
            printf("%d\n ",((int32_t*)out)[i]);
    }

    printf("Ending test w/ cu\n");
    return errCount;
}

int main()
{
    int i;
    int32_t* out;
    int errCount;

    Xil_DCacheDisable();
    init_platform();
    sleep(1);
    
    printf("Beginning test\n");
    InitData(&out, OUTPUT_SIZE);
    errCount = RunTest(MM2S_BASE, S2MM_BASE, (int32_t*)cint16input, int32golden, out, INPUT_SIZE, OUTPUT_SIZE);

    if(errCount == 0)
        printf("Test passed. \n");
    else
        printf("Test failed! Error count: %d \n",errCount);

    cleanup_platform();
    return errCount;
}

The following are the steps in the code example:

The main() function initializes the platform, data, runs the test, verifies the return code, and return the error code.
InitData() allocates size of memory space and initializes successfully allocated memory space to known data.
RunTest() passes necessary data to the kernel to process and return a result.
clipgraph.init() initializes the tiles that kernels will be run on.
clipgraph.run() starts kernels running on associated tiles.

The preceding code example references xparameters.h which is automatically generated from the bare-metal BSP. The application needs to ensure the bare-metal BSP is properly generated so that the memory mapped addresses for all drivers are correctly assigned.

xil_io.h contains general driver I/O APIs. This is the preferred method for accessing drivers.

Addressing Kernels in Bare-metal Applications

For bare-metal applications, when addressing the PL kernels from the embedded application, you must use the control registers, or read and write to the kernel at the appropriate base address and offset as it is implemented in hardware. Looking at the application discussed earlier, the embedded application can deliver data to the MM2S kernel, to introduce it to the AI Engine graph for the Interpolator and Classifier kernels, and read data from the S2MM kernel to continue processing in the embedded application. In this case, address the MM2S and S2MM kernels as they are implemented in the PL region of the fixed platform.

The main.cpp of the example shows the #define statements for the kernel base address and the address offset for specific registers. For example:

#define MM2S_BASE XPAR_MM2S_S_AXI_CONTROL_BASEADDR
#define S2MM_BASE XPAR_S2MM_S_AXI_CONTROL_BASEADDR

#define MEM_OFFSET 0x10
#define SIZE_OFFSET 0x1C
#define CTRL_OFFSET 0x0

To determine the address and offsets for the kernels, examine some of the files in the fixed platform. The location of the base address for the implemented kernels is located in the fixed platform xparameters.h file, that is located in the <platform_name>/standalone_domain/bspinclude/include folder. For the example design, use the following entries in xparameters.h to determine the base addresses of these kernels.

/* Definitions for peripheral MM2S */
#define XPAR_MM2S_S_AXI_CONTROL_BASEADDR 0xA4020000
#define XPAR_MM2S_S_AXI_CONTROL_HIGHADDR 0xA402FFFF

/* Definitions for peripheral S2MM */
#define XPAR_S2MM_S_AXI_CONTROL_BASEADDR 0xA4030000
#define XPAR_S2MM_S_AXI_CONTROL_HIGHADDR 0xA403FFFF

Note: The xparameters.h file is generated and addressing is dynamic. It is best to reference the address macros for kernels than to hard code them.

The location of the address offset is located in the <kernel_driver>_hw.h file of the _x/<kernel>/<kernel>/<kernel>/solution/impl/ip/drivers/<kernel>_v1_0/src folder of the compiled kernel produced by the Vitis™ compiler. For example, the MM2S kernel driver, xmm2s_mm2s_hw.h displays the following data.

#define XMM2S_MM2S_CONTROL_ADDR_AP_CTRL    0x00
#define XMM2S_MM2S_CONTROL_ADDR_GIE        0x04
#define XMM2S_MM2S_CONTROL_ADDR_IER        0x08
#define XMM2S_MM2S_CONTROL_ADDR_ISR        0x0c
#define XMM2S_MM2S_CONTROL_ADDR_MEM_V_DATA 0x10
#define XMM2S_MM2S_CONTROL_BITS_MEM_V_DATA 64
#define XMM2S_MM2S_CONTROL_ADDR_SIZE_DATA  0x1c
#define XMM2S_MM2S_CONTROL_BITS_SIZE_DATA  32

Use these offsets when reading or writing to the kernels. For instance, from the example application main.cpp file use the following to write to the memory location.

Xil_Out32(MM2S_BASE + MEM_OFFSET, (uint32_t) memAddr);