Using a Virtual Platform

AI Engine Programming briefly introduced the simulation platform class with file I/O support. This chapter continues the discussion and describes other variations in detail.

Virtual Platform

A virtual platform specification helps to connect the data flow graph written with external I/O mechanisms specific to the chosen target for testing or eventual deployment. The platform could be specified for a simulation, emulation, or an actual hardware execution target.

Current release support is only for a simulation platform, which implies that you can execute a data flow graph in a software simulation environment. This is the specification.

simulation::platform<#inputs, #outputs> platform-name(port-attribute-list);

The #inputs and #outputs specify how many input and output ports are needed in the platform to connect to the data flow graph. The platform object is pre-populated with src and sink arrays of output and input ports (respectively) to make the connection. The simpleGraph example from AI Engine Programming is used in this example.

simpleGraph mygraph;
simulation::platform<1,1> platform("input.txt","output.txt");
connect<> net0(platform.src[0], mygraph.in);
connect<> net1(mygraph.out, platform.sink[0]);

The port-attribute list within the platform declaration is an enumeration of attributes of each platform port starting with all the inputs and followed by the outputs. These are described in the following sections.

FileIO Attributes

By default, a platform port attribute is a string name used to construct an attribute of type FileIO. The string specifies the name of an input or output file relative to the current directory that will source or sink the platform data. The explicit form is specified in the following example using a FileIO constructor.

FileIO *in = new FileIO(“input.txt”);
FileIO *out = new FileIO(“output.txt”);
simulation::platform<1,1> platform(in,out);

FileIO ports are solely for the purpose of application simulation in the absence of an actual hardware platform. They are provided as a matter of convenience to test out a data flow graph in isolation before it is connected to a real platform. An actual hardware platform exports either stream or memory ports.

PLIO Attributes

A PLIO port attribute is used to make external stream connections that cross the AI Engine to programmable logic (PL) boundary. This situation arises when a hardware platform is designed separately and the PL blocks are already instantiated inside the platform. This hardware design is exported from the Vivado tools as a package XSA and it should be specified when creating a new project in the Vitis™ tools using that platform. The XSA contains a logical architecture interface specification that identifies which AI Engine I/O ports can be supported by the platform. The following is an example interface specification containing stream ports (looking from the AI Engine perspective).

Table 1. Example Logical Architecture Port Specification
AI Engine Port Annotation Type Direction Data Width Clock Frequency (MHz)
S00_AXIS Weight0 stream slave 32 300
S01_AXIS Datain0 stream slave 32 300
M00_AXIS Dataout0 stream master 32 300

This interface specification describes how the platform exports two stream input ports (slave port on the AI Engine array interface) and one stream output port (master port on the AI Engine array interface). A PLIO attribute specification is used to represent and connect these interface ports to their respective destination or source kernel ports in data flow graph.

The following example shows how the PLIO attributes shown in the previous table can be used in a program to read input data from a file or write output data to a file. The PLIO width and frequency of the PLIO port are also provided in the PLIO constructor. The constructor syntax is described in more detail in Adaptive Data Flow Graph Specification Reference.

adf::PLIO *wts = new adf::PLIO("Weight0", adf::plio_64_bits, "inputwts.txt", 300);
adf::PLIO *din = new adf::PLIO("Datain0", adf::plio_64_bits, "din.txt", 300);
adf::PLIO *dout = new adf::PLIO("Dataout0", adf::plio_64_bits, "dout.txt");
simulation::platform<2,1> platform(wts, din, dout);

The example simulation platform can then be connected to a graph that expects two input streams and one output stream in the usual way. During compilation, the logical architecture should be specified using the option --logical-arch=<filename>. This option is automatically populated by the Vitis tools if you have specified the XSA while creating the project. When simulated, the input weights and data are read from the two supplied files and the output data is produced in the designated output file in a streaming manner.

When a hardware platform is exported, all the AI Engine to PL stream connections are already routed to specific physical channels from the PL side.

Wide Stream Data Path PLIO

Typically, the AI Engine array runs at a higher clock frequency than the internal programmable logic. The AI Engine compiler can be given a compiler option --pl-freq to identify the frequency at which the PL blocks are expected to run . To balance the throughput between AI Engine and internal programmable logic, it is possible to design the PL blocks for a wider stream data path (64-bit, 128-bit), which is then sequentialized automatically into a 32-bit stream on the AI Engine stream network at the AI Engine to PL interface crossing.

The following example shows how wide stream PLIO attributes can be used in a program to read input data from a file or write output data to a file. The constructor syntax is described in more detail in Adaptive Data Flow Graph Specification Reference.

PLIO *attr_o = new PLIO("TestLogicalNameOut", plio_128_bits, "data/output.txt");
PLIO *attr_i = new PLIO("TestLogicalNameIn", plio_128_bits, "data/input.txt");

simulation::platform<1, 1> platform(attr_i, attr_o);  // Platform with PLIO
MEPL128BitClass gMePl;                                // Toplevel graph
connect<> net0(platform.src[0], gMePl.in);
connect<> net1(gMePl.out, platform.sink[0]);

In the previous example, a simulation platform with two 128-bit PLIO attributes is declared: one for input and one for output. The platform ports are then hooked up to the graph in the usual way. Data files specified in the PLIO attributes are then automatically opened for reading the input or writing the output respectively.

When simulating PLIO with data files, the data should be organized to accommodate both the width of the PL block as well as the data type of the connecting port on the AI Engine block. For example, a data file representing 32-bit PL interface to an AI Engine kernel expecting int16 should be organized as two columns per row, where each column represents a 16-bit value. As another example, a data file representing 64-bit PL interface to an AI Engine kernel expecting cint16 should be organized as four columns per row, where each column represents a 16-bit real or imaginary value. The same 64-bit PL interface feeding an AI Engine kernel with int32 port would need to organize the data as two columns per row of 32-bit real values. The following examples show the format of the input file for the previously mentioned scenarios.

64-bit PL interface feeding AI Engine kernel expecting cint16
input file:
0 0 0 0
1 1 1 1
2 2 2 2

64-bit PL interface feeding AI Engine kernel expecting int32
input file:
0 0
1 1
2 2

With these wide PLIO attribute specifications, the AI Engine compiler automatically generates the AI Engine array interface configuration to convert a 64-bit or 128-bits data into a sequence of 32-bit words. The AXI4-Stream protocol followed with all PL IP blocks ensures that partial data can also be sent on a wider data path with the appropriate strobe signals describing which words are valid.

GMIO Attributes

A GMIO port attribute is used to make external memory-mapped connections to or from the global memory. These connections are made between an AI Engine graph and the logical global memory ports of a hardware platform design. The platform can be a base platform from Xilinx or a custom platform that is exported from the Vivado tools as a Xilinx device support archive (XSA) package.

AI Engine tools support mapping the GMIO port to the tile DMA one to one. It does not support mapping multiple GMIO ports to one tile DMA channel. There is a limit on the number of GMIO ports supported for a given device. For example, the XCVC1902 device on the VCK190 board has 16 AI Engine to NoC master units (NMU) in total. For each AI Engine to NMU, it supports two MM2S and two S2MM channels. So, there can be at most 32 AI Engine GMIO inputs and 32 AI Engine GMIO outputs supported, but note that it can be further limited by the existing hardware platform.

Note: GMIO channel constraints should not be used for AI Engine compilation.

While developing data flow graph applications on top of an existing hardware platform, you need to know what global memory ports are exported by the underlying XSA and their functionality. In particular, any input or output ports exposed on the platform are recorded within the XSA and can be viewed as a logical architecture interface.

Programming Model for AI Engine–DDR Memory Connection

The GMIO port attribute can be used to initiate AI Engine–DDR memory read and write transactions in the PS program. This enables data transfer between an AI Engine and the DDR controller through APIs written in the PS program. The following example shows how to use GMIO APIs to send data to an AI Engine for processing and retrieve the processed data back to the DDR through the PS program.

GMIO gmioIn("gmioIn", 64, 1000);
GMIO gmioOut("gmioOut", 64, 1000);
simulation::platform<1,1> plat(&gmioIn, &gmioOut);
 
myGraph gr;
connect<> c0(plat.src[0], gr.in);
connect<> c1(gr.out, plat.sink[0]);

int main(int argc, char ** argv)
{
	 const int BLOCK_SIZE=256; 
    int32 *inputArray=(int32*)GMIO::malloc(BLOCK_SIZE*sizeof(int32));
    int32 *outputArray=(int32*)GMIO::malloc(BLOCK_SIZE*sizeof(int32));
 
    // provide input data to AI Engine in inputArray
	 for(int i=0;i<BLOCK_SIZE;i++){
		inputArray[i]=i;
	 }
 
    gr.init();
          
    gmioIn.gm2aie_nb(inputArray, BLOCK_SIZE*sizeof(int32));
    gmioOut.aie2gm_nb(outputArray, BLOCK_SIZE*sizeof(int32));
 
    gr.run(8);
 
    gmioOut.wait();
 
    // can start to access output data from AI Engine in outputArray
	... 

	GMIO::free(inputArray);
	GMIO::free(outputArray);
    gr.end();
}

This example declares two GMIO objects. gmioIn represents the DDR memory space to be read by the AI Engine and gmioOut represents the DDR memory space to be written by the AI Engine. The constructor specifies the logical name of the GMIO, burst length (that can be 64, 128, or 256 bytes) of the memory-mapped AXI4 transaction, and the required bandwidth (in MB/s).

GMIO gmioIn("gmioIn", 64, 1000);
GMIO gmioOut("gmioOut", 64, 1000);

Assuming the application graph (myGraph) has an input port (myGraph::in) connecting to the processing kernels and an output port (myGraph::out) producing the processed data from the kernels, the following code connects gm1 (as a platform source) to the input port of the graph and connects gm2 (as a platform sink) to the output port of the graph.

simulation::platform<1,1> plat(&gmioIn, &gmioOut);
connect<> c0(plat.src[0], gr.in);
connect<> c1(gr.out, plat.sink[0]);

Inside the main function, two 256-element int32 arrays are allocated by GMIO::malloc. The inputArray points to the memory space to be read by the AI Engine and the outputArray points to the memory space to be written by the AI Engine.In Linux, the vitual address passed to GMIO::gm2aie_nb, GMIO::aie2gm_nb, GMIO::gm2aie and GMIO::aie2gm must be allocated by GMIO::malloc. After the input data is allocated, it can be initialized.

const int BLOCK_SIZE=256; 
    int32 *inputArray=(int32*)GMIO::malloc(BLOCK_SIZE*sizeof(int32));
    int32 *outputArray=(int32*)GMIO::malloc(BLOCK_SIZE*sizeof(int32));
GMIO::gm2aie_nb is used to initiate memory-mapped AXI4 transactions for the AI Engine to read from DDR memory spaces. The first argument in GMIO::gm2aie_nb is the pointer to the start address of the memory space for the transaction. The second argument is the transaction size in bytes. The memory space for the transaction must be within the memory space allocated by GMIO::malloc. Similarly, GMIO::aie2gm_nb is used to initiate memory-mapped AXI4 transactions for the AI Engine to write to DDR memory spaces. GMIO::gm2aie_nb or GMIO::aie2gm_nb is a non-blocking function in a sense that it returns immediately when the transaction is issued - it does not wait for the transaction to complete. By contrast, GMIO::gm2aie or GMIO::aie2gm behaves in a blocking manner.
In this example, assuming in one iteration, the graph consumes 32 int32 data from the input port and produces 32 int32 data to the output port. To run eight iterations, the graph consumes 256 int32 data and produces 256 int32 data. The corresponding memory-mapped AXI4 transactions are initiated using the following code, one gm1.gm2aie_nb call to issue a read transaction for eight-iteration worth of data, and one gm2.aie2gm_nb call to issue a write transaction for eight-iteration worth of data.
gmioIn.gm2aie_nb(inputArray, BLOCK_SIZE*sizeof(int32));
gmioOut.aie2gm_nb(outputArray, BLOCK_SIZE*sizeof(int32));

gr.run(8) is also a non-blocking call to run the graph for eight iterations. To synchronize between the PS and AI Engine for DDR read/write access, you can use GMIO::wait to block PS execution until the GMIO transaction is complete. In this example, gmioOut.wait() is called to wait for the output data to be written to outputArray DDR memory space.

Note: The memory is non-cachable for GMIO in Linux.

After that, the PS program can access the data. When PS has completed processing, the memory space allocated by GMIO::malloc can be released by GMIO::free.

GMIO::free(inputArray);
	GMIO::free(outputArray);

GMIO APIs can be used in various ways to perform different level of control for read/write access and synchronization between the AI Engine, PS, and DDR memory. GMIO::gm2aie, GMIO::aie2gm, GMIO::gm2aie_nb or GMIO::aie2gm_nb can be called multiple times to associate different memory spaces for the same GMIO object during different phases of graph execution. Different GMIO objects can be associated with the same memory space for in-place AI Engine–DDR read/write access. Blocking versions of GMIO::gm2aie and GMIO::aie2gm APIs themselves are synchronization point for data transportation and kernel execution. Calling GMIO::gm2aie (or GMIO::aie2gm) is equivalent to calling GMIO::gm2aie_nb (or GMIO::aie2gm_nb) followed immediately by GMIO::wait. The following example shows the combination of the aforementioned use cases.

GMIO gmioIn("gmioIn", 64, 1000);
GMIO gmioOut("gmioOut", 64, 1000);
simulation::platform<1,1> plat(&gmioIn, &gmioOut);
 
myGraph gr;
connect<> c0(plat.src[0], gr.in);
connect<> c1(gr.out, plat.sink[0]);
 
int main(int argc, char ** argv)
{
	const int BLOCK_SIZE=256;
    // dynamically allocate memory spaces for in-place AI Engine read/write access
	int32* inoutArray=(int32*)GMIO::malloc(BLOCK_SIZE*sizeof(int32));
     
    gr.init();
 
    for (int k=0; k<4; k++)
    {
		// provide input data to AI Engine in inoutArray
		for(int i=0;i<BLOCK_SIZE;i++){
			inoutArray[i]=i;
		}
         
        gr.run(8);
        for (int i=0; i<8; i++)
        {
            gmioIn.gm2aie(inoutArray+i*32, 32*sizeof(int32)); //blocking call to ensure transaction data is read from DDR to AI Engine
            gmioOut.aie2gm_nb(inoutArray+i*32, 32*sizeof(int32));
        }
        gmioOut.wait();
     
        // can start to access output data from AI Engine in inoutArray
		...        
    }
	GMIO::free(inoutArray);
    gr.end();
}

In the example above, the two GMIO objects gmioIn and gmioOut are using the same memory space allocated by inoutArray for in-place read and write access.

Without knowing data flow dependency among the kernels inside the graph, and to ensure write-after-read for the inoutArray memory space, the blocking version gmioIn.gm2aie is called to ensure transaction data is copied from DDR to AI Engine local memory before issuing a write transaction to the same memory space in gmioOut.aie2gm_nb.

gmioIn.gm2aie(inoutArray+i*32, 32*sizeof(int32)); //blocking call to ensure transaction data is read from DDR to AI Engine
            gmioOut.aie2gm_nb(inoutArray+i*32, 32*sizeof(int32));

gmioOut.wait() is to ensure that data has been migrated to DDR. After it's done, the PS can access output data for post-processing.

The graph execution is divided into four phases in the for loop, for (int k=0; k<4; k++). inoutArray can be re-initialized in the for loop with different data to be processed in different phases.

Hardware Emulation and Hardware Flows

GMIO is not only used as a virtual platform for the AI Engine simulator, but can also work in hardware emulation and hardware flows. To allow it to work in hardware emulation and hardware flows, add the following code to graph.cpp.

#if !defined(__AIESIM__)
    #include "adf/adf_api/XRTConfig.h"
    #include "experimental/xrt_kernel.h"
    // Create XRT device handle for ADF API
    
    char* xclbinFilename = argv[1];
    auto dhdl = xrtDeviceOpen(0);//device index=0
    xrtDeviceLoadXclbinFile(dhdl,xclbinFilename);
    xuid_t uuid;
    xrtDeviceGetXclbinUUID(dhdl, uuid);
       
    adf::registerXRT(dhdl, uuid);
#endif

Using the guard macro __AIESIM__, the same version of graph.cpp can work for the AI Engine simulator, hardware emulation, and hardware flows. Note that the preceding code should be placed before calling the graph or the GMIO ADF APIs. At the end of the program, close the device using the xrtDeviceClose() API.

#if !defined(__AIESIM__)
    xrtDeviceClose(dhdl);
#endif

To compile the code for hardware flow, see Programming the PS Host Application.

While it is recommended to use ADF APIs to control the GMIO, it is possible to use XRT. However, only synchronous mode GMIO transactions are supported. The API to perform synchronous transferring data can be found in experimental/xrt_aie.h:

/**
 * xrtAIESyncBO() - Transfer data between DDR and Shim DMA channel
 *
 * @handle:          Handle to the device
 * @bohdl:           BO handle.
 * @gmioName:        GMIO name
 * @dir:             GM to AIE or AIE to GM
 * @size:            Size of data to synchronize
 * @offset:          Offset within the BO
 *
 * Return:          0 on success, or appropriate error number.
 *
 * Synchronize the buffer contents between GMIO and AIE.
 * Note: Upon return, the synchronization is done or error out
 */
int
xrtAIESyncBO(xrtDeviceHandle handle, xrtBufferHandle bohdl, const char *gmioName, enum xclBOSyncDirection dir, size_t size, size_t offset);

Performance Comparison Between AI Engine/PL and AI Engine/NoC Interfaces

The AI Engine array interface consists of the PL and NoC interface tiles. The AI Engine array interface tiles manage the two following high performance interfaces.

  • AI Engine to PL
  • AI Engine to NoC

The following image shows the AI Engine array interface structure.

Figure 1: AI Engine Array Interface Topology


One AI Engine to PL interface tile contains eight streams from the PL to the AI Engine and six streams from the AI Engine to the PL. The following table shows one AI Engine to PL interface tile capacity.

Table 2. AI Engine Array Interface to PL Interface Bandwidth Performance
Connection Type Number of Connections Data Width (bits) Clock Domain Bandwidth per Connection (GB/s) Aggregate Bandwidth (GB/s)
PL to AI Engine array interface 8 64 PL

(500 MHz)

4 32
AI Engine array interface to PL 6 64 PL

(500 MHz)

4 24
Note: All bandwidth calculations in this section assume a nominal 1 GHz AI Engine clock for a -1L speed grade device at VCCINT = 0.70V with the PL interface running at half the frequency of the AI Engine as an example.

The exact number of PL and NoC interface tiles is device-specific. For example, in the VC1902 device, there are 50 columns of AI Engine array interface tiles. However, only 39 array interface tiles are available to the PL interface. Therefore, the aggregate bandwidth for the PL interface is approximately:

  • 24 GB/s * 39 = 0.936 TB/s from AI Engine to PL
  • 32 GB/s * 39 =1.248 TB/s from PL to AI Engine

The number of array interface tiles available to the PL interface and total bandwidth of the AI Engine to PL interface for other devices and across different speed grades is specified in Versal AI Core Series Data Sheet: DC and AC Switching Characteristics (DS957).

GMIO uses DMA in the AI Engine to NoC interface tile. The DMA has two 32-bit incoming streams from the AI Engine and two 32-bit streams to the AI Engine. In addition, it has one 128-bit memory mapped AXI master interface to the NoC NMU. The performance of one AI Engine to NoC interface tile is shown in the following table.

Table 3. AI Engine to NoC Interface Tile Bandwidth Performance
Connection Type Number of connections Bandwidth per connection (GB/s) Aggregate Bandwidth (GB/s)
AI Engine to DMA 2 4 8
DMA to NoC 1 16 16
DMA to AI Engine 2 4 8
NoC to DMA 1 16 16

The exact number of AI Engine to NoC interface tiles is device-specific. For example, in the VC1902 device, there are 16 AI Engine to NoC interface tiles. So, the aggregate bandwidth for the NoC interface is approximately:

  • 8 GB/s * 16 = 128 GB/s from AI Engine to PL
  • 8 GB/s * 16 = 128 GB/s from PL to AI Engine

When accessing DDR memory, the integrated DDR memory controller (DDRMC) number in the platform limits the performance of DDR memory read and write. For example, if all four DDRMCs in a VC1902 device are fully used, the hard limit to access DDR memory is as follows.

  • 3200 Mb/s * 64 bit * 4 DDRMCs / 8 = 102.4 GB/s

The performance of GMIO accessing DDR memory through the NoC is further restricted by the NoC lane number in the horizontal and vertical NoC, inter NoC configurations, and QoS. Note that DDR memory read and write efficiency is largely affected by the access pattern and other overheads. For more information about the NoC, memory controller use, and performance numbers, see the Versal ACAP Programmable Network on Chip and Integrated Memory Controller LogiCORE IP Product Guide (PG313).

For a single connection from the AI Engine or to the AI Engine, both PLIO and GMIO have a hard bandwidth limit of 4 GB/s. Some advantages and disadvantages for choosing PLIO or GMIO are shown in the following table.

Table 4. Comparison of PLIO vs GMIO
PLIO GMIO
Advantages
  • Number of AI Engine to PL interface streams are larger, hence larger aggregate bandwidth
  • No interference between different stream connections
  • Supports packet switching
  • No PL resource required
  • No timing closure requirement
Disadvantages
  • Congestion risk if there are too many stream connections in a region of the device
  • Timing closure required for achieving best performance
  • Less GMIO ports available
  • Aggregate bandwidth is lower
  • Multiple GMIO ports competing for bandwidth