Using a Virtual Platform
AI Engine Programming briefly introduced the simulation platform class with file I/O support. This chapter continues the discussion and describes other variations in detail.
Virtual Platform
A virtual platform specification helps to connect the data flow graph written with external I/O mechanisms specific to the chosen target for testing or eventual deployment. The platform could be specified for a simulation, emulation, or an actual hardware execution target.
Current release support is only for a simulation platform, which implies that you can execute a data flow graph in a software simulation environment. This is the specification.
simulation::platform<#inputs, #outputs> platform-name(port-attribute-list);
The #inputs
and #outputs
specify how many input and output ports are needed
in the platform to connect to the data flow graph. The platform object is pre-populated
with src
and sink
arrays of output and input ports (respectively) to make the connection. The simpleGraph
example from AI Engine Programming is used in this example.
simpleGraph mygraph;
simulation::platform<1,1> platform("input.txt","output.txt");
connect<> net0(platform.src[0], mygraph.in);
connect<> net1(mygraph.out, platform.sink[0]);
The port-attribute list within the platform declaration is an enumeration of attributes of each platform port starting with all the inputs and followed by the outputs. These are described in the following sections.
FileIO Attributes
By default, a platform port attribute is a string name used to construct an
attribute of type FileIO
. The string specifies the
name of an input or output file relative to the current directory that will source
or sink the platform data. The explicit form is specified in the following example
using a FileIO
constructor.
FileIO *in = new FileIO(“input.txt”);
FileIO *out = new FileIO(“output.txt”);
simulation::platform<1,1> platform(in,out);
FileIO
ports are solely for
the purpose of application simulation in the absence of an actual hardware platform.
They are provided as a matter of convenience to test out a data flow graph in
isolation before it is connected to a real platform. An actual hardware platform
exports either stream or memory ports.
PLIO Attributes
A PLIO port attribute is used to make external stream connections that cross the AI Engine to programmable logic (PL) boundary. This situation arises when a hardware platform is designed separately and the PL blocks are already instantiated inside the platform. This hardware design is exported from the Vivado tools as a package XSA and it should be specified when creating a new project in the Vitis™ tools using that platform. The XSA contains a logical architecture interface specification that identifies which AI Engine I/O ports can be supported by the platform. The following is an example interface specification containing stream ports (looking from the AI Engine perspective).
AI Engine Port | Annotation | Type | Direction | Data Width | Clock Frequency (MHz) |
---|---|---|---|---|---|
S00_AXIS | Weight0 | stream | slave | 32 | 300 |
S01_AXIS | Datain0 | stream | slave | 32 | 300 |
M00_AXIS | Dataout0 | stream | master | 32 | 300 |
This interface specification describes how the platform exports two stream input ports (slave port on the AI Engine array interface) and one stream output port (master port on the AI Engine array interface). A PLIO attribute specification is used to represent and connect these interface ports to their respective destination or source kernel ports in data flow graph.
The following example shows how the PLIO attributes shown in the previous table can be used in a program to read input data from a file or write output data to a file. The PLIO width and frequency of the PLIO port are also provided in the PLIO constructor. The constructor syntax is described in more detail in Adaptive Data Flow Graph Specification Reference.
adf::PLIO *wts = new adf::PLIO("Weight0", adf::plio_64_bits, "inputwts.txt", 300);
adf::PLIO *din = new adf::PLIO("Datain0", adf::plio_64_bits, "din.txt", 300);
adf::PLIO *dout = new adf::PLIO("Dataout0", adf::plio_64_bits, "dout.txt");
simulation::platform<2,1> platform(wts, din, dout);
The example simulation platform can then be connected to a
graph that expects two input streams and one output stream in the usual way. During
compilation, the logical architecture should be specified using the option --logical-arch=<filename>
. This option is
automatically populated by the Vitis tools if
you have specified the XSA while creating the project. When simulated, the input
weights and data are read from the two supplied files and the output data is
produced in the designated output file in a streaming manner.
When a hardware platform is exported, all the AI Engine to PL stream connections are already routed to specific physical channels from the PL side.
Wide Stream Data Path PLIO
Typically, the AI Engine array runs at a
higher clock frequency than the internal programmable logic. The AI Engine compiler can be given a compiler option --pl-freq
to identify the frequency at which the PL
blocks are expected to run . To balance the
throughput between AI Engine and internal
programmable logic, it is possible to design the PL blocks for a wider stream data
path (64-bit, 128-bit), which is then sequentialized automatically into a 32-bit
stream on the AI Engine stream network at the
AI Engine to PL interface crossing.
The following example shows how wide stream PLIO attributes can be used in a program to read input data from a file or write output data to a file. The constructor syntax is described in more detail in Adaptive Data Flow Graph Specification Reference.
PLIO *attr_o = new PLIO("TestLogicalNameOut", plio_128_bits, "data/output.txt");
PLIO *attr_i = new PLIO("TestLogicalNameIn", plio_128_bits, "data/input.txt");
simulation::platform<1, 1> platform(attr_i, attr_o); // Platform with PLIO
MEPL128BitClass gMePl; // Toplevel graph
connect<> net0(platform.src[0], gMePl.in);
connect<> net1(gMePl.out, platform.sink[0]);
In the previous example, a simulation platform with two 128-bit PLIO attributes is declared: one for input and one for output. The platform ports are then hooked up to the graph in the usual way. Data files specified in the PLIO attributes are then automatically opened for reading the input or writing the output respectively.
When simulating PLIO with data files, the data should be organized to
accommodate both the width of the PL block as well as the data type of the
connecting port on the AI Engine block. For
example, a data file representing 32-bit PL interface to an AI Engine kernel expecting int16
should be organized as two columns per row, where each column
represents a 16-bit value. As another example, a data file representing 64-bit PL
interface to an AI Engine kernel expecting
cint16
should be organized as four columns per
row, where each column represents a 16-bit real or imaginary value. The same 64-bit
PL interface feeding an AI Engine kernel with
int32
port would need to organize the data as
two columns per row of 32-bit real values. The following examples show the format of
the input file for the previously mentioned scenarios.
64-bit PL interface feeding AI Engine kernel expecting cint16
input file:
0 0 0 0
1 1 1 1
2 2 2 2
64-bit PL interface feeding AI Engine kernel expecting int32
input file:
0 0
1 1
2 2
With these wide PLIO attribute specifications, the AI Engine compiler automatically generates the AI Engine array interface configuration to convert a 64-bit or 128-bits data into a sequence of 32-bit words. The AXI4-Stream protocol followed with all PL IP blocks ensures that partial data can also be sent on a wider data path with the appropriate strobe signals describing which words are valid.
GMIO Attributes
A GMIO port attribute is used to make external memory-mapped connections to or from the global memory. These connections are made between an AI Engine graph and the logical global memory ports of a hardware platform design. The platform can be a base platform from Xilinx or a custom platform that is exported from the Vivado tools as a Xilinx device support archive (XSA) package.
AI Engine tools support mapping the GMIO port to the tile DMA one to one. It does not support mapping multiple GMIO ports to one tile DMA channel. There is a limit on the number of GMIO ports supported for a given device. For example, the XCVC1902 device on the VCK190 board has 16 AI Engine to NoC master units (NMU) in total. For each AI Engine to NMU, it supports two MM2S and two S2MM channels. So, there can be at most 32 AI Engine GMIO inputs and 32 AI Engine GMIO outputs supported, but note that it can be further limited by the existing hardware platform.
While developing data flow graph applications on top of an existing hardware platform, you need to know what global memory ports are exported by the underlying XSA and their functionality. In particular, any input or output ports exposed on the platform are recorded within the XSA and can be viewed as a logical architecture interface.
Programming Model for AI Engine–DDR Memory Connection
The GMIO port attribute can be used to initiate AI Engine–DDR memory read and write transactions in the PS program. This enables data transfer between an AI Engine and the DDR controller through APIs written in the PS program. The following example shows how to use GMIO APIs to send data to an AI Engine for processing and retrieve the processed data back to the DDR through the PS program.
GMIO gmioIn("gmioIn", 64, 1000);
GMIO gmioOut("gmioOut", 64, 1000);
simulation::platform<1,1> plat(&gmioIn, &gmioOut);
myGraph gr;
connect<> c0(plat.src[0], gr.in);
connect<> c1(gr.out, plat.sink[0]);
int main(int argc, char ** argv)
{
const int BLOCK_SIZE=256;
int32 *inputArray=(int32*)GMIO::malloc(BLOCK_SIZE*sizeof(int32));
int32 *outputArray=(int32*)GMIO::malloc(BLOCK_SIZE*sizeof(int32));
// provide input data to AI Engine in inputArray
for(int i=0;i<BLOCK_SIZE;i++){
inputArray[i]=i;
}
gr.init();
gmioIn.gm2aie_nb(inputArray, BLOCK_SIZE*sizeof(int32));
gmioOut.aie2gm_nb(outputArray, BLOCK_SIZE*sizeof(int32));
gr.run(8);
gmioOut.wait();
// can start to access output data from AI Engine in outputArray
...
GMIO::free(inputArray);
GMIO::free(outputArray);
gr.end();
}
This example declares two GMIO objects. gmioIn
represents the DDR memory space to be read by the AI Engine and gmioOut
represents the DDR memory space to be written by the AI Engine. The constructor specifies the logical name of the GMIO,
burst length (that can be 64, 128, or 256 bytes) of the memory-mapped AXI4 transaction, and the required bandwidth (in
MB/s).
GMIO gmioIn("gmioIn", 64, 1000);
GMIO gmioOut("gmioOut", 64, 1000);
Assuming the application graph (myGraph
) has an input port (myGraph::in
) connecting to the processing kernels and an output port
(myGraph::out
) producing the processed data
from the kernels, the following code connects gm1
(as a platform source) to the input port of the graph and connects gm2
(as a platform sink) to the output port of the
graph.
simulation::platform<1,1> plat(&gmioIn, &gmioOut);
connect<> c0(plat.src[0], gr.in);
connect<> c1(gr.out, plat.sink[0]);
Inside the main function, two 256-element int32 arrays are allocated
by GMIO::malloc
. The inputArray
points to the memory space to be read by the AI Engine and the outputArray
points to the memory space to be written by the AI Engine.In Linux, the vitual address passed to
GMIO::gm2aie_nb
, GMIO::aie2gm_nb
, GMIO::gm2aie
and
GMIO::aie2gm
must be allocated by GMIO::malloc
. After the input data is allocated, it
can be
initialized.
const int BLOCK_SIZE=256;
int32 *inputArray=(int32*)GMIO::malloc(BLOCK_SIZE*sizeof(int32));
int32 *outputArray=(int32*)GMIO::malloc(BLOCK_SIZE*sizeof(int32));
GMIO::gm2aie_nb
is used to initiate memory-mapped AXI4 transactions for the AI Engine to read from DDR memory spaces. The first argument in GMIO::gm2aie_nb
is the pointer to the start address of the
memory space for the transaction. The second argument is the transaction size in bytes.
The memory space for the transaction must be within the memory space allocated by
GMIO::malloc
. Similarly, GMIO::aie2gm_nb
is used to initiate memory-mapped AXI4 transactions for the AI Engine to
write to DDR memory spaces. GMIO::gm2aie_nb
or GMIO::aie2gm_nb
is a non-blocking function in a sense that
it returns immediately when the transaction is issued - it does not wait for the
transaction to complete. By contrast, GMIO::gm2aie
or
GMIO::aie2gm
behaves in a blocking manner.gm1.gm2aie_nb
call to issue a read transaction for eight-iteration worth
of data, and one gm2.aie2gm_nb
call to issue a write
transaction for eight-iteration worth of
data.gmioIn.gm2aie_nb(inputArray, BLOCK_SIZE*sizeof(int32));
gmioOut.aie2gm_nb(outputArray, BLOCK_SIZE*sizeof(int32));
gr.run(8)
is also a non-blocking
call to run the graph for eight iterations. To synchronize between the PS and
AI Engine for DDR read/write access, you can
use GMIO::wait
to block PS execution until the GMIO
transaction is complete. In this example, gmioOut.wait()
is called to wait for the output data to be written to
outputArray
DDR memory space.
After that, the PS program can access
the data. When PS has completed processing, the memory space allocated by GMIO::malloc
can be released by GMIO::free
.
GMIO::free(inputArray);
GMIO::free(outputArray);
GMIO APIs can be used in various ways to perform different level of
control for read/write access and synchronization between the AI Engine, PS, and DDR memory. GMIO::gm2aie
, GMIO::aie2gm
, GMIO::gm2aie_nb
or
GMIO::aie2gm_nb
can be called multiple times
to associate different memory spaces for the same GMIO object during different
phases of graph execution. Different GMIO objects can be associated with the same
memory space for in-place AI Engine–DDR
read/write access. Blocking versions of GMIO::gm2aie
and GMIO::aie2gm
APIs
themselves are synchronization point for data transportation and kernel execution.
Calling GMIO::gm2aie
(or GMIO::aie2gm
) is equivalent to calling GMIO::gm2aie_nb
(or GMIO::aie2gm_nb
)
followed immediately by GMIO::wait
. The following
example shows the combination of the aforementioned use
cases.
GMIO gmioIn("gmioIn", 64, 1000);
GMIO gmioOut("gmioOut", 64, 1000);
simulation::platform<1,1> plat(&gmioIn, &gmioOut);
myGraph gr;
connect<> c0(plat.src[0], gr.in);
connect<> c1(gr.out, plat.sink[0]);
int main(int argc, char ** argv)
{
const int BLOCK_SIZE=256;
// dynamically allocate memory spaces for in-place AI Engine read/write access
int32* inoutArray=(int32*)GMIO::malloc(BLOCK_SIZE*sizeof(int32));
gr.init();
for (int k=0; k<4; k++)
{
// provide input data to AI Engine in inoutArray
for(int i=0;i<BLOCK_SIZE;i++){
inoutArray[i]=i;
}
gr.run(8);
for (int i=0; i<8; i++)
{
gmioIn.gm2aie(inoutArray+i*32, 32*sizeof(int32)); //blocking call to ensure transaction data is read from DDR to AI Engine
gmioOut.aie2gm_nb(inoutArray+i*32, 32*sizeof(int32));
}
gmioOut.wait();
// can start to access output data from AI Engine in inoutArray
...
}
GMIO::free(inoutArray);
gr.end();
}
In the example above, the two GMIO objects gmioIn
and gmioOut
are using the same
memory space allocated by inoutArray
for in-place
read and write access.
Without knowing data flow dependency
among the kernels inside the graph, and to ensure write-after-read for the inoutArray
memory space, the blocking version gmioIn.gm2aie
is called to ensure transaction data is
copied from DDR to AI Engine local memory before
issuing a write transaction to the same memory space in gmioOut.aie2gm_nb
.
gmioIn.gm2aie(inoutArray+i*32, 32*sizeof(int32)); //blocking call to ensure transaction data is read from DDR to AI Engine
gmioOut.aie2gm_nb(inoutArray+i*32, 32*sizeof(int32));
gmioOut.wait()
is to ensure that
data has been migrated to DDR. After it's done, the PS can access output data for
post-processing.
The graph execution is divided into four
phases in the for loop, for (int k=0; k<4; k++)
.
inoutArray
can be re-initialized in the
for
loop with different data to be processed in
different phases.
Hardware Emulation and Hardware Flows
GMIO is not only used as a virtual platform for the AI Engine simulator, but can also work in hardware emulation and hardware flows. To allow it to work in hardware emulation and hardware flows, add the following code to graph.cpp.
#if !defined(__AIESIM__)
#include "adf/adf_api/XRTConfig.h"
#include "experimental/xrt_kernel.h"
// Create XRT device handle for ADF API
char* xclbinFilename = argv[1];
auto dhdl = xrtDeviceOpen(0);//device index=0
xrtDeviceLoadXclbinFile(dhdl,xclbinFilename);
xuid_t uuid;
xrtDeviceGetXclbinUUID(dhdl, uuid);
adf::registerXRT(dhdl, uuid);
#endif
Using the guard macro __AIESIM__, the same version of graph.cpp can work for the AI Engine simulator, hardware emulation, and hardware flows. Note that the
preceding code should be placed before calling the graph or the GMIO ADF APIs. At the
end of the program, close the device using the xrtDeviceClose()
API.
#if !defined(__AIESIM__)
xrtDeviceClose(dhdl);
#endif
To compile the code for hardware flow, see Programming the PS Host Application.
While it is recommended to use ADF APIs to control the GMIO, it is possible to use XRT. However, only synchronous mode GMIO transactions are supported. The API to perform synchronous transferring data can be found in experimental/xrt_aie.h:
/**
* xrtAIESyncBO() - Transfer data between DDR and Shim DMA channel
*
* @handle: Handle to the device
* @bohdl: BO handle.
* @gmioName: GMIO name
* @dir: GM to AIE or AIE to GM
* @size: Size of data to synchronize
* @offset: Offset within the BO
*
* Return: 0 on success, or appropriate error number.
*
* Synchronize the buffer contents between GMIO and AIE.
* Note: Upon return, the synchronization is done or error out
*/
int
xrtAIESyncBO(xrtDeviceHandle handle, xrtBufferHandle bohdl, const char *gmioName, enum xclBOSyncDirection dir, size_t size, size_t offset);
Performance Comparison Between AI Engine/PL and AI Engine/NoC Interfaces
The AI Engine array interface consists of the PL and NoC interface tiles. The AI Engine array interface tiles manage the two following high performance interfaces.
- AI Engine to PL
- AI Engine to NoC
The following image shows the AI Engine array interface structure.
One AI Engine to PL interface tile contains eight streams from the PL to the AI Engine and six streams from the AI Engine to the PL. The following table shows one AI Engine to PL interface tile capacity.
Connection Type | Number of Connections | Data Width (bits) | Clock Domain | Bandwidth per Connection (GB/s) | Aggregate Bandwidth (GB/s) |
---|---|---|---|---|---|
PL to AI Engine array interface | 8 | 64 | PL (500 MHz) |
4 | 32 |
AI Engine array interface to PL | 6 | 64 | PL (500 MHz) |
4 | 24 |
The exact number of PL and NoC interface tiles is device-specific. For example, in the VC1902 device, there are 50 columns of AI Engine array interface tiles. However, only 39 array interface tiles are available to the PL interface. Therefore, the aggregate bandwidth for the PL interface is approximately:
- 24 GB/s * 39 = 0.936 TB/s from AI Engine to PL
- 32 GB/s * 39 =1.248 TB/s from PL to AI Engine
The number of array interface tiles available to the PL interface and total bandwidth of the AI Engine to PL interface for other devices and across different speed grades is specified in Versal AI Core Series Data Sheet: DC and AC Switching Characteristics (DS957).
GMIO uses DMA in the AI Engine to NoC interface tile. The DMA has two 32-bit incoming streams from the AI Engine and two 32-bit streams to the AI Engine. In addition, it has one 128-bit memory mapped AXI master interface to the NoC NMU. The performance of one AI Engine to NoC interface tile is shown in the following table.
Connection Type | Number of connections | Bandwidth per connection (GB/s) | Aggregate Bandwidth (GB/s) |
---|---|---|---|
AI Engine to DMA | 2 | 4 | 8 |
DMA to NoC | 1 | 16 | 16 |
DMA to AI Engine | 2 | 4 | 8 |
NoC to DMA | 1 | 16 | 16 |
The exact number of AI Engine to NoC interface tiles is device-specific. For example, in the VC1902 device, there are 16 AI Engine to NoC interface tiles. So, the aggregate bandwidth for the NoC interface is approximately:
- 8 GB/s * 16 = 128 GB/s from AI Engine to PL
- 8 GB/s * 16 = 128 GB/s from PL to AI Engine
When accessing DDR memory, the integrated DDR memory controller (DDRMC) number in the platform limits the performance of DDR memory read and write. For example, if all four DDRMCs in a VC1902 device are fully used, the hard limit to access DDR memory is as follows.
- 3200 Mb/s * 64 bit * 4 DDRMCs / 8 = 102.4 GB/s
The performance of GMIO accessing DDR memory through the NoC is further restricted by the NoC lane number in the horizontal and vertical NoC, inter NoC configurations, and QoS. Note that DDR memory read and write efficiency is largely affected by the access pattern and other overheads. For more information about the NoC, memory controller use, and performance numbers, see the Versal ACAP Programmable Network on Chip and Integrated Memory Controller LogiCORE IP Product Guide (PG313).
For a single connection from the AI Engine or to the AI Engine, both PLIO and GMIO have a hard bandwidth limit of 4 GB/s. Some advantages and disadvantages for choosing PLIO or GMIO are shown in the following table.
PLIO | GMIO | |
---|---|---|
Advantages |
|
|
Disadvantages |
|
|