Simulating an AI Engine Graph Application
This chapter describes the various execution targets available to simulate AI Engine applications at different levels of abstraction, accuracy, and speed. AI Engine graphs can be simulated in three different simulation environments.
The x86 simulator is a functional simulator as described in x86 Functional Simulator. It can be used to functionally simulate your AI Engine graph, and is very useful for early iterations in the kernel and graph development cycle. It, however, does not provide timing, resource, or performance information.
The AI Engine simulator (aiesimulator
) models the timing and resources of the
AI Engine array, while using transaction-level,
timed SystemC models for the NoC, DDR memory, PL, and PS. This allows for faster
performance analysis of your AI Engine applications
and accurate estimation of the AI Engine resource
use, with cycle-approximate timing information.
Finally, when you are ready to simulate the entire AI Engine graph targeting a specific board and platform, along with PL kernels and your host application you can use the Vitis™ hardware emulation flow. This flow includes the SystemC model of the AI Engine, transaction-level, SystemC models for the NoC, DDR memory, PL, and PS. You can also include RTL simulation models of your PL kernels and IPs. The options provided to this flow are described in this chapter.
As shown in Integrating the Application Using the Vitis Tools Flow and Using the Vitis IDE, the Vitis™ compiler builds the system-level project to run the simulator from the IDE. Alternatively, the options can be specified on a command line or in a script.
AI Engine SystemC Simulator
The Versal™ ACAP
AI Engine SystemC simulator (aiesimulator
) includes the modeling of the global
memory (DDR memory) and the network on chip (NoC) in addition to the AI Engine array. When the application is compiled
using the SystemC simulation target, the AI Engine SystemC simulator can be invoked as follows.
aiesimulator –-pkg-dir=./Work
The various configuration and binary files are generated by the
AI Engine compiler under the Work directory (see Compiling an AI Engine Graph Application) and specified using the --pkg-dir
option to the simulator. The graph is initialized, run, and
terminated by a control thread expressed in the main
application. The AI Engine
compiler compiles that control thread with a PS IP wrapper to be directly loaded
into the simulator.
By default, the graph.run()
option
specifies a graph that runs forever. The AI Engine compiler generates code to execute the data flow graph in a perpetual While
loop, thus simulation also runs perpetually. To
create terminating programs for debugging, specify graph.run(<number_of_iterations>)
in your graph code to limit
the execution for the specified number of iterations. The specified number of
iterations can be any positive integer value.
graph::run(-1)
specifies the graph that runs
forever.The AI Engine simulator command first configures the simulator as specified in the compiler generated Work/config/scsim_config.json file. This includes loading PL IP blocks and their connections, configuring I/O data file drivers, and configuring the NoC and global memory (DDR memory) connections. It then executes the specified PS application and finally exits the simulator.
The AI Engine simulator has an optional
--profile option, which enables printfs
in kernel code to appear on the console, and
also generates profile information. Also, the --dump-vcd
<filename> option generates a value change dump (VCD) for the
duration of the simulation. The --simulation-cycle-timeout
<number-of-cycles> can be used to exit the simulation after a
given number of clock cycles.
graph.run()
the simulation runs forever. You need to
press Ctrl+c twice to exit the
simulator.--simulation-cycle-timeout
option to stop the simulator on the exact
cycle. The total cycles that appear on the profiling report are same on each
run.<iostream>
in the kernel code to enable printfs
. The use of #include
<iostream>
in the kernel code results in a compilation error for
both the x86 simulator and the SystemC simulators.Simulator Options
The complete set of the AI Engine simulator
(aiesimulator
) options are described in this section. In most
cases, just pointing to pkg-dir is
sufficient.
Options | Description |
---|---|
-h, --help | Show this help message and exit. |
--dump-vcd FILE | Dump VCD waveform information into FILE . Because the tool appends
.vcd to the specified file
name, it is not necessary to include the file suffix. |
--gm-init-file <file> | Read global memory image from file. This loads the memory initialization file as described in Simulating Global Memory. |
--pkg-dir <PKG_DIR> | Specify the package directory, for example, ./Work. |
--profile | Generates profiling data for all used cores. Allows generation of
printf trace messages on the
stdout and collects profiling
statistics during simulation. This can slightly slow down the
simulator.Optionally, can specify the
profile of specific cores by using |
--simulation-cycle-timeout CYCLES | Run the application for a given number of cycles
after it is loaded. TIP: Specify the --simulation-cycle-timeout option to end the
simulation session after the specified number of timeouts.
However, when specifying simulation timeout during the debug
process, be sure to specify a large number of cycles because the
debug will terminate when the timeout cycle is
reached. |
--online [-ctf] [-wdb] |
Call TIP: The
--online option and --dump-vcd option cannot be used together. If both
options are specified, only --online option takes effect. |
--enable-memory-check | Enables run-time program and data memory boundary access - checks. Any access violation will be reported as an [WARNING] message. By default this option is disabled. |
Simulation Input and Output Data Streams
The default bit width for input/output streams is 32 bits. The bit width specifies the number of samples per line on the simulation input file. The interpretation of the samples on each line of the input file is dependent on the data type expected and the PLIO data width. The following table shows how the samples in the input data file are interpreted, depending on the data type and its corresponding PLIO interface specification.
Data Type | PLIO 32 bit | PLIO 64 bit | PLIO 128 bit |
---|---|---|---|
PLIO *in0 = new
PLIO("DataIn1", adf::plio_32_bits) |
PLIO *in0 = new
PLIO("DataIn1", adf::plio_64_bits) |
PLIO *in0 = new
PLIO("DataIn1", adf::plio_128_bits) |
|
int8 | //4 values per line 6 8 3 2 |
//8 values per line 6 8 3 2 6 8 3 2 |
//16 values per line 6 8 3 2 6 8 3 2 6 8 3 2 6 8 3 2 |
int16 | // 2 values per line 24 18 |
// 4 values per line 24 18 24 18 |
// 8 values per line 24 18 24 18 24 18 24 18 |
int32 | // single value per line 2386 |
// 2 values per line 2386 2386 |
// 4 values per line 2386 2386 2386 2386 |
int64 | N/A | 45678 | // 2 values per line 45678 95578 |
cint16 | // 1 cint value per line – real, imaginary 1980 485 |
// 2 cint values per line 1980 45 180 85 |
// 4 cint values per line 1980 485 180 85 980 48 190 45 |
cint32 | N/A | // 1 cint value per line – real, imaginary 1980 485 |
// 2 cint values per line 1980 45 180 85 |
float | //1 floating point value per line 893.5689 |
//2 floating point values per line 893.5689 3459.3452 |
//4 floating point values per line 893.5689 39.32 459.352 349.345 |
cfloat | N/A | //1 floating point cfloat value per line, real,
imaginary 893.5689 24156.456 |
//2 floating point cfloat values per line, real,
imaginary 893.5689 24156.456 93.689 256.46 |
Simulating Global Memory
When an application accesses global memory using the GMIO specification (see GMIO Attributes), the simulation needs to model the DDR memory and the routing network connecting the DDR memory to the PL and AI Engines. AI Engine to DDR memory connections are mediated by the DMA data mover that is embedded in the AI Engine array interface and controlled through the GMIO APIs in the PS program. Connections to DDR memory from an AXI4-Stream port on a PL block are mediated by a soft GMIO data mover block, which is generated automatically by the AI Engine compiler for simulation purposes. The data mover converts the streaming interface from the PL blocks to memory-mapped AXI4 interface transactions over the NoC with a specific start address, block size, and burst size as shown in GMIO Attributes.
While simulating with global memory, a memory data file can be supplied using an additional option, --gm-init-file, which initializes the DDR memory with predefined data. This file is a textual byte-dump of the DDR memory starting at a given address. The format of this file is as follows:
<startaddr>:
<byte>
<byte>
…
For example, the AI Engine simulator can be invoked with global memory initialization in the following way:
aiesimulator –-pkg-dir=./Work -–gm-init-file=dump.txt
The simulator also produces an output byte dump for the DDR
memory used, in the simulation output directory (default: aiesimulator_output). The name of the output file is based on the
internal location of the DDR memory bank (for example, DDRMC_SITE_X1Y0.mem) starting at the base address 0x0
. You can use this dump to verify the global memory
transactions.
Simulator Options for Hardware Emulation
The AI Engine simulator generates an
options file that lists the options used for simulating the AI Engine graph application. The options file is automatically
generated when the AI Engine simulator is run.
You can reuse the AI Engine simulator options
used in the initial graph-level simulation run, using the AI Engine simulator, later in the system-level hardware emulation. You
can also manually edit the options file to specify other options as required. The
following table lists the options that can be specified in the aiesim_options.txt file. This file is located in the
aiesimulator_output directory and is
created if either option, --dump-vcd
or --profile
is used with the aiesimulator
command. This file can be specified as part of the
command line option to launch the hardware emulator using the launch_hw_emu.sh script as described in Running the System. An example command line is as
follows.
./launch_hw_emu.sh \
-add-env VITIS_LAUNCH_WAVEFORM_BATCH=1 \
-aie-sim-options ${FULL_PATH}/aiesimulator_output/aiesim_options.txt
where ${FULL_PATH}
must be the
full path to the file or directory.
Command | Arguments | Description |
---|---|---|
AIE_DUMP_VCD |
<filename> | When AIE_DUMP_VCD is specified, the simulation generates
VCD data and writes it to the specified <filename>.vcd. |
AIE_PROFILE |
All | (1,2)(3,4)... | This option profiles either all the used AI Engines or selected AI Engines listed. Hardware Emulation generates
profile data files in the sim/behav_waveform/xsim directory and can be viewed in
the Vitis analyzer by opening
the file default.aierun_summary .
This option also logs the ADF kernel printf data to sim/behav_waveform/xsim/simulate.log file. |
AIE_PKG_DIR |
/path_to_work_dir/Work | This is a mandatory option that sets the path to the Work directory generated by the
AI Engine compiler. If you
do not specify this option, the generated sim/behav_waveform/xsim/default.aierun_summary file
will not have the correct Work
directory setting, which will impact the display of the summary file
in the Vitis analyzer. |
The following command brings up the XSIM waveform GUI during hardware emulation.
./launch_hw_emu.sh -g
Additionally, you can add more advanced options to log waveform data without having to launch emulation with the Vivado logic simulator GUI. An example command line is as follows.
./launch_hw_emu.sh \
-user-pre-sim-script pre-sim.tcl
The pre-sim.tcl contains Tcl commands to add waveforms or log design waveforms. For an example, see the Vitis Accelerated Software Development Flow Documentation in the Application Acceleration Development flow of the Vitis Unified Software Platform Documentation (UG1416) and for Tcl commands see Vivado Design Suite User Guide: Logic Simulation (UG900).
Enabling Third-Party Simulators
Simulator | v++ --link
Configuration |
---|---|
Questa Advanced Simulator | EXPORT
simulator=questa
|
Xcelium | EXPORT
simulator=xcelium
|
VCS | EXPORT
simulator=vcs
|
When the modifications have been made, build the design as normal, run the script launch_hw_emu.sh and the new simulator will be used. More information on emulation is provided in Running the System.
x86 Functional Simulator
When starting development on AI Engine graphs and kernels it is critical to verify the design behavior. We call this functional simulation and it is useful to identify bugs in the design. For example, window sizes in the graph are related to the number of iterations in a kernel. Troubleshooting to see if every kernel in a large graph is absorbing and generating the right number of samples can be time consuming and iterative in nature. The x86 simulator is an ideal choice for testing, debugging, and verifying this kind of behavior because of the speed of iteration and the high level of data visibility it provides the developer. The x86 simulation does not provide timing, resource, or performance information. The x86 simulator is running exclusively on the tool development machine. This means its performance and memory use are defined by the development machine.
While the AI Engine simulator
fully models the memory of the AI Engine array
this also means that AI Engine simulator is
limited by the memory space of the AI Engine. The
x86 simulator is not limited by this and provides a nearly unlimited amount of debug
printf()s
, large arrays, and variables. When
combined with the ability to single step through the kernel, very complex design
issues can quickly be isolated and addressed.
Several macros are provided to improve the kernel developer's quality of life. These macros are intended for use with the x86 simulator and can be left in the code for maintainability purposes. With the benefits of the x86 simulator come some trade offs. There are several types of graph constructs that are not supported and cycle accurate behavior cannot be guaranteed to match between the AI Engine simulator and the x86 simulator. The x86 simulator is not a replacement for the AI Engine simulator. The AI Engine simulator should still be run on all designs to verify behavior and obtain performance information. For different stages of project development one tool or the other might better suit your needs.
To run the x86 simulator, change the AI Engine compiler target to x86sim
.
aiecompiler --target=x86sim graph.cpp
After the application is compiled for x86 simulation, the x86 simulator can be invoked as follows.
x86simulator
The complete x86 simulator command help is shown in the following code.
$ x86simulator [-h] [--help] [--h] [--pkg-dir=PKGDIR]
optional arguments:
-h,--help --h show this help message and exit
--pkg-dir=PKG_DIR Set the package directory. ex: Work
--timeout=secs Terminate simuation after specified number of seconds
--gdb Invoke from gdb
The compiled binary for x86 native simulation is produced by the AI Engine compiler under the Work directory (see Compiling an AI Engine Graph Application) and is started automatically by the x86 simulator.
The input and the output files are specified in the following snippet of graph code.
adf::PLIO in1("In", adf::plio_32_bits, "In1.txt");
adf::PLIO out1("Out", adf::plio_32_bits, "Out1.txt");
simulation::platform<1,1> platform(&in1,&out1)
When running, the x86 simulator looks in the current working directory for data/In1.txt which is one of the inputs used by the ADF graph. To distinguish the output files for the x86 simulator from the output files for the AI Engine simulator the Out1.txt are located in current_working_dir/x86simulator_output/data/.
The output files produced by the simulator can be compared with a golden output ignoring white space differences.
diff –w <data>/golden.txt <data>/output.txt
Compiling the Design
The GNU debugger allows for C/C++ debugging similar to an IDE based debugger. It allows setting of breakpoints, single stepping, stepping over functions, and multiple hit counts on breakpoints. For AI Engine kernel development the x86 simulator enables single step debugging of kernel code using GDB.
The target
argument for the AI Engine compiler must be set to x86sim
to use GDB.
aiecompiler --target=x86sim graph.cpp
Additionally, compiling with the preprocessor directive -O0
minimizes optimizations which improves debug
visibility. If additional debug visibility is required it is possible to reduce the
compiler optimization level. Passing the optimization parameter to the preprocessor can
be done as follows.
aiecompiler --target=x86sim --Xpreproc=-O0 graph.cpp
Using GDB
After successful compilation with the appropriate target you can launch
the simulation and automatically attach a GDB instance to it. To launch an interactive
GDB session run the command with the switch --gdb
as
follows.
x86simulator --gdb
By default, when running the x86 simulator with the gdb
command line switch it breaks immediately before
entering main()
in graph.cpp. This pauses execution before any AI Engine kernels have started because the graph has not been run. To exit
GDB type quit
or help
for more commands.
break <kernel_function_name>
By typing continue
(shorthand c
) the debugger runs until the breakpoint in <kernel function name>
is reached. When the
breakpoint is reached it is possible to examine local stack variables and function
arguments. The following table shows some commonly used GDB instructions to allow
examination of these variables.
Command | Description |
---|---|
info stack |
Shows a track of the function call stack at the current breakpoint. |
info
locals |
Shows the current status of local variables within the scope of the function call shown in the call stack. |
print
<local_variable_name> |
Prints the current value of a single variable. |
finish |
Exits the current function call but keeps the simulation paused. |
continue |
Causes the debugger to run to completion. |
GDB is a very powerful debugger with many features. Full documentation of GDB is beyond the scope of this document. For more information see https://www.gnu.org/software/gdb/.
Macros
Xilinx provides several predefined
compiler macros to help using the x86 simulator. In your top level graph test bench
(usually called graph.cpp) it can be useful to use
the following pre-processor macros with a conditional #if
to help
include or exclude appropriate code.
Macro | Description |
---|---|
__X86SIM__ | Use this predefined macro to designate code that is
applicable only for the x86sim flow. |
__AIESIM__ | Use this predefined macro to designate code that is
applicable only for the aiesimulator flow. |
X86SIM_KERNEL_NAME |
Use this macro with printf() to
tag instrumentation text with the kernel instance name. |
_
there are two underscore characters at
the front and behind.The following is an example of the macro code.
myAIEgraph g;
#if defined(__AIESIM__) || defined(__X86SIM__)
int main()
{
g.init();
g.run(4);
g.end();
return 0;
}
#endif
The previous example shows the __X86SIM__
macro surrounding the main()
which is used by a
graph.cpp file. This main()
must be excluded from emulation flows and these macros provide that
flexibility. Additionally, consider using the __X86SIM__
macro to selectively enable debug instrumentation only during
x86 simulation.
Printf() Macros
The x86 simulator executes multiple kernels in parallel on separate threads.
This means that printf()
debug messages can often be
interleaved and non-deterministic when viewed in standard output. To help identify which
kernel is printing which line the X86SIM_KERNEL_NAME
macro can be useful. The following is an example showing how to combine it with printf()
.
#include <adf/x86sim/x86simDebug.h>
void simple(input_window_float * in, output_window_float * out) {
for (unsigned i=0; i<NUM_SAMPLES;i++) {
float val = window_readincr(in);
window_writeincr(out,val+0.15);
}
static int count = 0;
printf("%s: %s %d\n",__FILE__,X86SIM_KERNEL_NAME,++count);
}
Using printf() in Kernels
Vector data types are the most commonly used types in AI Engine kernel code. To debug vector operations within
a kernel it is helpful to use printf
.
Using printf() with Vector Data Types
To printf()
the native vector
types you can use a technique as follows.
v4cint16 input_vector;
...
int16_t* print_ptr =(int16_t*)&input_vector;
for (int qq=0; qq<4;qq++) //4 here so we print two int16s, real + imag per loop.
printf("vector re: %d, im: %d\r\n",print_ptr[2*qq],print_ptr[2*qq+1]);
}
With the AI Engine simulator the
--profile
option is required in order to
observe printf()
outputs. With the x86 simulator
no additional options are needed to enable printf
calls. This is one of the benefits of the x86 simulator.
std::cout
in kernel and host code. If std::cout
is used its outputs can appear interleaved, given the
multi-threaded nature of the x86 simulator. Using printf()
is recommended instead.Considerations
Memory Model
Source files with static or global variables are shared between multiple AI Engines. This is not the same behavior as the AI Engine simulator. If multiple kernels access the same memory the x86 simulator model does not allocate per kernel memory. Instead all kernels share the same global scope memory. Reading and writing to the same memory between multiple kernels results in data corruption during the x86 simulation. To avoid this problem keep memory allocations function scoped to a kernel or use the AI Engine simulator which fully models the memory.
Graph API Calls
Graph constructs for timed execution (graph.wait(N)
, graph.resume()
and
graph.end(N)
) behave differently in the
AI Engine simulator and the x86 simulator.
For the AI Engine simulator N specifies the
processor cycle timeout, whereas for the x86 simulator it specifies a wall clock
timeout in milliseconds. Thus, if your test bench uses timed execution the AI Engine simulator and x86 simulator might produce
different results, in particular, the amount of output data produced might
differ.
Preprocessor Considerations
When running the AI Engine
compiler with a target of x86sim
the compiler ignores the
--Xchess
option. This means that the x86sim
flow does not support kernel-specific compile options.
To understand this better consider the following example. A common
method of making compile-time modifications to C code is using preprocessor
directives such as #ifndef
. To control these
preprocessor directives it is helpful to pass #defines
through the command line compiler options. The following
example code block takes two different actions based on a preprocessor
directive.
void example_kernel()
{
#ifdef SIM
printf("Sumulation Mode\n");
#else
printf("Default Mode\n");
#endif
}
To define the SIM macro at compile time with the AI Engine compiler targeting hardware
(hw
) you can do the following.
aiecompiler -target=hw -Xchess="example_kernel:-DSIM"
Because the -Xchess
argument is ignored when the
compilation target is set to x86simulator
SIM is not defined for
the x86 simulator case and the output of the kernel is Default Mode.
If you need to specify preprocessor options with the x86 simulator you can do
so using aiecompiler -target=x86sim --Xpreproc
instead of -Xchess
. It is important to note that
any options passed in this manner applies to all source code and all target
flows.
Option | Description |
---|---|
--Xchess=<string> |
Can be used to pass kernel specific options to the CHESS compiler that is used to compile code for each AI Engine. The option string is specified as<kernel-function>:<optionid>=<value> .
This option string is included during compilation of generated
source files on the AI Engine
where the specified kernel function is mapped. |
--Xpreproc=<string> |
Pass general option to the PREPROCESSOR phase for
all source code compilations
( --Xpreproc=-D<var>=<value> |
Packet Switching and RTP
The x86 simulator and AI Engine simulator have some important differences when it comes to packet switching and run-time parameters (RTP). See Run-Time Parameter Specification and Explicit Packet Switching if you are not already familiar with these constructs.
Both packet switching and RTPs exhibit behavior that can manifest itself as a difference between the AI Engine simulator and the x86 simulator. It is important to understand that this dissimilarity is still correct in both cases. As discussed in Explicit Packet Switching, packet switched streams are non-deterministic in all design flows (x86 simulator, AI Engine simulator, and hardware emulation).
Packet Switching
Packet Stream connections have a field known as the packet ID. If the source of the packet ID field comes from within the ADF graph, the x86 simulator uses the canonical, zero-based indexing scheme for packet IDs. The first branch on the output of a split node has a packet ID equal to 0 followed by 1, 2, 3, etc. If the source of the packet ID field comes from outside the ADF graph there will be discrepancies between the AI Engine simulator and the x86 simulator. To resolve these discrepancies see Packet Switching and the AI Engine Simulator for additional information on providing custom packet IDs when the source is outside the ADF graph.
The nature of packet merging means that the x86 simulator and the AI Engine simulator produce non-deterministic results. If an AI Engine on a packet split branch finishes processing data before any of the other branches its data appears on the output of the packet merge first. Exactly which core finishes first is highly dependent on both the kernel code and the incoming data. So any downstream processing blocks must be prepared to handle this behavior.
Synchronous and Asynchronous RTP
Both synchronous and asynchronous run-time parameters are fully supported in the x86 simulator. However the precise timing and cycle accuracy of when an RTP update occurs differs between the x86 simulator and the AI Engine simulator. Asynchronous RTPs in particular do not affect a kernel on a specific known cycle by their very nature of being asynchronous. This is true of asynchronous RTPs in both the x86 simulator and the AI Engine simulator.
Best Practices
The x86 simulator works best with the following recommendations.
- Keep variables in kernels function scoped.
- Keep the x86 simulator test bench simple - use
init()
,run(X)
, andend()
. - Keep variables in the ADF graph class scoped (see Global Graph-scoped Tables and C++ Kernel Class Support).