Simulating an AI Engine Graph Application

This chapter describes the various execution targets available to simulate AI Engine applications at different levels of abstraction, accuracy, and speed. AI Engine graphs can be simulated in three different simulation environments.

The x86 simulator is a functional simulator as described in x86 Functional Simulator. It can be used to functionally simulate your AI Engine graph, and is very useful for early iterations in the kernel and graph development cycle. It, however, does not provide timing, resource, or performance information.

The AI Engine simulator (aiesimulator) models the timing and resources of the AI Engine array, while using transaction-level, timed SystemC models for the NoC, DDR memory, PL, and PS. This allows for faster performance analysis of your AI Engine applications and accurate estimation of the AI Engine resource use, with cycle-approximate timing information.

Finally, when you are ready to simulate the entire AI Engine graph targeting a specific board and platform, along with PL kernels and your host application you can use the Vitis™ hardware emulation flow. This flow includes the SystemC model of the AI Engine, transaction-level, SystemC models for the NoC, DDR memory, PL, and PS. You can also include RTL simulation models of your PL kernels and IPs. The options provided to this flow are described in this chapter.

As shown in Integrating the Application Using the Vitis Tools Flow and Using the Vitis IDE, the Vitis™ compiler builds the system-level project to run the simulator from the IDE. Alternatively, the options can be specified on a command line or in a script.

AI Engine SystemC Simulator

The Versal™ ACAP AI Engine SystemC simulator (aiesimulator) includes the modeling of the global memory (DDR memory) and the network on chip (NoC) in addition to the AI Engine array. When the application is compiled using the SystemC simulation target, the AI Engine SystemC simulator can be invoked as follows.

aiesimulator –-pkg-dir=./Work
IMPORTANT: Using AI Engine simulator requires the setup described in Setting Up the Vitis Tool Environment.

The various configuration and binary files are generated by the AI Engine compiler under the Work directory (see Compiling an AI Engine Graph Application) and specified using the --pkg-dir option to the simulator. The graph is initialized, run, and terminated by a control thread expressed in the main application. The AI Engine compiler compiles that control thread with a PS IP wrapper to be directly loaded into the simulator.

By default, the graph.run() option specifies a graph that runs forever. The AI Engine compiler generates code to execute the data flow graph in a perpetual While loop, thus simulation also runs perpetually. To create terminating programs for debugging, specify graph.run(<number_of_iterations>) in your graph code to limit the execution for the specified number of iterations. The specified number of iterations can be any positive integer value.

Note: graph::run(-1) specifies the graph that runs forever.

The AI Engine simulator command first configures the simulator as specified in the compiler generated Work/config/scsim_config.json file. This includes loading PL IP blocks and their connections, configuring I/O data file drivers, and configuring the NoC and global memory (DDR memory) connections. It then executes the specified PS application and finally exits the simulator.

The AI Engine simulator has an optional --profile option, which enables printfs in kernel code to appear on the console, and also generates profile information. Also, the --dump-vcd <filename> option generates a value change dump (VCD) for the duration of the simulation. The --simulation-cycle-timeout <number-of-cycles> can be used to exit the simulation after a given number of clock cycles.

IMPORTANT: If you do not provide either the clock cycles or the number of runs to graph.run() the simulation runs forever. You need to press Ctrl+c twice to exit the simulator.
TIP: You might observe cycle count differences between simulation runs on the same design. This is because the simulator waits for a few seconds for all pending transactions (such as DMA) to finish. During this wait time, the simulator process is still ticking but can be context-switched by the OS and total cycles can be different for each run. To ensure that the total cycles are the same for each run, you should use the AI Engine simulator --simulation-cycle-timeout option to stop the simulator on the exact cycle. The total cycles that appear on the profiling report are same on each run.
Note: Do not include <iostream> in the kernel code to enable printfs. The use of #include <iostream> in the kernel code results in a compilation error for both the x86 simulator and the SystemC simulators.

Simulator Options

The complete set of the AI Engine simulator (aiesimulator) options are described in this section. In most cases, just pointing to pkg-dir is sufficient.

Table 1. AI Engine Simulator Options
Options Description
-h, --help Show this help message and exit.
--dump-vcd FILE Dump VCD waveform information into FILE. Because the tool appends .vcd to the specified file name, it is not necessary to include the file suffix.
--gm-init-file <file> Read global memory image from file. This loads the memory initialization file as described in Simulating Global Memory.
--pkg-dir <PKG_DIR> Specify the package directory, for example, ./Work.
--profile Generates profiling data for all used cores. Allows generation of printf trace messages on the stdout and collects profiling statistics during simulation. This can slightly slow down the simulator.

Optionally, can specify the profile of specific cores by using --profile=(col,row)(col,row)....

--simulation-cycle-timeout CYCLES Run the application for a given number of cycles after it is loaded.
TIP: Specify the --simulation-cycle-timeout option to end the simulation session after the specified number of timeouts. However, when specifying simulation timeout during the debug process, be sure to specify a large number of cycles because the debug will terminate when the timeout cycle is reached.
--online [-ctf] [-wdb]

Call vcdanalyze to parse VCD data on-the-fly, to optionally produce common trace format (CTF), or waveform database (WDB) output.

TIP: The --online option and --dump-vcd option cannot be used together. If both options are specified, only --online option takes effect.
--enable-memory-check Enables run-time program and data memory boundary access - checks. Any access violation will be reported as an [WARNING] message. By default this option is disabled.

Simulation Input and Output Data Streams

The default bit width for input/output streams is 32 bits. The bit width specifies the number of samples per line on the simulation input file. The interpretation of the samples on each line of the input file is dependent on the data type expected and the PLIO data width. The following table shows how the samples in the input data file are interpreted, depending on the data type and its corresponding PLIO interface specification.

Table 2. Simulation Input Data Dependency on Data Type and PLIO Width
Data Type PLIO 32 bit PLIO 64 bit PLIO 128 bit
PLIO *in0 = new PLIO("DataIn1", adf::plio_32_bits) PLIO *in0 = new PLIO("DataIn1", adf::plio_64_bits) PLIO *in0 = new PLIO("DataIn1", adf::plio_128_bits)
int8 //4 values per line

6 8 3 2

//8 values per line

6 8 3 2 6 8 3 2

//16 values per line

6 8 3 2 6 8 3 2 6 8 3 2 6 8 3 2

int16 // 2 values per line

24 18

// 4 values per line

24 18 24 18

// 8 values per line

24 18 24 18 24 18 24 18

int32 // single value per line

2386

// 2 values per line

2386 2386

// 4 values per line

2386 2386 2386 2386

int64 N/A 45678 // 2 values per line

45678 95578

cint16 // 1 cint value per line – real, imaginary

1980 485

// 2 cint values per line

1980 45 180 85

// 4 cint values per line

1980 485 180 85 980 48 190 45

cint32 N/A // 1 cint value per line – real, imaginary

1980 485

// 2 cint values per line

1980 45 180 85

float //1 floating point value per line

893.5689

//2 floating point values per line

893.5689 3459.3452

//4 floating point values per line

893.5689 39.32 459.352 349.345

cfloat N/A //1 floating point cfloat value per line, real, imaginary

893.5689 24156.456

//2 floating point cfloat values per line, real, imaginary

893.5689 24156.456 93.689 256.46

Simulating Global Memory

When an application accesses global memory using the GMIO specification (see GMIO Attributes), the simulation needs to model the DDR memory and the routing network connecting the DDR memory to the PL and AI Engines. AI Engine to DDR memory connections are mediated by the DMA data mover that is embedded in the AI Engine array interface and controlled through the GMIO APIs in the PS program. Connections to DDR memory from an AXI4-Stream port on a PL block are mediated by a soft GMIO data mover block, which is generated automatically by the AI Engine compiler for simulation purposes. The data mover converts the streaming interface from the PL blocks to memory-mapped AXI4 interface transactions over the NoC with a specific start address, block size, and burst size as shown in GMIO Attributes.

While simulating with global memory, a memory data file can be supplied using an additional option, --gm-init-file, which initializes the DDR memory with predefined data. This file is a textual byte-dump of the DDR memory starting at a given address. The format of this file is as follows:

<startaddr>:
<byte>
<byte>
…

For example, the AI Engine simulator can be invoked with global memory initialization in the following way:

aiesimulator –-pkg-dir=./Work -–gm-init-file=dump.txt

The simulator also produces an output byte dump for the DDR memory used, in the simulation output directory (default: aiesimulator_output). The name of the output file is based on the internal location of the DDR memory bank (for example, DDRMC_SITE_X1Y0.mem) starting at the base address 0x0. You can use this dump to verify the global memory transactions.

Simulator Options for Hardware Emulation

The AI Engine simulator generates an options file that lists the options used for simulating the AI Engine graph application. The options file is automatically generated when the AI Engine simulator is run. You can reuse the AI Engine simulator options used in the initial graph-level simulation run, using the AI Engine simulator, later in the system-level hardware emulation. You can also manually edit the options file to specify other options as required. The following table lists the options that can be specified in the aiesim_options.txt file. This file is located in the aiesimulator_output directory and is created if either option, --dump-vcd or --profile is used with the aiesimulator command. This file can be specified as part of the command line option to launch the hardware emulator using the launch_hw_emu.sh script as described in Running the System. An example command line is as follows.

./launch_hw_emu.sh \
-add-env VITIS_LAUNCH_WAVEFORM_BATCH=1 \
-aie-sim-options ${FULL_PATH}/aiesimulator_output/aiesim_options.txt

where ${FULL_PATH} must be the full path to the file or directory.

Table 3. AI Engine Options for Hardware Emulation
Command Arguments Description
AIE_DUMP_VCD <filename> When AIE_DUMP_VCD is specified, the simulation generates VCD data and writes it to the specified <filename>.vcd.
AIE_PROFILE All | (1,2)(3,4)... This option profiles either all the used AI Engines or selected AI Engines listed. Hardware Emulation generates profile data files in the sim/behav_waveform/xsim directory and can be viewed in the Vitis analyzer by opening the file default.aierun_summary. This option also logs the ADF kernel printf data to sim/behav_waveform/xsim/simulate.log file.
AIE_PKG_DIR /path_to_work_dir/Work This is a mandatory option that sets the path to the Work directory generated by the AI Engine compiler. If you do not specify this option, the generated sim/behav_waveform/xsim/default.aierun_summary file will not have the correct Work directory setting, which will impact the display of the summary file in the Vitis analyzer.

The following command brings up the XSIM waveform GUI during hardware emulation.

./launch_hw_emu.sh -g

Additionally, you can add more advanced options to log waveform data without having to launch emulation with the Vivado logic simulator GUI. An example command line is as follows.

./launch_hw_emu.sh \
-user-pre-sim-script pre-sim.tcl

The pre-sim.tcl contains Tcl commands to add waveforms or log design waveforms. For an example, see the Vitis Accelerated Software Development Flow Documentation in the Application Acceleration Development flow of the Vitis Unified Software Platform Documentation (UG1416) and for Tcl commands see Vivado Design Suite User Guide: Logic Simulation (UG900).

Enabling Third-Party Simulators

Third-party simulators such as Questa Advanced Simulator (Mentor Grahpics), Xcelium (Cadence), and VCS (Synopsys) are supported when executing hardware emulation of your design. You can enable these simulators by updating the Vitis configuration file (config.ini or system.cfg).
Table 4. Vitis Link Settings
Simulator v++ --link Configuration
Questa Advanced Simulator EXPORT simulator=questa
[advanced]
param=hw_emu.simulator=QUESTA
[vivado]
prop=project.__CURRENT__.compxlib.questa_compiled_library_dir=/path/to/questa/2020.4/lin64/lib/
Xcelium EXPORT simulator=xcelium
[advanced]
param=hw_emu.simulator=XCELIUM
[vivado]
prop=project.__CURRENT__.simulator.xcelium_install_dir=/path/to/xcelium/bin/
prop=project.__CURRENT__.compxlib.xcelium_compiled_library_dir=/path/to/xcelium/20.09.006/lin64/lib/
prop=fileset.sim_1.xcelium.simulate.runtime=1000us
prop=fileset.sim_1.xcelium.elaborate.xmelab.more_options={-timescale 1ns/1ps}
VCS EXPORT simulator=vcs
[advanced]
param=hw_emu.simulator=VCS
[vivado]
prop=project.__CURRENT__.simulator.vcs_install_dir=/path/to/vcs/R-2020.12/bin/
prop=project.__CURRENT__.compxlib.vcs_compiled_library_dir=/path/to/clibs/vcs/R-2020.12/lin64/lib/
prop=project.__CURRENT__.simulator.vcs_gcc_install_dir=/path/to/synopsys/vg_gnu/2019.06/amd64/gcc-6.2.0_64/bin
prop=fileset.sim_1.vcs.simulate.log_all_signals=false

When the modifications have been made, build the design as normal, run the script launch_hw_emu.sh and the new simulator will be used. More information on emulation is provided in Running the System.

x86 Functional Simulator

When starting development on AI Engine graphs and kernels it is critical to verify the design behavior. We call this functional simulation and it is useful to identify bugs in the design. For example, window sizes in the graph are related to the number of iterations in a kernel. Troubleshooting to see if every kernel in a large graph is absorbing and generating the right number of samples can be time consuming and iterative in nature. The x86 simulator is an ideal choice for testing, debugging, and verifying this kind of behavior because of the speed of iteration and the high level of data visibility it provides the developer. The x86 simulation does not provide timing, resource, or performance information. The x86 simulator is running exclusively on the tool development machine. This means its performance and memory use are defined by the development machine.

While the AI Engine simulator fully models the memory of the AI Engine array this also means that AI Engine simulator is limited by the memory space of the AI Engine. The x86 simulator is not limited by this and provides a nearly unlimited amount of debug printf()s, large arrays, and variables. When combined with the ability to single step through the kernel, very complex design issues can quickly be isolated and addressed.

Several macros are provided to improve the kernel developer's quality of life. These macros are intended for use with the x86 simulator and can be left in the code for maintainability purposes. With the benefits of the x86 simulator come some trade offs. There are several types of graph constructs that are not supported and cycle accurate behavior cannot be guaranteed to match between the AI Engine simulator and the x86 simulator. The x86 simulator is not a replacement for the AI Engine simulator. The AI Engine simulator should still be run on all designs to verify behavior and obtain performance information. For different stages of project development one tool or the other might better suit your needs.

To run the x86 simulator, change the AI Engine compiler target to x86sim.

aiecompiler --target=x86sim graph.cpp

After the application is compiled for x86 simulation, the x86 simulator can be invoked as follows.

x86simulator

The complete x86 simulator command help is shown in the following code.

$ x86simulator [-h] [--help] [--h] [--pkg-dir=PKGDIR]
optional arguments:
-h,--help --h show this help message and exit
--pkg-dir=PKG_DIR Set the package directory. ex: Work
--timeout=secs Terminate simuation after specified number of seconds
--gdb Invoke from gdb

The compiled binary for x86 native simulation is produced by the AI Engine compiler under the Work directory (see Compiling an AI Engine Graph Application) and is started automatically by the x86 simulator.

The input and the output files are specified in the following snippet of graph code.

adf::PLIO in1("In", adf::plio_32_bits, "In1.txt");
adf::PLIO out1("Out", adf::plio_32_bits, "Out1.txt");
simulation::platform<1,1> platform(&in1,&out1)

When running, the x86 simulator looks in the current working directory for data/In1.txt which is one of the inputs used by the ADF graph. To distinguish the output files for the x86 simulator from the output files for the AI Engine simulator the Out1.txt are located in current_working_dir/x86simulator_output/data/.

The output files produced by the simulator can be compared with a golden output ignoring white space differences.

diff –w <data>/golden.txt <data>/output.txt

Compiling the Design

The GNU debugger allows for C/C++ debugging similar to an IDE based debugger. It allows setting of breakpoints, single stepping, stepping over functions, and multiple hit counts on breakpoints. For AI Engine kernel development the x86 simulator enables single step debugging of kernel code using GDB.

The target argument for the AI Engine compiler must be set to x86sim to use GDB.

aiecompiler --target=x86sim graph.cpp

Additionally, compiling with the preprocessor directive -O0 minimizes optimizations which improves debug visibility. If additional debug visibility is required it is possible to reduce the compiler optimization level. Passing the optimization parameter to the preprocessor can be done as follows.

aiecompiler --target=x86sim --Xpreproc=-O0 graph.cpp

Using GDB

After successful compilation with the appropriate target you can launch the simulation and automatically attach a GDB instance to it. To launch an interactive GDB session run the command with the switch --gdb as follows.

x86simulator --gdb
Note: Launching GDB is supported only from the command line in the 2021.1 release.

By default, when running the x86 simulator with the gdb command line switch it breaks immediately before entering main() in graph.cpp. This pauses execution before any AI Engine kernels have started because the graph has not been run. To exit GDB type quit or help for more commands.

Setting a breakpoint can be done in multiple ways. One method is to use the following syntax.
break <kernel_function_name>

By typing continue (shorthand c) the debugger runs until the breakpoint in <kernel function name> is reached. When the breakpoint is reached it is possible to examine local stack variables and function arguments. The following table shows some commonly used GDB instructions to allow examination of these variables.

Table 5. Common GDB Instructions
Command Description
info stack Shows a track of the function call stack at the current breakpoint.
info locals Shows the current status of local variables within the scope of the function call shown in the call stack.
print <local_variable_name> Prints the current value of a single variable.
finish Exits the current function call but keeps the simulation paused.
continue Causes the debugger to run to completion.

GDB is a very powerful debugger with many features. Full documentation of GDB is beyond the scope of this document. For more information see https://www.gnu.org/software/gdb/.

Macros

Xilinx provides several predefined compiler macros to help using the x86 simulator. In your top level graph test bench (usually called graph.cpp) it can be useful to use the following pre-processor macros with a conditional #if to help include or exclude appropriate code.

Macro Description
__X86SIM__ Use this predefined macro to designate code that is applicable only for the x86sim flow.
__AIESIM__ Use this predefined macro to designate code that is applicable only for the aiesimulator flow.

X86SIM_KERNEL_NAME

Use this macro with printf() to tag instrumentation text with the kernel instance name.
Note: For macros surrounded with underscores _ there are two underscore characters at the front and behind.

The following is an example of the macro code.

myAIEgraph g;
#if defined(__AIESIM__) || defined(__X86SIM__)
int main()
{
   g.init();
   g.run(4);
   g.end();
   return 0;
}
#endif
TIP: The __AIESIM__ macro is used in the AI Engine simulator only and __X86SIM__ is applicable for the x86 simulator.

The previous example shows the __X86SIM__ macro surrounding the main() which is used by a graph.cpp file. This main() must be excluded from emulation flows and these macros provide that flexibility. Additionally, consider using the __X86SIM__ macro to selectively enable debug instrumentation only during x86 simulation.

Printf() Macros

The x86 simulator executes multiple kernels in parallel on separate threads. This means that printf() debug messages can often be interleaved and non-deterministic when viewed in standard output. To help identify which kernel is printing which line the X86SIM_KERNEL_NAME macro can be useful. The following is an example showing how to combine it with printf().

Note: To use X86SIM_KERNEL_NAME you must include adf/x86sim/x86simDebug.h as shown in the following code.
#include <adf/x86sim/x86simDebug.h>

void simple(input_window_float * in, output_window_float * out) {
   for (unsigned i=0; i<NUM_SAMPLES;i++) {
       float val = window_readincr(in);
       window_writeincr(out,val+0.15);
   }
   static int count = 0;
   printf("%s: %s %d\n",__FILE__,X86SIM_KERNEL_NAME,++count);
}

Using printf() in Kernels

Vector data types are the most commonly used types in AI Engine kernel code. To debug vector operations within a kernel it is helpful to use printf.

Using printf() with Vector Data Types

To printf() the native vector types you can use a technique as follows.

v4cint16 input_vector;
...
int16_t* print_ptr =(int16_t*)&input_vector;
    for (int qq=0; qq<4;qq++) //4 here so we print two int16s, real + imag per loop.
    printf("vector re: %d, im: %d\r\n",print_ptr[2*qq],print_ptr[2*qq+1]);
}

With the AI Engine simulator the --profile option is required in order to observe printf() outputs. With the x86 simulator no additional options are needed to enable printf calls. This is one of the benefits of the x86 simulator.

IMPORTANT: Xilinx recommends avoiding std::cout in kernel and host code. If std::cout is used its outputs can appear interleaved, given the multi-threaded nature of the x86 simulator. Using printf() is recommended instead.

Considerations

Memory Model

Source files with static or global variables are shared between multiple AI Engines. This is not the same behavior as the AI Engine simulator. If multiple kernels access the same memory the x86 simulator model does not allocate per kernel memory. Instead all kernels share the same global scope memory. Reading and writing to the same memory between multiple kernels results in data corruption during the x86 simulation. To avoid this problem keep memory allocations function scoped to a kernel or use the AI Engine simulator which fully models the memory.

Graph API Calls

Graph constructs for timed execution (graph.wait(N), graph.resume() and graph.end(N)) behave differently in the AI Engine simulator and the x86 simulator. For the AI Engine simulator N specifies the processor cycle timeout, whereas for the x86 simulator it specifies a wall clock timeout in milliseconds. Thus, if your test bench uses timed execution the AI Engine simulator and x86 simulator might produce different results, in particular, the amount of output data produced might differ.

Preprocessor Considerations

When running the AI Engine compiler with a target of x86sim the compiler ignores the --Xchess option. This means that the x86sim flow does not support kernel-specific compile options.

To understand this better consider the following example. A common method of making compile-time modifications to C code is using preprocessor directives such as #ifndef. To control these preprocessor directives it is helpful to pass #defines through the command line compiler options. The following example code block takes two different actions based on a preprocessor directive.

void example_kernel()
{
  #ifdef SIM
    printf("Sumulation Mode\n");
  #else
    printf("Default Mode\n");
  #endif
}

To define the SIM macro at compile time with the AI Engine compiler targeting hardware (hw) you can do the following.

aiecompiler -target=hw -Xchess="example_kernel:-DSIM"

Because the -Xchess argument is ignored when the compilation target is set to x86simulator SIM is not defined for the x86 simulator case and the output of the kernel is Default Mode.

If you need to specify preprocessor options with the x86 simulator you can do so using aiecompiler -target=x86sim --Xpreproc instead of -Xchess. It is important to note that any options passed in this manner applies to all source code and all target flows.

Table 6. AI Engine Compiler Command Line Options
Option Description
--Xchess=<string>

Can be used to pass kernel specific options to the CHESS compiler that is used to compile code for each AI Engine.

The option string is specified as <kernel-function>:<optionid>=<value>. This option string is included during compilation of generated source files on the AI Engine where the specified kernel function is mapped.
--Xpreproc=<string>

Pass general option to the PREPROCESSOR phase for all source code compilations (AIE/PS/PL/x86sim). For example:

--Xpreproc=-D<var>=<value>

Packet Switching and RTP

The x86 simulator and AI Engine simulator have some important differences when it comes to packet switching and run-time parameters (RTP). See Run-Time Parameter Specification and Explicit Packet Switching if you are not already familiar with these constructs.

Both packet switching and RTPs exhibit behavior that can manifest itself as a difference between the AI Engine simulator and the x86 simulator. It is important to understand that this dissimilarity is still correct in both cases. As discussed in Explicit Packet Switching, packet switched streams are non-deterministic in all design flows (x86 simulator, AI Engine simulator, and hardware emulation).

Packet Switching

Packet Stream connections have a field known as the packet ID. If the source of the packet ID field comes from within the ADF graph, the x86 simulator uses the canonical, zero-based indexing scheme for packet IDs. The first branch on the output of a split node has a packet ID equal to 0 followed by 1, 2, 3, etc. If the source of the packet ID field comes from outside the ADF graph there will be discrepancies between the AI Engine simulator and the x86 simulator. To resolve these discrepancies see Packet Switching and the AI Engine Simulator for additional information on providing custom packet IDs when the source is outside the ADF graph.

The nature of packet merging means that the x86 simulator and the AI Engine simulator produce non-deterministic results. If an AI Engine on a packet split branch finishes processing data before any of the other branches its data appears on the output of the packet merge first. Exactly which core finishes first is highly dependent on both the kernel code and the incoming data. So any downstream processing blocks must be prepared to handle this behavior.

Synchronous and Asynchronous RTP

Both synchronous and asynchronous run-time parameters are fully supported in the x86 simulator. However the precise timing and cycle accuracy of when an RTP update occurs differs between the x86 simulator and the AI Engine simulator. Asynchronous RTPs in particular do not affect a kernel on a specific known cycle by their very nature of being asynchronous. This is true of asynchronous RTPs in both the x86 simulator and the AI Engine simulator.

Note: The x86 simulator by its nature is a functional level simulator whereas the AI Engine simulator models cycles (approximately). It is expected there will be differences. Xilinx recommends you consider partitioning your design into pieces that benefit from the x86 simulator.

Best Practices

The x86 simulator works best with the following recommendations.