Running Emulation

Development of a user application and hardware kernels targeting an FPGA requires a phased development approach. Because FPGA, Versal™ ACAP, and Zynq UltraScale+ MPSoC are programmable devices, building the device binary for hardware takes some time. To enable quicker iterations without having to go through the full hardware compilation flow, the Vitis™ tool provides emulation targets on which the application and kernels can be run. Compiling for emulation targets is significantly faster than compiling for the actual hardware. Additionally, emulation targets provide full visibility into the application or accelerator, thus making it easier to perform debugging. Once your design passes in emulation, then in the late stages of development you can compile and run the application on the hardware platform.

The Vitis tool provides two emulation targets:

Software emulation (sw_emu): The software emulation build compiles and links quickly, and the host program runs either natively on an x86 processor or in the QEMU emulation environment. The PL kernels are natively compiled and running on the host machine. This build target lets you quickly iterate on both the host code and kernel logic.
Hardware emulation (hw_emu): The host program runs in sw_emu, natively on x86 or in the QEMU, but the kernel code is compiled into an RTL behavioral model which is run in the Vivado® simulator or other supported third-party simulators. This build and run loop takes longer but provides a cycle-accurate view of kernel logic.

Compiling for either of the emulation targets is seamlessly integrated into the Vitis command line and IDE flows. You can compile your host and kernel source code for either emulation target, without making any change to the source code. For your host code, you do not need to compile differently for emulation as the same host executable or PS application ELF binary can be used in emulation. Emulation targets support most of the features including XRT APIs, buffer transfer, platform memory SP tags, kernel-to-kernel connections, etc.

Running Emulation Targets

The emulation targets have their own target specific drivers which are loaded by XRT. Thus, the same CPU binary can be run as-is without recompiling, by just changing the target mode during runtime. Based on the value of the XCL_EMULATION_MODE environment variable, XRT loads the target specific driver and makes the application interface with an emulation model of the hardware. The allowed values of XCL_EMULATION_MODE are sw_emu and hw_emu. If XCL_EMULATION_MODE is not set, then XRT will load the hardware driver.

IMPORTANT: It is required to set XCL_EMULATION_MODE when running emulation.

You can also use the xrt.ini file to configure various options applicable to emulation. There is an [Emulation] specific section in xrt.ini, as described in xrt.ini File.

Data Center vs. Embedded Platforms

Emulation is supported for both data center and embedded platforms. For data center platforms, the host application is compiled for x86 server, while the device is modeled as separate x86 process emulating the hardware. The user host code and the device model process communicate using RPC calls. For embedded platforms, where the CPU code is running on the embedded Arm processor, emulation flows use QEMU (Quick Emulator) to mimic the Arm-based PS-subsystem. In QEMU, you can boot embedded Linux and run Arm binaries on the emulation targets.

For running software emulation (sw_emu) and hardware emulation (hw_emu) of a data center application, you must compile an emulation model of the accelerator card using the emconfigutil command and set the XCL_EMULATION_MODE environment variable prior to launching your application. The steps are detailed in Running Emulation on Data Center Accelerator Cards.

For running sw_emu or hw_emu of an embedded application, you must launch the QEMU emulation environment on the x86 processor to model the execution environment of the Arm processor. This requires the use of the launch_emulator.py command, or shell scripts generated during the build process. The details of this flow are explained in Running Emulation on an Embedded Processor Platform.

QEMU

QEMU stands for Quick Emulator. It is a generic and open source machine emulator. Xilinx provides a customized QEMU model that mimics the Arm based processing system present on Versal ACAP, Zynq® UltraScale+™ MPSoC, and Zynq-7000 SoC devices. The QEMU model provides the ability to execute CPU instructions at almost real time without the need for real hardware. For more information, refer to the Xilinx Quick Emulator User Guide: QEMU.

For hardware emulation, the Vitis emulation targets use QEMU and co-simulate it with an RTL and SystemC-based model for rest of the design to provide a complete execution model of the entire platform. You can boot an embedded Linux kernel on it and run the XRT-based accelerator application. Because QEMU can execute the Arm instructions, you can take the Arm binaries and run them in emulation flows as-is without the need to recompile. QEMU also allows you to debug your application using GDB and TCF-based target connections from Xilinx System Debugger (XSDB).

The Vitis emulation flow also uses QEMU to emulate the MicroBlaze™ processor to model the platform management modules (PLM and PMU) of the devices. On Versal devices, the PLM firmware is used to load the PDI to program sections of the PS and AI Engine model.

To ensure that the QEMU configuration matches the platform, there are additional files that must be provided as part of sw directory of Vitis platforms. Two common files, qemu_args.txt and pmc_args.txt, contain the command line arguments to be used when launching QEMU. When you create a custom platform, these two files are automatically added to your platform with default contents. You can review the files and edit as needed to model your custom platform. Refer to a Xilinx embedded platform for an example.

Because QEMU is a generic model, it uses a Linux device tree style DTB formatted file to enable and configure various hardware modules. A default QEMU hardware DTB file is shipped with the Vitis tools in the <vitis_installation>/data/emulation/dtbs folder. However, if your platform requires a different QEMU DTB, you can package it as part of your platform.

TIP: The QEMU DTB represents the hardware configuration for QEMU, and is different from the DTB used by the Linux kernel.

Running Emulation on Data Center Accelerator Cards

TIP: Set up the command shell or window as described in Setting Up the Vitis Environment prior to running the builds.

Set the desired runtime settings in the xrt.ini file. This step is optional.
As described in xrt.ini File, the file specifies various parameters to control debugging, profiling, and message logging in XRT when running the host application and kernel execution. This enables the runtime to capture debugging and profile data as the application is running. The Emulation group in the xrt.ini provides features that affect your emulation run.
TIP: Be sure to use the v++ -g option when compiling your kernel code for emulation mode.
Create an emconfig.json file from the target platform as described in emconfigutil Utility. This is required for running hardware or software emulation.
The emulation configuration file, emconfig.json, is generated from the specified platform using the emconfigutil command, and provides information used by the XRT library during emulation. The following example creates the emconfig.json file for the specified target platform:
```
emconfigutil --platform xilinx_u200_xdma_201830_2
```
In emulation mode, the runtime looks for the emconfig.json file in the same directory as the host executable, and reads in the target configuration for the emulation runs.
TIP: It is mandatory to have an up-to-date JSON file for running emulation on your target platform.
Set the XCL_EMULATION_MODE environment variable to sw_emu (software emulation) or hw_emu (hardware emulation) as appropriate. This changes the application execution to emulation mode.
Use the following syntax to set the environment variable for C shell (csh):
```
setenv XCL_EMULATION_MODE sw_emu
```
Bash shell:
```
export  XCL_EMULATION_MODE=sw_emu
```
IMPORTANT: The emulation targets will not run if the XCL_EMULATION_MODE environment variable is not properly set.
Run the application.
With the runtime initialization file (xrt.ini), emulation configuration file (emconfig.json), and the XCL_EMULATION_MODE environment set, run the host executable with the desired command line argument.
IMPORTANT: The INI and JSON files must be in the same directory as the executable.
For example:
```
./host.exe kernel.xclbin
```
TIP: This command line assumes that the host program is written to take the name of the xclbin file as an argument, as most Vitis examples and tutorials do. However, your application may have the name of the xclbin file hard-coded into the host program, or may require a different approach to running the application.

Running Emulation on an Embedded Processor Platform

Note: The file size limit on your machine should either be set to unlimited or a higher value (over 16 GB) because embedded HW Emulation can create files with larger file size for memory.

TIP: Set up the command shell or window as described in Setting Up the Vitis Environment prior to running the builds.

Set the desired runtime settings in the xrt.ini file.
As described in xrt.ini File, the file specifies various parameters to control debugging, profiling, and message logging in XRT when running the host application and kernel execution. As described in Enabling Profiling in Your Application this enables the runtime to capture debugging and profile data as your application is running.
The xrt.ini file, as well as any additional files required for running the application, must be included in the output files as explained in Packaging for Embedded Platforms.
TIP: Be sure to use the v++ -g option when compiling your kernel code for emulation mode.
Launch the QEMU emulation environment by running the launch_sw_emu.sh script or launch_hw_emu.sh script.
```
launch_sw_emu.sh -forward-port 1440 22
```
The script is created in the emulation directory during the packaging process, and uses the launch_emulator.py command to setup and launch QEMU. When launching the emulation script you can also specify options for the launch_emulator.py command. Such as the -forward-port option to forward the QEMU port to an open port on the local system. This is needed when trying to copy files from QEMU as discussed in Step 5 below. Refer to launch_emulator Utility for details of the command.
Another example would be to specify launch_hw_emu.sh -enable-debug to configure additional XTERMs to be opened for QEMU and PL processes to observe live transcripts of command execution to aid in debugging the application. This is not enabled by default, but can be useful when needed for debug.
Mount and configure the QEMU shell with the required settings.
The Xilinx embedded base platforms have rootfs on a separate EXT4 partition on the SD card. After booting Linux, this partition needs to be mounted. If you are running emulation manually, you need to run the following commands from the QEMU shell:
```
mount /dev/mmcblk0p1 /mnt
cd /mnt
export LD_LIBRARY_PATH=/mnt:/tmp:$LD_LIBRARY_PATH
export XCL_EMULATION_MODE=hw_emu
export XILINX_XRT=/usr
export XILINX_VITIS=/mnt
```
TIP: You can set the XCL_EMULATION_MODE environment variable to sw_emu for software emulation, or hw_emu for hardware emulation. This configures the host application to run in emulation mode.
Run the application from within the QEMU shell.
With the runtime initialization (xrt.ini), the XCL_EMULATION_MODE environment set, run the host executable with the command line as required by the host application. For example:
```
./host.elf kernel.xclbin
```
TIP: This command line assumes that the host program is written to take the name of the xclbin file as an argument, as most Vitis examples and tutorials do. However, your application can have the name of the xclbin file hard-coded into the host program, or can require a different approach to running the application.
After the application run has completed, you might have some files that were produced by the runtime, such as opencl_summary.csv, opencl_trace.csv, and xclbin.run_summary. These files can be found in the /mnt folder inside the QEMU environment. However, to view these files you must copy them from the QEMU Linux system back to your local system. The files can be copied using the scp command as follows:
```
scp -P 1440 root@<host-ip-address>:/mnt/<file> <dest_path>
```
Where:
- 1440 is the QEMU port to connect to.
- root@<host-ip-address> is the root login for the PetaLinux running under QEMU on the specified IP address. The default root password is "root".
- /mnt/<file> is the path and file name of the file you want to copy from the QEMU environment.
- <dest_path> specifies the path and file name to copy the file to on the local system.
For example:
```
scp -P 1440 root@172.55.12.26:/mnt/xclbin.run_summary
```
When your application has completed emulation and you have copied any needed files, click Ctrl + a + x keys to terminate the QEMU shell and return to the Linux shell.
Note: If you have trouble terminating the QEMU environment, you can kill the processes it launches to run the environment. The tool reports the process IDs (pids) at the start of the transcript, or you can specify the -pid-file option to capture the pids when launching emulation.

Speed and Accuracy of Hardware Emulation

Hardware emulation uses a mix of SystemC and RTL co-simulation to provide a balance between accuracy and speed of simulation. The SystemC models are a mix of purely functional models and performance approximate models. Hardware emulation does not mimic hardware accuracy 100%, therefore you should expect some differences in behavior between running emulation and executing your application on hardware. This can lead to significant differences in application performance, and sometimes differences in functionality can also be observed.

Functional differences with hardware typically point to a race condition or some unpredictable behavior in your design. So, an issue seen in hardware might not always be reproducible in hardware emulation, though most behavior related to interactions between the host and the accelerator, or the accelerator and the memory are reproducible in hardware emulation. This makes hardware emulation an excellent tool to debug issues with your accelerator prior to running on hardware.

The following table lists models that are used to mimic the hardware platform and their accuracy levels.

Table 1. Hardware Platform
Hardware Functionality	Description
Host to Card PCIe® Connection and DMA (XDMA, SlaveBridge)	For data center platforms, the connection to the x86 host server over PCIe is done as a purely functional model and does not have any performance modeling. Thus, any issues related to PCIe bandwidth cannot be reflected in hardware emulation runs.
UltraScale™ DDR Memory, SmartConnect	The SystemC models for the DDR memory controller, AXI SmartConnect, and other data path IPs are usually throughput approximate. They typically do not model the exact latency of the hardware IP. The model can be used to gauge a relative performance trend as you modify your application or the accelerator kernel.
AI Engine	The AI Engine SystemC model is cycle approximate, though it is not intended to be 100% cycle accurate. A common model is used between AI Engine Simulator and hardware emulation, thus enabling a reasonable comparison between the two stages.
Versal NoC and DDR Models	The Versal NoC and DDR SystemC models are cycle approximate.
Arm Processing Subsystem (PS, CIPS)	The Arm PS is modeled using QEMU, which is a purely functional execution model. For more information, see QEMU.
User Kernel (accelerator)	Hardware emulation uses RTL for the user accelerator. As follows, the accelerator behavior by itself is 100% accurate. However, the accelerator is surrounded by other approximate models.
Other I/O Models	For hardware emulation, there is generic Python or C-based traffic generator which can be interfaced with the emulation process. You can generate abstract traffic at AXI protocol level which mimics the I/O in your design. Because these models are abstract, any issues observed on the specific hardware board will not be shown in hardware emulation.

Because hardware emulation uses RTL co-simulation as its execution model, the speed of execution is orders of magnitude slower as compared to real hardware. Xilinx recommends using small data buffers. For example, if you have a configurable vector addition and in hardware you are performing a 1024 element vadd, in emulation you might restrict it to 16 elements. This will enable you to test your application with the accelerator, while still completing execution in reasonable time.

Working with Simulators in Hardware Emulation

Simulator Support

The Vitis tool uses the Vivado logic simulator (xsim) as the default simulator for all platforms, including Alveo Data Center accelerator cards, and Versal and Zynq UltraScale+ MPSoC embedded platforms. However, for Versal embedded platforms, like xilinx_vck190_base or custom platforms similar to it, the Vitis tool also supports the use of third-party simulators for hardware emulation: Mentor Graphics Questa Advanced Simulator, Xcelium, and VCS. The specific versions of the supported simulators are the same as the versions supported by Vivado Design Suite.

TIP: For data center platforms, hardware emulation supports the U250_XDMA platform with Questa Advanced Simulator. This support does not include features like peer-to-peer (P2P), SlaveBridge, or other features unless explicitly mentioned.

Enabling a third-party simulator requires some additional configuration options to be implemented during generation of the device binary (.xclbin) and supporting Tcl scripts. The specific requirements for each simulator is discussed below. Also, note that you should run the Vivado setup for third-party simulators before using those simulators in Vitis. Specifically, you must pre-compile the simulation models using the compile_sim_lib Tcl command. For more details, see the Vivado Design Suite User Guide: Logic Simulation (UG900) for third-party simulator setup.

Questa

Add the following advanced parameters and Vivado properties to a configuration file for use during linking:

## Final set of additional options required for running simulation using Questa Simulator
[advanced]
param=hw_emu.simulator=QUESTA
[vivado]
prop=project.__CURRENT__.simulator.questa_install_dir=/tools/gensys/questa/2020.4/bin/
prop=project.__CURRENT__.compxlib.questa_compiled_library_dir=<install_dir>/clibs/questa/2020.4/lin64/lib/
prop=fileset.sim_1.questa.compile.sccom.cores={4}

After generating the configuration file you can use it in the v++ command line as follows:

v++ -link --config questa_sim.cfg

Xcelium

Add the following advanced parameters and Vivado properties to a configuration file for use during linking:

## Final set of additional options required for running simulation using Xcelium Simulator
[advanced]
param=hw_emu.simulator=XCELIUM
[vivado]
prop=project.__CURRENT__.simulator.xcelium_install_dir=/tools/dist/xlm/20.09.006/tools.lnx86/xcelium/bin/
prop=project.__CURRENT__.compxlib.xcelium_compiled_library_dir=/clibs/xcelium/20.09.006/lin64/lib/ 
prop=fileset.sim_1.xcelium.elaborate.xmelab.more_options={-timescale 1ns/1ps}

After generating the configuration file you can use it in the v++ command line as follows:

v++ -link --config xcelium.cfg

VCS

Add the following advanced parameters and Vivado properties to a configuration file for use during linking:

## Final set of additional options required for running simulation using VCS Simulator
[advanced]
param=hw_emu.simulator=VCS
[vivado]
prop=project.__CURRENT__.simulator.vcs_install_dir=/tools/gensys/vcs/R-2020.12/bin/
prop=project.__CURRENT__.compxlib.vcs_compiled_library_dir=/clibs/vcs/R-2020.12/lin64/lib/
prop=project.__CURRENT__.simulator.vcs_gcc_install_dir=/tools/installs/synopsys/vg_gnu/2019.06/amd64/gcc-6.2.0_64/bin

After generating the configuration file you can use it in the v++ command line as follows:

v++ -link --config vcs_sim.cfg

You can use the -user-pre-sim-script and -user-post-sim-script options from the launch_emulator.py command to specify Tcl scripts to run before the start of simulation, or after simulation completes. As an example, in these scripts, you can use the $cwd command to get the run directory of the simulator and copy any files needed prior to simulation, or copy any output files generated at the end of simulation.

To enable hardware emulation, you must setup the environment for simulation in the Vivado Design Suite. A key step for setup is pre-compiling the RTL and SystemC models for use with the simulator. To do this, you must run the compile_sim_lib command in the Vivado tool. For more information on pre-compilation of simulation models, refer to the Vivado Design Suite User Guide: Logic Simulation (UG900).

When creating your Versal platform ready for simulation, the Vivado tool generates a simulation wrapper which must be instantiated in your simulation test bench. So, if the top most design module is <top>, then when calling launch_simulation in the Vivado tool, it will generate a <top>_sim_wrapper module, and also generates xlnoc.bd. These files are generated as simulation-only sources and will be overwritten anytime launch_simulation is called in the Vivado tool. Platform developers need to instantiate this module in the test bench and not their own <top> module.

Using the Simulator Waveform Viewer

Hardware emulation uses RTL and SystemC models for execution. A regular application and HLS-based kernel developer does not need to be aware of the hardware level details. The Vitis analyzer provides sufficient details of the hardware execution model. However, for advanced users who are familiar with HW signal and protocols, they can launch hardware emulation with the simulator waveform running, as described in Waveform View and Live Waveform Viewer.

By default, when running v++ --link -t hw_emu, the tool compiles the simulation models in optimized mode. However, when you also specify the -g switch, you enable hardware emulation models to be compiled in debug mode. During the application runtime, use the -g switch with the launch_hw_emu.sh command to run the simulator interactively in GUI mode with waveforms displayed. By default, the hardware emulation flow adds common signals of interest to the waveform window. However, you can pause the simulator to add signals of interest and resume simulation.

AXI Transactions Display in XSIM Waveform

Many models in hardware emulation use SystemC transaction-level modeling (TLM). In these cases, interactions between the models cannot be viewed as RTL waveforms. However, Vivado simulator (xsim) provides a transaction level viewer. For standard platforms, these interface objects can be added to the waveform view, similar to how RTL signals are added. As an example, to add an AXI interface to the waveform, use the following Tcl command in xsim:

add_wave <HDL_objects>

Using the add_wave command, you can specify full or relative paths to HDL objects. For additional details on how to interpret the TLM waveform and how to add interfaces in the GUI, see the Vivado Design Suite User Guide: Logic Simulation (UG900).

Working with SystemC Models

SystemC models in the Vitis application acceleration development flow allow you to quickly model an RTL algorithm for rapid analysis in software and hardware emulation. Using this approach you can model portions of your system while the RTL kernel is still in development, but you want to move forward with some system analysis.

The SystemC model feature supports all the XRT-managed kernel execution models using ap_ctrl_hs and ap_ctrl_chain. It also supports modeling both AXI4 memory mapped interfaces (m_axi) and AXI4-Stream interfaces (axis), as well as register reads and write of the s_axilite interface.

You can model your kernel code in SystemC TLM models, provide interfaces to other kernels and the host application, and use it during emulation. You can create a Xilinx object file (XO) to link the SystemC model to other kernels in your xclbin. The sections that follow discuss the creation of SystemC models, the use of the create_sc_xo command to create the XO, and generating the xclbin using the v++ command.

TIP: Keep in mind that the SystemC model is not cycle accurate, and therefore impacts the timing results of your emulation. It does not reflect the true bandwidth, latency, or throughput of the RTL code.

Coding a SystemC Model

The process for defining a SystemC model uses the following steps:

Include header files "xtlm_ap_ctrl.h" and "xtlm.h".
Derive your kernel from a predefined class based on the supported kernel types: ap_ctrl_chain, ap_ctrl_hs, etc.
Declare and define the AXI interfaces used on your kernel.
Add required kernel arguments with the correct address offset and size.
Write the kernel body in main() thread.

This process is demonstrated in the code example below.

When creating the SystemC model, you derive the kernel from a class-based on a supported control protocol: xtlm_ap_ctrl_chain, xtlm_ap_ctrl_hs, and xtlm_ap_ctrl_none. Use the following structure to create your SystemC model.

TIP: The following example is based on the simple vector addition (VADD) example design found in the Vitis_Accel_Examples on GitHub.

#include "xtlm.h"
#include "xtlm_ap_ctrl.h"
 
class vadd : public xsc::xtlm_ap_ctrl_hs
{
    public:
        SC_HAS_PROCESS(vadd);
        vadd(sc_module_name name, xsc::common_cpp::properties& _properties):
        xsc::xtlm_ap_ctrl_hs(name)
        {
            DEFINE_XTLM_AXIMM_MASTER_IF(in1, 32);
            DEFINE_XTLM_AXIMM_MASTER_IF(in2, 32);
            DEFINE_XTLM_AXIMM_MASTER_IF(out_r, 32);
 
           ADD_MEMORY_IF_ARG(in1,   0x10, 0x8);
           ADD_MEMORY_IF_ARG(in2,   0x18, 0x8);
           ADD_MEMORY_IF_ARG(out_r, 0x20, 0x8);
           ADD_SCALAR_ARG(size,     0x28, 0x4);
 
            SC_THREAD(main_thread);
        }
 
 
        //! Declare aximm interfaces..
        DECLARE_XTLM_AXIMM_MASTER_IF(in1);
        DECLARE_XTLM_AXIMM_MASTER_IF(in2);
        DECLARE_XTLM_AXIMM_MASTER_IF(out_r);
 
        //! Declare scalar args...
        unsigned int size;
 
        void main_thread()
        {
            wait(ev_ap_start); //! Wait on ap_start event...
 
            //! Copy kernel args configured by host...
            uint64_t  in1_base_addr = kernel_args[0];
            uint64_t  in2_base_addr = kernel_args[1];
            uint64_t  out_r_base_addr = kernel_args[2];
            size = kernel_args[3];
 
            unsigned data1, data2, data_r;
            for(int i = 0; i < size; i++) {
                in1->read(in1_base_addr + (i*4), (unsigned char*)&data1);  //! Read from in1 interface
                in2->read(in2_base_addr + (i*4), (unsigned char*)&data2);  //! Read from in2 interface
 
                //! Add data1 & data2 and write back result
                data_r = data1 + data2;                //! Add
                out_r->write(out_r_base_addr + (i*4), (unsigned char*)&data_r); //! Write the result
            }
 
            ap_done(); //! completed Kernel computation...
        }
 
};

The include files are available in the Vitis installation hierarchy under the $XILINX_Vivado/data/systemc/simlibs/ folder.

The kernel name is specified when defining the class for the SystemC model, as shown above, inheriting from the xtlm_ap_ctrl_hs class.

You must declare and define the AXI interfaces associated with the kernel arguments as shown by the following constructs:

DECLARE_XTLM_AXIMM_MASTER_IF(in1);
DEFINE_XTLM_AXIMM_MASTER_IF(in1, 32);

The declaration associates the interface type with the argument. The definition defines the data width of the interface. You must also declare the register offsets and size for the kernel arguments as shown in the following:

ADD_MEMORY_IF_ARG(in1, 0x10, 0x8);

When specifying the kernel arguments, offsets, and size, these values should match the values reflected in the AXI4-Lite interface of the XRT-managed kernel as described in SW-Controllable Kernels and Control Requirements for XRT-Managed Kernels.

The addresses below 0x10 are reserved for use by XRT for managing kernel execution. Kernel arguments can be specified from 0x10 onwards. Most importantly the arguments, offsets, and size specified in the SystemC model should match the values used in the Vitis HLS or RTL kernel.

The the kernel is executed in the SystemC main_thread. This thread waits until the ap_start bit is set by the host application, or XRT, at which time the kernel can process argument values as shown:

The kernel waits for a signal to begin from XRT (ev_ap_start).
Kernel arguments are mapped to variables in the SystemC model.
The inputs are read.
The vectors are added and the result is captured.
The output is written back to the host.
The finished signal is sent to XRT (ap_done).

Creating the XO

To generate XO file from the SystemC model, use the create_sc_xo command. This takes the SystemC kernel source file as input and creates IP that generates the XO, which can be used for linking with the target platform and other kernels with the Vitis compiler. For example:

create_sc_xo vadd.cpp

Generating an XO file from the source file involves a number of intermediate steps like generating a Package IP script, and running the package_xo command. These intermediate commands can be used for debugging if necessary.

The output of the above create_sc_xo command is vadd.xo.

Linking with the v++ Command

Link the SystemC model XO file by adding the kernel to the v++ --link command line:

v++ --link --platform <platform> --target hw_emu \
--config ./vadd.cfg --input_files ../vadd.xo --output ../vadd.link.xclbin \
--optimize 0 --save-temps --temp_dir ./hw_emu

The SystemC model can be used for both software emulation and hardware emulation, but is not supported for hardware build targets.

When a SystemC model is included in the xclbin, the design is no longer clock cycle accurate due to the limitations of the TLM.

Using I/O Traffic Generators

Some user applications such as video streaming and Ethernet-based applications make use of I/O ports on the platform to stream data into and out of the platform. For these applications, performing hardware emulation of the design requires a mechanism to mimic the hardware behavior of the I/O port, and to simulate data traffic running through the ports. I/O traffic generators let you model traffic through the I/O ports during hardware emulation in the Vitis application acceleration development flow, or during logic simulation in the Vivado Design Suite. This supports both AXI4-Stream and AXI4 memory map interface I/O emulation.

Adding Traffic Generators to Your Design

Xilinx devices have rich I/O interfaces. The Alveo accelerator cards primarily have PCIe and DDR memory interfaces which have their own specific model. However, your platforms could also have other I/Os, for example GT-kernel based generic I/O, Video Streams, and Sensor data. I/O Traffic Generator kernels provide a method for platforms and applications to inject traffic onto the I/O during simulation.

This solution requires both the inclusion of streaming I/O kernels (XO) or IP in your design, and the use of a Python/C++/C provided by Xilinx to inject traffic or to capture output data from the emulation process. The Xilinx provided Python/C++/C library can be used to integrate traffic generator code into your application, run it as a separate process, and have it interface with the emulation process. Currently, Xilinx provides a library that enables interfacing at AXI4-Stream level to mimic any Streaming I/O and AXI3/AXI4 memory mapped interface to mimic any memory mapped I/O.

AXI4-Stream I/O Model for Streaming Traffic

The following section is specific to AXI4-Stream. The streaming I/O model can be used to emulate streaming traffic on the platform, and also support delay modeling. You can add streaming I/O to your application, or add them to your custom platform design as described below:

Streaming I/O kernels can be added to the device binary (xclbin) file like any other compiled kernel object (XO) file, using the v++ --link command. The Vitis installation provides kernels for AXI4-Stream interfaces of various data widths. These can be found in the software installation at $XILINX_VITIS/data/emulation/XO.
Add these to your designs using the following example command:
```
v++ -t hw_emu --link $XILINX_VITIS/data/emulation/XO/sim_ipc_axis_master_32.xo $XILINX_VITIS/data/emulation/XO/sim_ipc_axis_slave_32.xo ... 
```
In the example above, the sim_ipc_axis_master_32.xo and sim_ipc_axis_slave_32.xo provide 32-bit master and slave kernels that can be linked with the target platform and other kernels in your design to create the .xclbin file for the hardware emulation build.

IPC modules can also be added to a platform block design using the Vivado IP integrator feature for Versal and Zynq UltraScale+ MPSoC custom platforms. The tool provides sim_ipc_axis_master_v1_0 and sim_ipc_axis_slave_v1_0 IP to add to your platform design. These can be found in the software installation at $XILINX_VIVADO/data/emulation/hw_em/ip_repo.

The following is an example Tcl script used to add IPC IP to your platform design, which will enable you to inject data traffic into your simulation from an external process written in Python or C++:

#Update IP Repository path if required
set_property  ip_repo_paths $XILINX_VIVADO/data/emulation/hw_em/ip_repo [current_project]
## Add AXIS Master
create_bd_cell -type ip -vlnv xilinx.com:ip:sim_ipc_axis_master:1.0 sim_ipc_axis_master_0
#Change Model Property if required
set_property -dict [list CONFIG.C_M00_AXIS_TDATA_WIDTH {64}] [get_bd_cells sim_ipc_axis_master_0]

##Add AXIS Slave
create_bd_cell -type ip -vlnv xilinx.com:ip:sim_ipc_axis_slave:1.0 sim_ipc_axis_slave_0
#Change Model Property if required
set_property -dict [list CONFIG.C_S00_AXIS_TDATA_WIDTH {64}] [get_bd_cells sim_ipc_axis_slave_0]

Writing Traffic Generators in Python

You must also include a traffic generator process while simulating your application to generate data traffic on the I/O traffic generators, or to capture output data from the emulation process. The Xilinx provided Python or C++ library can be used to create the traffic generator code as described below. Also, an application can communicate to multiple I/O interface. It is not necessary to have each instance of I/O utilities to be in a separate process/thread. In case your application demands it, you might consider the non-blocking version APIs (details provided in the following section).

For Python, set $PYTHONPATH on the command terminal:

setenv PYTHONPATH $XILINX_VIVADO/data/emulation/hw_em/lib/python:\
$XILINX_VIVADO/data/emulation/ip_utils/xtlm_ipc/xtlm_ipc_v1_0/python/

Sample Python code to connect with the gt_master instance would look like the following:

Blocking Send
from xilinx_xtlm import ipc_axis_master_util
from xilinx_xtlm import xtlm_ipc
import struct
 
import binascii
  
#Instantiating AXI Master Utilities
master_util = ipc_axis_master_util("gt_master")
  
#Create payload
payload = xtlm_ipc.axi_stream_packet()
payload.data = "BINARY_DATA" # One way of getting "BINARY_DATA" from integer can be like payload.data = bytes(bytearray(struct.pack("i", int_number))) More info @ https://docs.python.org/3/library/struct.html
payload.tlast = True #AXI Stream Fields
#Optional AXI Stream Parameters
payload.tuser = "OPTIONAL_BINARY_DATA"
payload.tkeep = "OPTIONAL_BINARY_DATA"
  
#Send Transaction
master_util.b_transport(payload)
master_util.disconnect() #Disconnect connection between Python & Emulation

Sample Python code to connect with the gt_slave instance would look like the following:

Blocking Receive
from xilinx_xtlm import ipc_axis_slave_util
from xilinx_xtlm import xtlm_ipc
  
#Instantiating AXI Slave Utilities
slave_util = ipc_axis_slave_util("gt_slave")
  
  
#Sample payload (Blocking Call)
payload = slave_util.sample_transaction()
slave_util.disconnect() #Disconnect connection between Python & Emulation

For non-blocking version APIs in Python, it can be found at:

$XILINX_VIVADO/data/emulation/ip_utils/xtlm_ipc/xtlm_ipc_v1_0/python/xilinx_xtlm.py

Writing Traffic Generators in C++

For C++ the APIs are available at:
```
$XILINX_VIVADO/data/emulation/ip_utils/xtlm_ipc/xtlm_ipc_v1_0/cpp/inc/axis
```
The C++ API provides both blocking and non-blocking function support. The following snippets show the usage.
TIP: A sample Makefile is also available to generate the executable.

Blocking send:

A simple API is available if you prefer not to have fine granular control (recommended):

#include "xtlm_ipc.h" //Include file
void send_data() 
{
 //! Instantiate IPC socket with name matching in IPI diagram...
 xtlm_ipc::axis_initiator_socket_util<xtlm_ipc::BLOCKING> socket_util("gt_master");
 const unsigned int NUM_TRANSACTIONS = 8;
 std::vector<char> data;
 std::cout << "Sending " << NUM_TRANSACTIONS << " data transactions..." <<std::endl;
 for(int i = 0; i < NUM_TRANSACTIONS; i++) {
 data = generate_data();
 print(data);
 socket_util.transport(data.data(), data.size());
 }
}

For advanced users who need fine granular control over AXI4-Stream can use the following:

#include "xtlm_ipc.h" //Include file
 
void send_packets()
{
    //! Instantiate IPC socket with name matching in IPI diagram...
    xtlm_ipc::axis_initiator_socket_util<xtlm_ipc::BLOCKING> socket_util("gt_master");
 
    const unsigned int NUM_TRANSACTIONS = 8;
    xtlm_ipc::axi_stream_packet packet;
 
    std::cout << "Sending " << NUM_TRANSACTIONS << " Packets..." <<std::endl;
    for(int i = 0; i < NUM_TRANSACTIONS; i++) {
        xtlm_ipc::axi_stream_packet packet;
        // generate_data() is your custom code to generate traffic 
        std::vector<char> data = generate_data();
        //! Set packet attributes...
        packet.set_data(data.data(), data.size());
        packet.set_data_length(data.size());
        packet.set_tlast(1);
        //Additional AXIS attributes can be set if required
        socket_util.transport(packet); //Blocking transport API to send the transaction
    }
}

Blocking receive:

A simple API is available if you prefer not to have fine granular control (recommended):

#include "xtlm_ipc.h" //Include file
void receive_data()
{
 //! Instantiate IPC socket with name matching in IPI diagram...
 xtlm_ipc::axis_target_socket_util<xtlm_ipc::BLOCKING> socket_util("gt_slave");
 const unsigned int NUM_TRANSACTIONS = 100;
 unsigned int num_received = 0;
 std::vector<char> data;
 std::cout << "Receiving " << NUM_TRANSACTIONS << " data transactions..." <<std::endl;
 while(num_received < NUM_TRANSACTIONS) {
 socket_util.sample_transaction(data);
 print(data);
 num_received += 1;
 }
}

For advanced users who need fine granular control over AXI4-Stream can use the following:

#include "xtlm_ipc.h"
 
void receive_packets()
{
    //! Instantiate IPC socket with name matching in IPI diagram...
    xtlm_ipc::axis_target_socket_util<xtlm_ipc::BLOCKING> socket_util("gt_slave");
 
    const unsigned int NUM_TRANSACTIONS = 8;
    unsigned int num_received = 0;
    xtlm_ipc::axi_stream_packet packet;
 
    std::cout << "Receiving " << NUM_TRANSACTIONS << " packets..." <<std::endl;
    while(num_received < NUM_TRANSACTIONS) {
        socket_util.sample_transaction(packet); //API to sample the transaction
        //Process the packet as per requirement.
        num_received += 1;
    }
}

Non-Blocking send:

#include <algorithm>    // std::generate
#include "xtlm_ipc.h"
 
//A sample implementation of generating random data.
xtlm_ipc::axi_stream_packet generate_packet()
{
    xtlm_ipc::axi_stream_packet packet;
    // generate_data() is your custom code to generate traffic
    std::vector<char> data = generate_data();
 
    //! Set packet attributes...
    packet.set_data(data.data(), data.size());
    packet.set_data_length(data.size());
    packet.set_tlast(1);
    //packet.set_tlast(std::rand()%2);
    //! Option to set tuser tkeep optional attributes...
 
    return packet;
}
 //Simple Usage

void send_data() 
{
    //! Instantiate IPC socket with name matching in IPI diagram...
    xtlm_ipc::axis_initiator_socket_util<xtlm_ipc::NON_BLOCKING> socket_util("gt_master");

    const unsigned int NUM_TRANSACTIONS = 8;
    std::vector<char> data;

    std::cout << "Sending " << NUM_TRANSACTIONS << " data transactions..." <<std::endl;
    for(int i = 0; i < NUM_TRANSACTIONS/2; i++) {
        data = generate_data();
        print(data);
        socket_util.transport(data.data(), data.size());
    }

    std::cout<< "Adding Barrier to complete all outstanding transactions..." << std::endl;
    socket_util.barrier_wait();
    for(int i = NUM_TRANSACTIONS/2; i < NUM_TRANSACTIONS; i++) {
        data = generate_data();
        print(data);
        socket_util.transport(data.data(), data.size());
    }
}
void send_packets()
{
    //! Instantiate IPC socket with name matching in IPI diagram...
    xtlm_ipc::axis_initiator_socket_util<xtlm_ipc::NON_BLOCKING> socket_util("gt_master"); 
    // Instantiate Non Blocking specialization
 
    const unsigned int NUM_TRANSACTIONS = 8;
    xtlm_ipc::axi_stream_packet packet;
 
    std::cout << "Sending " << NUM_TRANSACTIONS << " Packets..." <<std::endl;
    for(int i = 0; i < NUM_TRANSACTIONS; i++) {
        packet = generate_packet(); // Or user's test patter / live data etc.
        socket_util.transport(packet);
    }
}

Non-Blocking receive:

#include <unistd.h>
#include "xtlm_ipc.h"
//Simple Usage 
void receive_data()
{
    //! Instantiate IPC socket with name matching in IPI diagram...
    xtlm_ipc::axis_target_socket_util<xtlm_ipc::NON_BLOCKING> socket_util("gt_slave");

    const unsigned int NUM_TRANSACTIONS = 8;
    unsigned int num_received = 0, num_outstanding = 0;
    std::vector<char> data;

    std::cout << "Receiving " << NUM_TRANSACTIONS << " data transactions..." <<std::endl;
    while(num_received < NUM_TRANSACTIONS) {
        num_outstanding = socket_util.get_num_transactions();
        num_received += num_outstanding;
        
        if(num_outstanding != 0) {
            std::cout << "Outstanding data transactions = "<< num_outstanding <<std::endl;
            for(int i = 0; i < num_outstanding; i++) {
                socket_util.sample_transaction(data);
                print(data);
            }
        }
        usleep(100000);
    }
}
void receive_packets()
{
    //! Instantiate IPC socket with name matching in IPI diagram...
    xtlm_ipc::axis_target_socket_util<xtlm_ipc::NON_BLOCKING> socket_util("gt_slave");
 
    const unsigned int NUM_TRANSACTIONS = 8;
    unsigned int num_received = 0, num_outstanding = 0;
    xtlm_ipc::axi_stream_packet packet;
 
    std::cout << "Receiving " << NUM_TRANSACTIONS << " packets..." <<std::endl;
    while(num_received < NUM_TRANSACTIONS) {
        num_outstanding = socket_util.get_num_transactions();
        num_received += num_outstanding;
         
        if(num_outstanding != 0) {
            std::cout << "Outstanding packets = "<< num_outstanding <<std::endl;
            for(int i = 0; i < num_outstanding; i++) {
                socket_util.sample_transaction(packet);
                print(packet);
            }
        }
        usleep(100000); //As transaction is non-blocking we would like to give some delay between consecutive samplings
    }
}

The following is an example Makefile for the blocking receive above:

GCC=/usr/bin/g++
IPC_XTLM=$(XILINX_VIVADO)/data/emulation/ip_utils/xtlm_ipc/xtlm_ipc_v1_0/cpp/
PROTO_PATH=$(XILINX_VIVADO)/data/simmodels/xsim/2021.1/lnx64/6.2.0/ext/protobuf/
BOOST=$(XILINX_VIVADO)/tps/boost_1_64_0/
 
SRC_FILE=b_receive.cpp
.PHONY: run all
 
default: all
 
all : b_receive
 
b_receive: $(SRC_FILE)
    $(GCC)   $(SRC_FILE) $(IPC_XTLM)/src/common/xtlm_ipc.pb.cc $(IPC_XTLM)/src/axis/*.cpp $(IPC_XTLM)/src/common/*.cpp -I$(IPC_XTLM)/inc/ -I$(PROTO_PATH)/include/ -L$(PROTO_PATH) -lprotobuf -o $@ -lpthread -I$(BOOST)/

For C APIs, it can be found at:

$XILINX_VIVADO/data/emulation/ip_utils/xtlm_ipc/xtlm_ipc_v1_0/C/inc/axis/c_axis_socket.h

It can be linked against pre-compiled library at:

$XILINX_VIVADO/data/emulation/ip_utils/xtlm_ipc/xtlm_ipc_v1_0/C/lib/

A full system-level example is available at https://github.com/Xilinx/Vitis_Accel_Examples/tree/master/emulation.

AXI4 Memory Map External Traffic through Python/C++

The AXI4 memory map external traffic has the following specifications:

Only transaction-level granularity is supported.
Re-ordering of transactions is not supported.
Parallel Read, Write transactions are not supported (transactions will be serialized).
Unaligned transactions are not supported.

The following figure shows the high-level design.

Figure 2: AXI4 Memory Map External Traffic Design

Use Cases

The use cases include the following:

Emulate AXI4 memory map Master/Slave through an external process such as Python/C++. This can help you with emulating design with quick design time of AXI4 Master/Slave without investing resources in developing AXI4 Master.
Chip-to-chip connection between two FPGAs can be emulated with AXI4 memory map Interprocess communication.

API/Pseudo Code

For API/pseudo code, a single instance of AXI4 memory map transaction is used for the complete transaction. This is in line with how payload is used in the Xilinx SystemC modules. For AXI4 Master, there is a b_transport(aximm_packet) API. After the call, aximm_packet is updated with a response given by AXI4 Slave. For AXI4 Slave, there are sample_transaction() and send_response(aximm_packet) APIs.

The following code snippets show the API usage in the context of C++.

Code snippet for C++ Master:

auto payload = generate_random_transaction(); //Custom Random transaction generator. Users can configure AXI propeties on the payload.
/* Or User can set the AXI transaction properties as follows
payload->set_addr(std::rand() * 4);
payload->set_len(1 + (std::rand() % 255));
payload->set_size(1 << (std::rand() % 3));
*/
 
master_uti.b_transport(*payload.get(), std::rand() % 0x10); //A blocking call. Response will be updated in the same payload. Each AXI MM transaction will use same payload for whole transaction
std::cout << "-----------Transaction Response------------" << std::endl;
std::cout << *payload << std::endl; //Prints AXI transaction info

Code snippet for C++ Slave:

auto& payload = slave_util.sample_transaction(); // Sample the transaction
 
//If it is read transaction, give read data
if(payload.cmd() == xtlm_ipc::aximm_packet_command_READ)
{
    rd_resp.resize(payload.len()*payload.size());
    std::generate(rd_resp.begin(), rd_resp.end(), []()
    {   return std::rand()%0x100;});
}
 
//Set AXI response (for Read & Write)
payload.set_resp(std::rand()%4);
slave_util.send_response(payload); //Send the response to the master

The following code snippets show the API usage in the context of Python.

You need to set PYTHONPATH as follows:

For example, on C Shell:

setenv PYTHONPATH $XILINX_VIVADO/data/emulation/hw_em/lib/python: $XILINX_VIVADO/data/emulation/ip_utils/xtlm_ipc/xtlm_ipc_v1_0/python

Code snippet of Python Master:

aximm_payload = xtlm_ipc.aximm_packet()
random_packet(aximm_payload) # Custom function to set AXI Properties randomly
#Or user can set AXI properties as required
#aximm_payload.addr = int(random.randint(0, 1000000)*4)
#aximm_payload.len = random.randint(1, 64)
#aximm_payload.size = 4
 
master_util.b_transport(aximm_payload)
#After this call aximm_payload will have updated response as set by the AXI Slave.

Code snippet of Python Slave:

aximm_payload = slave_util.sample_transaction()
aximm_payload.resp = random.randint(0,3)
if not aximm_payload.cmd: #if it is a read transction set Random data
    tot_bytes = aximm_payload.len * aximm_payload.size
    for i in range(0, int(tot_bytes/SIZE_OF_EACH_DATA_IN_BYTES)):
        aximm_payload.data += bytes(bytearray(struct.pack(">I", random.randint(0,60000)))) # Binary data should be aligned with C struct
         
slave_util.send_resp(aximm_payload)

AXI4 Memory Map I/O Limitations in the Platform

The following shows the AXI4 memory map I/O limitations in the platform:

During platform development, AXI4 memory map I/O can be connected to any memory/slave.
Master AXI4 memory map I/O cannot connect to kernel as kernel cannot provide an additional slave interface.
AXI4 memory map Slave I/O can be used without any restrictions.
AXI4 memory map Master I/O can be used where data needs to be driven from external process to memory/slave.

XO Usage

The use cases of AXI4 memory map I/O XO differs from AXI4-Stream I/O XO. AXI4 memory map XOs have few limitations on usage during the link stage of Vitis listed as below:

Only AXI4 memory map Master I/O can be used.
AXI4 memory map Master I/O can connect only with available slaves in the platform.
AXI4 memory map Master I/O cannot communicate with kernel in the design.

For XO usage during link stage:

To generate XO, developers can use the script available at $XILINX_VITIS/data/emulation/XO/scripts/aximm_xo_creation.sh

Required configuration of XO can be generated using the above script.

$XILINX_VITIS/data/emulation/XO/scripts/aximm_xo_creation.sh --address_width <adr_width> --data_width <data_width> --id_width <id_width> --output_path <output_path>.xo 
$XILINX_VITIS/data/emulation/XO/scripts/aximm_xo_creation.sh --address_width 64 --data_width 64 --id_width 4 --output_path sim_ipc_aximm_master.xo

After generating XO, it can be used in the design with configuration as shown below (sample usage, actual connection to be done based on the requirement):
```
[connectivity]
nk=sim_ipc_aximm_master:1:aximm_master
sp=aximm_master.M_AXIMM:HBM[0]
```

Running Traffic Generators

After generating an external process binary as shown above using the headers and sources available at $XILINX_VIVADO/data/emulation/ip_utils/xtlm_ipc/xtlm_ipc_v1_0/<supported_language>, you can run the emulation using the following steps:

Launch the Vitis hardware emulation or Vivado simulation using the standard process and wait for the simulation to start.
From another terminal(s), launch the external process such as Python/C++/C.