Performance Analysis of AI Engine Graph Application

A system-level view of program execution can be helpful in identifying problems during program execution including correctness and performance issues. Problems such as missing or mismatching locks, buffer overruns, and incorrect programming of DMA buffers are examples that are difficult to debug by using explicit print statements or by using traditional interactive debuggers. A systematic way of collecting system level traces for the program execution is needed. The AI Engine architecture has direct support for generation, collection, and streaming of events as trace data during simulation, hardware emulation, or hardware execution.

Note: The event trace feature for hardware execution is an early access feature.

AI Engine Simulation-Based Performance Analysis

In simulation, to view time stamped events, different event types, and data associated with each event, value change dump (VCD) files can be used. VCD files provide a detailed dump of the simulated hardware signals. Additionally, a profile summary provides annotated details for the overall application performance.

AI Engine Simulation-Based Value Change Dump

In the simulation framework, the AI Engine simulator can generate a detailed dump of the hardware signals in the form of value change dump (VCD) files. A defined set of abstract events describes the execution of a multi-kernel AI Engine program in terms of these events. The output of a VCD file is enabled using the aiesimulator --dump-vcd command.

After simulation, or emulation, the VCD file can be processed into events and viewed on a timeline in the Vitis™ analyzer. The events contain information such as time stamps, different event types, and data associated with each event. This information can be correlated to the compiler generated debug information. This includes program counter values mapped to function names and instruction offsets, and source level symbolic data offsets for memory accesses.

The abstract AI Engine events are independent of the VCD format and will be directly extracted from the hardware. The events traces can be produced as plain text, comma-separated values (CSV), common trace format (CTF), or in waveform database (WDB), and the generated event trace data can be viewed in the Vitis analyzer.

VCD File Generation

To generate VCD file from the Vitis IDE, right-click on your AI Engine graph project from the Project Explorer and select Run As > Run Configurations as described in Creating the AI Engine Graph Project and Top-Level System Project. This opens up the run configuration panel for the current project.

Select the AI Engine Emulator option and double click to open a New_configuration. Select the Generate Trace check box to enable trace capture, and select the VCD Trace button. By default, this produces a VCD dump in a file called foo.vcd in the current directory. You can rename the file if you like.

The VCD file can also be generated by invoking the AI Engine simulator with the –-dump-vcd <filename> option on the command line. The VCD file is generated in the same directory as the simulation is run. Assuming that the program is compiled using the AI Engine compiler, the simulator can be invoked in a shell with the VCD option.

$ aiesimulator –-pkg-dir=./Work --dump-vcd=foo

This command produces the VCD file (foo.vcd) which is written to the current directory.

AI Engine Trace from VCD

The vcdanalyze utility is provided to generate an AI Engine event trace from the VCD file. This process is integrated into the Vitis tool flow automatically. From the Vitis IDE, after a simulation run has finished capturing AI Engine events, you can right-click on the project from the Project Explorer and select Analyze AIE Events. The trace data is produced under the current project at Traces/AIE_AXI_Trace and various views are automatically loaded into the current project.

The raw event trace under the directory Traces/AIE_AXI_Trace/ctf/events.txt should look like the following:

time=1741000,event=FROM_PL,name=tl.me.shim.tile_0_0.pl_interface.pl_to_shim0.data0,col=0,streamid=0,data0=65536,data1=0,tlast=0
time=1742000,event=FROM_PL,name=tl.me.shim.tile_0_0.pl_interface.pl_to_shim0.data0,col=0,streamid=0,data0=196610,data1=0,tlast=0
time=1743000,event=FROM_PL,name=tl.me.shim.tile_0_0.pl_interface.pl_to_shim0.data0,col=0,streamid=0,data0=327684,data1=0,tlast=0
time=1743000,event=CORE_RESET,col=1,row=0
time=1744000,event=FROM_PL,name=tl.me.shim.tile_0_0.pl_interface.pl_to_shim0.data0,col=0,streamid=0,data0=458758,data1=0,tlast=0
time=1745000,event=FROM_PL,name=tl.me.shim.tile_0_0.pl_interface.pl_to_shim0.data0,col=0,streamid=0,data0=589832,data1=0,tlast=0
time=1746000,event=FROM_PL,name=tl.me.shim.tile_0_0.pl_interface.pl_to_shim0.data0,col=0,streamid=0,data0=720906,data1=0,tlast=0
time=1747000,event=FROM_PL,name=tl.me.shim.tile_0_0.pl_interface.pl_to_shim0.data0,col=0,streamid=0,data0=851980,data1=0,tlast=0
time=1748000,event=FROM_PL,name=tl.me.shim.tile_0_0.pl_interface.pl_to_shim0.data0,col=0,streamid=0,data0=983054,data1=0,tlast=0
time=1749000,event=FROM_PL,name=tl.me.shim.tile_0_0.pl_interface.pl_to_shim0.data0,col=0,streamid=0,data0=1,data1=0,tlast=0
time=1750000,event=FROM_PL,name=tl.me.shim.tile_0_0.pl_interface.pl_to_shim0.data0,col=0,streamid=0,data0=131075,data1=0,tlast=0
time=1751000,event=FROM_PL,name=tl.me.shim.tile_0_0.pl_interface.pl_to_shim0.data0,col=0,streamid=0,data0=262149,data1=0,tlast=0
time=2186000,event=DM_WRITE_REQ,col=0,row=0,port=tl.me.array.tile_0_1.mm.dm.port_AXI_write_b6
time=2190000,event=DM_WRITE_REQ,col=0,row=0,port=tl.me.array.tile_0_1.mm.dm.port_AXI_write_b7
time=2194000,event=DM_WRITE_REQ,col=0,row=0,port=tl.me.array.tile_0_1.mm.dm.port_AXI_write_b6
time=2198000,event=DM_WRITE_REQ,col=0,row=0,port=tl.me.array.tile_0_1.mm.dm.port_AXI_write_b7
time=2202000,event=DM_WRITE_REQ,col=0,row=0,port=tl.me.array.tile_0_1.mm.dm.port_AXI_write_b2
time=2206000,event=DM_WRITE_REQ,col=0,row=0,port=tl.me.array.tile_0_1.mm.dm.port_AXI_write_b3

The following command produces the AI Engine trace data for foo.vcd in text form in the ./trdata/events.txt file.

vcdanalyze -vcd foo.vcd

TIP: Use vcdanalyze -h to get help for the command.

The following command produces a CSV file from the AI Engine trace data from the foo.vcd file.

vcdanalyze -vcd=foo.vcd -csv

The following command produces the waveform data files from the AI Engine trace data from the foo.vcd file.

vcdanalyze -vcd foo.vcd -wdb

Viewing the Run Summary in the Vitis Analyzer

After running the system, whether in simulation, hardware emulation, or in hardware, a run_summary report is generated when the application has been properly configured.

During simulation of the AI Engine graph, the AI Engine simulator captures performance and activity metrics and writes the report to the output directory ./aiesimulator_output. The AI Engine simulator run_summary is named default.aierun_summary.

The run_summary can be viewed in the Vitis analyzer. The summary contains a collection of reports, capturing the performance profile of the AI Engine application captured as it runs. For example, to open the AI Engine simulator run summary use the following command:

vitis_analyzer ./aiesimulator_output/default.aierun_summary

The Vitis analyzer opens displaying the Summary page of the report. The Report Navigator view of the tool lists the different reports that are available in the summary. For a complete understanding of the Vitis analyzer, see Using the Vitis Analyzer in the Application Acceleration Development flow of the Vitis Unified Software Platform Documentation (UG1416).

TIP: You should set the $AIE_COMPILER_WORKDIR environment variable prior to launching hardware emulation. This ensures that the correct path is set in the run_summary file which is used by the Vitis analyzer to locate, process, and display trace data. If the environment variable is not specified, then the Vitis analyzer looks for the ./Work directory inside the current directory and uses the first one found.

The listed reports include:

Summary

This is the top-level of the report, and reports the details of the run, such as date, tool version, and the command-line used to launch the simulator.

Profile

When the

aiesimulator
							--profile

option is specified, the simulator collects profiling data on the AI Engine graph and kernels presenting a high-level view of the AI Engine graphs, kernels-mapped to processors, with tables and graphic presentation of metric data.

The Profile Summary provides annotated details regarding the overall application performance. All data generated during the execution of the application is grouped into categories. The Profile Summary lets you examine processor/DMA memory stalls, deadlock, interference, critical paths, and maximum contention. This is useful for system-level performance tuning and debug. System performance is presented in terms of latency (number of cycles taken to execute the system) and throughput (data/time taken). Sub-optimal system performance forces you to examine and control (thru constraints) mapping and buffer packing, stream and packet switch allocation, interaction with neighboring processors, and external interfaces. An example of the Profile Summary report is shown:

Trace

Problems such as missing or mismatching locks, buffer overruns, and incorrect programming of DMA buffers are difficult to debug using traditional interactive debug techniques. Event trace provides a systematic way of collecting system level traces for the program events, providing direct support for generation, collection, and streaming of hardware events as a trace. The following image shows the Trace report open in the Vitis analyzer.

Features of the trace report include:

Each tile is reported. Within each tile the report includes core, DMA, locks, and I/O if there are PL blocks in the graph.
There is a separate timeline for each kernel mapped to a core. It shows when the kernel is executing (blue) or stalled (red) due to memory conflicts or waiting for stream data.
By using lock IDs in the core, DMA, and locks sections you can identify how cores and DMAs interact with one another by acquiring and releasing locks.
The lock section shows the activities of the locks in the tile, both the allocation and release for read and write lock requests. A particular lock can be allocated by nearby tiles. Thus, this section does not necessarily match the core lock requests of the core shown in the left pane of the image.
If a lock is not released, a red bar extends through the end of simulation time.
Clicking the left or right arrows takes you to the start and end of a state, respectively.
The data view shows the data flowing through stream switch network with slave entry points and master exit points at each hop. This is most useful in finding the routing delays, as well as network congestion effects with packet switching, where one packet might get delayed behind another packet when sharing the same stream channel.

Using Trace Compass to Visualize AI Engine Traces

The CTF-based AI Engine trace can be visualized using the free eclipse tool called Trace Compass. This tool is already integrated into the Vitis IDE as a plug-in, allowing you to visualize your traces from the Vitis IDE main panel.

Note: You can download standalone versions and documentation of Trace Compass from their website at http://tracecompass.org.

In the Vitis IDE, after capturing trace data during simulation you can right-click on your project then select Analyze AIE Events. This imports the event data from the simulation and create various views to analyze them. Your screen could look as follows:
To see various views, toggle between the Statistics, Data View, System View, and Function View tabs.

Trace Views

The trace reports support several views:

The top window shows a textual list of events in chronological order with various event types and other relevant information. The top row in each column allows you to filter events based on textual patterns. In the bottom window, there are multiple tabs providing different views relating to the execution.
The Statistics tab shows the aggregate event statistics based on the selected set of events or a time slice.
The System View tab represents the state of system resources such as AI Engines, locks, and DMAs.
The Function View tab represents the state of various kernels executing on an AI Engine (core).
The Data View tab represents the state of data flowing through the stream switch network.

The following are screen shots of the function view, system view, and data view. The top bar of a view has several options: A legend explaining the colors, zoom in and zoom out, going to beginning and end of state, and correlating it to a textual event that causes the state change. Each view consists of a series of aligned timelines depicting the state of a certain resource or program object. Various events are represented in each timeline. You can hover over the timeline to see the information collected. Clicking on the timeline in one view creates a time bar that allows you to see the corresponding events at that time in other views.

Figure 2: System View with AI Engines, Locks, and DMAs

As shown in the system view, there are four sections: ActiveCores, ActiveDMA, and Locks. If there are PL blocks used in the application, the system view will also show the ActivePLPorts. By using lock IDs in the ActiveCores, ActiveDMA, and Locks sections you can identify how the AI Engines and DMAs interact with one another by acquiring and releasing locks. The currently executing function name is shown when hovering over the Core(0,0).pc bar. The color coding is shown in the legend that opens with a click on the legend icon (, left of the home icon, which resets the timescale to default). Clicking the left or right arrows takes you to the beginning and end of a state, respectively. A text window shows you the event that caused the state change. In this example, all locks are properly acquired and released. If a lock is not released, you will see a red bar that extends through the end of simulation time.

Figure 4: Function View Showing Running and Stalled Kernels and Main on Each AI Engine

The function view is most useful when analyzing the application from the program standpoint. There is a separate timeline for each kernel mapped to an AI Engine (core), and the view shows when the kernel is executing (blue) or stalled. A detailed pop-up window with details such as the types of stall and duration comes up when you hover over the stalls in the function view.

Figure 5: Data View Showing Data Flowing in the Stream Switch Network

The data view shows the data flowing through the stream switch network with slave entry points and master exit points at each hop. This is most useful in finding the routing delays, as well as network congestion effects with packet switching, when one packet might get delayed behind another packet when sharing the same stream channel.

Profiling the AI Engine

You can obtain profiling data when you run your design in simulation or in hardware at run time. Analyzing this data helps you gauge the efficiency of the kernels, the stall and active times associated with each AI Engine, and pinpoint the AI Engine kernel whose performance might not be optimal. This also allows you to collect data on design latency, throughput, and bandwidth.

You have two options to gather this information:

Use run-time event APIs in your PS host code
Use performance counters built into the hardware using a compile time option

Run-Time Event API for Performance Profiling

You can collect profile statistics of your design by calling event APIs in your PS host code. These event APIs are available both during simulation and when you run the design in hardware.

The AI Engine has hardware performance counters and can be configured to count hardware events for measuring performance metrics. You can use the run-time event API together with the graph control API to profile certain performance metrics during a controlled period of graph execution. The event API supports only platform I/O ports (PLIO) to measure performance metrics such as platform I/O port bandwidth, graph throughput, and graph latency. These are preliminary use cases and more capability will be provided in the near future.

Note: The run-time event API is not applicable to PL GMIO ports.

Profiling Platform I/O Port Bandwidth

The bandwidth of a platform I/O port can be defined as the average number of bytes transferred per second, which can be derived as the total number of bytes transferred divided by the time when the port is transferring or is stalled (for example, due to back pressure). The following example shows how to profile I/O port bandwidth using the event API. In the example, gr is the application graph object, plio_out is the PLIO object connecting to the graph output port, and the graph is designed to produce 256 int32 data samples in eight iterations.

gr.init();
event::handle handle = event::start_profiling(plio_out, event::io_total_stream_running_to_idle_cycles);
gr.run(8);
gr.wait();
long long cycle_count = event::read_profiling(handle);
event::stop_profiling(handle);
double bandwidth = (double)256 * sizeof(int32) / (cycle_count * 1e-9); //byte per second

In the example, after the graph is initialized, the event::start_profiling is called to configure the AI Engine to count the accumulated clock cycles between the stream running event and the stream idle event. In other words, it counts the number of cycles when the stream port is in running or in stall state. The first argument in event::start_profiling can be a PLIO or a GMIO object, in this case, it is plio_out. The second argument is event::io_profiling_option enumeration, and in this case, the enumeration is set to event::io_total_stream_running_to_idle_cycles. event::start_profiling returns a handle, which will be used later to read the counter value and to stop the profile. After the graph finishes eight iterations, you can call event::read_profiling to read the counter value by supplying the handle. After profiling is done, it is recommended to stop the performance counter by calling event::stop_profiling with the handle so the hardware resources configured to do the profile can be released for other uses. Finally, the bandwidth is derived by dividing the total number of bytes transferred (256 × sizeof(int32)) by the time spent when the stream port is active (cycle_count × 1e^-9, assuming the AI Engine is running at 1 GHz).

Profiling Graph Throughput

Graph throughput can be defined as the average number of bytes produced (or consumed) per second. The following example shows how to profile graph throughput using the event API. In the example, gr is the application graph object, plio_out is the PLIO object connecting to the graph output port, and the graph is designed to produce 256 int32 data in eight iterations.

gr.init();
event::handle handle = event::start_profiling(plio_out, event::io_stream_start_to_bytes_transferred_cycles, 256*sizeof(int32));
gr.run(8);
gr.wait();
long long cycle_count = event::read_profiling(handle);
event::stop_profiling(handle);
double throughput = (double)256 * sizeof(int32) / (cycle_count * 1e-9); // byte per second

In the example, after the graph is initialized, event::start_profiling is called to configure the AI Engine to count the clock cycles from the stream start event to the event that indicates 256 × sizeof(int32) bytes have been transferred, assuming that the stream stops right after the specified number of bytes are transferred. If the stream continues after the number of bytes transferred, the counter will continue and never end. The first argument in event::start_profiling is plio_out, the second argument is set to event::io_stream_start_to_bytes_transferred_cycles, and the third argument specifies the number of bytes to be transferred before stopping the counter. The graph throughput is derived by dividing the total number of bytes produced in eight iterations (256 × sizeof(int32)) by the time spent from the first output data to the last output data (cycle_count × 1e^-9, assuming the AI Engine is running at 1 GHz).

Profiling Graph Latency

Graph latency can be defined as the time spent from receiving the first input data to producing the first output data. The following example shows how to profile graph throughput using the event API. In the example, gr is the application graph object, plio_out is the PLIO object connecting to the graph output port, and gmio_in is the GMIO object connecting to the graph input port.

gr.init();
event::handle handle = event::start_profiling(gmio_in, plio_out, event::io_stream_start_difference_cycles);
gr.run(8);
gr.wait();
long long latency_in_cycles = event::read_profiling(handle);
event::stop_profiling(handle);

In the example, after graph is initialized, event::start_profiling is called to configure the AI Engine to count the clock cycles from the stream start event of the input I/O port to the stream start event of the output I/O port. The first and the second argument in event::start_profiling can be GMIO or PLIO ports, representing the input and the output I/O port respectively. In this example, gmio_in is the input I/O port and plio_out is the output I/O port. The third argument is set to event::io_stream_start_difference_cycles enumeration. The counter value simply indicates the graph latency in cycles.

Profiling Using Performance Counters

You can compile your AI Engine design with performance counters that can be read and collected at run time while the design is executing in hardware. These counters are programmed in the hardware to gather the following statistics for each active AI Engine in your design:

Active Cycles – the total clock cycles that a tile has been activated
Stall Cycles – the total clock cycles that a tile has stalled in one of four ways: memory, stream, cascade, and lock

To enable this feature, you need to specify a compile time option as well as turn it on at run time. To compile the performance counters into your design, use the following option.

aiecompiler –aie-heat-map

When these counters are in your design, you can turn on their capture at run time using the following code in an xrt.ini file.

[Debug]
aie_profile = true

The data can then be viewed and analyzed using the Vitis analyzer in a few different ways, including heat map, histogram, and profile summary. Analyzing this profile will help you determine the active and stall times associated with each AI Engine, and pinpoint the AI Engine whose performance might not be optimal as the design runs on hardware. The following sections include a more detailed description of the two profile views supported in the Vitis analyzer.

AI Engine Heat Maps and Histograms

The AI Engine heat map displays the active and stall cycles in correspondence with the Array View of your design. AI Engines can be highlighted based on their active and stall cycles.

The heat map and histogram view is displayed in the Performance Metrics tab when you open the Run Summary in the Vitis analyzer.

The preceding figure shows a heat map and histogram view for an example design that contains 64 tiles. The graph of tiles can be categorized based on utilization % or stall time, as selected above the graph. Histogram bins can be added and modified using the settings tab in the upper right above the graph, creating a customized view of the tiles and how they were utilized during the run of your design This enables you to identify the lowest utilized (or most stalled) tiles in your design, pinpointing bottleneck to optimize and potentially improve the overall performance of your design.

The preceding image shows an array view, displayed in the Array tab when you open the Run Summary in the Vitis analyzer. If this tab is not displayed, then go to the Summary tab and provide the AI Engine compiler Work directory by clicking on the Set AI Engine compile summary link. The tiles selected in the histogram table are cross-probed and highlighted here. A table is also provided, listing tile-specific information, such as kernel name, source file, and specified run-time ratio. See Viewing Compilation Results in the Vitis Analyzer for more information.

Profile Summary

Kernel utilization metrics can be visualized in a graphical format in the Profile Summary tab of the Vitis analyzer. Note only data for a maximum of 10 tiles can be displayed in this view.

The preceding image shows the profile summary for a design that includes AI Engines. Graphs are shown for Active Time (ms), and Usage (%). A tabular summary is also provided below these graphs listing the overall active and stall times, active utilization (%), and clock frequency (in MHz). To select/deselect different tiles to show in the graphs, use the check boxes under the Chart column.

Guidance Summary

Guidance Summary information is displayed in the Run Guidance tab.

The preceding image shows the run guidance for your hardware run, providing suggestions on ways to potentially improve the performance of your design. Two rules are provided for AI Engines: AIE_STALL and AIE_UTILIZATION. The rule AIE_STALL checks for tiles that have stall times of less than 20% of active times. If this threshold is not met, then a resolution is provided, including a link to more information on how to potentially improve your design.

The rule AIE_UTILIZATION checks for tiles that have utilization greater than 50%. Similar to AIE_STALL, if this threshold is not met, then a resolution is provided, including a link to more information on how to potentially improve your design.

To change the threshold values for either rule, click on the threshold value link, and a dialog box pops up, allowing you to modify the value to any value between 0 and 100. See Vitis Unified Software Platform Documentation: Application Acceleration Development (UG1393) for more information on how to use and interpret guidance results for all types of Vitis tools flows.

Event Tracing in Hardware [Early Access]

In hardware you must prepare the design when compiling the AI Engine graph application to ensure the libadf.a supports capturing trace data at run time. Event tracing in hardware builds begins with the aiecompiler command, using the --event-trace option. This option sets up the hardware device to capture AI Engine run-time trace data by informing the v++ linker to add and configure debugging and profiling IP (DPA) to the PL region of the device.

There are three parts to the event trace flow as listed:

Event trace build flow
Running the design in hardware and capturing trace data at run time
Viewing and analyzing the trace data using the Vitis analyzer

Event Trace Build Flow

The event trace build flow is as follows:

Compile the graph with --event-trace and other appropriate flags.
An example AI Engine compiler command for event tracing is as follows:
```
aiecompiler --verbose --pl-freq=100 --workdir=./myWork \
--event-trace=functions --num-trace-streams=1 --include="${XILINX_HLS}/include" \
--include="./" --include="./src" --include="./src/kernels" --include="./data" \
./src/graph.cpp
```
Note: The preceding example illustrates compiling the design with the event trace=functions configuration that captures function transitions on the AI Engine.
Compile and link the design using the Vitis compiler.
After compiling the AI Engine graph application, you must build the other elements of the system as described in Integrating the Application Using the Vitis Tools Flow. With --event-trace enabled in the libadf.a file from the AI Engine compiler, the system hardware generated by the Vitis compiler includes the compiled ELF file for the PS application, the compiled ELF files for the AI Engine processors, and the XCLBIN file for the PL. These are the elements you need to run the system on hardware.
After linking to create the device binary, run the Vitis compiler --package step to create the sd_card folder and files needed to boot the device, as described in Packaging. This step packages everything needed to build the BOOT.BIN file for the system. When packaging the boot files for the device, you must also specify the --package.defer_aie_run to load the AI Engine application with the ELF file, but wait to run it until graph.run directs it, as described in Graph Execution Control.

Running the Design in Hardware and Capturing Trace Data at Run Time

XRT and XSDB are two ways to run the design on the Arm processor in hardware and capture trace data at run time. XRT is supported on the Linux platform, whereas XSDB is supported on both bare metal and Linux platforms. Details of the steps involved in both flows are described as follows.

The Xilinx Software Debugger (xsdb) flow is as follows:

Set up xsdb as described below to connect to the device hardware.
When running the application, the trace data is stored in DDR memory by the debugging and profiling IP. To capture and evaluate this data, you must connect to the hardware device using xsdb. This command is typically used to program the device and debug bare-metal applications. Connect your system to the hardware platform or device over JTAG, launch the xsdb command in a command shell, and run the following sequence of commands:
```
xsdb% connect

xsdb% source $::env(XILINX_VITIS)/scripts/vitis/util/aie_trace_profile.tcl
xsdb% aietrace::initialize $PROJECT/xclbin.link_summary 0x800000000 0x80000

# Execute the PS host application (.elf) on Linux
## After the application completes processing.
xsdb% aietrace::offload
```
where:
- connect: Launches the hw_server and connects xsdb to the device.
- source $::env(XILINX_VITIS)/scripts/vitis/util/aie_trace_profile.tcl: Sources the Tcl trace command to set up the xsdb environment.
- aietrace::initialize PROJECT/xclbin.link_summary 0x800000000 0x80000: Initializes the DPA IP to begin capturing trace data. The values 0x800000000 0x80000 specify the starting address to write trace data into the AI Engine and the amount of data to store.
  IMPORTANT: The DDR memory address used in aietrace::initialize should be a high address to limit any chance of running into memory conflicts with the OS on the xilinx_vck190_base_202020_1 platform or the application. For a custom platform, make sure you know how much DDR memory is being used and plan accordingly.
- aietrace::offload: Instructs the DPA IP to offload the trace event data from the DDR memory. This command should wait until after the application completes. The data is written to the event_trace<N>.txt file in the current working directory from where xsdb was launched. An aie_trace_profile.run_summary file is also created. It can be opened in the Vitis analyzer as explained in Viewing the Run Summary in the Vitis Analyzer.
  TIP: If you do not remove the event_trace<N>.txt when running the graph again, the old files will be overwritten by the new run results.
Run the design on hardware to trace hardware events.
Offload the captured trace data.
Use the Vitis analyzer to import and analyze data.

The Xilinx Runtime (XRT) flow is as follows:

Burn the generated sd_card.img to the physical SD card.
Create the xrt.ini file in the sd_card folder as described in this section to enable xrt flow.
An example xrt.ini file is shown in the following.
```
[Debug]
aie_trace=true
aie_trace_buffer_size=10M
```
Run the design on hardware to trace hardware events.
Copy the captured trace data from the sd_card folder to your design at same level as the design Work directory. The trace data is generated in the same location as the host application on the SD card.
Use the Vitis analyzer to import and analyze data.

When running the application, the streaming interface between the AI Engine and the System DPA IP (highlighted in orange in the previous image) can become overloaded with event trace data captured from the application. In this case, you might need to increase the number of available streaming channels to capture data with the --num-trace-streams option to the AI Engine compiler.

Note: Implementing the DPA IP in hardware consumes device resources and so can impact the availability of resources for your PL kernels, and other elements of your design. Refer to Using Multiple Event Trace Streams for more information.

Viewing and Analyzing the Trace Data Using Vitis Analyzer

The Vitis analyzer should be used to view and analyze trace data. After the trace data has been captured either using XRT or XSDB, you should have all the data needed to open the Event Trace view in the Vitis analyzer.

Figure 10: Event Trace in Vitis Analyzer

Open the run summary file with the Vitis analyzer to view event trace data. An example is shown:

vitis_analyzer ./aie_trace_profile.run_summary

Limitations

The event trace feature is an Early Access feature. You might see timing synchronization issues between the kernels on some designs. The execution time between kernels are skewed by some clock cycles for the design.

Using Multiple Event Trace Streams

As AI Engine designs grow larger, tracking the events produced while running the design can be useful to identify performance bottlenecks as well as understanding how the overall AI Engine is operating for the design. Of course, with larger designs more and more events will be produced causing a bottleneck of the events being recorded by the trace IP being used. To capture all this data effectively, and quickly, you should consider instantiating multiple event trace streams. These streams will spread out the event data coming from the AI Engine, letting it store them correctly and in a timely manner.

To increase the trace streams in a design, use the aiecompiler --num-trace-streams option, which can have a value in the range of 1 to 16. The following table provides guidance on the number of trace streams to use, depending on the size of the design.

Table 1. Number of Event Trace Streams Methodology
Number of AI Engines	Recommended Number of Streams
Less than 10	1
Between 10 and 20	2
Between 20 and 40	4
Between 40 and 80	8
Larger than 80	16
It is recommended to only use up to 16 trace streams due to the resource utilization impact on the PL and DMA channel resources.

After the change to the AI Engine compiler option, recompile and re-link the XCLBIN file and libadf.a using the Vitis compiler with a config file as described in Linking the System.

v++ -l --config system.cfg ...

The config file includes the following advanced example options:

[advanced]
param=compiler.aieTraceClockSelect=fastest

where compiler.aieTraceClockSelect is the trace clock setting. The value is default or fastest. default is 150 MHz and fastest is 300 MHz.

Recompiling the graph and relinking the XCLBIN file prepares the tool to instantiate additional trace IP into the design to accommodate the added trace events being captured.

Note: Using multiple trace streams consumes more of the programmable logic resources on the device, depending upon how many streams, what kind of events are being captured, and how many tiles are being analyzed.