Performance Analysis of AI Engine Graph Application
A system-level view of program execution can be helpful in identifying problems during program execution including correctness and performance issues. Problems such as missing or mismatching locks, buffer overruns, and incorrect programming of DMA buffers are examples that are difficult to debug by using explicit print statements or by using traditional interactive debuggers. A systematic way of collecting system level traces for the program execution is needed. The AI Engine architecture has direct support for generation, collection, and streaming of events as trace data during simulation, hardware emulation, or hardware execution.
AI Engine Simulation-Based Performance Analysis
In simulation, to view time stamped events, different event types, and data associated with each event, value change dump (VCD) files can be used. VCD files provide a detailed dump of the simulated hardware signals. Additionally, a profile summary provides annotated details for the overall application performance.
AI Engine Simulation-Based Value Change Dump
In the simulation framework, the AI Engine simulator can generate a detailed dump of the hardware
signals in the form of value change dump (VCD) files. A defined set of abstract
events describes the execution of a multi-kernel AI Engine program in terms of these events. The output of a VCD file
is enabled using the aiesimulator --dump-vcd
command.
After simulation, or emulation, the VCD file can be processed into events and viewed on a timeline in the Vitis™ analyzer. The events contain information such as time stamps, different event types, and data associated with each event. This information can be correlated to the compiler generated debug information. This includes program counter values mapped to function names and instruction offsets, and source level symbolic data offsets for memory accesses.
The abstract AI Engine events are independent of the VCD format and will be directly extracted from the hardware. The events traces can be produced as plain text, comma-separated values (CSV), common trace format (CTF), or in waveform database (WDB), and the generated event trace data can be viewed in the Vitis analyzer.
VCD File Generation
To generate VCD file from the Vitis IDE, right-click on your AI Engine graph project from the Project Explorer and select as described in Creating the AI Engine Graph Project and Top-Level System Project. This opens up the run configuration panel for the current project.
Select the AI Engine Emulator option and double click to open a New_configuration. Select the Generate Trace check box to enable trace capture, and select the VCD Trace button. By default, this produces a VCD dump in a file called foo.vcd in the current directory. You can rename the file if you like.
The VCD file can also be generated by invoking the AI Engine simulator with the –-dump-vcd
<filename>
option on the command line. The VCD file is generated
in the same directory as the simulation is run. Assuming that the program is
compiled using the AI Engine compiler, the
simulator can be invoked in a shell with the VCD option.
$ aiesimulator –-pkg-dir=./Work --dump-vcd=foo
This command produces the VCD file (foo.vcd) which is written to the current directory.
AI Engine Trace from VCD
The vcdanalyze
utility is
provided to generate an AI Engine event trace
from the VCD file. This process is integrated into the Vitis tool flow automatically. From the Vitis IDE, after a simulation run has finished capturing AI Engine events, you can right-click on the project
from the Project Explorer and select
Analyze AIE Events. The trace data is
produced under the current project at Traces/AIE_AXI_Trace and various views are automatically loaded into
the current project.
The raw event trace under the directory Traces/AIE_AXI_Trace/ctf/events.txt should look like the following:
time=1741000,event=FROM_PL,name=tl.me.shim.tile_0_0.pl_interface.pl_to_shim0.data0,col=0,streamid=0,data0=65536,data1=0,tlast=0
time=1742000,event=FROM_PL,name=tl.me.shim.tile_0_0.pl_interface.pl_to_shim0.data0,col=0,streamid=0,data0=196610,data1=0,tlast=0
time=1743000,event=FROM_PL,name=tl.me.shim.tile_0_0.pl_interface.pl_to_shim0.data0,col=0,streamid=0,data0=327684,data1=0,tlast=0
time=1743000,event=CORE_RESET,col=1,row=0
time=1744000,event=FROM_PL,name=tl.me.shim.tile_0_0.pl_interface.pl_to_shim0.data0,col=0,streamid=0,data0=458758,data1=0,tlast=0
time=1745000,event=FROM_PL,name=tl.me.shim.tile_0_0.pl_interface.pl_to_shim0.data0,col=0,streamid=0,data0=589832,data1=0,tlast=0
time=1746000,event=FROM_PL,name=tl.me.shim.tile_0_0.pl_interface.pl_to_shim0.data0,col=0,streamid=0,data0=720906,data1=0,tlast=0
time=1747000,event=FROM_PL,name=tl.me.shim.tile_0_0.pl_interface.pl_to_shim0.data0,col=0,streamid=0,data0=851980,data1=0,tlast=0
time=1748000,event=FROM_PL,name=tl.me.shim.tile_0_0.pl_interface.pl_to_shim0.data0,col=0,streamid=0,data0=983054,data1=0,tlast=0
time=1749000,event=FROM_PL,name=tl.me.shim.tile_0_0.pl_interface.pl_to_shim0.data0,col=0,streamid=0,data0=1,data1=0,tlast=0
time=1750000,event=FROM_PL,name=tl.me.shim.tile_0_0.pl_interface.pl_to_shim0.data0,col=0,streamid=0,data0=131075,data1=0,tlast=0
time=1751000,event=FROM_PL,name=tl.me.shim.tile_0_0.pl_interface.pl_to_shim0.data0,col=0,streamid=0,data0=262149,data1=0,tlast=0
time=2186000,event=DM_WRITE_REQ,col=0,row=0,port=tl.me.array.tile_0_1.mm.dm.port_AXI_write_b6
time=2190000,event=DM_WRITE_REQ,col=0,row=0,port=tl.me.array.tile_0_1.mm.dm.port_AXI_write_b7
time=2194000,event=DM_WRITE_REQ,col=0,row=0,port=tl.me.array.tile_0_1.mm.dm.port_AXI_write_b6
time=2198000,event=DM_WRITE_REQ,col=0,row=0,port=tl.me.array.tile_0_1.mm.dm.port_AXI_write_b7
time=2202000,event=DM_WRITE_REQ,col=0,row=0,port=tl.me.array.tile_0_1.mm.dm.port_AXI_write_b2
time=2206000,event=DM_WRITE_REQ,col=0,row=0,port=tl.me.array.tile_0_1.mm.dm.port_AXI_write_b3
The following command produces the AI Engine trace data for foo.vcd in text form in the ./trdata/events.txt file.
vcdanalyze -vcd foo.vcd
vcdanalyze -h
to get help for the command. The following command produces a CSV file from the AI Engine trace data from the foo.vcd file.
vcdanalyze -vcd=foo.vcd -csv
The following command produces the waveform data files from the AI Engine trace data from the foo.vcd file.
vcdanalyze -vcd foo.vcd -wdb
Viewing the Run Summary in the Vitis Analyzer
After running the system, whether in simulation, hardware emulation, or in hardware, a run_summary report is generated when the application has been properly configured.
During simulation of the AI Engine graph, the AI Engine simulator captures performance and activity metrics and writes the report to the output directory ./aiesimulator_output. The AI Engine simulator run_summary is named default.aierun_summary.
The run_summary can be viewed in the Vitis analyzer. The summary contains a collection of reports, capturing the performance profile of the AI Engine application captured as it runs. For example, to open the AI Engine simulator run summary use the following command:
vitis_analyzer ./aiesimulator_output/default.aierun_summary
The Vitis analyzer opens displaying the Summary page of the report. The Report Navigator view of the tool lists the different reports that are available in the summary. For a complete understanding of the Vitis analyzer, see Using the Vitis Analyzer in the Application Acceleration Development flow of the Vitis Unified Software Platform Documentation (UG1416).
$AIE_COMPILER_WORKDIR
environment variable
prior to launching hardware emulation. This ensures that the correct path is set in
the run_summary file which is used by the
Vitis analyzer to locate, process, and
display trace data. If the environment variable is not specified, then the Vitis analyzer looks for the ./Work directory inside the current directory and
uses the first one found.The listed reports include:
- Summary
- This is the top-level of the report, and reports the details of the run, such as date, tool version, and the command-line used to launch the simulator.
- Profile
- When the
aiesimulator --profile
option is specified, the simulator collects profiling data on the AI Engine graph and kernels presenting a high-level view of the AI Engine graphs, kernels-mapped to processors, with tables and graphic presentation of metric data.The Profile Summary provides annotated details regarding the overall application performance. All data generated during the execution of the application is grouped into categories. The Profile Summary lets you examine processor/DMA memory stalls, deadlock, interference, critical paths, and maximum contention. This is useful for system-level performance tuning and debug. System performance is presented in terms of latency (number of cycles taken to execute the system) and throughput (data/time taken). Sub-optimal system performance forces you to examine and control (thru constraints) mapping and buffer packing, stream and packet switch allocation, interaction with neighboring processors, and external interfaces. An example of the Profile Summary report is shown:
- Trace
- Problems such as missing or mismatching locks, buffer overruns, and
incorrect programming of DMA buffers are difficult to debug using
traditional interactive debug techniques. Event trace provides a systematic
way of collecting system level traces for the program events, providing
direct support for generation, collection, and streaming of hardware events
as a trace. The following image shows the Trace report open in the Vitis analyzer.
Features of the trace report include:
- Each tile is reported. Within each tile the report includes core, DMA, locks, and I/O if there are PL blocks in the graph.
- There is a separate timeline for each kernel mapped to a core. It shows when the kernel is executing (blue) or stalled (red) due to memory conflicts or waiting for stream data.
- By using lock IDs in the core, DMA, and locks sections you can identify how cores and DMAs interact with one another by acquiring and releasing locks.
- The lock section shows the activities of the locks in the tile, both the allocation and release for read and write lock requests. A particular lock can be allocated by nearby tiles. Thus, this section does not necessarily match the core lock requests of the core shown in the left pane of the image.
- If a lock is not released, a red bar extends through the end of simulation time.
- Clicking the left or right arrows takes you to the start and end of a state, respectively.
- The data view shows the data flowing through stream switch network with slave entry points and master exit points at each hop. This is most useful in finding the routing delays, as well as network congestion effects with packet switching, where one packet might get delayed behind another packet when sharing the same stream channel.
Using Trace Compass to Visualize AI Engine Traces
The CTF-based AI Engine trace can be visualized using the free eclipse tool called Trace Compass. This tool is already integrated into the Vitis IDE as a plug-in, allowing you to visualize your traces from the Vitis IDE main panel.
- In the Vitis IDE, after capturing trace
data during simulation you can right-click on your project then select
Analyze AIE Events. This imports the
event data from the simulation and create various views to analyze them. Your
screen could look as follows:
- To see various views, toggle between the Statistics, Data View, System View, and Function View tabs.
Trace Views
The trace reports support several views:
- The top window shows a textual list of events in chronological order with various event types and other relevant information. The top row in each column allows you to filter events based on textual patterns. In the bottom window, there are multiple tabs providing different views relating to the execution.
- The Statistics tab shows the aggregate event statistics based on the selected set of events or a time slice.
- The System View tab represents the state of system resources such as AI Engines, locks, and DMAs.
- The Function View tab represents the state of various kernels executing on an AI Engine (core).
- The Data View tab represents the state of data flowing through the stream switch network.
The following are screen shots of the function view, system view, and data view. The top bar of a view has several options: A legend explaining the colors, zoom in and zoom out, going to beginning and end of state, and correlating it to a textual event that causes the state change. Each view consists of a series of aligned timelines depicting the state of a certain resource or program object. Various events are represented in each timeline. You can hover over the timeline to see the information collected. Clicking on the timeline in one view creates a time bar that allows you to see the corresponding events at that time in other views.
As shown in the system view, there are four sections: ActiveCores, ActiveDMA, and Locks. If there are PL blocks used in the
application, the system view will also show the ActivePLPorts. By using lock IDs in the
ActiveCores,
ActiveDMA, and
Locks sections you can
identify how the AI Engines and DMAs interact
with one another by acquiring and releasing locks. The currently executing function
name is shown when hovering over the Core(0,0).pc
bar. The color coding is shown in the legend that opens with a click on the legend
icon (, left of the home
icon, which resets the timescale to default). Clicking the left or right arrows
takes you to the beginning and end of a state, respectively. A text window shows you
the event that caused the state change. In this example, all locks are properly
acquired and released. If a lock is not released, you will see a red bar that
extends through the end of simulation time.
The function view is most useful when analyzing the application from the program standpoint. There is a separate timeline for each kernel mapped to an AI Engine (core), and the view shows when the kernel is executing (blue) or stalled. A detailed pop-up window with details such as the types of stall and duration comes up when you hover over the stalls in the function view.
The data view shows the data flowing through the stream switch network with slave entry points and master exit points at each hop. This is most useful in finding the routing delays, as well as network congestion effects with packet switching, when one packet might get delayed behind another packet when sharing the same stream channel.
Profiling the AI Engine
You can obtain profiling data when you run your design in simulation or in hardware at run time. Analyzing this data helps you gauge the efficiency of the kernels, the stall and active times associated with each AI Engine, and pinpoint the AI Engine kernel whose performance might not be optimal. This also allows you to collect data on design latency, throughput, and bandwidth.
- Use run-time event APIs in your PS host code
- Use performance counters built into the hardware using a compile time option
Run-Time Event API for Performance Profiling
You can collect profile statistics of your design by calling event APIs in your PS host code. These event APIs are available both during simulation and when you run the design in hardware.
The AI Engine has hardware performance counters and can be configured to count hardware events for measuring performance metrics. You can use the run-time event API together with the graph control API to profile certain performance metrics during a controlled period of graph execution. The event API supports only platform I/O ports (PLIO) to measure performance metrics such as platform I/O port bandwidth, graph throughput, and graph latency. These are preliminary use cases and more capability will be provided in the near future.
Profiling Platform I/O Port Bandwidth
The bandwidth of a platform I/O port can be defined as the average number of
bytes transferred per second, which can be derived as the total number of bytes
transferred divided by the time when the port is transferring or is stalled (for
example, due to back pressure). The following example shows how to profile I/O port
bandwidth using the event API. In the example, gr
is
the application graph object, plio_out
is the PLIO
object connecting to the graph output port, and the graph is designed to produce 256
int32 data samples in eight iterations.
gr.init();
event::handle handle = event::start_profiling(plio_out, event::io_total_stream_running_to_idle_cycles);
gr.run(8);
gr.wait();
long long cycle_count = event::read_profiling(handle);
event::stop_profiling(handle);
double bandwidth = (double)256 * sizeof(int32) / (cycle_count * 1e-9); //byte per second
In the example, after the graph is initialized, the event::start_profiling
is called to configure the AI Engine to count the accumulated clock
cycles between the stream running event and the stream idle event. In other words,
it counts the number of cycles when the stream port is in running or in stall state.
The first argument in event::start_profiling
can be
a PLIO or a GMIO object, in this case, it is plio_out
. The second argument is event::io_profiling_option
enumeration, and in this case, the
enumeration is set to event::io_total_stream_running_to_idle_cycles
. event::start_profiling
returns a handle, which will be used later to
read the counter value and to stop the profile. After the graph finishes eight
iterations, you can call event::read_profiling
to
read the counter value by supplying the handle. After profiling is done, it is
recommended to stop the performance counter by calling event::stop_profiling
with the handle so the hardware resources
configured to do the profile can be released for other uses. Finally, the bandwidth
is derived by dividing the total number of bytes transferred (256 × sizeof(int32))
by the time spent when the stream port is active (cycle_count × 1e-9, assuming the AI Engine is running at 1 GHz).
Profiling Graph Throughput
Graph throughput can be defined as the average number of bytes produced (or
consumed) per second. The following example shows how to profile graph throughput
using the event API. In the example, gr
is the
application graph object, plio_out
is the PLIO
object connecting to the graph output port, and the graph is designed to produce 256
int32 data in eight iterations.
gr.init();
event::handle handle = event::start_profiling(plio_out, event::io_stream_start_to_bytes_transferred_cycles, 256*sizeof(int32));
gr.run(8);
gr.wait();
long long cycle_count = event::read_profiling(handle);
event::stop_profiling(handle);
double throughput = (double)256 * sizeof(int32) / (cycle_count * 1e-9); // byte per second
In the example, after the graph is initialized, event::start_profiling
is called to configure the AI Engine to count the clock cycles from the stream
start event to the event that indicates 256 × sizeof(int32)
bytes
have been transferred, assuming that the stream stops right after the specified
number of bytes are transferred. If the stream continues after the number of bytes
transferred, the counter will continue and never end. The first argument in event::start_profiling
is plio_out
, the second argument is set to event::io_stream_start_to_bytes_transferred_cycles
, and the third
argument specifies the number of bytes to be transferred before stopping the
counter. The graph throughput is derived by dividing the total number of bytes
produced in eight iterations (256 × sizeof(int32)
) by the time
spent from the first output data to the last output data (cycle_count × 1e-9, assuming the AI Engine is running at 1 GHz).
Profiling Graph Latency
Graph latency can be defined as the time spent from receiving the
first input data to producing the first output data. The following example shows how
to profile graph throughput using the event API. In the example, gr
is the application graph object, plio_out
is the PLIO object
connecting to the graph output port, and gmio_in
is the GMIO object
connecting to the graph input port.
gr.init();
event::handle handle = event::start_profiling(gmio_in, plio_out, event::io_stream_start_difference_cycles);
gr.run(8);
gr.wait();
long long latency_in_cycles = event::read_profiling(handle);
event::stop_profiling(handle);
In the example, after graph is initialized, event::start_profiling
is called to configure the AI Engine to count the clock cycles from the
stream start event of the input I/O port to the stream start event of the output I/O
port. The first and the second argument in event::start_profiling
can be GMIO or PLIO ports, representing the
input and the output I/O port respectively. In this example, gmio_in
is the input I/O port and plio_out
is the output I/O port. The third argument is set to event::io_stream_start_difference_cycles
enumeration.
The counter value simply indicates the graph latency in cycles.
Profiling Using Performance Counters
You can compile your AI Engine design with performance counters that can be read and collected at run time while the design is executing in hardware. These counters are programmed in the hardware to gather the following statistics for each active AI Engine in your design:
- Active Cycles – the total clock cycles that a tile has been activated
- Stall Cycles – the total clock cycles that a tile has stalled in one of four ways: memory, stream, cascade, and lock
aiecompiler –aie-heat-map
When these counters are in your design, you can turn on their capture at run time using the following code in an xrt.ini file.
[Debug]
aie_profile = true
The data can then be viewed and analyzed using the Vitis analyzer in a few different ways, including heat map, histogram, and profile summary. Analyzing this profile will help you determine the active and stall times associated with each AI Engine, and pinpoint the AI Engine whose performance might not be optimal as the design runs on hardware. The following sections include a more detailed description of the two profile views supported in the Vitis analyzer.
AI Engine Heat Maps and Histograms
The AI Engine heat map displays the active and stall cycles in correspondence with the Array View of your design. AI Engines can be highlighted based on their active and stall cycles.
The heat map and histogram view is displayed in the Performance Metrics tab when you open the Run Summary in the Vitis analyzer.
The preceding figure shows a heat map and histogram view for an example design that contains 64 tiles. The graph of tiles can be categorized based on utilization % or stall time, as selected above the graph. Histogram bins can be added and modified using the settings tab in the upper right above the graph, creating a customized view of the tiles and how they were utilized during the run of your design This enables you to identify the lowest utilized (or most stalled) tiles in your design, pinpointing bottleneck to optimize and potentially improve the overall performance of your design.
The preceding image shows an array view, displayed in the Array tab when you open the Run Summary in the Vitis analyzer. If this tab is not displayed, then go to the Summary tab and provide the AI Engine compiler Work directory by clicking on the Set AI Engine compile summary link. The tiles selected in the histogram table are cross-probed and highlighted here. A table is also provided, listing tile-specific information, such as kernel name, source file, and specified run-time ratio. See Viewing Compilation Results in the Vitis Analyzer for more information.
Profile Summary
Kernel utilization metrics can be visualized in a graphical format in the Profile Summary tab of the Vitis analyzer. Note only data for a maximum of 10 tiles can be displayed in this view.
The preceding image shows the profile summary for a design that includes AI Engines. Graphs are shown for Active Time (ms), and Usage (%). A tabular summary is also provided below these graphs listing the overall active and stall times, active utilization (%), and clock frequency (in MHz). To select/deselect different tiles to show in the graphs, use the check boxes under the Chart column.
Guidance Summary
Guidance Summary information is displayed in the Run Guidance tab.
The preceding image shows the run guidance for your hardware run, providing suggestions on ways to potentially improve the performance of your design. Two rules are provided for AI Engines: AIE_STALL and AIE_UTILIZATION. The rule AIE_STALL checks for tiles that have stall times of less than 20% of active times. If this threshold is not met, then a resolution is provided, including a link to more information on how to potentially improve your design.
The rule AIE_UTILIZATION checks for tiles that have utilization greater than 50%. Similar to AIE_STALL, if this threshold is not met, then a resolution is provided, including a link to more information on how to potentially improve your design.
To change the threshold values for either rule, click on the threshold value link, and a dialog box pops up, allowing you to modify the value to any value between 0 and 100. See Vitis Unified Software Platform Documentation: Application Acceleration Development (UG1393) for more information on how to use and interpret guidance results for all types of Vitis tools flows.
Event Tracing in Hardware [Early Access]
In hardware you must prepare the design when
compiling the AI Engine graph application to
ensure the libadf.a supports capturing trace
data at run time. Event tracing in hardware builds begins with the aiecompiler
command, using the --event-trace
option. This option sets up the hardware
device to capture AI Engine run-time trace data
by informing the v++
linker to add and configure
debugging and profiling IP (DPA) to the PL region of the device.
- Event trace build flow
- Running the design in hardware and capturing trace data at run time
- Viewing and analyzing the trace data using the Vitis analyzer
Event Trace Build Flow
- Compile the graph with
--event-trace
and other appropriate flags.An example AI Engine compiler command for event tracing is as follows:
aiecompiler --verbose --pl-freq=100 --workdir=./myWork \ --event-trace=functions --num-trace-streams=1 --include="${XILINX_HLS}/include" \ --include="./" --include="./src" --include="./src/kernels" --include="./data" \ ./src/graph.cpp
Note: The preceding example illustrates compiling the design with theevent trace=functions
configuration that captures function transitions on the AI Engine. - Compile and link the design using the Vitis compiler.
After compiling the AI Engine graph application, you must build the other elements of the system as described in Integrating the Application Using the Vitis Tools Flow. With
--event-trace
enabled in the libadf.a file from the AI Engine compiler, the system hardware generated by the Vitis compiler includes the compiled ELF file for the PS application, the compiled ELF files for the AI Engine processors, and the XCLBIN file for the PL. These are the elements you need to run the system on hardware.After linking to create the device binary, run the Vitis compiler
--package
step to create the sd_card folder and files needed to boot the device, as described in Packaging. This step packages everything needed to build theBOOT.BIN
file for the system. When packaging the boot files for the device, you must also specify the--package.defer_aie_run
to load the AI Engine application with the ELF file, but wait to run it untilgraph.run
directs it, as described in Graph Execution Control.
Running the Design in Hardware and Capturing Trace Data at Run Time
XRT and XSDB are two ways to run the design on the Arm processor in hardware and capture trace data at run time. XRT is supported on the Linux platform, whereas XSDB is supported on both bare metal and Linux platforms. Details of the steps involved in both flows are described as follows.
The Xilinx Software Debugger
(xsdb
) flow is as follows:
- Set up
xsdb
as described below to connect to the device hardware.When running the application, the trace data is stored in DDR memory by the debugging and profiling IP. To capture and evaluate this data, you must connect to the hardware device using
xsdb
. This command is typically used to program the device and debug bare-metal applications. Connect your system to the hardware platform or device over JTAG, launch thexsdb
command in a command shell, and run the following sequence of commands:xsdb% connect xsdb% source $::env(XILINX_VITIS)/scripts/vitis/util/aie_trace_profile.tcl xsdb% aietrace::initialize $PROJECT/xclbin.link_summary 0x800000000 0x80000 # Execute the PS host application (.elf) on Linux ## After the application completes processing. xsdb% aietrace::offload
where:
connect
: Launches thehw_server
and connectsxsdb
to the device.source $::env(XILINX_VITIS)/scripts/vitis/util/aie_trace_profile.tcl
: Sources the Tcl trace command to set up thexsdb
environment.aietrace::initialize PROJECT/xclbin.link_summary 0x800000000 0x80000
: Initializes the DPA IP to begin capturing trace data. The values0x800000000 0x80000
specify the starting address to write trace data into the AI Engine and the amount of data to store.IMPORTANT: The DDR memory address used inaietrace::initialize
should be a high address to limit any chance of running into memory conflicts with the OS on the xilinx_vck190_base_202020_1 platform or the application. For a custom platform, make sure you know how much DDR memory is being used and plan accordingly.aietrace::offload
: Instructs the DPA IP to offload the trace event data from the DDR memory. This command should wait until after the application completes. The data is written to the event_trace<N>.txt file in the current working directory from wherexsdb
was launched. Anaie_trace_profile.run_summary
file is also created. It can be opened in the Vitis analyzer as explained in Viewing the Run Summary in the Vitis Analyzer.TIP: If you do not remove theevent_trace<N>.txt
when running the graph again, the old files will be overwritten by the new run results.
- Run the design on hardware to trace hardware events.
- Offload the captured trace data.
- Use the Vitis analyzer to import and analyze data.
The Xilinx Runtime (XRT) flow is as follows:
- Burn the generated sd_card.img to the physical SD card.
- Create the
xrt.ini
file in the sd_card folder as described in this section to enablexrt
flow.An example
xrt.ini
file is shown in the following.[Debug] aie_trace=true aie_trace_buffer_size=10M
- Run the design on hardware to trace hardware events.
- Copy the captured trace data from the sd_card folder to your design at same level as
the design
Work
directory. The trace data is generated in the same location as the host application on the SD card. - Use the Vitis analyzer to import and analyze data.
When running the application, the streaming interface between the
AI Engine and the System DPA IP (highlighted
in orange in the previous image) can become overloaded with event trace data
captured from the application. In this case, you might need to increase the number
of available streaming channels to capture data with the --num-trace-streams
option to the AI Engine compiler.
Viewing and Analyzing the Trace Data Using Vitis Analyzer
The Vitis analyzer should be used to view and analyze trace data. After the trace data has been captured either using XRT or XSDB, you should have all the data needed to open the Event Trace view in the Vitis analyzer.
Open the run summary file with the Vitis analyzer to view event trace data. An example is shown:
vitis_analyzer ./aie_trace_profile.run_summary
Limitations
The event trace feature is an Early Access feature. You might see timing synchronization issues between the kernels on some designs. The execution time between kernels are skewed by some clock cycles for the design.
Using Multiple Event Trace Streams
As AI Engine designs grow larger, tracking the events produced while running the design can be useful to identify performance bottlenecks as well as understanding how the overall AI Engine is operating for the design. Of course, with larger designs more and more events will be produced causing a bottleneck of the events being recorded by the trace IP being used. To capture all this data effectively, and quickly, you should consider instantiating multiple event trace streams. These streams will spread out the event data coming from the AI Engine, letting it store them correctly and in a timely manner.
To increase the trace streams in a design, use the aiecompiler --num-trace-streams
option, which can have a
value in the range of 1 to 16. The following table provides guidance on the number of
trace streams to use, depending on the size of the design.
Number of AI Engines | Recommended Number of Streams |
---|---|
Less than 10 | 1 |
Between 10 and 20 | 2 |
Between 20 and 40 | 4 |
Between 40 and 80 | 8 |
Larger than 80 | 16 |
|
After the change to the AI Engine compiler option, recompile and re-link the XCLBIN file and libadf.a using the Vitis compiler with a config file as described in Linking the System.
v++ -l --config system.cfg ...
The config file includes the following advanced example options:
[advanced]
param=compiler.aieTraceClockSelect=fastest
where compiler.aieTraceClockSelect
is the
trace clock setting. The value is default
or fastest
. default
is 150
MHz and fastest
is 300 MHz.
Recompiling the graph and relinking the XCLBIN file prepares the tool to instantiate additional trace IP into the design to accommodate the added trace events being captured.