Profiling the Application

The Vitis™ core development kit generates various system and kernel resource performance reports during compilation. These reports help you establish a baseline of performance for your application, identify bottlenecks, and help to identify target functions that can be accelerated in hardware kernels as discussed in Methodology for Architecting a Device Accelerated Application. The Xilinx® Runtime (XRT) collects profiling data during application execution in both emulation and hardware builds. Examples of profiling and event data that can be reported includes:

  • Host and device timeline events
  • OpenCL™ or XRT native API call sequences
  • Kernel execution sequence
  • Kernel start and stop signals
  • FPGA trace data including AXI transactions
  • Power profile data for the accelerator card
  • AI Engine profiling and event trace
  • User event and range profiling

Profiling reports and data can be used to isolate performance bottlenecks in the application, identify problems in the system, and optimize the design to improve performance. Optimizing an application requires optimizing both the application host code and any hardware accelerated kernels. The host code must be optimized to facilitate data transfers and kernel execution, while the kernel should be optimized for performance and resource usage.

There are four distinct areas to be considered when performing algorithm optimization in Vitis: System resource usage and performance, kernel optimization, host optimization, and data transfer optimization. The following Vitis reports and graphical tools support your efforts to profile and optimize these areas:

When properly enabled as described in Enabling Profiling in Your Application, reports are automatically generated while running the active build, either from the command line as described in Building and Running the Application, or from the Vitis integrated design environment (IDE). Separate reports are generated for the different build targets and can be found in the respective report directories. Refer to Output Directories of the v++ Command or Output Directories from the Vitis IDE for more information on locating these reports.

Reports can be viewed in Vitis analyzer, or in some cases from the Vitis IDE. To access these reports from Vitis analyzer, open the run_summary report as explained in Using the Vitis Analyzer.

Baselining Functionality and Performance

Methodology for Accelerating Applications with the Vitis Software Platform provides an overview of designing an application beginning with profiling the application to identify functions to accelerate, leading into recommended ways of developing C/C++ accelerators. As discussed in the this guide, it is very important to understand the architecture and performance of your application before you start any optimization effort. This is achieved by establishing a baseline for the application in terms of functions and performance.

Figure 1: Baselining Functions and Performance Flow

Identify Bottlenecks

The first step is to identify the bottlenecks of the your application running on your target platform. The most effective way is to run the application with profiling tools like the user profiling features described in Custom Profiling of the Host Application, or valgrind, callgrind, and GNU gprof. The profiling data generated by these tools show the call graph with the number of calls to all functions and their execution time.

Run Software and Hardware Emulation

Run software and hardware emulation on the accelerated application as described in Running Emulation, to verify functional correctness, and to generate profiling data on the host code and the kernels. Use Vitis analyzer to review the kernel compilation reports, profile summary, timeline trace, and device hardware transactions to understand the baseline performance estimate for timing interval, latency, and resource utilization, such as DSP and block RAM.

Build and Run the Application

The last step in baselining is building and running the application on an FPGA acceleration card, like one of the Alveo™ Data Center accelerator cards, as described in Running the Application Hardware Build. Analyze the reports from the system compilation, and the profiling data from application execution to see the actual performance and resource utilization on hardware.

TIP: Save all the reports during the baseline process, so that you can refer back to them and compare results during optimization.

Enabling Profiling in Your Application

To enable profiling and capturing event trace data during the execution of your application, you must instrument your application for this task. You must enable additional logic, and consume additional device resources to track the host and kernel execution steps, and capture event data. This process requires optionally modifying your host application to capture custom data, modifying your kernel XO during compilation and the xclbin during linking to capture different types of profile data from the device side activity, and configuring the Xilinx runtime (XRT) as described in the xrt.ini File to capture data during the application runtime.

TIP: While capturing profile data is a critical part of the profiling and optimization process for building your accelerated application, it does consume additional resources and impacts performance. You should be sure to clean these elements out of your final production build.

There are many different types of profiling for your applications, depending on which elements your system includes, and what type of data you want to capture. The following table shows some of the levels of profiling that can be enabled, and discusses which are complimentary and which are not.

Table 1. Profiling Host and Kernels
Profile/Trace Description Comments
Host Application OpenCL API and some limited device side (kernel) profiling. Specified by the use of the opencl_summary and opencl_trace options in the xrt.ini file. Generates the opencl_summary.csv and opencl_trace.csv files.
Host Application XRT Native API Specified by the use of the native_xrt_trace option in the xrt.ini file. Generates trace events for the XRT API.
Host Application User-Event Profiling Requires additional code in the host application as described in Custom Profiling of the Host Application. Generates user range data and user events for the host application.
Low Overhead Profiling Specified by the use of the lop_trace option in the xrt.ini file. Generates the lop_trace.csv file as described in Enabling Low Overhead Profiling.

Is disabled by opencl_summary=true in the xrt.ini file.

Device Side Profiling Enabled by the use of --profile options during v++ compilation and linking, as described in --profile Options. Enables capturing data traffic between the host and kernel, kernel stalls, the execution times of kernels and compute units (CUs), as well as monitoring activity in Versal AI Engines.
AI Engine Graph and Kernels Specified by the use of the aie_profile and aie_traceoptions in the xrt.ini file. These options can be specified together or separately. Generates the aie_profile_<device>.csv and aie_trace_##_<stream id>.txt reports.

Cannot be used with profile=true in the xrt.ini file.

Is also disabled by the presence of user event profiling in the host application.

Power Profile Specified by the use of the power_profile option in the xrt.ini file. Generates the power_profile_<device>.csv report.
Vitis AI Profiling Specified by the use of the vitis_ai_profile option in the xrt.ini file. Enables counter profiling of DPUs to generate the opencl_summary.csv file.

Is disabled by opencl_summary=true in the xrt.ini file.

The device binary (xclbin) file is configured for capturing limited device-side profiling data by default. However, using the --profile option during the Vitis compiler linking process instruments the device binary by adding Acceleration Monitors and AXI Performance Monitors to the system. This option has multiple instrumentation options: --profile.data, --profile.stall, and --profile.exec, as described in the --profile Options.

As an example, add --profile.data to the v++ linking command line:
v++ -g -l --profile.data all:all:all ...
TIP: Be sure to also use the v++ -g option when compiling your kernel code for debugging with software or hardware emulation.

After your application is enabled for profiling during the v++ compile and link process, data gathering during application runtime must also be enabled in XRT by editing the xrt.ini file as discussed above. For example, the following xrt.ini file enables OpenCL profiling, power profiling, and event and stall trace capture when the application is run:

[Debug]
opencl_summary=true
opencl_trace=true
power_profile=true
data_transfer_trace=coarse
stall_trace=all

To enable the profiling of Kernel Internals data, you must also add the debug_mode tag in the [Emulation] section of the xrt.ini:

[Emulation]
debug_mode=batch

If you are collecting a large amount of trace data, you can increase the amount of available memory for capturing data by specifying the --trace_memory option during v++ linking, and add the trace_buffer_size keyword in the xrt.ini.

--trace_memory
Indicates what type of memory to use for capturing trace data, as described in Vitis Compiler General Options.
trace_buffer_size
Specifies the amount of memory to use for capturing the trace data during the application runtime.

Finally, as discussed in Continuous Trace Capture you can enable continuous trace capture to continuously offload device trace data while the application is running, so in the event of an application or system crash, some trace data is available to help debug the application.

Continuous Trace Capture

The Vitis tool supports recording continuous trace data while the application is running. The application can run for a very long time thus leading to the capture of significant trace data, which can result in issues like incomplete trace data especially when the memory resource used for trace data is not large enough. Using continuous trace, analysis of the trace can be carried out while the application is still running or if the application has crashed before completion.

With the ability to continuously capture trace data, the Timeline Trace and Application Timeline reports can be dynamically updated in the Vitis analyzer tool while your application is running. Once these reports are loaded in Vitis Analyzer, there is a hyperlink available indicating that the current report is being modified on the disk. If new data needs to be loaded, Reload or Auto-Reload options are available on the banner to let you view the updated report as your application runs and trace data is generated.

Continuous trace is not enabled by default. Additionally, the memory resources of an FPGA are not unlimited. So if the application generates large trace data, a circular buffer for storing the data can be used. The circular buffer can be written, offloaded to the host, and reused again. By enabling a circular buffer with continuous trace, the memory resources needed are even smaller thus saving available resources on the device. However, an application run with continuous trace/circular buffer may result in multiple device trace files.

TIP: For Hardware emulation, only host side continuous trace is available, for hardware runs both host side and device side continuous trace are available.

Here are some scenarios where it is recommended to use the memory resource as a circular buffer.

The circular buffer implementation is automatically turned on when continuous trace is enabled in the xrt.ini. The flow requires the following settings for enabling continuous trace.

  • In the xrt.ini file, continuous_trace is set to TRUE
  • v++ linking option --trace_memory is set to DDR or HBM

You can optionally set:

  • The size of the trace buffer using trace_buffer_size in the xrt.ini file. This defaults to 1 MB.
  • The interval at which the trace buffer is offloaded from the device using trace_buffer_offload_interval_ms in the xrt.ini file. The default is 10 ms.
  • The interval at which files are dumped by setting trace_file_dump_interval_s. The default is 3 seconds.
IMPORTANT: Circular Buffer can be force enabled by setting trace_buffer_offload_interval_ms to 0 ms.
As an example, if you enable continuous_trace with trace_buffer_size as 8k and default trace_buffer_offload_interval_ms of 10 ms, the trace data rate is 819200 bytes/s which is less than the default of 100 MB/s. In this scenario, the circular buffer is NOT enabled by default and an XRT warning is reported:
[XRT] WARNING: Unable to use circular buffer for continuous trace offload. Please increase trace buffer size and/or reduce continuous
trace interval. Minimum required offload rate (bytes per second) : 104857600 Requested offload rate : 819200
Here is an example of xrt.ini settings:
[Debug]
opencl_summary=true
opencl_trace=true
data_transfer_trace=coarse
stall_trace=all
continuous_trace=true
// The following are optional and needed only in rare circumstances

trace_buffer_size=20M
trace_buffer_offload_interval_ms=10
trace_file_dump_interval_s=2

The following are the results of these settings:

  • opencl_summary: Enables the generation of host-related OpenCL API profile summary report, opencl_summary.csv file is created.
  • opencl_trace: Enables the generation of host-related OpenCL API trace, opencl_trace.csv files is created.
  • data_transfer_trace: Enables the collection of kernel activity to be added to profile summary and trace, device_trace_0.csv files are created with 0 being the device number.
  • stall_trace: Enables the hardware generation of stalls into compute units.
  • continuous_trace: Enables the continuous dumping of files for trace and the continuous reading of device data into the host.
  • trace_buffer_offload_interval_ms: Controls the reading of device data from the device to the host in milliseconds.
  • trace_file_dump_interval_s: Controls the time between dumping of trace files in seconds.

As a result, there are several CSV files generated in addition to the xclbin.run_summary as part of the application run using the above xrt.ini file. Vitis Analyzer only needs the generated run_summary file and will use the relevant CSV files to display the profile summary and timeline trace.

Here are the recommendations on setting up an application for trace data dumping:

  1. By default, an 8k FIFO is used for saving trace data. The FIFO size can be increased but not preferred above 64k and needs to be preallocated as part of the v++ linking step. It is also preferred to use device memory for saving trace data. If you specify a memory bank for trace, you can use trace_buffer_size option in xrt.ini to control the amount of trace generated at runtime. With device memory, the default size is 1M and the maximum size is 4095M.
  2. If still unable to dump maximum trace, disable stall trace by setting stall_trace=off or stall_trace=on with data_transfer_trace=coarse.
  3. If the application requires larger size of trace buffer, enable circular buffer by setting continuous_trace=true with default settings of trace_buffer_offload_interval_ms=10 and trace_file_dump_interval_s=5. Ideally, a continuous trace feature should be used for the following cases:
    • Long-running design with minimal trace generated
    • Debugging application crashes where some .csv files might still be available for debugging
  4. If the application run is still unable to dump the maximum trace, the trace_buffer_size can further be increased.
  5. If the application still creates huge trace data that the host cannot keep up, use the smaller size of trace_file_dump_interval, which creates multiple files equivalent to the interval provided.
  6. Lastly, continuous trace can generate several trace files as part of the application run in addition to xclbin.run_summary file. The Vitis Analyzer only needs the generated run_summary file and can pick the relevant CSV files generated to display profile summary and timeline trace to provide a better experience.

Custom Profiling of the Host Application

All XRT related actions from the host application are automatically tracked for profiling, through either theOpenCL API calls, or the XRT API calls. However, you can also profile the host application beyond the XRT related events, capturing event data based on user-specified actions or events.

TIP: You can use these features early in the design process as described in Baselining Functionality and Performance, even prior to separating functions to run in the Xilinx device hardware.
This feature provides two types of custom profiling:
User range
Profiles the specified start/end times across a range of code. This captures the span of time within which an action occurs in the host application.
User events
Marks an event in the timeline. The user event is added to the timeline waveform at whatever point in time it occurs.

The user_range and user_event data can be captured to the Profile Summary and Timeline Trace reports for display in Vitis analyzer. As seen in the figure below, the Profile Summary shows the number of occurrences of a given event and the range. The User Ranges table also reports the Min/Max/Avg/Total duration of the user-defined ranges in the host code. In the Timeline Trace report user_range elements in the host code are displayed in a separate row, and user_event markers are added at specific points on the timeline.

Figure 2: Profile Summary – User Range

Using custom profiling requires a few changes in your host application source code and build process. You must make use of C or C++ API in your code, as described below, and you must include the xrt_coreutil library when linking your host application.

TIP: An example of user_range and user_event can be seen in the host code at https://github.com/Xilinx/Vitis_Accel_Examples/blob/master/host/debug_profile/src/host.cpp.

Profiling of C++ Code

For C++ code the provided objects are:

user_range
This object captures the start time and end time of a measured range of activity with the specified ID. The object constructor is:
user_range(const char* label, const char* tooltip);
user_event
This object marks an event occurring at single point in time, adding the specified label onto the timeline trace. The object constructor is:
user_event() 

Use the user_range to construct an object and start keeping track of time immediately upon construction. Usage details of the user_range objects:

  • If a user_range is instantiated using the default constructor, no time is marked until the user calls user_range.start() with the label and tooltip.
  • You can instantiate a user_range object passing the label and tooltip strings. This starts monitoring the range immediately.
  • You must call user_range.start() and user_range.end() to capture ranges of time you are interested in.
  • If user_range.end() is not called, then any range being tracked lasts until the user_range object is destructed.
  • The user_range object can be reused any number of times, by calling user_range.start()/user_range.end() pairs in the host code.
  • Sequential calls to user_range.start() ignore all but the first call until user_range.end() terminates the range.
  • Sequential calls to user_range.end() ignore all but the first call until user_range.start() starts a new range.

Usage of the user_event objects:

  • A user_event object must be instantiated using the default constructor.
  • Calls to user_event.mark() creates a user marker on the timeline trace at that particular time.
  • user_event.mark() takes an optional const char* argument which appears as a label on the timeline trace.

The debug_profile example of the Vitis_Accel_Examples demonstrates user event profiling in a host application. With your host application properly instrumented, XRT can capture profile data from these user-defined ranges and events, as well as the standard XRT API-based events. You must enable profiling in thexrt.ini file as explained previously.

Profiling of C Code

For C code the provided functions are:

xrtURStart()
This function establishes the start time of a measured range of activity with the specified ID. The function signature is:
void xrtURStart(unsigned int id, const char* label, const char* tooltip)
xrtUREnd()
This function marks the end time of a measured range with the specified ID. The function signature is:
void xrtUREnd(unsigned int id) 
xrtUEMark()
This function marks an event occurring at single point in time, adding the specified label onto the timeline trace. The function signature is:
void xrtUEMark(const char* label) 

Use the xrtURStart() and xrtUREnd() functions to start keeping track of time immediately, and specify an ID to pair the start/end calls and define the user range. Usage details of the user_range functions:

  • Start/End ranges of one ID can be nested inside other Start/End ranges of a different ID.
  • It is your responsibility to make sure the IDs match for the Start/End range you are profiling.
    IMPORTANT: Multiple calls to xrtURStart and xrtUREnd with the same ID can cause unexpected behavior.
  • The user range can have a label that is added to the timeline, and a tooltip that is displayed when you place the cursor over the user range.

A call to xrtUEMark() will create a user marker on the timeline trace at the point of the event.

  • xrtUEMark() lets you specify a label for the event. The label will appear on the timeline with the mark.
  • You can use NULL for the label to add an unlabeled mark.

The following is example code:

int main(int argc, char* argv[]) {
 58 
 59   xrtURStart(0, "Software execution", "Whole program execution") ;
 60 ...
 61   //TARGET_DEVICE macro needs to be passed from gcc command line
 62   if(argc != 2) {
 63       std::cout << "Usage: " << argv[0] <<" <xclbin>" << std::endl;
 64       return EXIT_FAILURE;
 65    }
....
153     q.enqueueTask(krnl_vector_add);
154 
155     // The result of the previous kernel execution will need to be retrieved in
156     // order to view the results. This call will transfer the data from FPGA to
157     // source_results vector
158     q.enqueueMigrateMemObjects({buffer_result},CL_MIGRATE_MEM_OBJECT_HOST);
159 ····
160     q.finish();
161 
162     xrtUEMark("Starting verification") ;
163

Enabling Low Overhead Profiling

The Vitis software platform supports low overhead profiling that provides minimal information with little effect on execution time. Using this option during runtime, the timeline trace is still available but with a reduced amount of information. Low overhead profiling captures minimal information on OpenCL events and dumps a CSV file called lop_trace.csv at the end of execution. Low overhead profiling can be run in all three flows (hardware, hardware emulation, and software emulation).

To enable low overhead profiling, there is a new flag in the "Debug" section of the xrt.ini File called lop_trace. By default, lop_trace is FALSE and must be enabled by setting the ini parameter to TRUE.

xrt.ini file
[Debug]
lop_trace=true
TIP: The lop_trace parameter can be enabled alongside other profiling parameters, but doing so eliminates any benefit of low overhead profiling by capturing all profiling data as well.

When lop_trace=true is enabled, the runtime will generate lop_trace.csv which can be viewed in the Run Summary within Vitis analyzer.

vitis_analyzer <project>.run_summary

To obtain the lowest possible overhead, information collected in normal OpenCL profiling is omitted. Specifically, the following information is expected to not be available in the low overhead profiling trace:

  • Device events, such as compute unit executions or kernel memory transfers
  • Information about memory reads or writes, such as destination address or size
  • Information about kernel enqueues, such as kernel name or NDRange sizes
  • Dependencies between buffer transfers and kernel enqueue

Guidance

The Vitis core development kit has a comprehensive design guidance tool that provides immediate, actionable guidance to the software developer for issues detected in their designs. These issues might be related to the source code, or due to missed tool optimizations. Also, the rules are generic rules based on an extensive set of reference designs. Therefore, these rules might not be applicable for your specific design. It is up to you to understand the specific guidance rules and take appropriate action based on your specific algorithm and requirements.

Guidance is generated from the Vitis HLS, Vitis profiler, and Vivado Design Suite when invoked by the v++ compiler. The generated design guidance can have several severity levels; warning messages, informational messages and design rule checks are provided during software emulation, hardware emulation, and system builds. The profile design guidance helps you interpret the profiling results which allows you to focus on improving performance.

Guidance includes message text for reported violations, a brief suggested resolution, and a detailed resolution provided as a web link. You can determine your next course of action based on the suggested resolution. This helps improves productivity by quickly highlighting issues and directing you to additional information in using the Vitis technology.

Design guidance is automatically generated after building or running an application from the command line or Vitis IDE.

You can open the Guidance report as discussed in Using the Vitis Analyzer. To access the Guidance report, open the Compile Summary, the Link Summary, or the Run Summary, and open the Guidance report.

  • Kernel Guidance is generated by the Vitis HLS tool after kernel is built using v++ compile command. This can be viewed in the Vitis analyzer by opening the Compile Summary report. Kernel guidance as well as Compile Summary files are generated for each kernel compiled. Kernel guidance includes recommendations on using Dataflow; and possible reasons why the expected throughout could not be achieved.
  • System Guidance is generated after kernel is built using the v++ link command. This can be viewed in the Vitis analyzer by opening the Link Summary report. System guidance includes all Kernel Guidance checks, and provides comprehensive review before running your application.
  • Run Guidance is generated when your generated .xclbin is run, and is a feature of the XRT. This can be viewed by opening the Run Summary in the Vitis analyzer. Run Guidance includes checks like if Kernel Stall is above 50%, recommendations if PLRAM can be used instead of DDR, etc.

With the Guidance report open, the Guidance view displays the messages along with resolution columns. The resolutions also have extended weblink help available.

The following image shows an example of the Guidance report displayed in the Vitis analyzer. For example, clicking a link in the Name column opens a description of the rule check. Links in the Details column can open source code, select a design object such as a kernel, or navigate to another report.

Figure 3: Design Guidance Example
TIP: As described in Setting Guidance Thresholds, you can manually edit the values in the Threshold column of the Run Guidance report to customize the report.

There is one HTML guidance report for each run of the v++ command, including compile and link. The report files are located in the --report_dir under the specific output name. For example:

  • v++_compile_<output>_guidance.html for v++ compilation
  • v++_link_<output>_guidance.html for v++ linking

You can click the web link in the Resolution column to get additional details about the resolution. The Guidance Messaging web page lists all of the current messages for your review.

Figure 4: Guidance Messaging Web Page

Kernel and Compute Unit objects, as well as profile reported data values, can also be cross-probed to other views like the System Diagram or Profile Report. Refer to Working with Reports for more information.

Opening the Guidance Report

When kernels are compiled and when the FPGA binary is linked, guidance reports are generated automatically by the v++ command. You can view these reports in the Vitis analyzer by opening the <output_filename>.compile_summary or the <output_filename>.link_summary for the application project. The <output_filename> is the output of the v++ command.

As an example, launch the Vitis analyzer and open the report using this command:

vitis_analyzer <output_filename>.link_summary

When the Vitis analyzer opens, it displays the link summary report, as well as the compile summaries, and a collection of reports generated during the compile and link processes. Both the compile and link steps generate Guidance reports to view by clicking the Build heading on the left-hand side. Refer to Using the Vitis Analyzer for more information.

Interpreting Guidance Data

The Guidance view places each entry in a separate row. Each row might contain the name of the guidance rule, threshold value, actual value, and a brief but specific description of the rule. The last field provides a link to reference material intended to assist in understanding and resolving any of the rule violations.

In the GUI Guidance view, guidance rules are grouped by categories and unique IDs in the Name column and annotated with symbols representing the severity. These are listed individually in the HTML report. In addition, as the HTML report does not show tooltips, a full Name column is included in the HTML report as well.

The following list describes all fields and their purpose as included in the HTML guidance reports.

Id
Each guidance rule is assigned a unique ID. Use this id to uniquely identify a specific message from the guidance report.
Name
The Name column displays a mnemonic name uniquely identifying the guidance rule. These names are designed to assist in memorizing specific guidance rules in the view.
Severity
The Severity column allows the easy identification of the importance of a guidance rule.
Full Name
The Full Name provides a less cryptic name compared to the mnemonic name in the Name column.
Categories
Most messages are grouped within different categories. This allows the GUI to display groups of messages within logical categories under common tree nodes in the Guidance view.
Threshold
The Threshold column displays an expected threshold value, which determines whether or not a rule is met. The threshold values are determined from many applications that follow good design and coding practices.
Actual
The Actual column displays the values actually encountered on the specific design. This value is compared against the expected value to see if the rule is met.
Details
The Details column provides a brief message describing the specifics of the current rule.
Resolution
The Resolution column provides a pointer to common ways the model source code or tool transformations can be modified to meet the current rule. Clicking the link brings up a popup window or the documentation with tips and code snippets that you can apply to the specific issue.

System Estimate Report

The process step with the longest execution time includes building the hardware system and the FPGA binary to run on Xilinx devices. Build time is also affected by the target device and the number of compute units instantiated onto the FPGA fabric. Therefore, it is useful to estimate the performance of an application without needing to build it for the system hardware.

The System Estimate report provides estimates of FPGA resource usage and the estimated frequency at which the hardware accelerated kernels can operate. The report is automatically generated for hardware emulation and system hardware builds. The report contains high-level details of the user kernels, including resource usage and estimated frequency. This report can be used to guide design optimization.

You can also force the generation of the System Estimate report with the following option:

v++ .. --report_level estimate

An example report is shown in the figure:

Figure 5: System Estimate

Opening the System Estimate Report

The System Estimate report can be opened in the Vitis analyzer tool, intended for viewing reports from the Vitis compiler when the application is built, and the XRT library when the application is run. You can launch the Vitis analyzer and open the report using the following command:

vitis_analyzer <output_filename>.link_summary

The <output_filename> is the output of the v++ command. This opens the Link Summary for the application project in the Vitis analyzer tool. Then, select the System Estimate report. Refer to Using the Vitis Analyzer for more information.

Interpreting the System Estimate Report

The System Estimate report generated by the v++ command provides information on every binary container in the application, as well as every compute unit in the design. The report is structured as follows:

  • Target device information
  • Summary of every kernel in the application
  • Detailed information on every binary container in the solution

The following example report file represents the information generated for the estimate report:

----------------------------------------------------------------------------
Design Name:             mmult.hw_emu.xilinx_u200_xdma_201830_2
Target Device:           xilinx:u200:xdma:201830.2
Target Clock:            300.000000MHz
Total number of kernels: 1
----------------------------------------------------------------------------

Kernel Summary
Kernel Name  Type  Target              OpenCL Library                          Compute Units
-----------  ----  ------------------  --------------------------------------  -------------
mmult        c     fpga0:OCL_REGION_0  mmult.hw_emu.xilinx_u200_xdma_201830_2  1


-----------------------------------------------------------------------------
OpenCL Binary:     mmult.hw_emu.xilinx_u200_xdma_201830_2
Kernels mapped to: clc_region

Timing Information (MHz)
Compute Unit  Kernel Name  Module Name  Target Frequency  Estimated Frequency
------------  -----------  -----------  ----------------  -------------------
mmult_1       mmult        mmult        300.300293        411.015198

Latency Information (clock cycles)
Compute Unit  Kernel Name  Module Name  Start Interval  Best Case  Avg Case  Worst Case  
------------  -----------  -----------  --------------  ---------  --------  ----------  
mmult_1       mmult        mmult        826 ~ 829       825        827       828         

Area Information
Compute Unit  Kernel Name  Module Name  FF     LUT    DSP   BRAM  URAM
------------  -----------  -----------  -----  -----  ----  ----  ----
mmult_1       mmult        mmult        81378  35257  1036  2     0
----------------------------------------------------------------------------

Design and Target Device Summary

All design estimate reports begin with an application summary and information about the target device. The device information is provided in the following section of the report:

----------------------------------------------------------------------------
Design Name:             mmult.hw_emu.xilinx_u200_xdma_201830_2
Target Device:           xilinx:u200:xdma:201830.2
Target Clock:            300.000000MHz
Total number of kernels: 1
----------------------------------------------------------------------------

For the design summary, the information provided includes the following:

Target Device
Name of the Xilinx device on the target platform that runs the FPGA binary built by the Vitis compiler.
Target Clock
Specifies the target operating frequency for the compute units (CUs) mapped to the FPGA fabric.

Kernel Summary

This section lists all of the kernels defined for the application project. The following example shows the kernel summary:

Kernel Summary
Kernel Name  Type  Target              OpenCL Library                          Compute Units
-----------  ----  ------------------  --------------------------------------  -------------
mmult        c     fpga0:OCL_REGION_0  mmult.hw_emu.xilinx_u200_xdma_201830_2  1

In addition to the kernel name, the summary also provides the execution target and type of the input source. Because there is a difference in compilation and optimization methodology for OpenCL™, C, and C/C++ source files, the type of kernel source file is specified.

The Kernel Summary section is the last summary information in the report. From here, detailed information on each compute unit binary container is presented.

Timing Information

For each binary container, the detail section begins with the execution target of all compute units (CUs). It also provides timing information for every CU. As a general rule, if the estimated frequency for the FPGA binary is higher than the target frequency, the CU will be able to run in the device. If the estimated frequency is below the target frequency, the kernel code for the CU needs to be further optimized to run correctly on the FPGA fabric. This information is shown in the following example:

OpenCL Binary:     mmult.hw_emu.xilinx_u200_xdma_201830_2
Kernels mapped to: clc_region

Timing Information (MHz)
Compute Unit  Kernel Name  Module Name  Target Frequency  Estimated Frequency
------------  -----------  -----------  ----------------  -------------------
mmult_1       mmult        mmult        300.300293        411.015198

It is important to understand the difference between the target and estimated frequencies. CUs are not placed in isolation into the FPGA fabric. CUs are placed as part of a valid FPGA design that can include other components defined by the device developer to support a class of applications.

Because the CU custom logic is generated one kernel at a time, an estimated frequency that is higher than the target frequency indicates that the CU can run at the higher estimated frequency. Therefore, CU should meet timing at the target frequency during implementation of the FPGA binary.

Latency Information

The latency information presents the execution profile of each CU in the binary container. When analyzing this data, it is important to recognize that all values are measured from the CU boundary through the custom logic. In-system latencies associated with data transfers to global memory are not reported as part of these values. Also, the latency numbers reported are only for CUs targeted at the FPGA fabric. The following is an example of the latency report:

Latency Information (clock cycles)
Compute Unit  Kernel Name  Module Name  Start Interval  Best Case  Avg Case  Worst Case  
------------  -----------  -----------  --------------  ---------  --------  ----------  
mmult_1       mmult        mmult        826 ~ 829       825        827       828        

The latency report is divided into the following fields:

  • Start interval
  • Best case latency
  • Average case latency
  • Worst case latency

The start interval defines the amount of time that has to pass between invocations of a CU for a given kernel.

The best, average, and worst case latency numbers refer to how much time it takes the CU to generate the results of one ND Range data tile for the kernel. For cases where the kernel does not have data dependent computation loops, the latency values will be the same. Data dependent execution of loops introduces data specific latency variation that is captured by the latency report.

The interval or latency numbers will be reported as "undef" for kernels with one or more conditions listed below:
  • OpenCL kernels that do not have explicit reqd_work_group_size(x,y,z)
  • Kernels that have loops with variable bounds
Note: The latency information reflects estimates based on the analysis of the loop transformations and exploited parallelism of the model. These advanced transformations such as pipelining and data flow can heavily change the actual throughput numbers. Therefore, latency can only be used as relative guides between different runs.

Area Information

Although the FPGA can be thought of as a blank computational canvas, there are a limited number of fundamental building blocks available in each FPGA. These fundamental blocks (FF, LUT, DSP, block RAM) are used by the Vitis compiler to generate the custom logic for each CU in the design. The quantity of fundamental resources needed to implement the custom logic for a single CU determines how many CUs can be simultaneously loaded into the FPGA fabric. The following example shows the area information reported for a single CU:

Area Information
Compute Unit  Kernel Name  Module Name  FF     LUT    DSP   BRAM  URAM
------------  -----------  -----------  -----  -----  ----  ----  ----
mmult_1       mmult        mmult        81378  35257  1036  2     0
----------------------------------------------------------------------

HLS Report

The HLS report provides details about the high-level synthesis (HLS) process of a user kernel and is generated during the compilation process for hardware emulation and system builds. This process translates the C/C++ and OpenCL kernel into the hardware description language used for implementing the kernel logic on the FPGA. The report provides estimated FPGA resource usage, operating frequency, latency, and interface signals of the custom-generated hardware logic. These details provide many insights to guide kernel optimization.

When running from the Vitis IDE, this report can be found in the following directory: _x/<kernel_name>.<target>.<platform>/<kernel_name>/<kernel_name>/solution/syn/report

The HLS report can be opened from the Vitis analyzer by opening the Compile Summary, or the Link Summary as described in Using the Vitis Analyzer. An example of the HLS report is shown.

Figure 6: HLS Report

Generating and Opening the HLS Report

IMPORTANT: You must specify the --save-temps option during the build process to preserve the intermediate files produced by Vitis HLS, including the reports. The HLS report and HLS guidance are only generated for hardware emulation and system builds for C and OpenCL kernels. They are not generated for software emulation or RTL kernels.

The HLS report can be viewed through the Vitis analyzer by opening the <output_filename>.compile_summary or the <output_filename>.link_summary for the application project. The <output_filename> is the output of the v++ command.

You can launch the Vitis analyzer and open the report using the following command:

vitis_analyzer <output_filename>.compile_summary

When the Vitis analyzer opens, it displays the Compile Summary and a collection of reports generated during the compile process. Refer to Using the Vitis Analyzer for more information.

Interpreting the HLS Report

The HLS Synthesis report is a spreadsheet listing the module hierarchy in the left column. This section is describing one section of the HLS report: Performance and Resource Estimates. Each module and loop generated by the HLS run is represented in this hierarchy. The HLS Synthesis report contains the following columns:

  • Violation Type
  • Latency in clock cycles
  • Latency in absolute time (µs)
  • Iteration latency
  • Iteration Interval
  • Loop Tripcount
  • Pipelined
  • Utilization Estimates of BRAM, DSP, FF, and LUT
  • Negative Slack

If this information is part of a hierarchical block, it will sum up the information of the blocks contained in the hierarchy. Therefore, the hierarchy can also be navigated from within the report when it is clear which instance contributes to the overall design.

CAUTION: The absolute counts of cycles and latency numbers are based on estimates identified during HLS synthesis, especially with advanced transformations, such as pipelining and dataflow. Therefore, these numbers might not accurately reflect the final results. If you encounter question marks in the report, this might be due to variable bound loops, and you are encouraged to set trip counts for such loops to have some relative estimates presented in this report.

Profile Summary Report

When properly configured, the Xilinx Runtime (XRT) collects profiling data on host applications and kernels. XRT automatically captures profiling data for the host application as it makes calls to the runtime either through OpenCL or XRT API calls. You can add user calls to your host application to capture additional profiling information, as explained in Custom Profiling of the Host Application. To capture details of the kernel operations you must instrument kernels using the --profile options as explained in the next section.

After the application finishes execution, the Profile Summary report is saved as a .csv file in the directory where the compiled host code is executed. The Profile Summary provides annotated details regarding the overall application performance. All data generated during the execution of the application is grouped into categories. The Profile Summary lets you examine the kernel execution and data transfer statistics.

TIP: The Profile Summary report can be generated for all build configurations. However, with the software emulation build, the report will not include any data transfer details under kernel execution efficiency and data transfer efficiency. This information is only generated in hardware emulation or system builds.

An example of the Profile Summary report is shown below.

Figure 7: Profile Summary

Generating and Opening the Profile Summary Report

Capturing the data required for the Profile Summary requires a few steps prior to actually running the application.

  1. The FPGA binary (xclbin) file is configured for capturing profiling data by default. However, using the v++ --profile option during the linking process enables a greater level of detail in the profiling data captured. For more information, see the --profile Options.
  2. The runtime requires the presence of an xrt.ini file, as described in xrt.ini File, that includes the keyword for capturing profiling data:
    [Debug]
    opencl_summary = true
    opencl_device_counter=true 
  3. To enable the profiling of Kernel Internals data, you must also add the debug_mode tag in the [Emulation] section of the xrt.ini:
    [Emulation]
    debug_mode = batch

With profiling enabled in the device binary and in the xrt.ini file, the runtime creates the opencl_summary.csv report file when running the application, and also creates the profile_kernels.csv file when Kernel Internals is enabled. These files are linked to the Profile Summary report which can be viewed in the Vitis analyzer tool through the Run Summary. Open the Run Summary using the following command:

vitis_analyzer <project>.run_summary

Related Information

Interpreting the Profile Summary

The profile summary includes a number of useful statistics for your host application and kernels. The report provides a general idea of the functional bottlenecks in your application. The following tables show the profile summary descriptions.

Settings

This displays the report and XRT configuration settings.

Summary

This displays summary statistics including device execution time and device power.

Kernels & Compute Units

The following table displays the profile summary data for all kernel functions scheduled and executed.

Table 2. Kernel Execution
Name Description
Kernel Name of kernel
Enqueues Number of times kernel is enqueued. When the kernel is enqueued only once, the following stats are all the same.
Total Time Sum of runtimes of all enqueues (measured from START to END in OpenCL execution model) (in ms)
Minimum Time Minimum runtime of all enqueues
Average Time Average kernel runtime (in ms)

(Total time) / (Number of enqueues)

Maximum Time Maximum runtime of all enqueues (in ms)

The following table displays the profile summary data for top kernel functions.

Table 3. Top Kernel Execution
Name Description
Kernel Name of kernel
Kernel Instance Address Host address of kernel instance (in hex)
Context ID Context ID on host
Command Queue ID Command queue ID on host
Device Name of device where kernel was executed (format: <device>-<ID>)
Start Time Start time of execution (in ms)
Duration Duration of execution (in ms)

This following table displays the profile summary data for all compute units on the device.

Table 4. Compute Unit Utilization
Name Description
Compute Unit Name of compute unit
Kernel Kernel this compute unit is associated with
Device Name of the device (format: <device>-<ID>)
Calls Number of times the compute unit is called
Dataflow Execution Specifies whether the CU is executed with dataflow
Max Parallel Executions Number of executions in the dataflow region
Dataflow Acceleration Shows the performance improvement due to dataflow execution
CU Utilization (%) Shows the percent of the total kernel runtime that is consumed by the CU
Total Time Sum of the runtimes of all calls (in ms)
Minimum Time Minimum runtime of all calls (in ms)
Minimum runtime of all calls (Total time) / (Number of work groups)
Maximum Time Maximum runtime of all calls (in ms)
Clock Frequency Clock frequency used for a given accelerator (in MHz)

This following table displays the profile summary data for running times and stalls for compute units on the device.

Table 5. Compute Unit Running Times & Stalls
Name Description
Compute Unit Name of compute unit
Execution Count Execution count of the compute unit
Running Time Total time compute unit was running (in µs)
Intra-Kernel Dataflow Stalls (%) Percent time the compute unit was stalling from intra-kernel streams
External Memory Stalls (%) Percent time the compute unit was stalling from external memory accesses
Inter-Kernel Pipe Stalls (%) Percent time the compute unit was stalling from inter-kernel pipe accesses

Kernel Data Transfers

This following table displays the data transfer for kernels to the global memory.

Table 6. Data Transfer
Name Description
Compute Unit Port Name of compute unit/port
Kernel Arguments List of kernel arguments attached to this port
Device Name of device (format: <device>-<ID>)
Memory Resources Memory resource accessed by this port
Transfer Type Type of kernel data transfers
Number of Transfers Number of kernel data transfers (in AXI transactions)
Note: This might contain printf transfers.
Transfer Rate Rate of kernel data transfers (in MB/s):

Transfer Rate = (Total Bytes) / (Total CU Execution Time)

Where total CU execution time is the total time the CU was active
Avg Bandwidth Utilization (%) Average bandwidth of kernel data transfers:

Bandwidth Utilization (%) = (100 * Transfer Rate) / (0.6 * Max. Theoretical Rate)

Avg Size Average size of kernel data transfers (in KB):

Average Size = (Total KB) / (Number of Transfers)

Avg Latency Average latency of kernel data transfers (in ns)

This following table displays the top data transfer for kernels to the global memory.

Table 7. Top Data Transfer
Name Description
Compute Unit Name of compute unit
Device Name of device
Number of Transfers Number of write and read data transfers
Avg Bytes per Transfer Average bytes of kernel data transfers:

Average Bytes = (Total Bytes) / (Number of Transfers)

Transfer Efficiency (%) Efficiency of kernel data transfers:

Efficiency = (Average Bytes) / min((Memory Byte Width * 256), 4096)

Total Data Transfer Total data transferred by kernels (in MB):

Total Data = (Total Write) + (Total Read)

Total Write Total data written by kernels (in MB)
Total Read Total data read by kernels (in MB)
Total Transfer Rate Average total data transfer rate (in MB/s):

Total Transfer Rate = (Total Data Transfer) / (Total CU Execution Time)

Where total CU execution time is the total time the CU was active

This following table displays the data transfer streams.

Note: This table is only shown if there is stream data
Table 8. Data Transfer Streams
Name Description
Master Port Name of master compute unit and port
Master Kernel Arguments List of kernel arguments attached to this port
Slave Port Name of slave compute unit and port
Slave Kernel Arguments List of kernel arguments attached to this port
Device Name of device (format: <device>-<ID>)
Number of Transfers Number of stream data packets
Transfer Rate Rate of stream data transfers (in MB/s):

Transfer Rate = (Total Bytes) / (Total CU Execution Time)

Where total CU execution time is the total time the CU was active

Avg Size Average size of kernel data transfers (in KB):

Average Size = (Total KB) / (Number of Transfers)

Link Utilization (%) Link utilization (%):

Link Utilization = 100 * (Link Busy Cycles - Link Stall Cycles - Link Starve Cycles) / (Link Busy Cycles)

Link Starve (%) Link starve (%):

Link Starve = 100 * (Link Starve Cycles) / (Link Busy Cycles)

Link Stall (%) Link stall (%):

Link Stall = 100 * (Link Stall Cycles) / (Link Busy Cycles)

Host Data Transfers

This following table displays profile data for all write transfers between the host and device memory through PCI Express® link.

Table 9. Top Memory Writes
Name Description
Buffer Address Specifies the address location for the buffer
Context ID OpenCL Context ID on host
Command Queue ID OpenCL Command queue ID on host
Start Time Start time of write operation (in ms)
Duration Duration of write operation (in ms)
Buffer Size Amount of data being transferred (in KB)
Writing Rate Data transfer rate (in MB/s):

(Buffer Size)/(Duration)

This following table displays profile data for all read transfers between the host and device memory through PCI Express® link.

Table 10. Top Memory Reads
Name Description
Buffer Address Specifies the address location for the buffer
Context ID Context ID on host
Command Queue ID Command queue ID on host
Start Time Start time of read operation (in ms)
Duration Duration of read operation (in ms)
Buffer Size Amount of data being transferred (in KB)
Reading Rate Data transfer rate (in MB/s):

(Buffer Size) / (Duration)

This following table displays the data transfer for host to the global memory.

Table 11. Data Transfer
Name Description
Context:Number of Devices Context ID and number of devices in context
Transfer Type Type of kernel host transfers
Number of Buffer Transfers Number of host buffer transfers
Note: This might contain printf transfers.
Transfer Rate Rate of host buffer transfers (in MB/s):

Transfer Rate = (Total Bytes) / (Total Time in µs)

Avg Bandwidth Utilization (%) Average bandwidth of host buffer transfers:

Bandwidth Utilization (%) = (100 * Transfer Rate) / (Max. Theoretical Rate)

Avg Size Average size of host buffer transfers (in KB):

Average Size = (Total KB) / (Number of Transfers)

Total Time Sum of host buffer transfer durations (in ms)
Avg Time Average of host buffer transfer durations (in ms)

API Calls

This following table displays the profile data for all OpenCL host API function calls executed in the host application. The top displays a bar graph of the API call time as a percent of total time.

Table 12. API Calls
Name Description
API Name Name of the API function (for example, clCreateProgramWithBinary, clEnqueueNDRangeKernel)
Calls Number of calls to this API made by the host application
Total Time Sum of runtimes of all calls (in ms)
Minimum Time Minimum runtime of all calls (in ms)
Average Time Average Time (in ms)

(Total time) / (Number of calls)

Maximum Time Maximum runtime of all calls (in ms)

Device Power

This following table displays the profile data for device power.

Table 13. Device Power
Name Description
Power Used By Platform Shows a line graph of the three power rails on a Data Center acceleration card:
  • 12V Auxiliary
  • 12V PCIe
  • Internal power
These show the power (W) usage of the card over time.
Temperature One chart is created for each device that has non-zero temperature readings. Displays one line for each temperature sensor with readouts in (°C).
Fan Speed One chart is created for each device that has non-zero fan speed readings. The fan speed is measure in RPM.

Kernel Internals

This following table displays the running time for compute units in microseconds (µs) and reports stall time as a percent of the running time.

TIP: The Kernel Internals tab reports time in µs, while the rest of the Profile Summary reports time in milliseconds (ms).
Table 14. CU Runtime and Stalls
Name Description
Compute Unit Indicates the compute unit instance name
Running Time Reports the total running time for the CU (in µs)
Intra-Kernel Dataflow Stalls (%) Reports the percentage of running time consumed in stalls when streaming data between kernels
External Memory Stalls (%) Reports the percentage of running time consumed in stalls for memory transfers outside the CU
Inter-Kernel Pipe Stalls (%) Reports the percentage of running time consumed in stalls when streaming data to or from outside the CU

This following table displays the data transfer for specific ports on the compute unit.

Table 15. CU Port Data Transfers
Name Description
Port Indicates the port name on the compute unit
Compute Unit Indicates the compute unit instance name
Write Time Specifies the total data write time on the port (in µs)
Outstanding Write (%) Specifies the percentage of the runtime consumed in the write process
Read Time Specifies the total data read time on the port (in µs)
Outstanding Read (%) Specifies the percentage of the runtime consumed in the read process

This following table displays the functional port data transfers on the compute unit.

Table 16. Functional Port Data Transfers
Name Description
Port Name of port
Function Name of function
Compute Unit Name of compute unit
Write Time Total time the port had an outstanding write (in µs)
Outstanding Write (%) Percent time the port had an outstanding write
Read Time Total time the port had an outstanding read (in µs)
Outstanding Read (%) Percent time the port had an outstanding read

This following table displays the running time and stalls on the compute unit.

Table 17. Functions
Name Description
Compute Unit Name of compute unit
Function Name of function
Running Time Total time function was running (in ms)
Intra-Kernel Dataflow Stalls Percent time the function was stalling from intra-kernel streams (in ms)
External Memory Stalls Percent time the function was stalling from external memory accesses (in ms)
Inter-Kernel Pipe Stalls Percent time the function was stalling from inter-kernel pipe accesses (in ms)

Shell Data Transfers

This following table displays the DMA data transfers.

Table 18. DMA Data Transfer
Name Description
Device Name of device (format: <device>-<ID>)
Transfer Type Type of data transfers
Number of Transfers Number of data transfers (in AXI transactions)
Transfer Rate Rate of data transfers (in MB/s):

Transfer Rate = (Total Bytes) / (Total Time in µs)

Total Data Transfer Total amount of data transferred (in MB)
Total Time Total duration of data transfers (in ms)
Avg Size Average size of data transfers (in KB):

Average Size = (Total KB) / (Number of Transfers)

Avg Latency Average latency of data transfers (in ns)

For DMA bypass and Global Memory to Global Memory data transfers, see the DMA Data Transfer table above.

NoC Counters

NoC Counters display the NoC Counters Read and NoC Counters Write. These sections are only displayed if there is a non-zero NoC counter data.

Each section has a table containing summary data with line graphs for transfer rate and latency. The graphs can have multiple NoC counters, so you can toggle the counters ON/OFF through check boxes in the Chart column of the table.

Depending on the design, it can be possible to correlate NoC counters to CU ports. In this case, the CU port appears in the table, and selecting it cross-probes to the system diagram, profile summary, and any other views that include CU ports as selectable objects.

Table 19. NoC Counters Read or Write
Name Description
Compute Unit Port Name of compute unit/port
Name Name of NoC port
Traffic Class Traffic class type
Requested QoS QoS (MB/s): Requested quality of service (in MB/s)
Min Transfer Rate Rate of minimum data transfers (in MB/s)
Avg Transfer Rate Rate of average data transfers (in MB/s)
Max Transfer Rate Rate of maximum data transfers (in MB/s)
Avg Size Average size of data transfers (in KB):

Average Size = (Total KB) / (Number of Transfers)

Min Latency Minimum latency of data transfers (in ns)
Avg Latency Average latency of data transfers (in ns)
Max Latency Maximum latency of data transfers (in ns)

AI Engine Counters

AI Engine counters display if there is a non-zero AI Engine counter data. If there is an incompatible configuration of the AI Engine counters, this section displays a message stating that the configuration does not support performance profiling.

This section has a table containing summary data with line graphs for active time and usage. The usage chart is only available if stall profiling is enabled.

The graphs can have multiple AI Engine counters, so you can toggle the counters ON/OFF through check boxes in the Chart column of the table.

It is possible to cross-probe tiles to the AI Engine array and graph views.

Note: Depending on how the AI Engine counters are configured, one or more metric columns might appear. These include memory stall, stream stall, call inst time, group error time, etc. For more information, see Versal ACAP AI Engine Programming Environment User Guide (UG1076).
Table 20. AI Engine Counters
Name Description
Tile AI Engine Tile [Column, Row]
Clock Frequency (MHz) Frequency (in MHz) of clock used for AI Engine tiles

Timeline Trace

The Timeline Trace collects and displays host and kernel events on a common timeline to help you understand and visualize the overall health and performance of your systems. The graphical representation lets you see issues regarding kernel synchronization and efficient concurrent execution. The displayed events include:

  • OpenCL API calls from the host code.
  • Device trace data including compute units, AXI transaction start/stop.
  • Host events and kernel start/stops.

While this is useful for debugging and profiling the application, the timeline and device trace data are not collected by default, which can affect performance by adding time to the application execution. However, the trace data is collected with dedicated resources in the kernel, and does not affect kernel functionality. By default, the data is offloaded at the end of the run (v++ --trace_memory option). When continuous offload is enabled, it changes the default.

The following is a snapshot of the Timeline Trace window which displays host and device events on a common timeline. Host activity is displayed at the top of the image and kernel activity is shown on the bottom of the image. Host activities include creating the program, running the kernel and data transfers between global memory and the host. The kernel activities include read/write accesses and transfers between global memory and the kernel(s). This information helps you understand details of application execution and identify potential areas for improvements.

Figure 8: Timeline Trace

Timeline data can be enabled and collected through the command line flow. However, viewing must be done in the Vitis analyzer as described in Using the Vitis Analyzer.

Generating and Opening the Timeline Trace

To generate the Timeline Trace report, you must complete the following steps to enable timeline and device trace data collection in the command line flow:

  1. Instrument the FPGA binary during linking, by adding Acceleration Monitors and AXI Performance Monitors to kernels using the v++ --profile option as described in --profile Options. As an example, add --profile.data to the v++ linking command line:
    v++ -g -l --profile.data all:all:all ...
  2. After the kernels are instrumented during the build process, data gathering must also be enabled during the runtime execution of the application by editing the xrt.ini file. Refer to xrt.ini File for more information.

    The following xrt.ini file enables maximum information gathering when the application is run:

    [Debug]
    opencl_summary=true
    opencl_trace=true
    data_transfer_trace=coarse
    stall_trace=all
    
    TIP: If you are collecting a large amount of trace data, you might need to use the --trace_memory with the v++ command, and the trace_buffer_size keyword in the xrt.ini.

    After running the application, the Timeline Trace data is captured in CSV files called opencl_trace.csv and device_0.csv.

  3. The CSV report can be viewed in the Vitis analyzer tool by opening the Run Summary produced during the application execution. You can launch the Vitis analyzer and open the Run Summary using the following command:
    vitis_analyzer <project>.run_summary

Interpreting the Timeline Trace

The Timeline Trace window displays host and device events on a common timeline. This information helps you understand details of application execution and identify potential areas for improvements. The Timeline Trace report has two main sections: Host and Device. The Host section shows the trace of all the activity originating from the host side. The Device section shows the activity of the CUs on the FPGA.

The report has the following structure:

  • Host
    OpenCL API Calls
    All OpenCL API calls are traced here. The activity time is measured from the host perspective.
    General
    All general OpenCL API calls such as clCreateProgramWithBinary, clCreateContext, and clCreateCommandQueue, are traced here.
    Queue
    OpenCL API calls that are associated with a specific command queue are traced here. This includes commands such as clEnqueueMigrateMemObjects, and clEnqueueNDRangeKernel. If the user application creates multiple command queues, then this section shows all the queues and activities.
    Data Transfer
    In this section the DMA transfers from the host to the device memory are traced. There are multiple DMA threads implemented in the OpenCL runtime and there is typically an equal number of DMA channels. The DMA transfer is initiated by the user application by calling OpenCL APIs such as clEnqueueMigrateMemObjects. These DMA requests are forwarded to the runtime which delegates to one of the threads. The data transfer from the host to the device appear under Write as they are written by the host, and the transfers from device to host appear under Read.
    Kernel Enqueues
    The kernels enqueued by the host program are shown here. The kernels here should not be confused with the kernels/CUs on the device. Here kernel refers to the NDRangeKernels and tasks created by the OpenCL commands clEnqueueNDRangeKernels and clEnqueueTask. These are plotted against the time measured from the host’s perspective. Multiple kernels can be scheduled to be executed at the same time, and they are traced from the point they are scheduled to run until the end of the kernel execution. This is the reason for multiple entries. The number of rows depend on the number of overlapping kernel executions.
    Note: Overlapping of the kernels should not be mistaken for actual parallel execution on the device as the process might not be ready to execute right away.
  • Device "name"
    Binary Container "name"
    Binary container name.
    Accelerator "name"
    Name of the compute unit (a.k.a., Accelerator) on the FPGA.
    User Functions
    In the case of the Vitis HLS tool kernels, functions that are implemented as data flow processes are traced here. The trace for these functions show the number of active instances of these functions that are currently executing in parallel. These names are generated in hardware emulation when waveform is enabled.
    Note: Function level activity is only possible in hardware emulation.
    • Function: "name a"
    • Function: "name b"
    Read
    A CU reads from the DDR over AXI-MM ports. The trace of a data read by a CU is shown here. The activity is shown as transaction and the tool-tip for each transaction shows more details of the AXI transaction. These names are generated when --profile.data is for the CU.
    Write
    A CU writes to the DDR over AXI-MM ports. The trace of data written by a CU is shown here. The activity is shown as transactions and the tool-tip for each transaction shows more details of the AXI transaction. This is generated when --profile.data is specified for the CU.

Waveform View and Live Waveform Viewer

The Vitis core development kit can generate a Waveform view when running hardware emulation. It displays in-depth details at the system-level, CU level, and at the function level. The details include data transfers between the kernel and global memory and data flow through inter-kernel pipes. These details provide many insights into performance bottlenecks from the system-level down to individual function calls to help optimize your application.

The Live Waveform Viewer is similar to the Waveform view, however, it provides even lower-level details with some degree of interactivity. The Live Waveform Viewer can also be opened using the Vivado logic simulator, xsim.

Note: The Waveform view allows you to examine the device transactions from within the Vitis analyzer, as described in Using the Vitis Analyzer. In contrast, the Live Waveform Viewer opens the Vivado simulation waveform viewer to examine the hardware transactions in addition to any user selected signals.

Waveform data is not collected by default because it requires the runtime to generate simulation waveforms during hardware emulation, which consumes more time and disk space. Refer to Generating and Opening the Waveform Reports for instructions on enabling these features.

Figure 9: Waveform View

You can also open the waveform database (.wdb) file with the Vivado logic simulator through the Linux command line:

xsim -gui <filename.wdb> &
TIP: The .wdb file is written to the directory where the compiled host code is executed.

Generating and Opening the Waveform Reports

Follow these instructions to enable waveform data collection from the command line during hardware emulation and open the viewer:

  1. Enable debug code generation during compilation and linking using the -g option.
    v++ -c -g -t hw_emu ...
  2. Create an xrt.ini file in the same directory as the host executable with the following contents (see xrt.ini File for more information):
    [Emulation]
    debug_mode=batch

    The debug_mode=batch enables the capture of waveform data (.wdb) by running simulation in batch mode. You can also enable the Live Waveform Viewer to launch simulation in interactive mode using the following setting in the xrt.ini:

    [Emulation]
    debug_mode=gui
    TIP: If Live Waveform Viewer is enabled, the simulation waveform opens during the hardware emulation run.
  3. Run the hardware emulation build of the application as described in Running the Application Hardware Build. The hardware transaction data is collected in the waveform database file, <hardware_platform>-<device_id>-<xclbin_name>.wdb. Refer to Output Directories of the v++ Command or Output Directories from the Vitis IDE for more information on locating these reports.
  4. Open the Waveform view in the Vitis analyzer by opening the Run Summary, and opening the Waveform report:
    vitis_analyzer <project>.run_summary

Interpreting Data in the Waveform Views

The following image shows the Waveform view:

Figure 10: Waveform View

The Waveform and Live Waveform views are organized hierarchically for easy navigation.

  • The Waveform view is based on the actual waveforms generated during hardware emulation (Kernel Trace). This allows the viewer to descend all the way down to the individual signals responsible for the abstracted data. However, because the Waveform view is generated from the post-processed data, no additional signals can be added to the report, and some of the runtime analysis cannot be visualized, such as DATAFLOW transactions.
  • The Live Waveform viewer is displaying the Vivado logic simulator (xsim) run, so you can add extra signals and internals of the register transfer (RTL) design to the live view. Refer to the Vivado Design Suite User Guide: Logic Simulation (UG900) for information on working with the Waveform viewer.

The hierarchy of the Waveform and Live Waveform views include the following:

Device "name"
Target device name.
Binary Container "name"
Binary container name.
Memory Data Transfers
For each DDR Bank, this shows the trace of all the read and write request transactions arriving at the bank from the host.
Kernel "name" 1:1:1
For each kernel and for each compute unit of that kernel, this section breaks down the activities originating from the compute unit.
Compute Unit: "name"
Compute unit name.
CU Stalls (%)
Stall signals are provided by the Vitis HLS tool to inform you when a portion of the circuit is stalling because of external memory accesses, internal streams (that is, dataflow), or external streams (that is, OpenCL pipes). The stall bus shown in detailed kernel trace compiles all of the lowest level stall signals and reports the percentage that are stalling at any point in time. This provides a factor of how much of the kernel is stalling at any point in the simulation.

For example, if there are 100 lowest level stall signals and 10 are active on a given clock cycle, then the CU Stall percentage is 10%. If one goes inactive, then it is 9%.

Data Transfers
This shows the read/write data transfer accesses originating from each Master AXI port of the compute unit to the DDR.
User Functions
This information is available for the HLS kernels and shows the user functions.
Function: "name"
Function name.
Dataflow/Pipeline Activity
This shows the number of parallel executions of the function if the function is implemented as a dataflow process.
Active Iterations
This shows the currently active iterations of the dataflow. The number of rows is dynamically incremented to accommodate the visualization of any concurrent execution.
StallNoContinue
This is a stall signal that tells if there were any output stalls experienced by the dataflow processes (function is done, but it has not received a continue from the adjacent dataflow process).
RTL Signals
These are the underlying RTL control signals that were used to interpret the above transaction view of the dataflow process.
Function Stalls
Shows the different types of stalls experienced by the process.
External Memory
Stalls experienced while accessing the DDR memory.
Internal-Kernel Pipe
If the compute units communicated between each other through pipes, then this shows the related stalls.
Intra-Kernel Dataflow
FIFO activity internal to the kernel.
Function I/O
Actual interface signals.