Profiling the Application

The Vitis™ core development kit generates various system and kernel resource performance reports during compilation. These reports help you establish a baseline of performance on your application, identify bottlenecks, and help to identify target functions that can be accelerated in hardware kernels as discussed in Methodology for Architecting a Device Accelerated Application. The Xilinx Runtime (XRT) also collects profiling data during application execution in both emulation and system mode configurations. Examples of the reported data includes:

Host and device timeline events
OpenCL™ API call sequence
Kernel execution sequence
Kernel start and stop signals
FPGA trace data including AXI transactions
Power profile information for the accelerator card.

Profiling reports and data can be used to isolate performance bottlenecks in the application, identify problems in the system, and optimize the design to improve performance. Optimizing an application requires optimizing both the application host code and any hardware accelerated kernels. The host code must be optimized to facilitate data transfers and kernel execution, while the kernel should be optimized for performance and resource usage.

There are four distinct areas to be considered when performing algorithm optimization in Vitis: System resource usage and performance, kernel optimization, host optimization, and data transfer optimization. The following Vitis reports and graphical tools support your efforts to profile and optimize these areas:

Guidance
System Estimate Report
HLS Report
Profile Summary Report
Application Timeline
Waveform View and Live Waveform Viewer

Reports are automatically generated after running the active build, either from the command line as described in Running an Application, or from the Vitis integrated design environment (IDE). Separate sets of reports are generated for all three build targets and can be found in the respective report directories. Refer to Directory Structure for more information on locating these reports.

Reports can be viewed in Vitis analyzer, or in some cases from the Vitis IDE. To access these reports from Vitis analyzer, refer to Using the Vitis Analyzer.

Enabling Profiling in Your Application

To enable profiling, and the capture of trace data during the execution of your application, you must instrument your application for this task. You must enable additional logic, and consume additional device resources to track the behavior of the host and kernel execution steps.

TIP: While capturing profile data is a critical part of the profiling and optimization process for building your accelerated application, it does consume additional resources, and impacts performance. You should be sure to clean these elements out of your final production build.

The device binary (xclbin) file is configured for capturing profiling data by default. However, using the --profile_kernel option during the Vitis compiler linking process instruments the FPGA binary by adding Acceleration Monitors and AXI Performance Monitors to kernels. This option has three distinct instrumentation options: data, stall, and exec, as described in the Vitis Compiler Command.

As an example, add --profile_kernel to the v++ linking command line:

v++ -g -l --profile_kernel data:all:all:all ...

TIP: Be sure to also use the v++ -g option (or --debug) when compiling your kernel code for software or hardware emulation.

After your application is enabled for profiling during the build process, data gathering must also be enabled in XRT by editing the xrt.ini file. Refer to xrt.ini File for more information.

The following xrt.ini file will enable maximum information gathering when the application is run:

[Debug]
profile=true
power_profile=true
timeline_trace=true
data_transfer_trace=coarse
stall_trace=all

To enable the profiling of Kernel Internals data, you must also add the debug_mode tag in the [Emulation] section of the xrt.ini:

[Emulation]
debug_mode=batch

For Live Waveform Viewer, debug_mode is as follows:

[Emulation]
debug_mode=gui

TIP: If Live Waveform Viewer is enabled, the simulation waveform opens during the hardware emulation run.

If you are collecting a large amount of trace data, you can specify the --trace_memory option for the Vitis compiler, and add trace_buffer_size keyword in the xrt.ini.

--trace_memory: indicates what type of memory to use for capturing profile data, as described in Vitis Compiler General Options.
trace_buffer_size: Specifies the amount of memory to use for capturing the profile data.

Finally, you can enable continuous trace capture to continuously offload device trace data while the application is running, so in the event of a application or system crash, some trace data is available to help debug the application. To enable, add the continuous_trace keyword in the xrt.ini file.

Baselining Functionalities and Performance

Methodology for Accelerating Applications with the Vitis Software Platform provides an overview of designing an application beginning with profiling the application to identify functions to accelerate, leading into recommended ways of developing C/C++ accelerators. As discussed in the this guide, it is very important to understand the architecture and performance of your application before you start any optimization effort. This is achieved by establishing a baseline for the application in terms of functions and performance.

Figure 1: Baselining Functions and Performance Flow

Identify Bottlenecks

The first step is to identify the bottlenecks of the your application running on your target platform. The most effective way is to run the application with profiling tools, like valgrind, callgrind, and GNU gprof. The profiling data generated by these tools show the call graph with the number of calls to all functions and their execution time.

Run Software and Hardware Emulation

Run software and hardware emulation on the accelerated application as described in Running an Application, to verify functional correctness, and to generate profiling data on the host code and the kernels. Use Vitis analyzer to review the kernel compilation reports, profile summary, timeline trace, and device hardware transactions to understand the baseline performance estimate for timing interval, latency, and resource utilization, such as DSP and block RAM.

Build and Run the Application

The last step in baselining is building and running the application on an FPGA acceleration card, like one of the Alveo™ Data Center accelerator cards. Analyze the reports from the system compilation, and the profiling data from application execution to see the actual performance and resource utilization on hardware.

TIP: Save all the reports during the baseline process, so that you can refer back to them and compare results during optimization.

Guidance

The Vitis core development kit has a comprehensive design guidance tool that provides immediate, actionable guidance to the software developer for issues detected in their designs. These issues might be related to the source code, or due to missed tool optimizations. Also, the rules are generic rules based on an extensive set of reference designs. Therefore, these rules might not be applicable for your specific design. It is up to you to understand the specific guidance rules and take appropriate action based on your specific algorithm and requirements.

Guidance is generated from the Vitis HLS, Vitis profiler, and Vivado Design Suite when invoked by the v++ compiler. The generated design guidance can have several severity levels; warning messages, informational messages and design rule checks are provided during software emulation, hardware emulation, and system builds. The profile design guidance helps you interpret the profiling results which allows you to focus on improving performance.

Guidance includes message text for reported violations, a brief suggested resolution, and a detailed resolution provided as a web link. You can determine your next course of action based on the suggested resolution. This helps improves productivity by quickly highlighting issues and directing you to additional information in using the Vitis technology.

Design guidance is automatically generated after building or running an application from the command line or Vitis IDE.

You can open the Guidance report as discussed in Using the Vitis Analyzer. To access the Guidance report, open the Compile Summary, the Link Summary, or the Run Summary, and open the Guidance report.

Kernel Guidance is generated by the Vitis HLS tool after kernel is built using v++ compile command. This can be viewed in the Vitis analyzer by opening the Compile Summary report. Kernel guidance as well as Compile Summary files are generated for each kernel compiled. Kernel guidance includes recommendations on using Dataflow; and possible reasons why the expected throughout could not be achieved.
System Guidance is generated after kernel is built using the v++ link command. This can be viewed in the Vitis analyzer by opening the Link Summary report. System guidance includes all Kernel Guidance checks, and provides comprehensive review before running your application.
Run Guidance is generated when your generated .xclbin is run, and is a feature of the XRT. This can be viewed by opening the Run Summary in the Vitis analyzer. Run Guidance includes checks like if Kernel Stall is above 50%, recommendations if PLRAM can be used instead of DDR, etc.

With the Guidance report open, the Guidance view displays the messages along with resolution columns. The resolutions also have extended weblink help available.

The following image shows an example of the Guidance report displayed in the Vitis analyzer. For example, clicking a link in the Name column opens a description of the rule check. Links in the Details column can open source code, select a design object such as a kernel, or navigate to another report.

There is one HTML guidance report for each run of the v++ command, including compile and link. The report files are located in the --report_dir under the specific output name. For example:

v++_compile_<output>_guidance.html for v++ compilation
v++_link_<output>_guidance.html for v++ linking

You can click the web link in the Resolution column to get additional details about the resolution. The Guidance Messaging web page lists all of the current messages for your review.

Kernel and Compute Unit objects, as well as profile reported data values, can also be cross-probed to other views like the System Diagram or Profile Report. Refer to Working with Reports for more information.

Opening the Guidance Report

When kernels are compiled and when the FPGA binary is linked, guidance reports are generated automatically by the v++ command. You can view these reports in the Vitis analyzer by opening the <output_filename>.compile_summary or the <output_filename>.link_summary for the application project. The <output_filename> is the output of the v++ command.

As an example, launch the Vitis analyzer and open the report using this command:

vitis_analyzer <output_filename>.link_summary

When the Vitis analyzer opens, it displays the link summary report, as well as the compile summary, and a collection of reports generated during the compile and link processes. Both the compile and link steps generate Guidance reports to view by clicking the Build heading on the left-hand side. Refer to Using the Vitis Analyzer for more information.

Interpreting Guidance Data

The Guidance view places each entry in a separate row. Each row might contain the name of the guidance rule, threshold value, actual value, and a brief but specific description of the rule. The last field provides a link to reference material intended to assist in understanding and resolving any of the rule violations.

In the GUI Guidance view, guidance rules are grouped by categories and unique IDs in the Name column and annotated with symbols representing the severity. These are listed individually in the HTML report. In addition, as the HTML report does not show tooltips, a full Name column is included in the HTML report as well.

The following list describes all fields and their purpose as included in the HTML guidance reports.

Id: Each guidance rule is assigned a unique ID. Use this id to uniquely identify a specific message from the guidance report.
Name: The Name column displays a mnemonic name uniquely identifying the guidance rule. These names are designed to assist in memorizing specific guidance rules in the view.
Severity: The Severity column allows the easy identification of the importance of a guidance rule.
Full Name: The Full Name provides a less cryptic name compared to the mnemonic name in the Name column.
Categories: Most messages are grouped within different categories. This allows the GUI to display groups of messages within logical categories under common tree nodes in the Guidance view.
Threshold: The Threshold column displays an expected threshold value, which determines whether or not a rule is met. The threshold values are determined from many applications that follow good design and coding practices.
Actual: The Actual column displays the values actually encountered on the specific design. This value is compared against the expected value to see if the rule is met.
Details: The Details column provides a brief message describing the specifics of the current rule.
Resolution: The Resolution column provides a pointer to common ways the model source code or tool transformations can be modified to meet the current rule. Clicking the link brings up a popup window or the documentation with tips and code snippets that you can apply to the specific issue.

System Estimate Report

The process step with the longest execution time includes building the hardware system and the FPGA binary to run on Xilinx devices. Build time is also affected by the target device and the number of compute units instantiated onto the FPGA fabric. Therefore, it is useful to estimate the performance of an application without needing to build it for the system hardware.

The System Estimate report provides estimates of FPGA resource usage and the estimated frequency at which the hardware accelerated kernels can operate. The report is automatically generated for hardware emulation and system hardware builds. The report contains high-level details of the user kernels, including resource usage and estimated frequency. This report can be used to guide design optimization.

You can also force the generation of the System Estimate report with the following option:

v++ .. --report_level estimate

An example report is shown in the figure:

Opening the System Estimate Report

The System Estimate report can be opened in the Vitis analyzer tool, intended for viewing reports from the Vitis compiler when the application is built, and the XRT library when the application is run. You can launch the Vitis analyzer and open the report using the following command:

vitis_analyzer <output_filename>.link_summary

The <output_filename> is the output of the v++ command. This opens the Link Summary for the application project in the Vitis analyzer tool. Then, select the System Estimate report. Refer to Using the Vitis Analyzer for more information.

TIP: Because the System Estimate report is a text file, you can also view it in a text editor or target platform.

Interpreting the System Estimate Report

The System Estimate report generated by the v++ command provides information on every binary container in the application, as well as every compute unit in the design. The report is structured as follows:

Target device information
Summary of every kernel in the application
Detailed information on every binary container in the solution

The following example report file represents the information generated for the estimate report:

-------------------------------------------------------------------------------
Design Name:             mmult.hw_emu.xilinx_u200_xdma_201830_2
Target Device:           xilinx:u200:xdma:201830.2
Target Clock:            300.000000MHz
Total number of kernels: 1
-------------------------------------------------------------------------------

Kernel Summary
Kernel Name  Type  Target              OpenCL Library                          Compute Units
-----------  ----  ------------------  --------------------------------------  -------------
mmult        c     fpga0:OCL_REGION_0  mmult.hw_emu.xilinx_u200_xdma_201830_2  1


-------------------------------------------------------------------------------
OpenCL Binary:     mmult.hw_emu.xilinx_u200_xdma_201830_2
Kernels mapped to: clc_region

Timing Information (MHz)
Compute Unit  Kernel Name  Module Name  Target Frequency  Estimated Frequency
------------  -----------  -----------  ----------------  -------------------
mmult_1       mmult        mmult        300.300293        411.015198

Latency Information (clock cycles)
Compute Unit  Kernel Name  Module Name  Start Interval  Best Case  Avg Case  Worst Case  
------------  -----------  -----------  --------------  ---------  --------  ----------  
mmult_1       mmult        mmult        826 ~ 829       825        827       828         

Area Information
Compute Unit  Kernel Name  Module Name  FF     LUT    DSP   BRAM  URAM
------------  -----------  -----------  -----  -----  ----  ----  ----
mmult_1       mmult        mmult        81378  35257  1036  2     0
-------------------------------------------------------------------------------

Design and Target Device Summary

All design estimate reports begin with an application summary and information about the target device. The device information is provided in the following section of the report:

-------------------------------------------------------------------------------
Design Name:             mmult.hw_emu.xilinx_u200_xdma_201830_2
Target Device:           xilinx:u200:xdma:201830.2
Target Clock:            300.000000MHz
Total number of kernels: 1
-------------------------------------------------------------------------------

For the design summary, the information provided includes the following:

Target Device: Name of the Xilinx device on the target platform that runs the FPGA binary built by the Vitis compiler.
Target Clock: Specifies the target operating frequency for the compute units (CUs) mapped to the FPGA fabric.

Kernel Summary

This section lists all of the kernels defined for the application project. The following example shows the kernel summary:

Kernel Summary
Kernel Name  Type  Target              OpenCL Library                          Compute Units
-----------  ----  ------------------  --------------------------------------  -------------
mmult        c     fpga0:OCL_REGION_0  mmult.hw_emu.xilinx_u200_xdma_201830_2  1

In addition to the kernel name, the summary also provides the execution target and type of the input source. Because there is a difference in compilation and optimization methodology for OpenCL™, C, and C/C++ source files, the type of kernel source file is specified.

The Kernel Summary section is the last summary information in the report. From here, detailed information on each compute unit binary container is presented.

Timing Information

For each binary container, the detail section begins with the execution target of all compute units (CUs). It also provides timing information for every CU. As a general rule, if the estimated frequency for the FPGA binary is higher than the target frequency, the CU will be able to run in the device. If the estimated frequency is below the target frequency, the kernel code for the CU needs to be further optimized to run correctly on the FPGA fabric. This information is shown in the following example:

OpenCL Binary:     mmult.hw_emu.xilinx_u200_xdma_201830_2
Kernels mapped to: clc_region

Timing Information (MHz)
Compute Unit  Kernel Name  Module Name  Target Frequency  Estimated Frequency
------------  -----------  -----------  ----------------  -------------------
mmult_1       mmult        mmult        300.300293        411.015198

It is important to understand the difference between the target and estimated frequencies. CUs are not placed in isolation into the FPGA fabric. CUs are placed as part of a valid FPGA design that can include other components defined by the device developer to support a class of applications.

Because the CU custom logic is generated one kernel at a time, an estimated frequency that is higher than the target frequency indicates that the CU can run at the higher estimated frequency. Therefore, CU should meet timing at the target frequency during implementation of the FPGA binary.

Latency Information

The latency information presents the execution profile of each CU in the binary container. When analyzing this data, it is important to recognize that all values are measured from the CU boundary through the custom logic. In-system latencies associated with data transfers to global memory are not reported as part of these values. Also, the latency numbers reported are only for CUs targeted at the FPGA fabric. The following is an example of the latency report:

Latency Information (clock cycles)
Compute Unit  Kernel Name  Module Name  Start Interval  Best Case  Avg Case  Worst Case  
------------  -----------  -----------  --------------  ---------  --------  ----------  
mmult_1       mmult        mmult        826 ~ 829       825        827       828

The latency report is divided into the following fields:

Start interval
Best case latency
Average case latency
Worst case latency

The start interval defines the amount of time that has to pass between invocations of a CU for a given kernel.

The best, average, and worst case latency numbers refer to how much time it takes the CU to generate the results of one ND Range data tile for the kernel. For cases where the kernel does not have data dependent computation loops, the latency values will be the same. Data dependent execution of loops introduces data specific latency variation that is captured by the latency report.

The interval or latency numbers will be reported as "undef" for kernels with one or more conditions listed below:

OpenCL kernels that do not have explicit reqd_work_group_size(x,y,z)
Kernels that have loops with variable bounds

Note: The latency information reflects estimates based on the analysis of the loop transformations and exploited parallelism of the model. These advanced transformations such as pipelining and data flow can heavily change the actual throughput numbers. Therefore, latency can only be used as relative guides between different runs.

Area Information

Although the FPGA can be thought of as a blank computational canvas, there are a limited number of fundamental building blocks available in each FPGA. These fundamental blocks (FF, LUT, DSP, block RAM) are used by the Vitis compiler to generate the custom logic for each CU in the design. The quantity of fundamental resources needed to implement the custom logic for a single CU determines how many CUs can be simultaneously loaded into the FPGA fabric. The following example shows the area information reported for a single CU:

Area Information
Compute Unit  Kernel Name  Module Name  FF     LUT    DSP   BRAM  URAM
------------  -----------  -----------  -----  -----  ----  ----  ----
mmult_1       mmult        mmult        81378  35257  1036  2     0
-------------------------------------------------------------------------------

HLS Report

The HLS report provides details about the high-level synthesis (HLS) process of a user kernel and is generated in hardware emulation and system builds. This process translates the C/C++ and OpenCL kernel into the hardware description language used for implementing the kernel logic on the FPGA. The report provides estimated FPGA resource usage, operating frequency, latency, and interface signals of the custom-generated hardware logic. These details provide many insights to guide kernel optimization.

When running from the command line, this report can be found in the following directory:

_x/<kernel_name>.<target>.<platform>/<kernel_name>/<kernel_name>/solution/syn/report

The HLS report can be opened from the Vitis analyzer by opening the Compile Summary, or the Link Summary as described in Using the Vitis Analyzer. An example of the HLS report is shown.

Generating and Opening the HLS Report

IMPORTANT: You must specify the --save-temps option during the build process to preserve the intermediate files produced by Vitis HLS, including the reports. The HLS report and HLS guidance are only generated for hardware emulation and system builds for C and OpenCL kernels. They are not generated for software emulation or RTL kernels.

The HLS report can be viewed through the Vitis analyzer by opening the <output_filename>.compile_summary or the <output_filename>.link_summary for the application project. The <output_filename> is the output of the v++ command.

You can launch the Vitis analyzer and open the report using the following command:

vitis_analyzer <output_filename>.compile_summary

When the Vitis analyzer opens, it displays the Compile Summary and a collection of reports generated during the compile process. Refer to Using the Vitis Analyzer for more information.

Interpreting the HLS Report

The HLS Synthesis report is a spreadsheet listing the module hierarchy in the left column. Each module and loop generated by the HLS run is represented in this hierarchy. The HLS Synthesis report contains the following columns:

Violation Type
Latency in clock cycles
Latency in absolute time (µs)
Iteration latency
Iteration Interval
Loop Tripcount
Pipelined
Utilization Estimates of BRAM, DSP, FF, and LUT
Negative Slack

If this information is part of a hierarchical block, it will sum up the information of the blocks contained in the hierarchy. Therefore, the hierarchy can also be navigated from within the report when it is clear which instance contributes to the overall design.

CAUTION: Regarding the absolute counts of cycles and latency, these numbers are based on estimates identified during HLS synthesis, especially with advanced transformations, such as pipelining and dataflow. These numbers might not accurately reflect the final results. If you encounter question marks in the report, this might be due to variable bound loops, and you are encouraged to set trip counts for such loops to have some relative estimates presented in this report.

Profile Summary Report

When properly configured, the Xilinx Runtime (XRT) collects profiling data on host applications and kernels. After the application finishes execution, the Profile Summary report is saved as a .csv file in the directory where the compiled host code is executed.

The Profile Summary provides annotated details regarding the overall application performance. All data generated during the execution of the application is grouped into categories. The Profile Summary lets you examine kernel execution and data transfer statistics.

TIP: The Profile Summary report can be generated for all build configurations. However, with the software emulation build, the report will not include any data transfer details under kernel execution efficiency and data transfer efficiency. This information is only generated in hardware emulation or system builds.

An example of the Profile Summary report is shown below.

The Profile Summary report displays a navigation index based on available sections. The report graphs statistics presenting them as bar graphs, refer to Interpreting the Profile Summary.

Generating and Opening the Profile Summary Report

Capturing the data required for the Profile Summary requires a few steps prior to actually running the application.

The FPGA binary (xclbin) file is configured for capturing profiling data by default. However, using the v++ --profile_kernel option during the linking process enables a greater level of detail in the profiling data captured. For more information on the --profile_kernel option, see the Vitis Compiler Command.
The runtime requires the presence of an xrt.ini file, as described in xrt.ini File, that includes the keyword for capturing profiling data:
```
[Debug]
profile = true
```
To enable the profiling of Kernel Internals data, you must also add the debug_mode tag in the [Emulation] section of the xrt.ini:
```
[Emulation]
debug_mode = batch
```

With profiling enabled in the FPGA binary and in the xrt.ini file, the runtime creates the profile_summary.csv report file when running the application, and also creates the profile_kernels.csv and timeline_kernels.csv files when Kernel Internals is enabled.

The CSV report can be viewed in a spreadsheet tool or utility, or can be opened in the Vitis analyzer tool, intended for viewing reports from the Vitis compiler when the application is built, and the XRT library when the application is run. You can launch the Vitis analyzer and access the Profile Summary report through the Run Summary using the following command:

vitis_analyzer project1.run_summary

Related Information

Interpreting the Profile Summary

The profile summary includes a number of useful statistics for your host application and kernels. The report provides a general idea of the functional bottlenecks in your application. The following tables show the profile summary descriptions.

Settings

This displays the report and XRT configuration settings.

Summary

This displays summary statistics including device execution time and device power.

Kernels & Compute Units

The following table displays the profile summary data for all kernel functions scheduled and executed.

Table 1. Kernel Execution
Name	Description
Kernel	Name of kernel
Enqueues	Number of times kernel is enqueued. When the kernel is enqueued only once, the following stats will all be the same.
Total Time	Sum of runtimes of all enqueues (measured from START to END in OpenCL execution model) (in ms)
Minimum Time	Minimum runtime of all enqueues
Average Time	Average kernel runtime (in ms) (Total time) / (Number of enqueues)
Maximum Time	Maximum runtime of all enqueues (in ms)

The following table displays the profile summary data for top kernel functions.

Table 2. Top Kernel Execution
Name	Description
Kernel	Name of kernel
Kernel Instance Address	Host address of kernel instance (in hex)
Context ID	OpenCL Context ID on host
Command Queue ID	OpenCL Command queue ID on host
Device	Name of OpenCL device where kernel was executed (format: `<device>-<ID>`)
Start Time	Start time of execution (in ms)
Duration	Duration of execution (in ms)

This following table displays the profile summary data for all compute units on the device.

Table 3. Compute Unit Utilization
Name	Description
Compute Unit	Name of compute unit
Kernel	Kernel this compute unit is associated with
Device	Name of the OpenCL device (format: `<device>-<ID>`)
Calls	Number of times the compute unit is called
Dataflow Execution	Specifies whether the CU is executed with dataflow
Max Parallel Executions	Number of executions in the dataflow region
Dataflow Acceleration	Shows the performance improvement due to dataflow execution
Total Time	Sum of the runtimes of all calls (in ms)
Minimum Time	Minimum runtime of all calls (in ms)
Minimum runtime of all calls	(Total time) / (Number of work groups)
Maximum Time	Maximum runtime of all calls (in ms)
Clock Frequency	Clock frequency used for a given accelerator (in MHz)
CU Utilization (%)	Shows the percent of the total kernel runtime that is consumed by the CU

This following table displays the profile summary data for running times and stalls for compute units on the device.

Table 4. Compute Unit Running Times & Stalls
Name	Description
Compute Unit	Name of compute unit
Running Time	Total time compute unit was running (in µs)
Intra-Kernel Dataflow Stalls (%)	Percent time the compute unit was stalling from intra-kernel streams
External Memory Stalls (%)	Percent time the compute unit was stalling from external memory accesses
Inter-Kernel Pipe Stalls (%)	Percent time the compute unit was stalling from inter-kernel pipe accesses

Kernel Data Transfers

This following table displays the data transfer for kernels to the global memory.

Table 5. Data Transfer
Name	Description
Compute Unit Port	Name of compute unit/port
Kernel Arguments	List of kernel arguments attached to this port
Device	Name of device (format: `<device>-<ID>`)
Memory Resources	Memory resource accessed by this port
Transfer Type	Type of kernel data transfers
Number of Transfers	Number of kernel data transfers (in AXI transactions) Note: This might contain printf transfers.
Transfer Rate	Rate of kernel data transfers (in MB/s): Transfer Rate = (Total Bytes) / (Total CU Execution Time) Where total CU execution time is the total time the CU was active
Avg Bandwidth Utilization (%)	Average bandwidth of kernel data transfers: Bandwidth Utilization (%) = (100 * Transfer Rate) / (0.6 * Max. Theoretical Rate)
Avg Size	Average size of kernel data transfers (in KB): Average Size = (Total KB) / (Number of Transfers)
Avg Latency	Average latency of kernel data transfers (in ns)

This following table displays the top data transfer for kernels to the global memory.

Table 6. Top Data Transfer
Name	Description
Compute Unit	Name of compute unit
Device	Name of device
Number of Transfers	Number of write and read data transfers
Avg Bytes per Transfer	Average bytes of kernel data transfers: Average Bytes = (Total Bytes) / (Number of Transfers)
Transfer Efficiency (%)	Efficiency of kernel data transfers: Efficiency = (Average Bytes) / min((Memory Byte Width * 256), 4096)
Total Data Transfer	Total data transferred by kernels (in MB): Total Data = (Total Write) + (Total Read)
Total Write	Total data written by kernels (in MB)
Total Read	Total data read by kernels (in MB)
Total Transfer Rate	Average total data transfer rate (in MB/s): Total Transfer Rate = (Total Data Transfer) / (Total CU Execution Time) Where total CU execution time is the total time the CU was active

This following table displays the data transfer streams.

Note: This table is only shown if there is stream data

Table 7. Data Transfer Streams
Name	Description
Master Port	Name of master compute unit and port
Master Kernel Arguments	List of kernel arguments attached to this port
Slave Port	Name of slave compute unit and port
Slave Kernel Arguments	List of kernel arguments attached to this port
Device	Name of device (format: `<device>-<ID>`)
Number of Transfers	Number of stream data packets
Transfer Rate	Rate of stream data transfers (in MB/s): Transfer Rate = (Total Bytes) / (Total CU Execution Time) Where total CU execution time is the total time the CU was active
Avg Size	Average size of kernel data transfers (in KB): Average Size = (Total KB) / (Number of Transfers)
Link Utilization (%)	Link utilization (%): Link Utilization = 100 * (Link Busy Cycles - Link Stall Cycles - Link Starve Cycles) / (Link Busy Cycles)
Link Starve (%)	Link starve (%): Link Starve = 100 * (Link Starve Cycles) / (Link Busy Cycles)
Link Stall (%)	Link stall (%): Link Stall = 100 * (Link Stall Cycles) / (Link Busy Cycles)

Host Data Transfers

This following table displays profile data for all write transfers between the host and device memory through PCI Express® link.

Table 8. Top Memory Writes
Name	Description
Buffer Address	Specifies the address location for the buffer
Context ID	OpenCL Context ID on host
Command Queue ID	OpenCL Command queue ID on host
Start Time	Start time of write operation (in ms)
Duration	Duration of write operation (in ms)
Buffer Size	Amount of data being transferred (in KB)
Writing Rate	Data transfer rate (in MB/s): (Buffer Size)/(Duration)

This following table displays profile data for all read transfers between the host and device memory through PCI Express® link.

Table 9. Top Memory Reads
Name	Description
Buffer Address	Specifies the address location for the buffer
Context ID	OpenCL Context ID on host
Command Queue ID	OpenCL Command queue ID on host
Start Time	Start time of read operation (in ms)
Duration	Duration of read operation (in ms)
Buffer Size	Amount of data being transferred (in KB)
Reading Rate	Data transfer rate (in MB/s): (Buffer Size)/(Duration)

This following table displays the data transfer for host to the global memory.

Table 10. Data Transfer
Name	Description
Context:Number of Devices	Context ID and number of devices in context
Transfer Type	Type of kernel host transfers
Number of Buffer Transfers	Number of host buffer transfers Note: This might contain printf transfers.
Transfer Rate	Rate of host buffer transfers (in MB/s): Transfer Rate = (Total Bytes) / (Total Time in µs)
Avg Bandwidth Utilization (%)	Average bandwidth of host buffer transfers: Bandwidth Utilization (%) = (100 * Transfer Rate) / (Max. Theoretical Rate)
Avg Size	Average size of host buffer transfers (in KB): Average Size = (Total KB) / (Number of Transfers)
Total Time	Sum of host buffer transfer durations (in ms)
Avg Time	Average of host buffer transfer durations (in ms)

API Calls

This following table displays the profile data for all OpenCL host API function calls executed in the host application. The top displays a bar graph of the API call time as a percent of total time.

Table 11. API Calls
Name	Description
API Name	Name of the API function (for example, `clCreateProgramWithBinary`, `clEnqueueNDRangeKernel`)
Calls	Number of calls to this API made by the host application
Total Time	Sum of runtimes of all calls (in ms)
Minimum Time	Minimum runtime of all calls (in ms)
Average Time	Average Time (in ms) (Total time) / (Number of calls)
Maximum Time	Maximum runtime of all calls (in ms)

Device Power

This following table displays the profile data for device power.

Table 12. Device Power
Name	Description
Power Used By Platform	Shows a line graph of the three power rails on a Data Center acceleration card: 12V Auxiliary 12V PCIe Internal power These show the power (W) usage of the card over time.

Kernel Internals

This following table displays the running time for compute units in microseconds (µs) and reports stall time as a percent of the running time.

TIP: The Kernel Internals tab reports time in µs, while the rest of the Profile Summary reports time in milliseconds (ms).

Table 13. CU Runtime and Stalls
Name	Description
Compute Unit	Indicates the compute unit instance name
Running Time	Reports the total running time for the CU (in µs)
Intra-Kernel Dataflow Stalls (%)	Reports the percentage of running time consumed in stalls when streaming data between kernels
External Memory Stalls (%)	Reports the percentage of running time consumed in stalls for memory transfers outside the CU
Inter-Kernel Pipe Stalls (%)	Reports the percentage of running time consumed in stalls when streaming data to or from outside the CU

This following table displays the data transfer for specific ports on the compute unit.

Table 14. CU Port Data Transfers
Name	Description
Port	Indicates the port name on the compute unit
Compute Unit	Indicates the compute unit instance name
Write Time	Specifies the total data write time on the port (in µs)
Outstanding Write (%)	Specifies the percentage of the runtime consumed in the write process
Read Time	Specifies the total data read time on the port (in µs)
Outstanding Read (%)	Specifies the percentage of the runtime consumed in the read process

This following table displays the functional port data transfers on the compute unit.

Table 15. Functional Port Data Transfers
Name	Description
Port	Name of port
Function	Name of function
Compute Unit	Name of compute unit
Write Time	Total time the port had an outstanding write (in µs)
Outstanding Write (%)	Percent time the port had an outstanding write
Read Time	Total time the port had an outstanding read (in µs)
Outstanding Read (%)	Percent time the port had an outstanding read

This following table displays the running time and stalls on the compute unit.

Table 16. Functions
Name	Description
Compute Unit	Name of compute unit
Function	Name of function
Running Time	Total time function was running (in ms)
Intra-Kernel Dataflow Stalls	Percent time the function was stalling from intra-kernel streams (in ms)
External Memory Stalls	Percent time the function was stalling from external memory accesses (in ms)
Inter-Kernel Pipe Stalls	Percent time the function was stalling from inter-kernel pipe accesses (in ms)

Shell Data Transfers

This following table displays the DMA data transfers.

Table 17. DMA Data Transfer
Name	Description
Device	Name of device (format: `<device>-<ID>`)
Transfer Type	Type of data transfers
Number of Transfers	Number of data transfers (in AXI transactions)
Transfer Rate	Rate of data transfers (in MB/s): Transfer Rate = (Total Bytes) / (Total Time in µs)
Total Data Transfer	Total amount of data transferred (in MB)
Total Time	Total duration of data transfers (in ms)
Avg Size	Average size of data transfers (in KB): Average Size = (Total KB) / (Number of Transfers)
Avg Latency	Average latency of data transfers (in ns)

For DMA bypass and Global Memory to Global Memory data transfers, see the DMA Data Transfer table above.

Application Timeline

The Application Timeline collects and displays host and kernel events on a common timeline to help you understand and visualize the overall health and performance of your systems. The graphical representation lets you see issues regarding kernel synchronization and efficient concurrent execution. The displayed events include:

OpenCL API calls from the host code.
Device trace data including compute units, AXI transaction start/stop.
Host events and kernel start/stops.

While this is useful for debugging and profiling the application, the timeline and device trace data are not collected by default, which can affect performance by adding time to the application execution. However, the trace data is collected with dedicated resources in the kernel, and does not affect kernel functionality. The data is offloaded only at the end of the run (v++ --trace_memory option).

The following is a snapshot of the Application Timeline window which displays host and device events on a common timeline. Host activity is displayed at the top of the image and kernel activity is shown on the bottom of the image. Host activities include creating the program, running the kernel and data transfers between global memory and the host. The kernel activities include read/write accesses and transfers between global memory and the kernel(s). This information helps you understand details of application execution and identify potential areas for improvements.

Timeline data can be enabled and collected through the command line flow. However, viewing must be done in the Vitis analyzer as described in Using the Vitis Analyzer.

Generating and Opening the Application Timeline

To generate the Application Timeline report, you must complete the following steps to enable timeline and device trace data collection in the command line flow:

Instrument the FPGA binary during linking, by adding Acceleration Monitors and AXI Performance Monitors to kernels using the v++ --profile_kernel option. This option has three distinct instrumentation options (data, stall, and exec), as described in the Vitis Compiler Command. As an example, add --profile_kernel to the v++ linking command line:
```
v++ -g -l --profile_kernel data:all:all:all ...
```
After the kernels are instrumented during the build process, data gathering must also be enabled during the application runtime execution by editing the xrt.ini file. Refer to xrt.ini File for more information.
The following xrt.ini file will enable maximum information gathering when the application is run:
```
[Debug]
profile=true
timeline_trace=true
data_transfer_trace=coarse
stall_trace=all
```
TIP: If you are collecting a large amount of trace data, you might need to use the --trace_memory with the v++ command, and the trace_buffer_size keyword in the xrt.ini.
After running the application, the Application Timeline data is captured in a CSV file called timeline_trace.csv.
The CSV report can be viewed in a spreadsheet tool or utility, or can be opened in the Vitis analyzer tool, intended for viewing reports from the Vitis compiler when the application is built, and the XRT library when the application is run. You can launch the Vitis analyzer and open the report using the following command:
```
vitis_analyzer timeline_trace.csv
```

Interpreting the Appication Timeline

The Application Timeline window displays host and device events on a common timeline. This information helps you understand details of application execution and identify potential areas for improvements. The Application Timeline report has two main sections: Host and Device. The Host section shows the trace of all the activity originating from the host side. The Device section shows the activity of the CUs on the FPGA.

The report has the following structure:

Host

OpenCL API Calls

All OpenCL API calls are traced here. The activity time is measured from the host perspective.

General

All general OpenCL API calls such as clCreateProgramWithBinary, clCreateContext, and clCreateCommandQueue, are traced here.

Queue

OpenCL API calls that are associated with a specific command queue are traced here. This includes commands such as clEnqueueMigrateMemObjects, and clEnqueueNDRangeKernel. If the user application creates multiple command queues, then this section shows all the queues and activities.

Data Transfer

In this section the DMA transfers from the host to the device memory are traced. There are multiple DMA threads implemented in the OpenCL runtime and there is typically an equal number of DMA channels. The DMA transfer is initiated by the user application by calling OpenCL APIs such as clEnqueueMigrateMemObjects. These DMA requests are forwarded to the runtime which delegates to one of the threads. The data transfer from the host to the device appear under Write as they are written by the host, and the transfers from device to host appear under Read.

Kernel Enqueues

The kernels enqueued by the host program are shown here. The kernels here should not be confused with the kernels/CUs on the device. Here kernel refers to the NDRangeKernels and tasks created by the OpenCL commands clEnqueueNDRangeKernels and clEnqueueTask. These are plotted against the time measured from the host’s perspective. Multiple kernels can be scheduled to be executed at the same time, and they are traced from the point they are scheduled to run until the end of the kernel execution. This is the reason for multiple entries. The number of rows depend on the number of overlapping kernel executions.
Note: Overlapping of the kernels should not be mistaken for actual parallel execution on the device as the process might not be ready to execute right away.
Device "name"
Binary Container "name"

Binary container name.
Accelerator "name"

Name of the compute unit (a.k.a., Accelerator) on the FPGA.
User Functions

In the case of the Vitis HLS tool kernels, functions that are implemented as data flow processes are traced here. The trace for these functions show the number of active instances of these functions that are currently executing in parallel. These names are generated in hardware emulation when waveform is enabled.
Note: Function level activity is only possible in hardware emulation.

Function: "name a"

Function: "name b"

Read

A CU reads from the DDR over AXI-MM ports. The trace of a data read by a CU is shown here. The activity is shown as transaction and the tool-tip for each transaction shows more details of the AXI transaction. These names are generated when --profile_kernel data is used where the format name is m_axi_<bundle name>(port).

Write

A CU writes to the DDR over AXI-MM ports. The trace of data written by a CU is shown here. The activity is shown as transactions and the tool-tip for each transaction shows more details of the AXI transaction. This is generated when --profile_kernel data is used where the format name is m_axi_<bundle name>(port).

Low Overhead Profiling

The Vitis software platform supports a new option for low overhead profiling that provides minimal information with little effect on execution time. Using this option during runtime, the timeline trace is still available with a limited amount of information available. The low overhead profiling plug-in captures minimal information on OpenCL events and dumps a CSV file called lop_trace.csv at the end of execution. Low overhead can be run in all three flows (hardware, hardware emulation, and software emulation).

Adding --profile during compile time is not required as there is no profile information collected. Additionally, there is no guidance summary generated as well.

How to Enable Low Overhead Profiling

To enable the low overhead plug-in, there is a new flag in the "Debug" section of the xrt.ini File called lop_trace. By default, lop_trace is FALSE and must be enabled by setting the ini parameter to TRUE.

xrt.ini file
[Debug]
lop_trace=true

While the lop_trace parameter can be mixed with other profiling parameters, doing so removes any benefit of low overhead by including all the portions of profiling that are causing the high overhead.

Visualization Runtime when enabled with lop_trace=true option will generate lop_trace.csv. Use the vp_analyze command to generate the wdb/wcfg files so that these can be visualized within the Vitis analyzer.

vp_analyze lop_trace.csv

To obtain the lowest possible overhead, information collected in normal OpenCL profiling is omitted. Specifically, the following information is expected to not be available in the low overhead profiling trace:

Device events, such as compute unit executions or kernel memory transfers
Information about memory reads or writes, such as destination address or size
Information about kernel enqueues, such as kernel name or NDRange sizes
Dependencies between buffer transfers and kernel enqueue

While the lop_trace parameter can be mixed with other profiling parameters, doing so removes any benefit of low overhead by including all the portions of the profiling that are causing the high overhead. If profiling and lop_trace options are mixed, a message at the standard output is displayed.

Waveform View and Live Waveform Viewer

The Vitis core development kit can generate a Waveform view when running hardware emulation. It displays in-depth details at the system-level, CU level, and at the function level. The details include data transfers between the kernel and global memory and data flow through inter-kernel pipes. These details provide many insights into performance bottlenecks from the system-level down to individual function calls to help optimize your application.

The Live Waveform Viewer is similar to the Waveform view, however, it provides even lower-level details with some degree of interactivity. The Live Waveform Viewer can also be opened using the Vivado logic simulator, xsim.

Note: The Waveform view allows you to examine the device transactions from within the Vitis analyzer, as described in Using the Vitis Analyzer. In contrast, the Live Waveform Viewer generates the Vivado simulation waveform viewer to examine the hardware transactions in addition to user selected signals.

Waveform View and Live Waveform Viewer data are not collected by default because it requires the runtime to generate simulation waveforms during hardware emulation, which consumes more time and disk space. Refer to Generating and Opening the Waveform Reports for instructions on enabling these features.

You can also open the waveform database (.wdb) file with the Vivado logic simulator through the Linux command line:

xsim -gui <filename.wdb> &

TIP: The .wdb file is located in the directory where the compiled host code is executed.

Generating and Opening the Waveform Reports

Follow these instructions to enable waveform data collection from the command line during hardware emulation and open the viewer:

Enable debug code generation during compilation and linking using the -g option.
```
v++ -c -g -t hw_emu ...
```
Create an xrt.ini file in the same directory as the host executable with the following contents (see xrt.ini File for more information):
```
[Debug]
profile=true
timeline_trace=true

[Emulation]
debug_mode=batch
```
For Live Waveform Viewer, debug_mode is as follows:
```
[Emulation]
debug_mode=gui
```
TIP: If Live Waveform Viewer is enabled, the simulation waveform opens during the hardware emulation run.
Run the hardware emulation build of the application as described in Running an Application. The hardware transaction data is collected in the waveform database file, <hardware_platform>-<device_id>-<xclbin_name>.wdb. Refer to Directory Structure for the location of this file.
Open the Waveform view in the Vitis analyzer as described in Waveform View and Live Waveform Viewer.

Interpreting Data in the Waveform Views

The following image shows the Waveform view:

The Waveform and Live Waveform views are organized hierarchically for easy navigation.

The Waveform view is based on the actual waveforms generated during hardware emulation (Kernel Trace). This allows the viewer to descend all the way down to the individual signals responsible for the abstracted data. However, because the Waveform view is generated from the post-processed data, no additional signals can be added to the report, and some of the runtime analysis cannot be visualized, such as DATAFLOW transactions.
The Live Waveform viewer is displaying the Vivado logic simulator (xsim) run, so you can add extra signals and internals of the register transfer (RTL) design to the live view. Refer to the Vivado Design Suite User Guide: Logic Simulation (UG900) for information on working with the Waveform viewer.

The hierarchy of the Waveform and Live Waveform views include the following:

Device "name"

Target device name.

Binary Container "name"

Binary container name.

Memory Data Transfers

For each DDR Bank, this shows the trace of all the read and write request transactions arriving at the bank from the host.

Kernel "name" 1:1:1

For each kernel and for each compute unit of that kernel, this section breaks down the activities originating from the compute unit.

Compute Unit: "name": Compute unit name.
CU Stalls (%): Stall signals are provided by the Vitis HLS tool to inform you when a portion of the circuit is stalling because of external memory accesses, internal streams (that is, dataflow), or external streams (that is, OpenCL pipes). The stall bus shown in detailed kernel trace compiles all of the lowest level stall signals and reports the percentage that are stalling at any point in time. This provides a factor of how much of the kernel is stalling at any point in the simulation.
For example, if there are 100 lowest level stall signals and 10 are active on a given clock cycle, then the CU Stall percentage is 10%. If one goes inactive, then it is 9%.

Data Transfers

This shows the read/write data transfer accesses originating from each Master AXI port of the compute unit to the DDR.

User Functions

This information is available for the HLS kernels and shows the user functions.

Function: "name"

Function name.

Dataflow/Pipeline Activity

This shows the number of parallel executions of the function if the function is implemented as a dataflow process.

Active Iterations: This shows the currently active iterations of the dataflow. The number of rows is dynamically incremented to accommodate the visualization of any concurrent execution.
StallNoContinue: This is a stall signal that tells if there were any output stalls experienced by the dataflow processes (function is done, but it has not received a continue from the adjacent dataflow process).
RTL Signals: These are the underlying RTL control signals that were used to interpret the above transaction view of the dataflow process.

Function Stalls

Shows the different types of stalls experienced by the process.

External Memory: Stalls experienced while accessing the DDR memory.
Internal-Kernel Pipe: If the compute units communicated between each other through pipes, then this will show the related stalls.

Intra-Kernel Dataflow

FIFO activity internal to the kernel.

Function I/O

Actual interface signals.

Function: "name": Function name.
Function: "name": Function name.