Profiling the Application
The Vitis™ core development kit generates various system and kernel resource performance reports during compilation. These reports help you establish a baseline of performance for your application, identify bottlenecks, and help to identify target functions that can be accelerated in hardware kernels as discussed in Methodology for Architecting a Device Accelerated Application. The Xilinx® Runtime (XRT) collects profiling data during application execution in both emulation and hardware builds. Examples of profiling and event data that can be reported includes:
- Host and device timeline events
- OpenCL™ or XRT native API call sequences
- Kernel execution sequence
- Kernel start and stop signals
- FPGA trace data including AXI transactions
- Power profile data for the accelerator card
- AI Engine profiling and event trace
- User event and range profiling
Profiling reports and data can be used to isolate performance bottlenecks in the application, identify problems in the system, and optimize the design to improve performance. Optimizing an application requires optimizing both the application host code and any hardware accelerated kernels. The host code must be optimized to facilitate data transfers and kernel execution, while the kernel should be optimized for performance and resource usage.
There are four distinct areas to be considered when performing algorithm optimization in Vitis: System resource usage and performance, kernel optimization, host optimization, and data transfer optimization. The following Vitis reports and graphical tools support your efforts to profile and optimize these areas:
- Guidance
- System Estimate Report
- HLS Report
- Profile Summary Report
- Application Timeline
- Waveform View and Live Waveform Viewer
When properly enabled as described in Enabling Profiling in Your Application, reports are automatically generated while running the active build, either from the command line as described in Building and Running the Application, or from the Vitis integrated design environment (IDE). Separate reports are generated for the different build targets and can be found in the respective report directories. Refer to Output Directories of the v++ Command or Output Directories from the Vitis IDE for more information on locating these reports.
Reports can be viewed in Vitis analyzer, or in some cases from the Vitis IDE. To access these reports from Vitis analyzer, open the run_summary report as explained in Using the Vitis Analyzer.
Baselining Functionality and Performance
Methodology for Accelerating Applications with the Vitis Software Platform provides an overview of designing an application beginning with profiling the application to identify functions to accelerate, leading into recommended ways of developing C/C++ accelerators. As discussed in the this guide, it is very important to understand the architecture and performance of your application before you start any optimization effort. This is achieved by establishing a baseline for the application in terms of functions and performance.
Identify Bottlenecks
The first step is to identify the bottlenecks of the your application running
on your target platform. The most effective way is to run the application with profiling tools
like the user profiling features described in Custom Profiling of the Host Application,
or valgrind
, callgrind
, and GNU
gprof
. The profiling data generated by these tools show the
call graph with the number of calls to all functions and their execution time.
Run Software and Hardware Emulation
Run software and hardware emulation on the accelerated application as described in Running Emulation, to verify functional correctness, and to generate profiling data on the host code and the kernels. Use Vitis analyzer to review the kernel compilation reports, profile summary, timeline trace, and device hardware transactions to understand the baseline performance estimate for timing interval, latency, and resource utilization, such as DSP and block RAM.
Build and Run the Application
The last step in baselining is building and running the application on an FPGA acceleration card, like one of the Alveo™ Data Center accelerator cards, as described in Running the Application Hardware Build. Analyze the reports from the system compilation, and the profiling data from application execution to see the actual performance and resource utilization on hardware.
Enabling Profiling in Your Application
To enable profiling and capturing event trace data during the execution of your application, you must instrument your application for this task. You must enable additional logic, and consume additional device resources to track the host and kernel execution steps, and capture event data. This process requires optionally modifying your host application to capture custom data, modifying your kernel XO during compilation and the XCLBIN during linking to capture different types of profile data from the device side activity, and configuring the Xilinx runtime (XRT) as described in the xrt.ini File to capture data during the application runtime.
There are many different types of profiling for your applications, depending on which elements your system includes, and what type of data you want to capture. The following table shows some of the levels of profiling that can be enabled, and discusses which are complimentary and which are not.
Profile/Trace | Description | Comments |
---|---|---|
Host ApplicationOpenCL API and some limited device side (kernel) profiling. | Specified by the use of the profile and timeline_trace options in the xrt.ini file. |
Generates the profile_summary.csv and timeline_trace.csv files. |
Host Application XRT Native API | Specified by the use of the xrt_profile option in the xrt.ini file. |
Generates trace events for the XRT API. |
Host Application User-Event Profiling | Requires additional code in the host application as describe din Custom Profiling of the Host Application. | Generates user range data and user events for the host application. |
Low Overhead Profiling | Specified by the use of the lop_trace option in the xrt.ini file. |
Generates the lop_trace.csv file as described in Enabling Low Overhead Profiling. Is disabled by |
Device Side Profiling | Enabled by the use of --profile options during v++ compilation and linking, as described in --profile Options. |
Enables capturing data traffic between the host and kernel, kernel stalls, the execution times of kernels and compute units (CUs), as well as monitoring activity in Versal AI Engines. |
AI Engine Graph and Kernels | Specified by the use of the aie_profile and aie_trace options in the xrt.ini file. These options can be specified together or
separately. |
Generates the aie_profile_<device>.csv and aie_trace_##_<stream id>.txt reports. Cannot be used with Is also disabled by the presence of user event profiling in the host application. |
Power Profile | Specified by the use of the xrt_profile option in the xrt.ini file. |
Generates the power_profile_<device>.csv report. |
Vitis AI Profiling | Specified by the use of the vitis_ai_profile option in the xrt.ini file. |
Enables counter profiling of DPUs to generate the
profile_summary.csv file. Is disabled by |
The device binary (xclbin) file is
configured for capturing limited device-side profiling data by default. However, using
the --profile
option during the Vitis compiler linking process instruments the device
binary by adding Acceleration Monitors and AXI Performance Monitors to the system. This
option has multiple instrumentation options: --profile.data
, --profile.stall
, and
--profile.exec
, as described in the --profile Options.
--profile.data
to
the v++
linking command line:
v++ -g -l --profile.data all:all:all ...
v++ -g
option when compiling your kernel
code for debugging with software or hardware emulation.After your application is enabled for profiling during the v++
compile and link process, data gathering during
application runtime must also be enabled in XRT by editing the xrt.ini file as discussed above. For example, the
following xrt.ini file will enable OpenCL profiling, power profiling, and event and stall
trace capture when the application is run:
[Debug]
profile=true
power_profile=true
timeline_trace=true
data_transfer_trace=coarse
stall_trace=all
To enable the profiling of Kernel Internals data, you must also add the
debug_mode
tag in the [Emulation]
section of the xrt.ini:
[Emulation]
debug_mode=batch
If you are collecting a large amount of trace data, you can increase the
amount of available memory for capturing data by specifying the --trace_memory
option during v++
linking, and add the trace_buffer_size
keyword in the xrt.ini.
--trace_memory
- Indicates what type of memory to use for capturing trace data, as described in Vitis Compiler General Options.
trace_buffer_size
- Specifies the amount of memory to use for capturing the trace data during the application runtime.
Finally, you can enable continuous trace capture to continuously
offload device trace data while the application is running, so in the event of a
application or system crash, some trace data is available to help debug the application.
To enable, add the continuous_trace
keyword in the
xrt.ini file.
Custom Profiling of the Host Application
All XRT related actions from the host application are automatically tracked for profiling, through either theOpenCL API calls, or the XRT API calls. However, you can also profile the host application beyond the XRT related events, capturing event data based on user-specified actions or events. This feature provides two types of custom profiling:
- User range
- Profiles the specified start/end times across a range of code. This captures the span of time within which an action occurs in the host application.
- User events
- Marks an event in the timeline. The user event is added to the timeline waveform at whatever point in time it occurs.
Using custom profiling requires a few changes in your host application
source code and build process. You must make use of C or C++ API in your code, as
described below, and you must include the xrt_coreutil
library when linking your host application.
- The C/C++ API are described below, but can also be found at the following URL: https://github.com/Xilinx/XRT/blob/master/src/runtime_src/core/include/experimental/xrt_profile.h
- For both C and C++ you must add the following:
#include xrt/core/include/experimental/xrt_profile.h
- When linking host code, add
-lxrt_coreutil
to the compiler command line.
Profiling of C++ Code
For C++ code the provided objects are:
user_range
- This object captures the start time and end time of a measured range of activity
with the specified ID. The object constructor is:
user_range(const std::string& label,const std::string& tooltip);
user_event
- This object marks an event occurring at single point in time, adding the
specified label onto the timeline trace. The object constructor
is:
user_event()
Use the user_range
to construct an object and
start keeping track of time immediately upon construction. Usage details of the user_range
objects:
- If a
user_range
is instantiated using the default constructor, no time is marked until the user callsuser_range.start()
with the label and tooltip. - You can instantiate a
user_range
object passing the label and tooltip strings. This starts monitoring the range immediately. - You must call
user_range.start()
anduser_range.stop()
to capture ranges of time you are interested in. - If
user_range.stop()
is not called, then any range being tracked lasts until theuser_range
object is destructed. - The
user_range
object can be reused any number of times, by callinguser_range.start()
/user_range.stop()
pairs in the host code. - Sequential calls to
user_range.start()
ignore all but the first call. - Sequential calls to
user_range.stop()
ignore all but the first call.
Usage of the user_event
objects:
- A
user_event
object must be instantiated using the default constructor. - Calls to
user_event.mark()
will create a user marker on the timeline trace at that particular time. user_event.mark()
takes an optionalconst char*
argument which will appear as a label on the timeline trace.
The debug_profile example of the Vitis_Accel_Examples demonstrates user event profiling in a host application. With your host application properly instrumented, XRT can capture profile data from these user-defined ranges and events, as well as the standard XRT API-based events. You must enable profiling in thexrt.ini file as explained previously.
Profiling of C Code
For C code the provided functions are:
xrtURStart()
- This function establishes the start time of a measured range of activity with
the specified ID. The function signature is:
void xrtURStart(unsigned int id, const char* label, const char* tooltip)
xrtUREnd()
- This function marks the end time of a measured range with the specified ID. The
function signature is:
void xrtUREnd(unsigned int id)
xrtUEMark()
- This function marks an event occurring at single point in time, adding the
specified label onto the timeline trace. The function signature
is:
void xrtUEMark(const char* label)
Use the xrtURStart()
and xrtUREnd()
functions to start keeping track of time
immediately, and specify an ID to pair the start/end calls and define the user range.
Usage details of the user_range
functions:
- Start/End ranges of one ID can be nested inside other Start/End ranges of a different ID.
- It is your responsibility to make sure the IDs match for the
Start/End range you are profiling. IMPORTANT: Multiple calls to
xrtURStart
andxrtUREnd
with the same ID can cause unexpected behavior. - The user range can have a label that is added to the timeline, and a tooltip that is displayed when you place the cursor over the user range.
A call to xrtUEMark()
will create a user
marker on the timeline trace at the point of the event.
xrtUEMark()
lets you specify a label for the event. The label will appear on the timeline with the mark.- You can use
NULL
for the label to add an unlabeled mark.
The following is example code:
int main(int argc, char* argv[]) {
58
59 xrtURStart(0, "Software execution", "Whole program execution") ;
60 ...
61 //TARGET_DEVICE macro needs to be passed from gcc command line
62 if(argc != 2) {
63 std::cout << "Usage: " << argv[0] <<" <xclbin>" << std::endl;
64 return EXIT_FAILURE;
65 }
....
153 q.enqueueTask(krnl_vector_add);
154
155 // The result of the previous kernel execution will need to be retrieved in
156 // order to view the results. This call will transfer the data from FPGA to
157 // source_results vector
158 q.enqueueMigrateMemObjects({buffer_result},CL_MIGRATE_MEM_OBJECT_HOST);
159 ····
160 q.finish();
161
162 xrtUEMark("Starting verification") ;
163
Enabling Low Overhead Profiling
The Vitis software platform supports low overhead profiling that provides minimal information with little effect on execution time. Using this option during runtime, the timeline trace is still available but with a reduced amount of information. Low overhead profiling captures minimal information on OpenCL events and dumps a CSV file called lop_trace.csv at the end of execution. Low overhead profiling can be run in all three flows (hardware, hardware emulation, and software emulation).
To enable low overhead profiling, there is a new flag in the "Debug"
section of the xrt.ini File called lop_trace
. By default, lop_trace
is FALSE and must be enabled by setting the ini
parameter to TRUE.
xrt.ini file
[Debug]
lop_trace=true
lop_trace
parameter can be enabled alongside other profiling parameters,
but doing so eliminates any benefit of low overhead profiling by capturing all profiling
data as well.When lop_trace=true
is enabled, the
runtime will generate lop_trace.csv which can be
viewed in the Run Summary within Vitis analyzer.
vitis_analyzer <project>.run_summary
To obtain the lowest possible overhead, information collected in normal OpenCL profiling is omitted. Specifically, the following information is expected to not be available in the low overhead profiling trace:
- Device events, such as compute unit executions or kernel memory transfers
- Information about memory reads or writes, such as destination address or size
- Information about kernel enqueues, such as kernel name or NDRange sizes
- Dependencies between buffer transfers and kernel enqueue
Guidance
The Vitis core development kit has a comprehensive design guidance tool that provides immediate, actionable guidance to the software developer for issues detected in their designs. These issues might be related to the source code, or due to missed tool optimizations. Also, the rules are generic rules based on an extensive set of reference designs. Therefore, these rules might not be applicable for your specific design. It is up to you to understand the specific guidance rules and take appropriate action based on your specific algorithm and requirements.
Guidance is generated from the Vitis HLS, Vitis profiler, and Vivado Design Suite when invoked by the v++
compiler. The generated design guidance can have
several severity levels; warning messages, informational messages and design rule checks
are provided during software emulation, hardware emulation, and system builds. The
profile design guidance helps you interpret the profiling results which allows you to
focus on improving performance.
Guidance includes message text for reported violations, a brief suggested resolution, and a detailed resolution provided as a web link. You can determine your next course of action based on the suggested resolution. This helps improves productivity by quickly highlighting issues and directing you to additional information in using the Vitis technology.
Design guidance is automatically generated after building or running an application from the command line or Vitis IDE.
You can open the Guidance report as discussed in Using the Vitis Analyzer. To access the Guidance report, open the Compile Summary, the Link Summary, or the Run Summary, and open the Guidance report.
- Kernel Guidance is generated by the Vitis HLS tool after kernel is built using
v++
compile command. This can be viewed in the Vitis analyzer by opening the Compile Summary report. Kernel guidance as well as Compile Summary files are generated for each kernel compiled. Kernel guidance includes recommendations on using Dataflow; and possible reasons why the expected throughout could not be achieved. - System Guidance is generated after kernel is built using the
v++
link command. This can be viewed in the Vitis analyzer by opening the Link Summary report. System guidance includes all Kernel Guidance checks, and provides comprehensive review before running your application. - Run Guidance is generated when your generated .xclbin is run, and is a feature of the XRT. This can be viewed by opening the Run Summary in the Vitis analyzer. Run Guidance includes checks like if Kernel Stall is above 50%, recommendations if PLRAM can be used instead of DDR, etc.
With the Guidance report open, the Guidance view displays the messages along with resolution columns. The resolutions also have extended weblink help available.
The following image shows an example of the Guidance report displayed in the Vitis analyzer. For example, clicking a link in the Name column opens a description of the rule check. Links in the Details column can open source code, select a design object such as a kernel, or navigate to another report.
There is one HTML guidance report for each run of the v++
command, including compile and link. The report files
are located in the --report_dir under the specific
output name. For example:
- v++_compile_<output>_guidance.html for
v++
compilation - v++_link_<output>_guidance.html for
v++
linking
You can click the web link in the Resolution column to get additional details about the resolution. The Guidance Messaging web page lists all of the current messages for your review.
Kernel and Compute Unit objects, as well as profile reported data values, can also be cross-probed to other views like the System Diagram or Profile Report. Refer to Working with Reports for more information.
Opening the Guidance Report
When kernels are compiled and when the FPGA binary is linked, guidance
reports are generated automatically by the v++
command. You can view these reports in the Vitis
analyzer by opening the <output_filename>.compile_summary or the <output_filename>.link_summary for the
application project. The <output_filename>
is the output of the v++
command.
As an example, launch the Vitis analyzer and open the report using this command:
vitis_analyzer <output_filename>.link_summary
When the Vitis analyzer opens, it displays the link summary report, as well as the compile summary, and a collection of reports generated during the compile and link processes. Both the compile and link steps generate Guidance reports to view by clicking the Build heading on the left-hand side. Refer to Using the Vitis Analyzer for more information.
Interpreting Guidance Data
The Guidance view places each entry in a separate row. Each row might contain the name of the guidance rule, threshold value, actual value, and a brief but specific description of the rule. The last field provides a link to reference material intended to assist in understanding and resolving any of the rule violations.
In the GUI Guidance view, guidance rules are grouped by categories and unique IDs in the Name column and annotated with symbols representing the severity. These are listed individually in the HTML report. In addition, as the HTML report does not show tooltips, a full Name column is included in the HTML report as well.
The following list describes all fields and their purpose as included in the HTML guidance reports.
- Id
- Each guidance rule is assigned a unique ID. Use this id to uniquely identify a specific message from the guidance report.
- Name
- The Name column displays a mnemonic name uniquely identifying the guidance rule. These names are designed to assist in memorizing specific guidance rules in the view.
- Severity
- The Severity column allows the easy identification of the importance of a guidance rule.
- Full Name
- The Full Name provides a less cryptic name compared to the mnemonic name in the Name column.
- Categories
- Most messages are grouped within different categories. This allows the GUI to display groups of messages within logical categories under common tree nodes in the Guidance view.
- Threshold
- The Threshold column displays an expected threshold value, which determines whether or not a rule is met. The threshold values are determined from many applications that follow good design and coding practices.
- Actual
- The Actual column displays the values actually encountered on the specific design. This value is compared against the expected value to see if the rule is met.
- Details
- The Details column provides a brief message describing the specifics of the current rule.
- Resolution
- The Resolution column provides a pointer to common ways the model source code or tool transformations can be modified to meet the current rule. Clicking the link brings up a popup window or the documentation with tips and code snippets that you can apply to the specific issue.
System Estimate Report
The process step with the longest execution time includes building the hardware system and the FPGA binary to run on Xilinx devices. Build time is also affected by the target device and the number of compute units instantiated onto the FPGA fabric. Therefore, it is useful to estimate the performance of an application without needing to build it for the system hardware.
The System Estimate report provides estimates of FPGA resource usage and the estimated frequency at which the hardware accelerated kernels can operate. The report is automatically generated for hardware emulation and system hardware builds. The report contains high-level details of the user kernels, including resource usage and estimated frequency. This report can be used to guide design optimization.
You can also force the generation of the System Estimate report with the following option:
v++ .. --report_level estimate
An example report is shown in the figure:
Opening the System Estimate Report
The System Estimate report can be opened in the Vitis analyzer tool, intended for viewing reports from the Vitis compiler when the application is built, and the XRT library when the application is run. You can launch the Vitis analyzer and open the report using the following command:
vitis_analyzer <output_filename>.link_summary
The <output_filename> is
the output of the v++
command. This opens the Link
Summary for the application project in the Vitis
analyzer tool. Then, select the System Estimate report. Refer to Using the Vitis Analyzer for more information.
Interpreting the System Estimate Report
The System Estimate report generated by the v++
command provides information on every binary container in the
application, as well as every compute unit in the design. The report is structured as
follows:
- Target device information
- Summary of every kernel in the application
- Detailed information on every binary container in the solution
The following example report file represents the information generated for the estimate report:
----------------------------------------------------------------------------
Design Name: mmult.hw_emu.xilinx_u200_xdma_201830_2
Target Device: xilinx:u200:xdma:201830.2
Target Clock: 300.000000MHz
Total number of kernels: 1
----------------------------------------------------------------------------
Kernel Summary
Kernel Name Type Target OpenCL Library Compute Units
----------- ---- ------------------ -------------------------------------- -------------
mmult c fpga0:OCL_REGION_0 mmult.hw_emu.xilinx_u200_xdma_201830_2 1
-----------------------------------------------------------------------------
OpenCL Binary: mmult.hw_emu.xilinx_u200_xdma_201830_2
Kernels mapped to: clc_region
Timing Information (MHz)
Compute Unit Kernel Name Module Name Target Frequency Estimated Frequency
------------ ----------- ----------- ---------------- -------------------
mmult_1 mmult mmult 300.300293 411.015198
Latency Information (clock cycles)
Compute Unit Kernel Name Module Name Start Interval Best Case Avg Case Worst Case
------------ ----------- ----------- -------------- --------- -------- ----------
mmult_1 mmult mmult 826 ~ 829 825 827 828
Area Information
Compute Unit Kernel Name Module Name FF LUT DSP BRAM URAM
------------ ----------- ----------- ----- ----- ---- ---- ----
mmult_1 mmult mmult 81378 35257 1036 2 0
----------------------------------------------------------------------------
Design and Target Device Summary
All design estimate reports begin with an application summary and information about the target device. The device information is provided in the following section of the report:
----------------------------------------------------------------------------
Design Name: mmult.hw_emu.xilinx_u200_xdma_201830_2
Target Device: xilinx:u200:xdma:201830.2
Target Clock: 300.000000MHz
Total number of kernels: 1
----------------------------------------------------------------------------
For the design summary, the information provided includes the following:
- Target Device
- Name of the Xilinx device on the target platform that runs the FPGA binary built by the Vitis compiler.
- Target Clock
- Specifies the target operating frequency for the compute units (CUs) mapped to the FPGA fabric.
Kernel Summary
This section lists all of the kernels defined for the application project. The following example shows the kernel summary:
Kernel Summary
Kernel Name Type Target OpenCL Library Compute Units
----------- ---- ------------------ -------------------------------------- -------------
mmult c fpga0:OCL_REGION_0 mmult.hw_emu.xilinx_u200_xdma_201830_2 1
In addition to the kernel name, the summary also provides the execution target and type of the input source. Because there is a difference in compilation and optimization methodology for OpenCL™, C, and C/C++ source files, the type of kernel source file is specified.
The Kernel Summary section is the last summary information in the report. From here, detailed information on each compute unit binary container is presented.
Timing Information
For each binary container, the detail section begins with the execution target of all compute units (CUs). It also provides timing information for every CU. As a general rule, if the estimated frequency for the FPGA binary is higher than the target frequency, the CU will be able to run in the device. If the estimated frequency is below the target frequency, the kernel code for the CU needs to be further optimized to run correctly on the FPGA fabric. This information is shown in the following example:
OpenCL Binary: mmult.hw_emu.xilinx_u200_xdma_201830_2
Kernels mapped to: clc_region
Timing Information (MHz)
Compute Unit Kernel Name Module Name Target Frequency Estimated Frequency
------------ ----------- ----------- ---------------- -------------------
mmult_1 mmult mmult 300.300293 411.015198
It is important to understand the difference between the target and estimated frequencies. CUs are not placed in isolation into the FPGA fabric. CUs are placed as part of a valid FPGA design that can include other components defined by the device developer to support a class of applications.
Because the CU custom logic is generated one kernel at a time, an estimated frequency that is higher than the target frequency indicates that the CU can run at the higher estimated frequency. Therefore, CU should meet timing at the target frequency during implementation of the FPGA binary.
Latency Information
The latency information presents the execution profile of each CU in the binary container. When analyzing this data, it is important to recognize that all values are measured from the CU boundary through the custom logic. In-system latencies associated with data transfers to global memory are not reported as part of these values. Also, the latency numbers reported are only for CUs targeted at the FPGA fabric. The following is an example of the latency report:
Latency Information (clock cycles)
Compute Unit Kernel Name Module Name Start Interval Best Case Avg Case Worst Case
------------ ----------- ----------- -------------- --------- -------- ----------
mmult_1 mmult mmult 826 ~ 829 825 827 828
The latency report is divided into the following fields:
- Start interval
- Best case latency
- Average case latency
- Worst case latency
The start interval defines the amount of time that has to pass between invocations of a CU for a given kernel.
The best, average, and worst case latency numbers refer to how much time it takes the CU to generate the results of one ND Range data tile for the kernel. For cases where the kernel does not have data dependent computation loops, the latency values will be the same. Data dependent execution of loops introduces data specific latency variation that is captured by the latency report.
- OpenCL kernels that do not have
explicit
reqd_work_group_size(x,y,z)
- Kernels that have loops with variable bounds
Area Information
Although the FPGA can be thought of as a blank computational canvas, there are a limited number of fundamental building blocks available in each FPGA. These fundamental blocks (FF, LUT, DSP, block RAM) are used by the Vitis compiler to generate the custom logic for each CU in the design. The quantity of fundamental resources needed to implement the custom logic for a single CU determines how many CUs can be simultaneously loaded into the FPGA fabric. The following example shows the area information reported for a single CU:
Area Information
Compute Unit Kernel Name Module Name FF LUT DSP BRAM URAM
------------ ----------- ----------- ----- ----- ---- ---- ----
mmult_1 mmult mmult 81378 35257 1036 2 0
----------------------------------------------------------------------
HLS Report
The HLS report provides details about the high-level synthesis (HLS) process of a user kernel and is generated during the compilation process for hardware emulation and system builds. This process translates the C/C++ and OpenCL kernel into the hardware description language used for implementing the kernel logic on the FPGA. The report provides estimated FPGA resource usage, operating frequency, latency, and interface signals of the custom-generated hardware logic. These details provide many insights to guide kernel optimization.
When running from the Vitis IDE, this report can be found in the following directory: _x/<kernel_name>.<target>.<platform>/<kernel_name>/<kernel_name>/solution/syn/report
The HLS report can be opened from the Vitis analyzer by opening the Compile Summary, or the Link Summary as described in Using the Vitis Analyzer. An example of the HLS report is shown.
Generating and Opening the HLS Report
--save-temps
option during the build process to preserve
the intermediate files produced by Vitis HLS,
including the reports. The HLS report and HLS guidance are only generated for hardware
emulation and system builds for C and OpenCL
kernels. They are not generated for software emulation or RTL kernels.The HLS report can be viewed through the Vitis analyzer by opening the <output_filename>.compile_summary or the <output_filename>.link_summary for the
application project. The <output_filename>
is the output of the v++
command.
You can launch the Vitis analyzer and open the report using the following command:
vitis_analyzer <output_filename>.compile_summary
When the Vitis analyzer opens, it displays the Compile Summary and a collection of reports generated during the compile process. Refer to Using the Vitis Analyzer for more information.
Interpreting the HLS Report
The HLS Synthesis report is a spreadsheet listing the module hierarchy in the left column. Each module and loop generated by the HLS run is represented in this hierarchy. The HLS Synthesis report contains the following columns:
- Violation Type
- Latency in clock cycles
- Latency in absolute time (µs)
- Iteration latency
- Iteration Interval
- Loop Tripcount
- Pipelined
- Utilization Estimates of BRAM, DSP, FF, and LUT
- Negative Slack
If this information is part of a hierarchical block, it will sum up the information of the blocks contained in the hierarchy. Therefore, the hierarchy can also be navigated from within the report when it is clear which instance contributes to the overall design.
Profile Summary Report
When properly configured, the Xilinx
Runtime (XRT) collects profiling data on host applications and kernels. XRT
automatically captures profiling data for the host application as it makes calls to the
runtime either through OpenCL or XRT API calls.
You can add user calls to your host application to capture additional profiling
information, as explained in Custom Profiling of the Host Application. To
capture details of the kernel operations you must instrument kernels using the --profile
options as explained in the next section.
After the application finishes execution, the Profile Summary report is saved as a .csv file in the directory where the compiled host code is executed. The Profile Summary provides annotated details regarding the overall application performance. All data generated during the execution of the application is grouped into categories. The Profile Summary lets you examine the kernel execution and data transfer statistics.
An example of the Profile Summary report is shown below.
Generating and Opening the Profile Summary Report
Capturing the data required for the Profile Summary requires a few steps prior to actually running the application.
- The FPGA binary (xclbin) file is
configured for capturing profiling data by default. However, using the
v++ --profile
option during the linking process enables a greater level of detail in the profiling data captured. For more information, see the --profile Options. - The runtime requires the presence of an xrt.ini file, as described in xrt.ini File, that includes the keyword for capturing profiling data:
[Debug] profile = true
- To enable the profiling of Kernel Internals data, you must also add the
debug_mode
tag in the[Emulation]
section of the xrt.ini:[Emulation] debug_mode = batch
With profiling enabled in the device binary and in the xrt.ini file, the runtime creates the profile_summary.csv report file when running the application, and also creates the profile_kernels.csv and timeline_kernels.csv files when Kernel Internals is enabled. These files are linked to the Profile Summary report which can be viewed in the Vitis analyzer tool through the Run Summary. Open the Run Summary using the following command:
vitis_analyzer <project>.run_summary
Related Information
Interpreting the Profile Summary
The profile summary includes a number of useful statistics for your host application and kernels. The report provides a general idea of the functional bottlenecks in your application. The following tables show the profile summary descriptions.
Settings
This displays the report and XRT configuration settings.
Summary
This displays summary statistics including device execution time and device power.
Kernels & Compute Units
The following table displays the profile summary data for all kernel functions scheduled and executed.
Name | Description |
---|---|
Kernel | Name of kernel |
Enqueues | Number of times kernel is enqueued. When the kernel is enqueued only once, the following stats will all be the same. |
Total Time | Sum of runtimes of all enqueues (measured from START to END in OpenCL execution model) (in ms) |
Minimum Time | Minimum runtime of all enqueues |
Average Time | Average kernel runtime (in ms) (Total time) / (Number of enqueues) |
Maximum Time | Maximum runtime of all enqueues (in ms) |
The following table displays the profile summary data for top kernel functions.
Name | Description |
---|---|
Kernel | Name of kernel |
Kernel Instance Address | Host address of kernel instance (in hex) |
Context ID | Context ID on host |
Command Queue ID | Command queue ID on host |
Device | Name of device where kernel was executed (format: <device>-<ID> ) |
Start Time | Start time of execution (in ms) |
Duration | Duration of execution (in ms) |
This following table displays the profile summary data for all compute units on the device.
Name | Description |
---|---|
Compute Unit | Name of compute unit |
Kernel | Kernel this compute unit is associated with |
Device | Name of the device (format: <device>-<ID> ) |
Calls | Number of times the compute unit is called |
Dataflow Execution | Specifies whether the CU is executed with dataflow |
Max Parallel Executions | Number of executions in the dataflow region |
Dataflow Acceleration | Shows the performance improvement due to dataflow execution |
CU Utilization (%) | Shows the percent of the total kernel runtime that is consumed by the CU |
Total Time | Sum of the runtimes of all calls (in ms) |
Minimum Time | Minimum runtime of all calls (in ms) |
Minimum runtime of all calls | (Total time) / (Number of work groups) |
Maximum Time | Maximum runtime of all calls (in ms) |
Clock Frequency | Clock frequency used for a given accelerator (in MHz) |
This following table displays the profile summary data for running times and stalls for compute units on the device.
Name | Description |
---|---|
Compute Unit | Name of compute unit |
Running Time | Total time compute unit was running (in µs) |
Intra-Kernel Dataflow Stalls (%) | Percent time the compute unit was stalling from intra-kernel streams |
External Memory Stalls (%) | Percent time the compute unit was stalling from external memory accesses |
Inter-Kernel Pipe Stalls (%) | Percent time the compute unit was stalling from inter-kernel pipe accesses |
Kernel Data Transfers
This following table displays the data transfer for kernels to the global memory.
Name | Description |
---|---|
Compute Unit Port | Name of compute unit/port |
Kernel Arguments | List of kernel arguments attached to this port |
Device | Name of device (format: <device>-<ID> ) |
Memory Resources | Memory resource accessed by this port |
Transfer Type | Type of kernel data transfers |
Number of Transfers | Number of kernel data transfers (in AXI transactions) Note: This might contain printf transfers. |
Transfer Rate | Rate of kernel data transfers (in MB/s): Transfer Rate = (Total Bytes) / (Total CU Execution Time) Where total CU execution time is the total time the CU was active |
Avg Bandwidth Utilization (%) | Average bandwidth of kernel data transfers: Bandwidth Utilization (%) = (100 * Transfer Rate) / (0.6 * Max. Theoretical Rate) |
Avg Size | Average size of kernel data transfers (in KB): Average Size = (Total KB) / (Number of Transfers) |
Avg Latency | Average latency of kernel data transfers (in ns) |
This following table displays the top data transfer for kernels to the global memory.
Name | Description |
---|---|
Compute Unit | Name of compute unit |
Device | Name of device |
Number of Transfers | Number of write and read data transfers |
Avg Bytes per Transfer | Average bytes of kernel data transfers: Average Bytes = (Total Bytes) / (Number of Transfers) |
Transfer Efficiency (%) | Efficiency of kernel data transfers: Efficiency = (Average Bytes) / min((Memory Byte Width * 256), 4096) |
Total Data Transfer | Total data transferred by kernels (in MB): Total Data = (Total Write) + (Total Read) |
Total Write | Total data written by kernels (in MB) |
Total Read | Total data read by kernels (in MB) |
Total Transfer Rate | Average total data transfer rate (in MB/s): Total Transfer Rate = (Total Data Transfer) / (Total CU Execution Time) Where total CU execution time is the total time the CU was active |
This following table displays the data transfer streams.
Name | Description |
---|---|
Master Port | Name of master compute unit and port |
Master Kernel Arguments | List of kernel arguments attached to this port |
Slave Port | Name of slave compute unit and port |
Slave Kernel Arguments | List of kernel arguments attached to this port |
Device | Name of device (format:
<device>-<ID> ) |
Number of Transfers | Number of stream data packets |
Transfer Rate | Rate of stream data transfers (in MB/s): Transfer Rate = (Total Bytes) / (Total CU Execution Time) Where total CU execution time is the total time the CU was active |
Avg Size | Average size of kernel data transfers (in KB): Average Size = (Total KB) / (Number of Transfers) |
Link Utilization (%) | Link utilization (%): Link Utilization = 100 * (Link Busy Cycles - Link Stall Cycles - Link Starve Cycles) / (Link Busy Cycles) |
Link Starve (%) | Link starve (%): Link Starve = 100 * (Link Starve Cycles) / (Link Busy Cycles) |
Link Stall (%) | Link stall (%): Link Stall = 100 * (Link Stall Cycles) / (Link Busy Cycles) |
Host Data Transfers
This following table displays profile data for all write transfers between the host and device memory through PCI Express® link.
Name | Description |
---|---|
Buffer Address | Specifies the address location for the buffer |
Context ID | OpenCL Context ID on host |
Command Queue ID | OpenCL Command queue ID on host |
Start Time | Start time of write operation (in ms) |
Duration | Duration of write operation (in ms) |
Buffer Size | Amount of data being transferred (in KB) |
Writing Rate | Data transfer rate (in MB/s): (Buffer Size)/(Duration) |
This following table displays profile data for all read transfers between the host and device memory through PCI Express® link.
Name | Description |
---|---|
Buffer Address | Specifies the address location for the buffer |
Context ID | Context ID on host |
Command Queue ID | Command queue ID on host |
Start Time | Start time of read operation (in ms) |
Duration | Duration of read operation (in ms) |
Buffer Size | Amount of data being transferred (in KB) |
Reading Rate | Data transfer rate (in MB/s): (Buffer Size) / (Duration) |
This following table displays the data transfer for host to the global memory.
Name | Description |
---|---|
Context:Number of Devices | Context ID and number of devices in context |
Transfer Type | Type of kernel host transfers |
Number of Buffer Transfers | Number of host buffer transfers Note: This might
contain printf transfers.
|
Transfer Rate | Rate of host buffer transfers (in MB/s): Transfer Rate = (Total Bytes) / (Total Time in µs) |
Avg Bandwidth Utilization (%) | Average bandwidth of host buffer transfers: Bandwidth Utilization (%) = (100 * Transfer Rate) / (Max. Theoretical Rate) |
Avg Size | Average size of host buffer transfers (in KB): Average Size = (Total KB) / (Number of Transfers) |
Total Time | Sum of host buffer transfer durations (in ms) |
Avg Time | Average of host buffer transfer durations (in ms) |
API Calls
This following table displays the profile data for all OpenCL host API function calls executed in the host application. The top displays a bar graph of the API call time as a percent of total time.
Name | Description |
---|---|
API Name | Name of the API function (for example, clCreateProgramWithBinary , clEnqueueNDRangeKernel ) |
Calls | Number of calls to this API made by the host application |
Total Time | Sum of runtimes of all calls (in ms) |
Minimum Time | Minimum runtime of all calls (in ms) |
Average Time | Average Time (in ms) (Total time) / (Number of calls) |
Maximum Time | Maximum runtime of all calls (in ms) |
Device Power
This following table displays the profile data for device power.
Name | Description |
---|---|
Power Used By Platform | Shows a line graph of the three power rails on a Data Center
acceleration card:
|
Temperature | One chart will be created for each device that has non-zero temperature readings. Displays one line for each temperature sensor with readouts in (°C). |
Fan Speed | One chart will be created for each device that has non-zero fan speed readings. The fan speed is measure in RPM. |
Kernel Internals
This following table displays the running time for compute units in microseconds (µs) and reports stall time as a percent of the running time.
Name | Description |
---|---|
Compute Unit | Indicates the compute unit instance name |
Running Time | Reports the total running time for the CU (in µs) |
Intra-Kernel Dataflow Stalls (%) | Reports the percentage of running time consumed in stalls when streaming data between kernels |
External Memory Stalls (%) | Reports the percentage of running time consumed in stalls for memory transfers outside the CU |
Inter-Kernel Pipe Stalls (%) | Reports the percentage of running time consumed in stalls when streaming data to or from outside the CU |
This following table displays the data transfer for specific ports on the compute unit.
Name | Description |
---|---|
Port | Indicates the port name on the compute unit |
Compute Unit | Indicates the compute unit instance name |
Write Time | Specifies the total data write time on the port (in µs) |
Outstanding Write (%) | Specifies the percentage of the runtime consumed in the write process |
Read Time | Specifies the total data read time on the port (in µs) |
Outstanding Read (%) | Specifies the percentage of the runtime consumed in the read process |
This following table displays the functional port data transfers on the compute unit.
Name | Description |
---|---|
Port | Name of port |
Function | Name of function |
Compute Unit | Name of compute unit |
Write Time | Total time the port had an outstanding write (in µs) |
Outstanding Write (%) | Percent time the port had an outstanding write |
Read Time | Total time the port had an outstanding read (in µs) |
Outstanding Read (%) | Percent time the port had an outstanding read |
This following table displays the running time and stalls on the compute unit.
Name | Description |
---|---|
Compute Unit | Name of compute unit |
Function | Name of function |
Running Time | Total time function was running (in ms) |
Intra-Kernel Dataflow Stalls | Percent time the function was stalling from intra-kernel streams (in ms) |
External Memory Stalls | Percent time the function was stalling from external memory accesses (in ms) |
Inter-Kernel Pipe Stalls | Percent time the function was stalling from inter-kernel pipe accesses (in ms) |
Shell Data Transfers
This following table displays the DMA data transfers.
Name | Description |
---|---|
Device | Name of device (format:
<device>-<ID> ) |
Transfer Type | Type of data transfers |
Number of Transfers | Number of data transfers (in AXI transactions) |
Transfer Rate | Rate of data transfers (in MB/s): Transfer Rate = (Total Bytes) / (Total Time in µs) |
Total Data Transfer | Total amount of data transferred (in MB) |
Total Time | Total duration of data transfers (in ms) |
Avg Size | Average size of data transfers (in KB): Average Size = (Total KB) / (Number of Transfers) |
Avg Latency | Average latency of data transfers (in ns) |
For DMA bypass and Global Memory to Global Memory data transfers, see the DMA Data Transfer table above.
NoC Counters
NoC Counters display the NoC Counters Read and NoC Counters Write. These sections are only displayed if there is a non-zero NoC counter data.
Each section has a table containing summary data with line graphs for transfer rate and latency. The graphs can have multiple NoC counters, so you will be able to toggle the counters ON/OFF through check boxes in the Chart column of the table.
Depending on the design, it can be possible to correlate NoC counters to CU ports. In this case, the CU port will appear in the table, and selecting it will cross-probe to the system diagram, profile summary, and any other views that include CU ports as selectable objects.
Name | Description |
---|---|
Name | Name of NoC port |
Traffic Class | Traffic class type |
Requested QoS | QoS (MB/s): Requested quality of service (in MB/s) |
Min Transfer Rate | Rate of minimum data transfers (in MB/s) |
Avg Transfer Rate | Rate of average data transfers (in MB/s) |
Max Transfer Rate | Rate of maximum data transfers (in MB/s) |
Avg Size | Average size of data transfers (in KB): Average Size = (Total KB) / (Number of Transfers) |
Min Latency | Minimum latency of data transfers (in ns) |
Avg Latency | Average latency of data transfers (in ns) |
Max Latency | Maximum latency of data transfers (in ns) |
AI Engine Counters
AI Engine counters display if there is a non-zero AI Engine counter data. If there is an incompatible configuration of the AI Engine counters, this section will display a message stating that the configuration does not support performance profiling.
This section has a table containing summary data with line graphs for active time and usage. The usage chart is only available if stall profiling is enabled.
The graphs can have multiple AI Engine counters, so you will be able to toggle the counters ON/OFF through check boxes in the Chart column of the table.
It will be possible to cross-probe tiles to the AI Engine array and graph views.
Name | Description |
---|---|
Tile | AI Engine Tile [Column, Row] |
Active Time (ms) | Amount of time (in ms) this tile was active |
Stall Time (ms) | Amount of time (in ms) this tile was active but stalled |
Stall Time (%) | Percent of time this tile was active but stalled |
Active Utilization (ms) | Amount of time (in ms) this tile was active and not stalled |
Active Utilization (%) | Percent of time this tile was active and not stalled |
Clock Frequency (MHz) | Frequency (in MHz) of clock used for AI Engine tiles |
Application Timeline
The Application Timeline collects and displays host and kernel events on a common timeline to help you understand and visualize the overall health and performance of your systems. The graphical representation lets you see issues regarding kernel synchronization and efficient concurrent execution. The displayed events include:
- OpenCL API calls from the host code.
- Device trace data including compute units, AXI transaction start/stop.
- Host events and kernel start/stops.
While this is useful for debugging and profiling the application, the timeline
and device trace data are not collected by default, which can affect performance by
adding time to the application execution. However, the trace data is collected with
dedicated resources in the kernel, and does not affect kernel functionality. The data is
offloaded only at the end of the run (v++ --trace_memory
option).
The following is a snapshot of the Application Timeline window which displays host and device events on a common timeline. Host activity is displayed at the top of the image and kernel activity is shown on the bottom of the image. Host activities include creating the program, running the kernel and data transfers between global memory and the host. The kernel activities include read/write accesses and transfers between global memory and the kernel(s). This information helps you understand details of application execution and identify potential areas for improvements.
Timeline data can be enabled and collected through the command line flow. However, viewing must be done in the Vitis analyzer as described in Using the Vitis Analyzer.
Generating and Opening the Application Timeline
To generate the Application Timeline report, you must complete the following steps to enable timeline and device trace data collection in the command line flow:
- Instrument the FPGA binary during linking, by adding
Acceleration Monitors and AXI Performance Monitors to kernels using the
v++ --profile
option as described in --profile Options. As an example, add--profile.data
to thev++
linking command line:v++ -g -l --profile.data all:all:all ...
- After the kernels are instrumented during the build process, data
gathering must also be enabled during the runtime execution of the application
by editing the xrt.ini file. Refer to
xrt.ini File for more information.
The following xrt.ini file will enable maximum information gathering when the application is run:
[Debug] profile=true timeline_trace=true data_transfer_trace=coarse stall_trace=all
TIP: If you are collecting a large amount of trace data, you might need to use the--trace_memory
with the v++ command, and thetrace_buffer_size
keyword in the xrt.ini.After running the application, the Application Timeline data is captured in a CSV file called timeline_trace.csv.
- The CSV report can be viewed in the Vitis analyzer tool by opening the Run Summary produced during the
application execution. You can launch the Vitis analyzer and open the Run Summary using the following
command:
vitis_analyzer <project>.run_summary
Interpreting the Application Timeline
The Application Timeline window displays host and device events on a common timeline. This information helps you understand details of application execution and identify potential areas for improvements. The Application Timeline report has two main sections: Host and Device. The Host section shows the trace of all the activity originating from the host side. The Device section shows the activity of the CUs on the FPGA.
The report has the following structure:
- Host
- OpenCL API Calls
- All OpenCL API calls are traced here. The activity time is measured from the host perspective.
- Device "name"
- Binary Container "name"
- Binary container name.
Waveform View and Live Waveform Viewer
The Vitis core development kit can generate a Waveform view when running hardware emulation. It displays in-depth details at the system-level, CU level, and at the function level. The details include data transfers between the kernel and global memory and data flow through inter-kernel pipes. These details provide many insights into performance bottlenecks from the system-level down to individual function calls to help optimize your application.
The Live Waveform Viewer is similar to the Waveform view, however, it provides
even lower-level details with some degree of interactivity. The Live Waveform Viewer can
also be opened using the Vivado logic simulator,
xsim
.
Waveform data is not collected by default because it requires the runtime to generate simulation waveforms during hardware emulation, which consumes more time and disk space. Refer to Generating and Opening the Waveform Reports for instructions on enabling these features.
You can also open the waveform database (.wdb) file with the Vivado logic simulator through the Linux command line:
xsim -gui <filename.wdb> &
Generating and Opening the Waveform Reports
Follow these instructions to enable waveform data collection from the command line during hardware emulation and open the viewer:
- Enable debug code generation during compilation and linking using
the
-g
option.v++ -c -g -t hw_emu ...
- Create an xrt.ini file in the
same directory as the host executable with the following contents (see xrt.ini File for more
information):
[Debug] profile=true timeline_trace=true [Emulation] debug_mode=batch
The
debug_mode=batch
enables the capture of waveform data (.wdb) by running simulation in batch mode. You can also enable the Live Waveform Viewer to launch simulation in interactive mode using the following setting in the xrt.ini:[Emulation] debug_mode=gui
TIP: If Live Waveform Viewer is enabled, the simulation waveform opens during the hardware emulation run. - Run the hardware emulation build of the application as described in Running the Application Hardware Build. The hardware transaction data is collected in the waveform database file, <hardware_platform>-<device_id>-<xclbin_name>.wdb. Refer to Output Directories of the v++ Command or Output Directories from the Vitis IDE for more information on locating these reports.
- Open the Waveform view in the Vitis analyzer by opening the Run Summary, and opening the
Waveform report:
vitis_analyzer <project>.run_summary
Interpreting Data in the Waveform Views
The following image shows the Waveform view:
The Waveform and Live Waveform views are organized hierarchically for easy navigation.
- The Waveform view is based on the actual waveforms generated during hardware emulation (Kernel Trace). This allows the viewer to descend all the way down to the individual signals responsible for the abstracted data. However, because the Waveform view is generated from the post-processed data, no additional signals can be added to the report, and some of the runtime analysis cannot be visualized, such as DATAFLOW transactions.
- The Live Waveform viewer is displaying the Vivado logic simulator (
xsim
) run, so you can add extra signals and internals of the register transfer (RTL) design to the live view. Refer to the Vivado Design Suite User Guide: Logic Simulation (UG900) for information on working with the Waveform viewer.
The hierarchy of the Waveform and Live Waveform views include the following:
- Device "name"
- Target device name.
- Binary Container "name"
- Binary container name.
- Memory Data Transfers
- For each DDR Bank, this shows the trace of all the read and write request transactions arriving at the bank from the host.
- Kernel "name" 1:1:1
- For each kernel and for each compute
unit of that kernel, this section breaks down the
activities originating from the compute unit.
- Compute Unit: "name"
- Compute unit name.
- CU Stalls (%)
- Stall signals are provided by the Vitis HLS tool to inform you when a
portion of the circuit is stalling because of
external memory accesses, internal streams (that
is, dataflow), or external streams (that is,
OpenCL
pipes). The stall bus shown in detailed kernel
trace compiles all of the lowest level stall
signals and reports the percentage that are
stalling at any point in time. This provides a
factor of how much of the kernel is stalling at
any point in the simulation.
For example, if there are 100 lowest level stall signals and 10 are active on a given clock cycle, then the CU Stall percentage is 10%. If one goes inactive, then it is 9%.
- Data Transfers
- This shows the read/write data transfer accesses originating from each Master AXI port of the compute unit to the DDR.
- User Functions
- This information is available
for the HLS kernels and shows the user functions.
- Function: "name"
- Function name.
- Function: "name"
- Function name.
- Function: "name"
- Function name.