Profiling and Optimization
The SDAccel™ environment generates various system and kernel resource performance reports during compilation. It also collects profiling data during application execution in both emulation and system mode configurations. Examples of the data reported includes:
- Host and device timeline events
- OpenCL™ API call sequence
- Kernel execution sequence
- FPGA trace data including AXI transactions
- Kernel start and stop signals
Together the reports and profiling data can be used to isolate performance bottlenecks in the application and optimize the design to improve performance.
Optimizing an application requires optimizing both the application host code and any hardware accelerated kernels. The host code must be optimized to facilitate data transfers and kernel execution, while the kernel should be optimized for performance and resource usage.
There are four distinct areas to be considered when performing algorithm optimization in SDAccel: System resource usage and performance, Kernel optimization, Host optimization and PCIe® bandwidth optimization. The following SDAccel reports and graphical tools support your efforts to profile and optimize these areas:
- System Estimate
- Design Guidance
- HLS Report
- Profile Summary
- Application Timeline
- Waveform View and Live Waveform Viewer
Reports are automatically generated after running the active build via the
SDAccel GUI or xocc
Makefile flows.
Separate sets of reports are generated for all three build configurations and can be found in the respective report directories.
The Profile Summary and Application Timeline reports are generated for all three build configurations and are located under the default application sub-directory.
Reports can be viewed in a web browser or spreadsheet viewer for the SDAccel GUI. To access these reports from the SDx™ integrated design environment, make sure the Assistant view is visible and double-click the desired report.
This following sections briefly describe the various reports and graphical visualization tools, and how they can be used to profile and optimize your design. For complete details on each report along with optimization steps, and coding guidelines see the SDAccel Environment Profiling and Optimization Guide.
Design Guidance
The SDAccel environment has a
comprehensive design guidance tool that provides immediate actionable guidance to the
software application developers for detected issues in their designs. Guidance is
generated from HLS, the SDx Profiler and the
Vivado® Design Suite when invoked from xocc
. The generated design guidance can have several
severity levels; errors, advisories, warnings, and critical warnings are provided during
software emulation, hardware emulation, and system builds.
The guidance includes hyperlinks, examples, and links to documentation. This improves productivity for current users by quickly highlighting issues and propels new users to more quickly become experts in using the SDAccel tool.
Design guidance is automatically generated after building or running a design in the SDx GUI with results contained in the Guidance view located in the console area of the SDx GUI. Hovering over the guidance highlights solutions and suggestions.
The following image shows an example of guidance given by the SDx GUI. It details ways to increase the bandwidth use of the kernels. Clicking a link displays an expanded view of the actionable guidance. In this case, it displays guidance for maximizing use of global memory bandwidth.
There is one HTML guidance report for each command line run of xocc
, including compile and link. The report files are
generated in the --report_dir location under the
specific .xo
name.
The name of the report file is given below, where <output> is the .xo
name:
- xocc_compile_<output>_guidance.html for
xocc
compilation - xocc_link_t_guidance.html for
xocc
linking
The profile design guidance helps you interpret the profiling results and know exactly where to focus on to improve performance. Specific details of the reports and additional design guidance details can be found in SDAccel Environment Profiling and Optimization Guide.
System Estimate Report
The SDAccel HLS generates the System Estimate report provides estimates on FPGA resource usage and the frequency at which the hardware accelerated kernels can operate. It is automatically generated for Emulation-HW and System builds, and can be found under the respective directory of the Assistant view shown below.
The report contains high-level details of the user kernels including resource usage and estimated frequency. The results can be used to guide the design optimization. For instance, if the target frequency is not met, it might be necessary to revisit the source code.
An example report is shown in the following graphic. It shows the
krnl_vadd
kernel:
- It is estimated to operate at a frequency of 411 MHz which exceeds the 300 MHz targeted frequency.
- In the best case it has a latency of one cycle.
- Estimated FPGA resource usage of 2353 FF, 3948 LUTs, no DSPs, and three block RAMs.
xocc .. --report_level <arg>
where arg
specifies the level of reports generated.
For additional details on the System Estimate report, see the SDAccel Environment Profiling and Optimization
Guide. For information on the
--report_level xocc
option, see the SDx
Command and Utility Reference Guide (UG1279).
HLS Report
The HLS Report provides details about the high-level synthesis (HLS) process of a user kernel and is generated in Hardware emulation and System builds. This process translates the C/C++ and OpenCL kernel into a hardware description language responsible for implementing the functionality on the FPGA. It provides estimated FPGA resource usage, operating frequency, latency and interface signals of the custom-generated hardware logic. These details provide the programmer many insights to guide kernel optimization.
The HLS Report can be opened by selecting the report in the Assistant and double-clicking. An example of the HLS report follows.
When running from the command line, this report can be found in the following directory:
_x/<kernel_name>.<target>.<platform>/<kernel_name>/<kernel_name>/solution/syn/report
For additional details on the System Estimate report see the SDAccel Environment Profiling and Optimization Guide.
Profile Summary Report
The Profile Summary provides annotated details regarding the overall application performance. All data generated during the execution of the program is gathered by SDAccel and grouped into categories. The Profile Summary enables the programmer to drill down to the actual Data Transfer and Kernel Execution numbers and statistics.
To open the Profile Summary report in the SDx IDE, double-click the Profile Summary report under the Assistant as shown in the following image.
An example of the Profile Summary report is shown here.
The report has multiple tabs that can be selected. A description of each tab is given in the following table.
Tab | Description |
---|---|
Top Operations | Kernels and Global Memory. This tab shows a summary of top operations. It displays the profile data for top data transfers between FPGA and device memory. |
Kernels & Compute Units | Displays the profile data for all kernels and compute units. |
Data Transfers | Host and Global Memory. This table displays the profile data for all read and write transfers between the host and device memory through the PCIe link. It also displays data transfers between kernels and global memory, if enabled. |
OpenCL APIs | Displays the profile data for all OpenCL C host API function calls executed in the host application. |
For command line users, the profile summary data is generated by using
the --profile_kernel
option during the linking stage.
The --profile_kernel
syntax is given below:
--profile_kernel <[data]:<[kernel_name|all]:[compute_unit_name|all]:
[interface_name|all]:[counters|all]>
See the SDAccel Environment Profiling and Optimization Guide for complete details.
Application Timeline
Application Timeline collects and displays host and device events on a common timeline to help you understand and visualize the overall health and performance of your systems. These events include:
- OpenCL API calls from the host code.
- Device trace data including Compute units, AXI transaction start/stop.
- Host events and kernel start/stops.
This graphical representation enables the programmer to identify issues regarding kernel synchronization and efficient concurrent execution.
Double-click Application Timeline in the Reports window to open the Application Timeline window.
The following is a snapshot of the Application Timeline window which displays host and device events on a common timeline. Host activity is displayed at the top of the image and kernel activity is shown on the bottom of the image. Host activities include creating the program, running the kernel and data transfers between global memory and the host. The kernel activities include read/write accesses and transfers between global memory and the kernel(s). This information helps you understand details of application execution and identify potential areas for improvements.
Timeline data can be enabled and collected through the command line flow, however, viewing must be done through the GUI. Complete instructions for enabling and displaying timeline data collection through both the command and GUI flows are given in SDAccel Environment Profiling and Optimization Guide.
Waveform View and Live Waveform Viewer
The SDx Development Environment can generate a Waveform View when running hardware emulation. It displays in-depth details on the emulation results at the system level, compute unit (CU) level, and at the function level. The details include data transfers between the kernel and global memory and data flow through inter-kernel pipes. These details provide many insights into the performance bottleneck from the system level down to the individual function call to help developers optimize their applications.
The Live Waveform Viewer is similar to the Waveform view, however, it provides
even lower-level details. It can also be opened using xsim
, a Xilinx tool used by hardware designers.
Waveform View and Live Waveform Viewer data are not collected by default because it requires the runtime to generate simulation waveform during hardware emulation, which consumes more time and disk space. The SDAccel Environment Profiling and Optimization Guide describes setups required to enable data collection for the Waveform View and Live Waveform Viewer for both GUI and command line.
Double-click the Waveform in the Assistant view (shown in the following image) to open the Waveform View window.
The Live Waveform Viewer can be viewed if you select Launch Live Waveform in the Run Configuration Main tab. Or, if the
Launch Live Waveform is not selected, you can open the waveform (.wdb) with xsim
through
the Linux command line. The .wdb file is located in
the sub-directory, Emulation-HW/<kernel_name>-Default
, within the project directory.
Use the following Linux line command to open xsim
:
xsim -gui <filename.wdb> &
An example of the xsim
Live Waveform Viewer
is shown in the following image.
Kernel SLR and DDR Memory Assignments
Kernel compute unit (CU) instance and DDR memory resource floorplanning are keys to meeting quality of results of your design in terms of frequency and resources. Floorplanning involves explicitly allocating CUs (a kernel instance) to SLRs and mapping CUs to DDR memory resources. When floorplanning, both CU resource usage and DDR memory bandwidth requirements need to be considered.
The largest Xilinx FPGAs are made up of
multiple stacked silicon dies. Each stack is referred to as a super logic region (SLR)
and has a fixed amount of resources and memory including DDR interfaces. Available
device SLR resources which can be used for custom logic can be found in SDx Environments Release Notes,
Installation, and Licensing Guide or can be displayed using the
platforminfo
utility described in the SDx
Command and Utility Reference Guide (UG1279).
You can use the actual kernel resource utilization values to help distribute CUs across SLRs to reduce congestion in any one SLR. The system estimate report lists the number of resources (LUTs, Flip-Flops, block RAMs, etc.) used by the kernels early in the design cycle. The report can be generated during hardware emulation and system compilation through the command line or GUI and is described in System Estimate Report.
Use this information along with the available SLR resources to help assign CUs to SLRs such that no one SLR is over-utilized. The less congestion in an SLR, the better the tools can map the design to the FPGA resources and meet your performance target. For mapping memory resources and CUs, see Mapping Kernel Interfaces to Memory Resources and Allocating Compute Units to SLRs, respectively.
After allocating your CUs to SLRs, map any CU master AXI port(s) to DDR memory resources. Xilinx recommends connecting to a DDR memory resource in the same SLR as the CU. This reduces competition for the limited SLR-crossing connection resources. In addition, connections between SLRs use super long line (SLL) routing resources, which incurs a greater delay than a standard intra-SLR routing.
It might be necessary to cross an SLR region to connect to a DDR resource in a
different SLR. However, if both the --sp
and the
--slr
directives are explicitly defined, the tools
automatically add additional crossing logic to minimize the effect of the SLL delay, and
facilitates better timing closure.
Guidelines for Kernels that Access Multiple Memory Banks
The DDR memory resources are distributed across the super logic regions (SLRs) of the platform. Because the number of connections available for crossing between SLRs is limited, the general guidance is to place a kernel in the same SLR as the DDR memory resource with which it has the most connections. This reduces competition for SLR-crossing connections and avoids consuming extra logic resources associated with SLR crossing.
As shown in the previous figure, when a kernel has a single AXI interface that
maps only a single memory bank, the platforminfo
utility described in the SDx
Command and Utility Reference Guide (UG1279)
lists the SLR that is associated with the memory bank of the kernel; therefore, the SLR
where the kernel would be best placed. In this scenario, the design tools might
automatically place the kernel in that SLR without need for extra input; however, you
might need to provide an explicit SLR assignment for some of the kernels under the
following conditions:
- If the design contains a large number of kernels accessing the same memory bank.
- A kernel requires some specialized logic resources that are not available in the SLR of the memory bank.
When a kernel has multiple AXI interfaces and all of the interfaces of the kernel access the same memory bank, it can be treated in a very similar way to the kernel with a single AXI interface, and the kernel should reside in the same SLR as the memory bank that its AXI interfaces are mapping.
When a kernel has multiple AXI interfaces to multiple memory banks in different SLRs, the recommendation is to place the kernel in the SLR that has the majority of the memory banks accessed by the kernel (shown it the figure above). This minimizes the number of SLR crossings required by this kernel which leaves more SLR crossing resources available for other kernels in your design to reach your memory banks.
When the kernel is mapping memory banks from different SLRs, explicitly specify the SLR assignment as described in Kernel SLR and DDR Memory Assignments.
As shown in the previous figure, when a platform contains more than two SLRs, it is possible that the kernel might map a memory bank that is not in the immediately adjacent SLR to its most commonly mapped memory bank. When this scenario arises, memory accesses to the distant memory bank must cross more than one SLR boundary and incur additional SLR-crossing resource costs. To avoid such costs it might be better to place the kernel in an intermediate SLR where it only requires less expensive crossings into the adjacent SLRs.