Debug Techniques
This chapter describes the different styles of debugging techniques applicable to SDSoC™ applications. It highlights different approaches for software-based debugging and hardware-oriented techniques. In the software-based approaches, a full understanding of the implementation of the design in the FPGA is not required. However, this concept can only be extended to a certain degree, at which point a hardware-based detailed analysis should be performed. Highlighting pure software debugging techniques is not the intent of this document.
When debugging SDSoC applications, you can use the same methods and techniques as applications used for debugging standard C/C++. Most SDSoC applications consist of specific functions tagged for hardware acceleration and surrounded by standard C/C++ code.
When debugging an SDSoC application with a board attached to the debug host machine, you can right-click on a build configuration in the Assistant view, and select the option to begin a debug session.
You can select options other than the default settings by using the Xilinx recommends that you switch to the Debug perspective when prompted. The debug perspective view provides the ability to debug the standard C/C++ portions of the application by single-stepping code, setting and removing breakpoints, displaying variables, dumping registers, viewing memory, and controlling the code flow with “run until” and “jump to” debugging directives. Inputs and outputs can be observed before and after the function call to determine the correct behavior.
command to create a new custom debug configuration. As the debug environment is initialized,You can determine if a hardware accelerated application meets its real-time
requirements by placing debug statements to start and stop a counter just before and
just after a hardware accelerated function. The SDx™ environment provides the sds_clock_counter()
function, which is typically used to calculate the elapsed time for a hardware
accelerated function.
You can also perform debugging without a target board connected to the debug host by building the SDx project for emulation. During emulation, you can control and observe the software and data just as before through the debug perspective view, but you can also view the hardware accelerated functions through a Vivado® simulator waveform viewer. You can observe accelerator signaling for conditions such as accelerator start and accelerator done, and you can monitor data buses for inputs and outputs. Building a project for emulation also avoids a possibly long Vivado implementation step to generate an FPGA bitstream.
See the SDSoC Environment Debugging Guide for information on using the interactive debuggers in the SDx IDE.
Debugging System Hangs and Runtime Errors
Programs compiled using sds++
can be debugged
using the standard debuggers supplied with the SDx environment or Vivado.
Typical runtime errors are incorrect results, premature program exits, and
program hangs. The first two kinds of error are familiar to C/C++ programmers,
and can be debugged by stepping through the code using a debugger.
A program hang is a runtime error caused by specifying an incorrect amount of
data to be transferred across a streaming connection created using #pragma SDS data access_pattern(A:SEQUENTIAL)
, by
specifying a streaming interface in a synthesizable function within the Vivado High-Level Synthesis (HLS) tool, or by a
C-callable hardware function in a pre-built library that has streaming hardware
interfaces. A program hangs when the consumer of a stream is waiting for more
data from the producer, but the producer has stopped sending data. Consider the
following code fragment that results in streaming input/output from a hardware
function:
#pragma SDS data access_pattern(in_a:SEQENTIAL, out_b:SEQUENTIAL)
void f1(int in_a[20], int out_b[20]); // declaration
void f1(int in_a[20], int out_b[20]) { // definition
int i;
for (i=0; i < 19; i++) {
out_b[i] = in_a[i];
}
}
In_a[]
has 20 elements, but the loop only
reads 19 of them. Anything calling f1
would
appear to hang, waiting indefinitely for f1
to
consume the final element. Program errors that lead to hangs can be detected by
using system emulation to ascertain whether the data signals are static by
reviewing the associated protocol signals such as TLAST
, ap_ready
, ap_done
, and TREADY
. Program errors causing hangs can also be detected by
instrumenting the code to flag streaming access errors such as non-sequential
access or incorrect access counts within a function and running in software.
Streaming access issues are typically flagged as improper streaming access
warnings in the log file, and you can
determine if these are actual errors. Running your application on the SDSoC emulator is a good way to gain visibility
of data transfers with a debugger. You can see where in software the system is
hanging (often within a cf_wait()
call), and
can then inspect associated data transfers in the simulation waveform view,
which gives you access to signals on the hardware blocks associated with the
data transfer.
System Hang Debugging Example
#pragma SDS data access_pattern(in:SEQUENTIAL, out:SEQUENTIAL)
#pragma SDS data copy(in[0:large], out[0:small])
void too_large_copy(int* in, int* out, int small, int large)
{
for(int i = 0; i < small; i++) {out[i] = in[i];}
}
int main()
{
int* temp_var1 = new int[1024 * 1024];
int* temp_var2 = new int[1024 * 1024];
too_large_copy(temp_var1, temp_var2, 1024, 1024 * 1024); //hangs because the input DMA continues to try to feed data to a halted HLS core
}
In
this case, the direct memory access (DMA) continues to try to send data to the hardware
function, whereas the hardware function is already done and is not accepting any data.
This results in a system hang.- To debug this type of issue, build the code for emulation on the base platform. When the application is compiled, start the emulator by selecting Assistant window as shown below. Right-click the Active build configuration for the application and select Start/Stop Emulator. . Alternatively, you can start the emulator from the
- In the Emulation dialog box, ensure that the Show Waveform (Programmable Logic only) check box is checked. This brings up the Vivado Simulator where the state of different interfaces can be viewed in the Waveform window. To monitor the interfaces of the hardware function, right-click on the function and select Add to Wave window. This adds all the I/O ports of the selected function to the Waveform window.
- Start the simulator by clicking the Run All icon in the toolbar.
- Go back to the SDx IDE, and then
launch the application on the debugger. To do this, select the application to be
debugged, right-click, and then select Launch on
Emulator (SDx Application Debugger).
In the Confirm Perspective Switch dialog box, click Yes. The Debug Perspective opens with the application running on the hardware. The code execution stops at the main program entry.
- Click the Resume button on the
toolbar to execute the application.
The application is now stuck: a system hang has been encountered. - To determine the cause of the system hang, go back to Vivado Design Suite. Look at the state of the
ap_done, ap_start, ap_idle and ap_ready signals for the function. The state of
these signals indicates that a transaction was started at the instance when the
ap_start signal went High, followed by the
transaction ending when the ap_done signal
went Low. The ap_ready and ap_idle signals likewise indicate the state of the
function.
Analyzing the state of the DMA at the same point of time, you can see that while the hardware function has finished accepting data, the DMA is still writing to it, as indicated by the M00_AXIS_tready and the M00_AXIS_tvalid signals.
Now that you know the cause of the system hang, you can go back to the hardware function code and fix any outstanding issues.
Causes of System Hangs
There are other situations where a system hang can occur as listed below:
- If you can Ctrl+C out of the application, there was probably not enough data from the accelerator. The Arm® processor is expecting more data than the accelerator is sending. Review latencies if there is more than one path from a producer to a consumer. Designs where there are multiple paths with equal latencies between two accelerators (for example, A -> B ... -> Z, while there is also A -> Z direct) need to be fixed at the design level equalizing the branches.
- If Ctrl+C does not work, but you can
ping
orssh
into the board, there is not enough data in a Scatter Gather DMA (SGDMA) operation. Review the data movers (copy or zero-copy) and the access pattern. - If you cannot
ping
the board and it has hard locked, only coming back to life after a power cycle, common causes are interaction between the following:- The SDSoC environment design and IP on the platform. Debug with the ChipScope™ feature and peeking and poking of registers; see Hardware Debugging in SDSoC Using ChipScope and Peeking and Poking IP Registers.
- The SDSoC environment design and C-callable IP libraries. Debug with the ChipScope feature and peeking and poking of registers; see Hardware Debugging in SDSoC Using ChipScope and Peeking and Poking IP Registers.
- The RTL or the SW driver generated in the SDSoC flow. If you have enough Vivado Design Suite or C driver experience you might be able to debug this; otherwise, contact the Xilinx forums.
Causes of Runtime Errors
The following list shows other sources of runtime errors:
- Improper placement of
wait()
statements could result in the following issues:- The software might read invalid data before a hardware accelerator has written the correct value.
- A blocking
wait()
might be called before a related accelerator is started, resulting in a system hang.
- Inconsistent use of the memory consistency
SDS data mem_attribute
pragma can result in incorrect results.
Unexpected Data Values
- Go back to software debug and confirm that your software is good.
- If the software debug is good, you need to visually inspect the
code. Two common causes for unexpected data are from the use of the
#SDS data
or the#SDS zero copy
pragmas. - If you are using
#SDS data
pragmas, the tools trust what you write. Confirm that the data access pattern in the code matches the data access pattern specified by the pragma. - An incorrectly sized (normally too large)
#SDS zero copy
can pull invalid data from cache. This is seen in hardware. Emulation is likely to pass as there is no cache controller in software.
Peeking and Poking IP Registers
With the Xilinx® System Debugger
tool (XSDB), you can understand what is happening with the IP blocks included with the
platform or the various C-callable IP blocks. From the Xilinx Software Command Line Tool (XCST) console, you can read and write
registers within various IP blocks in the integrated design. Registers can be read by
typing the memory read command, mrd
. Likewise, a
writable register in any IP in the design can be written to by typing the mwr
command in the XSCT console. For help with commands,
type <command> -help
.
<project_name>/<Debug or
Release>/_sds/p0/vivado/prj/prj.xpr
. Double-clicking prj.xpr
opens up the project in Vivado. In the Vivado IDE, click on
under Flow Navigator. Click on the
Address Editor tab to view the memory map
information.
For details on XSDB, refer to SDK Online Help (UG782).
Event Tracing
This section describes how traces are collected and displayed in the SDSoC environment.
Runtime Trace Collection
Software traces are inserted into the same storage path as the hardware traces and receive a time stamp using the same timer/counter as hardware traces. This single-trace data stream is buffered in the hardware system and accessed over JTAG by the host PC.
In the SDSoC environment, traces are read back constantly while the program executes attempting to empty the hardware buffer as quickly as possible and prevent buffer overflow. However, trace data only displays when the application is finished.
Trace data is collected in real time when you are running on the hardware. For information about connecting to the hardware, refer to Connecting to the Hardware.
Trace Visualization
The SDSoC environment displays a graphical rendering of the hardware and software trace stream. Each trace point in the user application is given a unique name, and its own axis on the timeline. In general, a trace point can create multiple trace events throughout the execution of the application, for example, if the same block of code is executed in a loop, or if an accelerator is invoked more than once.
Each trace event has a few different attributes: name, type, start time, stop time, and duration. This data is shown as a tool-tip when the cursor hovers above one of the event rectangles in the view.
Troubleshooting
The following section provides general information on troubleshooting the different conditions encountered during event tracing.
- Incremental build flow
- The SDSoC environment does not support any incremental build flow using the trace feature. To ensure the correct build of your application and correct trace collection, do a project clean first, followed by a build after making any changes to your source code. Even if the source code you change does not relate to or impact any function marked for hardware, you can see incorrect results.
- Programming and bitstream
- The trace functionality is a single-use type of analysis. The timer used for time-stamping events is not started until the first event occurs, and runs indefinitely afterward. If you run your software application once after programming the bitstream, the timer is in an unknown state after your program is finished running. Running your software for a second time results in incorrect timestamps for events. Be sure to program the bitstream first, followed by downloading your software application, each and every time you run your application to take advantage of the trace feature. Your application will run correctly a second time, but the trace data will not be correct. For Linux, you need to reboot because the bitstream is loaded during boot time by U-Boot.
- Buffering up traces
- In the SDSoC environment, traces are buffered up and read out in real time as the application executes (although at a slower speed than they are created on the device), but are displayed after the application finishes in a post-processing fashion. This relies on having enough buffer space to store traces until they can be read out by the host PC. By default, there is enough buffer space for 1024 traces. After the buffer fills up, subsequent traces that are produced are dropped and lost. An error condition is set when the buffer overflows. Any traces created after the buffer overflows are not collected, and traces just prior to the overflow might be displayed.
- Errors
- In the SDSoC environment, traces are buffered up in hardware before being read out over JTAG by the host PC. If traces are produced faster than they are consumed, a buffer overflow event might occur. The trace infrastructure is recognizes this and sets an error flag that is detected during the collection on the host PC. After the error flag is parsed during trace data collection, collection is halted and the trace data that was read successfully is prepared for display. However, some data read successfully just prior to the buffer overflow might appear incorrectly in the visualization.
After an overflow occurs, an error file is created in the <build_config>/_sds/trace directory with the name in the following format: archive_DAY_MON_DD_HH_MM_SS_-GMT_YEAR_ERROR. You must reprogram the device (reboot Linux and so on) prior to running the application and collecting trace data again. The only way to reset the trace hardware in the design is with reprogramming.
Debugging with Software/Hardware Cross Probing
After an SDx environment application has been created and functions are marked for hardware acceleration, build the design with the appropriate settings. Then, connect to the target board (see Connecting to the Hardware).
Setting Debug Configurations
- In the Project Explorer view, click the ELF (.elf) file in the Debug folder in the project.
- In the toolbar, click Debug, or use the Debug drop-down list to select .
- Alternatively, right-click the project and select . The Confirm Perspective Switch dialog box appears.
- Ensure that the board is switched on before debugging the project. Click
Yes to switch to the debug
perspective. You are now in the Debug
Perspective of the SDx
IDE.Note: The debugger resets the system, programs and initializes the device, and then breaks at the main function. The source code is shown in the center panel, and local variables are shown in the top right corner panel. The SDx environment log at the bottom right panel shows the Debug Configuration log.Before you run the application, connect a serial terminal to the board so that you can see the output from your program. As an example, the following settings can be used:
- Connection Type: Serial
- Port: COM<n>
- Baud Rate: 115200
Running the Application
Click Resume to run your application and
observe the output in the terminal window. The source code window shows the _exit
function, and the Terminal tab shows the output from the application.
The code stops execution at the main function, as can be seen in the Debug tab. Additional breakpoints can be set in the code at specific points to stop the execution of the code at that specific point. Breakpoints can be enabled or disabled by double-clicking on the vertical blue bar adjacent to the line numbers in the code. Execution of the code can be resumed by clicking the Resume icon on the toolbar.
Tips for Debugging Performance
sds_clock_counter()
: Use this function to determine how much time different code sections, such as the accelerated code and the non-accelerated code, take to execute.sds_clock_frequency()
: This function returns the number of CPU cycles per second.
You can estimate the actual hardware acceleration time by looking at the
latency numbers in the Vivado Design Suite
High-level Synthesis (HLS) tool report files (_sds/vhls/…/*.rpt) or in the IDE under . The latency of X accelerator clock cycles equals X * (processor_clock_freq/accelerator_clock_freq)processor clock cycles
.
Compare this with the time spent on the actual function call to determine the overhead
of setup and data transfers.
For best performance improvement, the time required for executing the
accelerated function must be much smaller than the time required for executing the
original software function. If this is not true, try to run the accelerator at a higher
frequency by selecting a different clkid
on the sds++
command line. If that does not work, try to determine
whether the data transfer overhead is a significant part of the accelerated function
execution time, and reduce the data transfer overhead. Note that the default clkid
is 100 MHz for all platforms. More details about the
clkid
values for the given platform can be obtained
by running -sds-pf-info
<path>/<platform_name>
.
- Move more code into the accelerated function so that the computation time increases, and the ratio of computation to data transfer time is improved.
- Reduce the amount of data to be transferred by modifying the code or using pragmas to transfer only the required data.
- Sequentialize the access pattern as observed from the accelerator code, because it is more efficient to burst transfers than to make a series of unrelated random accesses.
- Ensure that data transfers make use of system ports that are
appropriate for the cache-ability of the data being transferred. Cache flushing
can be an resource-intensive procedure, and using coherent ports to access
coherent data, and non-coherent ports to access non-coherent ports makes a
significant impact.
Use
sds_alloc()
instead ofmalloc
, where possible. The memory thatsds_alloc()
issues is physically contiguous, and enables the use of data movers that are faster to configure that require physically contiguous memory. Also, pinning virtual pages, which is necessary when transferring data issue bymalloc()
data, is very costly.
Troubleshooting Compile and Link Time Errors
Typical compile/link time errors are indicated by error
messages issued when running make
. To analyze
further, look at the log files and rpt files in the _sds/reports sub-directory created by the
SDSoC environment in
the build directory. The most recently generated log file usually
indicates the cause of the error, such as a syntax error in the
corresponding input file, or an error generated by the tool chain
while synthesizing accelerator hardware or the data motion
network.
The following are tips and strategies to address errors specific to the SDSoC environment.
Tool Errors Are Reported by Tools in the SDSoC Environment Chain
Try the following troubleshooting steps:
- Check whether the corresponding code adheres to the Coding Guidelines in SDSoC Environment Programmers Guide.
- Check the syntax of pragmas. See the for more details.
- Check for typos in pragmas that might prevent them from being applied as intended.
Vivado Design Suite High-Level Synthesis (HLS) Cannot Meet Timing Requirement
- Select a slower clock frequency for the
accelerator in the SDx IDE (or with the
sdscc/sds++
command line parameter). - Modify the code structure to allow HLS to generate a faster implementation. See the Improving Hardware Function Parallelism section in SDSoC Profiling and Optimization Guide for more information on how to do this.
Vivado Tools Cannot Meet Timing
- In the SDx IDE, select a slower clock
frequency for the data motion network or
accelerator, or both (from the command line, use
sdscc/sds++
command line parameters). - Use the
-xp
option to specify a Vivado implementation strategy to improve results. For example:-impl-strategy Performance_Explore
- Provide an example/resource to help the user synthesize the HLS block to a higher clock frequency so that the synthesis/implementation tools have a bigger margin.
- Modify the C/C++ code passed to HLS, or add more HLS directives to make the HLS block go faster.
- Reduce the size of the design in cases where the resource usage exceeds 80%. Refer to the Vivado tools reports in the _sds folder.
The Design Is Too Large to Fit
- Reduce the number of accelerated functions.
- Change the coding style for an accelerator function to produce a more compact accelerator. You can reduce the amount of parallelism using the mechanisms described in the Improving Hardware Function Parallelism section in SDSoC Profiling and Optimization Guide.
- Modify pragmas and coding styles (pipelining) that cause multiple instances of accelerators to be created.
- Use pragmas to select smaller data
movers such as
AXIFIFO
instead ofAXIDMA_SG
. - Rewrite hardware functions to have fewer input and output parameters/arguments, especially in cases where the inputs/outputs are continuous stream (sequential access array argument) types that prevent the sharing of data mover hardware.
Troubleshooting Performance Issues
The SDSoC environment provides some basic
performance monitoring capabilities in the form of the sds_clock_counter()
function. Use this function to determine how much time
different code sections, such as the accelerated and the non-accelerated code, take to
execute.
To estimate the actual hardware acceleration time, you need to know the latency numbers from the Vivado HLS report, the clock frequency for the accelerator, and the Arm CPU clock frequency. To open the Vivado HLS report for the latency numbers, in the Assistant view, go to . To view the clock frequency for the accelerator, go to the Hardware Functions section of the Project Settings. Click on the Platform link in the Project Overview to open the Platform Summary dialog. The CPU frequency is shown under Clock Frequencies. A latency of X accelerator clock cycles is equal to X * (<processor clock frequency>/<accelerator clock frequency>) processor clock cycles. Compare this with the time spent on the actual function call to determine the data transfer overhead.
For best performance improvement, the time required for executing the
accelerated function must be much smaller than the time required for executing the original
software function. If this is not true, try to run the accelerator at a higher frequency by
selecting a different clkid
on the sdscc/sds++
command line. If that does not work, try to determine whether the
data transfer overhead is a significant part of the accelerated function execution time, and
reduce the data transfer overhead.
clkid
values for a given platform can be obtained by running the
following command:sds++ -sds-pf-info
- Move more code into the accelerated function so that the computation time increases, and the ratio of computation to data transfer time is improved.
- Reduce the amount of data to be transferred by modifying the code or using pragmas to transfer only the required data.