Debug Techniques

This chapter describes the different styles of debugging techniques applicable to SDSoC™ applications. It highlights different approaches for software-based debugging and hardware-oriented techniques. In the software-based approaches, a full understanding of the implementation of the design in the FPGA is not required. However, this concept can only be extended to a certain degree, at which point a hardware-based detailed analysis should be performed. Highlighting pure software debugging techniques is not the intent of this document.

When debugging SDSoC applications, you can use the same methods and techniques as applications used for debugging standard C/C++. Most SDSoC applications consist of specific functions tagged for hardware acceleration and surrounded by standard C/C++ code.

When debugging an SDSoC application with a board attached to the debug host machine, you can right-click on a build configuration in the Assistant view, and select the Debug > Launch on Hardware option to begin a debug session.

You can select options other than the default settings by using the Debug > Debug Configurations command to create a new custom debug configuration. As the debug environment is initialized, Xilinx recommends that you switch to the Debug perspective when prompted. The debug perspective view provides the ability to debug the standard C/C++ portions of the application by single-stepping code, setting and removing breakpoints, displaying variables, dumping registers, viewing memory, and controlling the code flow with “run until” and “jump to” debugging directives. Inputs and outputs can be observed before and after the function call to determine the correct behavior.

You can determine if a hardware accelerated application meets its real-time requirements by placing debug statements to start and stop a counter just before and just after a hardware accelerated function. The SDx™ environment provides the sds_clock_counter() function, which is typically used to calculate the elapsed time for a hardware accelerated function.

You can also perform debugging without a target board connected to the debug host by building the SDx project for emulation. During emulation, you can control and observe the software and data just as before through the debug perspective view, but you can also view the hardware accelerated functions through a Vivado® simulator waveform viewer. You can observe accelerator signaling for conditions such as accelerator start and accelerator done, and you can monitor data buses for inputs and outputs. Building a project for emulation also avoids a possibly long Vivado implementation step to generate an FPGA bitstream.

See the SDSoC Environment Debugging Guide for information on using the interactive debuggers in the SDx IDE.

Debugging System Hangs and Runtime Errors

Programs compiled using sds++ can be debugged using the standard debuggers supplied with the SDx environment or Vivado. Typical runtime errors are incorrect results, premature program exits, and program hangs. The first two kinds of error are familiar to C/C++ programmers, and can be debugged by stepping through the code using a debugger.

Note: Applications might hang when you are running on the board. Hangs commonly happen due to a mismatch of data size between the producer and the consumer.

A program hang is a runtime error caused by specifying an incorrect amount of data to be transferred across a streaming connection created using #pragma SDS data access_pattern(A:SEQUENTIAL), by specifying a streaming interface in a synthesizable function within the Vivado High-Level Synthesis (HLS) tool, or by a C-callable hardware function in a pre-built library that has streaming hardware interfaces. A program hangs when the consumer of a stream is waiting for more data from the producer, but the producer has stopped sending data. Consider the following code fragment that results in streaming input/output from a hardware function:

#pragma SDS data access_pattern(in_a:SEQENTIAL, out_b:SEQUENTIAL)
void f1(int in_a[20], int out_b[20]);     // declaration

void f1(int in_a[20], int out_b[20]) {    // definition
   int i;
   for (i=0; i < 19; i++) {
       out_b[i] = in_a[i];
   }
}

In_a[] has 20 elements, but the loop only reads 19 of them. Anything calling f1 would appear to hang, waiting indefinitely for f1 to consume the final element. Program errors that lead to hangs can be detected by using system emulation to ascertain whether the data signals are static by reviewing the associated protocol signals such as TLAST, ap_ready, ap_done, and TREADY. Program errors causing hangs can also be detected by instrumenting the code to flag streaming access errors such as non-sequential access or incorrect access counts within a function and running in software. Streaming access issues are typically flagged as improper streaming access warnings in the log file, and you can determine if these are actual errors. Running your application on the SDSoC emulator is a good way to gain visibility of data transfers with a debugger. You can see where in software the system is hanging (often within a cf_wait() call), and can then inspect associated data transfers in the simulation waveform view, which gives you access to signals on the hardware blocks associated with the data transfer.

System Hang Debugging Example

As another example, consider the following code that results in streaming input/output from the hardware function:

#pragma SDS data access_pattern(in:SEQUENTIAL, out:SEQUENTIAL)
#pragma SDS data copy(in[0:large], out[0:small])
void too_large_copy(int* in, int* out, int small, int large)
{
	for(int i = 0; i < small; i++) {out[i] = in[i];}
}


int main()
{
	int* temp_var1 = new int[1024 * 1024];
	int* temp_var2 = new int[1024 * 1024];

	too_large_copy(temp_var1, temp_var2, 1024, 1024 * 1024); //hangs because the input DMA continues to try to feed data to a halted HLS core

}

In this case, the direct memory access (DMA) continues to try to send data to the hardware function, whereas the hardware function is already done and is not accepting any data. This results in a system hang.

To debug this type of issue, build the code for emulation on the base platform. When the application is compiled, start the emulator by selecting Xilinx > Start/Stop Emulator. Alternatively, you can start the emulator from the Assistant window as shown below. Right-click the Active build configuration for the application and select Start/Stop Emulator.
In the Emulation dialog box, ensure that the Show Waveform (Programmable Logic only) check box is checked. This brings up the Vivado Simulator where the state of different interfaces can be viewed in the Waveform window. To monitor the interfaces of the hardware function, right-click on the function and select Add to Wave window. This adds all the I/O ports of the selected function to the Waveform window.
Start the simulator by clicking the Run All icon in the toolbar.
Go back to the SDx IDE, and then launch the application on the debugger. To do this, select the application to be debugged, right-click, and then select Launch on Emulator (SDx Application Debugger).

In the Confirm Perspective Switch dialog box, click Yes. The Debug Perspective opens with the application running on the hardware. The code execution stops at the main program entry.
Click the Resume button on the toolbar to execute the application.

The application is now stuck: a system hang has been encountered.
To determine the cause of the system hang, go back to Vivado Design Suite. Look at the state of the ap_done, ap_start, ap_idle and ap_ready signals for the function. The state of these signals indicates that a transaction was started at the instance when the ap_start signal went High, followed by the transaction ending when the ap_done signal went Low. The ap_ready and ap_idle signals likewise indicate the state of the function.

Analyzing the state of the DMA at the same point of time, you can see that while the hardware function has finished accepting data, the DMA is still writing to it, as indicated by the M00_AXIS_tready and the M00_AXIS_tvalid signals.

Now that you know the cause of the system hang, you can go back to the hardware function code and fix any outstanding issues.

Causes of System Hangs

There are other situations where a system hang can occur as listed below:

If you can Ctrl+C out of the application, there was probably not enough data from the accelerator. The Arm® processor is expecting more data than the accelerator is sending. Review latencies if there is more than one path from a producer to a consumer. Designs where there are multiple paths with equal latencies between two accelerators (for example, A -> B ... -> Z, while there is also A -> Z direct) need to be fixed at the design level equalizing the branches.
If Ctrl+C does not work, but you can ping or ssh into the board, there is not enough data in a Scatter Gather DMA (SGDMA) operation. Review the data movers (copy or zero-copy) and the access pattern.
If you cannot ping the board and it has hard locked, only coming back to life after a power cycle, common causes are interaction between the following:
1. The SDSoC environment design and IP on the platform. Debug with the ChipScope™ feature and peeking and poking of registers; see Hardware Debugging in SDSoC Using ChipScope and Peeking and Poking IP Registers.
2. The SDSoC environment design and C-callable IP libraries. Debug with the ChipScope feature and peeking and poking of registers; see Hardware Debugging in SDSoC Using ChipScope and Peeking and Poking IP Registers.
3. The RTL or the SW driver generated in the SDSoC flow. If you have enough Vivado Design Suite or C driver experience you might be able to debug this; otherwise, contact the Xilinx forums.

Causes of Runtime Errors

The following list shows other sources of runtime errors:

Improper placement of wait() statements could result in the following issues:
- The software might read invalid data before a hardware accelerator has written the correct value.
- A blocking wait() might be called before a related accelerator is started, resulting in a system hang.
Inconsistent use of the memory consistency SDS data mem_attribute pragma can result in incorrect results.

Unexpected Data Values

When the application is running, it is possible to get unexpected data. The hardware function might not be returning the expected data, or it might be returning expected data at the wrong time. This can be caused by hardware and/or software issues. If hardware is the suspected root cause, check data inputs to your board using the ChipScope feature if needed. If software is the suspected root cause, perform the following steps:

Go back to software debug and confirm that your software is good.
If the software debug is good, you need to visually inspect the code. Two common causes for unexpected data are from the use of the #SDS data or the #SDS zero copy pragmas.
If you are using #SDS data pragmas, the tools trust what you write. Confirm that the data access pattern in the code matches the data access pattern specified by the pragma.
An incorrectly sized (normally too large) #SDS zero copy can pull invalid data from cache. This is seen in hardware. Emulation is likely to pass as there is no cache controller in software.

Peeking and Poking IP Registers

With the Xilinx® System Debugger tool (XSDB), you can understand what is happening with the IP blocks included with the platform or the various C-callable IP blocks. From the Xilinx Software Command Line Tool (XCST) console, you can read and write registers within various IP blocks in the integrated design. Registers can be read by typing the memory read command, mrd. Likewise, a writable register in any IP in the design can be written to by typing the mwr command in the XSCT console. For help with commands, type <command> -help.

You need to be familiar with the memory map of the various IP blocks within the design to be able to perform reads and writes to the registers. You can access this information by opening the Vivado project and looking at the address editor. The Vivado project can be found at

<project_name>/<Debug or
				Release>/_sds/p0/vivado/prj/prj.xpr

. Double-clicking prj.xpr opens up the project in Vivado. In the Vivado IDE, click on IP Integrator > Open Block Design under Flow Navigator. Click on the Address Editor tab to view the memory map information.

For details on XSDB, refer to SDK Online Help (UG782).

CAUTION:

Trying to access an address that is not mapped results in a BUS ERROR. Addresses that are mapped, but lack proper backing, result in a system hang.

Event Tracing

This section describes how traces are collected and displayed in the SDSoC environment.

Runtime Trace Collection

Software traces are inserted into the same storage path as the hardware traces and receive a time stamp using the same timer/counter as hardware traces. This single-trace data stream is buffered in the hardware system and accessed over JTAG by the host PC.

In the SDSoC environment, traces are read back constantly while the program executes attempting to empty the hardware buffer as quickly as possible and prevent buffer overflow. However, trace data only displays when the application is finished.

Trace data is collected in real time when you are running on the hardware. For information about connecting to the hardware, refer to Connecting to the Hardware.

Trace Visualization

The SDSoC environment displays a graphical rendering of the hardware and software trace stream. Each trace point in the user application is given a unique name, and its own axis on the timeline. In general, a trace point can create multiple trace events throughout the execution of the application, for example, if the same block of code is executed in a loop, or if an accelerator is invoked more than once.

Each trace event has a few different attributes: name, type, start time, stop time, and duration. This data is shown as a tool-tip when the cursor hovers above one of the event rectangles in the view.

Troubleshooting

The following section provides general information on troubleshooting the different conditions encountered during event tracing.

Incremental build flow: The SDSoC environment does not support any incremental build flow using the trace feature. To ensure the correct build of your application and correct trace collection, do a project clean first, followed by a build after making any changes to your source code. Even if the source code you change does not relate to or impact any function marked for hardware, you can see incorrect results.

Programming and bitstream: The trace functionality is a single-use type of analysis. The timer used for time-stamping events is not started until the first event occurs, and runs indefinitely afterward. If you run your software application once after programming the bitstream, the timer is in an unknown state after your program is finished running. Running your software for a second time results in incorrect timestamps for events. Be sure to program the bitstream first, followed by downloading your software application, each and every time you run your application to take advantage of the trace feature. Your application will run correctly a second time, but the trace data will not be correct. For Linux, you need to reboot because the bitstream is loaded during boot time by U-Boot.

Buffering up traces: In the SDSoC environment, traces are buffered up and read out in real time as the application executes (although at a slower speed than they are created on the device), but are displayed after the application finishes in a post-processing fashion. This relies on having enough buffer space to store traces until they can be read out by the host PC. By default, there is enough buffer space for 1024 traces. After the buffer fills up, subsequent traces that are produced are dropped and lost. An error condition is set when the buffer overflows. Any traces created after the buffer overflows are not collected, and traces just prior to the overflow might be displayed.

Errors: In the SDSoC environment, traces are buffered up in hardware before being read out over JTAG by the host PC. If traces are produced faster than they are consumed, a buffer overflow event might occur. The trace infrastructure is recognizes this and sets an error flag that is detected during the collection on the host PC. After the error flag is parsed during trace data collection, collection is halted and the trace data that was read successfully is prepared for display. However, some data read successfully just prior to the buffer overflow might appear incorrectly in the visualization.

After an overflow occurs, an error file is created in the <build_config>/_sds/trace directory with the name in the following format: archive_DAY_MON_DD_HH_MM_SS_-GMT_YEAR_ERROR. You must reprogram the device (reboot Linux and so on) prior to running the application and collecting trace data again. The only way to reset the trace hardware in the design is with reprogramming.

Debugging with Software/Hardware Cross Probing

After an SDx environment application has been created and functions are marked for hardware acceleration, build the design with the appropriate settings. Then, connect to the target board (see Connecting to the Hardware).

Setting Debug Configurations

In the Project Explorer view, click the ELF (.elf) file in the Debug folder in the project.
In the toolbar, click Debug, or use the Debug drop-down list to select Debug As > Launch on Hardware (SDx Application Debugger).
Alternatively, right-click the project and select Debug As > Launch on Hardware (SDx Application Debugger). The Confirm Perspective Switch dialog box appears.
Ensure that the board is switched on before debugging the project. Click Yes to switch to the debug perspective. You are now in the Debug Perspective of the SDx IDE.
Note: The debugger resets the system, programs and initializes the device, and then breaks at the main function. The source code is shown in the center panel, and local variables are shown in the top right corner panel. The SDx environment log at the bottom right panel shows the Debug Configuration log.
Before you run the application, connect a serial terminal to the board so that you can see the output from your program. As an example, the following settings can be used:
- Connection Type: Serial
- Port: COM<n>
- Baud Rate: 115200

Running the Application

Click Resume to run your application and observe the output in the terminal window. The source code window shows the _exit function, and the Terminal tab shows the output from the application.

The code stops execution at the main function, as can be seen in the Debug tab. Additional breakpoints can be set in the code at specific points to stop the execution of the code at that specific point. Breakpoints can be enabled or disabled by double-clicking on the vertical blue bar adjacent to the line numbers in the code. Execution of the code can be resumed by clicking the Resume icon on the toolbar.

Tips for Debugging Performance

The SDSoC environment provides some basic performance monitoring capabilities with the following functions:

sds_clock_counter(): Use this function to determine how much time different code sections, such as the accelerated code and the non-accelerated code, take to execute.
sds_clock_frequency(): This function returns the number of CPU cycles per second.

You can estimate the actual hardware acceleration time by looking at the latency numbers in the Vivado Design Suite High-level Synthesis (HLS) tool report files (_sds/vhls/…/*.rpt) or in the IDE under Reports > HLS Report. The latency of X accelerator clock cycles equals X * (processor_clock_freq/accelerator_clock_freq)processor clock cycles. Compare this with the time spent on the actual function call to determine the overhead of setup and data transfers.

For best performance improvement, the time required for executing the accelerated function must be much smaller than the time required for executing the original software function. If this is not true, try to run the accelerator at a higher frequency by selecting a different clkid on the sds++ command line. If that does not work, try to determine whether the data transfer overhead is a significant part of the accelerated function execution time, and reduce the data transfer overhead. Note that the default clkid is 100 MHz for all platforms. More details about the clkid values for the given platform can be obtained by running -sds-pf-info <path>/<platform_name>.

If the data transfer overhead is large, the following changes might help:

Move more code into the accelerated function so that the computation time increases, and the ratio of computation to data transfer time is improved.
Reduce the amount of data to be transferred by modifying the code or using pragmas to transfer only the required data.
Sequentialize the access pattern as observed from the accelerator code, because it is more efficient to burst transfers than to make a series of unrelated random accesses.
Ensure that data transfers make use of system ports that are appropriate for the cache-ability of the data being transferred. Cache flushing can be an resource-intensive procedure, and using coherent ports to access coherent data, and non-coherent ports to access non-coherent ports makes a significant impact.
Use sds_alloc() instead of malloc, where possible. The memory that sds_alloc() issues is physically contiguous, and enables the use of data movers that are faster to configure that require physically contiguous memory. Also, pinning virtual pages, which is necessary when transferring data issue by malloc() data, is very costly.

Troubleshooting Compile and Link Time Errors

Typical compile/link time errors are indicated by error messages issued when running make. To analyze further, look at the log files and rpt files in the _sds/reports sub-directory created by the SDSoC environment in the build directory. The most recently generated log file usually indicates the cause of the error, such as a syntax error in the corresponding input file, or an error generated by the tool chain while synthesizing accelerator hardware or the data motion network.

The following are tips and strategies to address errors specific to the SDSoC environment.

Tool Errors Are Reported by Tools in the SDSoC Environment Chain

Try the following troubleshooting steps:

Check whether the corresponding code adheres to the Coding Guidelines in SDSoC Environment Programmers Guide.
Check the syntax of pragmas. See the for more details.
Check for typos in pragmas that might prevent them from being applied as intended.

Vivado Design Suite High-Level Synthesis (HLS) Cannot Meet Timing Requirement

Try the following troubleshooting steps:

Select a slower clock frequency for the accelerator in the SDx IDE (or with the sdscc/sds++ command line parameter).
Modify the code structure to allow HLS to generate a faster implementation. See the Improving Hardware Function Parallelism section in SDSoC Profiling and Optimization Guide for more information on how to do this.

Vivado Tools Cannot Meet Timing

Try the following troubleshooting steps:

In the SDx IDE, select a slower clock frequency for the data motion network or accelerator, or both (from the command line, use sdscc/sds++ command line parameters).
Use the -xp option to specify a Vivado implementation strategy to improve results. For example:
```
-impl-strategy Performance_Explore
```
Provide an example/resource to help the user synthesize the HLS block to a higher clock frequency so that the synthesis/implementation tools have a bigger margin.
Modify the C/C++ code passed to HLS, or add more HLS directives to make the HLS block go faster.
Reduce the size of the design in cases where the resource usage exceeds 80%. Refer to the Vivado tools reports in the _sds folder.

The Design Is Too Large to Fit

Try the following troubleshooting steps:

Reduce the number of accelerated functions.
Change the coding style for an accelerator function to produce a more compact accelerator. You can reduce the amount of parallelism using the mechanisms described in the Improving Hardware Function Parallelism section in SDSoC Profiling and Optimization Guide.
Modify pragmas and coding styles (pipelining) that cause multiple instances of accelerators to be created.
Use pragmas to select smaller data movers such as AXIFIFO instead of AXIDMA_SG.
Rewrite hardware functions to have fewer input and output parameters/arguments, especially in cases where the inputs/outputs are continuous stream (sequential access array argument) types that prevent the sharing of data mover hardware.

Troubleshooting Performance Issues

The SDSoC environment provides some basic performance monitoring capabilities in the form of the sds_clock_counter() function. Use this function to determine how much time different code sections, such as the accelerated and the non-accelerated code, take to execute.

To estimate the actual hardware acceleration time, you need to know the latency numbers from the Vivado HLS report, the clock frequency for the accelerator, and the Arm CPU clock frequency. To open the Vivado HLS report for the latency numbers, in the Assistant view, go to <Project Name> > <Build Configuration> > <Accelerator Name> > HLS report. To view the clock frequency for the accelerator, go to the Hardware Functions section of the Project Settings. Click on the Platform link in the Project Overview to open the Platform Summary dialog. The CPU frequency is shown under Clock Frequencies. A latency of X accelerator clock cycles is equal to X * (<processor clock frequency>/<accelerator clock frequency>) processor clock cycles. Compare this with the time spent on the actual function call to determine the data transfer overhead.

For best performance improvement, the time required for executing the accelerated function must be much smaller than the time required for executing the original software function. If this is not true, try to run the accelerator at a higher frequency by selecting a different clkid on the sdscc/sds++ command line. If that does not work, try to determine whether the data transfer overhead is a significant part of the accelerated function execution time, and reduce the data transfer overhead.

Note: More details about the clkid values for a given platform can be obtained by running the following command:

sds++ -sds-pf-info

If the data transfer overhead is large, the following changes might help:

Move more code into the accelerated function so that the computation time increases, and the ratio of computation to data transfer time is improved.
Reduce the amount of data to be transferred by modifying the code or using pragmas to transfer only the required data.