Hardware Tracing
The SDSoC environment supports hardware event tracing of accelerators cross-compiled using Vivado HLS, and data transfers over AXI4-Stream connections. When the sdscc/++
linker is invoked with the -trace
option, it automatically inserts hardware monitor IP cores into the generated system to log these event types:
- Accelerator start and stop, defined by
ap_start
andap_done
signals. - Data transfer start and stop, defined by AXI4-Stream handshake and TLAST signals.
ap_start
and ap_done
signals are not part of the IP
interface:#pragma HLS interface s_axilite port=foo
To give you an idea of the approximate resource utilization of these hardware monitor cores, the following table shows the resource utilization of these cores for a Zynq-7000 (xc7z020-1clg400) device:
Core Name | LUTs | FFs | BRAMs | DSPs |
---|---|---|---|---|
Accelerator | 79 | 18 | 0 | 0 |
AXI4-Stream (basic) | 79 | 14 | 0 | 0 |
AXI4-Stream (statistics) | 132 | 183 | 0 | 0 |
The AXI4-Stream monitor core has two modes: basic and statistics. The basic mode does just the start/stop trace event generation. The statistics mode enables an AXI4-Lite interface to two 32-bit registers. The register at offset 0x0 presents the word count of the current, on-going transfer. The register at offset 0x4 presents the word count of the previous transfer. As soon as a transfer is complete, the current count is moved to the previous register. By default, the AXI4-Stream core is configured in the basic mode. Future releases will enable the user to choose which mode to use. The core does support it today so adventurous users could potentially configure the core manually in the Vivado tools. However, this is not supported in the current release.
In addition to the hardware trace monitor cores, the output trace event signals are combined by a single integration core. This core has a parameterizeable number of ports (from 1–63), and can thus support up to 63 individual monitor cores (either accelerator or AXI4-Stream). The resource utilization of this core depends on the number of ports enabled, and thus the number of monitor cores inserted. The following table shows the resource utilization of this core for a Zynq-7000 (xc7z020-1clg400) device:
Number of Ports | LUTs | FFs | BRAMs | DSPs |
---|---|---|---|---|
1 | 241 | 404 | 0 | 0 |
2 | 307 | 459 | 0 | 0 |
3 | 366 | 526 | 0 | 0 |
4 | 407 | 633 | 0 | 0 |
6 | 516 | 686 | 0 | 0 |
8 | 644 | 912 | 0 | 0 |
16 | 1243 | 1409 | 0 | 0 |
32 | 2190 | 2338 | 0 | 0 |
63 | 3830 | 3812 | 0 | 0 |
Depending on the number of ports (i.e., monitor cores), the integration core will use on average 110 flip-flops (FFs) and 160 look-up tables (LUTs). At the system level for example, the resource utilization for the matrix multiplication template application on the ZC702 platform (using the same xc7z020-1clg400 part) is shown in the table below:
System | LUTs | FFs | BRAMs | DSPs |
---|---|---|---|---|
Base (no trace) | 16,433 | 21,426 | 46 | 160 |
Event trace enabled | 17,612 | 22,829 | 48 | 160 |
Based on the results above, the difference in designs is approximately 1,000 LUTs, 1,200 FFs, and two BRAMs. This design has a single accelerator with three AXI4-Stream ports (two inputs and one output). When event trace is enabled, four monitors are inserted into the system (one accelerator and three AXI4-Stream monitors), in addition to a single integration core and other associated read-out logic. Given the resource estimations above, 720 LUTs and 700 FFs are from the actual trace monitoring hardware (monitors and integration core). The remaining 280 LUTs, 500 FFs and two BRAMs are from the read-out logic which converts the AXI4-Stream output trace data stream to JTAG. The resource utilization for this read-out logic is static and does not vary based on the design.