Improving System Performance
This chapter describes underlying principles and inference rules within the SDSoC™ system compiler to assist the programmer to improve overall system performance through the following:
- Increased parallelism in the hardware function.
- Increased system parallelism and concurrency.
- Improved access to external memory from programmable logic.
- An understanding of the data motion network (default behavior and user specification).
There are many factors that affect overall system performance. A well-designed system generally balances computation and communication, so that all hardware components remain occupied doing meaningful work.
- Some applications are compute-bound; for these applications, concentrate on maximizing throughput and minimizing latency in hardware accelerators.
- Other applications might be memory-bound, in which case, you might need to
restructure algorithms to increase temporal and spatial locality in the hardware; for
example, adding copy-loops or
memcopy
to pull blocks of data into hardware rather than making random array accesses to external memory.
Control over the various aspects of optimization is provided through the use of pragmas in the code. For a complete description of the available pragmas, refer to the .
Improving Hardware Function Parallelism
This section provides a concise introduction to writing efficient code that can be cross-compiled into programmable logic.
The SDSoC environment employs the Vivado® High-Level Synthesis (HLS) tool as a programmable logic cross-compiler to transform C/C++ functions into hardware.
By applying the principles described in this section, you can dramatically increase the performance of the synthesized functions, which can lead to significant increases in overall system performance for your application.
Top-Level Hardware Function Guidelines
This section describes coding guidelines to ensure that the Vivado HLS tool hardware function has a consistent interface with object code generated by the Arm® core GNU toolchain.
Use Standard C99 Data Types for Top-Level Hardware Function Arguments
- Avoid using arrays of
bool
. An array ofbool
has a different memory layout between the GNU Arm cross compiler and the HLS tool. - Avoid using
hls::stream
at the hardware function top-level interface. This data type helps the HLS tool compiler synthesize efficient logic within a hardware function but does not apply to application software.
Omit HLS Interface Directives for Top-Level Hardware Function Arguments
Although supported, a top-level hardware function should not, in
general, contain HLS interface pragmas. The sdcc/sds++
(referred to as sds++
)
system compiler automatically generates the appropriate HLS tool interface
directives.
sds++
system compiler to generate the required HLS tool
interface directives:-
#pragma SDS data zero_copy()
- Use to generate a shared memory interface implemented as an AXI master interface in hardware.
-
#pragma SDS data access_pattern(argument:SEQUENTIAL)
- Use to generate a streaming interface implemented as a FIFO interface in hardware.
If you specify the interface using #pragma HLS interface
for a top-level function argument, the SDSoC environment does not generate the HLS tool
interface directive for that argument; you should ensure that the generated hardware
interface is consistent with all other function argument hardware interfaces.
sds++
system compiler error messages, it is
strongly recommended (though not absolutely mandatory) that you omit HLS interface
pragmas.Using Vivado Design Suite HLS Libraries
This section describes how to use the Vivado HLS tool libraries with the SDSoC environment.
The HLS tool libraries are provided as source code with the HLS installation tool in the SDSoC environment. Consequently, you can use these libraries as you would any other source code that you plan to cross-compile for programmable logic using the HLS tool. In particular, ensure that the source code conforms to the rules described in Hardware Function Argument Types, which might require you to provide a C/C++ wrapper function to ensure the functions export a software interface to your application.
In the SDSoC IDE, the synthesizeable
finite impulse response (FIR) example template for all basic platforms provides an
example that uses an HLS tool library. You can find several additional code examples
that employ the HLS tool libraries in the samples/hls_lib directory. For example, samples/hls_lib/hls_math
contains an example to implement and uses a
square root function.
The file my_sqrt.h
contains:
#ifndef _MY_SQRT_H_
#define _MY_SQRT_H_
#ifdef __SDSVHLS__
#include "hls_math.h"
#else
// The hls_math.h file includes hdl_fpo.h which contains actual code and
// will cause linker error in the ARM compiler, hence we add the function
// prototypes here
static float sqrtf(float x);
#endif
void my_sqrt(float x, float *ret);
#endif // _SQRT_H_
The file my_sqrt.cpp contains:
#include "my_sqrt.h"
void my_sqrt(float x, float *ret)
{
*ret = sqrtf(x);
}
The makefile has the commands to compile these files:
sds++ -c -hw my_sqrt –sds-pf zc702 my_sqrt.cpp
sds++ -c my_sqrt_test.cpp
sds++ my_sqrt.o my_sqrt_test.o -o my_sqrt_test.elf
Increasing System Parallelism and Concurrency
Increasing the level of concurrent execution is a standard way to increase overall system performance, and increasing the level of parallel execution is a standard way to increase concurrency. Programmable logic is well-suited to implement architectures with application-specific accelerators that run concurrently, especially communicating through flow-controlled streams that synchronize between data producers and consumers.
In the SDSoC environment, you influence the
macro-architecture parallelism at the function and data mover level, and the micro-architecture
parallelism within hardware accelerators. By understanding how the sds++/sdscc
(referred to as sds++
) compiler infers
system connectivity and data movers, you can structure application code and apply pragmas, as
needed, to control hardware connectivity between accelerators and software, data mover selection,
number of accelerator instances for a given hardware function, and task level software control.
You can control the micro-architecture parallelism, concurrency, and throughput for hardware functions within the Vivado HLS tool, or within the IP, you incorporate as C-callable and linkable libraries.
At the system level, the sds++
compiler chains
together hardware functions when the data flow between them does not require transferring
arguments out of programmable logic and back to system memory.
For example, consider the code in the following figure, where mmult
and madd
functions have been selected for
hardware.
Because the intermediate array variable tmp1
is used only to pass data between the two hardware functions, the sds++
compiler chains the two functions together in hardware with a direct connection
between them.
It is instructive to consider a timeline for the calls to hardware, as shown in the following figure:
The program preserves the original program semantics, but instead of the standard Arm core procedure calling sequence, each hardware function call is broken into multiple phases involving setup, execution, and cleanup, both for the data movers (DM) and the accelerators. The CPU, in turn, sets up each hardware function (that is, the underlying IP control interface), and the data transfers for the function call with non-blocking APIs, and then waits for all calls and transfers to complete.
In the example shown in the following figure, the mmult
and madd
functions run concurrently whenever
their inputs become available. The ensemble of function calls is orchestrated in the compiled
program by control code automatically generated by the sds++
system compiler according to the program, data mover, and accelerator structure.
In general, it is impossible for the sds++
system compiler to determine side effects of function calls in your application code (for
example, sds++
might have no access to source code for functions
within linked libraries), so any intermediate access of a variable occurring lexically between
hardware function calls requires the compiler to transfer data back to memory.
For example, an injudicious simple change to uncomment the debug print statement (in the "wrong place"), as shown in the following figure, can result in a significantly different data transfer graph and consequently, an entirely different generated system and application performance.
A program can invoke a single hardware function from multiple call sites. In
this case, the sds++
system compiler behaves as follows. If any
of the function calls results in "direct connection" data flow, then the sds++
system compiler creates an instance of the hardware function that services every
similar direct connection, and an instance of the hardware function that services the remaining
calls between memory (software) and PL.
To achieve high performance in PL, one of the best methods is structuring your application code with "direct connection" data flow between hardware functions. You can create deep pipelines of accelerators connected with data streams, increasing the opportunity for concurrent execution.
There is another way in which you can increase parallelism and concurrency
using the sds++
system compiler. You can direct the system
compiler to create multiple instances of a hardware function by inserting the following pragma
immediately preceding a call to the function.
#pragma SDS resource(<id>) // <id> a non-negative integer
This pragma creates a hardware instance that is referenced by <id>
.
A simple code snippet that creates two instances of a hardware function mmult
is as follows.
{
#pragma SDS resource(1)
mmult(A, B, C); // instance 1
#pragma SDS resource(2)
mmult(D, E, F); // instance 2
}
If creating multiple instances of an accelerator is not what you want, using
the sds_async
mechanism gives the programmer ability to handle
the "hardware threads" explicitly to achieve very high levels of parallelism and concurrency.
However, like any explicit multi-threaded programming model, it requires careful attention to
synchronization details to avoid non-deterministic behavior or deadlocks. For more information ,
refer to the SDSoC
Environment Programmers Guide.
Data Motion Network Generation in SDSoC
This section describes:
- Components that make up the data motion network in the SDSoC environment, and also provides guidelines to help you understand the data motion network generated by the SDSoC compiler.
- Guidelines to help you guide the data motion network generation by using appropriate SDSoC pragmas.
Every transfer between the software program and a hardware function requires a data mover, which consists of a hardware component that moves the data, and an operating system-specific library function. The following table lists supported data movers and various properties for each:
SDSoC Data Mover | Vivado® IP Data Mover | Accelerator IP Port Types | Transfer Size | Contiguous Memory Only |
---|---|---|---|---|
axi_dma_simple | axi_dma | bram, ap_fifo, axis | ≤ 32 MB | Yes |
axi_dma_sg | axi_dma | bram, ap_fifo, axis | N/A | No (but recommended) |
axi_fifo | axi_fifo_mm_s | bram, ap_fifo, axis | ≤ 300 B | No |
zero_copy | accelerator IP | aximm master | N/A | Yes |
- For array arguments, the data mover inference is based on transfer size, hardware
function port mapping, and function call site information. The selection of data
movers is a trade off between performance and resource, for example:
- The
axi_dma_simple
data mover is the most efficient bulk transfer engine and supports up to 32 MB transfers, so it is best for transfers under that limit. - The
axi_fifo
data mover does not require as many hardware resources as the DMA, but due to its slower transfer rates, is preferred only for payloads of up to 300 bytes. - The
axi_dma_sg
(scatter-gather DMA) data mover provides slower DMA performance and consumes more hardware resources but has fewer limitations, and in the absence of any pragma directives, is often the best default data mover.
- The
You can specify the data mover selection by inserting a pragma into program source immediately before the function declaration; for example:
#pragma SDS data data_mover(A:AXI_DMA_SIMPLE)
#pragma
SDS
is always treated as a rule, not a hint, so you must ensure that their
use conforms with the data mover requirements in the previous table.The data motion network in the SDSoC environment is made up of three components:
- The memory system ports on the PS (A)
- Data movers between the PS and accelerators as well as among accelerators (B)
- The hardware interface on an accelerator (C)
The following figure illustrates these three components.
Without any SDS pragma, the SDSoC environment automatically generates the data motion network based on an analysis of the source code; however, the SDSoC environment also provides pragmas for you to guide the data motion network generation. See the .
System Port
A system port connects a data mover to the PS. It can be an acceptance filter ID (AFI)—which corresponds to high-performance ports, memory interface generator (MIG)—which is a PL-based DDR memory controller, or a stream port on the Zynq®-7000 SoC or Zynq® UltraScale+™ MPSoC processors.
The AFI port is a non-cache-coherent port. If needed, cache coherency, such as cache flushing and cache invalidation, is maintained by software.
The AFI port depends on the cache requirement of the transferred data, the
cache attribute of the data, and the data size. If the data is allocated with sds_alloc_non_cacheable()
or sds_register_dmabuf()
, it is better to connect to the AFI port to avoid cache
flushing/invalidation.
sds_lib.h
and described in the environment APIs in the SDSoC
Environment Programmers Guide (UG1278).The SDSoC system compiler analyzes these memory attributes for the data transactions with the accelerator, and connects data movers to the appropriate system port.
To override the compiler decision, or in some cases where the compiler is not able to do such analysis, you can use the following pragma to specify the system port:
#pragma SDS data sys_port(arg:ip_port)
For example, the following function directly connects to a FIFO AXI
interface. This is where ip_port
can be either AFI
or MIG
:
#pragma SDS data sys_port:(A:fifo_S_AXI)*
void foo(int* A, int* B, int* C);
This function can also be used for a streaming interface:
#pragma SDS data sys_port:(A:stream_fifo_S_AXIS)*
void foo(int* A, int* B, int* C)
Use the following sds++
system compiler
command to see the list of system ports for the platform:
sds++ -sds-pf-info <platform> -verbose
Data Mover
The data mover transfers data between the PS and accelerators and among accelerators. The SDSoC environment can generate various types of data movers based on the properties and size of the data being transferred.
- Scalar
- Scalar data is always transferred by the
AXI_LITE
data mover. - Array
- The
sds++
system compiler can generate the following data movers:AXI_DMA_SG
AXI_DMA_SIMPLE
AXI_FIFO
zero_copy
(accelerator-mastered AXI4 bus)AXI_LITE
(depending on the memory attributes and data size of the array)
For example, if the array is allocated using
malloc()
, the memory is not physically contiguous, and SDSoC environment generates a scatter-gather DMA (AXI_DMA_SG
); however, if the data size is less than 300 bytes,AXI_FIFO
is generated instead because the data transfer time is less thanAXI_DMA_SG
, and it occupies much less PL resource. - Struct or Class
- The implementation of a
struct
depends on how the struct is passed to the hardware —passed by value, passed by reference, or as an array ofstructs
—and the type of data mover selected. The following table shows the various implementations..
Struct Pass Method | Default (no pragma) | #pragma SDS data zero_copy (arg) | #pragma SDS data zero_copy (arg[0:SIZE]) | #pragma SDS data copy (arg) | #pragma SDS data copy (arg[0:SIZE]) |
---|---|---|---|---|---|
pass by value (struct RGB
arg ) |
Each field is flattened and passed individually as a scalar or an array. | This is not supported and will result in an error. | This is not supported and will result in an error. | The struct is packed into a
single wide scalar. |
Each field is flattened and passed individually as a scalar or an array. The value of |
pass by pointer (struct RGB
*arg ) or reference (struct RGB
&arg ) |
Each field is flattened and passed individually as a scalar or an array. |
The The data is transferred to the hardware accelerator through an AXI4 bus. |
The The number of data values transferred to the hardware accelerator
through an AXI4 bus is defined by the value of |
The struct is packed into a
single wide scalar. |
The The number of data values transferred to the hardware accelerator
using an |
array of ( |
Each struct element of the
array is packed into a single wide scalar. |
Each The data is transferred to the hardware accelerator using an AXI4 bus. |
Each The data is transferred to the hardware accelerator using an AXI4 bus. The value of |
Each The data is transferred to the hardware accelerator using a data
mover such as |
Each The data is transferred to the hardware accelerator using a data
mover such as The value of |
Determining which data mover to use for transferring an array depends on two attributes of
the array, data size and physical memory contiguity. For example, if the memory size is one MB
and not physically contiguous (allocated by malloc()
), use
AXI_DMA_SG
. The following table shows the applicability of
these data movers.
Data Mover | Physical Memory Contiguity | Data Size (bytes) |
---|---|---|
AXI_DMA_SG |
Either | > 300 |
AXI_DMA_Simple |
Contiguous | < 32M |
AXI_FIFO |
Non-contiguous | < 300 |
Normally, the SDSoC cross-compiler analyzes the array that is transferred to the hardware accelerator for these two attributes, and selects the appropriate data mover accordingly. However, there are cases where such analysis is not possible. At that time, the SDSoC cross-compiler issues a warning message, as shown in the following example, that states that it is unable to determine the memory attributes using SDS pragmas.
WARNING: [DMAnalysis 83-4492] Unable to determine the memory attributes passed to rgb_data_in of function img_process at
C:/simple_sobel/src/main_app.c:84
The following pragma specifies the memory attributes:
#pragma SDS data mem_attribute(function_argument:contiguity)
The contiguity
can be either PHYSICAL_CONTIGUOUS
or NON_PHYSICAL_CONTIGUOUS
. Use the following pragma to specify the data size:
#pragma SDS data copy(function_argument[offset:size])
The size
can be a number or an arbitrary
expression.
Zero Copy Data Mover
The zero copy data mover is unique because it covers both the accelerator interface and the data mover. The syntax of this pragma is:
#pragma SDS data zero_copy(arg[offset:size])
The [offset:size]
is optional, and only needed
if the data transfer size for an array cannot be determined at compile time.
By default, the SDSoC environment assumes
copy
semantics for an array argument, meaning the data is
explicitly copied from the PS to the accelerator through a data mover. When this ZERO_COPY
pragma is specified, SDSoC environment generates an
AXI-Master
interface for the specified argument on the
accelerator, which grabs the data from the PS as specified in the accelerator code.
To use the ZERO_COPY pragma, the memory corresponding to the array must be
physically contiguous, that is allocated with sds_alloc
.
Accelerator Interface
- Scalar
- For a scalar argument, the register interface is generated to pass in and/or out of the accelerator.
- Arrays
- The hardware interface on an accelerator for transferring an array can
be either a RAM interface or a streaming interface, depending on how the accelerator
accesses the data in the array.
The RAM interface allows the data to be accessed randomly within the accelerator; however, it requires the entire array to be transferred to the accelerator before any memory accesses can happen within the accelerator. Moreover, the use of this interface requires block RAM resources on the accelerator side to store the array.
The streaming interface, on the other hand, does not require memory to store the whole array, it allows the accelerator to pipeline the processing of array elements; for example, the accelerator can start processing a new array element while the previous ones are still being processed. However, the streaming interface requires the accelerator to access the array in a strict sequential order, and the amount of data transferred must be the same as the accelerator expects.
The SDSoC environment, by default, generates the RAM interface for an array; however, the SDSoC environment provides pragmas to direct it to generate the streaming interface.
- struct or class
- The implementation of a
struct
depends on how thestruct
is passed to the hardware—passed by value, passed by reference, or as an arrays ofstructs
—and the type of data mover selected. The previous table shows the various implementations.
The following SDS pragma can be used to guide the interface generation for the accelerator.
#pragma SDS data access_pattern(function_argument:pattern)
Where pattern
can either be RANDOM or
SEQUENTIAL, and arg
can be an array argument name of the
accelerator function.
If an array argument's access pattern is specified as RANDOM, a RAM interface is generated. If it is specified as SEQUENTIAL, a streaming interface is generated.
- The default access pattern for an array argument is RANDOM.
- The specified access pattern must be consistent with the behavior of the accelerator function. For SEQUENTIAL access patterns, the function must access every array element in a strict sequential order.
- This pragma only applies to arguments without the
zero_copy
pragma.