Improving System Performance

This chapter describes underlying principles and inference rules within the SDSoC™ system compiler to assist the programmer to improve overall system performance through the following:

Increased parallelism in the hardware function.
Increased system parallelism and concurrency.
Improved access to external memory from programmable logic.
An understanding of the data motion network (default behavior and user specification).

There are many factors that affect overall system performance. A well-designed system generally balances computation and communication, so that all hardware components remain occupied doing meaningful work.

Some applications are compute-bound; for these applications, concentrate on maximizing throughput and minimizing latency in hardware accelerators.
Other applications might be memory-bound, in which case, you might need to restructure algorithms to increase temporal and spatial locality in the hardware; for example, adding copy-loops or memcopy to pull blocks of data into hardware rather than making random array accesses to external memory.

Control over the various aspects of optimization is provided through the use of pragmas in the code. For a complete description of the available pragmas, refer to the .

Improving Hardware Function Parallelism

This section provides a concise introduction to writing efficient code that can be cross-compiled into programmable logic.

The SDSoC environment employs the Vivado® High-Level Synthesis (HLS) tool as a programmable logic cross-compiler to transform C/C++ functions into hardware.

By applying the principles described in this section, you can dramatically increase the performance of the synthesized functions, which can lead to significant increases in overall system performance for your application.

Top-Level Hardware Function Guidelines

This section describes coding guidelines to ensure that the Vivado HLS tool hardware function has a consistent interface with object code generated by the Arm® core GNU toolchain.

Use Standard C99 Data Types for Top-Level Hardware Function Arguments

Avoid using arrays of bool. An array of bool has a different memory layout between the GNU Arm cross compiler and the HLS tool.
Avoid using hls::stream at the hardware function top-level interface. This data type helps the HLS tool compiler synthesize efficient logic within a hardware function but does not apply to application software.

Omit HLS Interface Directives for Top-Level Hardware Function Arguments

Although supported, a top-level hardware function should not, in general, contain HLS interface pragmas. The sdcc/sds++ (referred to as sds++) system compiler automatically generates the appropriate HLS tool interface directives.

There are two SDSoC environment pragmas you can specify for a top-level hardware function to guide the sds++ system compiler to generate the required HLS tool interface directives:

#pragma SDS data zero_copy(): Use to generate a shared memory interface implemented as an AXI master interface in hardware.
#pragma SDS data access_pattern(argument:SEQUENTIAL): Use to generate a streaming interface implemented as a FIFO interface in hardware.

If you specify the interface using #pragma HLS interface for a top-level function argument, the SDSoC environment does not generate the HLS tool interface directive for that argument; you should ensure that the generated hardware interface is consistent with all other function argument hardware interfaces.

Note: Because a function with the incompatible HLS tool interface types can result in cryptic sds++ system compiler error messages, it is strongly recommended (though not absolutely mandatory) that you omit HLS interface pragmas.

Using Vivado Design Suite HLS Libraries

This section describes how to use the Vivado HLS tool libraries with the SDSoC environment.

The HLS tool libraries are provided as source code with the HLS installation tool in the SDSoC environment. Consequently, you can use these libraries as you would any other source code that you plan to cross-compile for programmable logic using the HLS tool. In particular, ensure that the source code conforms to the rules described in Hardware Function Argument Types, which might require you to provide a C/C++ wrapper function to ensure the functions export a software interface to your application.

In the SDSoC IDE, the synthesizeable finite impulse response (FIR) example template for all basic platforms provides an example that uses an HLS tool library. You can find several additional code examples that employ the HLS tool libraries in the samples/hls_lib directory. For example, samples/hls_lib/hls_math contains an example to implement and uses a square root function.

The file my_sqrt.h contains:

#ifndef _MY_SQRT_H_ 
#define _MY_SQRT_H_ 

#ifdef __SDSVHLS__
#include "hls_math.h" 
#else 
// The hls_math.h file includes hdl_fpo.h which contains actual code and 
// will cause linker error in the ARM compiler, hence we add the function 
// prototypes here 
static float sqrtf(float x); 
#endif 

void my_sqrt(float x, float *ret); 

#endif // _SQRT_H_

The file my_sqrt.cpp contains:

#include "my_sqrt.h" 

void my_sqrt(float x, float *ret) 
{ 
    *ret = sqrtf(x); 
}

The makefile has the commands to compile these files:

sds++ -c -hw my_sqrt –sds-pf zc702 my_sqrt.cpp 
sds++ -c my_sqrt_test.cpp 
sds++ my_sqrt.o my_sqrt_test.o -o my_sqrt_test.elf

Increasing System Parallelism and Concurrency

Increasing the level of concurrent execution is a standard way to increase overall system performance, and increasing the level of parallel execution is a standard way to increase concurrency. Programmable logic is well-suited to implement architectures with application-specific accelerators that run concurrently, especially communicating through flow-controlled streams that synchronize between data producers and consumers.

In the SDSoC environment, you influence the macro-architecture parallelism at the function and data mover level, and the micro-architecture parallelism within hardware accelerators. By understanding how the sds++/sdscc (referred to as sds++) compiler infers system connectivity and data movers, you can structure application code and apply pragmas, as needed, to control hardware connectivity between accelerators and software, data mover selection, number of accelerator instances for a given hardware function, and task level software control.

You can control the micro-architecture parallelism, concurrency, and throughput for hardware functions within the Vivado HLS tool, or within the IP, you incorporate as C-callable and linkable libraries.

At the system level, the sds++ compiler chains together hardware functions when the data flow between them does not require transferring arguments out of programmable logic and back to system memory.

For example, consider the code in the following figure, where mmult and madd functions have been selected for hardware.

Because the intermediate array variable tmp1 is used only to pass data between the two hardware functions, the sds++ compiler chains the two functions together in hardware with a direct connection between them.

It is instructive to consider a timeline for the calls to hardware, as shown in the following figure:

The program preserves the original program semantics, but instead of the standard Arm core procedure calling sequence, each hardware function call is broken into multiple phases involving setup, execution, and cleanup, both for the data movers (DM) and the accelerators. The CPU, in turn, sets up each hardware function (that is, the underlying IP control interface), and the data transfers for the function call with non-blocking APIs, and then waits for all calls and transfers to complete.

In the example shown in the following figure, the mmult and madd functions run concurrently whenever their inputs become available. The ensemble of function calls is orchestrated in the compiled program by control code automatically generated by the sds++ system compiler according to the program, data mover, and accelerator structure.

In general, it is impossible for the sds++ system compiler to determine side effects of function calls in your application code (for example, sds++ might have no access to source code for functions within linked libraries), so any intermediate access of a variable occurring lexically between hardware function calls requires the compiler to transfer data back to memory.

For example, an injudicious simple change to uncomment the debug print statement (in the "wrong place"), as shown in the following figure, can result in a significantly different data transfer graph and consequently, an entirely different generated system and application performance.

A program can invoke a single hardware function from multiple call sites. In this case, the sds++ system compiler behaves as follows. If any of the function calls results in "direct connection" data flow, then the sds++ system compiler creates an instance of the hardware function that services every similar direct connection, and an instance of the hardware function that services the remaining calls between memory (software) and PL.

To achieve high performance in PL, one of the best methods is structuring your application code with "direct connection" data flow between hardware functions. You can create deep pipelines of accelerators connected with data streams, increasing the opportunity for concurrent execution.

There is another way in which you can increase parallelism and concurrency using the sds++ system compiler. You can direct the system compiler to create multiple instances of a hardware function by inserting the following pragma immediately preceding a call to the function.

#pragma SDS resource(<id>) // <id> a non-negative integer

This pragma creates a hardware instance that is referenced by <id>.

A simple code snippet that creates two instances of a hardware function mmult is as follows.

{
#pragma SDS resource(1)
 mmult(A, B, C); // instance 1
#pragma SDS resource(2)
 mmult(D, E, F); // instance 2
}

If creating multiple instances of an accelerator is not what you want, using the sds_async mechanism gives the programmer ability to handle the "hardware threads" explicitly to achieve very high levels of parallelism and concurrency. However, like any explicit multi-threaded programming model, it requires careful attention to synchronization details to avoid non-deterministic behavior or deadlocks. For more information , refer to the SDSoC Environment Programmers Guide.

Data Motion Network Generation in SDSoC

This section describes:

Components that make up the data motion network in the SDSoC environment, and also provides guidelines to help you understand the data motion network generated by the SDSoC compiler.
Guidelines to help you guide the data motion network generation by using appropriate SDSoC pragmas.

Every transfer between the software program and a hardware function requires a data mover, which consists of a hardware component that moves the data, and an operating system-specific library function. The following table lists supported data movers and various properties for each:

Table 1. SDSoC Data Movers Table
SDSoC Data Mover	Vivado® IP Data Mover	Accelerator IP Port Types	Transfer Size	Contiguous Memory Only
axi_dma_simple	axi_dma	bram, ap_fifo, axis	≤ 32 MB	Yes
axi_dma_sg	axi_dma	bram, ap_fifo, axis	N/A	No (but recommended)
axi_fifo	axi_fifo_mm_s	bram, ap_fifo, axis	≤ 300 B	No
zero_copy	accelerator IP	aximm master	N/A	Yes

For array arguments, the data mover inference is based on transfer size, hardware function port mapping, and function call site information. The selection of data movers is a trade off between performance and resource, for example:
- The axi_dma_simple data mover is the most efficient bulk transfer engine and supports up to 32 MB transfers, so it is best for transfers under that limit.
- The axi_fifo data mover does not require as many hardware resources as the DMA, but due to its slower transfer rates, is preferred only for payloads of up to 300 bytes.
- The axi_dma_sg (scatter-gather DMA) data mover provides slower DMA performance and consumes more hardware resources but has fewer limitations, and in the absence of any pragma directives, is often the best default data mover.

You can specify the data mover selection by inserting a pragma into program source immediately before the function declaration; for example:

#pragma SDS data data_mover(A:AXI_DMA_SIMPLE)

Note:

#pragma
                SDS

is always treated as a rule, not a hint, so you must ensure that their use conforms with the data mover requirements in the previous table.

The data motion network in the SDSoC environment is made up of three components:

The memory system ports on the PS (A)
Data movers between the PS and accelerators as well as among accelerators (B)
The hardware interface on an accelerator (C)

The following figure illustrates these three components.

Without any SDS pragma, the SDSoC environment automatically generates the data motion network based on an analysis of the source code; however, the SDSoC environment also provides pragmas for you to guide the data motion network generation. See the .

System Port

A system port connects a data mover to the PS. It can be an acceptance filter ID (AFI)—which corresponds to high-performance ports, memory interface generator (MIG)—which is a PL-based DDR memory controller, or a stream port on the Zynq®-7000 SoC or Zynq® UltraScale+™ MPSoC processors.

The AFI port is a non-cache-coherent port. If needed, cache coherency, such as cache flushing and cache invalidation, is maintained by software.

The AFI port depends on the cache requirement of the transferred data, the cache attribute of the data, and the data size. If the data is allocated with sds_alloc_non_cacheable() or sds_register_dmabuf(), it is better to connect to the AFI port to avoid cache flushing/invalidation.

Note: These functions can be found in the sds_lib.h and described in the environment APIs in the SDSoC Environment Programmers Guide (UG1278).

The SDSoC system compiler analyzes these memory attributes for the data transactions with the accelerator, and connects data movers to the appropriate system port.

To override the compiler decision, or in some cases where the compiler is not able to do such analysis, you can use the following pragma to specify the system port:

#pragma SDS data sys_port(arg:ip_port)

For example, the following function directly connects to a FIFO AXI interface. This is where ip_port can be either AFI or MIG:

#pragma SDS data sys_port:(A:fifo_S_AXI)*
void foo(int* A, int* B, int* C);

This function can also be used for a streaming interface:

#pragma SDS data sys_port:(A:stream_fifo_S_AXIS)*
void foo(int* A, int* B, int* C)

Note: For more information about AXI functionality in the Vivado® High Level Synthesis (HLS) tool, see the Vivado Design Suite User Guide: High-Level Synthesis (UG902).

Use the following sds++ system compiler command to see the list of system ports for the platform:

sds++ -sds-pf-info <platform> -verbose

Data Mover

The data mover transfers data between the PS and accelerators and among accelerators. The SDSoC environment can generate various types of data movers based on the properties and size of the data being transferred.

Scalar

Scalar data is always transferred by the AXI_LITE data mover.

Array

The sds++ system compiler can generate the following data movers:

AXI_DMA_SG
AXI_DMA_SIMPLE
AXI_FIFO
zero_copy (accelerator-mastered AXI4 bus)
AXI_LITE (depending on the memory attributes and data size of the array)

For example, if the array is allocated using malloc(), the memory is not physically contiguous, and SDSoC environment generates a scatter-gather DMA (AXI_DMA_SG); however, if the data size is less than 300 bytes, AXI_FIFO is generated instead because the data transfer time is less than AXI_DMA_SG, and it occupies much less PL resource.

Struct or Class

The implementation of a struct depends on how the struct is passed to the hardware —passed by value, passed by reference, or as an array of structs—and the type of data mover selected. The following table shows the various implementations..

Table 2. Struct Implementations
Struct Pass Method	Default (no pragma)	#pragma SDS data zero_copy (arg)	#pragma SDS data zero_copy (arg[0:SIZE])	#pragma SDS data copy (arg)	#pragma SDS data copy (arg[0:SIZE])
pass by value (`struct RGB arg`)	Each field is flattened and passed individually as a scalar or an array.	This is not supported and will result in an error.	This is not supported and will result in an error.	The `struct` is packed into a single wide scalar.	Each field is flattened and passed individually as a scalar or an array. The value of `SIZE` is ignored.
pass by pointer (`struct RGB *arg`) or reference (`struct RGB &arg`)	Each field is flattened and passed individually as a scalar or an array.	The `struct` is packed into a single wide scalar and transferred as a single value. The data is transferred to the hardware accelerator through an AXI4 bus.	The`struct` is packed into a single wide scalar. The number of data values transferred to the hardware accelerator through an AXI4 bus is defined by the value of `SIZE`.	The `struct` is packed into a single wide scalar.	The `struct` is packed into a single wide scalar. The number of data values transferred to the hardware accelerator using an `AXIDMA_SG` or `AXIDMA_SIMPLE` is defined by the value of `SIZE`.
array of `struct` (`struct RGB arg[1024]`)	Each `struct` element of the array is packed into a single wide scalar.	Each `struct` element of the array is packed into a single wide scalar. The data is transferred to the hardware accelerator using an AXI4 bus.	Each `struct` element of the array is packed into a single wide scalar. The data is transferred to the hardware accelerator using an AXI4 bus. The value of `SIZE` overrides the array size and determines the number of data values transferred to the accelerator.	Each `struct` element of the array is packed into a single wide scalar. The data is transferred to the hardware accelerator using a data mover such as `AXI_DMA_SG` or `AXI_DMA_SIMPLE`.	Each `struct` element of the array is packed into a single wide scalar. The data is transferred to the hardware accelerator using a data mover such as `AXI_DMA_SG` or `AXI_DMA_SIMPLE`. The value of `SIZE` overrides the array size and determines the number of data values transferred to the accelerator.

Determining which data mover to use for transferring an array depends on two attributes of the array, data size and physical memory contiguity. For example, if the memory size is one MB and not physically contiguous (allocated by malloc()), use AXI_DMA_SG. The following table shows the applicability of these data movers.

Table 3. Data Mover Selection
Data Mover	Physical Memory Contiguity	Data Size (bytes)
`AXI_DMA_SG`	Either	> 300
`AXI_DMA_Simple`	Contiguous	< 32M
`AXI_FIFO`	Non-contiguous	< 300

Normally, the SDSoC cross-compiler analyzes the array that is transferred to the hardware accelerator for these two attributes, and selects the appropriate data mover accordingly. However, there are cases where such analysis is not possible. At that time, the SDSoC cross-compiler issues a warning message, as shown in the following example, that states that it is unable to determine the memory attributes using SDS pragmas.

WARNING: [DMAnalysis 83-4492] Unable to determine the memory attributes passed to rgb_data_in of function img_process at 
C:/simple_sobel/src/main_app.c:84

The following pragma specifies the memory attributes:

#pragma SDS data mem_attribute(function_argument:contiguity)

The contiguity can be either PHYSICAL_CONTIGUOUS or NON_PHYSICAL_CONTIGUOUS. Use the following pragma to specify the data size:

#pragma SDS data copy(function_argument[offset:size])

The size can be a number or an arbitrary expression.

Zero Copy Data Mover

The zero copy data mover is unique because it covers both the accelerator interface and the data mover. The syntax of this pragma is:

#pragma SDS data zero_copy(arg[offset:size])

The [offset:size] is optional, and only needed if the data transfer size for an array cannot be determined at compile time.

By default, the SDSoC environment assumes copy semantics for an array argument, meaning the data is explicitly copied from the PS to the accelerator through a data mover. When this ZERO_COPY pragma is specified, SDSoC environment generates an AXI-Master interface for the specified argument on the accelerator, which grabs the data from the PS as specified in the accelerator code.

To use the ZERO_COPY pragma, the memory corresponding to the array must be physically contiguous, that is allocated with sds_alloc.

Accelerator Interface

The accelerator interface generated in depends on the data type of the argument.

Scalar

For a scalar argument, the register interface is generated to pass in and/or out of the accelerator.

Arrays

The hardware interface on an accelerator for transferring an array can be either a RAM interface or a streaming interface, depending on how the accelerator accesses the data in the array.

The RAM interface allows the data to be accessed randomly within the accelerator; however, it requires the entire array to be transferred to the accelerator before any memory accesses can happen within the accelerator. Moreover, the use of this interface requires block RAM resources on the accelerator side to store the array.

The streaming interface, on the other hand, does not require memory to store the whole array, it allows the accelerator to pipeline the processing of array elements; for example, the accelerator can start processing a new array element while the previous ones are still being processed. However, the streaming interface requires the accelerator to access the array in a strict sequential order, and the amount of data transferred must be the same as the accelerator expects.

The SDSoC environment, by default, generates the RAM interface for an array; however, the SDSoC environment provides pragmas to direct it to generate the streaming interface.

struct or class

The implementation of a struct depends on how the struct is passed to the hardware—passed by value, passed by reference, or as an arrays of structs—and the type of data mover selected. The previous table shows the various implementations.

The following SDS pragma can be used to guide the interface generation for the accelerator.

#pragma SDS data access_pattern(function_argument:pattern)

Where pattern can either be RANDOM or SEQUENTIAL, and arg can be an array argument name of the accelerator function.

If an array argument's access pattern is specified as RANDOM, a RAM interface is generated. If it is specified as SEQUENTIAL, a streaming interface is generated.

Note:

The default access pattern for an array argument is RANDOM.
The specified access pattern must be consistent with the behavior of the accelerator function. For SEQUENTIAL access patterns, the function must access every array element in a strict sequential order.
This pragma only applies to arguments without the zero_copy pragma.