Introduction

The Zynq®-7000 SoC processor and the Zynq® UltraScale+™ MPSoC processor integrate the software programmability of an Arm®-based processor with the hardware programmability of an FPGA, enabling key analytics and hardware acceleration while integrating CPU, DSP, ASSP, and mixed-signal functionality on a single device. The SDSoC™ environment is a tool suite that includes an Eclipse-based integrated development environment (IDE) for implementing heterogeneous embedded systems using the Zynq platforms.

The SDSoC environment includes system compilers that transform C/C++ programs into complete hardware/software systems with selected functions compiled into programmable logic to enable hardware acceleration of the select functions. This guide provides software programmers with an understanding of the underlying hardware used to provide the acceleration, including factors that might limit or improve performance, and a methodology to help achieve the highest system performance.

A detailed knowledge of hardware is not required to get started with the SDSoC environment. However, an understanding of the hardware resources available on the device and how a hardware function achieves very high performance through increased parallelism helps you select the appropriate compiler optimization directives to meet your performance needs.

The methodology can be summarized as follows:

Identify functions to be accelerated in hardware. Profiling your software can help identify compute intensive regions.
Optimize the hardware function code to achieve performance targets using the Vivado® High-Level Synthesis (HLS) tool coding guidelines.
Optimize data transfers between the CPU processor system (PS) and hardware functions programmable logic (PL). This step can involve restructuring data accesses to be burst-friendly, and to select data movers as described in this document.

After providing an understanding of how hardware acceleration is achieved, this guide concludes with a number of real-world examples demonstrating the methodology for use in your own SDSoC-based applications.

Software Acceleration with SDSoC

When compared with processor architectures, the structures that comprise the programmable logic (PL) in a Xilinx device enable a high degree of parallelism in application execution. The custom processing architecture generated by the sds++/sdscc (referred to as sds++) system compiler for a hardware function in an accelerator presents a different execution paradigm from CPU execution, and provides an opportunity for significant performance gains. While you can re-target an existing embedded processor application for acceleration in PL, writing your application to use the source code libraries of existing hardware functions, such as the Xilinx xfOpenCV library, or modifying your code to better use the PL device architecture, yields significant performance gains and power reduction.

CPUs have fixed resources and offer limited opportunities for parallelization of tasks or operations. A processor, regardless of its type, executes a program as a sequence of instructions generated by processor compiler tools, which transform an algorithm expressed in C/C++ into assembly language constructs that are native to the target processor. Even a simple operation, such as the multiplication of two values, results in multiple assembly instructions that must be executed across multiple clock cycles.

An FPGA is an inherently parallel processing device capable of implementing any function that can run on a processor. Xilinx devices have an abundance of resources that can be programmed and configured to implement any custom architecture and achieve virtually any level of parallelism. Unlike a processor, where all computations share the same ALU, the FPGA programming logic acts as a blank canvas to define and implement your acceleration functions. The FPGA compiler creates a unique circuit optimized for each application or algorithm; for example, only implementing multiply and accumulate hardware for a neural net—not a whole ALU.

The sds++ system compiler invoked with the -c option compiles a file into a hardware IP by invoking the Vivado High-Level Synthesis (HLS) tool on the desired function definition. Before calling the HLS tool, the sds++ compiler translates #pragma SDS into pragmas understood by the HLS tool. The HLS tool performs hardware-oriented transformations and optimizations, including scheduling, pipelining, and dataflow operations to increase concurrency.

The sds++ linker analyzes program dataflow involving calls into and between hardware functions, mapping into a system hardware data motion network, and software control code (called stubs) to orchestrate accelerators and data transfers through data movers. As described in the following section, the sds++ linker performs data transfer scheduling to identify operations that can be shared, and to insert wait barrier API calls into stubs to ensure program semantics are preserved.

Execution Model of an SDSoC Application

The execution model for an SDSoC environment application can be understood in terms of the normal execution of a C++ program running on the target CPU after the platform has booted. It is useful to understand how a C++ binary executable interfaces to hardware.

The set of declared hardware functions within a program is compiled into hardware accelerators that are accessed with the standard C runtime through calls into these functions. Each hardware function call in effect invokes the accelerator as a task and each of the arguments to the function is transferred between the CPU and the accelerator, accessible by the program after accelerator task completion. Data transfers between memory and accelerators are accomplished through data movers, such as a DMA engine, automatically inserted into the system by the sds++ system compiler taking into account user data mover pragmas such as zero_copy.

To ensure program correctness, the system compiler intercepts each call to a hardware function, and replaces it with a call to a generated stub function that has an identical signature but with a derived name. The stub function orchestrates all data movement and accelerator operation, synchronizing software and accelerator hardware at the exit of the hardware function call. Within the stub, all accelerator and data mover control is realized through a set of send and receive APIs provided by the sds_lib library.

When program dataflow between hardware function calls involves array arguments that are not accessed after the function calls have been invoked within the program (other than destructors or free() calls), and when the hardware accelerators can be connected using streams, the system compiler transfers data from one hardware accelerator to the next through direct hardware stream connections, rather than implementing a round trip to and from memory. This optimization can result in significant performance gains and reduction in hardware resources.

The SDSoC program execution model includes the following steps:

Initialization of the sds_lib library occurs during the program constructor before entering main().
Within a program, every call to a hardware function is intercepted by a function call into a stub function with the same function signature (other than name) as the original function. Within the stub function, the following steps occur:
1. A synchronous accelerator task control command is sent to the hardware.
2. For each argument to the hardware function, an asynchronous data transfer request is sent to the appropriate data mover, with an associated wait() handle. A non-void return value is treated as an implicit output scalar argument.
3. A barrier wait() is issued for each transfer request. If a data transfer between accelerators is implemented as a direct hardware stream, the barrier wait() for this transfer occurs in the stub function for the last in the chain of accelerator functions for this argument.
Clean up of the sds_lib library occurs during the program destructor, upon exiting main().

TIP: Steps 2a–2c ensure that program correctness is preserved at the entrance and exit of accelerator pipelines while enabling concurrent execution within the pipelines.

Sometimes, the programmer has insight of the potential concurrent execution of accelerator tasks that cannot be automatically inferred by the system compiler. In this case, the sds++ system compiler supports a #pragma SDS async(ID) that can be inserted immediately preceding a call to a hardware function. This pragma instructs the compiler to generate a stub function without any barrier wait() calls for data transfers. As a result, after issuing all data transfer requests, control returns to the program, enabling concurrent execution of the program while the accelerator is running. In this case, it is your responsibility to insert a #pragma SDS wait(ID) within the program at appropriate synchronization points, which are resolved into sds_wait(ID) API calls to correctly synchronize hardware accelerators, their implicit data movers, and the CPU.

IMPORTANT: Every async(ID) pragma requires a matching wait(ID) pragma.

SDSoC Build Process

The SDSoC build process uses a standard compilation and linking process. Similar to g++, the sds++ system compiler invokes sub-processes to accomplish compilation and linking.

As shown in the following figure, compilation is extended not only to object code that runs on the CPU, but it also includes compilation and linking of hardware functions into IP blocks using the Vivado High-Level Synthesis (HLS) tool, and creating standard object files (.o) using the target CPU toolchain. System linking consists of program analysis of caller/callee relationships for all hardware functions, and the generation of an application-specific hardware/software network to implement every hardware function call. The sds++ system compiler invokes all necessary tools, including Vivado HLS (function compiler), the Vivado Design Suite to implement the generated hardware system, and the Arm compiler and sds++ linker to create the application binaries that run on the CPU invoking the accelerator (stubs) for each hardware function by outputting a complete bootable system for an SD card.

The compilation process includes the following tasks:

Analyzing the code and running a compilation for the main application on the Arm core, as well as a separate compilation for each of the hardware accelerators.
Compiling the application code through standard GNU Arm compilation tools with an object (.o) file produced as final output.
Running the hardware accelerated functions through the HLS tool to start the process of custom hardware creation with an object (.o) file as output.

After compilation, the linking process includes the following tasks:

Analyzing the data movement through the design and modifying the hardware platform to accept the accelerators.
Implementing the hardware accelerators into the programmable logic (PL) region using the Vivado Design Suite to run synthesis and implementation, and generate the bitstream for the device.
Updating the software images with hardware access APIs to call the hardware functions from the embedded processor application.
Producing an integrated SD card image that can boot the board with the application in an Executable and Linkable Format (ELF) file.