SDAccel Introduction and Overview
The SDAccel™ environment provides a framework for developing and delivering FPGA accelerated data center applications using standard programming languages. The SDAccel environment includes a familiar software development flow with an Eclipse-based integrated development environment (IDE), and an architecturally optimizing compiler that makes efficient use of FPGA resources. Developers of accelerated applications will use a familiar software programming work flow to take advantage of FPGA acceleration with little or no prior FPGA or hardware design experience. Acceleration kernel developers can use a hardware-centric approach working through the HLS compiler with standard programming languages to produce a heterogeneous application with both software and hardware components. The software component, or application, is developed in C/C++ with OpenCL™ API calls; the hardware component, or kernel, is developed in C/C++, OpenCL, or RTL. The SDAccel environment accommodates various methodologies, allowing developers to start from either the software component or the hardware component.
Xilinx® FPGAs offer many advantages over traditional CPU/GPU acceleration, including a custom architecture capable of implementing any function that can run on a processor, resulting in better performance at lower power dissipation. To realize the advantages of software acceleration on a Xilinx device, you should look to accelerate large compute-intensive portions of your application in hardware. Implementing these functions in custom hardware gives you an ideal balance between performance and power. The SDAccel environment provides tools and reports to profile the performance of your host application, and determine where the opportunities for acceleration are. The tools also provide automated runtime instrumentation of cache, memory and bus usage to track real-time performance on the hardware.
The SDAccel environment targets acceleration hardware platforms such as the Xilinx Alveo™ U200 and U250 Data Center accelerator cards. These acceleration platforms are designed for computationally intensive applications, specifically applications for live video transcoding, data analytics, and artificial intelligence (AI) applications using machine learning. There are also a number of available third-party acceleration platforms compatible with the SDAccel environment.
A growing number of FPGA-accelerated libraries are available through the SDAccel environment, such as the Xilinx Machine Learning (ML) suite to optimize and deploy accelerated ML inference applications. Predefined accelerator functions include targeted applications, such as artificial intelligence, with support for many common machine learning frameworks such as: Caffe, MxNet, and TensorFlow; video processing, encryption, and big data analysis. These predefined accelerator libraries offered by Xilinx and third-party developers can be integrated into your accelerated application project quickly to speed development.
Software Acceleration with SDAccel
When compared with processor architectures, the structures that comprise the programmable logic (PL) fabric in a Xilinx FPGA enable a high degree of parallelism in application execution. The custom processing architecture generated by SDAccel for a kernel presents a different execution paradigm from CPU execution, and provides opportunity for significant performance gains. While you can re-target an existing application for acceleration on an FPGA, understanding the FPGA architecture and revising your host and kernel code appropriately will significantly improve performance. Refer to the SDAccel Environment Programmers Guide for more information on writing your host and kernel code, and managing data transfers between them.
CPUs have fixed resources and offer limited opportunities for parallelization of tasks or operations. A processor, regardless of its type, executes a program as a sequence of instructions generated by processor compiler tools, which transform an algorithm expressed in C/C++ into assembly language constructs that are native to the target processor. Even a simple operation, like the addition of two values, results in multiple assembly instructions that must be executed across multiple clock cycles. This is why software engineers spend so much time restructuring their algorithms to increase the cache hit rate and decrease the processor cycles used per instruction.
On the other hand, the FPGA is an inherently parallel processing device capable of implementing any function that can run on a processor. Xilinx FPGAs have an abundance resources that can be programmed and configured to implement any custom architecture and achieve virtually any level of parallelism. Unlike a processor, where all computations share the same ALU, operations in an FPGA are distributed and executed across a configurable array of processing resources. The FPGA compiler creates a unique circuit optimized for each application or algorithm. The FPGA programming fabric acts as a blank canvas to define and implement your acceleration functions.
The SDAccel compiler exercises the capabilities of the FPGA fabric through the processes of scheduling, pipelining, and dataflow.
- Scheduling
- The process of identifying the data and control dependencies between different operations to determine when each will execute. The compiler analyzes dependencies between adjacent operations as well as across time, and groups operations to execute in the same clock cycle when possible, or to overlap the function calls as permitted by the dataflow dependencies.
- Pipelining
- A technique to increase instruction-level parallelism in the hardware implementation of an algorithm by overlapping independent stages of operations or functions. The data dependence in the original software implementation is preserved for functional equivalence, but the required circuit is divided into a chain of independent stages. All stages in the chain run in parallel on the same clock cycle. Pipelining is a fine-grain optimization that eliminates CPU restrictions requiring the current function call or operation to fully complete before the next can begin.
- Dataflow
- Enables multiple functions implemented in the FPGA to execute in a parallel and pipelined manner instead of sequentially, implementing task-level parallelism. The compiler extracts this level of parallelism by evaluating the interactions between different functions of a program based on their inputs and outputs. In terms of software execution, this transformation applies to parallel execution of functions within a single kernel.
Another advantage of a Xilinx FPGA is the ability to be dynamically reconfigured. For example, loading a compiled program into a processor or reconfiguring the FPGA during runtime can re-purpose the resources of the FPGA to implement additional kernels as the accelerated application runs. This allows a single SDAccel accelerator board provide acceleration for multiple functions within an application, either sequentially or concurrently.
Execution Model of an SDAccel Application
The SDAccel environment is designed to provide a simplified development experience for FPGA-based software acceleration platforms. The general structure of the acceleration platform is shown in the following figure.
The custom application is running on the host x86 server and uses OpenCL API calls to interact with the FPGA accelerators. The Xilinx runtime (XRT) manages those interactions. The application is written in C/C++ using OpenCL APIs. The custom kernels are running within a Xilinx FPGA with the XRT managing interactions between the host application and the accelerator. Communication between the host x86 machine and the accelerator board occurs across the PCIe bus.
The SDAccel hardware platform contains global memory banks. The data transfer between the host machine and kernels, in either direction, occurs through these global memory banks. The kernels running on the FPGA can have one or more memory interfaces. The connection from the memory banks to those memory interfaces is programmable and determined by linking options of the compiler.
The SDAccel execution model follows these steps:
- The host application writes the data needed by a kernel into the global memory of the attached device through the PCIe interface.
- The host application programs the kernel with its input parameters.
- The host application triggers the execution of the kernel function on the FPGA.
- The kernel performs the required computation while reading and writing data from global memory, as necessary.
- The kernels write data back to the memory banks, and notify the host that it has completed its task.
- The host application reads data back from global memory into the host memory space, and continues processing as needed.
The FPGA can accommodate multiple kernel instances at one time; this can occur between different types of kernels or multiple instances of the same kernel. The XRT transparently orchestrates the communication between the host application and the kernels in the accelerator. The number of instances of a kernel is determined by compilation options.
SDAccel Build Process
The SDAccel environment offers all of the features of a standard software development environment:
- Optimized compiler for host applications
- Cross-compilers for the FPGA
- Robust debugging environment to help identify and resolve issues in the code
- Performance profilers to identify bottlenecks and optimize the code
Within this environment, the build process uses a standard compilation and linking process for both the software elements, and the hardware elements of the project. As shown in the following figure, the host application is built through one process using standard GCC compiler, and the FPGA binary is built through a separate process using the Xilinx xocc compiler.
- Host application build process using GCC:
- Each host application source file is compiled to an object file (.o).
- The object files (.o) are linked with the Xilinx SDAccel runtime shared library to create the executable (.exe).
- FPGA build process is highlighted in the following figure:
- Each kernel is independently compiled to a Xilinx object (.xo) file.
- C/C++ and OpenCL C kernels are compiled for implementation on an FPGA using the xocc compiler. This step leverages the Vivado® HLS compiler. Pragmas and attributes supported by Vivado HLS can be used in C/C++ and OpenCL C kernel source code to specify the desired kernel micro-architecture and control the result of the compilation process.
- RTL kernels are compiled using the
package_xo
utility. The RTL kernel wizard in the SDAccel environment can be used to simplify this process.
- The kernel .xo files
are linked with the hardware platform (shell) to create the FPGA binary
(.xclbin). Important architectural
aspects are determined during the link step. In particular, this is where
connections from kernel ports to global memory banks are established and
where the number of instances for each kernel is specified.
- When the build target is software or hardware emulation, as described below, xocc generates simulation models of the device contents.
- When the build target is the system (actual hardware), xocc generates the FPGA binary for the device leveraging the Vivado Design Suite to run synthesis and implementation.
- Each kernel is independently compiled to a Xilinx object (.xo) file.
Build Targets
The SDAccel tool build process generates the host application executable (.exe) and the FPGA binary (.xclbin). The SDAccel build target defines the nature of FPGA binary generated by the build process.
The SDAccel tool provides three different build targets, two emulation targets used for debug and validation purposes, and the default hardware target used to generate the actual FPGA binary:
- Software Emulation (
sw_emu
) - Both the host application code and the kernel code are compiled to run on the x86 processor. This allows iterative algorithm refinement through fast build-and-run loops. This target is useful for identifying syntax errors, performing source-level debugging of the kernel code running together with application, and verifying the behavior of the system.
- Hardware Emulation (
hw_emu
) - The kernel code is compiled into a hardware model (RTL) which is run in a dedicated simulator. This build and run loop takes longer but provides a detailed, cycle-accurate, view of kernel activity. This target is useful for testing the functionality of the logic that will go in the FPGA and for getting initial performance estimates.
- System (
hw
) - The kernel code is compiled into a hardware model (RTL) and is then implemented on the FPGA device, resulting in a binary that will run on the actual FPGA.
SDAccel Design Methodology
The SDAccel environment supports the two primary use cases:
- Software-Centric Design
- This software-centric approach focuses on improving the performance of an application written by software programmers, by accelerating compute intensive functions or bottlenecks identified while profiling the application.
- Hardware-Centric Design
- The acceleration kernel developer creates an optimized kernel that may be called as a library element by the application developer. Kernel languages are not specific to the methodology. A software-centric flow can also use either C/C++, OpenCL, or RTL for kernel. The main differences between the two approaches are the starting point (software application or kernels) and the emphasis that comes with it.
The two use cases can be combined, allowing teams of software and hardware developers define accelerator kernels and develop applications to use them. This combined methodology involves different components of the application, developed by different people, potentially from different companies. You can leverage predefined kernel libraries available for use in your accelerated application, or develop all the acceleration functions within your own team.
Software-Centric Design
The software-centric approach to accelerated application development, or acceleration kernel development, uses code written as a standard software program, with some attention to the specific architecture of the code. For more information see the SDAccel Environment Profiling and Optimization Guide. The software development flow typically uses the following steps:
- Profile application: Baseline the application in terms of functionalities
and performance and isolate functions to be accelerated in hardware.
Functions that consume the most execution time are good candidates to be offloaded and accelerated onto FPGAs.
- Code the desired kernel(s): Convert functions to OpenCL C or C/C++ kernels without any optimization.
The application code calling these kernels will also need to be converted to use OpenCL APIs for data movement and task scheduling.
- Verify functionality, iterate as needed: Run software emulation to ensure
functional correctness. Run hardware emulation to generate host and kernel
profiling data including:
- Estimated FPGA resource usage (non-RTL)
- Overall application performance
- Visual timeline showing host calls and kernel start/stop times
- Optimize for performance, iterate as needed: Using the various compilation
reports and profiling data generated during hardware emulation and system run to
assist your optimization effort. Common optimization objectives include:
- Optimize data movement from the host to/from global memory, and data movement from global memory to/from the kernel.
- Maximize parallelism across software requests.
- Maximize parallelism across multiple kernels.
- Maximize task and instruction level parallelism within kernels.
Hardware-Centric Design
Hardware-centric flows first focuses on developing and optimizing the kernel(s) and typically leverages advanced FPGA design techniques. For more information, see the SDAccel Environment Profiling and Optimization Guide. The hardware-centric development flow typically uses the following steps:
- Baseline the application in terms of functionalities and performance and isolate functions to be accelerated in hardware.
- Estimate cycle budgets and performance requirements to define accelerator architecture and interfaces.
- Develop accelerator.
- Verify functionality and performance. Iterate as needed.
- Optimize timing and resource utilization. Iterate as needed.
- Import kernel into SDAccel.
- Develop sample host code to test with a dummy kernel having the same interfaces as the actual kernel.
- Verify kernel works correctly with host code using hardware emulation, or running on actual hardware. Iterate as needed.
- Use Activity Timeline, Profile Summary, and timers in the source code to measure performance to optimize host code for performance. Iterate as needed.
Best Practices for Acceleration with SDAccel
Below are some specific things to keep in mind when developing your application code and hardware function in the SDAccel environment. You can find additional information in the SDAccel Environment Profiling and Optimization Guide.
- Look to accelerate functions that have a high ratio of compute time to input and output data volume. Compute time can be greatly reduced using FPGA kernels, but data volume adds transfer latency.
- Accelerate functions that have a self-contained control structure and do not require regular synchronization with the host.
- Transfer large blocks of data from host to global device memory. One large transfer is more efficient than several smaller transfers. Run a bandwidth test to find the optimal transfer size.
- Only copy data back to host when necessary. Data written to global memory by a kernel can be directly read by another kernel. Memory resources include PLRAM (small size but fast access with lowest latency), HBM (moderate size and access speed with some latency), and DDR (large size but slow access with high latency).
- Take advantage of the multiple global memory resources to evenly distribute bandwidth across kernels.
- Maximize bandwidth usage between kernel and global memory by performing 512-bit wide bursts.
- Cache data in local memory within the kernels. Accessing local memories is much faster than accessing global memory.
- In the host application, use events and non-blocking transactions to launch multiple requests in a parallel and overlapping manner.
- In the FPGA, use different kernels to take advantage of task-level parallelism and use multiple CUs to take advantage of data-level parallelism to execute multiple tasks in parallel and further increase performance.
- Within the kernels take advantage of tasks-level with dataflow and instruction-level parallelism with loop unrolling and loop pipelining to maximize throughput.
- Some Xilinx FPGAs contain multiple partitions called super logic regions (SLRs). Keep the kernel in the same SLR as the global memory bank that it accesses.
- Use software and hardware emulation to validate your code frequently to make sure it is functionally correct.
- Frequently review the SDAccel Guidance report as it provides clear and actionable feedback regarding deficiencies in your project.