Overview
Core Overview
The Xilinx® Deep Learning Processing Unit (DPU) is a programmable engine optimized for convolutional neural networks. It is composed of a high performance scheduler module, a hybrid computing array module, an instruction fetch unit module, and a global memory pool module. The DPU uses a specialized instruction set, which allows for the efficient implementation of many convolutional neural networks. Some examples of convolutional neural networks which have been deployed include VGG, ResNet, GoogLeNet, YOLO, SSD, MobileNet, and FPN among others.
The DPU IP can be implemented in the programmable logic (PL) of the selected Zynq® UltraScale+™ MPSoC device with direct connections to the processing system (PS). The DPU requires instructions to implement a neural network and accessible memory locations for input images as well as temporary and output data. A program running on the application processing unit (APU) is also required to service interrupts and coordinate data transfers.
The top-level block diagram of the DPU is shown in the following figure.
- APU - Application Processing Unit
- PE - Processing Engine
- DPU - Deep Learning Processing Unit
- RAM - Random Access Memory
Navigating Content by Design Process
Xilinx® documentation is organized around a set of standard design processes to help you find relevant content for your current development task. All Versal™ ACAP design process Design Hubs can be found on the Xilinx.com website. This document covers the following design processes:
- System and Solution Planning
- Identifying the components, performance, I/O, and data transfer requirements at a system level. Includes application mapping for the solution to PS, PL, and AI Engine. Topics in this document that apply to this design process include:
- Hardware, IP, and Platform Development
- Creating the PL IP blocks for the hardware platform, creating PL kernels, functional simulation, and evaluating the Vivado® timing, resource use, and power closure. Also involves developing the hardware platform for system integration. Topics in this document that apply to this design process include:
- System Integration and Validation
- Integrating and validating the system functional performance, including timing, resource use, and power closure. Topics in this document that apply to this design process include:
Hardware Architecture
The detailed hardware architecture of the DPU is shown in the following figure. After start-up, the DPU fetches instructions from the off-chip memory to control the operation of the computing engine. The instructions are generated by the Vitis™ AI compiler, where substantial optimizations are performed.
On-chip memory is used to buffer input, intermediate, and output data to achieve high throughput and efficiency. The data is reused as much as possible to reduce the external memory bandwidth. A deep pipelined design is used for the computing engine. The processing elements (PE) take full advantage of the fine-grained building blocks such as multipliers, adders, and accumulators in Xilinx devices.
DPU with Enhanced Usage of DSP
A DSP Double Data Rate (DDR) technique is used to improve the performance achieved with the device. Therefore, two input clocks for the DPU are needed: One for general logic and another at twice the frequency for DSP slices. The difference between a DPU not using the DSP DDR technique and a DPU enhanced usage architecture is shown here.
Development Tools
Two flows are supported for integrating the DPU into your project: the Vivado flow and the Vitis™ flow.
The Xilinx Vivado® Design Suite is required to integrate the DPU into your projects for the Vivado flow. Vivado Design Suite 2021.1 or later version is recommended. Contact your local sales representative if the project requires an older version of Vivado.
The Vitis unified software platform 2021.1 or later is required to integrate the DPU for the Vitis flow.
Device Resources
The DPU logic resource usage is scalable across UltraScale+™ MPSoC devices. For more information on resource utilization, see the DPU Configuration.
DPU Development Flow
The DPU requires a device driver which is included in the Xilinx Vitis™ AI development kit.
Free developer resources can be obtained from the Xilinx website: https://github.com/Xilinx/Vitis-AI.
The Vitis AI User Guide (UG1414) describes how to use the DPU with the Vitis AI tools. The basic development flow is shown in the following figure. First, use Vivado/ Vitis to generate the bitstream. Then, download the bitstream to the target board and install the related driver. For instructions on installing the related driver and dependent libraries, see the Vitis AI User Guide (UG1414).
Example System with the DPUCZDX8G
The figure below shows an example system block diagram with the Xilinx® UltraScale+™ MPSoC using a camera input. The DPU is integrated into the system through an AXI interconnect to perform deep learning inference tasks such as image classification, object detection, and semantic segmentation.
Vitis AI Development Kit
The Vitis AI development environment is used for AI inference on Xilinx hardware platforms. It consists of optimized IP cores, tools, libraries, models, and example designs.
As shown in the following figure, the Vitis AI development kit consists of AI Compiler, AI Quantizer, AI Optimizer, AI Profiler, AI Library, and Xilinx Runtime Library (XRT).
For more information of the Vitis AI development kit, see the Vitis AI User Guide in the Vitis AI User Documentation (UG1431).
You can download the Vitis AI development kit for free from here.