Vitis AI Tools Overview
Deep-Learning Processor Unit
The deep-learning processor unit (DPU) is a programmable engine optimized for deep neural networks. It is a group of parameterizable IP cores pre-implemented on the hardware with no place and route required. It is designed to accelerate the computing workloads of deep learning inference algorithms widely adopted in various computer vision applications, such as image/video classification, semantic segmentation, and object detection/tracking. The DPU is released with the Vitis AI specialized instruction set, thus facilitating the efficient implementation of deep learning networks.
An efficient tensor-level instruction set is designed to support and accelerate various popular convolutional neural networks, such as VGG, ResNet, GoogLeNet, YOLO, SSD, and MobileNet, among others. The DPU is scalable to fit various Xilinx Zynq®-7000 devices, Zynq UltraScale+ MPSoCs, Xilinx Kria KV260, Versal cards, and Alveo boards from Edge to Cloud to meet the requirements of many diverse applications.
A configuration file, arch.json, is generated during the Vitis flow. The arch.json file is used by the Vitis AI compiler for model compilation. Once the configuration of the DPU is modified, a new arch.json must be generated. The models must be regenerated using the new arch.json file. In the DPU-TRD, the arch.json file is located at $TRD_HOME/prj/Vitis/binary_container_1/link/vivado/vpl/prj/prj.gen/sources_1/bd/xilinx_zcu102_base/ip/xilinx_zcu102_base_DPUCZDX8G_1_0/arch.json.
Vitis AI offers a series of different DPUs for both embedded devices such as Xilinx Zynq®-7000, Zynq® UltraScale+™ MPSoC, Kria KV260, Versal cards and Alveo cards such as U50, U200, U250, and U280 enabling unique differentiation and flexibility in terms of throughput, latency, scalability, and power.
DPU Naming
Vitis AI 1.2 and later releases use a new DPU naming scheme to differentiate various DPUs designed for different purposes. The old DPUv1/v2/v3 naming is deprecated.
The new DPU naming convention is shown in the following figure:
DPU Naming Example
To understand the mapping between the old DPU naming scheme and the current naming scheme, see the following table:
Example | DPU | App | HW Platform | Q Method | Q Bitwidth | Design Target | Major | Minor | Patch | DPU Name | |
---|---|---|---|---|---|---|---|---|---|---|---|
DPUv1 | DPU | C | AD | X | 8 | G | 3 | 0 | 0 | DPUCADX8G-3.0.0 | |
DPUv2 | DPU | C | ZD | X | 8 | G | 1 | 4 | 1 | DPUCZDX8G-1.4.1 | |
DPUv3e | DPU | C | AH | X | 8 | H | 1 | 0 | 0 | DPUCAHX8H-1.0.0 | |
DPUv3me | DPU | C | AH | X | 8 | L | 1 | 0 | 0 | DPUCAHX8L-1.0.0 | |
DPUv3int8 | DPU | C | AD | F | 8 | H | 1 | 0 | 0 | DPUCADF8H-1.0.0 | |
XRNN | DPU | R | AH | R | 16 | L | 1 | 0 | 0 | DPURAHR16L-1.0.0 | |
XVDPU | DPU | C | VD | X | 8 | G | 1 | 0 | 0 | DPUCVDX8G-1.0.0 | |
DPUv4e | DPU | C | VD | X | 8 | H | 1 | 0 | 0 | DPUCVDX8H-1.0.0 | |
|
Zynq UltraScale+ MPSoC: DPUCZDX8G
The DPUCZDX8G IP has been optimized for Zynq UltraScale+ MPSoC. You can integrate this IP as a block in the programmable logic (PL) of the selected Zynq UltraScale+ MPSoCs with direct connections to the processing system (PS). The DPU is user-configurable and exposes several parameters which can be specified to optimize PL resources or customize enabled features. If you want to integrate the DPU in the customized AI projects or products, see the https://github.com/Xilinx/Vitis-AI/tree/master/dsa/DPU-TRD.
Alveo U50/U280 Card: DPUCAHX8H
The Xilinx DPUCAHX8H DPU is a programmable engine optimized for convolutional neural networks, mainly for high throughput applications. This unit includes a high performance scheduler module, a hybrid computing array module, an instruction fetch unit module, and a global memory pool module. The DPU uses a specialized instruction set, which allows an efficient implementation of many convolutional neural networks. Some examples of convolutional neural networks that are deployed include VGG, ResNet, GoogLeNet, YOLO, SSD, MobileNet, and FPN.
The DPU IP can be implemented in the PL of the selected Alveo board. The DPU requires instructions to implement a neural network and accessible memory locations for input images as well as temporary and output data. A user-defined unit running on PL is also required to do necessary configuration, inject instructions, service interrupts and coordinate data transfers.
The top-level block diagram of the DPU is shown in the following figure.
Alveo U50/U50LV/U280 Card: DPUCAHX8L
The DPUCAHX8L IP is a new general-purpose CNN accelerator that is optimized for HBM cards, such as the Alveo U50/U50LV and U280 cards, and designed for low latency applications. It has a new low latency DPU micro-architecture with an HBM memory sub-system supporting 4TOPs to 5.3TOPs MAC array. It supports the back-to-back convolution and depthwise convolution engines to increase computing parallelism. It also supports hierarchical memory system, UltraRAM and HBM, to maximize data movement. With this low latency DPU IP, the Vitis AI compiler supports the super layer interface and many new compiling strategies for kernel fusion and graph partition.
Alveo U200/U250 Card: DPUCADF8H
The DPUCADF8H is the DPU optimized for Alveo U200/U250 card and targeted for high-throughput applications. The key features of the DPUCADF8H are as follows:
- Throughput-oriented and high-efficiency computing engines: throughput is improved by 1.5X~2.0X on different workloads
- Wide range of convolution neural network support
- Friendly to pruned convolution neural networks
- Optimized for high-resolution images
The top-level block diagram is shown in the following figure:
Versal AI Core Series: DPUCVDX8G
The DPUCVDX8G is a high-performance general CNN processing engine optimized for the Versal AI Core Series. The Versal devices can provide superior performance/watt over conventional FPGAs, CPUs, and GPUs. The DPUCVDX8G is composed of AI Engines and PL circuits. This IP is user-configurable and exposes several parameters which can be specified to optimize AI Engines and PL resources or customize features.
The top-level block diagram of DPUCVDX8G is shown in the following figure.
Versal AI Core Series: DPUCVDX8H
The DPUCVDX8H is a high-performance and high-throughput general CNN processing engine optimized for the Versal AI Core series. Besides traditional program logic, Versal devices integrate high performance AI engine arrays, high bandwidth NoCs, DDR/LPDDR controllers, and other high-speed interfaces that can provide superior performance/watt over conventional FPGAs, CPUs, and GPUs. The DPUCVDX8H is implemented on Versal devices to leverage these benefits. You can configure the parameters to meet your data center application requirements.
The top-level block diagram of the DPUCVDX8H is shown in the following figure.
Vitis AI Model Zoo
The Vitis AI Model Zoo includes optimized deep learning models to speed up the deployment of deep learning inference on Xilinx platforms. These models cover different applications, including ADAS/AD, video surveillance, robotics, and data center. You can get started with these pre-trained models to enjoy the benefits of deep learning acceleration.
For more information, see Vitis AI Model Zoo on GitHub.
Vitis AI Optimizer
With world-leading model compression technology, you can reduce model complexity by 5x to 50x with minimal accuracy degradation. See Vitis AI Optimizer User Guide (UG1333) for information on the Vitis AI Optimizer.
The Vitis AI optimizer requires a commercial license to run. Contact your Xilinx sales representative for more information.
Vitis AI Quantizer
By converting the 32-bit floating-point weights and activations to fixed-point like INT8, the Vitis AI quantizer can reduce the computing complexity without losing prediction accuracy. The fixed-point network model requires less memory bandwidth, thus providing faster speed and higher power efficiency than the floating-point model.
Vitis AI Compiler
The Vitis AI compiler maps the AI model to a highly-efficient instruction set and dataflow model. It also performs sophisticated optimizations such as layer fusion, instruction scheduling, and reuses on-chip memory as much as possible.
Vitis AI Profiler
The Vitis AI profiler profiles and visualizes AI applications to find bottlenecks and allocates computing resources among different devices. It is easy to use and requires no code changes. It can trace function calls and run time, and also collect hardware information, including CPU, DPU, and memory utilization.
Vitis AI Library
The Vitis AI Library is a set of high-level libraries and APIs built for efficient AI inference with DPUs. It fully supports the XRT and is built on Vitis AI runtime with Vitis runtime unified APIs.
The Vitis AI Library provides an easy-to-use and unified interface by encapsulating many efficient and high-quality neural networks. This simplifies the use of deep-learning neural networks, even for users without knowledge of deep-learning or FPGAs. The Vitis AI Library allows you to focus more on developing your applications rather than the underlying hardware.
Vitis AI Runtime
The Vitis AI runtime enables applications to use the unified high-level runtime API for both Cloud and Edge making Cloud-to-Edge deployments seamless and efficient.
Following are the features for the AI runtime API:
- Asynchronous submission of jobs to the accelerator
- Asynchronous collection of jobs from the accelerator
- C++ and Python implementations
- Support for multi-threading and multi-process execution
The Vitis AI Runtime (VART) is the next generation runtime suitable for devices based on DPUCZDX8G, DPUCADF8H, DPUCAHX8H, DPUCVDX8G, and DPUCVDX8H.
- DPUCZDX8G is used for Edge devices, such as the ZCU102 and the ZCU104 evaluation boards, and the KV260 starter kit.
- DPUCADX8G and DPUCADF8H are used for Cloud devices, such as the Alveo U200 and U250 cards.
- DPUCAHX8H is used for Cloud devices, such as the Alveo U50, U50LV, and U280 cards.
- DPUCVDX8G is used for the Versal evaluation boards, such as the VCK190 board.
- DPUCVDX8H is used for the Versal ACAP VCK5000 board.
The framework of VART is shown in the following figure. For this Vitis AI release, VART is based on the XRT. XIR is the Xilinx Intermediate Representation.