Compiling the Model
Vitis AI Compiler
The Vitis™ AI compiler (VAI_C) is the unified interface to a compiler family targeting the optimization of neural-network computations to a family of DPUs. Each compiler maps a network model to a highly optimized DPU instruction sequence.
The simplified description of VAI_C framework is shown in the following figure. After parsing the topology of optimized and quantized input model, VAI_C constructs an internal computation graph as intermediate representation (IR). Therefore, a corresponding control flow and a data flow representation. It then performs multiple optimizations, for example, computation nodes fusion such as when batch norm is fused into a presiding convolution, efficient instruction scheduling by exploit inherent parallelism, or exploiting data reuse.
The Vitis AI Compiler generates the compiled model based on the DPU microarchitecture. Vitis AI supports several DPUs for different platforms and applications.
DPU Name | Hardware platform |
---|---|
DPUCZDX8G | Zynq® UltraScale+™ MPSoC |
DPUCAHX8H | Alveo™ U50, U280 Data Center accelerator cards |
DPUCAHX8L | Alveo U50, U280 Data Center accelerator cards |
DPUCADF8H | Alveo U200, U250 Data Center accelerator cards |
DPUCVDX8G | Versal™ ACAP VCK190 evaluation board, Versal AI Core Series |
DPUCVDX8H | Versal ACAP VCK5000 evaluation kit |
Compiling with an XIR-based Toolchain
Xilinx Intermediate Representation (XIR) is a graph-based intermediate representation of the AI algorithms which is designed for compilation and efficient deployment of the DPU on the FPGA platform. If you are an advanced user, you can apply whole application acceleration to allow the FPGA to be used to its maximum potential by extending the XIR to support customized IPs in the Vitis AI flow. It is the current foundation for the Vitis AI quantizer, compiler, runtime, and other tools.
XIR
XIR includes the Op, Tensor, Graph, and Subgraph libraries, which provide a clear and flexible representation of the computational graph. XIR has in-memory format and file format for different usage. The in-memory format XIR is a graph object and the file format is an xmodel. A graph object can be serialized to an xmodel while an xmodel can be deserialized to a graph object.
In the Op library, there is a well-defined set of operators to cover the popular deep learning frameworks, e.g., TensorFlow, PyTorch and Caffe, and all of the built-in DPU operators. This enhances the expression ability and achieves one of the core goals, which is eliminating the difference between these frameworks and providing a unified representation for users and developers.
XIR also provides Python APIs named PyXIR, which enables Python users to fully access the XIR in a Python environment, e.g., co-develop and integrate users' Python projects with the current XIR-based tools without having to perform a huge amount of work to fix the gap between different languages.
xir::Graph
Graph is the core component of the XIR. It obtains serveral significant APIs,
e.g., the xir::Graph::serialize
, xir::Graph::deserialize
and xir::Graph::topological_sort
.
The Graph is like a container, which maintains the Op as its vertex, and uses the producer-consumer relation as the edge.
xir::Op
Op in XIR is the instance of the operator definition either in XIR or extended from XIR. All Op instances can only be created or added by the Graph according to the predefined built-in/extended op definition library. The Op definition mainly includes the input arguments and intrinsic attributes.
Besides the intrinsic predefined attributes, an Op instance is also able to
carry more extrinsic attributes by applying xir::Op::set_attr
API. Each Op instance can only obtain one output
tensor, but more than one fanout ops.
xir::Tensor
Tensor is another important class in XIR. Unlike other frameworks' tensor definition, XIR's Tensor is only a description of the data block it representes. The real data block is excluded from the Tensor.
The key attributes for Tensor is the data type and shape.
xir::Subgraph
XIR's Subgraph is a tree-like hierarchy, which divides a set of ops into several non-overlapping sets. The Graph's entire op set can be seen as the root. The Subgraph can be nested but it must be non-overlapping. The nested insiders must be the children of the outer one.
Compiling for DPU
The XIR-based compiler takes the quantized TensorFlow or Caffe model as the input. First, it transforms the input models into the XIR format as the foundation for the following processes. Most of the variations among different frameworks are eliminated and transferred to a unified representation in XIR. Then, it applies various optimizations to the graph and breaks up the graph into several subgraphs on the basis of whether the operation can be executed on the DPU. Architecture-aware optimizations are applied for each subgraph, as required. For the DPU subgraph, the compiler generates the instruction stream and attaches to it. Finally, the optimized graph with the necessary information and instructions for VART is serialized into a compiled xmodel file.
The XIR-based compiler can support the DPUCZDX8G series on the Edge Zynq UltraScale+ MPSoC platforms, DPUCADF8H on the Alveo platform, DPUCAHX8H on the Alveo HBM platform optimized for high-throughput applications, DPUCAHX8L on the Alveo HBM platform optimized for low-latency applications, DPUCVDX8G on the Versal Edge platform, and DPUCVDX8H on the Versal Cloud platform. You can find thearch.json files for these platforms in /opt/vitis_ai/compiler/arch.
Steps to compile Caffe or TensorFlow models with VAI_C are the same as for the previous DPUs. It is assumed that you have successfully installed the Vitis AI package including VAI_C and compressed your model with the vai_quantizer.
Caffe
For Caffe, vai_q_caffe generates a prototxt (deploy.prototxt) and a model (deploy.caffemodel). Ensure that you specify the -keep_fixed_neuron
option for vai_q_caffe because it
is essential for the XIR-based compiler. Run the following command to get the
compiled xmodel.
vai_c_caffe -p /PATH/TO/deploy.prototxt -c /PATH/TO/deploy.caffemodel -a /PATH/TO/arch.json -o /OUTPUTPATH -n netname
The compiler creates three files in OUTPUTPATH directory. netname_org.xmodel is the pre-compiled xmodel which is generated by the compiler. netname.xmodel is the compiled xmodel which contains instructions and other necessary information. meta.json is for the Vitis AI runtime.
TensorFlow
For TensorFlow, vai_q_tensorflow generates a pb file (quantize_eval_model.pb). There are two pb files generated by vai_q_tensorflow. The quantize_eval_model.pb file is the input file for the XIR-based compiler. The compilation command is as follows.
vai_c_tensorflow -f /PATH/TO/quantize_eval_model.pb -a /PATH/TO/arch.json -o /OUTPUTPATH -n netname
The outputs is the same as the output for Caffe.
Sometimes, the TensorFlow model does not contain input tensor shape
information because it might cause the compilation to fail. You can specify the
input tensor shape with an extra option like --options
'{"input_shape": "1,224,224,3"}'
.
TensorFlow 2.x
For TensorFlow 2.x, the quantizer generates the quantized model in the hdf5 format.
vai_c_tensorflow2 -m /PATH/TO/quantized.h5 -a /PATH/TO/arch.json -o /OUTPUTPATH -n netname
Currently, vai_c_tensorflow2 only supports Keras functional APIs.
PyTorch
For PyTorch, the quantizer NNDCT outputs the quantized model in the XIR format directly. Use vai_c_xir to compile it.
vai_c_xir -x /PATH/TO/quantized.xmodel -a /PATH/TO/arch.json -o /OUTPUTPATH -n netname
Compiling for Customized Accelerator
The XIR-based compiler works in the context of a framework-independent XIR graph generated from deep learning frameworks. The parser removes the framework-specific attributes in the CNN models and transforms models into XIR-based computing graphs. The compiler divides the computing graph into different subgraphs, leverages heterogeneous optimizations, and generates corresponding optimized machine codes for subgraphs.
When the model contains operations that the DPU cannot support, some subgraphs are created and mapped to the CPU. The FPGA is so powerful that you can create a specific IP to accelerate those operations for improved end-to-end performance. To enable customized accelerating IPs with an XIR-based toolchain, leverage a pipeline named plugin to extend the XIR and compiler.
In Plugin.hpp, the interface class Plugin is declared. Plugins are executed sequentially before the compiler starts to compile the graph for the DPU. At first, a child subgraph is created for each operator and the plugin picks the operators which it can accelerate. It merges them into larger subgraphs, maps them to the customized IP, and attaches necessary information for runtime (VART::Runner) such as the instructions on the subgraphs.
Implementing a Plugin
- Implement
Plugin::partition()
In
std::set<xir::Subgraph*> partition(xir::Graph* graph)
, pick the desired operations and merge them into device level subgraphs using the following helper functions.xir::Subgraph* filter_by_name(xir::Graph* graph, const std::string& name)
returns the subgraph with a specific namestd::set<xir::Subgraph*> filter_by_type(xir::Graph* graph, const std::string& type)
returns subgraphs with a specific type.std::set<xir::Subgraph*> filter_by_template(xir::Graph* graph, xir::GraphTemplate* temp)
returns subgraphs with a specific structure.Figure 4: Filter by Templates std::set<xir::Subgraph*> filter(xir::Graph* graph, std::function<std::set<xir::Subgraph*>(std::set<xir::Subgraph*>)> func)
allows you to filter the subgraphs by customized function. This method helps you to find all uncompiled subgraphs.
To merge the child subgraphs, use the
merge_subgraph()
helper function. However, this function can only merge subgraphs at the same level. If the subgraph list can not be merged into one subgraph, the helper function will merge them as far as possible. - Specify the name, device, and runner for the subgraphs you picked in the
Plugin::partition()
function. - Implement
Plugin::compile(xir::Subgraph*)
. This function is called for all the subgraphs returned by thepartition()
function. You can attach information on subgraphs for runtime.
Building the Plugin
Create an extern get_plugin()
function and
build the implementations into a shared library.
extern "C" plugin* get_plugin() { return new YOURPLUGIN(); }
Using the Plugin
Use --options '{"plugin":
"libplugin0.so,libplugin1.so"}'
in the vai_c command line option to
pass your plugin library to compiler. When executing your plugin, the compiler opens
the library and makes an instance of your plugin by loading your extern function
named ‘get_plugin’. If more than one plugin is specified, they are executed
sequentially in the order defined by the command line option. Compilation for DPU
and CPU are executed after all the plugins have been implemented.
Samples
Check https://github.com/Xilinx/Vitis-AI/tree/master/tools/Vitis-AI-Runtime/VART/plugin-samples for samples.
Supported Operators and DPU Limitations
Xilinx is continuously improving the DPU IP and the compiler to support more operators with better performance. The following table lists some typical operations and the configurations such as kernel size, stride, etc. that the DPU can support. If the operation configurations exceed these limitations, the operator will be assigned to the CPU. Additionally, the operators that the DPU can support are dependent on the DPU types, ISA versions, and configurations.
You can configure the DPUs to suit your requirements. You can choose engines, adjust intrinsic parameters, and create your own DPU IP with TRD projects but this means that the limitations can be very different between configurations. Either use the following product guides for information on configuration or compile the model with your own DPU configuration. The compiler tells you which operators can be assigned to the CPU. The table shows a specific configuration of each DPU architecture.
- DPUCZDX8G for Zynq UltraScale+ MPSoCs Product Guide (PG338)
-
DPUCAHX8L for Convolutional Neural Networks Product Guide (PG366)
-
DPUCAHX8H for Convolutional Neural Network Product Guide (PG367)
- DPUCVDX8G for Versal ACAPs Product Guide Product Guide (PG389)
The following operators are primitively defined in different deep learning frameworks. The compiler can automatically parse these operators, transform them into the XIR format, and distribute them to DPU or CPU. These operators are partially supported by the tools, and they are listed here for your reference.
Currently Supported Operators
Typical Operation Type in CNN | Parameters | DPUCZDX8G_ISA0_B4096_MAX_BG2 (ZCU102, ZCU104) | DPUCAHX8L_ISA0 (U50, U50LV, U280) | DPUCVDX8G_ISA1_C32B3 (VCK190) | DPUCAHX8H_ISA2 (U50, U50LV9E, U50LV10E, U280) | DPUCADF8H_ISA0 (U200, U250) | DPUCVDX8H_ISA1_F2W2 (VCK5000) |
---|---|---|---|---|---|---|---|
Intrinsic Parameter | channel_parallel: 16 bank_depth: 2048 |
channel_parallel: 32 bank_depth: 4096 |
channel_parallel: 16 bank_depth: 16384 |
channel_parallel: 16 bank_depth: 2048 |
channel_parallel: 16 bank_depth: 8192 |
channel_parallel: 64 bank_depth: 256 |
|
conv2d | Kernel size | w, h: [1, 16] | w, h: [1, 16] | w, h: [1, 16] w * h <= 64 |
w, h: [1, 16] | w, h: [1, 16] | w, h: [1, 16] |
Strides | w, h: [1, 8] | w, h: [1, 4] | w, h: [1, 8] | w, h: [1, 4] | w, h: [1, 8] | w, h: [1, 4] | |
Dilation | dilation * input_channel <= 256 * channel_parallel | ||||||
Paddings | pad_left, pad_right: [0, (kernel_w - 1) * dilation_w] | ||||||
pad_top, pad_bottom: [0, (kernel_h - 1) * dilation_h] | |||||||
In Size | kernel_w * kernel_h * ceil(input_channel / channel_parallel) <= bank_depth | ||||||
Out Size | output_channel <= 256 * channel_parallel | ||||||
Activation | ReLU, LeakyReLU, ReLU6 | ReLU, ReLU6 | ReLU, LeakyReLU, ReLU6, Hard-Swish, Hard-Sigmoid | ReLU, LeakyReLU, ReLU6 | ReLU, LeakyReLU | ReLU, LeakyReLU | |
Group* (Caffe) | group==1 | ||||||
depthwise-conv2d | Kernel size | w, h: [1, 16] | w, h: [3] | w, h: [1, 256] | Not supported | ||
Strides | w, h: [1, 8] | w, h: [1, 2] | w, h: [1, 8] | ||||
dilation | dilation * input_channel <= 256 * channel_parallel | ||||||
Paddings | pad_left, pad_right: [0, (kernel_w - 1) * dilation_w] | pad_left, pad_right: [0, 15 * dilation_w] | |||||
pad_top, pad_bottom: [0, (kernel_h - 1) * dilation_h] | pad_top, pad_bottom: [0, 15 * dilation_h] | ||||||
In Size | kernel_w * kernel_h * ceil(input_channel / channel_parallel) <= bank_depth | ||||||
Out Size | output_channel <= 256 * channel_parallel | ||||||
Activation | ReLU, ReLU6 | ReLU, ReLU6 | ReLU, ReLU6 | ||||
Group* (Caffe) | group==input_channel | ||||||
transposed-conv2d | Kernel size | kernel_w/stride_w, kernel_h/stride_h: [1, 16] | |||||
Strides | |||||||
Paddings | pad_left, pad_right: [1, kernel_w-1] | ||||||
pad_top, pad_bottom: [1, kernel_h-1] | |||||||
Out Size | output_channel <= 256 * channel_parallel | ||||||
Activation | ReLU, LeakyReLU, ReLU6 | ReLU, ReLU6 | ReLU, LeakyReLU, ReLU6, Hard-Swish, Hard-Sigmoid | ReLU, LeakyReLU, ReLU6 | ReLU, LeakyReLU | ReLU, LeakyReLU | |
depthwise-transposed-conv2d | Kernel size | kernel_w/stride_w, kernel_h/stride_h: [1, 16] | kernel_w/stride_w, kernel_h/stride_h: [3] | kernel_w/stride_w, kernel_h/stride_h: [1, 256] | Not supported | ||
Strides | |||||||
Paddings | pad_left, pad_right: [1, kernel_w-1] | pad_left, pad_right: [1, 15] | |||||
pad_top, pad_bottom: [1, kernel_h-1] | pad_top, pad_bottom: [1, 15] | ||||||
Out Size | output_channel <= 256 * channel_parallel | ||||||
Activation | ReLU, ReLU6 | ReLU, ReLU6 | ReLU, ReLU6 | ||||
max-pooling | Kernel size | w, h: [2, 8] | w, h: {2, 3, 5, 7, 8} | w, h: [1, 256] | w, h: [1, 8] | w, h: [1, 16] | w, h: {1, 2, 3, 7} |
Strides | w, h: [1, 8] | w, h: [1, 8] | w, h: [1, 8] | w, h: [1, 8] | w, h: [1, 8] | w, h: [1, 8] | |
Paddings | pad_left, pad_right: [1, kernel_w-1] | pad_left, pad_right: [1, 15] | pad_left, pad_right: [1, kernel_w-1] | ||||
pad_top, pad_bottom: [1, kernel_h-1] | pad_top, pad_bottom: [1, 15] | pad_top, pad_bottom: [1, kernel_h-1] | |||||
Activation | ReLU | not supported | ReLU, ReLU6 | not supported | ReLU | not supported | |
average-pooling | Kernel size | w, h: [2, 8] w==h |
w, h: {2, 3, 5, 7, 8} w==h |
w, h: [1, 256] | w, h: [1, 8] w==h |
w, h: [1, 16] | w, h: {1, 2, 3, 7} w==h |
Strides | w, h: [1, 8] | w, h: [1, 8] | w, h: [1, 8] | w, h: [1, 8] | w, h: [1, 8] | w, h: [1, 8] | |
Paddings | pad_left, pad_right: [1, kernel_w-1] | pad_left, pad_right: [1, 15] | pad_left, pad_right: [1, kernel_w-1] | ||||
pad_top, pad_bottom: [1, kernel_h-1] | pad_top, pad_bottom: [1, 15] | pad_top, pad_bottom: [1, kernel_h-1] | |||||
Activation | ReLU | not supported | ReLU, ReLU6 | not supported | ReLU | not supported | |
eltwise | type | sum | sum | sum, prod | sum | sum | sum |
Input Channel | input_channel <= 256 * channel_parallel | ||||||
Activation | ReLU | ReLU | ReLU | ReLU | ReLU | ReLU | |
concat | Network-specific limitation, which relates to the size of feature maps, quantization results and compiler optimizations. | ||||||
reorg | Strides | reverse==false :
stride ^ 2 * input_channel <= 256 * channel_parallel reverse==true : input_channel <= 256 * channel_parallel |
|||||
pad | In Size | input_channel <= 256 * channel_parallel | |||||
Mode | "SYMMETRIC" ("CONSTANT" pad(value=0) would be fused into adjacent operators during compiler optimization process) | ||||||
global pooling | Global pooling will be processed as general pooling with kernel size euqal to input tensor size. | ||||||
InnerProduct, Fully Connected, Matmul | These ops will be transformed into conv2d op |
Operators Supported by TensorFlow
TensorFlow | XIR | DPU Implementations | ||
---|---|---|---|---|
OP type | Attributes | OP name | Attributes | |
placeholder / inputlayer* | shape | data | shape | Allocate memory for input data. |
data_type | ||||
const | const | datashapedata_type | Allocate memory for const data. | |
conv2d | filter | conv2d | kernel | Convolution Engine |
strides | stride | |||
pad([0, 0, 0, 0]) | ||||
padding | pad_mode(SAME or VALID) | |||
dilations | dilation | |||
depthwiseconv2dnative | filter | depthwise-conv2d | kernel | Depthwise-Convolution Engine |
strides | stride | |||
explicit_paddings | padding | |||
padding | pad_mode(SAME or VALID) | |||
dilations | dilation | |||
conv2dbackpropinput / conv2dtranspose* | filter | transposed-conv2d | kernel | Convolution Engine |
strides | stride | |||
padding([0, 0, 0, 0]) | ||||
padding | pad_mode(SAME or VALID) | |||
dilations | dilation | |||
spacetobacthnd + conv2d + batchtospacend | block_shape | conv2d | dilation | Spacetobatch, Conv2d and Batchtospace would be mapped to Convolution Engine when specific requirements we set have been met. |
padding | ||||
filter | kernel | |||
strides | stride | |||
padding | pad_mode(SAME) | |||
dilations | dilations | |||
block_shape | ||||
crops | ||||
matmul / dense* | transpose_a | conv2d / matmul | transpose_a | The matmul would be transformed to a conv2d operation once the equivalent conv2d meets the hardware requirements and can be mapped to DPU. |
transpose_b | transpose_b | |||
maxpool / maxpooling2d* | ksize | maxpool | kernel | Pooling Engine |
strides | stride | |||
pad([0, 0, 0, 0]) | ||||
padding | pad_mode(SAME or VALID) | |||
avgpool / averagepooling2d* / globalavgeragepooling2d* | ksize | avgpool | kernel | Pooling Engine |
strides | stride | |||
pad([0, 0, 0, 0]) | ||||
padding | pad_mode(SAME or VALID) | |||
count_include_pad (false) | ||||
count_include_invalid (true) | ||||
mean | axis | avgpool / reduction_mean | axis | Mean operation would be transformed to avgpool if the equivalent avgpool meets the hardware requirements and can be mapped to DPU. |
keep_dims | keep_dims | |||
relu | relu | Activations would be fused to adjacent operations such as convolution, add, etc. | ||
relu6 | relu6 | |||
leakyrelu | alpha | leakyrelu | alpha | |
fixneuron / quantizelayer* | bit_width | fix | bit_width | It would be divided into float2fix and fix2float during compilation, then the float2fix and fix2float operations would be fused with adjacent operations into course-grained operations. |
quantize_pos | fix_point | |||
if_signed | ||||
round_mode | ||||
identity | identity | Identity would be removed. | ||
add, addv2 | add | If the add is an element-wise add, the add would be mapped to DPU Element-wise Add Engine, if the add is an channel-wise add, we search for opportunities to fuse the add with adjacent operations such as convolutions. | ||
concatv2 / concatenate* | axis | concat | axis | We reduce the overhead resulting from the concat by special reading or writing strategies and allocating the on-chip memory carefully. |
pad / zeropadding2d* | paddings | pad | paddings | "CONSTANT" padding would be fused adjacent operations. "SYMMETRIC" padding would be mapped to DPU instructions. "REFLECT" padding is not supported by DPU yet. |
mode | mode | |||
shape | shape | The shape operation would be removed. | ||
stridedslice | begin | stridedslice | begin | If they are shape-related operations, they would be removed during compilation. If they are components of a coarse-grained operation, they would be fused with adjacent operations. Otherwise, they would be compiled into CPU implementations. |
end | end | |||
strides | strides | |||
pack | axis | stack | axis | |
neg | neg | |||
mul | mul | |||
realdiv | div | |||
sub | sub | |||
prod | axis | reduction_product | axis | |
keep_dims | keep_dims | |||
sum | axis | reduction_sum | axis | |
keep_dims | keep_dims | |||
max | axis | reduction_max | axis | |
keep_dims | keep_dims | |||
resizebilinear | size/scale | resize | size | If the mode of the resize is 'BILINEAR', align_corner=false, half_pixel_centers = false, size = 2, 4, 8; align_corner=false, half_pixel_centers = true, size = 2, 4 can be transformed to DPU implementations (pad+depthwise-transposed conv2d). If the mode of the resize is 'NEAREST' and the size is an integer, the resize would be mapped to DPU implementations. |
align_corners | align_corners | |||
half_pixel_centers | half_pixel_centers | |||
mode="BILINEAR" | ||||
resizenearestneighbor | size/scale | resize | size | |
align_corners | align_corners | |||
half_pixel_centers | half_pixel_centers | |||
mode="NEAREST" | ||||
upsample2d | size/scale | resize | size | |
align_corners | ||||
half_pixel_centers | ||||
interpolation | mode | |||
reshape | shape | reshape | shape | They would be transformed to the reshape operation in some cases. Otherwise they would be mapped to CPU. |
transpose | perm | transpose | order | |
squeeze | axis | squeeze | axis | |
exp | exp | They would only be compiled into CPU implementations. | ||
softmax | axis | softmax | axis | |
sigmoid | sigmoid | |||
square+ rsqrt+ maximum | l2_normalize | axis | output = x / sqrt(max(sum(x ^ 2), epsilon)) would be fused into a l2_normalize in XIR. | |
epsilon | ||||
|
Operators Supported by Caffe
Caffe | XIR | DPU Implementation | ||
---|---|---|---|---|
OP name | Attributes | OP name | Attributes | |
input | shape | data | shape | Allocate memory for input data. |
data_type | ||||
convolution | kernel_size | conv2d (group = 1) / depthwise-conv2d (group = input channel) | kernel | If group == input channel, the convolution would be compiled into Depthwise-Convolution Engine, if group == 1, the convolution would be mapped to Convolution Engine. Otherwise, it would be mapped to CPU. |
stride | stride | |||
pad | pad | |||
pad_mode (FLOOR) | ||||
dilation | dilation | |||
bias_term | ||||
num_output | ||||
group | ||||
deconvolution | kernel_size | transposed-conv2d (group = 1) / depthwise-transposed-conv2d (group = input channel) | kernel | If group == input channel, the deconvolution would be compiled into Depthwise-Convolution Engine, if group == 1, the deconvolution would be mapped to Convolution Engine. Otherwise, it would be mapped to CPU |
stride | stride | |||
pad | pad | |||
pad_mode (FLOOR) | ||||
dilation | dilation | |||
bias_term | ||||
num_output | ||||
group | ||||
innerproduct | bias_term | conv2d / matmul | transpose_a | The inner-product would be transformed to matmul, then the matmul would be transformed to conv2d and compiled to Convolution Engine. If the inner-product fails to be transformed, it would be implemented by CPU. |
num_output | transpose_b | |||
scale | bias_term | depthwise-conv2d / scale | The scale would be transformed to depthwise-convolution, otherwise, it would be mapped to CPU. | |
pooling | kernel_size | maxpool2d (pool_method = 0) / avgpool2d (pool_method = 1) | kernel_size | Pooling Engine |
stride | stride | |||
global_pooling | global | |||
pad | pad | |||
pool_method | pad_mode(CEIL) | |||
count_include_pad (true) | ||||
count_include_invalid (false) | ||||
eltwise | coeff = 1 | add | Element-wise Add Engine | |
operation = SUM | ||||
concat | axis | concat | axis | We reduce the overhead resulting from the concat by special reading or writing strategies and allocate the on-chip memory carefully. |
relu | negative_slope | relu / leakyrelu | alpha | Activations would be fused to adjacent operations such as convolution, add, etc. |
relu6 | relu6 | |||
fixneuron | bit_width | fix | bit_width | It would be divided into float2fix and fix2float during compilation, then the float2fix and fix2float operations would be fused with adjacent operations into course-grained operations. |
quantize_pos | fix_point | |||
if_signed | ||||
round_mode | ||||
reshape | shape | reshape | shape | These operations are shape-related operations, they would be removed or transformed into reshape in most cases, which would not affect the on-chip data layout. Otherwise, they would be compiled to CPU. |
permute | order | reshape / transpose | order | |
flatten | axis | reshape / flatten | start_axis | |
end_axis | end_axis | |||
reorg | strides | reorg | strides | If the reorg meets the hardware requirements, it would be mapped to DPU implementations. |
reverse | reverse | |||
deephiresize | scale | resize | size | If the mode of the resize is 'BILINEAR', align_corner=false, half_pixel_centers = false, size = 2, 4, 8; align_corner=false, half_pixel_centers = true, size = 2, 4 can be transformed to DPU implementations (pad+depthwise-transposed conv2d). If the mode of the resize is 'NEAREST' and the size is an integer, the resize would be mapped to DPU implementations. |
mode | mode | |||
align_corners=false | ||||
half_pixel_centers=false | ||||
gstiling | strides | gstiling | stride | If the strides of gstiling are integers, it may be mapped into special DPU read/write instructions. |
reverse | reverse | |||
slice | axis | strided_slice | begin | They would only be compiled into CPU implementations. |
slice_point | end | |||
strides | ||||
priorbox | min_sizes | priorbox | min_sizes | |
max_sizes | max_sizes | |||
aspect_ratio | aspect_ratio | |||
flip | flip | |||
clip | clip | |||
variance | variance | |||
step | step | |||
offset | offset | |||
softmax | axis | softmax | axis |
Operators Supported by PyTorch
PyTorch | XIR | DPU Implementation | ||
---|---|---|---|---|
API | Attributes | OP name | Attributes | |
Parameter | data | const | data | Allocate memory for input data. |
shape | ||||
data_type | ||||
Conv2d | in_channels | conv2d (groups = 1) / depthwise-conv2d (groups = input channel) | If groups == input channel, the convolution would be compiled into Depthwise-Convolution Engine. If groups == 1, the convolution would be mapped to Convolution Engine. Otherwise, it would be mapped to the CPU. | |
out_channels | ||||
kernel_size | kernel | |||
stride | stride | |||
padding | pad | |||
padding_mode('zeros') | pad_mode (FLOOR) | |||
groups | ||||
dilation | dilation | |||
ConvTranspose2d | in_channels | transposed-conv2d (groups = 1) / depthwise-transposed-conv2d (groups = input channel) | If groups == input channel, the convolution would be compiled into Depthwise-Convolution Engine. If groups == 1, the convolution would be mapped to Convolution Engine. Otherwise, it would be mapped to the CPU. | |
out_channels | ||||
kernel_size | kernel | |||
stride | stride | |||
padding | pad | |||
padding_mode('zeros') | pad_mode (FLOOR) | |||
groups | ||||
dilation | dilation | |||
matmul | conv2d / matmul | transpose_a | The matmul would be transformed to conv2d and compiled to Convolution Engine. If the matmul fails to be transformed, it would be implemented by CPU. | |
transpose_b | ||||
MaxPool2d / AdaptiveMaxPool2d | kernel_size | maxpool2d | kernel | Pooling Engine |
stride | stride | |||
padding | pad | |||
ceil_mode | pad_mode | |||
output_size (adaptive) | global | |||
AvgPool2d / AdaptiveAvgPool2d | kernel_size | avgpool2d | kernel | Pooling Engine |
stride | stride | |||
padding | pad | |||
ceil_mode | pad_mode | |||
count_include_pad | count_include_pad | |||
count_include_invalid (true) | ||||
output_size (adaptive) | global | |||
ReLU | relu | Activations would be fused to adjacent operations such as convolution, add, etc. | ||
LeakyReLU | negative_slope | leakyrelu | alpha | |
ReLU6 | relu6 | |||
Hardtanh | min_val = 0 | |||
max_val = 6 | ||||
ConstantPad2d / ZeroPad2d | padding | pad | paddings | "CONSTANT" padding would be fused adjacent operations. |
value = 0 | mode ("CONSTANT") | |||
add | add | If the add is an element-wise add, the add would be mapped to DPU Element-wise Add Engine. If the add is a channel-wise add, search for opportunities to fuse the add with adjacent operations such as convolutions. If they are shape-related operations, they would be removed during compilation. If they are components of a coarse-grained operation, they would be fused with adjacent operations. Otherwise, they would be compiled into CPU implementations. | ||
sub / rsub | sub | |||
mul | mul | |||
max | dim | reduction_max | axis | |
keepdim | keep_dims | |||
mean | dim | reduction_mean | axis | |
keepdim | keep_dims | |||
interpolate / upsample / upsample_bilinear / upsample_nearest | size | resize | size | If the mode of the resize is 'BILINEAR', align_corner=false, half_pixel_centers = false, size = 2, 4, 8; align_corner=false, half_pixel_centers = true, size = 2, 4 can be transformed to DPU implementations (pad+depthwise-transposed conv2d). If the mode of the resize is 'NEAREST' and the size are integers, the resize would be mapped to DPU implementations. |
scale_factor | ||||
mode | mode | |||
align_corners | align_corners | |||
half_pixel_centers = !align_corners | ||||
transpose | dim0 | transpose | order | These operations would be transformed to the reshape operation in some cases. Additionally, search for opportunities to fuse the dimension transformation operations into special load/save instrutions of adjacent operations to reduce the overhead. Otherwise, they would be mapped to CPU. |
dim1 | ||||
permute | dims | |||
view | size | reshape | shape | |
flatten | start_dim | reshape / flatten | start_axis | |
end_dim | end_axis | |||
squeeze | dim | reshape / squeeze | axis | |
cat | dim | concat | axis | Reduce the overhead resulting from the concat by special reading or writing strategies and allocating the on-chip memory carefully. |
aten::slice* | dim | strided_slice | If the strided_slice is shape-related or is the component of a coarse-grained operation, it would be removed. Otherwise, the strided_slice would be compiled into CPU implementations. | |
start | begin | |||
end | end | |||
step | strides | |||
BatchNorm2d | eps | depthwise-conv2d / batchnorm | epsilon | If the batch_norm is quantized and can be transformed to a depthwise-conv2d equivalently, it would be transformed to depthwise-conv2d and the compiler would search for compilation opportunities to map the batch_norm into DPU implementations. Otherwise, the batch_norm would be executed by CPU. |
axis | ||||
moving_mean | ||||
moving_var | ||||
gamma | ||||
beta | ||||
softmax | dim | softmax | axis | They would only be compiled into CPU implementations. |
Tanh | tanh | |||
Sigmoid | sigmoid | |||
|
VAI_C Usage
The corresponding Vitis AI compiler for Caffe and TensorFlow frameworks are vai_c_caffe, vai_c_tensorflow, vai_c_tensorflow2, and vai_c_xir across Cloud-to-Edge DPU. The common options for VAI_C are illustrated in the following table.
Parameters | Description |
---|---|
--arch | The DPU architecture configuration file for the VAI_C compiler in JSON format. For pre-built DPU xclbins in Vitis AI releases, you can find the corresponding arch.json file in Vitis AI docker (/opt/vitis_ai/compiler/arch). The contents should be something like {"target": "DPUCZDX8G_ISA0_B4096_MAX_BG2"}. For customized DPU IPs, the corresponding arch.json files are generated by the DPU TRD along with the DPU IPs. The contents should be something like {“fingerprint”:"0x1000000f7014407"}. The fingerprint is a 64-bit digital signature to identify a DPU target. It consists of 1 byte to indicate the DPU type, 1 byte to indicate the ISA version, and 6 bytes to indicate specific configurations. The fingerprint is unique to each DPU configuration and runtime relies on it to identify DPU instance running on the current platform and to verify that the model is compiled for the same DPU target. "DPUCZDX8G_ISA0_B4096_MAX_BG2" is an alias for a specific fingerprint which is pre-defined in the compiler. |
--output_dir | Path of output directory for vai_c_caffe and vai_c_tensorflow after compilation process. |
--net_name | Name of DPU kernel for network model after compiled by VAI_C. |
--options | The list for the extra options in the format of 'key':'value'. If
there are multiple options to be specified, they are separated by
‘,’. Use --options '{"input_shape": "1,224,224,3"}' to specify input shape manually. Use --options '{"plugins": "plugin0,plugin1"}' to specify plugin libraries. Use --options '{"output_ops": "op_name0,op_name1"}' to specify output ops Note: Arguments specified with “--options”
have the highest priorities and will override the values
specified in other places. |