Compiling the Model

Vitis AI Compiler

The Vitis™ AI compiler (VAI_C) is the unified interface to a compiler family targeting the optimization of neural-network computations to a family of DPUs. Each compiler maps a network model to a highly optimized DPU instruction sequence.

The simplified description of VAI_C framework is shown in the following figure. After parsing the topology of optimized and quantized input model, VAI_C constructs an internal computation graph as intermediate representation (IR). Therefore, a corresponding control flow and a data flow representation. It then performs multiple optimizations, for example, computation nodes fusion such as when batch norm is fused into a presiding convolution, efficient instruction scheduling by exploit inherent parallelism, or exploiting data reuse.

Figure 1: Vitis AI Compiler Framework

The Vitis AI Compiler generates the compiled model based on the DPU microarchitecture. Vitis AI supports several DPUs for different platforms and applications.

Table 1. DPUs on Different Hardware Platforms
DPU Name Hardware platform
DPUCZDX8G Zynq® UltraScale+™ MPSoC
DPUCAHX8H Alveo™ U50, U280 Data Center accelerator cards
DPUCAHX8L Alveo U50, U280 Data Center accelerator cards
DPUCADF8H Alveo U200, U250 Data Center accelerator cards
DPUCVDX8G Versal™ ACAP VCK190 evaluation board, Versal AI Core Series
DPUCVDX8H Versal ACAP VCK5000 evaluation kit

Compiling with an XIR-based Toolchain

Xilinx Intermediate Representation (XIR) is a graph-based intermediate representation of the AI algorithms which is designed for compilation and efficient deployment of the DPU on the FPGA platform. If you are an advanced user, you can apply whole application acceleration to allow the FPGA to be used to its maximum potential by extending the XIR to support customized IPs in the Vitis AI flow. It is the current foundation for the Vitis AI quantizer, compiler, runtime, and other tools.

XIR

XIR includes the Op, Tensor, Graph, and Subgraph libraries, which provide a clear and flexible representation of the computational graph. XIR has in-memory format and file format for different usage. The in-memory format XIR is a graph object and the file format is an xmodel. A graph object can be serialized to an xmodel while an xmodel can be deserialized to a graph object.

In the Op library, there is a well-defined set of operators to cover the popular deep learning frameworks, e.g., TensorFlow, PyTorch and Caffe, and all of the built-in DPU operators. This enhances the expression ability and achieves one of the core goals, which is eliminating the difference between these frameworks and providing a unified representation for users and developers.

XIR also provides Python APIs named PyXIR, which enables Python users to fully access the XIR in a Python environment, e.g., co-develop and integrate users' Python projects with the current XIR-based tools without having to perform a huge amount of work to fix the gap between different languages.

Figure 2: XIR Based Flow

xir::Graph

Graph is the core component of the XIR. It obtains serveral significant APIs, e.g., the xir::Graph::serialize, xir::Graph::deserialize and xir::Graph::topological_sort.

The Graph is like a container, which maintains the Op as its vertex, and uses the producer-consumer relation as the edge.

xir::Op

Op in XIR is the instance of the operator definition either in XIR or extended from XIR. All Op instances can only be created or added by the Graph according to the predefined built-in/extended op definition library. The Op definition mainly includes the input arguments and intrinsic attributes.

Besides the intrinsic predefined attributes, an Op instance is also able to carry more extrinsic attributes by applying xir::Op::set_attr API. Each Op instance can only obtain one output tensor, but more than one fanout ops.

xir::Tensor

Tensor is another important class in XIR. Unlike other frameworks' tensor definition, XIR's Tensor is only a description of the data block it representes. The real data block is excluded from the Tensor.

The key attributes for Tensor is the data type and shape.

xir::Subgraph

XIR's Subgraph is a tree-like hierarchy, which divides a set of ops into several non-overlapping sets. The Graph's entire op set can be seen as the root. The Subgraph can be nested but it must be non-overlapping. The nested insiders must be the children of the outer one.

Compiling for DPU

The XIR-based compiler takes the quantized TensorFlow or Caffe model as the input. First, it transforms the input models into the XIR format as the foundation for the following processes. Most of the variations among different frameworks are eliminated and transferred to a unified representation in XIR. Then, it applies various optimizations to the graph and breaks up the graph into several subgraphs on the basis of whether the operation can be executed on the DPU. Architecture-aware optimizations are applied for each subgraph, as required. For the DPU subgraph, the compiler generates the instruction stream and attaches to it. Finally, the optimized graph with the necessary information and instructions for VART is serialized into a compiled xmodel file.

The XIR-based compiler can support the DPUCZDX8G series on the Edge Zynq UltraScale+ MPSoC platforms, DPUCADF8H on the Alveo platform, DPUCAHX8H on the Alveo HBM platform optimized for high-throughput applications, DPUCAHX8L on the Alveo HBM platform optimized for low-latency applications, DPUCVDX8G on the Versal Edge platform, and DPUCVDX8H on the Versal Cloud platform. You can find thearch.json files for these platforms in /opt/vitis_ai/compiler/arch.

Steps to compile Caffe or TensorFlow models with VAI_C are the same as for the previous DPUs. It is assumed that you have successfully installed the Vitis AI package including VAI_C and compressed your model with the vai_quantizer.

Caffe

For Caffe, vai_q_caffe generates a prototxt (deploy.prototxt) and a model (deploy.caffemodel). Ensure that you specify the -keep_fixed_neuron option for vai_q_caffe because it is essential for the XIR-based compiler. Run the following command to get the compiled xmodel.

vai_c_caffe -p /PATH/TO/deploy.prototxt -c /PATH/TO/deploy.caffemodel -a /PATH/TO/arch.json -o /OUTPUTPATH -n netname

The compiler creates three files in OUTPUTPATH directory. netname_org.xmodel is the pre-compiled xmodel which is generated by the compiler. netname.xmodel is the compiled xmodel which contains instructions and other necessary information. meta.json is for the Vitis AI runtime.

TensorFlow

For TensorFlow, vai_q_tensorflow generates a pb file (quantize_eval_model.pb). There are two pb files generated by vai_q_tensorflow. The quantize_eval_model.pb file is the input file for the XIR-based compiler. The compilation command is as follows.

vai_c_tensorflow -f /PATH/TO/quantize_eval_model.pb -a /PATH/TO/arch.json -o /OUTPUTPATH -n netname

The outputs is the same as the output for Caffe.

Sometimes, the TensorFlow model does not contain input tensor shape information because it might cause the compilation to fail. You can specify the input tensor shape with an extra option like --options '{"input_shape": "1,224,224,3"}'.

TensorFlow 2.x

For TensorFlow 2.x, the quantizer generates the quantized model in the hdf5 format.

vai_c_tensorflow2 -m /PATH/TO/quantized.h5 -a /PATH/TO/arch.json -o /OUTPUTPATH -n netname

Currently, vai_c_tensorflow2 only supports Keras functional APIs.

PyTorch

For PyTorch, the quantizer NNDCT outputs the quantized model in the XIR format directly. Use vai_c_xir to compile it.

vai_c_xir -x /PATH/TO/quantized.xmodel -a /PATH/TO/arch.json -o /OUTPUTPATH -n netname

Compiling for Customized Accelerator

The XIR-based compiler works in the context of a framework-independent XIR graph generated from deep learning frameworks. The parser removes the framework-specific attributes in the CNN models and transforms models into XIR-based computing graphs. The compiler divides the computing graph into different subgraphs, leverages heterogeneous optimizations, and generates corresponding optimized machine codes for subgraphs.

Figure 3: Compilation Flow

When the model contains operations that the DPU cannot support, some subgraphs are created and mapped to the CPU. The FPGA is so powerful that you can create a specific IP to accelerate those operations for improved end-to-end performance. To enable customized accelerating IPs with an XIR-based toolchain, leverage a pipeline named plugin to extend the XIR and compiler.

In Plugin.hpp, the interface class Plugin is declared. Plugins are executed sequentially before the compiler starts to compile the graph for the DPU. At first, a child subgraph is created for each operator and the plugin picks the operators which it can accelerate. It merges them into larger subgraphs, maps them to the customized IP, and attaches necessary information for runtime (VART::Runner) such as the instructions on the subgraphs.

Implementing a Plugin

  1. Implement Plugin::partition()

    In std::set<xir::Subgraph*> partition(xir::Graph* graph), pick the desired operations and merge them into device level subgraphs using the following helper functions.

    • xir::Subgraph* filter_by_name(xir::Graph* graph, const std::string& name) returns the subgraph with a specific name
    • std::set<xir::Subgraph*> filter_by_type(xir::Graph* graph, const std::string& type) returns subgraphs with a specific type.
    • std::set<xir::Subgraph*> filter_by_template(xir::Graph* graph, xir::GraphTemplate* temp) returns subgraphs with a specific structure.
      Figure 4: Filter by Templates
    • std::set<xir::Subgraph*> filter(xir::Graph* graph, std::function<std::set<xir::Subgraph*>(std::set<xir::Subgraph*>)> func) allows you to filter the subgraphs by customized function. This method helps you to find all uncompiled subgraphs.

    To merge the child subgraphs, use the merge_subgraph() helper function. However, this function can only merge subgraphs at the same level. If the subgraph list can not be merged into one subgraph, the helper function will merge them as far as possible.

  2. Specify the name, device, and runner for the subgraphs you picked in the Plugin::partition() function.
  3. Implement Plugin::compile(xir::Subgraph*). This function is called for all the subgraphs returned by the partition() function. You can attach information on subgraphs for runtime.

Building the Plugin

Create an extern get_plugin() function and build the implementations into a shared library.

extern "C" plugin* get_plugin() { return new YOURPLUGIN(); }

Using the Plugin

Use --options '{"plugin": "libplugin0.so,libplugin1.so"}' in the vai_c command line option to pass your plugin library to compiler. When executing your plugin, the compiler opens the library and makes an instance of your plugin by loading your extern function named ‘get_plugin’. If more than one plugin is specified, they are executed sequentially in the order defined by the command line option. Compilation for DPU and CPU are executed after all the plugins have been implemented.

Samples

Check https://github.com/Xilinx/Vitis-AI/tree/master/tools/Vitis-AI-Runtime/VART/plugin-samples for samples.

Supported Operators and DPU Limitations

Xilinx is continuously improving the DPU IP and the compiler to support more operators with better performance. The following table lists some typical operations and the configurations such as kernel size, stride, etc. that the DPU can support. If the operation configurations exceed these limitations, the operator will be assigned to the CPU. Additionally, the operators that the DPU can support are dependent on the DPU types, ISA versions, and configurations.

You can configure the DPUs to suit your requirements. You can choose engines, adjust intrinsic parameters, and create your own DPU IP with TRD projects but this means that the limitations can be very different between configurations. Either use the following product guides for information on configuration or compile the model with your own DPU configuration. The compiler tells you which operators can be assigned to the CPU. The table shows a specific configuration of each DPU architecture.

  • DPUCZDX8G for Zynq UltraScale+ MPSoCs Product Guide (PG338)
  • DPUCAHX8L for Convolutional Neural Networks Product Guide (PG366)

  • DPUCAHX8H for Convolutional Neural Network Product Guide (PG367)

  • DPUCVDX8G for Versal ACAPs Product Guide Product Guide (PG389)

The following operators are primitively defined in different deep learning frameworks. The compiler can automatically parse these operators, transform them into the XIR format, and distribute them to DPU or CPU. These operators are partially supported by the tools, and they are listed here for your reference.

Currently Supported Operators

Table 2. Currently Supported Operators
Typical Operation Type in CNN Parameters DPUCZDX8G_ISA0_B4096_MAX_BG2 (ZCU102, ZCU104) DPUCAHX8L_ISA0 (U50, U50LV, U280) DPUCVDX8G_ISA1_C32B3 (VCK190) DPUCAHX8H_ISA2 (U50, U50LV9E, U50LV10E, U280) DPUCADF8H_ISA0 (U200, U250) DPUCVDX8H_ISA1_F2W2 (VCK5000)
Intrinsic Parameter channel_parallel: 16

bank_depth: 2048

channel_parallel: 32

bank_depth: 4096

channel_parallel: 16

bank_depth: 16384

channel_parallel: 16

bank_depth: 2048

channel_parallel: 16

bank_depth: 8192

channel_parallel: 64

bank_depth: 256

conv2d Kernel size w, h: [1, 16] w, h: [1, 16] w, h: [1, 16]

w * h <= 64

w, h: [1, 16] w, h: [1, 16] w, h: [1, 16]
Strides w, h: [1, 8] w, h: [1, 4] w, h: [1, 8] w, h: [1, 4] w, h: [1, 8] w, h: [1, 4]
Dilation dilation * input_channel <= 256 * channel_parallel
Paddings pad_left, pad_right: [0, (kernel_w - 1) * dilation_w]
pad_top, pad_bottom: [0, (kernel_h - 1) * dilation_h]
In Size kernel_w * kernel_h * ceil(input_channel / channel_parallel) <= bank_depth
Out Size output_channel <= 256 * channel_parallel
Activation ReLU, LeakyReLU, ReLU6 ReLU, ReLU6 ReLU, LeakyReLU, ReLU6, Hard-Swish, Hard-Sigmoid ReLU, LeakyReLU, ReLU6 ReLU, LeakyReLU ReLU, LeakyReLU
Group* (Caffe) group==1
depthwise-conv2d Kernel size w, h: [1, 16] w, h: [3] w, h: [1, 256] Not supported
Strides w, h: [1, 8] w, h: [1, 2] w, h: [1, 8]
dilation dilation * input_channel <= 256 * channel_parallel
Paddings pad_left, pad_right: [0, (kernel_w - 1) * dilation_w] pad_left, pad_right: [0, 15 * dilation_w]
pad_top, pad_bottom: [0, (kernel_h - 1) * dilation_h] pad_top, pad_bottom: [0, 15 * dilation_h]
In Size kernel_w * kernel_h * ceil(input_channel / channel_parallel) <= bank_depth
Out Size output_channel <= 256 * channel_parallel
Activation ReLU, ReLU6 ReLU, ReLU6 ReLU, ReLU6
Group* (Caffe) group==input_channel
transposed-conv2d Kernel size kernel_w/stride_w, kernel_h/stride_h: [1, 16]
Strides
Paddings pad_left, pad_right: [1, kernel_w-1]
pad_top, pad_bottom: [1, kernel_h-1]
Out Size output_channel <= 256 * channel_parallel
Activation ReLU, LeakyReLU, ReLU6 ReLU, ReLU6 ReLU, LeakyReLU, ReLU6, Hard-Swish, Hard-Sigmoid ReLU, LeakyReLU, ReLU6 ReLU, LeakyReLU ReLU, LeakyReLU
depthwise-transposed-conv2d Kernel size kernel_w/stride_w, kernel_h/stride_h: [1, 16] kernel_w/stride_w, kernel_h/stride_h: [3] kernel_w/stride_w, kernel_h/stride_h: [1, 256] Not supported
Strides
Paddings pad_left, pad_right: [1, kernel_w-1] pad_left, pad_right: [1, 15]
pad_top, pad_bottom: [1, kernel_h-1] pad_top, pad_bottom: [1, 15]
Out Size output_channel <= 256 * channel_parallel
Activation ReLU, ReLU6 ReLU, ReLU6 ReLU, ReLU6
max-pooling Kernel size w, h: [2, 8] w, h: {2, 3, 5, 7, 8} w, h: [1, 256] w, h: [1, 8] w, h: [1, 16] w, h: {1, 2, 3, 7}
Strides w, h: [1, 8] w, h: [1, 8] w, h: [1, 8] w, h: [1, 8] w, h: [1, 8] w, h: [1, 8]
Paddings pad_left, pad_right: [1, kernel_w-1] pad_left, pad_right: [1, 15] pad_left, pad_right: [1, kernel_w-1]
pad_top, pad_bottom: [1, kernel_h-1] pad_top, pad_bottom: [1, 15] pad_top, pad_bottom: [1, kernel_h-1]
Activation ReLU not supported ReLU, ReLU6 not supported ReLU not supported
average-pooling Kernel size w, h: [2, 8]

w==h

w, h: {2, 3, 5, 7, 8}

w==h

w, h: [1, 256] w, h: [1, 8]

w==h

w, h: [1, 16] w, h: {1, 2, 3, 7}

w==h

Strides w, h: [1, 8] w, h: [1, 8] w, h: [1, 8] w, h: [1, 8] w, h: [1, 8] w, h: [1, 8]
Paddings pad_left, pad_right: [1, kernel_w-1] pad_left, pad_right: [1, 15] pad_left, pad_right: [1, kernel_w-1]
pad_top, pad_bottom: [1, kernel_h-1] pad_top, pad_bottom: [1, 15] pad_top, pad_bottom: [1, kernel_h-1]
Activation ReLU not supported ReLU, ReLU6 not supported ReLU not supported
eltwise type sum sum sum, prod sum sum sum
Input Channel input_channel <= 256 * channel_parallel
Activation ReLU ReLU ReLU ReLU ReLU ReLU
concat Network-specific limitation, which relates to the size of feature maps, quantization results and compiler optimizations.
reorg Strides reverse==false : stride ^ 2 * input_channel <= 256 * channel_parallel

reverse==true : input_channel <= 256 * channel_parallel

pad In Size input_channel <= 256 * channel_parallel
Mode "SYMMETRIC" ("CONSTANT" pad(value=0) would be fused into adjacent operators during compiler optimization process)
global pooling Global pooling will be processed as general pooling with kernel size euqal to input tensor size.
InnerProduct, Fully Connected, Matmul These ops will be transformed into conv2d op

Operators Supported by TensorFlow

Table 3. Operators Supported by TensorFlow
TensorFlow XIR  DPU Implementations 
OP type Attributes OP name Attributes
placeholder / inputlayer* shape data shape Allocate memory for input data. 
data_type
const       const   datashapedata_type Allocate memory for const data.  
conv2d     filter conv2d     kernel Convolution Engine    
strides stride
  pad([0, 0, 0, 0])
padding pad_mode(SAME or VALID)
dilations dilation
depthwiseconv2dnative     filter depthwise-conv2d     kernel Depthwise-Convolution Engine    
strides stride
explicit_paddings padding
padding pad_mode(SAME or VALID)
dilations dilation
conv2dbackpropinput / conv2dtranspose*     filter transposed-conv2d     kernel Convolution Engine    
strides stride
  padding([0, 0, 0, 0])
padding pad_mode(SAME or VALID)
dilations dilation
spacetobacthnd + conv2d + batchtospacend        block_shape conv2d        dilation Spacetobatch, Conv2d and Batchtospace would be mapped to Convolution Engine when specific requirements we set have been met.       
padding  
filter kernel
strides stride
padding pad_mode(SAME)
dilations dilations
block_shape  
crops  
matmul / dense*  transpose_a conv2d / matmul  transpose_a The matmul would be transformed to a conv2d operation once the equivalent conv2d meets the hardware requirements and can be mapped to DPU. 
transpose_b transpose_b
maxpool / maxpooling2d*    ksize maxpool    kernel Pooling Engine   
strides stride
  pad([0, 0, 0, 0])
padding pad_mode(SAME or VALID)
avgpool / averagepooling2d* / globalavgeragepooling2d*      ksize avgpool      kernel Pooling Engine     
strides stride
  pad([0, 0, 0, 0])
padding pad_mode(SAME or VALID)
  count_include_pad (false)
  count_include_invalid (true)
mean  axis avgpool / reduction_mean  axis Mean operation would be transformed to avgpool if the equivalent avgpool meets the hardware requirements and can be mapped to DPU. 
keep_dims keep_dims
relu   relu   Activations would be fused to adjacent operations such as convolution, add, etc.  
relu6   relu6  
leakyrelu alpha leakyrelu alpha
fixneuron / quantizelayer*    bit_width fix    bit_width It would be divided into float2fix and fix2float during compilation, then the float2fix and fix2float operations would be fused with adjacent operations into course-grained operations.   
quantize_pos fix_point
  if_signed
  round_mode
identity   identity   Identity would be removed.
add, addv2   add   If the add is an element-wise add, the add would be mapped to DPU Element-wise Add Engine, if the add is an channel-wise add, we search for opportunities to fuse the add with adjacent operations such as convolutions.
concatv2 / concatenate* axis concat axis We reduce the overhead resulting from the concat by special reading or writing strategies and allocating the on-chip memory carefully.
pad / zeropadding2d*  paddings pad  paddings "CONSTANT" padding would be fused adjacent operations. "SYMMETRIC" padding would be mapped to DPU instructions. "REFLECT" padding is not supported by DPU yet. 
mode mode
shape   shape   The shape operation would be removed.
stridedslice   begin stridedslice   begin If they are shape-related operations, they would be removed during compilation. If they are components of a coarse-grained operation, they would be fused with adjacent operations. Otherwise, they would be compiled into CPU implementations.             
end end
strides strides
pack axis stack axis
neg   neg  
mul   mul  
realdiv   div  
sub   sub  
prod  axis reduction_product  axis
keep_dims keep_dims
sum  axis reduction_sum  axis
keep_dims keep_dims
max  axis reduction_max  axis
keep_dims keep_dims
resizebilinear    size/scale resize    size If the mode of the resize is 'BILINEAR', align_corner=false, half_pixel_centers = false, size = 2, 4, 8; align_corner=false, half_pixel_centers = true, size = 2, 4 can be transformed to DPU implementations (pad+depthwise-transposed conv2d). If the mode of the resize is 'NEAREST' and the size is an integer, the resize would be mapped to DPU implementations.           
align_corners align_corners
half_pixel_centers half_pixel_centers
  mode="BILINEAR"
resizenearestneighbor    size/scale resize    size
align_corners align_corners
half_pixel_centers half_pixel_centers
  mode="NEAREST"
upsample2d    size/scale resize    size
  align_corners
  half_pixel_centers
interpolation mode
reshape shape reshape shape They would be transformed to the reshape operation in some cases. Otherwise they would be mapped to CPU.  
transpose perm transpose order
squeeze axis squeeze axis
exp   exp   They would only be compiled into CPU implementations.  
softmax axis softmax axis
sigmoid   sigmoid  
square+ rsqrt+ maximum    l2_normalize  axis output = x / sqrt(max(sum(x ^ 2), epsilon)) would be fused into a l2_normalize in XIR.
  epsilon
  1. The OPs in TensorFlow listed above are supported in XIR. All of them have CPU implementations in the tool-chain.
  2. Operators with * represent that the version of TensorFlow > 2.0.

Operators Supported by Caffe

Table 4. Operators Supported by Caffe
Caffe XIR DPU Implementation 
OP name Attributes OP name Attributes
input  shape data  shape Allocate memory for input data. 
  data_type
convolution        kernel_size conv2d (group = 1) / depthwise-conv2d (group = input channel)        kernel If group == input channel, the convolution would be compiled into Depthwise-Convolution Engine, if group == 1, the convolution would be mapped to Convolution Engine. Otherwise, it would be mapped to CPU.       
stride stride
pad pad
  pad_mode (FLOOR)
dilation dilation
bias_term  
num_output  
group  
deconvolution        kernel_size transposed-conv2d (group = 1) / depthwise-transposed-conv2d (group = input channel)        kernel If group == input channel, the deconvolution would be compiled into Depthwise-Convolution Engine, if group == 1, the deconvolution would be mapped to Convolution Engine. Otherwise, it would be mapped to CPU       
stride stride
pad pad
  pad_mode (FLOOR)
dilation dilation
bias_term  
num_output  
group  
innerproduct  bias_term conv2d / matmul  transpose_a The inner-product would be transformed to matmul, then the matmul would be transformed to conv2d and compiled to Convolution Engine. If the inner-product fails to be transformed, it would be implemented by CPU. 
num_output transpose_b
scale bias_term depthwise-conv2d / scale   The scale would be transformed to depthwise-convolution, otherwise, it would be mapped to CPU.
pooling       kernel_size maxpool2d (pool_method = 0) / avgpool2d (pool_method = 1)       kernel_size Pooling Engine      
stride stride
global_pooling global
pad pad
pool_method pad_mode(CEIL)
  count_include_pad (true)
  count_include_invalid (false)
eltwise  coeff = 1 add    Element-wise Add Engine 
operation = SUM  
concat axis concat axis We reduce the overhead resulting from the concat by special reading or writing strategies and allocate the on-chip memory carefully.
relu negative_slope relu / leakyrelu alpha Activations would be fused to adjacent operations such as convolution, add, etc. 
relu6   relu6  
fixneuron    bit_width fix    bit_width It would be divided into float2fix and fix2float during compilation, then the float2fix and fix2float operations would be fused with adjacent operations into course-grained operations.   
quantize_pos fix_point
  if_signed
  round_mode
reshape shape reshape shape These operations are shape-related operations, they would be removed or transformed into reshape in most cases, which would not affect the on-chip data layout. Otherwise, they would be compiled to CPU.   
permute order reshape / transpose order
flatten axis reshape / flatten start_axis
  end_axis   end_axis
reorg  strides reorg  strides If the reorg meets the hardware requirements, it would be mapped to DPU implementations. 
reverse reverse
deephiresize    scale resize    size If the mode of the resize is 'BILINEAR', align_corner=false, half_pixel_centers = false, size = 2, 4, 8; align_corner=false, half_pixel_centers = true, size = 2, 4 can be transformed to DPU implementations (pad+depthwise-transposed conv2d). If the mode of the resize is 'NEAREST' and the size is an integer, the resize would be mapped to DPU implementations.   
mode mode
  align_corners=false
  half_pixel_centers=false
gstiling  strides gstiling  stride If the strides of gstiling are integers, it may be mapped into special DPU read/write instructions. 
reverse reverse
slice   axis strided_slice   begin They would only be compiled into CPU implementations.           
slice_point end
  strides
priorbox        min_sizes priorbox        min_sizes
max_sizes max_sizes
aspect_ratio aspect_ratio
flip flip
clip clip
variance variance
step step
offset offset
softmax axis softmax axis

Operators Supported by PyTorch

Table 5. Operators Supported by PyTorch
PyTorch XIR DPU Implementation 
API Attributes OP name Attributes
Parameter   data const   data Allocate memory for input data.  
shape
  data_type
Conv2d        in_channels conv2d (groups = 1) / depthwise-conv2d (groups = input channel)          If groups == input channel, the convolution would be compiled into Depthwise-Convolution Engine. If groups == 1, the convolution would be mapped to Convolution Engine. Otherwise, it would be mapped to the CPU.       
out_channels  
kernel_size kernel
stride stride
padding pad
padding_mode('zeros') pad_mode (FLOOR)
groups  
dilation dilation
ConvTranspose2d        in_channels transposed-conv2d (groups = 1) / depthwise-transposed-conv2d (groups = input channel)          If groups == input channel, the convolution would be compiled into Depthwise-Convolution Engine. If groups == 1, the convolution would be mapped to Convolution Engine. Otherwise, it would be mapped to the CPU.
out_channels  
kernel_size kernel
stride stride
padding pad
padding_mode('zeros') pad_mode (FLOOR)
groups  
dilation dilation
matmul    conv2d / matmul  transpose_a The matmul would be transformed to conv2d and compiled to Convolution Engine. If the matmul fails to be transformed, it would be implemented by CPU. 
  transpose_b
MaxPool2d / AdaptiveMaxPool2d     kernel_size maxpool2d     kernel Pooling Engine    
stride stride
padding pad
ceil_mode pad_mode
output_size (adaptive) global
AvgPool2d / AdaptiveAvgPool2d       kernel_size avgpool2d        kernel Pooling Engine      
stride stride
padding pad
ceil_mode pad_mode
count_include_pad count_include_pad
  count_include_invalid (true)
output_size (adaptive) global
ReLU   relu   Activations would be fused to adjacent operations such as convolution, add, etc.    
LeakyReLU negative_slope leakyrelu alpha
ReLU6   relu6    
Hardtanh  min_val = 0  
max_val = 6  
ConstantPad2d / ZeroPad2d  padding pad  paddings "CONSTANT" padding would be fused adjacent operations. 
value = 0 mode ("CONSTANT")
add   add   If the add is an element-wise add, the add would be mapped to DPU Element-wise Add Engine. If the add is a channel-wise add, search for opportunities to fuse the add with adjacent operations such as convolutions. If they are shape-related operations, they would be removed during compilation. If they are components of a coarse-grained operation, they would be fused with adjacent operations. Otherwise, they would be compiled into CPU implementations.      
sub / rsub   sub  
mul   mul  
max  dim reduction_max  axis
keepdim keep_dims
mean  dim reduction_mean  axis
keepdim keep_dims
interpolate / upsample / upsample_bilinear / upsample_nearest     size resize     size If the mode of the resize is 'BILINEAR', align_corner=false, half_pixel_centers = false, size = 2, 4, 8; align_corner=false, half_pixel_centers = true, size = 2, 4 can be transformed to DPU implementations (pad+depthwise-transposed conv2d). If the mode of the resize is 'NEAREST' and the size are integers, the resize would be mapped to DPU implementations.    
scale_factor  
mode mode
align_corners align_corners
  half_pixel_centers = !align_corners
transpose  dim0 transpose  order These operations would be transformed to the reshape operation in some cases. Additionally, search for opportunities to fuse the dimension transformation operations into special load/save instrutions of adjacent operations to reduce the overhead. Otherwise, they would be mapped to CPU.      
dim1  
permute dims    
view size reshape shape
flatten  start_dim reshape / flatten  start_axis
end_dim end_axis
squeeze dim reshape / squeeze axis
cat dim concat axis Reduce the overhead resulting from the concat by special reading or writing strategies and allocating the on-chip memory carefully.
aten::slice*    dim strided_slice   If the strided_slice is shape-related or is the component of a coarse-grained operation, it would be removed. Otherwise, the strided_slice would be compiled into CPU implementations.   
start begin
end end
step strides
BatchNorm2d      eps depthwise-conv2d / batchnorm      epsilon If the batch_norm is quantized and can be transformed to a depthwise-conv2d equivalently, it would be transformed to depthwise-conv2d and the compiler would search for compilation opportunities to map the batch_norm into DPU implementations. Otherwise, the batch_norm would be executed by CPU.
  axis
  moving_mean
  moving_var
  gamma
  beta
softmax dim softmax axis They would only be compiled into CPU implementations. 
Tanh   tanh  
Sigmoid   sigmoid  
  1. If the slice of tensor in PyTorch is written in the Python syntax, it is transformed into aten::slice.

VAI_C Usage

The corresponding Vitis AI compiler for Caffe and TensorFlow frameworks are vai_c_caffe, vai_c_tensorflow, vai_c_tensorflow2, and vai_c_xir across Cloud-to-Edge DPU. The common options for VAI_C are illustrated in the following table.

Table 6. VAI_C Common Options for Cloud and Edge DPU
Parameters Description
--arch The DPU architecture configuration file for the VAI_C compiler in JSON format. For pre-built DPU xclbins in Vitis AI releases, you can find the corresponding arch.json file in Vitis AI docker (/opt/vitis_ai/compiler/arch). The contents should be something like {"target": "DPUCZDX8G_ISA0_B4096_MAX_BG2"}. For customized DPU IPs, the corresponding arch.json files are generated by the DPU TRD along with the DPU IPs. The contents should be something like {“fingerprint”:"0x1000000f7014407"}. The fingerprint is a 64-bit digital signature to identify a DPU target. It consists of 1 byte to indicate the DPU type, 1 byte to indicate the ISA version, and 6 bytes to indicate specific configurations. The fingerprint is unique to each DPU configuration and runtime relies on it to identify DPU instance running on the current platform and to verify that the model is compiled for the same DPU target. "DPUCZDX8G_ISA0_B4096_MAX_BG2" is an alias for a specific fingerprint which is pre-defined in the compiler.
--output_dir Path of output directory for vai_c_caffe and vai_c_tensorflow after compilation process.
--net_name Name of DPU kernel for network model after compiled by VAI_C.
--options The list for the extra options in the format of 'key':'value'. If there are multiple options to be specified, they are separated by ‘,’.

Use --options '{"input_shape": "1,224,224,3"}' to specify input shape manually.

Use --options '{"plugins": "plugin0,plugin1"}' to specify plugin libraries.

Use --options '{"output_ops": "op_name0,op_name1"}' to specify output ops

Note: Arguments specified with “--options” have the highest priorities and will override the values specified in other places.