Compiling the Model

Vitis AI Compiler

The Vitis™ AI compiler (VAI_C) is the unified interface to a compiler family targeting the optimization of neural-network computations to a family of DPUs. Each compiler maps a network model to a highly optimized DPU instruction sequence.

The simplified description of VAI_C framework is shown in the following figure. After parsing the topology of optimized and quantized input model, VAI_C constructs an internal computation graph as intermediate representation (IR). Therefore, a corresponding control flow and a data flow representation. It then performs multiple optimizations, for example, computation nodes fusion such as when batch norm is fused into a presiding convolution, efficient instruction scheduling by exploit inherent parallelism, or exploiting data reuse.

The Vitis AI Compiler generates the compiled model based on the DPU microarchitecture. Vitis AI supports several DPUs for different platforms and applications.

Table 1. DPUs on Different Hardware Platforms
DPU Name	Hardware platform
DPUCZDX8G	Zynq® UltraScale+™ MPSoC
DPUCAHX8H	Alveo™ U50, U280 Data Center accelerator cards
DPUCAHX8L	Alveo U50, U280 Data Center accelerator cards
DPUCADF8H	Alveo U200, U250 Data Center accelerator cards
DPUCVDX8G	Versal™ ACAP VCK190 evaluation board, Versal AI Core Series
DPUCVDX8H	Versal ACAP VCK5000 evaluation kit

Compiling with an XIR-based Toolchain

Xilinx Intermediate Representation (XIR) is a graph-based intermediate representation of the AI algorithms which is designed for compilation and efficient deployment of the DPU on the FPGA platform. If you are an advanced user, you can apply whole application acceleration to allow the FPGA to be used to its maximum potential by extending the XIR to support customized IPs in the Vitis AI flow. It is the current foundation for the Vitis AI quantizer, compiler, runtime, and other tools.

XIR

XIR includes the Op, Tensor, Graph, and Subgraph libraries, which provide a clear and flexible representation of the computational graph. XIR has in-memory format and file format for different usage. The in-memory format XIR is a graph object and the file format is an xmodel. A graph object can be serialized to an xmodel while an xmodel can be deserialized to a graph object.

In the Op library, there is a well-defined set of operators to cover the popular deep learning frameworks, e.g., TensorFlow, PyTorch and Caffe, and all of the built-in DPU operators. This enhances the expression ability and achieves one of the core goals, which is eliminating the difference between these frameworks and providing a unified representation for users and developers.

XIR also provides Python APIs named PyXIR, which enables Python users to fully access the XIR in a Python environment, e.g., co-develop and integrate users' Python projects with the current XIR-based tools without having to perform a huge amount of work to fix the gap between different languages.

Figure 2: XIR Based Flow

xir::Graph

Graph is the core component of the XIR. It obtains serveral significant APIs, e.g., the xir::Graph::serialize, xir::Graph::deserialize and xir::Graph::topological_sort.

The Graph is like a container, which maintains the Op as its vertex, and uses the producer-consumer relation as the edge.

xir::Op

Op in XIR is the instance of the operator definition either in XIR or extended from XIR. All Op instances can only be created or added by the Graph according to the predefined built-in/extended op definition library. The Op definition mainly includes the input arguments and intrinsic attributes.

Besides the intrinsic predefined attributes, an Op instance is also able to carry more extrinsic attributes by applying xir::Op::set_attr API. Each Op instance can only obtain one output tensor, but more than one fanout ops.

xir::Tensor

Tensor is another important class in XIR. Unlike other frameworks' tensor definition, XIR's Tensor is only a description of the data block it representes. The real data block is excluded from the Tensor.

The key attributes for Tensor is the data type and shape.

xir::Subgraph

XIR's Subgraph is a tree-like hierarchy, which divides a set of ops into several non-overlapping sets. The Graph's entire op set can be seen as the root. The Subgraph can be nested but it must be non-overlapping. The nested insiders must be the children of the outer one.

Compiling for DPU

The XIR-based compiler takes the quantized TensorFlow or Caffe model as the input. First, it transforms the input models into the XIR format as the foundation for the following processes. Most of the variations among different frameworks are eliminated and transferred to a unified representation in XIR. Then, it applies various optimizations to the graph and breaks up the graph into several subgraphs on the basis of whether the operation can be executed on the DPU. Architecture-aware optimizations are applied for each subgraph, as required. For the DPU subgraph, the compiler generates the instruction stream and attaches to it. Finally, the optimized graph with the necessary information and instructions for VART is serialized into a compiled xmodel file.

The XIR-based compiler can support the DPUCZDX8G series on the Edge Zynq UltraScale+ MPSoC platforms, DPUCADF8H on the Alveo platform, DPUCAHX8H on the Alveo HBM platform optimized for high-throughput applications, DPUCAHX8L on the Alveo HBM platform optimized for low-latency applications, DPUCVDX8G on the Versal Edge platform, and DPUCVDX8H on the Versal Cloud platform. You can find thearch.json files for these platforms in /opt/vitis_ai/compiler/arch.

Steps to compile Caffe or TensorFlow models with VAI_C are the same as for the previous DPUs. It is assumed that you have successfully installed the Vitis AI package including VAI_C and compressed your model with the vai_quantizer.

Caffe

For Caffe, vai_q_caffe generates a prototxt (deploy.prototxt) and a model (deploy.caffemodel). Ensure that you specify the -keep_fixed_neuron option for vai_q_caffe because it is essential for the XIR-based compiler. Run the following command to get the compiled xmodel.

vai_c_caffe -p /PATH/TO/deploy.prototxt -c /PATH/TO/deploy.caffemodel -a /PATH/TO/arch.json -o /OUTPUTPATH -n netname

The compiler creates three files in OUTPUTPATH directory. netname_org.xmodel is the pre-compiled xmodel which is generated by the compiler. netname.xmodel is the compiled xmodel which contains instructions and other necessary information. meta.json is for the Vitis AI runtime.

TensorFlow

For TensorFlow, vai_q_tensorflow generates a pb file (quantize_eval_model.pb). There are two pb files generated by vai_q_tensorflow. The quantize_eval_model.pb file is the input file for the XIR-based compiler. The compilation command is as follows.

vai_c_tensorflow -f /PATH/TO/quantize_eval_model.pb -a /PATH/TO/arch.json -o /OUTPUTPATH -n netname

The outputs is the same as the output for Caffe.

Sometimes, the TensorFlow model does not contain input tensor shape information because it might cause the compilation to fail. You can specify the input tensor shape with an extra option like --options '{"input_shape": "1,224,224,3"}'.

TensorFlow 2.x

For TensorFlow 2.x, the quantizer generates the quantized model in the hdf5 format.

vai_c_tensorflow2 -m /PATH/TO/quantized.h5 -a /PATH/TO/arch.json -o /OUTPUTPATH -n netname

Currently, vai_c_tensorflow2 only supports Keras functional APIs.

PyTorch

For PyTorch, the quantizer NNDCT outputs the quantized model in the XIR format directly. Use vai_c_xir to compile it.

vai_c_xir -x /PATH/TO/quantized.xmodel -a /PATH/TO/arch.json -o /OUTPUTPATH -n netname

Compiling for Customized Accelerator

The XIR-based compiler works in the context of a framework-independent XIR graph generated from deep learning frameworks. The parser removes the framework-specific attributes in the CNN models and transforms models into XIR-based computing graphs. The compiler divides the computing graph into different subgraphs, leverages heterogeneous optimizations, and generates corresponding optimized machine codes for subgraphs.

When the model contains operations that the DPU cannot support, some subgraphs are created and mapped to the CPU. The FPGA is so powerful that you can create a specific IP to accelerate those operations for improved end-to-end performance. To enable customized accelerating IPs with an XIR-based toolchain, leverage a pipeline named plugin to extend the XIR and compiler.

In Plugin.hpp, the interface class Plugin is declared. Plugins are executed sequentially before the compiler starts to compile the graph for the DPU. At first, a child subgraph is created for each operator and the plugin picks the operators which it can accelerate. It merges them into larger subgraphs, maps them to the customized IP, and attaches necessary information for runtime (VART::Runner) such as the instructions on the subgraphs.

Implementing a Plugin

Implement Plugin::partition()
In std::set<xir::Subgraph*> partition(xir::Graph* graph), pick the desired operations and merge them into device level subgraphs using the following helper functions.
- xir::Subgraph* filter_by_name(xir::Graph* graph, const std::string& name) returns the subgraph with a specific name
- std::set<xir::Subgraph*> filter_by_type(xir::Graph* graph, const std::string& type) returns subgraphs with a specific type.
- std::set<xir::Subgraph*> filter_by_template(xir::Graph* graph, xir::GraphTemplate* temp) returns subgraphs with a specific structure.
  Figure 4: Filter by Templates
- std::set<xir::Subgraph*> filter(xir::Graph* graph, std::function<std::set<xir::Subgraph*>(std::set<xir::Subgraph*>)> func) allows you to filter the subgraphs by customized function. This method helps you to find all uncompiled subgraphs.
To merge the child subgraphs, use the merge_subgraph() helper function. However, this function can only merge subgraphs at the same level. If the subgraph list can not be merged into one subgraph, the helper function will merge them as far as possible.
Specify the name, device, and runner for the subgraphs you picked in the Plugin::partition() function.
Implement Plugin::compile(xir::Subgraph*). This function is called for all the subgraphs returned by the partition() function. You can attach information on subgraphs for runtime.

Building the Plugin

Create an extern get_plugin() function and build the implementations into a shared library.

extern "C" plugin* get_plugin() { return new YOURPLUGIN(); }

Using the Plugin

Use --options '{"plugin": "libplugin0.so,libplugin1.so"}' in the vai_c command line option to pass your plugin library to compiler. When executing your plugin, the compiler opens the library and makes an instance of your plugin by loading your extern function named ‘get_plugin’. If more than one plugin is specified, they are executed sequentially in the order defined by the command line option. Compilation for DPU and CPU are executed after all the plugins have been implemented.

Samples

Check https://github.com/Xilinx/Vitis-AI/tree/master/tools/Vitis-AI-Runtime/VART/plugin-samples for samples.

Supported Operators and DPU Limitations

Xilinx is continuously improving the DPU IP and the compiler to support more operators with better performance. The following table lists some typical operations and the configurations such as kernel size, stride, etc. that the DPU can support. If the operation configurations exceed these limitations, the operator will be assigned to the CPU. Additionally, the operators that the DPU can support are dependent on the DPU types, ISA versions, and configurations.

You can configure the DPUs to suit your requirements. You can choose engines, adjust intrinsic parameters, and create your own DPU IP with TRD projects but this means that the limitations can be very different between configurations. Either use the following product guides for information on configuration or compile the model with your own DPU configuration. The compiler tells you which operators can be assigned to the CPU. The table shows a specific configuration of each DPU architecture.

DPUCZDX8G for Zynq UltraScale+ MPSoCs Product Guide (PG338)
DPUCAHX8L for Convolutional Neural Networks Product Guide (PG366)
DPUCAHX8H for Convolutional Neural Network Product Guide (PG367)
DPUCVDX8G for Versal ACAPs Product Guide Product Guide (PG389)

The following operators are primitively defined in different deep learning frameworks. The compiler can automatically parse these operators, transform them into the XIR format, and distribute them to DPU or CPU. These operators are partially supported by the tools, and they are listed here for your reference.

Currently Supported Operators

Table 2. Currently Supported Operators
Typical Operation Type in CNN	Parameters	DPUCZDX8G_ISA0_B4096_MAX_BG2 (ZCU102, ZCU104)	DPUCAHX8L_ISA0 (U50, U50LV, U280)	DPUCVDX8G_ISA1_C32B3 (VCK190)	DPUCAHX8H_ISA2 (U50, U50LV9E, U50LV10E, U280)	DPUCADF8H_ISA0 (U200, U250)	DPUCVDX8H_ISA1_F2W2 (VCK5000)
Intrinsic Parameter		channel_parallel: 16 bank_depth: 2048	channel_parallel: 32 bank_depth: 4096	channel_parallel: 16 bank_depth: 16384	channel_parallel: 16 bank_depth: 2048	channel_parallel: 16 bank_depth: 8192	channel_parallel: 64 bank_depth: 256
conv2d	Kernel size	w, h: [1, 16]	w, h: [1, 16]	w, h: [1, 16] w * h <= 64	w, h: [1, 16]	w, h: [1, 16]	w, h: [1, 16]
	Strides	w, h: [1, 8]	w, h: [1, 4]	w, h: [1, 8]	w, h: [1, 4]	w, h: [1, 8]	w, h: [1, 4]
	Dilation	dilation * input_channel <= 256 * channel_parallel
	Paddings	pad_left, pad_right: [0, (kernel_w - 1) * dilation_w]
	Paddings	pad_top, pad_bottom: [0, (kernel_h - 1) * dilation_h]
	In Size	kernel_w * kernel_h * ceil(input_channel / channel_parallel) <= bank_depth
	Out Size	output_channel <= 256 * channel_parallel
	Activation	ReLU, LeakyReLU, ReLU6	ReLU, ReLU6	ReLU, LeakyReLU, ReLU6, Hard-Swish, Hard-Sigmoid	ReLU, LeakyReLU, ReLU6	ReLU, LeakyReLU	ReLU, LeakyReLU
	Group* (Caffe)	group==1
depthwise-conv2d	Kernel size	w, h: [1, 16]	w, h: [3]	w, h: [1, 256]	Not supported
	Strides	w, h: [1, 8]	w, h: [1, 2]	w, h: [1, 8]
	dilation	dilation * input_channel <= 256 * channel_parallel
	Paddings	pad_left, pad_right: [0, (kernel_w - 1) * dilation_w]		pad_left, pad_right: [0, 15 * dilation_w]
	Paddings	pad_top, pad_bottom: [0, (kernel_h - 1) * dilation_h]		pad_top, pad_bottom: [0, 15 * dilation_h]
	In Size	kernel_w * kernel_h * ceil(input_channel / channel_parallel) <= bank_depth
	Out Size	output_channel <= 256 * channel_parallel
	Activation	ReLU, ReLU6	ReLU, ReLU6	ReLU, ReLU6
	Group* (Caffe)	group==input_channel
transposed-conv2d	Kernel size	kernel_w/stride_w, kernel_h/stride_h: [1, 16]
	Strides	kernel_w/stride_w, kernel_h/stride_h: [1, 16]
	Paddings	pad_left, pad_right: [1, kernel_w-1]
	Paddings	pad_top, pad_bottom: [1, kernel_h-1]
	Out Size	output_channel <= 256 * channel_parallel
	Activation	ReLU, LeakyReLU, ReLU6	ReLU, ReLU6	ReLU, LeakyReLU, ReLU6, Hard-Swish, Hard-Sigmoid	ReLU, LeakyReLU, ReLU6	ReLU, LeakyReLU	ReLU, LeakyReLU
depthwise-transposed-conv2d	Kernel size	kernel_w/stride_w, kernel_h/stride_h: [1, 16]	kernel_w/stride_w, kernel_h/stride_h: [3]	kernel_w/stride_w, kernel_h/stride_h: [1, 256]	Not supported
	Strides	kernel_w/stride_w, kernel_h/stride_h: [1, 16]	kernel_w/stride_w, kernel_h/stride_h: [3]	kernel_w/stride_w, kernel_h/stride_h: [1, 256]
	Paddings	pad_left, pad_right: [1, kernel_w-1]		pad_left, pad_right: [1, 15]
	Paddings	pad_top, pad_bottom: [1, kernel_h-1]		pad_top, pad_bottom: [1, 15]
	Out Size	output_channel <= 256 * channel_parallel
	Activation	ReLU, ReLU6	ReLU, ReLU6	ReLU, ReLU6
max-pooling	Kernel size	w, h: [2, 8]	w, h: {2, 3, 5, 7, 8}	w, h: [1, 256]	w, h: [1, 8]	w, h: [1, 16]	w, h: {1, 2, 3, 7}
	Strides	w, h: [1, 8]	w, h: [1, 8]	w, h: [1, 8]	w, h: [1, 8]	w, h: [1, 8]	w, h: [1, 8]
	Paddings	pad_left, pad_right: [1, kernel_w-1]		pad_left, pad_right: [1, 15]	pad_left, pad_right: [1, kernel_w-1]
	Paddings	pad_top, pad_bottom: [1, kernel_h-1]		pad_top, pad_bottom: [1, 15]	pad_top, pad_bottom: [1, kernel_h-1]
	Activation	ReLU	not supported	ReLU, ReLU6	not supported	ReLU	not supported
average-pooling	Kernel size	w, h: [2, 8] w==h	w, h: {2, 3, 5, 7, 8} w==h	w, h: [1, 256]	w, h: [1, 8] w==h	w, h: [1, 16]	w, h: {1, 2, 3, 7} w==h
	Strides	w, h: [1, 8]	w, h: [1, 8]	w, h: [1, 8]	w, h: [1, 8]	w, h: [1, 8]	w, h: [1, 8]
	Paddings	pad_left, pad_right: [1, kernel_w-1]		pad_left, pad_right: [1, 15]	pad_left, pad_right: [1, kernel_w-1]
	Paddings	pad_top, pad_bottom: [1, kernel_h-1]		pad_top, pad_bottom: [1, 15]	pad_top, pad_bottom: [1, kernel_h-1]
	Activation	ReLU	not supported	ReLU, ReLU6	not supported	ReLU	not supported
eltwise	type	sum	sum	sum, prod	sum	sum	sum
	Input Channel	input_channel <= 256 * channel_parallel
	Activation	ReLU	ReLU	ReLU	ReLU	ReLU	ReLU
concat	Network-specific limitation, which relates to the size of feature maps, quantization results and compiler optimizations.
reorg	Strides	reverse==false : stride ^ 2 * input_channel <= 256 * channel_parallel reverse==true : input_channel <= 256 * channel_parallel
pad	In Size	input_channel <= 256 * channel_parallel
pad	Mode	"SYMMETRIC" ("CONSTANT" pad(value=0) would be fused into adjacent operators during compiler optimization process)
global pooling	Global pooling will be processed as general pooling with kernel size euqal to input tensor size.
InnerProduct, Fully Connected, Matmul	These ops will be transformed into conv2d op

Operators Supported by TensorFlow

Table 3. Operators Supported by TensorFlow
TensorFlow		XIR		DPU Implementations
OP type	Attributes	OP name	Attributes	DPU Implementations
placeholder / inputlayer*	shape	data	shape	Allocate memory for input data.
placeholder / inputlayer*	shape	data	data_type	Allocate memory for input data.
const		const	datashapedata_type	Allocate memory for const data.
conv2d	filter	conv2d	kernel	Convolution Engine
	strides		stride
			pad([0, 0, 0, 0])
	padding		pad_mode(SAME or VALID)
	dilations		dilation
depthwiseconv2dnative	filter	depthwise-conv2d	kernel	Depthwise-Convolution Engine
	strides		stride
	explicit_paddings		padding
	padding		pad_mode(SAME or VALID)
	dilations		dilation
conv2dbackpropinput / conv2dtranspose*	filter	transposed-conv2d	kernel	Convolution Engine
	strides		stride
			padding([0, 0, 0, 0])
	padding		pad_mode(SAME or VALID)
	dilations		dilation
spacetobacthnd + conv2d + batchtospacend	block_shape	conv2d	dilation	Spacetobatch, Conv2d and Batchtospace would be mapped to Convolution Engine when specific requirements we set have been met.
	padding
	filter		kernel
	strides		stride
	padding		pad_mode(SAME)
	dilations		dilations
	block_shape
	crops
matmul / dense*	transpose_a	conv2d / matmul	transpose_a	The matmul would be transformed to a conv2d operation once the equivalent conv2d meets the hardware requirements and can be mapped to DPU.
matmul / dense*	transpose_b	conv2d / matmul	transpose_b
maxpool / maxpooling2d*	ksize	maxpool	kernel	Pooling Engine
	strides		stride
			pad([0, 0, 0, 0])
	padding		pad_mode(SAME or VALID)
avgpool / averagepooling2d* / globalavgeragepooling2d*	ksize	avgpool	kernel	Pooling Engine
	strides		stride
			pad([0, 0, 0, 0])
	padding		pad_mode(SAME or VALID)
			count_include_pad (false)
			count_include_invalid (true)
mean	axis	avgpool / reduction_mean	axis	Mean operation would be transformed to avgpool if the equivalent avgpool meets the hardware requirements and can be mapped to DPU.
mean	keep_dims	avgpool / reduction_mean	keep_dims
relu		relu		Activations would be fused to adjacent operations such as convolution, add, etc.
relu6		relu6
leakyrelu	alpha	leakyrelu	alpha
fixneuron / quantizelayer*	bit_width	fix	bit_width	It would be divided into float2fix and fix2float during compilation, then the float2fix and fix2float operations would be fused with adjacent operations into course-grained operations.
	quantize_pos		fix_point
			if_signed
			round_mode
identity		identity		Identity would be removed.
add, addv2		add		If the add is an element-wise add, the add would be mapped to DPU Element-wise Add Engine, if the add is an channel-wise add, we search for opportunities to fuse the add with adjacent operations such as convolutions.
concatv2 / concatenate*	axis	concat	axis	We reduce the overhead resulting from the concat by special reading or writing strategies and allocating the on-chip memory carefully.
pad / zeropadding2d*	paddings	pad	paddings	"CONSTANT" padding would be fused adjacent operations. "SYMMETRIC" padding would be mapped to DPU instructions. "REFLECT" padding is not supported by DPU yet.
pad / zeropadding2d*	mode	pad	mode
shape		shape		The shape operation would be removed.
stridedslice	begin	stridedslice	begin	If they are shape-related operations, they would be removed during compilation. If they are components of a coarse-grained operation, they would be fused with adjacent operations. Otherwise, they would be compiled into CPU implementations.
	end		end
	strides		strides
pack	axis	stack	axis
neg		neg
mul		mul
realdiv		div
sub		sub
prod	axis	reduction_product	axis
prod	keep_dims	reduction_product	keep_dims
sum	axis	reduction_sum	axis
sum	keep_dims	reduction_sum	keep_dims
max	axis	reduction_max	axis
max	keep_dims	reduction_max	keep_dims
resizebilinear	size/scale	resize	size	If the mode of the resize is 'BILINEAR', align_corner=false, half_pixel_centers = false, size = 2, 4, 8; align_corner=false, half_pixel_centers = true, size = 2, 4 can be transformed to DPU implementations (pad+depthwise-transposed conv2d). If the mode of the resize is 'NEAREST' and the size is an integer, the resize would be mapped to DPU implementations.
	align_corners		align_corners
	half_pixel_centers		half_pixel_centers
			mode="BILINEAR"
resizenearestneighbor	size/scale	resize	size
	align_corners		align_corners
	half_pixel_centers		half_pixel_centers
			mode="NEAREST"
upsample2d	size/scale	resize	size
			align_corners
			half_pixel_centers
	interpolation		mode
reshape	shape	reshape	shape	They would be transformed to the reshape operation in some cases. Otherwise they would be mapped to CPU.
transpose	perm	transpose	order
squeeze	axis	squeeze	axis
exp		exp		They would only be compiled into CPU implementations.
softmax	axis	softmax	axis
sigmoid		sigmoid
square+ rsqrt+ maximum		l2_normalize	axis	output = x / sqrt(max(sum(x ^ 2), epsilon)) would be fused into a l2_normalize in XIR.
square+ rsqrt+ maximum		l2_normalize	epsilon
The OPs in TensorFlow listed above are supported in XIR. All of them have CPU implementations in the tool-chain. Operators with * represent that the version of TensorFlow > 2.0.

Operators Supported by Caffe

Table 4. Operators Supported by Caffe
Caffe		XIR		DPU Implementation
OP name	Attributes	OP name	Attributes	DPU Implementation
input	shape	data	shape	Allocate memory for input data.
input		data	data_type	Allocate memory for input data.
convolution	kernel_size	conv2d (group = 1) / depthwise-conv2d (group = input channel)	kernel	If group == input channel, the convolution would be compiled into Depthwise-Convolution Engine, if group == 1, the convolution would be mapped to Convolution Engine. Otherwise, it would be mapped to CPU.
	stride		stride
	pad		pad
			pad_mode (FLOOR)
	dilation		dilation
	bias_term
	num_output
	group
deconvolution	kernel_size	transposed-conv2d (group = 1) / depthwise-transposed-conv2d (group = input channel)	kernel	If group == input channel, the deconvolution would be compiled into Depthwise-Convolution Engine, if group == 1, the deconvolution would be mapped to Convolution Engine. Otherwise, it would be mapped to CPU
	stride		stride
	pad		pad
			pad_mode (FLOOR)
	dilation		dilation
	bias_term
	num_output
	group
innerproduct	bias_term	conv2d / matmul	transpose_a	The inner-product would be transformed to matmul, then the matmul would be transformed to conv2d and compiled to Convolution Engine. If the inner-product fails to be transformed, it would be implemented by CPU.
innerproduct	num_output	conv2d / matmul	transpose_b
scale	bias_term	depthwise-conv2d / scale		The scale would be transformed to depthwise-convolution, otherwise, it would be mapped to CPU.
pooling	kernel_size	maxpool2d (pool_method = 0) / avgpool2d (pool_method = 1)	kernel_size	Pooling Engine
	stride		stride
	global_pooling		global
	pad		pad
	pool_method		pad_mode(CEIL)
			count_include_pad (true)
			count_include_invalid (false)
eltwise	coeff = 1	add		Element-wise Add Engine
eltwise	operation = SUM	add		Element-wise Add Engine
concat	axis	concat	axis	We reduce the overhead resulting from the concat by special reading or writing strategies and allocate the on-chip memory carefully.
relu	negative_slope	relu / leakyrelu	alpha	Activations would be fused to adjacent operations such as convolution, add, etc.
relu6		relu6
fixneuron	bit_width	fix	bit_width	It would be divided into float2fix and fix2float during compilation, then the float2fix and fix2float operations would be fused with adjacent operations into course-grained operations.
	quantize_pos		fix_point
			if_signed
			round_mode
reshape	shape	reshape	shape	These operations are shape-related operations, they would be removed or transformed into reshape in most cases, which would not affect the on-chip data layout. Otherwise, they would be compiled to CPU.
permute	order	reshape / transpose	order
flatten	axis	reshape / flatten	start_axis
	end_axis		end_axis
reorg	strides	reorg	strides	If the reorg meets the hardware requirements, it would be mapped to DPU implementations.
reorg	reverse	reorg	reverse
deephiresize	scale	resize	size	If the mode of the resize is 'BILINEAR', align_corner=false, half_pixel_centers = false, size = 2, 4, 8; align_corner=false, half_pixel_centers = true, size = 2, 4 can be transformed to DPU implementations (pad+depthwise-transposed conv2d). If the mode of the resize is 'NEAREST' and the size is an integer, the resize would be mapped to DPU implementations.
	mode		mode
			align_corners=false
			half_pixel_centers=false
gstiling	strides	gstiling	stride	If the strides of gstiling are integers, it may be mapped into special DPU read/write instructions.
gstiling	reverse	gstiling	reverse
slice	axis	strided_slice	begin	They would only be compiled into CPU implementations.
	slice_point		end
			strides
priorbox	min_sizes	priorbox	min_sizes
	max_sizes		max_sizes
	aspect_ratio		aspect_ratio
	flip		flip
	clip		clip
	variance		variance
	step		step
	offset		offset
softmax	axis	softmax	axis

Operators Supported by PyTorch

Table 5. Operators Supported by PyTorch
PyTorch		XIR		DPU Implementation
API	Attributes	OP name	Attributes	DPU Implementation
Parameter	data	const	data	Allocate memory for input data.
			shape
			data_type
Conv2d	in_channels	conv2d (groups = 1) / depthwise-conv2d (groups = input channel)		If groups == input channel, the convolution would be compiled into Depthwise-Convolution Engine. If groups == 1, the convolution would be mapped to Convolution Engine. Otherwise, it would be mapped to the CPU.
	out_channels
	kernel_size		kernel
	stride		stride
	padding		pad
	padding_mode('zeros')		pad_mode (FLOOR)
	groups
	dilation		dilation
ConvTranspose2d	in_channels	transposed-conv2d (groups = 1) / depthwise-transposed-conv2d (groups = input channel)		If groups == input channel, the convolution would be compiled into Depthwise-Convolution Engine. If groups == 1, the convolution would be mapped to Convolution Engine. Otherwise, it would be mapped to the CPU.
	out_channels
	kernel_size		kernel
	stride		stride
	padding		pad
	padding_mode('zeros')		pad_mode (FLOOR)
	groups
	dilation		dilation
matmul		conv2d / matmul	transpose_a	The matmul would be transformed to conv2d and compiled to Convolution Engine. If the matmul fails to be transformed, it would be implemented by CPU.
matmul		conv2d / matmul	transpose_b
MaxPool2d / AdaptiveMaxPool2d	kernel_size	maxpool2d	kernel	Pooling Engine
	stride		stride
	padding		pad
	ceil_mode		pad_mode
	output_size (adaptive)		global
AvgPool2d / AdaptiveAvgPool2d	kernel_size	avgpool2d	kernel	Pooling Engine
	stride		stride
	padding		pad
	ceil_mode		pad_mode
	count_include_pad		count_include_pad
			count_include_invalid (true)
	output_size (adaptive)		global
ReLU		relu		Activations would be fused to adjacent operations such as convolution, add, etc.
LeakyReLU	negative_slope	leakyrelu	alpha
ReLU6		relu6
Hardtanh	min_val = 0
Hardtanh	max_val = 6
ConstantPad2d / ZeroPad2d	padding	pad	paddings	"CONSTANT" padding would be fused adjacent operations.
ConstantPad2d / ZeroPad2d	value = 0	pad	mode ("CONSTANT")	"CONSTANT" padding would be fused adjacent operations.
add		add		If the add is an element-wise add, the add would be mapped to DPU Element-wise Add Engine. If the add is a channel-wise add, search for opportunities to fuse the add with adjacent operations such as convolutions. If they are shape-related operations, they would be removed during compilation. If they are components of a coarse-grained operation, they would be fused with adjacent operations. Otherwise, they would be compiled into CPU implementations.
sub / rsub		sub
mul		mul
max	dim	reduction_max	axis
max	keepdim	reduction_max	keep_dims
mean	dim	reduction_mean	axis
mean	keepdim	reduction_mean	keep_dims
interpolate / upsample / upsample_bilinear / upsample_nearest	size	resize	size	If the mode of the resize is 'BILINEAR', align_corner=false, half_pixel_centers = false, size = 2, 4, 8; align_corner=false, half_pixel_centers = true, size = 2, 4 can be transformed to DPU implementations (pad+depthwise-transposed conv2d). If the mode of the resize is 'NEAREST' and the size are integers, the resize would be mapped to DPU implementations.
	scale_factor
	mode		mode
	align_corners		align_corners
			half_pixel_centers = !align_corners
transpose	dim0	transpose	order	These operations would be transformed to the reshape operation in some cases. Additionally, search for opportunities to fuse the dimension transformation operations into special load/save instrutions of adjacent operations to reduce the overhead. Otherwise, they would be mapped to CPU.
transpose	dim1	transpose
permute	dims
view	size	reshape	shape
flatten	start_dim	reshape / flatten	start_axis
flatten	end_dim	reshape / flatten	end_axis
squeeze	dim	reshape / squeeze	axis
cat	dim	concat	axis	Reduce the overhead resulting from the concat by special reading or writing strategies and allocating the on-chip memory carefully.
aten::slice*	dim	strided_slice		If the strided_slice is shape-related or is the component of a coarse-grained operation, it would be removed. Otherwise, the strided_slice would be compiled into CPU implementations.
	start		begin
	end		end
	step		strides
BatchNorm2d	eps	depthwise-conv2d / batchnorm	epsilon	If the batch_norm is quantized and can be transformed to a depthwise-conv2d equivalently, it would be transformed to depthwise-conv2d and the compiler would search for compilation opportunities to map the batch_norm into DPU implementations. Otherwise, the batch_norm would be executed by CPU.
			axis
			moving_mean
			moving_var
			gamma
			beta
softmax	dim	softmax	axis	They would only be compiled into CPU implementations.
Tanh		tanh
Sigmoid		sigmoid
If the slice of tensor in PyTorch is written in the Python syntax, it is transformed into `aten::slice`.

VAI_C Usage

The corresponding Vitis AI compiler for Caffe and TensorFlow frameworks are vai_c_caffe, vai_c_tensorflow, vai_c_tensorflow2, and vai_c_xir across Cloud-to-Edge DPU. The common options for VAI_C are illustrated in the following table.

Table 6. VAI_C Common Options for Cloud and Edge DPU
Parameters	Description
--arch	The DPU architecture configuration file for the VAI_C compiler in JSON format. For pre-built DPU xclbins in Vitis AI releases, you can find the corresponding arch.json file in Vitis AI docker (/opt/vitis_ai/compiler/arch). The contents should be something like {"target": "DPUCZDX8G_ISA0_B4096_MAX_BG2"}. For customized DPU IPs, the corresponding arch.json files are generated by the DPU TRD along with the DPU IPs. The contents should be something like {“fingerprint”:"0x1000000f7014407"}. The fingerprint is a 64-bit digital signature to identify a DPU target. It consists of 1 byte to indicate the DPU type, 1 byte to indicate the ISA version, and 6 bytes to indicate specific configurations. The fingerprint is unique to each DPU configuration and runtime relies on it to identify DPU instance running on the current platform and to verify that the model is compiled for the same DPU target. "DPUCZDX8G_ISA0_B4096_MAX_BG2" is an alias for a specific fingerprint which is pre-defined in the compiler.
--output_dir	Path of output directory for vai_c_caffe and vai_c_tensorflow after compilation process.
--net_name	Name of DPU kernel for network model after compiled by VAI_C.
--options	The list for the extra options in the format of 'key':'value'. If there are multiple options to be specified, they are separated by ‘,’. Use --options '{"input_shape": "1,224,224,3"}' to specify input shape manually. Use --options '{"plugins": "plugin0,plugin1"}' to specify plugin libraries. Use --options '{"output_ops": "op_name0,op_name1"}' to specify output ops Note: Arguments specified with “--options” have the highest priorities and will override the values specified in other places.