Deploying and Running the Model

Programming with VART

Vitis AI provides a C++ DpuRunner class with the following interfaces:

```
std::pair<uint32_t, int> execute_async(  
					const std::vector<TensorBuffer*>& input,  
					const std::vector<TensorBuffer*>& output);
```
Submit input tensors for execution and output tensors to store results. The host pointer is passed using the TensorBuffer object. This function returns a job ID and the status of the function call.
```
int wait(int jobid, int timeout);
```
The job ID returned by execute_async is passed to wait() to block until the job is complete and the results are ready.
```
TensorFormat get_tensor_format()
```
Query the DpuRunner for the Tensor format it expects.

Returns DpuRunner::TensorFormat::NCHW or DpuRunner::TensorFormat::NHWC
```
std::vector<Tensor*> get_input_tensors()
std::vector<Tensor*> get_output_tensors()
```
Query the DpuRunner for the shape and name of the input and output tensors it expects for its loaded Vitis AI model.
To create a DpuRunner object, call the following: function
```
create_runner(const xir::Subgraph* subgraph, const std::string& mode = "")
```
It returns the following:
```
std::unique_ptr<Runner>
```
The input to create_runner is a XIR subgraph generated by the Vitis AI compiler.

TIP: To enable multi-threading with VART, create a runner for each thread.

Note: If the model has multiple subgraph, you can refer to https://github.com/Xilinx/Vitis-Tutorials/tree/master/Machine_Learning/Feature_Tutorials/pytorch-subgraphs.

C++ Example

// get dpu subgraph by parsing model file
auto runner = vart::Runner::create_runner(subgraph, "run");
// populate input/output tensors
auto job_data = runner->execute_async(inputs, outputs);
runner->wait(job_data.first, -1);
// process outputs

For more C++ examples, refer to Vitis AI Examples.

Vitis AI also provides a Python ctypes Runner class that mirrors the C++ class, using the C DpuRunner implementation:

class Runner:
def __init__(self, path)
def get_input_tensors(self)
def get_output_tensors(self)
def get_tensor_format(self)
def execute_async(self, inputs, outputs)
# differences from the C++ API:
# 1. inputs and outputs are numpy arrays with C memory layout
#    the numpy arrays should be reused as their internal buffer 
#    pointers are passed to the runtime. These buffer pointers
#    may be memory-mapped to the FPGA DDR for performance.
# 2. returns job_id, throws exception on error
def wait(self, job_id)

Python Example

dpu_runner = runner.Runner(subgraph，"run")
# populate input/output tensors
jid = dpu_runner.execute_async(fpgaInput, fpgaOutput)
dpu_runner.wait(jid)
# process fpgaOutput

DPU Debug with VART

This section aims to demonstrate how to verify DPU inference result with VART tools. TensorFlow ResNet50, Caffe ResNet50, and PyTorch ResNet50 networks are used as examples. Following are the four steps for debugging the DPU with VART:

Generate a quantized inference model and reference result.
Generate a DPU xmodel.
Generate a DPU inference result.
Crosscheck the reference result and the DPU inference result.

Before you start to debug the DPU result, ensure that you have set up the environment according to the instructions in the Getting Started section.

TensorFlow Workflow

To generate the quantized inference model and reference result, follow these steps:

Generate the quantized inference model by running the following command to quantize the model.

The quantized model, quantize_eval_model.pb, is generated in the quantize_model folder.

vai_q_tensorflow quantize 	                               \
	--input_frozen_graph ./float/resnet_v1_50_inference.pb   \
	--input_fn input_fn.calib_input			  		    \
	--output_dir quantize_model				              \
	--input_nodes input								      \
	--output_nodes resnet_v1_50/predictions/Reshape_1 	   \
	--input_shapes	?,224,224,3					        \
	--calib_iter	100

Generate the reference result by running the following command to generate reference data.

vai_q_tensorflow dump --input_frozen_graph        \
            quantize_model/quantize_eval_model.pb \
     --input_fn input_fn.dump_input               \
     --output_dir=dump_gpu

The following figure shows part of the reference data.

Generate the DPU xmodel by running the following command to generate the DPU xmodel file.

vai_c_tensorflow --frozen_pb quantize_model/quantize_eval_model.pb \
  --arch /opt/vitis_ai/compiler/arch/DPUCAHX8H/U50/arch.json       \
  --output_dir compile_model                                       \
  --net_name resnet50_tf

Generate the DPU inference result by running the following command to generate the DPU inference result and compare the DPU inference result with the reference data automatically.
```
env XLNX_ENABLE_DUMP=1 XLNX_ENABLE_DEBUG_MODE=1 XLNX_GOLDEN_DIR=./dump_gpu/dump_results_0 \
   xdputil run ./compile_model/resnet_v1_50_tf.xmodel            \
   ./dump_gpu/dump_results_0/input_aquant.bin                    \
   2>result.log 1>&2
```
For xdputil more usage, execute xdputil --help command.
After the above command runs, the DPU inference result and the comparing result result.log are generated. The DPU inference results are located in the dump folder.
Crosscheck the reference result and the DPU inference result.
1. View comparison results for all layers.
```
grep --color=always 'XLNX_GOLDEN_DIR.*layer_name' result.log
```
2. View only the failed layers.
```
grep --color=always 'XLNX_GOLDEN_DIR.*fail ! layer_name' result.log
```
If the crosscheck fails, use the following methods to further check from which layer the crosscheck fails.
1. Check the input of DPU and GPU, make sure they use the same input data.
2. Use xdputil tool to generate a picture for displaying the network's structure.
```
Usage: xdputil xmodel <xmodel> -s <svg>
```
  Note: In the Vitis AI docker environment, execute the following command to install the required library.
```
sudo apt-get install graphviz
```
  When you open the picture you created, you can see many little boxes around these ops. Each box means a layer on DPU. You can use the last op's name to find its corresponding one in GPU dump-result. The following figure shows parts of the structure.
3. Submit the files to Xilinx.
  If certain layer proves to be wrong on DPU, prepare the quantized model, such as quantize_eval_model.pb as one package for further analysis by factory and send it to Xilinx with a detailed description.

Caffe Workflow

To generate the quantized inference model and reference result, follow these steps:

Generate the quantized inference model by running the following commands.

vai_q_caffe quantize -model float/test_quantize.prototxt \
-weights float/trainval.caffemodel                       \
-output_dir quantize_model                               \
-keep_fixed_neuron			                           \
2>&1 | tee ./log/quantize.log

The following files are generated in the quantize_model folder.

deploy.caffemodel
deploy.prototxt
quantize_train_test.caffemodel
quantize_train_test.prototxt

Generate the reference result by running the following command.

DECENT_DEBUG=5 vai_q_caffe test -model quantize_model/dump.prototxt \
-weights quantize_model/quantize_train_test.caffemodel              \
-test_iter 1                                                        \
2>&1 | tee ./log/dump.log

This creates the dump_gpu folder and files as shown in the following figure.

Generate the DPU xmodel by running the following command.

vai_c_caffe --prototxt quantize_model/deploy.prototxt       \
--caffemodel quantize_model/deploy.caffemodel               \
--arch /opt/vitis_ai/compiler/arch/DPUCAHX8H/U50/arch.json  \
--output_dir compile_model                                  \
--net_name resnet50

Generate the DPU inference result by running the following command.
```
env XLNX_ENABLE_DUMP=1  XLNX_ENABLE_DEBUG_MODE=1            \
	xdputil run ./compile_model/resnet50.xmodel             \
	./dump_gpu/data.bin 2>result.log 1>&2
```
The DPU inference result and the comparing result result.log are generated if this command runs successfully. The DPU inference results are located in the dump folder.
Crosscheck the reference result and the DPU inference result.
The crosscheck mechanism is to first make sure input(s) to one layer is identical to reference and then the output(s) is identical too. This can be done with commands like diff, vimdiff, and cmp. If two files are identical, diff and cmp will return nothing in the command line.
1. Check the input of DPU and GPU to ensure they use the same input data.
2. Use the xdputil tool to generate a picture for displaying the network structure.
```
Usage: xdputil xmodel <xmodel> -s <svg>
```
  Note: To install the required library, execute the following command in the Vitis AI docker environment.
```
sudo apt-get install graphviz
```
  The following figure is part of the ResNet50 model structure generated by xdputil.
3. View the xmodel structure image and find out the last layer name of the model.
  Note: Check the last layer first. If the crosscheck of the last layer is successful, then the crosscheck for all the layers will pass and there is no need crosscheck each layers individually.
  For this model, the name of the last layer is `subgraph_fc1000_fixed_(fix2float)`.
  1. Search the keyword fc1000 under dump_gpu and dump. You will find the reference result file fc1000.bin under dump_gpu and DPU inference result 0.fc1000_inserted_fix_2.bin under dump/subgraph_fc1000/output/.
  2. Diff the two files.
    If the crosscheck for the last layer fails, perform the crosscheck from the first layer until you find the layer where the crosscheck fails.
  Note: For the layers that have multiple input or output (e.g., res2a_branch1), input correctness should be checked before verifying the output.
4. Submit the files to Xilinx if the DPU cross check fails.
  If a certain layer proves to be wrong on the DPU, prepare the following files as one package for further analysis and send it to Xilinx with a detailed description.
  - Float model and prototxt file
  - Quantized model, such as deploy.caffemodel, deploy.prototxt, quantize_train_test.caffemodel, and quantize_train_test.prototxt.

PyTorch Workflow

To generate the quantized inference model and reference result, follow these steps:

Generate the quantized inference model by running the following command to quantize the model.
```
python resnet18_quant.py --quant_mode calib --subset_len 200
```
Generate the reference result by running the following command to generate reference data.
```
python resnet18_quant.py --quant_mode test
```
Generate the DPU xmodel by running the following command to generate DPU xmodel file.
```
vai_c_xir -x /PATH/TO/quantized.xmodel -a /PATH/TO/
arch.json -o /OUTPUTPATH -n netname}
```
Generate the DPU inference result.
This step is same as the step in TensorFlow workflow.
Crosscheck the reference result and the DPU inference result.
This step is same as the step in TensorFlow workflow.

Multi-FPGA Programming

Most modern servers have multiple Xilinx® Alveo™ cards, and you would want to take advantage of scaling up and scaling out deep-learning inference. Vitis AI provides support for multi-FPGA servers using the following building blocks.

XRM

The Xilinx Resource Manager (XRM) manages and controls Xilinx FPGA resources on a machine. With the Vitis AI release, installing XRM is mandatory for running a deep-learning solution using XRM. XRM is implemented as a server-client paradigm. It is an add-on library on top of the XRT to facilitate multi-FPGA resource management. XRM is not a replacement for the Xilinx XRT. The feature list for XRM is as follows:

Enables multi-FPGA heterogeneous support
C++ API and CLI for the clients to allocate, use, and release resources
Enables resource allocation at FPGA, compute unit (CU), and service granularity
Auto-release resource
Multi-client support: Enables multi-client/users/processes request
XCLBIN-to-DSA auto-association
Resource sharing amongst clients/users
Containerized support
User defined function
Logging support

https://github.com/Xilinx/XRM

Multi-FPGA, Multi-Graph Deployment with Vitis AI

Vitis AI provides different applications built using the Unified Runner APIs to deploy multiple models on single/multiple FPGAs. Detailed description and examples are available in the Vitis AI GitHub (Multi-Tenant Multi FPGA Deployment).

AI Kernel Scheduler

Real world deep learning applications involve multi-stage data processing pipelines which include many compute intensive preprocessing operations like data loading from disk, decoding, resizing, color space conversion, scaling, and cropping multiple ML networks of different kinds like CNN, and various post-processing operations like NMS.

The AI kernel scheduler (AKS) is an application to automatically and efficiently pipeline such graphs without much effort from the users. It provides various kinds of kernels for every stage of the complex graphs which are plug and play and are highly configurable. For example, preprocessing kernels like image decode and resize, CNN kernel like the Vitis AI DPU kernel and post processing kernels like SoftMax and NMS. You can create their graphs using kernels and execute their jobs seamlessly to get the maximum performance.

For more details and examples, see the Vitis AI GitHub (AI Kernel Scheduler).

Apache TVM, Microsoft ONNX Runtime, and TensorFlow Lite

In addition to VART and related APIs, Vitis AI has integrated with the Apache TVM and Microsoft ONNX Runtime and TensorFlow Lite frameworks for improved model support and automatic partitioning. This work incorporates community driven machine learning framework interfaces that are not available through the standard Vitis AI compiler and quantizers. In addition, it incorporates highly optimized CPU code for x86 and Arm® CPUs, when certain layers may not yet be available on Xilinx DPUs. These frameworks are supported on all Zynq® UltraScale+™ MPSoCs and Alveo™-based DPUs.

Apache TVM

Apache TVM is an open source deep learning compiler stack focusing on building efficient implementations for a wide variety of hardware architectures. It includes model parsing from TensorFlow, TensorFlow Lite (TFLite), Keras, PyTorch, MxNet, ONNX, Darknet, and others. Through the Vitis AI integration with TVM, Vitis AI is able to run models from these frameworks. TVM incorporates two phases. The first is a model compilation/quantization phase which produces the CPU/FPGA binary for your desired target CPU and DPU. Then by installing the TVM Runtime on your Cloud or Edge device, the TVM APIs in Python or C++ can be called to execute the model.

To read more about Apache TVM, see https://tvm.apache.org.

Vitis AI provides tutorials and installation guides on Vitis AI and TVM integration on theVitis AI GitHub repository: https://github.com/Xilinx/Vitis-AI/tree/master/external/tvm.

Microsoft ONNX Runtime

Microsoft ONNX Runtime is an open source inference accelerator focused on ONNX models. It is the platform Vitis AI has integrated with to provide first-class ONNX model support, which can be exported from a wide variety of training frameworks. It incorporates very easy to use runtime APIs in Python and C++ and can support models without requiring the separate compilation phase that TVM requires. Included in ONNXRuntime is a partitioner that can automatically partition between the CPU and FPGA further enhancing the ease of model deployment. Finally, it also incorporates the Vitis AI quantizer in a way that does not require separate quantization setup.

To read more about Microsoft ONNX Runtime, see https://microsoft.github.io/onnxruntime/.

Vitis AI provides tutorials and installation guides on Vitis AI and ONNXRuntime integration on the Vitis AI GitHub repository: https://github.com/Xilinx/Vitis-AI/tree/master/external/onnxruntime.

TensorFlow Lite

TensorFlow Lite (TFLite) is an open source inference accelerator focused on TensorFlow Lite models. It is the platform Vitis AI has integrated with to provide first-class TFLite model support, which can be exported from TensorFlow. It incorporates a very easy to use runtime APIs in Python and C++ and can support models without requiring the separate compilation phase that TVM requires. Included in TensorFlow Lite is a partitioner that can automatically partition between the CPU and FPGA further enhancing the ease of model deployment. Finally, it also incorporates the Vitis AI quantizer in a way that does not require separate quantization setup.

To read more about TensorFlow Lite, see https://tensorflow.org/lite.

Vitis AI provides tutorials and installation guides on Vitis AI and TensorFlow Lite integration on the GitHub repository: https://github.com/Xilinx/Vitis-AI/tree/master/external/tflite.