Deploying and Running the Model
Programming with VART
Vitis AI provides a C++ DpuRunner class with the following interfaces:
-
std::pair<uint32_t, int> execute_async( const std::vector<TensorBuffer*>& input, const std::vector<TensorBuffer*>& output);
Submit input tensors for execution and output tensors to store results. The host pointer is passed using the TensorBuffer object. This function returns a job ID and the status of the function call.
-
int wait(int jobid, int timeout);
The job ID returned by
execute_async
is passed towait()
to block until the job is complete and the results are ready. -
TensorFormat get_tensor_format()
Query the DpuRunner for the Tensor format it expects.
Returns DpuRunner::TensorFormat::NCHW or DpuRunner::TensorFormat::NHWC
-
std::vector<Tensor*> get_input_tensors() std::vector<Tensor*> get_output_tensors()
Query the DpuRunner for the shape and name of the input and output tensors it expects for its loaded Vitis AI model.
- To create a DpuRunner object, call the following:
function
create_runner(const xir::Subgraph* subgraph, const std::string& mode = "")
It returns the following:
std::unique_ptr<Runner>
The input to create_runner is a XIR subgraph generated by the Vitis AI compiler.
C++ Example
// get dpu subgraph by parsing model file
auto runner = vart::Runner::create_runner(subgraph, "run");
// populate input/output tensors
auto job_data = runner->execute_async(inputs, outputs);
runner->wait(job_data.first, -1);
// process outputs
For more C++ examples, refer to Vitis AI Examples.
Vitis AI also provides a Python ctypes Runner class that mirrors the C++ class, using the C DpuRunner implementation:
class Runner:
def __init__(self, path)
def get_input_tensors(self)
def get_output_tensors(self)
def get_tensor_format(self)
def execute_async(self, inputs, outputs)
# differences from the C++ API:
# 1. inputs and outputs are numpy arrays with C memory layout
# the numpy arrays should be reused as their internal buffer
# pointers are passed to the runtime. These buffer pointers
# may be memory-mapped to the FPGA DDR for performance.
# 2. returns job_id, throws exception on error
def wait(self, job_id)
Python Example
dpu_runner = runner.Runner(subgraph,"run")
# populate input/output tensors
jid = dpu_runner.execute_async(fpgaInput, fpgaOutput)
dpu_runner.wait(jid)
# process fpgaOutput
DPU Debug with VART
This section aims to demonstrate how to verify DPU inference result with VART tools. TensorFlow ResNet50, Caffe ResNet50, and PyTorch ResNet50 networks are used as examples. Following are the four steps for debugging the DPU with VART:
- Generate a quantized inference model and reference result.
- Generate a DPU xmodel.
- Generate a DPU inference result.
- Crosscheck the reference result and the DPU inference result.
Before you start to debug the DPU result, ensure that you have set up the environment according to the instructions in the Getting Started section.
TensorFlow Workflow
To generate the quantized inference model and reference result, follow these steps:
- Generate the quantized inference model by running the following command to
quantize the model. The quantized model, quantize_eval_model.pb, is generated in the
quantize_model
folder.vai_q_tensorflow quantize \ --input_frozen_graph ./float/resnet_v1_50_inference.pb \ --input_fn input_fn.calib_input \ --output_dir quantize_model \ --input_nodes input \ --output_nodes resnet_v1_50/predictions/Reshape_1 \ --input_shapes ?,224,224,3 \ --calib_iter 100
- Generate the reference result by running the following command to generate
reference
data.
vai_q_tensorflow dump --input_frozen_graph \ quantize_model/quantize_eval_model.pb \ --input_fn input_fn.dump_input \ --output_dir=dump_gpu
The following figure shows part of the reference data.
- Generate the DPU xmodel by running the following command to generate the DPU
xmodel
file.
vai_c_tensorflow --frozen_pb quantize_model/quantize_eval_model.pb \ --arch /opt/vitis_ai/compiler/arch/DPUCAHX8H/U50/arch.json \ --output_dir compile_model \ --net_name resnet50_tf
- Generate the DPU inference result by running the following command to
generate the DPU inference result and compare the DPU inference result with the
reference data
automatically.
env XLNX_ENABLE_DUMP=1 XLNX_ENABLE_DEBUG_MODE=1 XLNX_GOLDEN_DIR=./dump_gpu/dump_results_0 \ xdputil run ./compile_model/resnet_v1_50_tf.xmodel \ ./dump_gpu/dump_results_0/input_aquant.bin \ 2>result.log 1>&2
For
xdputil
more usage, execute xdputil --help command.After the above command runs, the DPU inference result and the comparing result
result.log
are generated. The DPU inference results are located in thedump
folder. - Crosscheck the reference result and the DPU inference result.
- View comparison results for all
layers.
grep --color=always 'XLNX_GOLDEN_DIR.*layer_name' result.log
- View only the failed
layers.
grep --color=always 'XLNX_GOLDEN_DIR.*fail ! layer_name' result.log
If the crosscheck fails, use the following methods to further check from which layer the crosscheck fails.
- Check the input of DPU and GPU, make sure they use the same input data.
- Use
xdputil
tool to generate a picture for displaying the network's structure.Usage: xdputil xmodel <xmodel> -s <svg>
Note: In the Vitis AI docker environment, execute the following command to install the required library.sudo apt-get install graphviz
When you open the picture you created, you can see many little boxes around these ops. Each box means a layer on DPU. You can use the last op's name to find its corresponding one in GPU dump-result. The following figure shows parts of the structure.
- Submit the files to Xilinx.
If certain layer proves to be wrong on DPU, prepare the quantized model, such as
quantize_eval_model.pb
as one package for further analysis by factory and send it to Xilinx with a detailed description.
- View comparison results for all
layers.
Caffe Workflow
To generate the quantized inference model and reference result, follow these steps:
- Generate the quantized inference model by running the following commands.
vai_q_caffe quantize -model float/test_quantize.prototxt \ -weights float/trainval.caffemodel \ -output_dir quantize_model \ -keep_fixed_neuron \ 2>&1 | tee ./log/quantize.log
The following files are generated in the
quantize_model
folder.- deploy.caffemodel
- deploy.prototxt
- quantize_train_test.caffemodel
- quantize_train_test.prototxt
- Generate the reference result by running the following
command.
DECENT_DEBUG=5 vai_q_caffe test -model quantize_model/dump.prototxt \ -weights quantize_model/quantize_train_test.caffemodel \ -test_iter 1 \ 2>&1 | tee ./log/dump.log
This creates the
dump_gpu
folder and files as shown in the following figure. - Generate the DPU xmodel by running the following command.
vai_c_caffe --prototxt quantize_model/deploy.prototxt \ --caffemodel quantize_model/deploy.caffemodel \ --arch /opt/vitis_ai/compiler/arch/DPUCAHX8H/U50/arch.json \ --output_dir compile_model \ --net_name resnet50
- Generate the DPU inference result by running the following command.
env XLNX_ENABLE_DUMP=1 XLNX_ENABLE_DEBUG_MODE=1 \ xdputil run ./compile_model/resnet50.xmodel \ ./dump_gpu/data.bin 2>result.log 1>&2
The DPU inference result and the comparing result
result.log
are generated if this command runs successfully. The DPU inference results are located in thedump
folder. - Crosscheck the reference result and the DPU inference result.
The crosscheck mechanism is to first make sure input(s) to one layer is identical to reference and then the output(s) is identical too. This can be done with commands like
diff
,vimdiff
, andcmp
. If two files are identical,diff
andcmp
will return nothing in the command line.- Check the input of DPU and GPU to ensure they use the same input data.
- Use the
xdputil
tool to generate a picture for displaying the network structure.Usage: xdputil xmodel <xmodel> -s <svg>
Note: To install the required library, execute the following command in the Vitis AI docker environment.sudo apt-get install graphviz
The following figure is part of the ResNet50 model structure generated by
xdputil
. - View the xmodel structure image and find out the last layer name of the
model.Note: Check the last layer first. If the crosscheck of the last layer is successful, then the crosscheck for all the layers will pass and there is no need crosscheck each layers individually.
For this model, the name of the last layer is `subgraph_fc1000_fixed_(fix2float)`.
- Search the keyword
fc1000
underdump_gpu
anddump
. You will find the reference result filefc1000.bin
underdump_gpu
and DPU inference result0.fc1000_inserted_fix_2.bin
underdump/subgraph_fc1000/output/
. - Diff the two files.
If the crosscheck for the last layer fails, perform the crosscheck from the first layer until you find the layer where the crosscheck fails.
Note: For the layers that have multiple input or output (e.g.,res2a_branch1
), input correctness should be checked before verifying the output. - Search the keyword
- Submit the files to Xilinx if the DPU
cross check fails.
If a certain layer proves to be wrong on the DPU, prepare the following files as one package for further analysis and send it to Xilinx with a detailed description.
- Float model and prototxt file
- Quantized model, such as deploy.caffemodel, deploy.prototxt, quantize_train_test.caffemodel, and quantize_train_test.prototxt.
PyTorch Workflow
To generate the quantized inference model and reference result, follow these steps:
- Generate the quantized inference model by running the following command to
quantize the
model.
python resnet18_quant.py --quant_mode calib --subset_len 200
- Generate the reference result by running the following command to generate
reference
data.
python resnet18_quant.py --quant_mode test
- Generate the DPU xmodel by running the following command to generate DPU
xmodel
file.
vai_c_xir -x /PATH/TO/quantized.xmodel -a /PATH/TO/ arch.json -o /OUTPUTPATH -n netname}
- Generate the DPU inference result.
This step is same as the step in TensorFlow workflow.
- Crosscheck the reference result and the DPU inference result.
This step is same as the step in TensorFlow workflow.
Multi-FPGA Programming
Most modern servers have multiple Xilinx® Alveo™ cards, and you would want to take advantage of scaling up and scaling out deep-learning inference. Vitis AI provides support for multi-FPGA servers using the following building blocks.
XRM
The Xilinx Resource Manager (XRM) manages and controls Xilinx FPGA resources on a machine. With the Vitis AI release, installing XRM is mandatory for running a deep-learning solution using XRM. XRM is implemented as a server-client paradigm. It is an add-on library on top of the XRT to facilitate multi-FPGA resource management. XRM is not a replacement for the Xilinx XRT. The feature list for XRM is as follows:
- Enables multi-FPGA heterogeneous support
- C++ API and CLI for the clients to allocate, use, and release resources
- Enables resource allocation at FPGA, compute unit (CU), and service granularity
- Auto-release resource
- Multi-client support: Enables multi-client/users/processes request
- XCLBIN-to-DSA auto-association
- Resource sharing amongst clients/users
- Containerized support
- User defined function
- Logging support
Multi-FPGA, Multi-Graph Deployment with Vitis AI
Vitis AI provides different applications built using the Unified Runner APIs to deploy multiple models on single/multiple FPGAs. Detailed description and examples are available in the Vitis AI GitHub (Multi-Tenant Multi FPGA Deployment).
AI Kernel Scheduler
Real world deep learning applications involve multi-stage data processing pipelines which include many compute intensive preprocessing operations like data loading from disk, decoding, resizing, color space conversion, scaling, and cropping multiple ML networks of different kinds like CNN, and various post-processing operations like NMS.
The AI kernel scheduler (AKS) is an application to automatically and efficiently pipeline such graphs without much effort from the users. It provides various kinds of kernels for every stage of the complex graphs which are plug and play and are highly configurable. For example, preprocessing kernels like image decode and resize, CNN kernel like the Vitis AI DPU kernel and post processing kernels like SoftMax and NMS. You can create their graphs using kernels and execute their jobs seamlessly to get the maximum performance.
For more details and examples, see the Vitis AI GitHub (AI Kernel Scheduler).
Apache TVM, Microsoft ONNX Runtime, and TensorFlow Lite
In addition to VART and related APIs, Vitis AI has integrated with the Apache TVM and Microsoft ONNX Runtime and TensorFlow Lite frameworks for improved model support and automatic partitioning. This work incorporates community driven machine learning framework interfaces that are not available through the standard Vitis AI compiler and quantizers. In addition, it incorporates highly optimized CPU code for x86 and Arm® CPUs, when certain layers may not yet be available on Xilinx DPUs. These frameworks are supported on all Zynq® UltraScale+™ MPSoCs and Alveo™-based DPUs.
Apache TVM
Apache TVM is an open source deep learning compiler stack focusing on building efficient implementations for a wide variety of hardware architectures. It includes model parsing from TensorFlow, TensorFlow Lite (TFLite), Keras, PyTorch, MxNet, ONNX, Darknet, and others. Through the Vitis AI integration with TVM, Vitis AI is able to run models from these frameworks. TVM incorporates two phases. The first is a model compilation/quantization phase which produces the CPU/FPGA binary for your desired target CPU and DPU. Then by installing the TVM Runtime on your Cloud or Edge device, the TVM APIs in Python or C++ can be called to execute the model.
To read more about Apache TVM, see https://tvm.apache.org.
Vitis AI provides tutorials and installation guides on Vitis AI and TVM integration on theVitis AI GitHub repository: https://github.com/Xilinx/Vitis-AI/tree/master/external/tvm.
Microsoft ONNX Runtime
Microsoft ONNX Runtime is an open source inference accelerator focused on ONNX models. It is the platform Vitis AI has integrated with to provide first-class ONNX model support, which can be exported from a wide variety of training frameworks. It incorporates very easy to use runtime APIs in Python and C++ and can support models without requiring the separate compilation phase that TVM requires. Included in ONNXRuntime is a partitioner that can automatically partition between the CPU and FPGA further enhancing the ease of model deployment. Finally, it also incorporates the Vitis AI quantizer in a way that does not require separate quantization setup.
To read more about Microsoft ONNX Runtime, see https://microsoft.github.io/onnxruntime/.
Vitis AI provides tutorials and installation guides on Vitis AI and ONNXRuntime integration on the Vitis AI GitHub repository: https://github.com/Xilinx/Vitis-AI/tree/master/external/onnxruntime.
TensorFlow Lite
TensorFlow Lite (TFLite) is an open source inference accelerator focused on TensorFlow Lite models. It is the platform Vitis AI has integrated with to provide first-class TFLite model support, which can be exported from TensorFlow. It incorporates a very easy to use runtime APIs in Python and C++ and can support models without requiring the separate compilation phase that TVM requires. Included in TensorFlow Lite is a partitioner that can automatically partition between the CPU and FPGA further enhancing the ease of model deployment. Finally, it also incorporates the Vitis AI quantizer in a way that does not require separate quantization setup.
To read more about TensorFlow Lite, see https://tensorflow.org/lite.
Vitis AI provides tutorials and installation guides on Vitis AI and TensorFlow Lite integration on the GitHub repository: https://github.com/Xilinx/Vitis-AI/tree/master/external/tflite.