Libraries and Samples
Caffe Framework
The Vitis™ AI Library contains the following neural network libraries based on the Caffe framework:
TensorFlow Framework
The Vitis™ AI contains the following neural network libraries based on the TensorFlow framework:
PyTorch Framework
The Vitis™ AI supports the following type of neural network libraries based on the PyTorch framework.
- Classification
- ReID Detection
- Face Recognition
- Semantic Segmentation
- Pointpillars
- Medical Segmentation
- 3D Segmentation
- Pointpillars_nuscenes: Surround-view
- Centerpoint: 4D radar based 3D detection
- PointPainting: Image-lidar sensor fusion
- Depth Estimation
- Bayesian Crowd Counting
- Multi-task V3
The related libraries are open source and can be modified as needed. The open source codes are available on GitHub.
The Vitis™ AI Library provides image test samples and video test samples for all the above networks. In addition, the kit provides the corresponding performance test program. For video based testing, we recommend to use raw video for evaluation. Decoding by software libraries on Arm® processors may have inconsistent decoding time, which may affect the accuracy of evaluation.
Model Library
After the model packet is installed on the target, all the models are stored under /usr/share/vitis_ai_library/models/. Each model is stored in a separate folder, which is composed of the following files, by default:
- [model_name].xmodel
- [model_name].prototxt
Take the "inception_v1" model as an example. inception_v1.xmodel is the model data. inception_v1.prototxt is the parameter of the model.
Model Type
Classification
The Classification library is used to classify images. Such neural networks are trained on ImageNet for ILSVRC and they can identify the objects from its 1000 classification. The Vitis AI Library integrates networks including, but not limited to, ResNet18, ResNet50, Inception_v1, Inception_v2, Inception_v3, Inception_v4, Vgg, mobilenet_v1, mobilenet_v2, and Squeezenet into Xilinx libraries. The input is a picture with an object and the output is the top-K most probable category.
The following table lists the classification models supported by the Vitis AI library.
No | Model Name | Framework |
---|---|---|
1 | inception_resnet_v2_tf | TensorFlow |
2 | inception_v1_tf | |
3 | inception_v3_tf | |
4 | inception_v4_2016_09_09_tf | |
5 | mobilenet_v1_0_25_128_tf | |
6 | mobilenet_v1_0_5_160_tf | |
7 | mobilenet_v1_1_0_224_tf | |
8 | mobilenet_v2_1_0_224_tf | |
9 | mobilenet_v2_1_4_224_tf | |
10 | resnet_v1_101_tf | |
11 | resnet_v1_152_tf | |
12 | resnet_v1_50_tf | |
13 | vgg_16_tf | |
14 | vgg_19_tf | |
15 | mobilenet_edge_1_0_tf | |
16 | mobilenet_edge_0_75_tf | |
17 | inception_v2_tf | |
18 | MLPerf_resnet50_v1.5_tf | |
19 | resnet50_tf2 | |
20 | mobilenet_1_0_224_tf2 | |
21 | inception_v3_tf2 | |
22 | resnet_v2_50_tf | |
23 | resnet_v2_101_tf | |
24 | resnet_v2_152_tf | |
25 | efficientnet-b0_tf2 | |
26 | efficientNet-edgetpu-S_tf | |
27 | efficientNet-edgetpu-M_tf | |
28 | efficientNet-edgetpu-L_tf | |
29 | resnet50 | Caffe |
30 | resnet18 | |
31 | inception_v1 | |
32 | inception_v2 | |
33 | inception_v3 | |
34 | inception_v4 | |
35 | mobilenet_v2 | |
36 | squeezenet | |
37 | resnet50_pt | PyTorch |
38 | squeezenet_pt | |
39 | inception_v3_pt |
Face Detection
The Face Detection library uses the DenseBox neural network to detect human faces. The input is a picture with the faces you want to detect and the output is a vector of the result structure containing the information of each detection box. The following image shows the result of face detection.
The following table lists the face detection models supported by the AI Library.
No | Model Name | Framework |
---|---|---|
1 | densebox_320_320 | Caffe |
2 | densebox_640_360 |
Face Landmark Detection
The Face Landmark network is used to detect five key points on a human face. The five points include the left eye, the right eye, the nose, the left corner of the lips, and the right corner of the lips. This network is used to correct face direction (what this means is if a face is not directly facing the camera (e.g., tilted 20 degrees left or right), it is "adjusted" to face the camera directly) before face feature extraction. The input image should be a face which is detected by the face detection network. The output of the network is the five key points. The five key points are normalized. The following image shows the result of face detection.
The following table lists the face landmark models supported by the AI Library.
No | Model Name | Framework |
---|---|---|
1 | face_landmark | Caffe |
SSD Detection
The SSD Detection library is commonly used with the SSD neuron network. SSD is a neural network which is used to detect objects. The input is a picture with some objects you want to detect. The output is a vector of the result structure containing the information of each detection box. The following image shows the result of SSD detection.
The following table lists the SSD detection models supported by the Vitis AI Library.
No | Model Name | Framework |
---|---|---|
1 | ssd_mobilenet_v1_coco_tf | TensorFlow |
2 | ssd_mobilenet_v2_coco_tf | |
3 | ssd_resnet_50_fpn_coco_tf | |
4 | mlperf_ssd_resnet34_tf | |
5 | ssdlite_mobilenet_v2_coco_tf | |
6 | ssd_inception_v2_coco_tf | |
7 | ssd_pedestrian_pruned_0_97 | Caffe |
8 | ssd_traffic_pruned_0_9 | |
9 | ssd_adas_pruned_0_95 | |
10 | ssd_mobilenet_v2 |
Pose Detection
The Pose Detection library is used to detect the posture of the human body. This library includes a neural network which can identify 14 key points on the human body (you can use our SSD detection library). The input is a picture that is detected by the pedestrian detection neural network. The output is a structure containing the coordinates of each point. The following image shows the result of pose detection.
The following table lists the pose detection models supported by the Vitis AI Library.
No | Model Name | Framework |
---|---|---|
1 | sp_net | Caffe |
Semantic Segmentation
Semantic segmentation assigns a semantic category to each pixel in the input image, that is, it identifies pixels as part of an object, say, a car, a road, a tree, a horse, etc. Libsegmentation is a segmentation library which can be used in ADAS applications. It offers simple interfaces for a developer to deploy segmentation tasks on a Xilinx® FPGA.
The following is an example of semantic segmentation, where "blue gray" denotes the sky, "green" denotes trees, "red" denotes people, "dark blue" denotes cars, "plum" denotes the road, and "gray" denotes structures.
The following table lists the semantic segmentation models supported by the Vitis AI library.
No | Model Name | Framework |
---|---|---|
1 | fpn | Caffe |
2 | FPN-resnet18_Endov | |
3 | semantic_seg_citys_tf2 | TensorFlow |
4 | mobilenet_v2_cityscapes_tf | |
5 |
SemanticFPN_cityscapes_pt |
PyTorch |
6 | ENet_cityscapes_pt | |
7 | unet_chaos-CT_pt | |
8 | SemanticFPN_Mobilenetv2_pt |
Road Line Detection
The Road Line Detection library is used to draw lane lines in ADAS applications.
Each lane line is represented by a number representing the category. A
vector<Point> is used to draw the lane line. In the test code, a color map is
used. Different types of lane lines are represented by different colors. The point is
stored in the container vector, and the polygon interface
cv::polylines()
of OpenCV is used to draw the lane line. The
following image shows the result of road line detection.
No | Model Name | Framework |
---|---|---|
1 | vpgnet_pruned_0_99 | Caffe |
YOLOv3 Detection
YOLO is a neural network which is used to detect objects. The current version is v3. The input is a picture with one or more objects and the output is a vector of the result struct which is composed of the detected information. The following image shows the result of YOLOv3 detection.
The following table lists the YOLOv3 detection models supported by the Vitis AI library.
No | Model Name | Framework |
---|---|---|
1 | yolov3_voc_tf | TensorFlow |
2 | yolov3_adas_pruned_0_9 | Caffe |
3 | yolov3_voc | |
4 | yolov3_bdd | |
5 | tiny_yolov3_vmss |
YOLOv4 Detection
No | Model Name | Framework |
---|---|---|
1 | yolov4_leaky_spp_m | Caffe |
2 | yolov4_leaky_spp_m_pruned_0_36 |
YOLOv2 Detection
No | Model Name | Framework |
---|---|---|
1 | yolov2_voc | Caffe |
2 | yolov2_voc_pruned_0_66 | |
3 | yolov2_voc_pruned_0_71 | |
4 | yolov2_voc_pruned_0_77 |
Openpose Detection
0: head, 1: neck, 2: L_shoulder, 3:L_elbow, 4: L_wrist, 5: R_shoulder,
6: R_elbow, 7: R_wrist, 8: L_hip, 9: L_knee, 10: L_ankle, 11: R_hip,
12: R_knee, 13: R_ankle
The input of the network is 368x368. The following image shows the result of openpose detection.
The following table lists the Openpose detection models supported by the Vitis AI Library.
No | Model Name | Framework |
---|---|---|
1 | openpose_pruned_0_3 | Caffe |
RefineDet Detection
RefineDet is a neural network that is used to detect human bodies. The input is a picture with some individuals that you would like to detect. The output is a vector of the result structure that contain each box’s information. The following image shows the result of RefineDet detection:
The following table lists the RefineDet detection models supported by the Vitis AI Library.
No | Model Name | Framework |
---|---|---|
1 | refinedet_pruned_0_8 | Caffe |
2 | refinedet_pruned_0_92 | |
3 | refinedet_pruned_0_96 | |
4 | refinedet_baseline | |
5 | refinedet_VOC_tf | TensorFlow |
ReID Detection
The task of person re-identification is to identify a person of interest at any time or place. This is done by extracting the image feature and comparing the features. Images of the same person should have similar features and have small feature distance, while images of different persons have large feature distance. Given a queried image and a pile of candidate images, the image that has the smallest feature distance is identified as the same person as the queried image. The following table lists the ReID detection models supported by the Vitis AI Library.
Number | Model Name | Framework |
---|---|---|
1 | reid | Caffe |
2 | personreid-res18_pt | PyTorch |
3 |
personreid-res50_pt |
|
4 |
facereid-large_pt |
|
5 |
facereid-small_pt |
Multi-task
The multi-task library is appropriate for a model that has multiple sub-tasks. The Multi-task model in the Vitis AI Library has two sub-tasks: semantic segmentation and SSD detection. The following table listss the multi-task models supported by the Vitis AI Library.
Number | Model Name | Framework |
---|---|---|
1 | multi_task | Caffe |
2 | MT-resnet18_mixed_pt | PyTorch |
Face Recognition
The models of face feature are used for face recognition. They can extract the features of a person's face. The output of these models are 512 features. If you have two different images and you want to know if they are of the same person, use these models to extract features of the two images, and then use calculation functions and mapped functions to get the similarity of the two images.
The following table listss the face recognition models supported by the Vitis AI Library.
No | Model Name | Framework |
---|---|---|
1 | facerec_resnet20 | Caffe |
2 | facerec_resnet64 | |
3 |
facerec-resnet20_mixed_pt |
PyTorch |
Plate Detection
The Plate Detection library uses the DenseBox neuron network to detect license plates. The input is a picture of the vehicle that is detected by the SSD and the output is a structure containing the plate location information. The following image shows the result of the plate detection.
The following table lists the plate detection models supported by the Vitis AI Library.
No | Model Name | Framework |
---|---|---|
1 | plate_detect | Caffe |
Plate Recognition
The Plate Recognition library uses a classification network to recognize license plate number (Chinese license plates only). The input is a picture of the license plate that is detected by plate detect. The output is a structure containing license plate number information. The following image shows the result of the plate recognition.
The following table lists the plate recognition models supported by the Vitis AI Library.
No | Model Name | Framework |
---|---|---|
1 | plate_num | Caffe |
Medical Segmentation
Endoscopy is a common clinical procedure for the early detection of cancers in hollow-organs such as nasopharyngeal cancer, esophageal adenocarcinoma, gastric cancer, colorectal cancer, and bladder cancer. Accurate and temporally consistent localization and segmentation of diseased region-of-interests enable precise quantification and mapping of lesions from clinical endoscopy videos, which is critical for monitoring and surgical planning.
The medical segmentation model is used to classify diseased region-of-interests in the input image. It can be classified into many categories, including BE, cancer, HGD, polyp, and suspicious.
Libmedicalsegmentation is a segmentation library which can be used in segmentation of multi-class diseases in endoscopy. It offers simple interfaces for developers to deploy segmentation tasks on Xilinx FPGAs. The following is an example of medical segmentation, where the goal is to mark the diseased region.
The following is an example of semantic segmentation, where the goal is to predict class labels for each pixel in the image.
The following table lists the medical segmentation models supported by the Vitis AI Library.
No | Model Name | Framework |
---|---|---|
1 | FPN_Res18_Medical_segmentation | Caffe |
Medical Detection
The RefineDet model is based on vgg16. It is used for medical detection and can detect five types of diseases, namely, BE, cancer, HGD, polyp, and suspicious from an input endoscopy image like the Endoscopy Disease Detection and Segmentation database (EDD2020).
The following table lists the medical detection models supported by the Vitis AI Library.
No | Model Name | Framework |
---|---|---|
1 | RefineDet-Medical_EDD_tf | TensorFlow |
Medical Cell Segmentation
The nucleus is an organelle present within all eukaryotic cells, including human cells. Abberant nuclear shape can be used to identify cancer cells, for example, pap smear tests for the diagnosis of cervical cancer. Medical segmentation cell models offer nuclear segmentation in digital microscopic tissue images which can enable extraction of high quality features for nuclear morphometric and other analyses in computational pathology. The following images show the results of cell segmentation.
The following table lists the Medical Cell Segmentation models supported by the Vitis AI Library.
No | Model Name | Framework |
---|---|---|
1 | medical_seg_cell_tf2 | TensorFlow |
Retinaface
This retinaface network is used to detect human face and face landmark. The input is a picture with some faces you would like to detect and the output contains face positions, scores, and landmarks of faces.
The following table lists the retinaface detection models supported by the Vitis AI Library.
No | Model Name | Framework |
---|---|---|
1 | retinaface | Caffe |
Face Quality
Th Face Quality library uses the face quality network to detect the quality score of a face. If a face is clear and a front face, the score is high. On the contrary, a blurry or side face will get a low score. The score range from 0 to 1. It also provide face landmark positions. The input is a face which is detected by face detect network and the output contains quality score and five landmark key points.
The following table lists the face quality models supported by the Vitis AI Library.
No | Model Name | Framework |
---|---|---|
1 | face-quality | Caffe |
2 | face-quality_pt | PyTorch |
Hourglass Pose Detection
0 - r ankle, 1 - r knee, 2 - r hip, 3 - l hip, 4 - l knee, 5 - l ankle,
6 - pelvis, 7 - thorax, 8 - upper neck, 9 - head top, 10 - r wrist,
11 - r elbow, 12 - r shoulder, 13 - l shoulder, 14 - l elbow, 15 - l wrist
This network can detect the posture of only one person in the input image. The input of the network is 256x256. The following image shows the result of hourglass detection.
The following table lists the hourglass models supported by the Vitis AI library.
No | Model Name | Framework |
---|---|---|
1 | hourglass-pe_mpii | Caffe |
Pointpillars
Object detection in point clouds is an important aspect of many robotics applications such as autonomous driving. The pointpillars model is a novel deep network and encoder that can be trained end-to-end on LiDAR point clouds. It offers the best architecture for 3D object detection from LiDAR. The following image shows the result of a pointpillar test.
The following table lists the pointpillars models supported by the Vitis AI Library.
No | Model Name | Framework |
---|---|---|
1 | pointpillars_kitti_12000_0_pt | PyTorch |
2 | pointpillars_kitti_12000_1_pt | PyTorch |
3D Segmentation
The 3D segmentation library can support the SalsaNext model, which is used for the uncertainty-aware semantic segmentation of a full 3D LiDAR point cloud in real-time. SalsaNext is the next version of SalsaNet which has an encoder-decoder architecture, where the encoder unit has a set of ResNet blocks and the decoder unit combines upsampled features from the residual blocks.
The following table lists the3D segmentation models supported by the Vitis AI library.
No | Model Name | Framework |
---|---|---|
1 | salsanext_pt | PyTorch |
2 | salsanext_v2_pt | PyTorch |
Covid19 Segmentation
The Covid19 segmentation library can support the COVID-Net model which is a deep convolutional neural network design tailored for the detection of COVID-19 cases from chest X-ray (CXR) images.
The following table lists the Covid19 segmentation models supported by the Vitis AI Library.
No | Model Name | Framework |
---|---|---|
1 | FPN-resnet18_covid19-seg_pt | PyTorch |
Bayesian Crowd Counting
Bayesian Crowd Counting is a neural network that is used for crowd counting. The input is a picture with crowd individuals that you would like to estimate the number of them. The output is a number which is the estimated value of crowd counting with the density map of input image. The following image shows the result of Bayesian Crowd Counting test.
The following table lists the BCC models supported by the Vitis AI library.
No | Model Name | Framework |
---|---|---|
1 | bcc_pt | PyTorch |
Production Recognition
PMG model can be used for fine-grained goods product recognition, for example, RP2K dataset. The model is Resnet18-based and the detailed model structure is shown in the picture below. On rp2k dataset, this model can achieve 96.4% top-1 float accuracy with 13.82M parameters and 2.28G Flops. Model final deployment and quantization top-1 accuracy are 96.19% and 96.18%, respectively.
The following table lists the PMG models supported by the Vitis AI library.
No | Model Name | Framework |
---|---|---|
1 | pmg_pt | PyTorch |
SA-Gate Segmentation
SA-Gate is a neural network that is used for indoor segmentation. The input is a pair images which are RGB image and HHA map generated with depth map. The output is a heat map where each pixels is predicted with a semantic category, like chair, bed, usual object in indoor.
The following image shows the result of SA-Gate segmentation.
The following table lists the SA-Gate models supported by the Vitis AI library.
No | Model Name | Framework |
---|---|---|
1 | SA_gate_pt | PyTorch |
RCAN Super Resolution
RCAN model is a super-resolution network. The corresponding high-resolution image is reconstructed from the low-resolution image. Based on the original image, the length and width are enlarged by two times. It has important application value in the fields of monitoring equipment, satellite images, and medical imaging. The following images show the result of RCAN. The image is still clear after zooming in.
The following table lists the RCAN super resolution models supported by the Vitis AI Library.
No | Model Name | Framework |
---|---|---|
1 | rcan_pruned_tf | TensorFlow |
PointPainting
For AD/ADAS system, sensor-fusion algorithms play a significant role in providing high-quality perception and increasing the safety level for driving. PointPainting provides a sensor-fusion framework that takes advantage of 2D semantic segmentation and 3D object detection models. Specifically, first a network is applied to the camera images for semantic segmentation. Based on the semantic information and calibration information (on camera and LiDAR), the LiDAR point clouds are projected to the images and fused with the semantic information to get the painted point clouds. Finally the painted point clouds are consumed by the 3D object detector to achieve better perception.
The following table lists the PointPainting models supported by the Vitis AI library.
No | Model Name | Framework |
---|---|---|
1 | pointpainting_nuscenes_40000_64_0_pt | PyTorch |
2 | pointpainting_nuscenes_40000_64_1_pt | PyTorch |
3 | semanticfpn_nuimage_576_320_pt | PyTorch |
Pointpillars_nuscenes
PointPillars is an efficient network for real-time 3D object detection on point cloud. Trained on the nuScenes dataset, this model gives 3D bounding boxes and speed prediction for ten classes (including some kinds of vehicles, pedestrian, barrier, and traffic cone) in the surround-view range. With multi-sweep point clouds as input, PointPillars can achieve higher accuracy of 3D object detection and speed estimation at the cost of increasing complexity of the pre-processing part.
The following table lists the Pointpillars_nuscenes models supported by the Vitis AI library.
No | Model Name | Framework |
---|---|---|
1 | pointpillars_nuscenes_40000_64_0_pt | PyTorch |
2 | pointpillars_nuscenes_40000_64_1_pt | PyTorch |
Multi-task V3
Multi-task V3 aims to do different tasks in autonomous driving scenarios simultaneously while achieving good performance and efficiency. The tasks includes object detection, segmentation, lane detection, drivable area segmentation and depth estimation, which are important components of the autonomous driving perception module.
The following table lists the multi-task v3 models supported by the Vitis AI library.
No | Model Name | Framework |
---|---|---|
1 | multi_task_v3_pt | PyTorch |
Centerpoint
4D radar is a high-resolution long-range radar sensor that not only detects the distance, relative speed, and azimuth of objects, but also their height above the road level. Unlike LiDAR, it works well in all weather conditions, including fog and heavy rain. A state-of-the-art anchor-free 3D object detector CenterPoint is used. It is trained on the 4D radar data of the open dataset Astyx. Because the annotated samples are limited and the 4D radar point clouds are sparse, the 3D bounding box prediction is naturally not so good. It is observed that although vehicles near ego car could be correctly detected, but there are still some false positive predictions and some objects at longer distance could not be detected. 4D radar object detection and fusion with camera image could boost the performance by a large margin.
Centerpoint model is used for 4D radar detection and the following figure shows the result of Centerpoint model.
The following table lists the Centerpoint models supported by the Vitis AI library.
No | Model Name | Framework |
---|---|---|
1 | centerpoint_0_pt | PyTorch |
2 | centerpoint_1_pt | PyTorch |
Depth Estimation
FADNet is a model used for depth estimation. It is a fast and accurate network for disparity estimation. It has three main features:
- It exploits efficient 2D-based correlation layers with stacked blocks to preserve fast computation.
- It combines the residual structures to make the deeper model easier to learn.
- It contains multi-scale predictions so as to exploit a multi-scale weight scheduling training technique to improve the accuracy.
The following images show the result of depth estimation. The first image is the left camera image input, the second image is the right camera image input and the third image is the running result of the FADNet model.
The following table lists the depth estimation models supported by the Vitis AI library.
No | Model Name | Framework |
---|---|---|
1 | FADNet_0_pt | PyTorch |
2 | FADNet_1_pt | PyTorch |
3 | FADNet_2_pt | PyTorch |
Model Samples
Currently, there are 37 model samples that are located in ~/Vitis-AI/demo/Vitis-AI-Library/samples. Each sample has the following four kinds of test samples:
- test_jpeg_[model type]
- test_video_[model type]
- test_performance_[model type]
- test_accuracy_[model type]
Take YOLOv3 as an example.
- Before you run the YOLOv3 detection example, you can choose one of the
following yolov3 models to run:
- yolov3_bdd
- yolov3_voc
- yolov3_voc_tf
- Ensure that the following test programs exists:
- test_jpeg_yolov3
- test_video_yolov3
- test_performance_yolov3
- test_accuracy_yolov3_bdd
- test_accuracy_yolov3_adas_pruned_0_9
- test_accuracy_yolov3_voc
- test_accuracy_yolov3_voc_tf
If the executable program does not exist, you have to cross compile it on the host and then copy the executable program to the target.
- To test the image data, execute the following
command:
#./test_jpeg_yolov3 yolov3_bdd sample_yolov3.jpg
The result is printed on the terminal. Also, you can view the output image: sample_yolov3_result.jpg.
- To test the video data, execute the following
command:
#./test_video_yolov3 yolov3_bdd video_input.mp4 -t 8
- To test the model performance, execute the following
command:
The result is printed on the terminal.#./test_performance_yolov3 yolov3_bdd test_performance_yolov3.list -t 8
- To test the model accuracy, prepare your own image dataset, image
list file and the ground truth of the images. Then execute the following
command:
#./test_accuracy_yolov3_bdd [image_list_file] [output_file]
After the output_file is generated, a script file is needed to automatically compare the results. Finally, the accuracy result can be obtained.