PyTorch Version (vai_q_pytorch)

Installing vai_q_pytorch

vai_q_pytorch has GPU and CPU versions. It supports PyTorch version 1.2-1.7.1 but does not support PyTorch data parallelism. There are two ways to install vai_q_pytorch:

Install Using Docker Containers

The Vitis AI provides a Docker container for quantization tools, including vai_q_pytorch. After running a GPU/CPU container, activate the Conda environment, vitis-ai-pytorch.

conda activate vitis-ai-pytorch

Note: In some cases, if you want to install some packages in the conda environment and meet permission problems, you can create a separate conda environment based on vitis-ai-pytorch instead of using vitis-ai-pytorch directly. The pt_pointpillars_kitti_12000_100_10.8G_1.3 model in Xilinx Model Zoo is an example of this.

A new Conda environment with a specified PyTorch version (1.2~1.7.1) can be created using the /opt/vitis_ai/scripts/replace_pytorch.sh script. This script clones a Conda environment from vitis-ai-pytorch, uninstalls the original PyTorch, Torchvision and vai_q_pytorch packages, and then installs the specified version of PyTorch, Torchvision, and re-installs vai_q_pytorch from source code.

Install from the Source Code

vai_q_pytorch is a Python package designed to work as a PyTorch plugin. It is an open source in Vitis_AI_Quantizer. It is recommended to install vai_q_pytorch in the Conda environment. To do so, follow these steps:

Add the CUDA_HOME environment variable in .bashrc.
For the GPU version, if the CUDA library is installed in /usr/local/cuda, add the following line into .bashrc. If CUDA is in other directory, change the line accordingly.
```
export CUDA_HOME=/usr/local/cuda
```
For the CPU version, remove all CUDA_HOME environment variable setting in your .bashrc. It is recommended to cleanup it in command line of a shell window by running the following command:
```
unset CUDA_HOME
```
Install PyTorch (1.2-1.7.1) and Torchvision.
The following code takes PyTorch 1.4 and torchvision 0.5.0 as an example. You can find detailed instructions for other versions on the PyTorch website.
```
pip install torch==1.4.0 torchvision==0.5.0
```
Install other dependencies.
```
pip install -r requirements.txt
```

Install vai_q_pytorch.

cd ./pytorch_binding 
python setup.py install (for user) 
python setup.py develop (for developer)

Verify the installation.
```
python -c "import pytorch_nndct"
```

Note: If the PyTorch version you installed is 1.4 or higher, import pytorch_nndct before importing torch in your script. This is caused by a PyTorch bug in versions prior to 1.4. Refer to PyTorch GitHub issue 28536 and 19668 for details.

import pytorch_nndct
import torch

Running vai_q_pytorch

vai_q_pytorch is designed to work as a PyTorch plugin. Xilinx provides the simplest APIs to introduce the FPGA-friendly quantization feature. For a well-defined model, you only need to add a few lines to get a quantize model object. To do so, follow these steps:

Preparing Files for vai_q_pytorch

Prepare the following files for vai_q_pytorch.

Table 1. Input Files for vai_q_pytorch
No.	Name	Description
1	model.pth	Pre-trained PyTorch model, generally pth file.
2	model.py	A Python script including float model definition.
3	calibration dataset	A subset of the training dataset containing 100 to 1000 images.

Modifying the Model Definition

To make a PyTorch model quantizable, it is necessary to modify the model definition to make sure the modified model meets the following conditions. An example is available in Vitis AI GitHub.

The model to be quantized should include forward method only. All other functions should be moved outside or move to a derived class. These functions usually work as pre-processing and post-processing. If they are not moved outside, the API removes them in the quantized module, which causes unexpected behavior when forwarding the quantized module.
The float model should pass the jit trace test. Set the float module to evaluation status, then use the torch.jit.trace function to test the float model.

Adding vai_q_pytorch APIs to Float Scripts

If, before quantization, there is a trained float model and some Python scripts to evaluate accuracy/mAP of the model, the Quantizer API replaces the float module with a quantized module. The normal evaluate function encourages quantized module forwarding. Quantize calibration determines quantization steps of tensors in evaluation process if flag quant_mode is set to "calib". After calibration, evaluate the quantized model by setting quant_mode to "test".

Import the vai_q_pytorch module.

from pytorch_nndct.apis import torch_quantizer, dump_xmodel

Generate a quantizer with quantization needed input and get the converted model.

input = torch.randn([batch_size, 3, 224, 224])
quantizer = torch_quantizer(quant_mode, model, (input))
quant_model = quantizer.quant_model

Forward a neural network with the converted model.

acc1_gen, acc5_gen, loss_gen = evaluate(quant_model, val_loader, loss_fn)

Output the quantization result and deploy the model.

if quant_mode == 'calib':
    quantizer.export_quant_config()
if deploy:
    quantizer.export_xmodel())

Running Quantization and Getting the Result

Note: vai_q_pytorch log messages have special colors and a special keyword, "NNDCT." "NNDCT" is an internal project name and you can change it later. vai_q_pytorch log message types include "error", "warning", and "note." Pay attention to vai_q_pytorch log messages to check the flow status.

Run command with "--quant_mode calib" to quantize model.
```
python resnet18_quant.py --quant_mode calib --subset_len 200
```
When calibrating forward, borrow the float evaluation flow to minimize code change from float script. If there are loss and accuracy messages displayed in the end, you can ignore them. Note the colorful log messages with the special keyword, "NNDCT".
It is important to control iteration numbers during quantization and evaluation. Generally, 100-1000 images are enough for quantization and the whole validation set is required for evaluation. The iteration numbers can be controlled in the data loading part. In this case, the subset_len argument controls the number of images that are used for network forwarding. If the float evaluation script does not have an argument with a similar role, you must add one.
If this quantization command runs successfully, two important files are generated in the output directory ./quantize_result.

ResNet.py

Converted vai_q_pytorch format model.

Quant_info.json

Quantization steps of tensors. Retain this file for evaluating quantized models.
To evaluate the quantized model, run the following command:
```
python resnet18_quant.py --quant_mode test
```
The accuracy displayed after the command has executed successfully is the right accuracy for the quantized model.
To generate the xmodel for compilation, the batch size should be 1. Set subset_len=1 to avoid redundant iterations and run the following command:
```
python resnet18_quant.py --quant_mode test --subset_len 1 --batch_size=1 --deploy
```
Skip loss and accuracy displayed in the log during running. The xmodel file for the Vitis AI compiler is generated in the output directory, ./quantize_result. It is further used to deploy to the FPGA.
```
ResNet_int.xmodel: deployed model
```
Note: XIR is ready in "vitis-ai-pytorch" conda environment in the Vitis AI docker but if vai_q_pytorch is installed from the source code, you have to install XIR in advance. If XIR is not installed, the xmodel file cannot be generated and the command will return an error. However, you can still check the accuracy in the output log.

Module Partial Quantization

You can use module partial quantization if not all the sub-modules in a model need to be quantized. Besides using general vai_q_pytorch APIs, the QuantStub/DeQuantStub operator pair can be used to realize it. The following example demonstrates how to quantize subm0 and subm2, but not quantize subm1.

from pytorch_nndct.nn import QuantStub, DeQuantStub

class WholeModule(torch.nn.module):
    def __init__(self,...):
        self.subm0 = ...
        self.subm1 = ...
        self.subm2 = ...

        # define QuantStub/DeQuantStub submodules
        self.quant = QuantStub()
        self.dequant = DeQuantStub()

    def forward(self, input):
        input = self.quant(input) # begin of part to be quantized
        output0 = self.subm0(input)
        output0 = self.dequant(output0) # end of part to be quantized

        output1 = self.subm1(output0)

        output1 = self.quant(output1) # begin of part to be quantized
        output2 = self.subm2(output1)
        output2 = self.dequant(output2) # end of part to be quantized

vai_q_pytorch Fast Finetuning

Generally, there is a small accuracy loss after quantization, but for some networks such as MobileNets, the accuracy loss can be large. In this situation, first try fast finetune. If fast finetune still does not yield satisfactory results, quantize finetuning can be used to further improve the accuracy of the quantized models.

The AdaQuant algorithm¹ uses a small set of unlabeled data. It not only calibrates the activations but also finetunes the weights. The Vitis AI quantizer implements this algorithm and call it "fast finetuning" or "advanced calibration." Though slightly slower, fast finetuning can achieve better performance than quantize calibration. Similar to quantize finetuning, each run of fast finetuning produces a different result.

Fast finetuning does not train the model, and only needs a limited number of iterations. For classification models on Imagenet dataset, 1000 images are enough. Fast finetuning only needs some modification based on the model evaluation script. There is no need to set up the optimizer for training. To use fast finetuning, a function for model forwarding iteration is needed and will be called during fast finetuning. Re-calibration with the original inference code is recommended.

You can find a complete example in the open source example.

# fast finetune model or load finetuned parameter before test 
  if fast_finetune == True:
      ft_loader, _ = load_data(
          subset_len=1024,
          train=False,
          batch_size=batch_size,
          sample_method=None,
          data_dir=args.data_dir,
          model_name=model_name)
      if quant_mode == 'calib':
          quantizer.fast_finetune(evaluate, (quant_model, ft_loader, loss_fn))
      elif quant_mode == 'test':
          quantizer.load_ft_param()

For parameter finetuning and re-calibration of this ResNet18 example, run the following command:

python resnet18_quant.py --quant_mode calib --fast_finetune

To test finetuned quantized model accuracy, run the following command:

python resnet18_quant.py --quant_mode test --fast_finetune

Note:

Itay Hubara et.al., Improving Post Training Neural Quantization: Layer-wise Calibration and Integer Programming, arXiv:2006.10518, 2020.

vai_q_pytorch QAT

Assuming that there is a pre-defined model architecture, use the following steps to do quantization aware training. Take the ResNet18 model from Torchvision as an example. The complete model definition is here.

Check if there are non-module operations to be quantized. ResNet18 uses ‘+’ to add two tensors. Replace them with pytorch_nndct.nn.modules.functional.Add.

Check if there are modules to be called multiple times. Usually such modules have no weights; the most common one is the torch.nn.ReLu module. Define multiple such modules and then call them separately in a forward pass. The revised definition that meets the requirements is as follows:

class BasicBlock(nn.Module):
  expansion = 1

  def __init__(self,
               inplanes,
               planes,
               stride=1,
               downsample=None,
               groups=1,
               base_width=64,
               dilation=1,
               norm_layer=None):
    super(BasicBlock, self).__init__()
    if norm_layer is None:
      norm_layer = nn.BatchNorm2d
    if groups != 1 or base_width != 64:
      raise ValueError('BasicBlock only supports groups=1 and base_width=64')
    if dilation > 1:
      raise NotImplementedError("Dilation > 1 not supported in BasicBlock")
    # Both self.conv1 and self.downsample layers downsample the input when stride != 1
    self.conv1 = conv3x3(inplanes, planes, stride)
    self.bn1 = norm_layer(planes)
    self.relu1 = nn.ReLU(inplace=True)
    self.conv2 = conv3x3(planes, planes)
    self.bn2 = norm_layer(planes)
    self.downsample = downsample
    self.stride = stride

    # Use a functional module to replace ‘+’
    self.skip_add = functional.Add()

    # Additional defined module
    self.relu2 = nn.ReLU(inplace=True)

  def forward(self, x):
    identity = x

    out = self.conv1(x)
    out = self.bn1(out)
    out = self.relu1(out)

    out = self.conv2(out)
    out = self.bn2(out)

    if self.downsample is not None:
      identity = self.downsample(x)
    
    # Use function module instead of ‘+’
    # out += identity
    out = self.skip_add(out, identity)
    out = self.relu2(out)

    return out

Insert QuantStub and DeQuantStub.

Use QuantStub to quantize the inputs of the network and DeQuantStub to de-quantize the outputs of the network. Any sub-network from QuantStub to DeQuantStub in a forward pass will be quantized. Multiple QuantStub-DeQuantStub pairs are allowed.

class ResNet(nn.Module):

  def __init__(self,
               block,
               layers,
               num_classes=1000,
               zero_init_residual=False,
               groups=1,
               width_per_group=64,
               replace_stride_with_dilation=None,
               norm_layer=None):
    super(ResNet, self).__init__()
    if norm_layer is None:
      norm_layer = nn.BatchNorm2d
    self._norm_layer = norm_layer

    self.inplanes = 64
    self.dilation = 1
    if replace_stride_with_dilation is None:
      # each element in the tuple indicates if we should replace
      # the 2x2 stride with a dilated convolution instead
      replace_stride_with_dilation = [False, False, False]
    if len(replace_stride_with_dilation) != 3:
      raise ValueError(
          "replace_stride_with_dilation should be None "
          "or a 3-element tuple, got {}".format(replace_stride_with_dilation))
    self.groups = groups
    self.base_width = width_per_group
    self.conv1 = nn.Conv2d(
        3, self.inplanes, kernel_size=7, stride=2, padding=3, bias=False)
    self.bn1 = norm_layer(self.inplanes)
    self.relu = nn.ReLU(inplace=True)
    self.maxpool = nn.MaxPool2d(kernel_size=3, stride=2, padding=1)
    self.layer1 = self._make_layer(block, 64, layers[0])
    self.layer2 = self._make_layer(
        block, 128, layers[1], stride=2, dilate=replace_stride_with_dilation[0])
    self.layer3 = self._make_layer(
        block, 256, layers[2], stride=2, dilate=replace_stride_with_dilation[1])
    self.layer4 = self._make_layer(
        block, 512, layers[3], stride=2, dilate=replace_stride_with_dilation[2])
    self.avgpool = nn.AdaptiveAvgPool2d((1, 1))
    self.fc = nn.Linear(512 * block.expansion, num_classes)

    self.quant_stub = nndct_nn.QuantStub()
    self.dequant_stub = nndct_nn.DeQuantStub()

    for m in self.modules():
      if isinstance(m, nn.Conv2d):
        nn.init.kaiming_normal_(m.weight, mode='fan_out', nonlinearity='relu')
      elif isinstance(m, (nn.BatchNorm2d, nn.GroupNorm)):
        nn.init.constant_(m.weight, 1)
        nn.init.constant_(m.bias, 0)

    # Zero-initialize the last BN in each residual branch,
    # so that the residual branch starts with zeros, and each residual block behaves like an identity.
    # This improves the model by 0.2~0.3% according to https://arxiv.org/abs/1706.02677
    if zero_init_residual:
      for m in self.modules():
        if isinstance(m, Bottleneck):
          nn.init.constant_(m.bn3.weight, 0)
        elif isinstance(m, BasicBlock):
          nn.init.constant_(m.bn2.weight, 0)

  def forward(self, x):
    x = self.quant_stub(x)

    x = self.conv1(x)
    x = self.bn1(x)
    x = self.relu(x)
    x = self.maxpool(x)

    x = self.layer1(x)
    x = self.layer2(x)
    x = self.layer3(x)
    x = self.layer4(x)

    x = self.avgpool(x)
    x = torch.flatten(x, 1)
    x = self.fc(x)
    x = self.dequant_stub(x)
    return x

Use QAT APIs to create the quantizer and train the model.

def _resnet(arch, block, layers, pretrained, progress, **kwargs):
  model = ResNet(block, layers, **kwargs)
  if pretrained:
    #state_dict = load_state_dict_from_url(model_urls[arch], progress=progress)
    state_dict = torch.load(model_urls[arch])
    model.load_state_dict(state_dict)
  return model

def resnet18(pretrained=False, progress=True, **kwargs):
  r"""ResNet-18 model from
    `"Deep Residual Learning for Image Recognition" <https://arxiv.org/pdf/1512.03385.pdf>'_

    Args:
        pretrained (bool): If True, returns a model pre-trained on ImageNet
        progress (bool): If True, displays a progress bar of the download to stderr
    """
  return _resnet('resnet18', BasicBlock, [2, 2, 2, 2], pretrained, progress,
                 **kwargs)

model = resnet18(pretrained=True)

# Generate dummy inputs.
input = torch.randn([batch_size, 3, 224, 224], dtype=torch.float32)

# Create a quantizer
quantizer = torch_quantizer(quant_mode = 'calib',
                           module = model, 
                           input_args = input,
                           bitwidth = 8,
                           qat_proc = True)
quantized_model = quantizer.quant_model
optimizer = torch.optim.Adam(
        quantized_model.parameters(), 
        lr, 
        weight_decay=weight_decay)

# Use the optimizer to train the model, just like a normal float model.
…

Convert the trained model to a deployable model.

After training, dump the quantized model to xmodel. (batch size=1 is must for compilation of xmodel).

# vai_q_pytorch interface function: deploy the trained model and convert xmodel
  # need at least 1 iteration of inference with batch_size=1 
  quantizer.deploy(quantized_model)
  deployable_model = quantizer.deploy_model
  val_dataset2 = torch.utils.data.Subset(val_dataset, list(range(1)))
  val_loader2 = torch.utils.data.DataLoader(
      val_dataset,
      batch_size=1,
      shuffle=False,
      num_workers=workers,
      pin_memory=True)
  validate(val_loader2, deployable_model, criterion, gpu)
  quantizer.export_xmodel()

vai_q_pytorch QAT Requirements

Generally, there is a small accuracy loss after quantization, but for some networks such as MobileNets, the accuracy loss can be large. In this situation, first try fast finetune. If fast finetune does not yield satisfactory results, QAT can be used to further improve the accuracy of the quantized models.

The QAT APIs have some requirements for the model to be trained.

All operations to be quantized must be instances of the torch.nn.Module object, rather than Torch functions or Python operators. For example, it is common to use ‘+’ to add two tensors in PyTorch. However, this is not supported in QAT. Thus, replace ‘+’ with pytorch_nndct.nn.modules.functional.Add. Operations that need replacement are listed in the following table.

Table 2. Operation-Replacement Mapping
Operation	Replacement
`+`	`pytorch_nndct.nn.modules.functional.Add`
`-`	`pytorch_nndct.nn.modules.functional.Sub`
`torch.add`	`pytorch_nndct.nn.modules.functional.Add`
`torch.sub`	`pytorch_nndct.nn.modules.functional.Sub`

IMPORTANT: A module to be quantized cannot be called multiple times in the forward pass.

Use pytorch_nndct.nn.QuantStub and pytorch_nndct.nn.DeQuantStub at the beginning and end of the network to be quantized. The network can be the complete network or a partial sub-network.

vai_q_pytorch Usage

This section introduces the usage of execution tools and APIs to implement quantization and generate a model to be deployed on the target hardware. The APIs in the module pytorch_binding/pytorch_nndct/apis/quant_api.py are as follows:

class torch_quantizer()

Class torch_quantizer creates a quantizer object.

class torch_quantizer(): 
  def __init__(self,
               quant_mode: str, # ['calib', 'test']
               module: torch.nn.Module,
               input_args: Union[torch.Tensor, Sequence[Any]] = None,
               state_dict_file: Optional[str] = None,
               output_dir: str = "quantize_result",
               bitwidth: int = 8,
               device: torch.device = torch.device("cuda"),
               qat_proc: bool = False):

Arguments

quant_mode: An integer that indicates which quantization mode the process is using. "calib" for calibration of quantization, and "test" for evaluation of quantized model.
Module: Float module to be quantized.
Input_args: Input tensor with the same shape as real input of float module to be quantized, but the values can be random numbers.
State_dict_file: Float module pretrained parameters file. If float module has read parameters in, the parameter is not needed to be set.
Output_dir: Directory for quantization result and intermediate files. Default is “quantize_result”.
Bitwidth: Global quantization bit width. Default is 8.
Device: Run model on GPU or CPU.
Qat_proc: Turn on quantize finetuning, also named quantization-aware-training (QAT).

def export_quant_config(self)

This function exports information related to the quantization steps.

def export_quant_config(self):

def export_xmodel(self, output_dir, deploy_check)

This function exports the xmodel and dumps the output data of the operators for detailed data comparison.

def export_xmodel(self, output_dir, deploy_check):

Arguments

Output_dir: Directory for quantization result and intermediate files. Default is “quantize_result.”
Deploy_check: Flags to control dump of data for detailed data comparison. Default is FALSE. If it is set to TRUE, binary format data is dumped in the output_dir/deploy_check_data_int/ location.