PyTorch Version (vai_q_pytorch)
Installing vai_q_pytorch
vai_q_pytorch has GPU and CPU versions. It supports PyTorch version 1.2-1.7.1 but does not support PyTorch data parallelism. There are two ways to install vai_q_pytorch:
Install Using Docker Containers
The Vitis AI provides a Docker container for quantization tools, including vai_q_pytorch. After running a GPU/CPU container, activate the Conda environment, vitis-ai-pytorch.
conda activate vitis-ai-pytorch
vitis-ai-pytorch
instead of using vitis-ai-pytorch
directly. The pt_pointpillars_kitti_12000_100_10.8G_1.3 model in
Xilinx Model Zoo is an example of
this.A new Conda environment with a specified PyTorch version (1.2~1.7.1) can be created using the /opt/vitis_ai/scripts/replace_pytorch.sh script. This script clones a Conda environment from vitis-ai-pytorch, uninstalls the original PyTorch, Torchvision and vai_q_pytorch packages, and then installs the specified version of PyTorch, Torchvision, and re-installs vai_q_pytorch from source code.
Install from the Source Code
vai_q_pytorch is a Python package designed to work as a PyTorch plugin. It is an open source in Vitis_AI_Quantizer. It is recommended to install vai_q_pytorch in the Conda environment. To do so, follow these steps:
- Add the CUDA_HOME environment variable in .bashrc.For the GPU version, if the CUDA library is installed in /usr/local/cuda, add the following line into .bashrc. If CUDA is in other directory, change the line accordingly.
export CUDA_HOME=/usr/local/cuda
For the CPU version, remove all CUDA_HOME environment variable setting in your .bashrc. It is recommended to cleanup it in command line of a shell window by running the following command:unset CUDA_HOME
- Install PyTorch (1.2-1.7.1) and Torchvision.
The following code takes PyTorch 1.4 and torchvision 0.5.0 as an example. You can find detailed instructions for other versions on the PyTorch website.
pip install torch==1.4.0 torchvision==0.5.0
- Install other
dependencies.
pip install -r requirements.txt
- Install
vai_q_pytorch.
cd ./pytorch_binding python setup.py install (for user) python setup.py develop (for developer)
- Verify the
installation.
python -c "import pytorch_nndct"
Running vai_q_pytorch
vai_q_pytorch is designed to work as a PyTorch plugin. Xilinx provides the simplest APIs to introduce the FPGA-friendly quantization feature. For a well-defined model, you only need to add a few lines to get a quantize model object. To do so, follow these steps:
Preparing Files for vai_q_pytorch
No. | Name | Description |
---|---|---|
1 | model.pth | Pre-trained PyTorch model, generally pth file. |
2 | model.py | A Python script including float model definition. |
3 | calibration dataset | A subset of the training dataset containing 100 to 1000 images. |
Modifying the Model Definition
- The model to be quantized should include forward method only. All other functions should be moved outside or move to a derived class. These functions usually work as pre-processing and post-processing. If they are not moved outside, the API removes them in the quantized module, which causes unexpected behavior when forwarding the quantized module.
- The float model should pass the jit trace test. Set the float module to
evaluation status, then use the
torch.jit.trace
function to test the float model.
Adding vai_q_pytorch APIs to Float Scripts
- Import the vai_q_pytorch
module.
from pytorch_nndct.apis import torch_quantizer, dump_xmodel
- Generate a quantizer with quantization needed input and get the converted
model.
input = torch.randn([batch_size, 3, 224, 224]) quantizer = torch_quantizer(quant_mode, model, (input)) quant_model = quantizer.quant_model
- Forward a neural network with the converted
model.
acc1_gen, acc5_gen, loss_gen = evaluate(quant_model, val_loader, loss_fn)
- Output the quantization result and deploy the
model.
if quant_mode == 'calib': quantizer.export_quant_config() if deploy: quantizer.export_xmodel())
Running Quantization and Getting the Result
- Run command with "--quant_mode calib" to quantize
model.
python resnet18_quant.py --quant_mode calib --subset_len 200
When calibrating forward, borrow the float evaluation flow to minimize code change from float script. If there are loss and accuracy messages displayed in the end, you can ignore them. Note the colorful log messages with the special keyword, "NNDCT".
It is important to control iteration numbers during quantization and evaluation. Generally, 100-1000 images are enough for quantization and the whole validation set is required for evaluation. The iteration numbers can be controlled in the data loading part. In this case, the
subset_len
argument controls the number of images that are used for network forwarding. If the float evaluation script does not have an argument with a similar role, you must add one.If this quantization command runs successfully, two important files are generated in the output directory ./quantize_result.
- ResNet.py
- Converted vai_q_pytorch format model.
- Quant_info.json
- Quantization steps of tensors. Retain this file for evaluating quantized models.
- To evaluate the quantized model, run the following
command:
python resnet18_quant.py --quant_mode test
The accuracy displayed after the command has executed successfully is the right accuracy for the quantized model.
- To generate the xmodel for compilation, the batch size should be 1. Set
subset_len=1
to avoid redundant iterations and run the following command:python resnet18_quant.py --quant_mode test --subset_len 1 --batch_size=1 --deploy
Skip loss and accuracy displayed in the log during running. The xmodel file for the Vitis AI compiler is generated in the output directory, ./quantize_result. It is further used to deploy to the FPGA.
ResNet_int.xmodel: deployed model
Note: XIR is ready in "vitis-ai-pytorch" conda environment in the Vitis AI docker but if vai_q_pytorch is installed from the source code, you have to install XIR in advance. If XIR is not installed, the xmodel file cannot be generated and the command will return an error. However, you can still check the accuracy in the output log.
Module Partial Quantization
You can use module partial quantization if not all the sub-modules in a
model need to be quantized. Besides using general vai_q_pytorch APIs, the QuantStub/DeQuantStub
operator pair can be used to realize
it. The following example demonstrates how to quantize subm0
and subm2
, but not quantize subm1
.
from pytorch_nndct.nn import QuantStub, DeQuantStub
class WholeModule(torch.nn.module):
def __init__(self,...):
self.subm0 = ...
self.subm1 = ...
self.subm2 = ...
# define QuantStub/DeQuantStub submodules
self.quant = QuantStub()
self.dequant = DeQuantStub()
def forward(self, input):
input = self.quant(input) # begin of part to be quantized
output0 = self.subm0(input)
output0 = self.dequant(output0) # end of part to be quantized
output1 = self.subm1(output0)
output1 = self.quant(output1) # begin of part to be quantized
output2 = self.subm2(output1)
output2 = self.dequant(output2) # end of part to be quantized
vai_q_pytorch Fast Finetuning
Generally, there is a small accuracy loss after quantization, but for some networks such as MobileNets, the accuracy loss can be large. In this situation, first try fast finetune. If fast finetune still does not yield satisfactory results, quantize finetuning can be used to further improve the accuracy of the quantized models.
The AdaQuant algorithm1 uses a small set of unlabeled data. It not only calibrates the activations but also finetunes the weights. The Vitis AI quantizer implements this algorithm and call it "fast finetuning" or "advanced calibration." Though slightly slower, fast finetuning can achieve better performance than quantize calibration. Similar to quantize finetuning, each run of fast finetuning produces a different result.
Fast finetuning does not train the model, and only needs a limited number of iterations. For classification models on Imagenet dataset, 1000 images are enough. Fast finetuning only needs some modification based on the model evaluation script. There is no need to set up the optimizer for training. To use fast finetuning, a function for model forwarding iteration is needed and will be called during fast finetuning. Re-calibration with the original inference code is recommended.
You can find a complete example in the open source example.
# fast finetune model or load finetuned parameter before test
if fast_finetune == True:
ft_loader, _ = load_data(
subset_len=1024,
train=False,
batch_size=batch_size,
sample_method=None,
data_dir=args.data_dir,
model_name=model_name)
if quant_mode == 'calib':
quantizer.fast_finetune(evaluate, (quant_model, ft_loader, loss_fn))
elif quant_mode == 'test':
quantizer.load_ft_param()
python resnet18_quant.py --quant_mode calib --fast_finetune
python resnet18_quant.py --quant_mode test --fast_finetune
- Itay Hubara et.al., Improving Post Training Neural Quantization: Layer-wise Calibration and Integer Programming, arXiv:2006.10518, 2020.
vai_q_pytorch QAT
Assuming that there is a pre-defined model architecture, use the following steps to do quantization aware training. Take the ResNet18 model from Torchvision as an example. The complete model definition is here.
- Check if there are non-module operations to be quantized. ResNet18 uses
‘+’
to add two tensors. Replace them withpytorch_nndct.nn.modules.functional.Add
. - Check if there are modules to be called multiple times. Usually such
modules have no weights; the most common one is the
torch.nn.ReLu
module. Define multiple such modules and then call them separately in a forward pass. The revised definition that meets the requirements is as follows:class BasicBlock(nn.Module): expansion = 1 def __init__(self, inplanes, planes, stride=1, downsample=None, groups=1, base_width=64, dilation=1, norm_layer=None): super(BasicBlock, self).__init__() if norm_layer is None: norm_layer = nn.BatchNorm2d if groups != 1 or base_width != 64: raise ValueError('BasicBlock only supports groups=1 and base_width=64') if dilation > 1: raise NotImplementedError("Dilation > 1 not supported in BasicBlock") # Both self.conv1 and self.downsample layers downsample the input when stride != 1 self.conv1 = conv3x3(inplanes, planes, stride) self.bn1 = norm_layer(planes) self.relu1 = nn.ReLU(inplace=True) self.conv2 = conv3x3(planes, planes) self.bn2 = norm_layer(planes) self.downsample = downsample self.stride = stride # Use a functional module to replace ‘+’ self.skip_add = functional.Add() # Additional defined module self.relu2 = nn.ReLU(inplace=True) def forward(self, x): identity = x out = self.conv1(x) out = self.bn1(out) out = self.relu1(out) out = self.conv2(out) out = self.bn2(out) if self.downsample is not None: identity = self.downsample(x) # Use function module instead of ‘+’ # out += identity out = self.skip_add(out, identity) out = self.relu2(out) return out
- Insert
QuantStub
andDeQuantStub
.Use
QuantStub
to quantize the inputs of the network andDeQuantStub
to de-quantize the outputs of the network. Any sub-network fromQuantStub
toDeQuantStub
in a forward pass will be quantized. Multiple QuantStub-DeQuantStub pairs are allowed.class ResNet(nn.Module): def __init__(self, block, layers, num_classes=1000, zero_init_residual=False, groups=1, width_per_group=64, replace_stride_with_dilation=None, norm_layer=None): super(ResNet, self).__init__() if norm_layer is None: norm_layer = nn.BatchNorm2d self._norm_layer = norm_layer self.inplanes = 64 self.dilation = 1 if replace_stride_with_dilation is None: # each element in the tuple indicates if we should replace # the 2x2 stride with a dilated convolution instead replace_stride_with_dilation = [False, False, False] if len(replace_stride_with_dilation) != 3: raise ValueError( "replace_stride_with_dilation should be None " "or a 3-element tuple, got {}".format(replace_stride_with_dilation)) self.groups = groups self.base_width = width_per_group self.conv1 = nn.Conv2d( 3, self.inplanes, kernel_size=7, stride=2, padding=3, bias=False) self.bn1 = norm_layer(self.inplanes) self.relu = nn.ReLU(inplace=True) self.maxpool = nn.MaxPool2d(kernel_size=3, stride=2, padding=1) self.layer1 = self._make_layer(block, 64, layers[0]) self.layer2 = self._make_layer( block, 128, layers[1], stride=2, dilate=replace_stride_with_dilation[0]) self.layer3 = self._make_layer( block, 256, layers[2], stride=2, dilate=replace_stride_with_dilation[1]) self.layer4 = self._make_layer( block, 512, layers[3], stride=2, dilate=replace_stride_with_dilation[2]) self.avgpool = nn.AdaptiveAvgPool2d((1, 1)) self.fc = nn.Linear(512 * block.expansion, num_classes) self.quant_stub = nndct_nn.QuantStub() self.dequant_stub = nndct_nn.DeQuantStub() for m in self.modules(): if isinstance(m, nn.Conv2d): nn.init.kaiming_normal_(m.weight, mode='fan_out', nonlinearity='relu') elif isinstance(m, (nn.BatchNorm2d, nn.GroupNorm)): nn.init.constant_(m.weight, 1) nn.init.constant_(m.bias, 0) # Zero-initialize the last BN in each residual branch, # so that the residual branch starts with zeros, and each residual block behaves like an identity. # This improves the model by 0.2~0.3% according to https://arxiv.org/abs/1706.02677 if zero_init_residual: for m in self.modules(): if isinstance(m, Bottleneck): nn.init.constant_(m.bn3.weight, 0) elif isinstance(m, BasicBlock): nn.init.constant_(m.bn2.weight, 0) def forward(self, x): x = self.quant_stub(x) x = self.conv1(x) x = self.bn1(x) x = self.relu(x) x = self.maxpool(x) x = self.layer1(x) x = self.layer2(x) x = self.layer3(x) x = self.layer4(x) x = self.avgpool(x) x = torch.flatten(x, 1) x = self.fc(x) x = self.dequant_stub(x) return x
- Use QAT APIs to create the quantizer and train the
model.
def _resnet(arch, block, layers, pretrained, progress, **kwargs): model = ResNet(block, layers, **kwargs) if pretrained: #state_dict = load_state_dict_from_url(model_urls[arch], progress=progress) state_dict = torch.load(model_urls[arch]) model.load_state_dict(state_dict) return model def resnet18(pretrained=False, progress=True, **kwargs): r"""ResNet-18 model from `"Deep Residual Learning for Image Recognition" <https://arxiv.org/pdf/1512.03385.pdf>'_ Args: pretrained (bool): If True, returns a model pre-trained on ImageNet progress (bool): If True, displays a progress bar of the download to stderr """ return _resnet('resnet18', BasicBlock, [2, 2, 2, 2], pretrained, progress, **kwargs) model = resnet18(pretrained=True) # Generate dummy inputs. input = torch.randn([batch_size, 3, 224, 224], dtype=torch.float32) # Create a quantizer quantizer = torch_quantizer(quant_mode = 'calib', module = model, input_args = input, bitwidth = 8, qat_proc = True) quantized_model = quantizer.quant_model optimizer = torch.optim.Adam( quantized_model.parameters(), lr, weight_decay=weight_decay) # Use the optimizer to train the model, just like a normal float model. …
- Convert the trained model to a deployable model.
After training, dump the quantized model to xmodel. (
batch size=1
is must for compilation of xmodel).# vai_q_pytorch interface function: deploy the trained model and convert xmodel # need at least 1 iteration of inference with batch_size=1 quantizer.deploy(quantized_model) deployable_model = quantizer.deploy_model val_dataset2 = torch.utils.data.Subset(val_dataset, list(range(1))) val_loader2 = torch.utils.data.DataLoader( val_dataset, batch_size=1, shuffle=False, num_workers=workers, pin_memory=True) validate(val_loader2, deployable_model, criterion, gpu) quantizer.export_xmodel()
vai_q_pytorch QAT Requirements
Generally, there is a small accuracy loss after quantization, but for some networks such as MobileNets, the accuracy loss can be large. In this situation, first try fast finetune. If fast finetune does not yield satisfactory results, QAT can be used to further improve the accuracy of the quantized models.
The QAT APIs have some requirements for the model to be trained.
- All operations to be quantized must be instances of the
torch.nn.Module
object, rather than Torch functions or Python operators. For example, it is common to use‘+’
to add two tensors in PyTorch. However, this is not supported in QAT. Thus, replace‘+’
withpytorch_nndct.nn.modules.functional.Add
. Operations that need replacement are listed in the following table.Table 2. Operation-Replacement Mapping Operation Replacement +
pytorch_nndct.nn.modules.functional.Add
-
pytorch_nndct.nn.modules.functional.Sub
torch.add
pytorch_nndct.nn.modules.functional.Add
torch.sub
pytorch_nndct.nn.modules.functional.Sub
IMPORTANT: A module to be quantized cannot be called multiple times in the forward pass. - Use
pytorch_nndct.nn.QuantStub
andpytorch_nndct.nn.DeQuantStub
at the beginning and end of the network to be quantized. The network can be the complete network or a partial sub-network.
vai_q_pytorch Usage
This section introduces the usage of execution tools and APIs to implement
quantization and generate a model to be deployed on the target hardware. The APIs in the
module pytorch_binding/pytorch_nndct/apis/quant_api.py
are as follows:
class torch_quantizer()
Class torch_quantizer
creates a
quantizer object.
class torch_quantizer():
def __init__(self,
quant_mode: str, # ['calib', 'test']
module: torch.nn.Module,
input_args: Union[torch.Tensor, Sequence[Any]] = None,
state_dict_file: Optional[str] = None,
output_dir: str = "quantize_result",
bitwidth: int = 8,
device: torch.device = torch.device("cuda"),
qat_proc: bool = False):
Arguments
- quant_mode
- An integer that indicates which quantization mode the process is using. "calib" for calibration of quantization, and "test" for evaluation of quantized model.
- Module
- Float module to be quantized.
- Input_args
- Input tensor with the same shape as real input of float module to be quantized, but the values can be random numbers.
- State_dict_file
- Float module pretrained parameters file. If float module has read parameters in, the parameter is not needed to be set.
- Output_dir
- Directory for quantization result and intermediate files. Default is “quantize_result”.
- Bitwidth
- Global quantization bit width. Default is 8.
- Device
- Run model on GPU or CPU.
- Qat_proc
- Turn on quantize finetuning, also named quantization-aware-training (QAT).
def export_quant_config(self)
This function exports information related to the quantization steps.
def export_quant_config(self):
def export_xmodel(self, output_dir, deploy_check)
This function exports the xmodel and dumps the output data of the operators for detailed data comparison.
def export_xmodel(self, output_dir, deploy_check):
Arguments
- Output_dir
- Directory for quantization result and intermediate files. Default is “quantize_result.”
- Deploy_check
- Flags to control dump of data for detailed data comparison. Default is FALSE. If it is set to TRUE, binary format data is dumped in the output_dir/deploy_check_data_int/ location.