White Paper: Convolutional Neural Network with INT4 Optimization on Xilinx Devices

Editor’s Note: This content is contributed by Tiantian Han, Sr. Software Engineer and Tianyu Zhang, Sr. Design Engineer at Xilinx

In some resource-limited, high-performance, and low-latency scenarios, we strive for lower power consumption and higher performance without losing accuracy for AI inference. Low power consumption and high accuracy are especially critical in edge applications and low-latency ADAS. While 8-bit quantization can produce high accuracy, it requires more hardware resources. Extremely low-bit quantization, such as binary or ternary, often has a large accuracy degradation. Therefore, a full-process hardware-friendly quantization solution of 4-bit activations and 4-bit weights (4A4W) is proposed as a better accuracy/resource trade-off. With INT4 optimization, Xilinx can achieve up to a 77% performance boost on real hardware in comparison with INT8 and can achieve comparable accuracy to the full-precision models.

The whitepaper describes our implementation of a low-precision accelerator for CNN 4-bit XDPU on the Zynq® UltraScale+™ MPSoC and Zynq-7000 SoC families (16nm and 28nm), which takes full advantage of its DSP capabilities by efficiently mapping convolutional computations. This solution achieves 2X solution-level performance over the XDPU. On a 2D detection task in an ADAS system, the implementation achieves an inference speed of 230fps on a Zynq UltraScale+ MPSoC ZCU102 board, which is a 1.52X performance gain over the 8-bit XDPU.

The whitepaper focuses on:

A full-process hardware-friendly quantization solution for 4A4W that can achieve comparable accuracy compared to full-precision models.
Our low-bit quantization solution extended to different real-world computer vision tasks, demonstrating its effectiveness and universality.
Efficient mapping of convolution calculations based on the DSP on the FPGA. By packing four channels of MAC into one DSP per clock cycle, we significantly optimize the resources required for convolution calculations.
Efficient use of on-chip RAM resources. We reassign on-chip memory management for different precision data in low-precision networks.

To learn more about the INT4 Optimization solution on Xilinx devices, please download and read the following white paper.

Note: We would like to thank Dong Li, Dongliang Xie, Guangdong Liu, Lu Tian, Tiantian Han, Tianyu Zhang, and Yi Shan for their input in this white paper.

Original Date: ‎06-30-2020

AI 推断加速

应用商店

汽车

广播与专业 A/V

消费电子

数据中心

仿真与原型设计

工业

保健/医疗

测试和测量

有线 / 无线通信

器件

加速器

System-on-Modules (SOMs)

Kria SOM 资源

评估板与套件

以太网适配器

软件开发工具

软件开发资源

硬件开发工具

硬件开发资源

嵌入式开发

核心技术

应用商店

产品支持

技术支持社区

服务

www.新利18

合作伙伴

联系我们

Your cart is empty

新利app下载

新闻与活动

媒体与社区

产品支持

合作伙伴与投资者