DPU Configuration
Introduction
The DPU IP provides some user-configurable parameters to optimize resource usage and customize different features. Different configurations can be selected for DSP slices, LUT, block RAM, and UltraRAM usage based on the amount of available programmable logic resources. There are also options for additional functions, such as channel augmentation, average pooling, depthwise convolution, and softmax. Furthermore, there is an option to determine the number of DPU cores that will be instantiated in a single DPU IP.
The deep neural network features and the associated parameters supported by the DPU are shown in the following table.
A configuration file named arch.json is generated during the Vivado or Vitis flow. The arch.json file is used by the Vitis AI Compiler for model compilation. For more information of Vitis AI Compiler, see refer to the Vitis AI User Guide (UG1414).
In the Vivado flow, the arch.json file is located at $TRD_HOME/prj/Vivado/srcs/top/ip/top_dpu_0/arch.json. In the Vitis flow, the arch.json file is located at $TRD_HOME/prj/Vitis/binary_container_1/link/vivado/vpl/prj/prj.gen/sources_1/bd/zcu102_base/ip/zcu102_base_DPUCZDX8G_1_0/arch.json.
Features | Description | |
---|---|---|
Convolution | Kernel Sizes | kernel_w: 1~16 kernel_h: 1~16 |
Strides | kernel_w: 1~8 kernel_w:1~8 |
|
Padding_w | 0~(kernel_w-1) | |
Padding_h | 0~(kernel_h-1) | |
Input Size | Arbitrary | |
Input Channel | 1~256 * channel_parallel | |
Output Channel | 1~256 * channel_parallel | |
Activation | ReLU, ReLU6 and LeakyReLU | |
Dilation | dilation * input_channel ≤ 256 * channel_parallel && stride_w == 1 && stride_h == 1 | |
Constraint* | kernel_w *kernel_h * (ceil(input_channel / channel_parallel)) <= bank_depth/2 | |
Depthwise Convolution |
Kernel Sizes | kernel_w: 1~16 kernel_h: 1~16 |
Strides | kernel_w: 1~8 kernel_h:1~8 |
|
Padding_w | 0~(kernel_w-1) | |
Padding_h | 0~(kernel_h-1) | |
Input Size | Arbitrary | |
Input Channel | 1~256 * channel_parallel | |
Output Channel | 1~256 * channel_parallel | |
Activation | ReLU, ReLU6 | |
Dilation | dilation * input_channel ≤ 256 * channel_parallel && stride_w == 1 && stride_h == 1 | |
Constraint* | kernel_w *kernel_h * (ceil(input_channel / channel_parallel)) <= bank_depth/2 | |
Deconvolution | Kernel Sizes | kernel_w: 1~16 kernel_h: 1~16 |
Stride_w | stride_w * output_channel ≤ 256 * channel_parallel | |
Stride_h | Arbitrary | |
Padding_w | 0~(kernel_w-1) | |
Padding_h | 0~(kernel_h-1) | |
Input Size | Arbitrary | |
Input Channel | 1~256 * channel_parallel | |
Output Channel | 1~256 * channel_parallel | |
Activation | ReLU, ReLU6 and LeakyReLU | |
Max Pooling | Kernel Sizes | kernel_w: 1~8 kernel_h: 1~8 |
Strides | kernel_w: 1~8 kernel_h:1~8 |
|
Padding | kernel_w: 0~(kernel_w-1) kernel_h: 0~(kernel_h-1) |
|
Average Pooling | Kernel Sizes | Only support square size from 2x2, 3x3 to 8x8 |
Strides | kernel_w: 1~8 kernel_h: 1~8 |
|
Padding | kernel_w: 0~(kernel_w-1) kernel_h: 0~(kernel_h-1) |
|
Max Reduce (max pooling for large size) | Kernel Sizes | kernel_w: 1~256 kernel_h: 1~256 |
Strides | Equals to kernel size | |
Padding | Not supported | |
Elementwise-Sum | Input channel | 1~256 * channel_parallel |
Input size | Arbitrary | |
Feature Map Number | 1~4 | |
Elementwise-Multiply | Input channel | 1~256 * channel_parallel |
Input size | Arbitrary | |
Feature Map Number | 2 | |
Concat | Output channel | 1~256 * channel_parallel |
Reorg | Strides | stride * stride * input_channel ≤ 256 * channel_parallel |
Batch Normalization | - | - |
Fully Connected | Input_channel | Input_channel ≤ 2048 * channel_parallel |
Output_channel | Arbitrary | |
|
Configuration Options
The DPU can be configured with some predefined options, which includes the number of DPU cores, the convolution architecture, DSP cascade, DSP usage, and UltraRAM usage. These options allow you to set the DSP slice, LUT, block RAM, and UltraRAM usage. The following figure shows the configuration page of the DPU.
The following sections describe the configuration options.
Number of DPU Cores
A maximum of four cores can be selected in one DPU IP. Multiple DPU cores can be used to achieve higher performance. Consequently, it consumes more programmable logic resources.
Contact your local Xilinx sales representative if you require more than four cores.
Architecture of the DPU
The DPU IP can be configured with various convolution architectures which are related to the parallelism of the convolution unit. The architectures for the DPU IP include B512, B800, B1024, B1152, B1600, B2304, B3136, and B4096.
There are three dimensions of parallelism in the DPU convolution architecture: pixel parallelism, input channel parallelism, and output channel parallelism. The input channel parallelism is always equal to the output channel parallelism (this is equivalent to channel_parallel in the previous table).
The different architectures require different programmable logic resources. The larger architectures can achieve higher performance with more resources. The parallelism for the different architectures is listed in the following table.
DPU Architecture | Pixel Parallelism (PP) | Input Channel Parallelism (ICP) | Output Channel Parallelism (OCP) | Peak Ops (operations/per clock) |
---|---|---|---|---|
B512 | 4 | 8 | 8 | 512 |
B800 | 4 | 10 | 10 | 800 |
B1024 | 8 | 8 | 8 | 1024 |
B1152 | 4 | 12 | 12 | 1150 |
B1600 | 8 | 10 | 10 | 1600 |
B2304 | 8 | 12 | 12 | 2304 |
B3136 | 8 | 14 | 14 | 3136 |
B4096 | 8 | 16 | 16 | 4096 |
|
Resource Use
The resource utilization of a sample DPU single core project is as follows. The data is based on the ZCU102 platform with low RAM usage, depthwise convolution, average pooling, channel augmentation, average pool, leaky ReLU + ReLU6 features, and low DSP usage.
In the following tables, the triplet (PPxICPxOCP) after the architecture refers to the pixel parallelism, input channel parallelism, and output channel parallelism.
DPU Architecture | LUT | Register | Block RAM | DSP |
---|---|---|---|---|
B512 (4x8x8) | 27893 | 35435 | 73.5 | 78 |
B800 (4x10x10) | 30468 | 42773 | 91.5 | 117 |
B1024 (8x8x8) | 34471 | 50763 | 105.5 | 154 |
B1152 (4x12x12) | 33238 | 49040 | 123 | 164 |
B1600 (8x10x10) | 38716 | 63033 | 127.5 | 232 |
B2304 (8x12x12) | 42842 | 73326 | 167 | 326 |
B3136 (8x14x14) | 47667 | 85778 | 210 | 436 |
B4096 (8x16x16) | 53540 | 105008 | 257 | 562 |
Another example of a DPU single core project is based on the ZCU104 platform. In this project, the image and weights buffer utilize UltraRAM. The project is configured with low RAM usage, depthwise convolution, average pooling, channel augmentation, average pool, leaky ReLU + ReLU6 features, and low DSP usage. The resource utilization of this project is as follows.
DPU Architecture | LUT | Register | Block RAM | UltraRAM | DSP |
---|---|---|---|---|---|
B512 (4x8x8) | 27396 | 35251 | 1.5 | 18 | 78 |
B800 (4x10x10) | 30356 | 42463 | 1.5 | 40 | 117 |
B1024 (8x8x8) | 34134 | 50820 | 1.5 | 26 | 154 |
B1152 (4x12x12) | 33103 | 49502 | 2 | 44 | 164 |
B1600 (8x10x10) | 38526 | 63294 | 1.5 | 56 | 232 |
B2304 (8x12x12) | 42538 | 74000 | 2 | 60 | 326 |
B3136 (8x14x14) | 47270 | 85782 | 2 | 64 | 436 |
B4096 (8x16x16) | 52681 | 104562 | 2 | 68 | 562 |
RAM Usage
DPU Architecture | Low RAM Usage | High RAM Usage |
---|---|---|
B512 (4x8x8) | 73.5 | 89.5 |
B800 (4x10x10) | 91.5 | 109.5 |
B1024 (8x8x8) | 105.5 | 137.5 |
B1152 (4x12x12) | 123 | 145 |
B1600 (8x10x10) | 127.5 | 163.5 |
B2304 (8x12x12) | 167 | 211 |
B3136 (8x14x14) | 210 | 262 |
B4096 (8x16x16) | 257 | 317.5 |
Channel Augmentation
Channel augmentation is an optional feature for improving the efficiency of the DPU when the number of input channels is much lower than the available channel parallelism. For example, the input channel of the first layer in most CNNs is three, which does not fully use all the available hardware channels. However, when the number of input channels is larger than the channel parallelism, then channel augmentation may be utilized.
Thus, channel augmentation can improve the total efficiency for most CNNs, but it will cost extra logic resources. The following table illustrates the extra LUT resources used with channel augmentation and the statistics are for reference.
DPU Architecture | Extra LUTs with Channel Augmentation |
---|---|
B512(4x8x8) | 3121 |
B800(4x10x10) | 2624 |
B1024(8x8x8) | 3133 |
B1152(4x12x12) | 1744 |
B1600(8x10x10) | 2476 |
B2304(8x12x12) | 1710 |
B3136(8x14x14) | 1946 |
B4096(8x16x16) | 1701 |
DepthwiseConv
In standard convolution, each input channel needs to perform the operation with one specific kernel, and then the result is obtained by combining the results of all channels together.
In depthwise separable convolution, the operation is performed in two steps: depthwise convolution and pointwise convolution. Depthwise convolution is performed for each feature map separately as shown on the left side of the following figure. The next step is to perform pointwise convolution, which is the same as standard convolution with kernel size 1x1. The parallelism of depthwise convolution is half that of the pixel parallelism.
DPU Architecture | Extra LUTs | Extra Block RAMs | Extra DSPs |
---|---|---|---|
B512(4x12x12) | 1734 | 4 | 12 |
B800(4x10x10) | 2293 | 4.5 | 15 |
B1024(8x8x8) | 2744 | 4 | 24 |
B1152(4x12x12) | 2365 | 5.5 | 18 |
B1600(8x10x10) | 3392 | 4.5 | 30 |
B2304(8x12x12) | 3943 | 5.5 | 36 |
B3136(8x14x14) | 4269 | 6.5 | 42 |
B4096(8x16x16) | 4930 | 7.5 | 48 |
ElementWise Multiply
The ElementWise Multiply calculates the Hadamard product of two input feature maps. The input channel of ElementWise Multiply ranges from 1 to 256 * channel_parallel.
The extra resources with ElementWise Multiply is listed in the following table.
DPU Architecture | Extra LUTs | Extra FFs1 | Extra DSPs |
---|---|---|---|
B512(4x12x12) | 159 | -113 | 8 |
B800(4x10x10) | 295 | -93 | 10 |
B1024(8x8x8) | 211 | -65 | 8 |
B1152(4x12x12) | 364 | -274 | 12 |
B1600(8x10x10) | 111 | 292 | 10 |
B2304(8x12x12) | 210 | -158 | 12 |
B3136(8x14x14) | 329 | -267 | 14 |
B4096(8x16x16) | 287 | 78 | 16 |
|
AveragePool
The AveragePool option determines whether the average pooling operation will be performed on the DPU or not. The supported sizes range from 2x2, 3x3, …, to 8x8, with only square sizes supported.
The extra resources with Average Pool is listed in the following table.
DPU Architecture | Extra LUTs |
---|---|
B512(4x12x12) | 1507 |
B800(4x10x10) | 2016 |
B1024(8x8x8) | 1564 |
B1152(4x12x12) | 2352 |
B1600(8x10x10) | 1862 |
B2304(8x12x12) | 2338 |
B3136(8x14x14) | 2574 |
B4096(8x16x16) | 3081 |
ReLU Type
The ReLU Type option determines which kind of ReLU function can be used in the DPU. ReLU and ReLU6 are supported by default. The option “ReLU + LeakyReLU + ReLU6“ means that LeakyReLU becomes available as an activation function.
DPU Architecture | Extra LUTs |
---|---|
B512(4x12x12) | 347 |
B800(4x10x10) | 725 |
B1024(8x8x8) | 451 |
B1152(4x12x12) | 780 |
B1600(8x10x10) | 467 |
B2304(8x12x12) | 706 |
B3136(8x14x14) | 831 |
B4096(8x16x16) | 925 |
Softmax
This option allows the softmax function to be implemented in hardware. The hardware implementation of softmax can be 160 times faster than a software implementation. Enabling this option depends on the available hardware resources and desired throughput. Note that the maximum categories number of hardware softmax is 1023. If the categories number is greater than 1023, it is recommend to use the software softmax. For more information, refer to theVitis AI Library User Guide (UG1354) .
When softmax is enabled, an AXI master interface named SFM_M_AXI and an interrupt port named sfm_interrupt will appear in the DPU IP. The softmax module uses m_axi_dpu_aclk as the AXI clock for SFM_M_AXI as well as for computation.
The extra resources with Softmax enabled are listed in the following table.
IP Name | Extra LUTs | Extra FFs | Extra BRAMs | Extra DSPs |
---|---|---|---|---|
Softmax | 9580 | 8019 | 4 | 14 |
Advanced Tab
The following figure shows the Advanced tab of the DPU configuration.
- S-AXI Clock Mode
- s_axi_aclk is the S-AXI interface clock. When Common with M-AXI Clock is selected, s_axi_aclkshares the same clock as m_axi_aclk and the s_axi_aclk port is hidden. When Independent is selected, a clock different from m_axi_aclk must be provided.
- dpu_2x Clock Gating
- dpu_2x clock gating is an option for reducing the power consumption of the DPU. When the option is enabled, a port named dpu_2x_clk_ce appears for each DPU core. The dpu_2x_clk_ce port should be connected to the clk_dsp_ce port in the dpu_clk_wiz IP. The dpu_2x_clk_ce signal can shut down the dpu_2x_clk when the computing engine in the DPU is idle. To generate the clk_dsp_ce port in the dpu_clk_wiz IP, the clocking wizard IP should be configured with specific options. For more information, see the Reference Clock Generation section.
- DSP Cascade
- The maximum length of the DSP48E slice cascade chain can be set. Longer cascade lengths typically use fewer logic resources but might have worse timing. Shorter cascade lengths might not be suitable for small devices as they require more hardware resources. Xilinx recommends selecting the mid-value, which is four, in the first iteration and adjust the value if the timing is not met.
- DSP Usage
- This allows you to select whether DSP48E slices will be used for
accumulation in the DPU convolution module. When
low DSP usage is selected, the DPU IP will use
DSP slices only for multiplication in the convolution. In high DSP usage mode, the DSP slice
will be used for both multiplication and accumulation. Thus, the high DSP usage consumes more
DSP slices and less LUTs. The extra logic utilization compared of high and low DSP usage is
shown in the following table.
Table 12. Extra Resources of Low DSP Usage Compared with High DSP Usage DPU Architecture Extra LUTs Extra Registers Extra DSPs1 B512 1418 1903 -32 B800 1445 2550 -40 B1024 1978 3457 -64 B1152 1661 2525 -48 B1600 2515 4652 -80 B2304 3069 4762 -96 B3136 3520 6219 -112 B4096 3900 7359 -128 - Negative numbers imply a relative decrease.
- UltraRAM
- There are two kinds of on-chip memory resources in Zynq®
UltraScale+™ devices: block RAM and UltraRAM. The
available amount of each memory type is device-dependent. Each block RAM consists of two 18K
slices which can be configured as 9b*4096, 18b*2048, or 36b*1024. UltraRAM has a
fixed-configuration of 72b*4096. A memory unit in the DPU has a width of ICP*8 bits and a depth of 2048. For the B1024
architecture, the ICP is eight, and the width of a memory unit is 8*8 bit. Each memory unit can
then be instantiated with one UltraRAM block. When the ICP is greater than eight, each memory
unit in the DPU needs at least two UltraRAM
blocks.
The DPU uses block RAM as the memory unit by default. For a target device with both block RAM and UltraRAM, configure the number of UltraRAM to determine how many UltraRAMs are used to replace some block RAMs. The number of UltraRAM should be set as a multiple of the number of UltraRAM required for a memory unit in the DPU. An example of block RAM and UltraRAM utilization is shown in the Summary tab section.
- Timestamp
- When enabled, the DPU records the time that the DPU project was synthesized. When disabled, the timestamp keeps the value at the moment of the last IP update.
Summary Tab
A summary of the configuration settings is displayed in the Summary tab. The target version shows the DPU instruction set version number.
DPU Performance on Different Devices
The following table shows the peak theoretical performance of the DPU on different devices.
Device | DPU Configuration | Frequency (MHz) | Peak Theoretical Performance (GOPS) |
---|---|---|---|
Z7020 | B1152x1 | 200 | 230 |
ZU2 | B1152x1 | 370 | 426 |
ZU3 | B2304x1 | 370 | 852 |
ZU5 | B4096x1 | 350 | 1400 |
ZU7EV | B4096x2 | 330 | 2700 |
ZU9 | B4096x3 | 333 | 4100 |
Performance of Different Models
In this section, the performance of several models is given for reference. The results shown in the following table were measured on a Xilinx® ZCU102 board with three B4096 cores with 16 threads running at 287 MHz.
Network Model | Workload (GOPs per image) | Input Image Resolution | Accuracy (DPU)2 | Frame per second (FPS) |
---|---|---|---|---|
Inception-v1 | 3.2 | 224*224 | Top-1: 0.6954 | 452.4 |
ResNet50 | 7.7 | 224*224 | Top-1: 0.7338 | 163.4 |
MobileNet_v2 | 0.6 | 299*299 | Top-1: 0.6352 | 587.2 |
SSD_ADAS_VEHICLE1 | 6.3 | 480*360 | mAP: 0.4190 | 306.2 |
SSD_ADAS_PEDESTRIAN1 | 5.9 | 640*360 | mAP: 0.5850 | 279.2 |
SSD_MobileNet_v2 | 6.6 | 480*360 | mAP: 0.2940 | 124.7 |
YOLO-V3-VOC | 65.4 | 416*416 | mAP: 0.8153 | 43.6 |
YOLO-V3_ADAS1 | 5.5 | 512*256 | mAP: 0.5301 | 239.7 |
|
Unsupported Models
Some models in a specific DPU architecture may not be supported due to a large feature map size. Following is a list of unsupported models in different architectures:
DPU Architecture | Unsupported models |
---|---|
B512 | inception_resnet_v2_tf |
vgg_16_tf | |
vgg_19_tf | |
mobilenet_edge_1_0_tf | |
facerec_resnet20 | |
facerec_resnet64 | |
facerec-resnet20_mixed_pt | |
pmg_pt | |
B800 | vgg_16_tf |
vgg_19_tf | |
facerec_resnet20 | |
facerec_resnet64 | |
facerec-resnet20_mixed_pt | |
B1024 | inception_resnet_v2_tf |
vgg_16_tf | |
vgg_19_tf | |
mobilenet_edge_1_0_tf | |
facerec_resnet20 | |
facerec_resnet64 | |
facerec-resnet20_mixed_pt | |
pmg_pt | |
B1152 | vgg_16_tf |
vgg_19_tf | |
B1600 | vgg_16_tf |
vgg_19_tf | |
facerec_resnet20 | |
facerec_resnet64 | |
facerec-resnet20_mixed_pt | |
B2304 | vgg_16_tf |
vgg_19_tf | |
B3136 | vgg_16_tf |
vgg_19_tf |
I/O Bandwidth Requirements
Network Model | B1152 | B4096 | ||
---|---|---|---|---|
Peak (MB/s) | Average (MB/s) | Peak (MB/s) | Average (MB/s) | |
Inception-v1 | 1704 | 890 | 4626 | 2474 |
ResNet50 | 2052 | 1017 | 5298 | 3132 |
SSD ADAS VEHICLE | 1516 | 684 | 5724 | 2049 |
YOLO-V3-VOC | 2076 | 986 | 6453 | 3290 |
If one DPU core needs to run at full speed, the peak I/O bandwidth requirement shall be met. The I/O bandwidth is mainly used for accessing data though the AXI master interfaces (DPU0_M_AXI_DATA0 and DPU0_M_AXI_DATA1).