DPU Configuration

Introduction

The DPU IP provides some user-configurable parameters to optimize resource usage and customize different features. Different configurations can be selected for DSP slices, LUT, block RAM, and UltraRAM usage based on the amount of available programmable logic resources. There are also options for additional functions, such as channel augmentation, average pooling, depthwise convolution, and softmax. Furthermore, there is an option to determine the number of DPU cores that will be instantiated in a single DPU IP.

The deep neural network features and the associated parameters supported by the DPU are shown in the following table.

A configuration file named arch.json is generated during the Vivado or Vitis flow. The arch.json file is used by the Vitis AI Compiler for model compilation. For more information of Vitis AI Compiler, see refer to the Vitis AI User Guide (UG1414).

In the Vivado flow, the arch.json file is located at $TRD_HOME/prj/Vivado/srcs/top/ip/top_dpu_0/arch.json. In the Vitis flow, the arch.json file is located at $TRD_HOME/prj/Vitis/binary_container_1/link/vivado/vpl/prj/prj.gen/sources_1/bd/zcu102_base/ip/zcu102_base_DPUCZDX8G_1_0/arch.json.

Table 1. Deep Neural Network Features and Parameters Supported by the DPU
Features	Description
Convolution	Kernel Sizes	kernel_w: 1~16 kernel_h: 1~16
	Strides	kernel_w: 1~8 kernel_w:1~8
	Padding_w	0~(kernel_w-1)
	Padding_h	0~(kernel_h-1)
	Input Size	Arbitrary
	Input Channel	1~256 * channel_parallel
	Output Channel	1~256 * channel_parallel
	Activation	ReLU, ReLU6 and LeakyReLU
	Dilation	dilation * input_channel ≤ 256 * channel_parallel && stride_w == 1 && stride_h == 1
	Constraint*	kernel_w kernel_h (ceil(input_channel / channel_parallel)) <= bank_depth/2
Depthwise Convolution	Kernel Sizes	kernel_w: 1~16 kernel_h: 1~16
	Strides	kernel_w: 1~8 kernel_h:1~8
	Padding_w	0~(kernel_w-1)
	Padding_h	0~(kernel_h-1)
	Input Size	Arbitrary
	Input Channel	1~256 * channel_parallel
	Output Channel	1~256 * channel_parallel
	Activation	ReLU, ReLU6
	Dilation	dilation * input_channel ≤ 256 * channel_parallel && stride_w == 1 && stride_h == 1
	Constraint*	kernel_w kernel_h (ceil(input_channel / channel_parallel)) <= bank_depth/2
Deconvolution	Kernel Sizes	kernel_w: 1~16 kernel_h: 1~16
	Stride_w	stride_w * output_channel ≤ 256 * channel_parallel
	Stride_h	Arbitrary
	Padding_w	0~(kernel_w-1)
	Padding_h	0~(kernel_h-1)
	Input Size	Arbitrary
	Input Channel	1~256 * channel_parallel
	Output Channel	1~256 * channel_parallel
	Activation	ReLU, ReLU6 and LeakyReLU
Max Pooling	Kernel Sizes	kernel_w: 1~8 kernel_h: 1~8
	Strides	kernel_w: 1~8 kernel_h:1~8
	Padding	kernel_w: 0~(kernel_w-1) kernel_h: 0~(kernel_h-1)
Average Pooling	Kernel Sizes	Only support square size from 2x2, 3x3 to 8x8
	Strides	kernel_w: 1~8 kernel_h: 1~8
	Padding	kernel_w: 0~(kernel_w-1) kernel_h: 0~(kernel_h-1)
Max Reduce (max pooling for large size)	Kernel Sizes	kernel_w: 1~256 kernel_h: 1~256
	Strides	Equals to kernel size
	Padding	Not supported
Elementwise-Sum	Input channel	1~256 * channel_parallel
	Input size	Arbitrary
	Feature Map Number	1~4
Elementwise-Multiply	Input channel	1~256 * channel_parallel
	Input size	Arbitrary
	Feature Map Number	2
Concat	Output channel	1~256 * channel_parallel
Reorg	Strides	stride * stride * input_channel ≤ 256 * channel_parallel
Batch Normalization	-	-
Fully Connected	Input_channel	Input_channel ≤ 2048 * channel_parallel
Fully Connected	Output_channel	Arbitrary
The parameter channel_parallel is determined by the DPU configuration. For example, channel_parallel for the B1152 is 12, and channel_parallel for B4096 is 16 (see Parallelism for Different Convolution Architectures table in Configuration Options section). In some neural networks, the FC layer is connected with a Flatten layer. The Vitis AI compiler will automatically combine the Flatten+FC to a global CONV2D layer, and the CONV2D kernel size is directly equal to the input feature map size of Flatten layer. For this case, the input feature map size cannot exceed the limitation of the kernel size of CONV, otherwise an error will be generated during compilation. This limitation occurs only in the Flatten+FC situation. This will be optimized in future releases. The bank_depth refers to the on-chip weight buffer depth. In all DPU architectures, the bank_depth of the feature map and weights is 2048. Max Reduce is similar to Max Pooling but it is able to handle larger kernel sizes. The constraint is kernel_w * kernel_h * input_channel <= PP * CP * bank_depth.

Configuration Options

The DPU can be configured with some predefined options, which includes the number of DPU cores, the convolution architecture, DSP cascade, DSP usage, and UltraRAM usage. These options allow you to set the DSP slice, LUT, block RAM, and UltraRAM usage. The following figure shows the configuration page of the DPU.

The following sections describe the configuration options.

Number of DPU Cores

A maximum of four cores can be selected in one DPU IP. Multiple DPU cores can be used to achieve higher performance. Consequently, it consumes more programmable logic resources.

Contact your local Xilinx sales representative if you require more than four cores.

Architecture of the DPU

The DPU IP can be configured with various convolution architectures which are related to the parallelism of the convolution unit. The architectures for the DPU IP include B512, B800, B1024, B1152, B1600, B2304, B3136, and B4096.

There are three dimensions of parallelism in the DPU convolution architecture: pixel parallelism, input channel parallelism, and output channel parallelism. The input channel parallelism is always equal to the output channel parallelism (this is equivalent to channel_parallel in the previous table).

The different architectures require different programmable logic resources. The larger architectures can achieve higher performance with more resources. The parallelism for the different architectures is listed in the following table.

Table 2. Parallelism for Different Convolution Architectures
DPU Architecture	Pixel Parallelism (PP)	Input Channel Parallelism (ICP)	Output Channel Parallelism (OCP)	Peak Ops (operations/per clock)
B512	4	8	8	512
B800	4	10	10	800
B1024	8	8	8	1024
B1152	4	12	12	1150
B1600	8	10	10	1600
B2304	8	12	12	2304
B3136	8	14	14	3136
B4096	8	16	16	4096
In each clock cycle, the convolution array performs a multiplication and an accumulation, which are counted as two operations. Thus, the peak number of operations per cycle is equal to PPICPOCP*2.

Resource Use

The resource utilization of a sample DPU single core project is as follows. The data is based on the ZCU102 platform with low RAM usage, depthwise convolution, average pooling, channel augmentation, average pool, leaky ReLU + ReLU6 features, and low DSP usage.

In the following tables, the triplet (PPxICPxOCP) after the architecture refers to the pixel parallelism, input channel parallelism, and output channel parallelism.

Table 3. Resources of Different DPU Architectures
DPU Architecture	LUT	Register	Block RAM	DSP
B512 (4x8x8)	27893	35435	73.5	78
B800 (4x10x10)	30468	42773	91.5	117
B1024 (8x8x8)	34471	50763	105.5	154
B1152 (4x12x12)	33238	49040	123	164
B1600 (8x10x10)	38716	63033	127.5	232
B2304 (8x12x12)	42842	73326	167	326
B3136 (8x14x14)	47667	85778	210	436
B4096 (8x16x16)	53540	105008	257	562

Another example of a DPU single core project is based on the ZCU104 platform. In this project, the image and weights buffer utilize UltraRAM. The project is configured with low RAM usage, depthwise convolution, average pooling, channel augmentation, average pool, leaky ReLU + ReLU6 features, and low DSP usage. The resource utilization of this project is as follows.

Table 4. Resources of DPU using UltraRAM
DPU Architecture	LUT	Register	Block RAM	UltraRAM	DSP
B512 (4x8x8)	27396	35251	1.5	18	78
B800 (4x10x10)	30356	42463	1.5	40	117
B1024 (8x8x8)	34134	50820	1.5	26	154
B1152 (4x12x12)	33103	49502	2	44	164
B1600 (8x10x10)	38526	63294	1.5	56	232
B2304 (8x12x12)	42538	74000	2	60	326
B3136 (8x14x14)	47270	85782	2	64	436
B4096 (8x16x16)	52681	104562	2	68	562

RAM Usage

The weights, bias, and intermediate features are buffered in the on-chip memory. The on-chip memory consists of RAM which can be instantiated as block RAM and UltraRAM. The RAM Usage option determines the total amount of on-chip memory used in different DPU architectures, and the setting is for all the DPU cores in the DPU IP. High RAM Usage means that the on-chip memory block will be larger, allowing the DPU more flexibility in handling the intermediate data. High RAM Usage implies higher performance in each DPU core. The number of BRAM36K blocks used in different architectures for low and high RAM Usage is illustrated in the following table.

Note: The DPU instruction set for different options of RAM Usage is different. When the RAM Usage option is modified, the DPU instructions file should be regenerated by recompiling the neural network. The following results are based on a DPU with depthwise convolution.

Table 5. Number of BRAM36K Blocks in Different Architectures for Each DPU Core
DPU Architecture	Low RAM Usage	High RAM Usage
B512 (4x8x8)	73.5	89.5
B800 (4x10x10)	91.5	109.5
B1024 (8x8x8)	105.5	137.5
B1152 (4x12x12)	123	145
B1600 (8x10x10)	127.5	163.5
B2304 (8x12x12)	167	211
B3136 (8x14x14)	210	262
B4096 (8x16x16)	257	317.5

Channel Augmentation

Channel augmentation is an optional feature for improving the efficiency of the DPU when the number of input channels is much lower than the available channel parallelism. For example, the input channel of the first layer in most CNNs is three, which does not fully use all the available hardware channels. However, when the number of input channels is larger than the channel parallelism, then channel augmentation may be utilized.

Thus, channel augmentation can improve the total efficiency for most CNNs, but it will cost extra logic resources. The following table illustrates the extra LUT resources used with channel augmentation and the statistics are for reference.

Table 6. Extra LUTs of DPU with Channel Augmentation
DPU Architecture	Extra LUTs with Channel Augmentation
B512(4x8x8)	3121
B800(4x10x10)	2624
B1024(8x8x8)	3133
B1152(4x12x12)	1744
B1600(8x10x10)	2476
B2304(8x12x12)	1710
B3136(8x14x14)	1946
B4096(8x16x16)	1701

DepthwiseConv

In standard convolution, each input channel needs to perform the operation with one specific kernel, and then the result is obtained by combining the results of all channels together.

In depthwise separable convolution, the operation is performed in two steps: depthwise convolution and pointwise convolution. Depthwise convolution is performed for each feature map separately as shown on the left side of the following figure. The next step is to perform pointwise convolution, which is the same as standard convolution with kernel size 1x1. The parallelism of depthwise convolution is half that of the pixel parallelism.

Figure 3: Depthwise Convolution and Pointwise Convolution

Table 7. Extra resources of DPU with Depthwise Convolution
DPU Architecture	Extra LUTs	Extra Block RAMs	Extra DSPs
B512(4x12x12)	1734	4	12
B800(4x10x10)	2293	4.5	15
B1024(8x8x8)	2744	4	24
B1152(4x12x12)	2365	5.5	18
B1600(8x10x10)	3392	4.5	30
B2304(8x12x12)	3943	5.5	36
B3136(8x14x14)	4269	6.5	42
B4096(8x16x16)	4930	7.5	48

ElementWise Multiply

The ElementWise Multiply calculates the Hadamard product of two input feature maps. The input channel of ElementWise Multiply ranges from 1 to 256 * channel_parallel.

The extra resources with ElementWise Multiply is listed in the following table.

Table 8. Extra Resources of DPU with ElementWise Multiply
DPU Architecture	Extra LUTs	Extra FFs¹	Extra DSPs
B512(4x12x12)	159	-113	8
B800(4x10x10)	295	-93	10
B1024(8x8x8)	211	-65	8
B1152(4x12x12)	364	-274	12
B1600(8x10x10)	111	292	10
B2304(8x12x12)	210	-158	12
B3136(8x14x14)	329	-267	14
B4096(8x16x16)	287	78	16
Negative numbers imply a relative decrease.

AveragePool

The AveragePool option determines whether the average pooling operation will be performed on the DPU or not. The supported sizes range from 2x2, 3x3, …, to 8x8, with only square sizes supported.

The extra resources with Average Pool is listed in the following table.

Table 9. Extra LUTs of DPU with Average Pool
DPU Architecture	Extra LUTs
B512(4x12x12)	1507
B800(4x10x10)	2016
B1024(8x8x8)	1564
B1152(4x12x12)	2352
B1600(8x10x10)	1862
B2304(8x12x12)	2338
B3136(8x14x14)	2574
B4096(8x16x16)	3081

ReLU Type

The ReLU Type option determines which kind of ReLU function can be used in the DPU. ReLU and ReLU6 are supported by default. The option “ReLU + LeakyReLU + ReLU6“ means that LeakyReLU becomes available as an activation function.

Note: LeakyReLU coefficient is fixed to 0.1.

Table 10. Extra LUTs with ReLU + LeakyReLU + ReLU6 compared to ReLU+ReLU6
DPU Architecture	Extra LUTs
B512(4x12x12)	347
B800(4x10x10)	725
B1024(8x8x8)	451
B1152(4x12x12)	780
B1600(8x10x10)	467
B2304(8x12x12)	706
B3136(8x14x14)	831
B4096(8x16x16)	925

Softmax

This option allows the softmax function to be implemented in hardware. The hardware implementation of softmax can be 160 times faster than a software implementation. Enabling this option depends on the available hardware resources and desired throughput. Note that the maximum categories number of hardware softmax is 1023. If the categories number is greater than 1023, it is recommend to use the software softmax. For more information, refer to theVitis AI Library User Guide (UG1354) .

When softmax is enabled, an AXI master interface named SFM_M_AXI and an interrupt port named sfm_interrupt will appear in the DPU IP. The softmax module uses m_axi_dpu_aclk as the AXI clock for SFM_M_AXI as well as for computation.

The extra resources with Softmax enabled are listed in the following table.

Table 11. Extra Resources with Softmax
IP Name	Extra LUTs	Extra FFs	Extra BRAMs	Extra DSPs
Softmax	9580	8019	4	14

Advanced Tab

The following figure shows the Advanced tab of the DPU configuration.

S-AXI Clock Mode

s_axi_aclk is the S-AXI interface clock. When Common with M-AXI Clock is selected, s_axi_aclkshares the same clock as m_axi_aclk and the s_axi_aclk port is hidden. When Independent is selected, a clock different from m_axi_aclk must be provided.

dpu_2x Clock Gating

dpu_2x clock gating is an option for reducing the power consumption of the DPU. When the option is enabled, a port named dpu_2x_clk_ce appears for each DPU core. The dpu_2x_clk_ce port should be connected to the clk_dsp_ce port in the dpu_clk_wiz IP. The dpu_2x_clk_ce signal can shut down the dpu_2x_clk when the computing engine in the DPU is idle. To generate the clk_dsp_ce port in the dpu_clk_wiz IP, the clocking wizard IP should be configured with specific options. For more information, see the Reference Clock Generation section.

DSP Cascade

The maximum length of the DSP48E slice cascade chain can be set. Longer cascade lengths typically use fewer logic resources but might have worse timing. Shorter cascade lengths might not be suitable for small devices as they require more hardware resources. Xilinx recommends selecting the mid-value, which is four, in the first iteration and adjust the value if the timing is not met.

DSP Usage

This allows you to select whether DSP48E slices will be used for accumulation in the DPU convolution module. When low DSP usage is selected, the DPU IP will use DSP slices only for multiplication in the convolution. In high DSP usage mode, the DSP slice will be used for both multiplication and accumulation. Thus, the high DSP usage consumes more DSP slices and less LUTs. The extra logic utilization compared of high and low DSP usage is shown in the following table.

Table 12. Extra Resources of Low DSP Usage Compared with High DSP Usage
DPU Architecture	Extra LUTs	Extra Registers	Extra DSPs¹
B512	1418	1903	-32
B800	1445	2550	-40
B1024	1978	3457	-64
B1152	1661	2525	-48
B1600	2515	4652	-80
B2304	3069	4762	-96
B3136	3520	6219	-112
B4096	3900	7359	-128
Negative numbers imply a relative decrease.

UltraRAM

There are two kinds of on-chip memory resources in Zynq® UltraScale+™ devices: block RAM and UltraRAM. The available amount of each memory type is device-dependent. Each block RAM consists of two 18K slices which can be configured as 9b*4096, 18b*2048, or 36b*1024. UltraRAM has a fixed-configuration of 72b*4096. A memory unit in the DPU has a width of ICP*8 bits and a depth of 2048. For the B1024 architecture, the ICP is eight, and the width of a memory unit is 8*8 bit. Each memory unit can then be instantiated with one UltraRAM block. When the ICP is greater than eight, each memory unit in the DPU needs at least two UltraRAM blocks.

The DPU uses block RAM as the memory unit by default. For a target device with both block RAM and UltraRAM, configure the number of UltraRAM to determine how many UltraRAMs are used to replace some block RAMs. The number of UltraRAM should be set as a multiple of the number of UltraRAM required for a memory unit in the DPU. An example of block RAM and UltraRAM utilization is shown in the Summary tab section.

Timestamp

When enabled, the DPU records the time that the DPU project was synthesized. When disabled, the timestamp keeps the value at the moment of the last IP update.

Summary Tab

A summary of the configuration settings is displayed in the Summary tab. The target version shows the DPU instruction set version number.

DPU Performance on Different Devices

The following table shows the peak theoretical performance of the DPU on different devices.

Table 13. DPU Performance GOPs per second (GOPS) on Different Devices
Device	DPU Configuration	Frequency (MHz)	Peak Theoretical Performance (GOPS)
Z7020	B1152x1	200	230
ZU2	B1152x1	370	426
ZU3	B2304x1	370	852
ZU5	B4096x1	350	1400
ZU7EV	B4096x2	330	2700
ZU9	B4096x3	333	4100

Performance of Different Models

In this section, the performance of several models is given for reference. The results shown in the following table were measured on a Xilinx® ZCU102 board with three B4096 cores with 16 threads running at 287 MHz.

Table 14. Performance of Different Models
Network Model	Workload (GOPs per image)	Input Image Resolution	Accuracy (DPU)²	Frame per second (FPS)
Inception-v1	3.2	224*224	Top-1: 0.6954	452.4
ResNet50	7.7	224*224	Top-1: 0.7338	163.4
MobileNet_v2	0.6	299*299	Top-1: 0.6352	587.2
SSD_ADAS_VEHICLE¹	6.3	480*360	mAP: 0.4190	306.2
SSD_ADAS_PEDESTRIAN¹	5.9	640*360	mAP: 0.5850	279.2
SSD_MobileNet_v2	6.6	480*360	mAP: 0.2940	124.7
YOLO-V3-VOC	65.4	416*416	mAP: 0.8153	43.6
YOLO-V3_ADAS¹	5.5	512*256	mAP: 0.5301	239.7
These models were pruned by the Vitis AI Optimizer. Accuracy values are obtained using 8-bit quantization.

Unsupported Models

Some models in a specific DPU architecture may not be supported due to a large feature map size. Following is a list of unsupported models in different architectures:

Table 15. Unsupported Models in Different DPU Architectures
DPU Architecture	Unsupported models
B512	inception_resnet_v2_tf
	vgg_16_tf
	vgg_19_tf
	mobilenet_edge_1_0_tf
	facerec_resnet20
	facerec_resnet64
	facerec-resnet20_mixed_pt
	pmg_pt
B800	vgg_16_tf
	vgg_19_tf
	facerec_resnet20
	facerec_resnet64
	facerec-resnet20_mixed_pt
B1024	inception_resnet_v2_tf
	vgg_16_tf
	vgg_19_tf
	mobilenet_edge_1_0_tf
	facerec_resnet20
	facerec_resnet64
	facerec-resnet20_mixed_pt
	pmg_pt
B1152	vgg_16_tf
B1152	vgg_19_tf
B1600	vgg_16_tf
	vgg_19_tf
	facerec_resnet20
	facerec_resnet64
	facerec-resnet20_mixed_pt
B2304	vgg_16_tf
B2304	vgg_19_tf
B3136	vgg_16_tf
B3136	vgg_19_tf

I/O Bandwidth Requirements

When different neural networks run on the DPU, the I/O bandwidth requirement will change depending on which neural network is currently being executed. Even the I/O bandwidth requirement of different layers in one neural network are different. The I/O bandwidth requirements for some neural networks, averaged by layer, have been tested with one DPU core running at full speed. The peak and average I/O bandwidth requirements of three different neural networks are shown in the table below. The table only shows the number of two commonly used DPU architectures (B1152 and B4096).

Note: When multiple DPU cores run in parallel, each core might not be able to run at full speed due to the I/O bandwidth limitations.

Table 16. I/O Bandwidth Requirements for B1152 and B4096
Network Model	B1152		B4096
Network Model	Peak (MB/s)	Average (MB/s)	Peak (MB/s)	Average (MB/s)
Inception-v1	1704	890	4626	2474
ResNet50	2052	1017	5298	3132
SSD ADAS VEHICLE	1516	684	5724	2049
YOLO-V3-VOC	2076	986	6453	3290

If one DPU core needs to run at full speed, the peak I/O bandwidth requirement shall be met. The I/O bandwidth is mainly used for accessing data though the AXI master interfaces (DPU0_M_AXI_DATA0 and DPU0_M_AXI_DATA1).