FPGA Virtualization for Deep Learning: Achieving 3X Performance in the Cloud

This is a guest post from Song Yao Xilinx Senior Director in AI Business

Cloud computing has become the new computing paradigm. For cloud computing, virtualization is necessary to enable isolation between users, high flexibility and scalability, high security, and maximized utilization of hardware resources.

Since 2017, because of the advantages of programmability, low latency, and high energy efficiency, FPGA has been widely adopted into cloud computing. Amazon Web Service, Aliyun (Alibaba Cloud), Microsoft Azure, Huawei Cloud and etc. have all provided Xilinx FPGA instances on their cloud at present. However, FPGA instances on the cloud are still physical instances which are aimed for single-task and static-workload scenario, which means if there are multiple users, they can only share one FPGA instance in a time-division multiplexing (TDM) way. There is an increasing demand for virtualized FPGA.

Research group of Professor Yu Wang, Tsinghua University, has been working on FPGA virtualization for years, and recently proposed a framework to enable node-level FPGA virtualization for deep learning acceleration applications. Performance isolation for multiple users is enabled through a two-level instruction dispatch module (IDM) and a multi-core-based hardware resources pool. Overhead of online re-compilation is reduced to about 1ms by a tiling-based instruction frame package design and two-stage static-dynamic compilation is adopted. The baseline design of DNN accelerator baseline design is based on Angel-Eye, a DNN acceleration on FPGA published by Prof. Wang’s group in 2017.

The paper has just been presented at 28th FCCM, a premier academic conference in the programmable computing area. A demo is also provided for anyone who wants to try it:

https://github.com/annoysss123/FPGA-Virt-Exp-on-Aliyun-f3

General Introduction

As shown in Figure 1 (a), an FPGA instance provided on the cloud is usually the one with plenty of resources such as Xilinx VU9P, which could support a large number of users. For public cloud, there are two typical isolation methods between users: physical resources and performance isolation. Physical resources isolation allocates different hardware resources for different users while performance isolation means the performance provided to each user is not be disturbed by tasks executed of multiple other users. For private cloud, virtualization aims to maximize the overall system performance.

In this research, two baseline designs are used for comparing two different configurations: a static single large core design that supports multiple users using time-division multiplexing (TDM) and a static multi-core design with 16 small cores which supports one user by one core each. The virtualization design has also 16 small cores but uses space-division multiplexing (SDM) to support multiple tasks dynamically.

figure` (2)
Figure 1 Virtualization method for ISA-based DNN accelerator on FPGA: (a) Hardware architecture for public cloud; (b) Two-stage compiler design for private cloud

As shown in Figure 1 (b), a two-stage compilation flow is proposed to reduce online re-compilation overhead. The basic idea is to tile the output feature map into blocks and has an instruction frame package (IFP) for a series of instructions for each tile. During offline deployment, IFPs are generated based on the DNN model and hardware configuration of basic shareable units (small cores). During the online deployment stage, we only need to re-allocate the pre-generated IFPs to each core based on re-allocated hardware resources for new users. A simple latency simulator is also proposed so that we can predict the latency of each IFP and achieve workload balance among all the allocated cores.

figure2 (2)
Figure 2. Hardware architecture of the proposed virtualized FPGA DNN accelerator

Hardware and Compiler Design

To realize virtualization, as shown in Figure 2, the hardware architecture of the DNN accelerator is different from traditional DNN accelerator. First of all, as shown in Figure 2 (a) this accelerator has adopted Hardware Resource Pool (HRP) with multi small cores, which are allocated for different users exclusively. Besides, unlike the original Instruction Distribution Module (IDM), which is only used to implement instruction distribution and dependency management in a single core, a two-level IDM is designed to achieve multi-core sharing and synchronization, as shown in Figure 2 (b).

The first level IDM has 4 modules, including Instruction Mem., Instruction Decoder, Content-Switch Controller, and Multi-Core Sync. Controller. Instruction Mem. fetches the instructions from DDR and caches them until the next reconfiguration. The Instruction Decoder decodes them and sends the instructions to the second level IDM of the corresponding core according to the core index of each instruction. Content-Switch Controller records the index of the DNN layer that has been executed, so that other cores can continue computing upon the intermediate results. Multi-core Sync. Controller managers the layer-wise multi-core synchronization.

The second level IDM manages the computation inside each core. The Context-Switch Module can restart the computation based on the context information recorded by the first level IDM in the online reconfiguration stage. The System Synchronization Controller generates the local synchronization signal when the computation of the current DNN layer finishes and then waits for a valid global synchronization signal for the next layer to start.

figure3 (2)
Figure 3. The compilation flow for virtualized FPGA DNN accelerator, including static compilation (left) and dynamic compilation (right).

Tiling is an important idea in designing DNN accelerators to achieve massive parallelism and data reuse by partitioning the feature maps into tiles. We can achieve tiling among different dimensions, such as the height and width of feature maps. However, since the compiler generates Convolution instructions along the height dimension, feature map width and output channel dimensions are selected as the tiling dimension to generate IFPs.

The left part of Figure 3 shows how to get the latency results for each IFP using different tiling dimensions. After tiling, instructions in each tile are integrated into an IFP and then a cycle-level latency simulator predicts the latency of each IFP (T_1 to T_N or T_M). In the dynamic compilation stage, the dynamic compiler fetches the latency predictions and finds the optimal allocation strategy for multi-core sharing to minimize the total latency of a DNN layer.

Experimental Results

To evaluate the performance of this virtualization design, the research team used Xilinx Alveo U200, Xilinx VU9P FPGA in Aliyun (Alibaba Cloud), and nVidia Tesla V100 GPU to run 4 famous DNN models for comparison, including Inception v3, VGG16, MobileNet, and ResNet50 Xilinx SDAccel 2018.3 is used for hardware synthesis and software deployment. Hardware resource utilization results are shown in Table 1. For the virtualized multi-core design, 1% more logic and memory resources are used compared with static multi-core design, since the two-level IDM costs a little bit more resources. Multi-core designs consume almost double resources compared with the single large core design since a lot of modules like data mover will be copied for each small core in the multi-core design.

table1 (1)

Firstly, the cost of context switching with different numbers of re-allocated cores is evaluated. As shown in Table 2. The static compilation takes 14.7-46.8s to generate IFPs during the offline deployment, while the dynamic compilation costs only 0.4-1.5ms. The total online reconfiguration overhead is limited to 0.45-1.70ms considering the time of transfer instruction files. Compared with a non-virtualized design that takes tens of seconds to re-compile the whole DNN model, the online reconfiguration overhead of the virtualized design is negligible.

table2

A good virtualization framework should achieve good performance isolation, which means the performance of one user should not be influenced by other tasks. The second experiment assumes there are 4 possible users, gives one user fixed resources x (100%, 75%, 50%, and 25% of total resources), adjusts the remaining users to occupy the other (1 – x) resources, and finally gets the maximum and minimum performance of a user can get. As shown in Figure 4, when a user monopolizes all resources, there is no performance deviation. When the resources occupied by a single user are 75%, 50%, and 25% of total resources, GPU virtualization solution has 7.1-13.1%, 5.5-10.9%, and 6.5-8.1% performance deviations, while FPGA virtualization design limits them within 1%. The FPGA virtualization solution achieves much better isolation than GPU while meeting all the requirements for isolation.

figure5 (2)
Figure 4. Performance isolation: The performance deviation for ideal situation for one user with different hardware resources when there are 4 users.

A good virtualization framework should make you achieve similar performance linear to the resources allocated for you. Figure 5 shows the performance results with different DNN models and different tiling strategies. The redline presents the performance with a single large core and the dark blue line presents a virtualized design with workload balance.

For Inception v3 and VGG16, the performance loss results of virtualized design are just 0.95% and 3.93%; but for MobileNet, there is a performance loss of 31.64% since this compact DNN model requires much more memory bandwidth and multi-core design further increases the demand for bandwidth.

We can also find that for VGG16, the performance is nearly linear to parallelism since it is a computation-bounded task. But for Inception v3 and MobileNet, even the performance of a single large core design deteriorates a lot compared with ideal linearity with a large number of parallelism because they are all memory bounded.

figure6 (2)
Figure 5. The single-task throughput under different situations with different parallelism.

For the situation with multiple uses and multiple tasks, the performance is measured as the total throughput of the FPGA chip. As shown in Figure 6, four columns from light blue to dark blue represent the throughput with virtualized cores, virtualized cores optimized for single task each core, static multi-core design, and large single-core design. A total of 16 cores are implemented on FPGA so that at most 16 tasks can be supported simultaneously.

When there are only a few of tasks like 1, 2, or 4 tasks, if we send each task to a small core, the static multi-core design cannot achieve high throughput, while virtualized designs perform much better. When there are 8, 12, or 16 tasks, the throughput of static large single-core design will not improve since all tasks are executed in a TDM manner. But for virtualized design, the more tasks executed, the higher throughput can be achieved. When there are total of 16 tasks, the optimized virtualized design achieves the optimal throughput.

fig6 (2)
Figure 6. The multi-task throughput under different situations

In conclusion, we have seen that the proposed FPGA virtualization framework provided excellent performance isolation, scalability, and flexibility. With an online reconfiguration overhead of about 1ms and 1.12% single-core performance loss, it achieves 1.07x – 1.69x and 1.88x – 3.12x performance improvement compared with the baseline design. The virtualized FPGA design also achieves great isolation and linearity to hardware resources. It will help further reduce TCO of all deep learning applications in the cloud.

If you are interested in the original paper, please do not hesitate to download it on Arxiv:

https://arxiv.org/abs/2003.12101.