Inferring Burst Transfer from/to Global Memory
The most common global memories used on Xilinx® acceleration cards are DDR3 and DDR4 SDRAMs. They are most efficient when operated in burst mode. In addition there are overheads associated with switching between DDR read and write. Xilinx recommendes that you transfer large amount of data in a single burst to achieve the best efficiency of the memory controller and keep the compute unit inside the FPGA device busy all the time.
The memory layout of data objects is a key factor to consider for improving the data transfer efficiency. Considering a 4x4 matrix “a” example, conceptually it is a two dimensional array as shown in the matrix logical layout in the Figure below. In C/C++ programming, arrays are physically stored in row-major order that all data within a row are stored in consecutive locations followed by the data within the next row as shown in the matrix physical layout below. The implication is that if your algorithm reads the data column-wise, the burst transfer will not happen as it reads from discrete location each time. This can generally be optimized by either transposing your data in the host code or caching multiple columns of data in the kernel.
Figure: Memory Layout Matrix
This chapter discusses the most common data access patterns and presents guidelines and examples on how to infer burst transfers for these data access patterns as well as how to analyze the profiling data to confirm that.