Tiled Read and Write

Some applications like very large matrix multiplications process large amount of data that cannot all be brought into the local memory at once for fast access. It is more practical and efficient to divide the data into tiles and read/compute/write one tile of data at a time. The figure below shows a square matrix divided into 16 tiles with equal size.

Figure: Matrix Divided into 16 Tiles

Below is the code snippet from the Window of 2D Array Example Using AXI4-Master Interface Example in “kernel_to_gmem” category on Xilinx On-boarding Example GitHub that shows the recommended coding style for inferring burst read and write of tiles of data from a two dimensional array (e.g. matrix). Note that the data from the global memory inx is read consecutively into the local memory tile in the innermost loop. Also note that the innermost loop is pipelined using the xcl_pipeline_loop attribute.
kernel __attribute__ ((reqd_work_group_size(1, 1, 1)))
void read_data(__global int *inx) {
    int tile[TILE_HEIGHT][TILE_WIDTH];
    rd_loop_i: for(int i = 0; i < TILE_PER_COLUMN; ++i) {
        rd_loop_j: for (int j = 0; j < TILE_PER_ROW; ++j) {
            rd_buf_loop_m: for (int m = 0; m < TILE_HEIGHT; ++m) {
                __attribute__((xcl_pipeline_loop))
                rd_buf_loop_n: for (int n = 0; n < TILE_WIDTH; ++n) {
                    // should burst TILE_WIDTH in WORD beat
                    tile[m][n] = inx[TILE_HEIGHT*TILE_PER_ROW*TILE_WIDTH*i+TILE_PER_ROW*TILE_WIDTH*m+TILE_WIDTH*j+n];
                }
            }
            rd_loop_m: for (int m = 0; m < TILE_HEIGHT; ++m) {
                __attribute__((xcl_pipeline_loop))
                rd_loop_n: for (int n = 0; n < TILE_WIDTH; ++n) {
                    write_pipe_block(inFifo, &tile[m][n]);
                }
            }
        }
    }