Vector Initialization, Load, and Store

Vector registers can be initialized, loaded, and saved in a variety of ways. For optimal performance, it is critical that the local memory that is used to load or save the vector registers be aligned on 16-byte boundaries.

Alignment

The alignas standard C specifier can be used to ensure proper alignment of local memory. In the following example, the reals is aligned to 16 byte boundary.

alignas(16) const int32 reals[8] = 
       {32767, 23170, 0, -23170, -32768, -23170, 0, 23170};
       //align to 16 bytes boundary, equivalent to "alignas(v4int32)"

Initialization

The following functions can be used to initialize vector registers as undefined, all 0’s, with data from local memory, or with part of the values set from another register and the remaining part are undefined. Initialization using the undef_type() initializer ensures that the compiler can optimize regardless of the undefined parts of the value.

v8int32 v;
v8int32 uv = undef_v8int32(); //undefined
v8int32 nv = null_v8int32(); //all 0's
v8int32 iv = *(v8int32 *) reals; //re-interpret "reals" as "v8int32" pointer and load value from it
v16int32 sv = xset_w(0, iv); //create a new 512-bit vector with lower 256-bit set with "iv"

In the previous example, vector set intrinsic functions [T]set_[R] allow creating a vector where only one part is initialized and the other parts are undefined. Here [T] indicates the target vector register to be set, w for W register (256-bit), x for X register (512-bit), and y for Y register (1024-bit). [R] indicates where the source value comes from, v for V register (128-bit), w for W register (256-bit), and x for X register (512-bit). Note that [R] width is smaller than [T] width. The valid vector set intrinsic functions are, wset_v, xset_v, xset_w, yset_v, yset_w, and yset_x.

The static keyword applies to the vector data type as well. The default value is zero when not initialized and the value is kept between graph run iterations.

Load and Store

Load and Store from Vector Registers

The compiler supports standard pointer de-referencing and pointer arithmetic for vectors. Post increment of the pointer is the most efficient form for scheduling. No special intrinsic functions are needed to load vector registers.

v8int32 * ptr_coeff_buffer = (v8int32 *)ptr_kernel_coeff;
v8int32 kernel_vec0 = *ptr_coeff_buffer++; // 1st 8 values (0 .. 7)
v8int32 kernel_vec1 = *ptr_coeff_buffer;   // 2nd 8 values (8 .. 15)

Load and Store From Memory

AI Engine APIs provide access methods to read and write data from data memory, streaming data ports, and cascade streaming ports which can be used by AI Engine kernels. For additional details on the window and stream APIs, see Window and Streaming Data API in the AI Engine Documentation flow of the Vitis Unified Software Platform Documentation (UG1416). In the following example, the window readincr (window_readincr_v8(din)) API is used to read a window of complex int16 data into the data vector. Similarly, readincr_v8(cin) is used to read a sample of int16 data from the cin stream. writeincr_v4 (cas_out, v) is used to write data to a cascade stream output.

void func(input_window_cint16 *din, 
			input_stream_int16 *cin, 
			output_stream_cacc48 *cas_out){
	v8cint16 data=window_readincr_v8(din);
	v8int16 coef=readincr_v8(cin);
	v4cacc48 v;
	…
	writeincr_v4(cas_out, v);
}

Load and Store Using Pointers

It is mandatory to use the window API in the kernel function prototype as inputs and outputs. However, in the kernel code, it is possible to use a direct pointer reference to read/write data.

void func(input_window_int16 *w_input, 
			output_window_cint16 *w_output){
	.....
	v16int16 *ptr_in  = (v16int16 *)w_input->ptr;
	v8cint16 *ptr_out = (v8cint16 *)w_output->ptr;
	......
}

The window structure is responsible for managing buffer locks tracking buffer type (ping/pong) and this can add to the cycle count. This is especially true when load/store are out-of-order (scatter-gather). Using pointers may help reduce the cycle count required for load and store.

Note: If using pointers to load and store data, it is the designer’s responsibility to avoid out-of-bound memory access.

Load and Store Using Streams

Vector data can also be loaded from or stored in streams as shown in the following example.

void func(input_stream_int32 *s0, input_stream_int32 *s1, …){
	for(…){
		data0=readincr(s0);
		data1=readincr(s1);
		…
	}
}

For more information about window and streaming data API usage, see Window and Streaming Data API in the AI Engine Documentation flow of the Vitis Unified Software Platform Documentation (UG1416).

Load and Store with Virtual Resource Annotations

AI Engine is able to perform several vector load or store operations per cycle. However, in order for the load or store operations to be executed in parallel, they must target different memory banks. In general, the compiler tries to schedule many memory accesses in the same cycle when possible, but there are some exceptions. Memory accesses coming from the same pointer are scheduled on different cycles. If the compiler schedules the operations on multiple variables or pointers in the same cycle, memory bank conflicts can occur.

To avoid concurrent access to a memory with multiple variables or pointers, the compiler provides the following aie_dm_resource annotations to annotate different virtual resources. Accesses using types that are associated with the same virtual resource are not scheduled to access the resource at the same cycle.

__aie_dm_resource_a
__aie_dm_resource_b
__aie_dm_resource_c
__aie_dm_resource_d
__aie_dm_resource_stack

For example, the following code is to annotate two arrays to the same __aie_dm_resource_a that guides the compiler to not access them in the same instruction.

v8int32 va[32];
v8int32 vb[32];
v8int32 __aie_dm_resource_a* restrict p_va = (v8int32 __aie_dm_resource_a*)va;
v8int32 __aie_dm_resource_a* restrict p_vb = (v8int32 __aie_dm_resource_a*)vb;
//access va, vb by p_va, p_vb 
v8int32 vc;
vc=p_va[i]+p_vb[i];

The following code is to annotate an array and a window buffer to the same __aie_dm_resource_a that guides the compiler to not access them in the same instruction.

void func(input_window_int32 * __restrict wa, ......
  v8int32 coeff[32];
  input_window_int32 sa;
  v8int32 __aie_dm_resource_a* restrict p_coeff = (v8int32 __aie_dm_resource_a*)coeff;
  input_window_int32 __aie_dm_resource_a* restrict p_wa = (input_window_int32 __aie_dm_resource_a*)&sa;
  window_copy(p_wa,wa);
  v8int32 va;
  va=window_readincr_v8(p_wa);//access wa by p_wa

Update, Extract, and Shift

To update portions of vector registers, the upd_v(), upd_w(), and upd_x() intrinsic functions are provided for 128-bit (v), 256-bit (w), and 512-bit (x) updates.

Note: The updates overwrite a portion of the larger vector with the new data while keeping the other part of the vector alive. This alive state of the larger vector persists through multiple updates. If too many vectors are kept unnecessarily alive, register spillage can occur and impact performance.

Similarly, ext_v(), ext_w(), and ext_x() intrinsic functions are provided to extract portions of the vector.

To update or extract individual elements, the upd_elem() and ext_elem() intrinsic functions are provided. These must be used when loading or storing values that are not in contiguous memory locations and require multiple clock cycles to load or store a vector. In the following example, the 0th element of vector v1 is updated with the value of a - which is 100.

int a = 100;
v4int32 v1 = upd_elem(undef_v4int32(), 0, a);

Another important use is to move data to the scalar unit and do an inverse or sqrt. In the following example, the 0th element of vector vf is extracted and stored in the scalar variable f.

v4float vf;
float f=ext_elem(vf,0);
float i_f=invsqrt(f);

The shft_elem() intrinsic function can be used to update a vector by inserting a new element at the beginning of a vector and shifting the other elements by one.