Vector Initialization, Load, and Store
Vector registers can be initialized, loaded, and saved in a variety of ways. For optimal performance, it is critical that the local memory that is used to load or save the vector registers be aligned on 16-byte boundaries.
Alignment
The alignas
standard C specifier can
be used to ensure proper alignment of local memory. In the following example, the
reals
is aligned to 16 byte boundary.
alignas(16) const int32 reals[8] =
{32767, 23170, 0, -23170, -32768, -23170, 0, 23170};
//align to 16 bytes boundary, equivalent to "alignas(v4int32)"
Initialization
The following functions can be used to initialize vector registers as
undefined, all 0’s, with data from local memory, or with part of the values set from
another register and the remaining part are undefined. Initialization using the
undef_type()
initializer ensures that the
compiler can optimize regardless of the undefined parts of the value.
v8int32 v;
v8int32 uv = undef_v8int32(); //undefined
v8int32 nv = null_v8int32(); //all 0's
v8int32 iv = *(v8int32 *) reals; //re-interpret "reals" as "v8int32" pointer and load value from it
v16int32 sv = xset_w(0, iv); //create a new 512-bit vector with lower 256-bit set with "iv"
In the previous example, vector set intrinsic functions [T]set_[R]
allow creating a vector where only one part
is initialized and the other parts are undefined. Here [T] indicates the target
vector register to be set, w for W register (256-bit), x for X register (512-bit),
and y for Y register (1024-bit). [R] indicates where the source value comes from, v
for V register (128-bit), w for W register (256-bit), and x for X register
(512-bit). Note that [R] width is smaller than [T] width. The valid vector set
intrinsic functions are, wset_v
, xset_v
, xset_w
,
yset_v
, yset_w
, and yset_x
.
The static
keyword applies to the vector data
type as well. The default value is zero when not initialized and the value is kept
between graph run iterations.
Load and Store
Load and Store from Vector Registers
The compiler supports standard pointer de-referencing and pointer arithmetic for vectors. Post increment of the pointer is the most efficient form for scheduling. No special intrinsic functions are needed to load vector registers.
v8int32 * ptr_coeff_buffer = (v8int32 *)ptr_kernel_coeff;
v8int32 kernel_vec0 = *ptr_coeff_buffer++; // 1st 8 values (0 .. 7)
v8int32 kernel_vec1 = *ptr_coeff_buffer; // 2nd 8 values (8 .. 15)
Load and Store From Memory
AI Engine APIs provide access
methods to read and write data from data memory, streaming data ports, and cascade
streaming ports which can be used by AI Engine
kernels. For additional details on the window and stream APIs, see Window and Streaming Data
API in the AI Engine Documentation flow of the Vitis Unified Software Platform
Documentation (UG1416). In the
following example, the window readincr
(window_readincr_v8(din)
) API is used to read a window
of complex int16 data into the data vector. Similarly, readincr_v8(cin)
is used to read a sample of int16 data from the
cin
stream. writeincr_v4 (cas_out, v)
is used to write data to a cascade stream
output.
void func(input_window_cint16 *din,
input_stream_int16 *cin,
output_stream_cacc48 *cas_out){
v8cint16 data=window_readincr_v8(din);
v8int16 coef=readincr_v8(cin);
v4cacc48 v;
…
writeincr_v4(cas_out, v);
}
Load and Store Using Pointers
It is mandatory to use the window API in the kernel function prototype as inputs and outputs. However, in the kernel code, it is possible to use a direct pointer reference to read/write data.
void func(input_window_int16 *w_input,
output_window_cint16 *w_output){
.....
v16int16 *ptr_in = (v16int16 *)w_input->ptr;
v8cint16 *ptr_out = (v8cint16 *)w_output->ptr;
......
}
The window structure is responsible for managing buffer locks tracking buffer type (ping/pong) and this can add to the cycle count. This is especially true when load/store are out-of-order (scatter-gather). Using pointers may help reduce the cycle count required for load and store.
Load and Store Using Streams
Vector data can also be loaded from or stored in streams as shown in the following example.
void func(input_stream_int32 *s0, input_stream_int32 *s1, …){
for(…){
data0=readincr(s0);
data1=readincr(s1);
…
}
}
For more information about window and streaming data API usage, see Window and Streaming Data API in the AI Engine Documentation flow of the Vitis Unified Software Platform Documentation (UG1416).
Load and Store with Virtual Resource Annotations
AI Engine is able to perform several vector load or store operations per cycle. However, in order for the load or store operations to be executed in parallel, they must target different memory banks. In general, the compiler tries to schedule many memory accesses in the same cycle when possible, but there are some exceptions. Memory accesses coming from the same pointer are scheduled on different cycles. If the compiler schedules the operations on multiple variables or pointers in the same cycle, memory bank conflicts can occur.
To avoid concurrent access to a memory with multiple variables or
pointers, the compiler provides the following aie_dm_resource
annotations to annotate different virtual resources.
Accesses using types that are associated with the same virtual resource are not
scheduled to access the resource at the same cycle.
__aie_dm_resource_a
__aie_dm_resource_b
__aie_dm_resource_c
__aie_dm_resource_d
__aie_dm_resource_stack
For example, the following code is to annotate two arrays to the
same __aie_dm_resource_a
that guides the compiler
to not access them in the same instruction.
v8int32 va[32];
v8int32 vb[32];
v8int32 __aie_dm_resource_a* restrict p_va = (v8int32 __aie_dm_resource_a*)va;
v8int32 __aie_dm_resource_a* restrict p_vb = (v8int32 __aie_dm_resource_a*)vb;
//access va, vb by p_va, p_vb
v8int32 vc;
vc=p_va[i]+p_vb[i];
The following code is to annotate an array and a window buffer to
the same __aie_dm_resource_a
that guides the
compiler to not access them in the same instruction.
void func(input_window_int32 * __restrict wa, ......
v8int32 coeff[32];
input_window_int32 sa;
v8int32 __aie_dm_resource_a* restrict p_coeff = (v8int32 __aie_dm_resource_a*)coeff;
input_window_int32 __aie_dm_resource_a* restrict p_wa = (input_window_int32 __aie_dm_resource_a*)&sa;
window_copy(p_wa,wa);
v8int32 va;
va=window_readincr_v8(p_wa);//access wa by p_wa
Update, Extract, and Shift
To update portions of vector registers, the upd_v()
, upd_w()
, and upd_x()
intrinsic functions are provided for 128-bit
(v), 256-bit (w), and 512-bit (x) updates.
Similarly, ext_v()
, ext_w()
, and ext_x()
intrinsic functions are provided to extract portions of the vector.
To update or extract individual elements, the upd_elem()
and ext_elem()
intrinsic
functions are provided. These must be used when loading or storing values that are
not in contiguous memory locations and require multiple clock cycles to load or
store a vector. In the following example, the 0th element of vector v1
is updated with the value of a
- which is 100.
int a = 100;
v4int32 v1 = upd_elem(undef_v4int32(), 0, a);
Another
important use is to move data to the scalar unit and do an inverse or
sqrt. In the following example, the 0th element of vector vf
is extracted and stored in the scalar variable
f
.
v4float vf;
float f=ext_elem(vf,0);
float i_f=invsqrt(f);
The shft_elem()
intrinsic function
can be used to update a vector by inserting a new element at the beginning of a
vector and shifting the other elements by one.