Introduction to Scalar and Vector Programming

This section provides an overview of the key elements of kernel programming for scalar and vector processing elements. The details of each element and optimization skills will be seen in following sections.

The following example uses only the scalar engine. It demonstrates a for loop iterating through 512 int32 elements. Each loop iteration performs a single multiply of int32 a and int32 b storing the result in c and writing it to an output window. The scalar_mul kernel operates on two input blocks (window) of data input_window_int32 and produces an output window of data output_window_int32.

The APIs window_readincr and window_writeincr are used to read and write to the circular buffers outside the kernel. For additional details on the window APIs, see Window and Streaming Data API in the AI Engine Documentation flow of the Vitis Unified Software Platform Documentation (UG1416).

void scalar_mul(input_window_int32* data1,
			input_window_int32* data2,
			output_window_int32* out){
	for(int i=0;i<512;i++)
	{
		int32 a=window_readincr(data1);
		int32 b=window_readincr(data2);
		int32 c=a*b;
		window_writeincr(out,c);
	}
}

The following example is a vectorized version for the same kernel.

void vect_mul(input_window_int32* __restrict data1,
			input_window_int32* __restrict data2,
			output_window_int32* __restrict out){
	for(int i=0;i<64;i++)
	chess_prepare_for_pipelining
	{
		v8int32 va=window_readincr_v8(data1);
		v8int32 vb=window_readincr_v8(data2);
		v8acc80 vt=mul(va,vb);
		v8int32 vc=srs(vt,0);
   
		window_writeincr(out,vc);
	}
}

Note the data types v8int32 and v8acc80 used in the previous kernel code. The window API window_readincr_v8 returns a vector of 8 int32s and stores them in variables named va and vb. These two variables are vector type variables and they are passed to the intrinsic function mul which outputs vt which is a v8acc80 data type. The v8acc80 type is reduced by a shift round saturate function srs that allows a v8int32 type, variable vc, to be returned and then written to the output window. Additional details on the data types supported by the AI Engine are covered in the following sections.

The __restrict keyword used on the input and output parameters of the vect_mul function, allows for more aggressive compiler optimization by explicitly stating independence between data.

chess_prepare_for_pipelining is a compiler pragma that directs kernel compiler to achieve optimized pipeline for the loop.

The scalar version of this example function takes 1055 cycles while the vectorized version takes only 99 cycles. As you can see there is more than 10 times speedup for vectorized version of the kernel. Vector processing itself would give 8x the throughput for int32 multiplication but has a higher latency and would not get 8x the throughput overall. However, with the loop optimizations done, it can get close to 10x. The sections that follow describe in detail the various data types that can be used, registers available, and also the kinds of optimizations that can be achieved on the AI Engine using concepts like software pipelining in loops and keywords like __restrict.