Single Kernel Coding Examples

The following sections focus on mapping the algorithm into a single AI Engine. It takes into consideration the compute and memory bounds of the algorithm and capability of the AI Engine. The scalar and vectorized versions of the code are shown to illustrate vectorization.

Matrix Vector Multiplication

The following matrix vector multiplication example focuses on a single AI Engine kernel vectorization. It implements the following matrix vector multiplication equation.

C (64 x 1) = A (64 x 16) * B(16 x 1)

The example assumes that the data for the matrices is stored in column based form and data type for the matrices A and B is int16.

c0 = a0*b0 + a64*b1 + a128*b2 + a192*b3 + a256*b4 + a320*b5 + a384*b6 + a448*b7 + … 
c1 = a1*b0 + a65*b1 + a129*b2 + a193*b3 + a257*b4 + a321*b5 + a385*b6 + a449*b7 + … 
c2 = a2*b0 + a66*b1 + a130*b2 + a194*b3 + a258*b4 + a322*b5 + a386*b6 + a450*b7 + … 
c3 = a3*b0 + a67*b1 + a131*b2 + a195*b3 + a259*b4 + a323*b5 + a387*b6 + a451*b7 + …      
                            … 
c60 = a60*b0 + a124*b1 + a188*b2 + a252*b3 + a316*b4 + a380*b5 + a444*b6 + a508*b7 + … 
c61 = a61*b0 + a125*b1 + a189*b2 + a253*b3 + a317*b4 + a381*b5 + a445*b6 + a509*b7 + … 
c62 = a62*b0 + a126*b1 + a190*b2 + a254*b3 + a318*b4 + a382*b5 + a446*b6 + a510*b7 + … 
c63 = a63*b0 + a127*b1 + a191*b2 + a255*b3 + a319*b4 + a383*b5 + a447*b6 + a5111*b7 + … 

Kernel Coding Bounds

In this example, a total of 16 int16 x int16 multiplications are required per output value. As the matrix C consists of 64 values, a total of 16 * 64 = 1024 multiplications is required to complete one matrix multiplication. Given that 32 16-bit multiplications can be performed per cycle in an AI Engine, the minimum number of cycles required for the matrix multiplication is 1024/32 = 32. The summation of the individual terms comes without additional cycle requirements because the addition can be performed together with the multiplication in a MAC operation. Hence the compute bound for the kernel is:

Compute bound = 32 cycles / invocation

Next, analyze the memory accesses bound for the kernel. If it is going to fully use the vector unit MAC performance, 32 16-bit multiplications are performed per cycle. Vector b can be stored in the vector register because it is only 16*16-bit =256 bits. It does not need to be fetched from the AI Engine data memory or tile interface for each MAC operation. Considering data “a” needed for computation, it needs 32*16-bit = 512 bits data per cycle. The stream interface only supports 2*32 bit per cycle and hence fetching data from memory can be considered. It allows two 256 bits loads per cycle which matches the MAC performance. Thus, if two 256 bits loads are performed each cycle, the memory bound for the kernel is:

Memory bound = 32 cycles / invocation

Note that compute bound and memory bound are the theoretical limits of the kernel realization. It does not take into account the function overhead outside the main computation loop. When the kernel is only part of the graph, it might be relieved due to bandwidth limitation of other kernels or lower system performance requirements.

Vectorization

For a complicated vector processing algorithm, starting with a scalar version is recommended because it is also helpful as a golden reference for verifying the accuracy. The scalar version for matrix multiplication is shown as follows.

void matmul_scalar(input_window_int16* matA, 
      input_window_int16* matB, 
      output_window_int16* matC){ //A[M,N], B[N,1], C[M,1]. M=64, N=16 
    for(int i=0; i<M; i++){  
        int temp = 0 ; 
        for(int j=0; j<N; j++){ 
            temp += window_read(matA)*window_readincr(matB) ;    
            window_incr(matA,64); //Jump of 64 elements to access the next element of the same row                             
        } 
        window_writeincr(matC,(int16_t)(temp>>15)) ;           
        window_incr(matA,1); //Jump of one element for moving to the next row.  
    }    
}

Note that in the previously shown code, matA is stored in the column base and matB is a circular buffer to the kernel. It can be read continuously by window_readincr for computing different rows of output because it will loop back to the start of the buffer.

There are total 64 outputs (M=64), and each output needs 16 (N=16) multiplications. When choosing MAC intrinsics to do vector processing, for the data type int16 * int16, select lane 4, 8, 16 to do the equation. These are illustrated in following figure.

Figure 1: Lane Selection

Note that the main difference between 4, 8, and 16 lanes MAC is how the data is consumed. If you assume that the data is stored by column, then 16 lanes MAC may be the best choice, because only two parts of continuous data needs to be loaded for the MAC operation, – a0 to a15 and a64 to a79. a0 to a15 are 256 bits, which allows one load to load the value into vector register.

To allow two loads to occur at the same cycle, a0 to a15 and a64 to a79 are required to be in separate data banks. The data needs to be divided column by column into two separate buffers to the kernel. That is to say, a0 to a63 are in the first buffer, a64 to a127 are in the second buffer, a128 to a191 are in the first buffer again, and so on.

By vectorization, the matrix multiplication can have a loop with 64/16=4 iterations and each iteration of the loop contains eight MAC operations. Every iteration of the loop produces 16 output data. This is illustrated in the following figure.

Figure 2: Vectorization

The mac16() intrinsic function to be used has the following interface.

v16acc48 mac16( v16acc48 acc, 
    v32int16 xbuff, 
    int xstart, 
    unsigned int xoffsets, 
    unsigned int xoffsets_hi, 
    unsigned int xsquare, 
    v16int16 zbuff, 
    int zstart, 
    unsigned int zoffsets, 
    unsigned int zoffsets_hi, 
    int zstep
)

The buffers contain parameters (start, offsets, square, and step) to compute the indexing into the buffers (vector registers). For details about the lane addressing scheme with these parameters, see MAC Intrinsics.

Coding with MAC intrinsics can be seen in the following section.

Coding with Intrinsics

You have analyzed how the function will be mapped into the AI Engine vector processor. Now have a look at the first version of the vectorized code.

inline void mac16_sub(input_window_int16* matA, v16int16 &buf_matB, v16acc48 &acc, int i){
	v32int16 buf_matA = undef_v32int16(); // holds 32 elements of matA
	buf_matA=upd_w(buf_matA, 0, window_read_v16(matA));
	window_incr(matA,64);
	buf_matA = upd_w(buf_matA, 1, window_read_v16(matA));
	window_incr(matA,64);
	acc = mac16(acc,buf_matA,0,0x73727170,0x77767574,0x3120,buf_matB,i,0x0,0x0,1);
}

void matmul_vec16(input_window_int16*  matA,
		input_window_int16* matB,
		output_window_int16* matC){

	v16int16 buf_matB = window_read_v16(matB); // holds 16 elements of matB
	v16acc48 acc = null_v16acc48(); // holds acc value of Row * column dot product

	for (unsigned int i=0;i<M/16;i++)  //M=64, Each iteration computes 16 outputs
	{
		acc=null_v16acc48();
		for(int j=0;j<16;j+=2){
			mac16_sub(matA,buf_matB,acc,j);
		}
		window_writeincr(matC,srs(acc,15));
		window_incr(matA,16);
	}
}

In the main function matmul_vec16, the loop produces 16 output data per iteration. In the outer loop body, there is an inner loop with eight iterations. In each iteration of the inner loop, an inline function mac16_sub is called. In the inline function, there is a mac16 operation, with two loads of data for the MAC operation.

Inside mac16_sub(), buf_matA is declared as local variable and buf_matB and acc are declared as local variables in the main function. They are passed between functions by reference (or pointer). This ensures that only one identical vector exists for each variable. The function has one parameter that is used in the mac16() intrinsic as follows and this specific intrinsic (i=0) has been introduced in MAC Intrinsics.

acc = mac16(acc,buf_matA,0,0x73727170,0x77767574,0x3120,buf_matB,i,0x0,0x0,1);

At the end of each iteration of the loop, window pointer for the data is incremented by 16 (that is 16 rows for the matrix).

Note: While in the example, inline is used to guide the tool to remove the boundary of a function and inline __attribute__((always_inline)) can be used to force removal of the boundary of the function, sometimes it is helpful to retain the boundary of a function using __attribute__ ((noinline)) void func(...). Note that inlining or not can affect program memory usage and program optimization.

The compiled code for the kernel can be found in the disassembly view in the debug perspective of the Vitis™ IDE. Note that a graph is needed for compiling the kernel with AI Engine tools. For more understanding about the assembly code in disassembly view, refer to Using Vitis IDE and Reports. For additional details on graph coding and Vitis IDE usage, refer to the Versal ACAP AI Engine Programming Environment User Guide (UG1076).

Figure 3: Assembly Code for the Loop

Note that the compiler automatically unrolls the inner loop and pipelines the outer loop. From the previous assembly code for the loop, each iteration requires 19 cycles. However, with one window interface of data (matA), the minimum cycle number required for eight MACs must be 16 (two loads of data per MAC). This degradation of performance is caused by unbalanced window pointer increment at the end of the loop. This can be resolved by pairing the last increment with the last MAC operation. The optimized code is as follows.

inline void mac16_sub(input_window_int16* matA, v16int16 &buf_matB, v16acc48 &acc, int i,int incr_num){
	v32int16 buf_matA = undef_v32int16(); // holds 32 elements of matA
	buf_matA=upd_w(buf_matA, 0, window_read_v16(matA));
	window_incr(matA,64);
	buf_matA = upd_w(buf_matA, 1, window_read_v16(matA));
	window_incr(matA,incr_num);
	acc = 	mac16(acc,buf_matA,0,0x73727170,0x77767574,0x3120,buf_matB,i,0x0,0x0,1);
}

void matmul_vec16(input_window_int16*  matA,
		input_window_int16* matB,
		output_window_int16* matC){

	v16int16 buf_matB = window_read_v16(matB); // holds 16 elements of matB
	v16acc48 acc = null_v16acc48(); // holds acc value of Row * column dot product

	for (unsigned int i=0;i<M/16;i++)  //M=64, Each iteration computes 16 outputs
	{
		acc=null_v16acc48();
		for(int j=0;j<16;j+=2){
			int incr_num=(j==14)?80:64;
			mac16_sub(matA,buf_matB,acc,j,incr_num);
		}
		window_writeincr(matC,srs(acc,15));
	}
}

Note that the function mac16_sub has a new parameter incr_num. This parameter is for the pointer increment, which is different for the last function call in the inner loop. This increment number 80 for the last function call is to ensure that data in the next 16 rows is selected in the next iteration of the outer loop. Now the assembled code for the loop is as shown in following figure.

Figure 4: Optimized Assembly Code for the Loop

An iteration of the loop requires 16 cycles. This means that the compute bound for this kernel is 16*4=64 cycles per invocation. As seen in the previous section, the theoretical limit is 32 cycles per invocation. That is eight cycles for an iteration of the loop, which means that eight MAC operations must be compacted into eight cycles. Depending on the system performance requirements, this can be achieved by splitting the data input column by column into two window buffers, matA_0 and matA_1. The data of the two windows is first to be read into two v16int16 vectors and concatenated into one v32int16 vector to be used in the mac16 intrinsic. The code for the kernel is as follows.

inline void mac16_sub_loads(input_window_int16* matA_0, input_window_int16* matA_1, v16int16 &buf_matB, v16acc48 &acc, int i, int incr_num){
	v16int16 buf_matA0 = window_read_v16(matA_0);
	window_incr(matA_0,incr_num);
	v16int16 buf_matA1 = window_read_v16(matA_1);
	window_incr(matA_1,incr_num);
	acc = 	mac16(acc,concat(buf_matA0,buf_matA1),0,0x73727170,0x77767574,0x3120,buf_matB,i,0x0,0x0,1);
}

void matmul_vec16(input_window_int16* __restrict matA_0,
		input_window_int16* __restrict matA_1,
		input_window_int16* __restrict matB,
		output_window_int16* __restrict matC){
	v16int16 buf_matB = window_read_v16(matB);
	for (unsigned int i=0;i<M/16;i++)  //M=64, Each iteration computes 16 outputs
	chess_prepare_for_pipelining
	{
		v16acc48 acc=null_v16acc48();
		for(int j=0;j<16;j+=2){
			int incr_num=(j==14)?80:64;
			mac16_sub_loads(matA_0,matA_1,buf_matB,acc,j,incr_num);
		}
		window_writeincr(matC,srs(acc,15));
	}
}

Note that two v16int16 vectors, buf_matA0 and buf_matA1, are defined and concatenated for the mac16 intrinsic. Also note that chess_prepare_for_pipelining is added for the loop and __restrict keyword for the window interfaces to ensure that the loop is pipelined and window operations can be well optimized.

IMPORTANT: The __restrict keyword cannot be used freely. Before using it, refer to Using the Restrict Keyword in AI Engine Kernels in the AI Engine Documentation flow of the Vitis Unified Software Platform Documentation (UG1416).

The assembly code for the version of two window loads in a cycle is as follows.

Figure 5: Assembly Code for Two Window Loads a Cycle

Matrix Multiplication

The following matrix multiplication example implements the equation:

C (64 x 2) = A (64 x 8) * B(8 x 2)

The example assumes that the data for the matrices is stored in column major form and the data type for the matrices A and B is int16.

The first output column is computed as follows.

c0 = a0*b0 + a64*b1 + a128*b2 + a192*b3 + a256*b4 + a320*b5 + a384*b6 + a448*b7
c1 = a1*b0 + a65*b1 + a129*b2 + a193*b3 + a257*b4 + a321*b5 + a385*b6 + a449*b7
c2 = a2*b0 + a66*b1 + a130*b2 + a194*b3 + a258*b4 + a322*b5 + a386*b6 + a450*b7
c3 = a3*b0 + a67*b1 + a131*b2 + a195*b3 + a259*b4 + a323*b5 + a387*b6 + a451*b7
                            … 
c60 = a60*b0 + a124*b1 + a188*b2 + a252*b3 + a316*b4 + a380*b5 + a444*b6 + a508*b7
c61 = a61*b0 + a125*b1 + a189*b2 + a253*b3 + a317*b4 + a381*b5 + a445*b6 + a509*b7
c62 = a62*b0 + a126*b1 + a190*b2 + a254*b3 + a318*b4 + a382*b5 + a446*b6 + a510*b7
c63 = a63*b0 + a127*b1 + a191*b2 + a255*b3 + a319*b4 + a383*b5 + a447*b6 + a5111*b7

The second output column is computed as follows.

c64 = a0*b8 + a64*b9 + a128*b10 + a192*b11 + a256*b12 + a320*b13 + a384*b14 + a448*b15
c65 = a1*b8 + a65*b9 + a129*b10 + a193*b11 + a257*b12 + a321*b13 + a385*b14 + a449*b15
c66 = a2*b8 + a66*b9 + a130*b10 + a194*b11 + a258*b12 + a322*b13 + a386*b14 + a450*b15
c67 = a3*b8 + a67*b9 + a131*b10 + a195*b11 + a259*b12 + a323*b13 + a387*b14 + a451*b15
                            … 
c124 = a60*b8 + a124*b9 + a188*b10 + a252*b11 + a316*b12 + a380*b13 + a444*b14 + a508*b15
c125 = a61*b8 + a125*b9 + a189*b10 + a253*b11 + a317*b12 + a381*b13 + a445*b14 + a509*b15
c126 = a62*b8 + a126*b9 + a190*b10 + a254*b11 + a318*b12 + a382*b13 + a446*b14 + a510*b15
c127 = a63*b8 + a127*b9 + a191*b10 + a255*b11 + a319*b12 + a383*b13 + a447*b14 + a5111*b15

Kernel Coding Bounds

In this example, a total of 1024 int16 x int16 multiplications are required for computing 128 output value. Given that 32 16-bit multiplications can be performed per cycle in an AI Engine, the compute bound for the kernel is as follows.

Compute bound = 32 cycles / invocation

Matrix B can be stored in the vector register because it is only 16*16-bit =256 bits. It does not need to be fetched from the AI Engine data memory or tile interface for each MAC operation. Considering the data “a” needed for computation, there are total 64*8*2=1024 bytes to be fetched from memory. Given that AI Engine allows two 256 bits (32 bytes) loads per cycle, the memory bound for the kernel is as follows.

Memory bound = 1024 / (2*32) = 16 cycles / invocation

It is seen that the compute bound is larger than the memory bound. Hence the purpose of vectorization can be to achieve the theoretical limit of MAC operations in the vector processor.

Vectorization

The scalar reference code for this matrix multiplication example is shown as follows. Note that the data is stored in columns.

void matmul_mat8_scalar(input_window_int16* matA,
		input_window_int16* matB,
		output_window_int16* matC){

	for(int i=0; i<M; i++){//M=64
		for(int j=0;j<L;j++){//L=2
			int temp = 0 ;
			for(int k=0; k<N; k++){//N=8
				temp += window_read(matA)*window_readincr(matB);//B is circular buffer, size N*L
				window_incr(matA,64); //Jump of 64 elements to access the next element of the same row
			}
			window_write(matC,(int16_t)(temp>>15)) ;
			window_incr(matC,64); //Jump to the next column
		}
		window_incr(matA,1); //Jump of one element for moving to the next row.
		window_incr(matC,1); //Jump to the next row
	}
}

As analyzed in the previous example, Matrix Vector Multiplication, mac16 intrinsic is the best choice for computing 16 lanes together because 16 int16 from a column can be loaded at once. To compute 16 output data in a column, four mac16 operations are needed. The same data in vector "a" is used twice to compute the data for two output columns. Thus, two columns of data can be loaded and two mac16 used for accumulations to the two output columns. These two loads and two MACs are repeated four times to get the results of two output columns. This method is shown in the following pseudo-code.

C_[0:15,0] = A_[0:15,0:1]*B_[0:1,0] 
C_[0:15,1] = A_[0:15,0:1]*B_[0:1,1] 

C_[0:15,0]+= A_[0:15,2:3]*B_[2:3,0] 
C_[0:15,1]+= A_[0:15,2:3]*B_[2:3,1]
 
C_[0:15,0]+= A_[0:15,4:5]*B_[4:5,0] 
C_[0:15,1]+= A_[0:15,4:5]*B_[4:5,1]

C_[0:15,0]+= A_[0:15,6:7]*B_[6:7,0] 
C_[0:15,1]+= A_[0:15,6:7]*B_[6:7,1]

In the previous code, each "*" denotes a MAC operation. C_[0:15,0] and C_[0:15,1] denote two output columns that are accumulated separately. A_[0:15,0:1] denotes the column 0 and 1, and each column has 16 elements. B_[0:1,0] denotes column 0 with 2 elements. There will be a loop for the code in the real vectorized code because there are 64 output rows. The mac16 intrinsic function to be used has the following interface.

v16acc48 mac16	(	v16acc48 	acc,
	v64int16 	xbuff,
	int 	xstart,
	unsigned int 	xoffsets,
	unsigned int 	xoffsets_hi,
	unsigned int 	xsquare,
	v16int16 	zbuff,
	int 	zstart,
	unsigned int 	zoffsets,
	unsigned int 	zoffsets_hi,
	int 	zstep 
)	

The buffers contain parameters (start, offsets, square, and step) to compute the indexing into buffers (vector registers). For details about the lane addressing scheme with these parameters, see MAC Intrinsics.

Note that the mac16 intrinsic function prototype is different with the one introduced in the previous matrix vector multiplication example. The xbuff here is v64int16 which allows two sets of data to be stored and used in an interleaved way.

Coding with MAC intrinsics can be seen in the following section.

Coding with Intrinsics

You have analyzed how the function will be mapped into the AI Engine vector processor. Now have a look at the vectorized code.

void matmul_mat8(input_window_int16* matA,
		input_window_int16* matB,
		output_window_int16* matC){

	v16int16 buf_matB = window_read_v16(matB);

	v64int16 buf_matA = undef_v64int16();
	buf_matA=upd_w(buf_matA,0,window_read_v16(matA));
	window_incr(matA,64);
	buf_matA=upd_w(buf_matA,1,window_read_v16(matA));
	window_incr(matA,64);

	for (unsigned int i=0;i<M/16;i++)  //M=64, Each iteration computes 16 outputs
	chess_prepare_for_pipelining
	chess_loop_range(4,)
	{
		v16acc48 acc0=null_v16acc48();//For first output column
		v16acc48 acc1=null_v16acc48();//For second output column

		acc0 = mac16(acc0,buf_matA,0,0x73727170,0x77767574,0x3120,buf_matB,0,0x0,0x0,1);
		buf_matA=upd_w(buf_matA,2,window_read_v16(matA));
		window_incr(matA,64);
		acc1 = mac16(acc1,buf_matA,0,0x73727170,0x77767574,0x3120,buf_matB,8,0x0,0x0,1);
		buf_matA=upd_w(buf_matA,3,window_read_v16(matA));
		window_incr(matA,64);

		acc0 = mac16(acc0,buf_matA,32,0x73727170,0x77767574,0x3120,buf_matB,2,0x0,0x0,1);
		buf_matA=upd_w(buf_matA,0,window_read_v16(matA));
		window_incr(matA,64);
		acc1 = mac16(acc1,buf_matA,32,0x73727170,0x77767574,0x3120,buf_matB,10,0x0,0x0,1);
		buf_matA=upd_w(buf_matA,1,window_read_v16(matA));
		window_incr(matA,64);

		acc0 = mac16(acc0,buf_matA,0,0x73727170,0x77767574,0x3120,buf_matB,4,0x0,0x0,1);
		buf_matA=upd_w(buf_matA,2,window_read_v16(matA));
		window_incr(matA,64);
		acc1 = mac16(acc1,buf_matA,0,0x73727170,0x77767574,0x3120,buf_matB,12,0x0,0x0,1);
		buf_matA=upd_w(buf_matA,3,window_read_v16(matA));
		window_incr(matA,80);//point to next 16 rows

		acc0 = mac16(acc0,buf_matA,32,0x73727170,0x77767574,0x3120,buf_matB,6,0x0,0x0,1);
		window_write(matC,srs(acc0,15));
		window_incr(matC,64);
		buf_matA=upd_w(buf_matA,0,window_read_v16(matA));
		window_incr(matA,64);
		acc1 = mac16(acc1,buf_matA,32,0x73727170,0x77767574,0x3120,buf_matB,14,0x0,0x0,1);
		window_write(matC,srs(acc1,15));
		window_incr(matC,80);//point to next 16 rows
		buf_matA=upd_w(buf_matA,1,window_read_v16(matA));
		window_incr(matA,64);
	}
}

In the previous code, buf_matB is for matrix B and it is loaded outside the loop. buf_matA is for matrix A and two sets of A are stored in lower and higher parts. When mac16 has the value "0" for xstart, the lower part of buf_matA is used. When mac16 has the value "32" for xstart, the higher part of buf_matA is used. acc0 and acc1 are the accumulated values for two output columns.

Note that buf_matA is preloaded before the loop. In the loop, the loads with window buffer pointer increment, MAC operations and the stores are interleaved. To understand how the mac16() intrinsic works, refer to MAC Intrinsics. The assembled code for the loop is as shown in following figure.

Figure 6: Assembly Code for the Loop

From the previously assembled code, it is seen that there is a MAC operation and a load operation in every cycle of the loop. Wide registers wr0, wr1, wr2, and wr3 are used for buf_matA. Accumulator registers bm0 and bm1 are used for the two accumulated results.

Keys to make the loop be well pipelined are as follows:

  • Preload the data into vector registers before the loop start.
  • Interleave data loads, MAC operations, data stores in the loop body.
  • Use wide input data vector register (v64int16 in the example) to make data load and MAC operation perform on different parts of the vector register.
  • Use multiple accumulator registers and reuse input data for multiple outputs.
  • Data load and buffer pointer increment come in pairs. This applies for data store and buffer pointer increments as well.