Single Kernel Coding Examples
The following sections focus on mapping the algorithm into a single AI Engine. It takes into consideration the compute and memory bounds of the algorithm and capability of the AI Engine. The scalar and vectorized versions of the code are shown to illustrate vectorization.
Matrix Vector Multiplication
The following matrix vector multiplication example focuses on a single AI Engine kernel vectorization. It implements the following matrix vector multiplication equation.
The example assumes that the data for the matrices is stored in column based form and data type for the matrices A and B is int16.
c0 = a0*b0 + a64*b1 + a128*b2 + a192*b3 + a256*b4 + a320*b5 + a384*b6 + a448*b7 + …
c1 = a1*b0 + a65*b1 + a129*b2 + a193*b3 + a257*b4 + a321*b5 + a385*b6 + a449*b7 + …
c2 = a2*b0 + a66*b1 + a130*b2 + a194*b3 + a258*b4 + a322*b5 + a386*b6 + a450*b7 + …
c3 = a3*b0 + a67*b1 + a131*b2 + a195*b3 + a259*b4 + a323*b5 + a387*b6 + a451*b7 + …
…
c60 = a60*b0 + a124*b1 + a188*b2 + a252*b3 + a316*b4 + a380*b5 + a444*b6 + a508*b7 + …
c61 = a61*b0 + a125*b1 + a189*b2 + a253*b3 + a317*b4 + a381*b5 + a445*b6 + a509*b7 + …
c62 = a62*b0 + a126*b1 + a190*b2 + a254*b3 + a318*b4 + a382*b5 + a446*b6 + a510*b7 + …
c63 = a63*b0 + a127*b1 + a191*b2 + a255*b3 + a319*b4 + a383*b5 + a447*b6 + a5111*b7 + …
Kernel Coding Bounds
In this example, a total of 16 int16 x int16 multiplications are required per output value. As the matrix C consists of 64 values, a total of 16 * 64 = 1024 multiplications is required to complete one matrix multiplication. Given that 32 16-bit multiplications can be performed per cycle in an AI Engine, the minimum number of cycles required for the matrix multiplication is 1024/32 = 32. The summation of the individual terms comes without additional cycle requirements because the addition can be performed together with the multiplication in a MAC operation. Hence the compute bound for the kernel is:
Next, analyze the memory accesses bound for the kernel. If it is going to fully use the vector unit MAC performance, 32 16-bit multiplications are performed per cycle. Vector b can be stored in the vector register because it is only 16*16-bit =256 bits. It does not need to be fetched from the AI Engine data memory or tile interface for each MAC operation. Considering data “a” needed for computation, it needs 32*16-bit = 512 bits data per cycle. The stream interface only supports 2*32 bit per cycle and hence fetching data from memory can be considered. It allows two 256 bits loads per cycle which matches the MAC performance. Thus, if two 256 bits loads are performed each cycle, the memory bound for the kernel is:
Note that compute bound and memory bound are the theoretical limits of the kernel realization. It does not take into account the function overhead outside the main computation loop. When the kernel is only part of the graph, it might be relieved due to bandwidth limitation of other kernels or lower system performance requirements.
Vectorization
For a complicated vector processing algorithm, starting with a scalar version is recommended because it is also helpful as a golden reference for verifying the accuracy. The scalar version for matrix multiplication is shown as follows.
void matmul_scalar(input_window_int16* matA,
input_window_int16* matB,
output_window_int16* matC){ //A[M,N], B[N,1], C[M,1]. M=64, N=16
for(int i=0; i<M; i++){
int temp = 0 ;
for(int j=0; j<N; j++){
temp += window_read(matA)*window_readincr(matB) ;
window_incr(matA,64); //Jump of 64 elements to access the next element of the same row
}
window_writeincr(matC,(int16_t)(temp>>15)) ;
window_incr(matA,1); //Jump of one element for moving to the next row.
}
}
Note that in the previously shown code, matA
is stored in the column base and matB
is a
circular buffer to the kernel. It can be read continuously by window_readincr
for computing different rows of output because it will loop back
to the start of the buffer.
There are total 64 outputs (M=64), and each output needs 16 (N=16) multiplications. When choosing MAC intrinsics to do vector processing, for the data type int16 * int16, select lane 4, 8, 16 to do the equation. These are illustrated in following figure.
Note that the main difference between 4, 8, and 16 lanes MAC is how the data is consumed. If you assume that the data is stored by column, then 16 lanes MAC may be the best choice, because only two parts of continuous data needs to be loaded for the MAC operation, – a0 to a15 and a64 to a79. a0 to a15 are 256 bits, which allows one load to load the value into vector register.
To allow two loads to occur at the same cycle, a0 to a15 and a64 to a79 are required to be in separate data banks. The data needs to be divided column by column into two separate buffers to the kernel. That is to say, a0 to a63 are in the first buffer, a64 to a127 are in the second buffer, a128 to a191 are in the first buffer again, and so on.
By vectorization, the matrix multiplication can have a loop with 64/16=4 iterations and each iteration of the loop contains eight MAC operations. Every iteration of the loop produces 16 output data. This is illustrated in the following figure.
The mac16() intrinsic function to be used has the following interface.
v16acc48 mac16( v16acc48 acc,
v32int16 xbuff,
int xstart,
unsigned int xoffsets,
unsigned int xoffsets_hi,
unsigned int xsquare,
v16int16 zbuff,
int zstart,
unsigned int zoffsets,
unsigned int zoffsets_hi,
int zstep
)
The buffers contain parameters (start, offsets, square, and step) to compute the indexing into the buffers (vector registers). For details about the lane addressing scheme with these parameters, see MAC Intrinsics.
Coding with MAC intrinsics can be seen in the following section.
Coding with Intrinsics
You have analyzed how the function will be mapped into the AI Engine vector processor. Now have a look at the first version of the vectorized code.
inline void mac16_sub(input_window_int16* matA, v16int16 &buf_matB, v16acc48 &acc, int i){
v32int16 buf_matA = undef_v32int16(); // holds 32 elements of matA
buf_matA=upd_w(buf_matA, 0, window_read_v16(matA));
window_incr(matA,64);
buf_matA = upd_w(buf_matA, 1, window_read_v16(matA));
window_incr(matA,64);
acc = mac16(acc,buf_matA,0,0x73727170,0x77767574,0x3120,buf_matB,i,0x0,0x0,1);
}
void matmul_vec16(input_window_int16* matA,
input_window_int16* matB,
output_window_int16* matC){
v16int16 buf_matB = window_read_v16(matB); // holds 16 elements of matB
v16acc48 acc = null_v16acc48(); // holds acc value of Row * column dot product
for (unsigned int i=0;i<M/16;i++) //M=64, Each iteration computes 16 outputs
{
acc=null_v16acc48();
for(int j=0;j<16;j+=2){
mac16_sub(matA,buf_matB,acc,j);
}
window_writeincr(matC,srs(acc,15));
window_incr(matA,16);
}
}
In the main function matmul_vec16
, the loop
produces 16 output data per iteration. In the outer loop body, there is an inner loop with
eight iterations. In each iteration of the inner loop, an inline function mac16_sub
is called. In the inline function, there is a mac16
operation, with two loads of data for the MAC operation.
Inside mac16_sub()
, buf_matA
is declared as local variable and buf_matB
and acc
are declared as local
variables in the main function. They are passed between functions by reference (or
pointer). This ensures that only one identical vector exists for each variable. The
function has one parameter that is used in the mac16()
intrinsic as follows and this specific intrinsic (i=0) has been introduced in MAC Intrinsics.
acc = mac16(acc,buf_matA,0,0x73727170,0x77767574,0x3120,buf_matB,i,0x0,0x0,1);
At the end of each iteration of the loop, window pointer for the data is incremented by 16 (that is 16 rows for the matrix).
inline
is used to guide the tool to remove the
boundary of a function and inline
__attribute__((always_inline))
can be used to force removal of the boundary
of the function, sometimes it is helpful to retain the boundary of a function using
__attribute__ ((noinline)) void func(...)
. Note
that inlining or not can affect program memory usage and program optimization.The compiled code for the kernel can be found in the disassembly view in the debug perspective of the Vitis™ IDE. Note that a graph is needed for compiling the kernel with AI Engine tools. For more understanding about the assembly code in disassembly view, refer to Using Vitis IDE and Reports. For additional details on graph coding and Vitis IDE usage, refer to the Versal ACAP AI Engine Programming Environment User Guide (UG1076).
Note that the compiler automatically unrolls the inner loop and pipelines the outer loop. From the previous assembly code for the loop, each iteration requires 19 cycles. However, with one window interface of data (matA), the minimum cycle number required for eight MACs must be 16 (two loads of data per MAC). This degradation of performance is caused by unbalanced window pointer increment at the end of the loop. This can be resolved by pairing the last increment with the last MAC operation. The optimized code is as follows.
inline void mac16_sub(input_window_int16* matA, v16int16 &buf_matB, v16acc48 &acc, int i,int incr_num){
v32int16 buf_matA = undef_v32int16(); // holds 32 elements of matA
buf_matA=upd_w(buf_matA, 0, window_read_v16(matA));
window_incr(matA,64);
buf_matA = upd_w(buf_matA, 1, window_read_v16(matA));
window_incr(matA,incr_num);
acc = mac16(acc,buf_matA,0,0x73727170,0x77767574,0x3120,buf_matB,i,0x0,0x0,1);
}
void matmul_vec16(input_window_int16* matA,
input_window_int16* matB,
output_window_int16* matC){
v16int16 buf_matB = window_read_v16(matB); // holds 16 elements of matB
v16acc48 acc = null_v16acc48(); // holds acc value of Row * column dot product
for (unsigned int i=0;i<M/16;i++) //M=64, Each iteration computes 16 outputs
{
acc=null_v16acc48();
for(int j=0;j<16;j+=2){
int incr_num=(j==14)?80:64;
mac16_sub(matA,buf_matB,acc,j,incr_num);
}
window_writeincr(matC,srs(acc,15));
}
}
Note that the function mac16_sub
has a new
parameter incr_num
. This parameter is for the pointer
increment, which is different for the last function call in the inner loop. This
increment number 80
for the last function call is to
ensure that data in the next 16 rows is selected in the next iteration of the outer
loop. Now the assembled code for the loop is as shown in following figure.
An iteration of the loop requires 16 cycles. This means that the compute
bound for this kernel is 16*4=64 cycles per invocation. As seen in the previous section,
the theoretical limit is 32 cycles per invocation. That is eight cycles for an iteration
of the loop, which means that eight MAC operations must be compacted into eight cycles.
Depending on the system performance requirements, this can be achieved by splitting the
data input column by column into two window buffers, matA_0
and matA_1
. The data of the two
windows is first to be read into two v16int16 vectors and concatenated into one v32int16
vector to be used in the mac16
intrinsic. The code for the kernel is as
follows.
inline void mac16_sub_loads(input_window_int16* matA_0, input_window_int16* matA_1, v16int16 &buf_matB, v16acc48 &acc, int i, int incr_num){
v16int16 buf_matA0 = window_read_v16(matA_0);
window_incr(matA_0,incr_num);
v16int16 buf_matA1 = window_read_v16(matA_1);
window_incr(matA_1,incr_num);
acc = mac16(acc,concat(buf_matA0,buf_matA1),0,0x73727170,0x77767574,0x3120,buf_matB,i,0x0,0x0,1);
}
void matmul_vec16(input_window_int16* __restrict matA_0,
input_window_int16* __restrict matA_1,
input_window_int16* __restrict matB,
output_window_int16* __restrict matC){
v16int16 buf_matB = window_read_v16(matB);
for (unsigned int i=0;i<M/16;i++) //M=64, Each iteration computes 16 outputs
chess_prepare_for_pipelining
{
v16acc48 acc=null_v16acc48();
for(int j=0;j<16;j+=2){
int incr_num=(j==14)?80:64;
mac16_sub_loads(matA_0,matA_1,buf_matB,acc,j,incr_num);
}
window_writeincr(matC,srs(acc,15));
}
}
Note that two v16int16
vectors, buf_matA0
and buf_matA1
,
are defined and concatenated for the mac16
intrinsic.
Also note that chess_prepare_for_pipelining
is added
for the loop and __restrict
keyword for the window
interfaces to ensure that the loop is pipelined and window operations can be well
optimized.
__restrict
keyword cannot be used freely. Before using it,
refer to Using the Restrict Keyword in
AI Engine Kernels in the AI Engine Documentation flow of
the Vitis Unified Software Platform
Documentation (UG1416).The assembly code for the version of two window loads in a cycle is as follows.
Matrix Multiplication
The following matrix multiplication example implements the equation:
The example assumes that the data for the matrices is stored in column major form and the data type for the matrices A and B is int16.
The first output column is computed as follows.
c0 = a0*b0 + a64*b1 + a128*b2 + a192*b3 + a256*b4 + a320*b5 + a384*b6 + a448*b7
c1 = a1*b0 + a65*b1 + a129*b2 + a193*b3 + a257*b4 + a321*b5 + a385*b6 + a449*b7
c2 = a2*b0 + a66*b1 + a130*b2 + a194*b3 + a258*b4 + a322*b5 + a386*b6 + a450*b7
c3 = a3*b0 + a67*b1 + a131*b2 + a195*b3 + a259*b4 + a323*b5 + a387*b6 + a451*b7
…
c60 = a60*b0 + a124*b1 + a188*b2 + a252*b3 + a316*b4 + a380*b5 + a444*b6 + a508*b7
c61 = a61*b0 + a125*b1 + a189*b2 + a253*b3 + a317*b4 + a381*b5 + a445*b6 + a509*b7
c62 = a62*b0 + a126*b1 + a190*b2 + a254*b3 + a318*b4 + a382*b5 + a446*b6 + a510*b7
c63 = a63*b0 + a127*b1 + a191*b2 + a255*b3 + a319*b4 + a383*b5 + a447*b6 + a5111*b7
The second output column is computed as follows.
c64 = a0*b8 + a64*b9 + a128*b10 + a192*b11 + a256*b12 + a320*b13 + a384*b14 + a448*b15
c65 = a1*b8 + a65*b9 + a129*b10 + a193*b11 + a257*b12 + a321*b13 + a385*b14 + a449*b15
c66 = a2*b8 + a66*b9 + a130*b10 + a194*b11 + a258*b12 + a322*b13 + a386*b14 + a450*b15
c67 = a3*b8 + a67*b9 + a131*b10 + a195*b11 + a259*b12 + a323*b13 + a387*b14 + a451*b15
…
c124 = a60*b8 + a124*b9 + a188*b10 + a252*b11 + a316*b12 + a380*b13 + a444*b14 + a508*b15
c125 = a61*b8 + a125*b9 + a189*b10 + a253*b11 + a317*b12 + a381*b13 + a445*b14 + a509*b15
c126 = a62*b8 + a126*b9 + a190*b10 + a254*b11 + a318*b12 + a382*b13 + a446*b14 + a510*b15
c127 = a63*b8 + a127*b9 + a191*b10 + a255*b11 + a319*b12 + a383*b13 + a447*b14 + a5111*b15
Kernel Coding Bounds
In this example, a total of 1024 int16 x int16 multiplications are required for computing 128 output value. Given that 32 16-bit multiplications can be performed per cycle in an AI Engine, the compute bound for the kernel is as follows.
Compute bound = 32 cycles / invocation
Matrix B can be stored in the vector register because it is only 16*16-bit =256 bits. It does not need to be fetched from the AI Engine data memory or tile interface for each MAC operation. Considering the data “a” needed for computation, there are total 64*8*2=1024 bytes to be fetched from memory. Given that AI Engine allows two 256 bits (32 bytes) loads per cycle, the memory bound for the kernel is as follows.
Memory bound = 1024 / (2*32) = 16 cycles / invocation
It is seen that the compute bound is larger than the memory bound. Hence the purpose of vectorization can be to achieve the theoretical limit of MAC operations in the vector processor.
Vectorization
The scalar reference code for this matrix multiplication example is shown as follows. Note that the data is stored in columns.
void matmul_mat8_scalar(input_window_int16* matA,
input_window_int16* matB,
output_window_int16* matC){
for(int i=0; i<M; i++){//M=64
for(int j=0;j<L;j++){//L=2
int temp = 0 ;
for(int k=0; k<N; k++){//N=8
temp += window_read(matA)*window_readincr(matB);//B is circular buffer, size N*L
window_incr(matA,64); //Jump of 64 elements to access the next element of the same row
}
window_write(matC,(int16_t)(temp>>15)) ;
window_incr(matC,64); //Jump to the next column
}
window_incr(matA,1); //Jump of one element for moving to the next row.
window_incr(matC,1); //Jump to the next row
}
}
As analyzed in the previous example, Matrix Vector Multiplication, mac16
intrinsic is the
best choice for computing 16 lanes together because 16 int16 from a column can be loaded
at once. To compute 16 output data in a column, four mac16
operations are needed. The same data in vector "a" is used twice to
compute the data for two output columns. Thus, two columns of data can be loaded and two
mac16
used for accumulations to the two output
columns. These two loads and two MACs are repeated four times to get the results of two
output columns. This method is shown in the following pseudo-code.
C_[0:15,0] = A_[0:15,0:1]*B_[0:1,0]
C_[0:15,1] = A_[0:15,0:1]*B_[0:1,1]
C_[0:15,0]+= A_[0:15,2:3]*B_[2:3,0]
C_[0:15,1]+= A_[0:15,2:3]*B_[2:3,1]
C_[0:15,0]+= A_[0:15,4:5]*B_[4:5,0]
C_[0:15,1]+= A_[0:15,4:5]*B_[4:5,1]
C_[0:15,0]+= A_[0:15,6:7]*B_[6:7,0]
C_[0:15,1]+= A_[0:15,6:7]*B_[6:7,1]
In the previous code, each "*" denotes a MAC operation. C_[0:15,0]
and C_[0:15,1]
denote two
output columns that are accumulated separately. A_[0:15,0:1]
denotes the column 0 and 1, and each column has 16 elements.
B_[0:1,0]
denotes column 0 with 2 elements. There
will be a loop for the code in the real vectorized code because there are 64 output
rows. The mac16
intrinsic function to be used has the
following interface.
v16acc48 mac16 ( v16acc48 acc,
v64int16 xbuff,
int xstart,
unsigned int xoffsets,
unsigned int xoffsets_hi,
unsigned int xsquare,
v16int16 zbuff,
int zstart,
unsigned int zoffsets,
unsigned int zoffsets_hi,
int zstep
)
The buffers contain parameters (start, offsets, square, and step) to compute the indexing into buffers (vector registers). For details about the lane addressing scheme with these parameters, see MAC Intrinsics.
Note that the mac16
intrinsic function
prototype is different with the one introduced in the previous matrix vector
multiplication example. The xbuff
here is v64int16
which allows two sets of data to be stored and used in an
interleaved way.
Coding with MAC intrinsics can be seen in the following section.
Coding with Intrinsics
You have analyzed how the function will be mapped into the AI Engine vector processor. Now have a look at the vectorized code.
void matmul_mat8(input_window_int16* matA,
input_window_int16* matB,
output_window_int16* matC){
v16int16 buf_matB = window_read_v16(matB);
v64int16 buf_matA = undef_v64int16();
buf_matA=upd_w(buf_matA,0,window_read_v16(matA));
window_incr(matA,64);
buf_matA=upd_w(buf_matA,1,window_read_v16(matA));
window_incr(matA,64);
for (unsigned int i=0;i<M/16;i++) //M=64, Each iteration computes 16 outputs
chess_prepare_for_pipelining
chess_loop_range(4,)
{
v16acc48 acc0=null_v16acc48();//For first output column
v16acc48 acc1=null_v16acc48();//For second output column
acc0 = mac16(acc0,buf_matA,0,0x73727170,0x77767574,0x3120,buf_matB,0,0x0,0x0,1);
buf_matA=upd_w(buf_matA,2,window_read_v16(matA));
window_incr(matA,64);
acc1 = mac16(acc1,buf_matA,0,0x73727170,0x77767574,0x3120,buf_matB,8,0x0,0x0,1);
buf_matA=upd_w(buf_matA,3,window_read_v16(matA));
window_incr(matA,64);
acc0 = mac16(acc0,buf_matA,32,0x73727170,0x77767574,0x3120,buf_matB,2,0x0,0x0,1);
buf_matA=upd_w(buf_matA,0,window_read_v16(matA));
window_incr(matA,64);
acc1 = mac16(acc1,buf_matA,32,0x73727170,0x77767574,0x3120,buf_matB,10,0x0,0x0,1);
buf_matA=upd_w(buf_matA,1,window_read_v16(matA));
window_incr(matA,64);
acc0 = mac16(acc0,buf_matA,0,0x73727170,0x77767574,0x3120,buf_matB,4,0x0,0x0,1);
buf_matA=upd_w(buf_matA,2,window_read_v16(matA));
window_incr(matA,64);
acc1 = mac16(acc1,buf_matA,0,0x73727170,0x77767574,0x3120,buf_matB,12,0x0,0x0,1);
buf_matA=upd_w(buf_matA,3,window_read_v16(matA));
window_incr(matA,80);//point to next 16 rows
acc0 = mac16(acc0,buf_matA,32,0x73727170,0x77767574,0x3120,buf_matB,6,0x0,0x0,1);
window_write(matC,srs(acc0,15));
window_incr(matC,64);
buf_matA=upd_w(buf_matA,0,window_read_v16(matA));
window_incr(matA,64);
acc1 = mac16(acc1,buf_matA,32,0x73727170,0x77767574,0x3120,buf_matB,14,0x0,0x0,1);
window_write(matC,srs(acc1,15));
window_incr(matC,80);//point to next 16 rows
buf_matA=upd_w(buf_matA,1,window_read_v16(matA));
window_incr(matA,64);
}
}
In the previous code, buf_matB
is for matrix B
and it is loaded outside the loop. buf_matA
is for
matrix A and two sets of A are stored in lower and higher parts. When mac16
has the value "0" for xstart
, the lower part of buf_matA
is
used. When mac16
has the value "32" for xstart
, the higher part of buf_matA
is used. acc0
and acc1
are the accumulated values for two output columns.
Note that buf_matA
is preloaded before the
loop. In the loop, the loads with window buffer pointer increment, MAC operations and
the stores are interleaved. To understand how the mac16()
intrinsic works, refer to MAC Intrinsics. The assembled code for the loop is as shown in
following figure.
From the previously assembled code, it is seen that there is a MAC operation
and a load operation in every cycle of the loop. Wide registers wr0
, wr1
, wr2
, and wr3
are used for buf_matA
. Accumulator registers bm0
and bm1
are used for the two
accumulated results.
Keys to make the loop be well pipelined are as follows:
- Preload the data into vector registers before the loop start.
- Interleave data loads, MAC operations, data stores in the loop body.
- Use wide input data vector register (
v64int16
in the example) to make data load and MAC operation perform on different parts of the vector register. - Use multiple accumulator registers and reuse input data for multiple outputs.
- Data load and buffer pointer increment come in pairs. This applies for data store and buffer pointer increments as well.