Vector Register Lane Permutations

The AI Engine fixed point vector units datapath consists of the following three separate and largely independently usable paths:

Main MAC datapath
Shift-round-saturate path
Upshift path

The main multiplication path reads values from vector registers, permutes them in a user controllable fashion, performs optional pre-adding, multiplies them, and after some post-adding accumulates them to the previous value of the accumulator register.

While the main datapath stores to the accumulator, the shift-round-saturate path reads from the accumulator registers and stores to the vector registers or the data memory. In parallel to the main datapath runs the upshift path. It does not perform any multiplications but simply reads vectors, upshifts them and feeds the result into the accumulators. For details on the Fixed point and Floating point data paths refer to Versal ACAP AI Engine Architecture Manual (AM009). Details on the intrinsic functions that can be used to exercise these data paths can be found in the Versal ACAP AI Engine Intrinsics Documentation (UG1078).

As shown in the following figure, the basic functionality of MAC data path consists of vector multiply and accumulate operations between data from the X and Z buffers. Other parameters and options allow flexible data selection within the vectors and number of output lanes and optional features allow different input data sizes and pre-adding. There is an additional input buffer, the Y buffer, whose values can be pre-added with those from the X buffer before the multiplication occurs. The result from the intrinsic is added to an accumulator.

Figure 1: Functional Overview of the MAC Data Path

The operation can be described using lanes and columns. The number of lanes corresponds to the number of output values that will be generated from the intrinsic call. The number of columns is the number of multiplications that will be performed per output lane, with each of the multiplication results being added together. For example:

acc0 += z00*(x00+y00) + z01*(x01+y01) + z02*(x02+y02) + z03*(x03+y03)
acc1 += z10*(x10+y10) + z11*(x11+y11) + z12*(x12+y12) + z13*(x13+y13)
acc2 += z20*(x20+y20) + z21*(x21+y21) + z22*(x22+y22) + z23*(x23+y23)
acc3 += z30*(x30+y30) + z31*(x31+y31) + z32*(x32+y32) + z33*(x33+y33)

In this case, four outputs are being generated, so there are four lanes and four columns for each of the outputs with pre-addition from the X and Y buffers.

The parameters of the intrinsics allow for flexible data selection from the different input buffers for each lane and column, all following the same pattern of parameters. The following section introduces the data selection (or data permute) schemes with detailed examples that include shuffle and select intrinsics. Details around the mac intrinsic and its variants are also discussed in the following sections.

Data Selection

AI Engine intrinsic functions support various types of data selection. The details around the shuffle and select intrinsic are as follows.

Data Shuffle

The AI Engine shuffle intrinsic function selects data from a single input data buffer according to the start and offset parameters. This allows for flexible permutations of the input vector values without needing to rearrange the values. xbuff is the input data buffer, with xstart indicating the starting position offset for each lane in the xbuff data buffer and xoffset indicating the position offset applied to the data buffer. The shuffle intrinsic function is available in 8, 16, and 32 lane variants (shuffle8, shuffle16, and shuffle32). The main permute for data (xoffsets) is at 32-bit granularity and xsquare allows a further 16-bit granularity mini permute after main permute. Thus, the 8-bit and 16-bit vector intrinsic functions can have additional square parameter- for more complex permutations.

For example, a shuffle16 intrinsic has the following function prototype.

v16int32 shuffle16	(	v16int32 	xbuff,
	int 	xstart,
	unsigned int 	xoffsets,
	unsigned int 	xoffsets_hi 
)

The data permute performs in 32 bits granularity. When the data size is 32 bits or 64 bits, the start and offsets are relative to the full data width, 32 bits or 64 bits. The lane selection follows the regular lane selection scheme.

f: result [lane number] = (xstart + xbuff [lane number]) Mod input_samples

The following example shows how shuffle works on the v16int32 vector. xoffset and xoffset_hi have 4 bits for each lane. This example moves the even and odd elements of the buffer into lower and higher parts of the buffer.

When data permute is on 16 bits data, the intrinsic function includes another parameter, xsquare, allowing flexibility to perform data selection in each 4 x 16 bits block of data. The xoffset comes in pairs. The first hex value is an absolute 32 bits offset and picks up 2 x 16 bits values (index, index+1). The second hex value is offset from first value + 1 (32 bits offset) and picks up 2 x 16 bits values. For example, 0x00 selects index 0, 1, and index 2, 3. 0x24 selects index 8, 9, and index 14, 15. Following is a shuffle example on the v32int16 vector.

Data Select

The select intrinsic selects between the first set of lanes or the second one according to the value of the select parameter. If the lane corresponding bit in select is zero, it returns the value in the first set of lanes. If the bit is one, it returns the value in the second set of lanes. For example, a select16 intrinsic function has the following function prototype.

v16int32 select16	(	unsigned int 	select,
	v16int32 	xbuff,
	int 	xstart,
	unsigned int 	xoffsets,
	unsigned int 	xoffsets_hi,
	v16int32 	ybuff,
	int 	ystart,
	unsigned int 	yoffsets,
	unsigned int 	yoffsets_hi 
)

For each bit of select (from low to high), it will select a lane either from xbuff (if the select parameter bit is 0) or from ybuff (if the select parameter bit is 1). Data permute on the resulting lane of xbuff or ybuff is achieved by a shuffle with corresponding bits in xoffsets or yoffsets. Following is the pseudo C-style code for select.

for (int i = 0; i < 16; i++){
	idx = f( xstart, xoffsets[i]); //i'th 4 bits of offsets
	idy = f( ystart, yoffsets[i]);
	o[i] = select[i] ? y[idy]:x[idx];
}

For information about how f works in previous code, refer to the regular lane selection scheme equation listed at the beginning of this section.

When working on the int16 data type, the select intrinsic has an additional xsquare parameter which allows a further 16-bit granularity mini permute after main permute. For example, a select32 intrinsic function has the following function prototype.


v32int16 select32	(	unsigned int 	select,
	v64int16 	xbuff,
	int 	xstart,
	unsigned int 	xoffsets,
	unsigned int 	xoffsets_hi,
	unsigned int 	xsquare,
	int 	ystart,
	unsigned int 	yoffsets,
	unsigned int 	yoffsets_hi,
	unsigned int 	ysquare 
)

Following is the pseudo C-style code for select.

for (int i = 0; i < 32; i++){
	idx = f( xstart, xoffsets[i], xsquare); 
	idy = f( ystart, yoffsets[i], ysquare);
	o[i] = select[i] ? y[idy]:x[idx];
}

The following example uses select32 to interleave first 16 elements of A and B (A first).

int16 A[32]={0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,
		16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31
};
int16 B[32]={32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,
		48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63
};
v32int16 *pA=(v32int16*)A;
v32int16 *pB=(v32int16*)B;
v32int16 C = select32(0xAAAAAAAA, concat(*pA,*pB),
		0, 0x03020100, 0x07060504, 0x1100,
		32, 0x03020100, 0x07060504, 0x1100);

The output C for the previous code is as follows.

{0,32,1,33,2,34,3,35,4,36,5,37,6,38,7,39,8,40,9,41,10,42,11,43,12,44,13,45,14,46,15,47
}

This can also be done using the shuffle32 intrinsic.

v32int16 C = shuffle32(concat(*pA,*pB),
	0, 0xF3F2F1F0, 0xF7F6F5F4, 0x3120);

The following figure shows how the previous select32 intrinsic works.

MAC Intrinsics

MAC intrinsics perform vector multiply and accumulate operations between data from two buffers, the X and Z buffers, with the other parameters and options allowing flexibility (data selection within the vectors, number of output lanes) and optional features (different input data sizes, pre-adding, etc). There is an additional input buffer, the Y buffer, whose values can be pre-added with those from the X buffer before the multiplication occurs. The result from the intrinsic is added to an accumulator.

The parameters of the intrinsics allow for flexible data selection from the different input buffers for each lane and column, all following the same pattern of parameters. A starting point in the buffer is given by the (x/y/z) start parameter which selects the first element for the first row as well as first column. To allow flexibility for each lane, (x/y/z) offsets provides an offset value for each lane that will be added to the starting point. Finally, the (x/y/z) step parameter defines the step in data selection between each column based on the previous position. It is worth noticing that when the ystep is not specified in the intrinsic it will be the symmetric of the xstep.

Main permute granularity for x/y and z buffers is 32 bits and 16 bits, respectively. Complex numbers are considered as one entity for the permute (for example, cint16 as 32 bits for permute). Parameter zstart must be a compile time constant. 8-bit and 16-bit permute granularity in x/y and 8-bit permute granularity in z have certain limitations as addressed towards the end of this section. The following sections covers the different data widths and explains the result of the MAC intrinsic on these data widths.

MAC on 32x32 bits

The following figure shows how start, offsets, and step work on the cint16 data type.

mac4 has four output lanes. The first column of data is selected by adding xstart to every 4 bits of xoffsets. The subsequent column of data is selected by adding xstep to its previous column. In AI Engine Vector Precision Support , it is seen that there are eight MACs per cycle for the cint16 * cint16 operation. This means that mac4 has two columns of multiplication.

The coefficients of mac4 are chosen similarly by zstart, zoffset, and zstep.

MAC on 32x16 bits

An example of MAC with pre-adding is as follows. With pre-adding, the data from X buffer can be added by itself, or the data from X buffer and Y buffer can be added. The start, offsets, and step parameters work similar as previous example. There is a ystart parameter for Y buffer or another data from X buffer. The step parameter works reversely for Y or another data from X buffer.

Figure 6: LMAC8_SYM on int32 x int16 Type

MAC on 16x16 bits

An example of MAC with int16 X buffer and int16 Z buffer is as follows. Note that the permute granularity for X buffer is 32 bits. The start and step parameters are always in terms of data type granularity. Therefore, a value of 2 for 16 bits data will choose 2 * 16 bits away. The xoffsets parameter comes as a pair. The first hex value is an absolute 32 bits offset and picks up 2 x 16 bits values (index, index+1) in the even row. The second hex value is offset from first value + 1 (32 bits offset) and picks up 2 x 16 bits values in the odd row. So the hex value 0x24 in xoffsets selects index 8, 9 for even row and index 14, 15 for odd row from xbuff and the hex value 0x00 in xoffsets selects index 0, 1 for even row and index 2, 3 for odd row from xbuff.

There is another xsquare parameter to perform 16 bits granularity twiddling after the main permute. For example, xsquare value 0x2103 (see from lower hex value to higher hex value) puts index 3, 0 in the even row and index 1, 2 in the odd row. How the xsquare parameter works can be seen in the center of the following figure.

The following figure is an example of mac16 intrinsic of int16 and int16. It is used in the matrix vector multiplication and matrix multiplication example designs in Single Kernel Coding Examples.

MAC on 8x8 bits

The following figures show MAC with int8 X buffer and int8 Z buffer. The first figure shows how data is permuted and the second figure shows how coefficients are permuted. Note that the permute granularity for X buffer and Z buffer are 32 bits and 16 bits, respectively. The xoffsets parameter comes in pair. The first hex value is an absolute 32 bits offset and pick up 4 x 8 bits values (index, index+1, index+2, index+3). The second hex value is offset from the first value + 1 (32 bits offset) and picks up 4 x 8 bits values. For example, 0x00 selects index 0, 1, 2, 3 as well as 4, 5, 6, 7, and 0x24 selects index 16, 17, 18, 19 as well as 28, 29, 30, 31.

There is another xsquare parameter to do 8 bits granularity twiddling after main permute. How xsquare parameter works in this example can be seen in the center of the following figure.

The start (xstart, zstart) and step (xstep, zstep) parameters are always in terms of data type granularity. Hence, a value of 2 for 16 bits is 2 * 16 bits away, while a value of 2 for 8 bits is 2 * 8 bits away. The step parameter applies to the next block of selected data. So, if a pair of offset parameters select a 2 * 2 block, the step applies to the next 2 * 2 block. The step added to the index value must be aligned to the permute granularity (32 bits for data, 16 bits for coefficient). For example, when working with 8-bit data, xstep needs to be multiples of four. When working with 8-bit coefficient, zstep needs to be multiples of two. The following two figures show how step works for data and coefficients.

Note that for the coefficient in int8 * int8 types, the 2 * 2 index block is duplicated to construct a 4 * 2 block. See how index 0, 1, 2, and 3 are duplicated in MAC8 on int8 x int8 Type (Z Part).

Figure 9: MAC8 on int8 x int8 Type (X Part)

Figure 10: MAC8 on int8 x int8 Type (Z Part)

Options

There are rich sets of MAC intrinsic with additional operations like pre-adding, pre-subtraction, and conjugation. The naming convention for the vector MAC intrinsics is as follows. Optional characteristics are shown in [] and mandatory ones in {}.

[l]{mac|msc|mul|negmul}{2|4|8|16}[_abs|_max|_min|_maxdiff][_conj][{_sym|_antisym}[_ct|_uct]][_c|_cc|_cn|_nc]

Every operation will either be a multiplication, initializing an accumulator, or a MAC operation which accumulates to a running accumulator of 2, 4, 8, or 16 lanes.

l

Denotes that an accumulator with 80-bit lanes is used for the operation.

sym and antisym

Indicates the use of pre-adding and pre-subtraction respectively.

max, min, and maxdiff

Indicates the pre-selection of lanes in the xbuff based on the maximum, minimum, or maximum difference value.

abs

Indicates the pre-computation of the absolute value in the xbuff.

ct

Used for partial pre-adding and pre-subtraction (separate selection for the data input from X for the final column).

uct

Used for unit center optimization for certain types of FIR filters. Refer to the Versal ACAP AI Engine Intrinsics Documentation (UG1078) for more information.

n and c

Used to indicate that the complex conjugate will be used for one of the input buffers with complex values:

c: The only complex input buffer will be conjugated.
cn: Complex conjugate of X (or XY if pre-adding is used) buffer.
nc: Complex conjugate of Z buffer.
cc: Complex conjugate of both X (or XY if pre-adding is used) and Z buffers.

conj

Indicates that the complex conjugate of Z will be used when multiplying the data input from Y.

Data Permute and MAC Examples

The following example takes two vectors with reals in rva and imaginary in rvb (with type v8int32) and creates a new complex vector, using the offsets to interleave the values as required.

v8cint32 cv = as_v8cint32(select16(0xaaaa, concat(rva, rvb),
		0, 0x03020100, 0x07060504,  8, 0x30201000, 0x70605040));

The following example shows how to extract real and imaginary portion of a vector cv with type v8cint32.

v16int32 re_im  = shuffle16(as_v16int32(cv), 0, 0xECA86420, 0xFDB97531);  
v8int32 re = ext_w(re_im, 0);  
v8int32 im = ext_w(re_im, 1);

Shuffle intrinsic functions can be used to reorder the elements in a vector or set all elements to the same value. Some intrinsic functions operate only on larger registers but it is easy to use them for smaller registers. The following example shows how to implement a function to set all four elements in a vector to a constant value.

v4int32 v2 = ext_v(shuffle16(xset_v(0, v1), 0 ,0, 0), 0);

The following example shows how to multiply each element in rva by the first element in rvb. This is efficient for a vector multiplied by constant value.

v8acc80 acc = lmul8(concat(rva,undef_v8int32()),0,0x76543210,rvb,0,0x00);

The following examples show how to multiply each element in rva by its corresponding element in rvb.

acc = lmul8(concat(rva, undef_v8int32()),0,0x76543210,rvb,0,0x76543210);
acc = lmul8(upd_w(undef_v16int32(),0,rva),0,0x76543210,rvb,0,0x76543210);

The following examples show how to do matrix multiplication for int8 x int8 data types with mul intrinsic, assuming that data storage is row based.

//Z_{2x8} * X_{8x8} = A_{2x8}
mul16(Xbuff, 0, 0x11101110, 16, 0x3120, Zbuff, 0, 0x44440000, 2, 0x3210);
//Z_{4x8} * X_{8x4} = A_{4x4}
mul16(Xbuff, 0, 0x00000000, 8, 0x3120, Zbuff, 0, 0xCC884400, 2, 0x3210);

If the kernel has multiple mul or mac intrinsics, try to keep the xoffsets and zoffsets parameters constant across uses and vary the xtsart and zstart parameters. This will help prevent configuration register spills on stack.

For more information about vector lane permutations, refer to the Versal ACAP AI Engine Intrinsics Documentation (UG1078).