Vector Register Lane Permutations
The AI Engine fixed point vector units datapath consists of the following three separate and largely independently usable paths:
- Main MAC datapath
- Shift-round-saturate path
- Upshift path
The main multiplication path reads values from vector registers, permutes them in a user controllable fashion, performs optional pre-adding, multiplies them, and after some post-adding accumulates them to the previous value of the accumulator register.
While the main datapath stores to the accumulator, the shift-round-saturate path reads from the accumulator registers and stores to the vector registers or the data memory. In parallel to the main datapath runs the upshift path. It does not perform any multiplications but simply reads vectors, upshifts them and feeds the result into the accumulators. For details on the Fixed point and Floating point data paths refer to Versal ACAP AI Engine Architecture Manual (AM009). Details on the intrinsic functions that can be used to exercise these data paths can be found in the Versal ACAP AI Engine Intrinsics Documentation (UG1078).
As shown in the following figure, the basic functionality of MAC data path consists of vector multiply and accumulate operations between data from the X and Z buffers. Other parameters and options allow flexible data selection within the vectors and number of output lanes and optional features allow different input data sizes and pre-adding. There is an additional input buffer, the Y buffer, whose values can be pre-added with those from the X buffer before the multiplication occurs. The result from the intrinsic is added to an accumulator.
The operation can be described using lanes and columns. The number of lanes corresponds to the number of output values that will be generated from the intrinsic call. The number of columns is the number of multiplications that will be performed per output lane, with each of the multiplication results being added together. For example:
acc0 += z00*(x00+y00) + z01*(x01+y01) + z02*(x02+y02) + z03*(x03+y03)
acc1 += z10*(x10+y10) + z11*(x11+y11) + z12*(x12+y12) + z13*(x13+y13)
acc2 += z20*(x20+y20) + z21*(x21+y21) + z22*(x22+y22) + z23*(x23+y23)
acc3 += z30*(x30+y30) + z31*(x31+y31) + z32*(x32+y32) + z33*(x33+y33)
In this case, four outputs are being generated, so there are four lanes and four columns for each of the outputs with pre-addition from the X and Y buffers.
The parameters of the intrinsics allow for flexible data selection from the
different input buffers for each lane and column, all following the same pattern of
parameters. The following section introduces the data selection (or data permute)
schemes with detailed examples that include shuffle
and select
intrinsics. Details around the mac
intrinsic and its variants are also discussed in
the following sections.
Data Selection
AI Engine intrinsic functions support various types of data selection. The details around the shuffle and select intrinsic are as follows.
Data Shuffle
The AI Engine shuffle intrinsic function
selects data from a single input data buffer according to the start and offset
parameters. This allows for flexible permutations of the input vector values without
needing to rearrange the values. xbuff
is the input data buffer,
with xstart
indicating the starting position offset for each lane
in the xbuff
data buffer and xoffset
indicating
the position offset applied to the data buffer. The shuffle intrinsic function is
available in 8, 16, and 32 lane variants (shuffle8
,
shuffle16
, and shuffle32
). The main permute
for data (xoffsets
) is at 32-bit granularity and
xsquare
allows a further 16-bit granularity mini permute after
main permute. Thus, the 8-bit and 16-bit vector intrinsic functions can have
additional square parameter- for more complex permutations.
For example, a shuffle16
intrinsic has the
following function prototype.
v16int32 shuffle16 ( v16int32 xbuff,
int xstart,
unsigned int xoffsets,
unsigned int xoffsets_hi
)
The data permute performs in 32 bits granularity. When the data size is 32 bits or 64 bits, the start and offsets are relative to the full data width, 32 bits or 64 bits. The lane selection follows the regular lane selection scheme.
f: result [lane number] = (xstart + xbuff [lane number]) Mod input_samples
The following example shows how shuffle works on the v16int32
vector. xoffset
and xoffset_hi
have 4 bits for each lane. This
example moves the even and odd elements of the buffer into lower and higher parts of
the buffer.
When data permute is on 16 bits data, the intrinsic function includes
another parameter, xsquare
, allowing flexibility to
perform data selection in each 4 x 16 bits block of data. The xoffset
comes in pairs. The first hex value is an
absolute 32 bits offset and picks up 2 x 16 bits values (index, index+1). The second
hex value is offset from first value + 1 (32 bits offset) and picks up 2 x 16 bits
values. For example, 0x00
selects index 0, 1, and
index 2, 3. 0x24
selects index 8, 9, and index 14,
15. Following is a shuffle example on the v32int16
vector.
Data Select
The select intrinsic selects between the first set of lanes or the
second one according to the value of the select
parameter. If the lane corresponding bit in select
is zero, it returns the value in the first set of lanes. If the bit is one, it
returns the value in the second set of lanes. For example, a select16
intrinsic function has the following function prototype.
v16int32 select16 ( unsigned int select,
v16int32 xbuff,
int xstart,
unsigned int xoffsets,
unsigned int xoffsets_hi,
v16int32 ybuff,
int ystart,
unsigned int yoffsets,
unsigned int yoffsets_hi
)
For each bit of select
(from low to high), it
will select a lane either from xbuff
(if the
select
parameter bit is 0) or from ybuff
(if the select
parameter bit is 1). Data permute on the resulting lane of xbuff
or ybuff
is achieved by a
shuffle
with corresponding bits in xoffsets
or yoffsets
.
Following is the pseudo C-style code for select
.
for (int i = 0; i < 16; i++){
idx = f( xstart, xoffsets[i]); //i'th 4 bits of offsets
idy = f( ystart, yoffsets[i]);
o[i] = select[i] ? y[idy]:x[idx];
}
For information about how f
works in
previous code, refer to the regular lane selection scheme equation listed at the
beginning of this section.
When working on the int16 data type, the select
intrinsic has an additional xsquare
parameter
which allows a further 16-bit granularity mini permute after main permute. For
example, a select32
intrinsic function has the
following function prototype.
v32int16 select32 ( unsigned int select,
v64int16 xbuff,
int xstart,
unsigned int xoffsets,
unsigned int xoffsets_hi,
unsigned int xsquare,
int ystart,
unsigned int yoffsets,
unsigned int yoffsets_hi,
unsigned int ysquare
)
Following is the pseudo C-style code for select
.
for (int i = 0; i < 32; i++){
idx = f( xstart, xoffsets[i], xsquare);
idy = f( ystart, yoffsets[i], ysquare);
o[i] = select[i] ? y[idy]:x[idx];
}
The following example uses select32
to interleave first 16 elements of A
and B
(A first).
int16 A[32]={0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,
16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31
};
int16 B[32]={32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,
48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63
};
v32int16 *pA=(v32int16*)A;
v32int16 *pB=(v32int16*)B;
v32int16 C = select32(0xAAAAAAAA, concat(*pA,*pB),
0, 0x03020100, 0x07060504, 0x1100,
32, 0x03020100, 0x07060504, 0x1100);
The output C for the previous code is as follows.
{0,32,1,33,2,34,3,35,4,36,5,37,6,38,7,39,8,40,9,41,10,42,11,43,12,44,13,45,14,46,15,47
}
This can also be done using the shuffle32
intrinsic.
v32int16 C = shuffle32(concat(*pA,*pB),
0, 0xF3F2F1F0, 0xF7F6F5F4, 0x3120);
The following figure shows how the previous select32
intrinsic works.
MAC Intrinsics
MAC intrinsics perform vector multiply and accumulate operations between data from two buffers, the X and Z buffers, with the other parameters and options allowing flexibility (data selection within the vectors, number of output lanes) and optional features (different input data sizes, pre-adding, etc). There is an additional input buffer, the Y buffer, whose values can be pre-added with those from the X buffer before the multiplication occurs. The result from the intrinsic is added to an accumulator.
The parameters of the intrinsics allow for flexible data selection from the different input buffers for each lane and column, all following the same pattern of parameters. A starting point in the buffer is given by the (x/y/z) start parameter which selects the first element for the first row as well as first column. To allow flexibility for each lane, (x/y/z) offsets provides an offset value for each lane that will be added to the starting point. Finally, the (x/y/z) step parameter defines the step in data selection between each column based on the previous position. It is worth noticing that when the ystep is not specified in the intrinsic it will be the symmetric of the xstep.
Main permute granularity for x/y and z buffers is 32 bits and 16 bits,
respectively. Complex numbers are considered as one entity for the permute (for
example, cint16 as 32 bits for permute). Parameter zstart
must be a compile time constant. 8-bit and 16-bit permute
granularity in x/y and 8-bit permute granularity in z have certain limitations as
addressed towards the end of this section. The following sections covers the
different data widths and explains the result of the MAC intrinsic on these data
widths.
MAC on 32x32 bits
The following figure shows how start
,
offsets
, and step
work on the cint16 data type.
mac4
has four output lanes. The first
column of data is selected by adding xstart
to
every 4 bits of xoffsets
. The subsequent column of
data is selected by adding xstep
to its previous
column. In AI Engine Vector Precision
Support , it is seen that there are eight MACs per cycle for the
cint16 * cint16 operation. This means that mac4
has two columns of multiplication.
The coefficients of mac4
are
chosen similarly by zstart
, zoffset
, and zstep
.
MAC on 32x16 bits
An example of MAC with pre-adding is as follows. With pre-adding, the data
from X
buffer can be added by itself, or the data
from X
buffer and Y
buffer can be added. The start
,
offsets
, and step
parameters work similar as previous example. There is a ystart
parameter for Y
buffer or another data from X
buffer. The step
parameter works reversely for Y
or another data from X
buffer.
MAC on 16x16 bits
An example of MAC with int16 X
buffer and
int16 Z
buffer is as follows. Note that the permute
granularity for X
buffer is 32 bits. The start
and step
parameters are always in terms of data type granularity. Therefore, a value of 2 for
16 bits data will choose 2 * 16 bits away. The xoffsets
parameter comes as a pair. The first hex value is an absolute
32 bits offset and picks up 2 x 16 bits values (index, index+1) in the even row. The
second hex value is offset from first value + 1 (32 bits offset) and picks up 2 x 16
bits values in the odd row. So the hex value 0x24
in xoffsets
selects index 8, 9 for even row and
index 14, 15 for odd row from xbuff
and the hex
value 0x00
in xoffsets
selects index 0, 1 for even row and index 2, 3 for odd row
from xbuff
.
There is another xsquare
parameter
to perform 16 bits granularity twiddling after the main permute. For example,
xsquare
value 0x2103
(see from lower hex value to higher hex value) puts index 3, 0
in the even row and index 1, 2 in the odd row. How the xsquare
parameter works can be seen in the center of the following
figure.
The following figure is an example of mac16
intrinsic of int16 and int16. It is used in the matrix vector multiplication and
matrix multiplication example designs in Single Kernel Coding Examples.
MAC on 8x8 bits
The following figures show MAC with int8 X
buffer and int8 Z
buffer. The first
figure shows how data is permuted and the second figure shows how coefficients are
permuted. Note that the permute granularity for X
buffer and Z
buffer are 32 bits and 16 bits,
respectively. The xoffsets
parameter comes in pair.
The first hex value is an absolute 32 bits offset and pick up 4 x 8 bits values
(index, index+1, index+2, index+3). The second hex value is offset from the first
value + 1 (32 bits offset) and picks up 4 x 8 bits values. For example, 0x00
selects index 0, 1, 2, 3 as well as 4, 5, 6, 7,
and 0x24
selects index 16, 17, 18, 19 as well as
28, 29, 30, 31.
There is another xsquare
parameter
to do 8 bits granularity twiddling after main permute. How xsquare
parameter works in this example can be seen in the center of
the following figure.
The start
(xstart
, zstart
) and step
(xstep
, zstep
) parameters are always in terms of data type
granularity. Hence, a value of 2 for 16 bits is 2 * 16 bits away, while a value of 2
for 8 bits is 2 * 8 bits away. The step
parameter
applies to the next block of selected data. So, if a pair of offset
parameters select a 2 * 2 block, the step applies to the next 2
* 2 block. The step added to the index value must be aligned to the permute
granularity (32 bits for data, 16 bits for coefficient). For example, when working
with 8-bit data, xstep
needs to be multiples of
four. When working with 8-bit coefficient, zstep
needs to be multiples of two. The following two figures show how step
works for data and coefficients.
Note that for the coefficient in int8 * int8 types, the 2 * 2 index block is duplicated to construct a 4 * 2 block. See how index 0, 1, 2, and 3 are duplicated in MAC8 on int8 x int8 Type (Z Part).
Options
There are rich sets of MAC intrinsic with additional operations like pre-adding, pre-subtraction, and conjugation. The naming convention for the vector MAC intrinsics is as follows. Optional characteristics are shown in [] and mandatory ones in {}.
[l]{mac|msc|mul|negmul}{2|4|8|16}[_abs|_max|_min|_maxdiff][_conj][{_sym|_antisym}[_ct|_uct]][_c|_cc|_cn|_nc]
Every operation will either be a multiplication, initializing an accumulator, or a MAC operation which accumulates to a running accumulator of 2, 4, 8, or 16 lanes.
l
- Denotes that an accumulator with 80-bit lanes is used for the operation.
sym
andantisym
- Indicates the use of pre-adding and pre-subtraction respectively.
max
,min
, andmaxdiff
- Indicates the pre-selection of lanes in the
xbuff
based on the maximum, minimum, or maximum difference value. abs
- Indicates the pre-computation of the absolute value in the
xbuff
. ct
- Used for partial pre-adding and pre-subtraction (separate selection for the data input from X for the final column).
uct
- Used for unit center optimization for certain types of FIR filters. Refer to the Versal ACAP AI Engine Intrinsics Documentation (UG1078) for more information.
n
andc
- Used to indicate that the complex conjugate will be used
for one of the input buffers with complex values:
c
- The only complex input buffer will be conjugated.
cn
- Complex conjugate of X (or XY if pre-adding is used) buffer.
nc
- Complex conjugate of Z buffer.
cc
- Complex conjugate of both X (or XY if pre-adding is used) and Z buffers.
conj
- Indicates that the complex conjugate of Z will be used when multiplying the data input from Y.
Data Permute and MAC Examples
The following example takes two vectors with reals in rva
and imaginary in rvb
(with type v8int32
) and creates a
new complex vector, using the offsets to interleave the values as required.
v8cint32 cv = as_v8cint32(select16(0xaaaa, concat(rva, rvb),
0, 0x03020100, 0x07060504, 8, 0x30201000, 0x70605040));
The following example shows how to extract real and imaginary portion
of a vector cv
with type v8cint32
.
v16int32 re_im = shuffle16(as_v16int32(cv), 0, 0xECA86420, 0xFDB97531);
v8int32 re = ext_w(re_im, 0);
v8int32 im = ext_w(re_im, 1);
Shuffle intrinsic functions can be used to reorder the elements in a vector or set all elements to the same value. Some intrinsic functions operate only on larger registers but it is easy to use them for smaller registers. The following example shows how to implement a function to set all four elements in a vector to a constant value.
v4int32 v2 = ext_v(shuffle16(xset_v(0, v1), 0 ,0, 0), 0);
The following example shows how to multiply each element in rva
by the first element in rvb
. This is efficient for a vector multiplied by constant value.
v8acc80 acc = lmul8(concat(rva,undef_v8int32()),0,0x76543210,rvb,0,0x00);
The following examples show how to multiply each element in rva
by its corresponding element in rvb
.
acc = lmul8(concat(rva, undef_v8int32()),0,0x76543210,rvb,0,0x76543210);
acc = lmul8(upd_w(undef_v16int32(),0,rva),0,0x76543210,rvb,0,0x76543210);
The following examples show how to do matrix multiplication for int8
x int8 data types with mul
intrinsic, assuming that
data storage is row based.
//Z_{2x8} * X_{8x8} = A_{2x8}
mul16(Xbuff, 0, 0x11101110, 16, 0x3120, Zbuff, 0, 0x44440000, 2, 0x3210);
//Z_{4x8} * X_{8x4} = A_{4x4}
mul16(Xbuff, 0, 0x00000000, 8, 0x3120, Zbuff, 0, 0xCC884400, 2, 0x3210);
If the kernel has multiple mul
or
mac
intrinsics, try to keep the xoffsets
and zoffsets
parameters constant across uses and vary the xtsart
and zstart
parameters. This will help prevent
configuration register spills on stack.
For more information about vector lane permutations, refer to the Versal ACAP AI Engine Intrinsics Documentation (UG1078).