Vector Processing Unit
The vector unit contains a fixed-point unit with 128 8-bit fixed-point multipliers and a floating-point unit with eight single-precision floating-point multipliers. The vector registers and permute network are shared between the fixed-point and floating-point multipliers. The peak performance depends on the size of the data types used by the operands. The following table provides the number of MAC operations that can be performed by the vector processor per instruction.
X Operand | Z Operand | Output | Number of MACs/Clock |
---|---|---|---|
8 real | 8 real | 48 real | 128 |
16 real | 8 real | 48 real | 64 |
16 real | 16 real | 48 real | 32 |
16 real | 16 complex | 48 complex | 16 |
16 complex | 16 real | 48 complex | 16 |
16 complex | 16 complex | 48 complex | 8 |
16 real | 32 real | 48/80 real | 16 |
16 real | 32 complex | 48/80 complex | 8 |
16 complex | 32 real | 48/80 complex | 8 |
16 complex | 32 complex | 48/80 complex | 4 |
32 real | 16 real | 48/80 real | 16 |
32 real | 16 complex | 48/80 complex | 8 |
32 complex | 16 real | 48/80 complex | 8 |
32 complex | 16 complex | 48/80 complex | 4 |
32 real | 32 real | 80 real | 8 |
32 real | 32 complex | 80 complex | 4 |
32 complex | 32 real | 80 complex | 4 |
32 complex | 32 complex | 80 complex | 2 |
32 SPFP | 32 SPFP | 32 SPFP | 8 |
The X operand is 1024 bits wide and the Z operand is 256 bits wide. In terms of component use, consider the first row in the previous table. The multiplier operands come from the same 1024-bit and 256-bit input registers but some values are broadcast to multiple multipliers. There are 128 8-bit single multipliers and results are post-added and accumulated into 16 or 8 accumulator lanes of 48 bits each.
To calculate the maximum performance for a given datapath, it is necessary to multiply the number of MACs per instruction with the clock frequency of the AI Engine kernel. For example, with 16-bit input vectors X and Z, the vector processor can achieve 32 MACs per instruction. Using the clock frequency for the slowest speed grade device results in: