AI Engine Architecture Overview

Programming the AI Engine array requires a thorough understanding of the algorithm to be implemented, the capabilities of the AI Engines, and the overall data flow between individual functional units. The AI Engine array supports three levels of parallelism:

SIMD: Through vector registers that allow multiple elements to be computed in parallel.
Instruction level: Through the VLIW architecture that allows multiple instructions to be executed in a single clock cycle.
Multicore: Through the AI Engine array, where up to 400 AI Engines can execute in parallel.

While most standard C code can be compiled for the AI Engine, the code might need substantial restructuring to achieve optimal performance on the AI Engine array. The power of an AI Engine is its ability to execute a vector MAC operation, load two 256-bit vectors for the next operation, store a 256-bit vector from the previous operation, and increment a pointer or execute another scalar operation in each clock cycle. The AI Engine compiler does not perform any auto or pragma-based vectorization. The code must be rewritten to use SIMD intrinsic data types (for example, v8int32) and vector intrinsic functions (for example, mac(…)), and these must be executed within a pipelined loop to achieve the optimal performance. The 32-bit scalar RISC processor has an ALU, some non-linear functions, and data type conversions. Each AI Engine has access to a limited amount of memory, this means that large data sets need to be partitioned.

AI Engine kernels are functions that run on an AI Engine, and form the fundamental building blocks of a data-flow graph specification. The data-flow graph is a Kahn process network with deterministic behavior that does not depend on the various computational or communication delays. AI Engine kernels are declared as void C/C++ functions that take window or stream arguments for graph connectivity. Kernels can also have static data and run-time parameter arguments that can be either asynchronous or triggering. Each kernel should be defined in its own source file.

To achieve overall system performance, additional reading and experience is required with respect to the architecture, partitioning, as well as with the AI Engine data flow graph generation and optimizing data-flow connectivity. The Versal ACAP AI Engine Architecture Manual (AM009) contains more detailed information.

Xilinx provides DSP and communications libraries with optimized code for the AI Engine that should be used whenever possible. The supplied source code is also a great resource for learning about AI Engine kernel coding.

AI Engine Tile Architecture

The AI Engine array consists of a 2D array of AI Engine tiles, where each AI Engine tile contains an AI Engine, memory module, and tile interconnect module. An overview of such an AI Engine tile is shown in the following figure.

AI Engine: Each AI Engine is a very long instruction word (VLIW) processor containing a scalar unit, a vector unit, two load units, and a single store unit.
AI Engine Tile: An AI Engine tile contains an AI Engine, a local memory module together with several communication paths to facilitate data exchange between tiles.
AI Engine Array: AI Engine array refers to the complete 2D array of AI Engine tiles.
AI Engine Program: The AI Engine program consists of a data-flow graph specification which is written in C/C++. This program is compiled and executed using the AI Engine tool chain.
AI Engine Kernels: Kernels are written in C/C++ using AI Engine vector data types and intrinsic functions. These are the computation functions running on an AI Engine. The kernels form the fundamental building blocks of a data-flow graph specification.

The following illustration is the architecture of a single AI Engine.

Each AI Engine is a very long instruction word (VLIW) processor containing a scalar unit, a vector unit, two load units, and one store unit. The main compute power is provided by the vector unit. The vector unit contains a fixed-point unit with 128 8-bit fixed-point multipliers and a floating-point unit with eight single-precision floating-point multipliers. The vector registers and permute network are shared between the floating-point and fixed-point vector units. The peak performance depends on the size of the data types used by the operands. The following table provides the number of MAC operations that can be performed by the vector processor per instruction.

Table 1. Supported Precision Bit Width of the Vector Datapath
X Operand	Z Operand	Output	Number of MACs
8 real	8 real	48 real	128
16 real	8 real	48 real	64
16 real	16 real	48 real	32
16 real	16 complex	48 complex	16
16 complex	16 real	48 complex	16
16 complex	16 complex	48 complex	8
16 real	32 real	48/80 real	16
16 real	32 complex	48/80 complex	8
16 complex	32 real	48/80 complex	8
16 complex	32 complex	48/80 complex	4
32 real	16 real	48/80 real	16
32 real	16 complex	48/80 complex	8
32 complex	16 real	48/80 complex	8
32 complex	16 complex	48/80 complex	4
32 real	32 real	80 real	8
32 real	32 complex	80 complex	4
32 complex	32 real	80 complex	4
32 complex	32 complex	80 complex	2
32 SPFP	32 SPFP	32 SPFP	8

To calculate the maximum performance for a given datapath, it is necessary to multiply the number of MACs per instruction with the clock frequency of the AI Engine kernel. For example, with 16-bit input vectors X and Z, the vector processor can achieve 32 MACs per instruction. Using the clock frequency for the slowest speed grade results in:

32 MACs * 1 GHz clock frequency = 32 Giga MAC operations/second

In most cases, 32 MACs/instruction remains a theoretical upper bound because the algorithm to be implemented cannot continuously use the full capabilities of the AI Engine or might be constrained by I/O bandwidth.

The main I/O interfaces with respect to reading and writing data to and from the AI Engine for compute are the data memory interfaces, the stream interfaces, and the cascade stream interfaces. A complete list of interfaces including the program memory interface and debug interface are available in Versal ACAP AI Engine Architecture Manual (AM009).

Note: Xilinx highly recommends reading Versal ACAP AI Engine Architecture Manual (AM009) prior to starting your AI Engine kernel programming.

The data memory interface sees one contiguous memory consisting of the data memory modules in all four directions with a total capacity of 128 KB. The AI Engine has two 256-bit wide-load units and one 256-bit wide-store unit.
The AI Engine has two 32-bit input AXI4-Stream interfaces and two 32-bit output AXI4-Stream interfaces. Each of these streams allow the AI Engine to have a 128-bit access every four clock cycles or a 32-bit wide access per cycle.
The 384-bit accumulator data from one AI Engine can be forwarded to the neighboring AI Engine by using the cascade stream interfaces to form a chain. The cascade stream interface is uni-directional and its direction depends on the row where the AI Engine is located. There is a small, two deep, 384-bit wide FIFO on both the input and output streams that allow storing up to four values between AI Engines. Each cycle 384-bits can be sent and received by the chained AI Engines.

The program memory size on the AI Engine is 16 KB, which allows storing 1024 instructions of 128-bit each. The AI Engine instructions are 128-bits wide and support multiple instruction formats and variable length instructions to reduce the program memory size. Many instructions outside of the optimized inner loop can use the shorter formats.

Tools

Vitis Integrated Design Environment

The Vitis™ integrated design environment (IDE) can be used to target system programming of Xilinx® devices including, Versal™ devices with multiple AI Engine kernels. The following features are available in the tool.

An optimizing C/C++ compiler that compiles the kernels and graph code making all of the necessary connections, placements, and checks to ensure proper functioning on the device.
A cycle accurate simulator, accelerated functional simulator, and profiling tools.
A debugging environment that works in both simulation and hardware environments.

Vitis Command Line Tools

Command line tools are available to build, simulate, and generate output files and reports. Command line outputs which are generated by the IDE are captured to facilitate subsequent integration into customer build environments. The Vitis analyzer IDE is available for report viewing and analysis of the output files and reports generated by the command line tools.

Xilinx Model Composer and System Generator

Model Composer and System Generator offer a high-level graphical entry environment based on MATLAB® and Simulink® for simulation and code generation of designs that includes AI Engine, HLS, and RTL components.

Import AI Engine kernels, graphs, HLS kernels, and RTL based blocks (from System Generator) into one Simulink® design for fast co-simulation.
From the Simulink library browser, drag and drop optimized AI Engine functions such as Finite Impulse Response (FIR) filters into the design.
Verify the design using stimulus generated in MATLAB or Simulink, visualize the results, and compare the results with golden reference. Generate graph code and test vectors.
Assemble imported and block library code to feed into downstream tools.

Documentation

The following links are useful in developing and programming your AI Engine kernels.

The Versal ACAP AI Engine Intrinsics Documentation (UG1078) is a list of all the intrinsic APIs and data types supported in the current release. It is a good reference guide for all the intrinsic APIs and data types supported by the AI Engine.
The Chess Compiler User Manual has a list of all the pragmas and functions to help optimize your AI Engine kernel code. It can be found in the AI Engine lounge.
Versal ACAP AI Engine Register Reference (AM015)