AI Engine Programming

An AI Engine program consists of a Data Flow Graph Specification written in C++. As described in C++ Template Support you can use template classes or functions for writing the AI Engine graph or kernels. The application can be compiled and executed using the AI Engine tool chain. This chapter provides an introduction to writing an AI Engine program.

A complete class reference guide is shown in Adaptive Data Flow Graph Specification Reference. The example that is used in this chapter can be found as a template example in the Vitis™ environment when creating a new AI Engine project.

Prepare the Kernels

Kernels are computation functions that form the fundamental building blocks of the data flow graph specifications. Kernels are declared as ordinary C/C++ functions that return void and can use special data types as arguments (discussed in Window and Streaming Data API). Each kernel should be defined in its own source file. This organization is recommended for reusability and faster compilation. Furthermore, the kernel source files should include all relevant header files to allow for independent compilation. It is recommended that a header file (kernels.h in this documentation) should declare the function prototypes for all kernels used in a graph. An example is shown below.

#ifndef FUNCTION_KERNELS_H
#define FUNCTION_KERNELS_H

void simple(input_window_cint16 * in, output_window_cint16 * out);

#endif

In the example, the #ifndef and #endif are present to ensure that the include file is only included once, which is good C/C++ practice.

Creating a Data Flow Graph (Including Kernels)

This following process describes how to construct data flow graphs in C++.

Define your application graph class in a separate header file (for example project.h). First, add the Adaptive Data Flow (ADF) header (adf.h) and include the kernel function prototypes. The ADF library includes all the required constructs for defining and executing the graphs on AI Engines.
```
#include <adf.h>
#include "kernels.h"
```
Define your graph class by using the objects which are defined in the adf name space. All user graphs are derived from the class graph.
```
include <adf.h>
#include "kernels.h"

using namespace adf;

class simpleGraph : public graph {
private:
  kernel first;
  kernel second;
};
```
This is the beginning of a graph class definition that declares two kernels (first and second).

Add some top-level ports to the graph.

#include <adf.h>
#include "kernels.h"

using namespace adf;

class simpleGraph : public graph {
private:
  kernel first;
  kernel second;
public:
  input_port in;
  output_port out;

};

Use the kernel::create function to instantiate the first and second C++ kernel objects using the functionality of the C function simple.

#include <adf.h>
#include "kernels.h"

using namespace adf;

class simpleGraph : public graph {
private:
  kernel first;
  kernel second;
public:
  input_port in;
  output_port out;
  simpleGraph() {
      first = kernel::create(simple);
      second = kernel::create(simple);
  }
};

Add the connectivity information, which is equivalent to the nets in a data flow graph. In this description, ports are referenced by indices. The first input window or stream argument in the simple function is assigned index 0 in an array of input ports (in). Subsequent input arguments take ascending consecutive indices. The first output window or stream argument in the simple function is assigned index 0 in an array of output ports (out). Subsequent output arguments take ascending consecutive indices.
```
#include <adf.h>
#include "kernels.h"

using namespace adf;

class simpleGraph : public graph {
private:
  kernel first;
  kernel second;
public:
  input_port in;
  output_port out;

  simpleGraph() {
    first = kernel::create(simple);
    second = kernel::create(simple);
    connect< window<128> > net0 (in, first.in[0]);
    connect< window<128> > net1 (first.out[0], second.in[0]);
    connect< window<128> > net2 (second.out[0], out);
  }
};
```
As shown, the input port from the top level is connected into the input port of the first kernel, the output port of the first kernel is connected to the input port of the second kernel, and the output port of the second kernel is connected to the output exposed to the top level. The first kernel executes when 128 bytes of data (32 complex samples) are collected in a buffer from an external source. This is specified as a window parameter at the connection net0. Likewise, the second kernel executes when its input window has valid data being produced as the output of the first kernel expressed via connection net1. Finally, the output of the second kernel is connected to the top level output port as connection net2, specifying that upon termination the second kernel will produce 128 bytes of data.
Set the source file and tile usage for each of the kernels. The source file kernel.cc contains kernel first and kernel second source code. Then the ratio of the function run time compared to the cycle budget, known as the run-time ratio, and must be between 0 and 1. The cycle budget is the number of instruction cycles a function can take to either consume data from its input (when dealing with a rate limited input data stream), or to produce a block of data on its output (when dealing with a rate limited output data stream). This cycle budget can be affected by changing the block sizes.
```
#include <adf.h>
#include "kernels.h"

using namespace adf;

class simpleGraph : public graph {
private:
  kernel first;
  kernel second;
public:
  input_port in;
  output_port out;
  simpleGraph(){
    
    first = kernel::create(simple);
    second = kernel::create(simple);
    connect< window<128> > net0 (in, first.in[0]);
    connect< window<128> > net1 (first.out[0], second.in[0]);
    connect< window<128> > net2 (second.out[0], out);

    source(first) = "kernels.cc";
    source(second) = "kernels.cc";

    runtime<ratio>(first) = 0.1;
    runtime<ratio>(second) = 0.1;

  }
};
```
Note: See Run-Time Ratio for more information.

Define a top-level application file (for example project.cpp) that contains an instance of your graph class and connect the graph to a simulation platform to provide file input and output. In this example, these files are called input.txt and output.txt.

#include "project.h"

simpleGraph mygraph;
simulation::platform<1,1> platform(“input.txt”,”output.txt”);
connect<> net0(platform.src[0], mygraph.in);
connect<> net1(mygraph.out, platform.sink[0]);

int main(void) {
  adf::return_code ret;
  mygraph.init();
  ret=mygraph.run(<number_of_iterations>);
  if(ret!=adf::ok){
    printf("Run failed\n");
    return ret;
  }
  ret=mygraph.end();
  if(ret!=adf::ok){
    printf("End failed\n");
    return ret;
  }
  return 0;
}

IMPORTANT: By default, the mygraph.run() option specifies a graph that runs forever. The AI Engine compiler generates code to execute the data flow graph in a perpetual while loop. To limit the execution of the graph for debugging and test, specify the mygraph.run(<number_of_iterations>) in the graph code. The specified number of iterations can be one or more.

ADF APIs have return enumerate type return_code to show the API running status.

The main program is the driver for the graph. It is used to load, execute, and terminate the graph. See Run-Time Graph Control API for more details.

Note: Kernel code must be written so that no name clashes occur when two kernels get assigned to the same core.

Run-Time Ratio

The run-time ratio is a user-specified constraint that allows the AI Engine tools the flexibility to put multiple AI Engine kernels into a single AI Engine, if their summarized run-time ratio is less to 1. The run-time ratio of a kernel can be computed using the following equation.

run-time ratio = (cycles for one run of the kernel)/(cycle budget)

The cycle budget is the cycles allowed to run one invocation of the kernel which depends on the system throughput requirement.

Cycles for one run of the kernel can be estimated in the initial design stage. For example, if the kernel contains a loop that can be well pipelined, and each cycle is capable of handling that amount of data, then the cycles for one run of the kernel can be estimated by the following.

synchronization of synchronous buffers + function initialization + loop count * cycles of each iteration of the loop + preamble and postamble of the loop

Note: For more information about loop pipelining, see the Versal ACAP AI Engine Kernel Coding User Guide (UG1079).

Cycles for one run of the kernel can also be profiled in the AI Engine simulator when vectorized code is available.

If multiple AI Engine kernels are put into a single AI Engine, they run in a sequential manner, one after the other, and they all run once with each iteration of graph::run. This means the following.

If the AI Engine run-time percentage (specified by the run-time constraint) is allocated for the kernel in each iteration of graph::run (or on an average basis, depending on the system requirement), the kernel performance requirement can be met.
For a single iteration of graph::run, the kernel takes no more percentage than that specified by the run-time constraint. Otherwise, it might affect other kernels' performance that are located in the same AI Engine.
Even if multiple kernels have a summarized run-time ratio less than one, they are not necessarily put into a single AI Engine. The mapping of an AI Engine kernel into an AI Engine is also affected by hardware resources. For example, there must be enough program memory to allow the kernels to be in the same AI Engine, and also, stream interfaces must be available to allow all the kernels to be in the same AI Engine.
When multiple kernels are put into the same AI Engine, resources might be saved. For example, the buffers between the kernels in the same AI Engine are single buffers instead of ping-pong buffers.
Increasing the run-time ratio of a kernel does not necessarily mean that the performance of the kernel or the graph is increased, because the performance is also affected by the data availability to the kernel and the data throughput in and out of the graph. An unreasonably high run-time ratio setting might result in inefficient resource utilization.
Low run-time ratio does not necessarily limit the performance of the kernel to the specified percentage of the AI Engine. For example, the kernel can run immediately when all the data is available if there is only one kernel in the AI Engine, no matter what run-time ratio is set.
Kernels in different top-level graphs can not be put into the same AI Engine, because the graph API needs to control different graphs independently.
Set the run-time ratio as accurately as possible, because it affects not only the AI Engine to be used, but also the data communication routes between kernels. It might also affect other design flows, for example, the power estimation.

Recommended Project Directory Structure

The following directory structure and coding practices are recommended for organizing your AI Engine projects to provide clarity and reuse.

All adaptive data flow (ADF) graph class definitions, that is, all the ADF graphs are derived from graph class adf::graph, must be located in a header file. Multiple ADF graph definitions can be included in the same header file. This class header file should be included in the main application file where the actual top-level graph is declared in the file scope (see Creating a Data Flow Graph (Including Kernels)).
There should be no dependencies on the order that the header files are included. All header files must be self-contained and include all the other header files that it needs.
There should be no file scoped variable or data-structure definitions in the graph header files. Any definitions (including static) must be declared in a separate source file that can be identified in the header property of the kernel where they are referenced (see Look-up Tables).
There is no need to declare the kernels under extern "C" {...}. However, this declaration can be used in an application meant to run full-program simulation, but it must adhere to the following conditions:
- If the kernel-function declaration is wrapped with extern "C", then the definition must know about it. This can be done by either including the header file inside the definition file, or wrapping the definition with extern "C".
- The extern "C" must be wrapped with #ifdef __cplusplus. This is synonymous to how extern "C" is used in stdio.h.

Compiling and Running the Graph from the Command Line

To compile your graph, execute the following command (see Compiling an AI Engine Graph Application for more details).
```
aiecompiler  project.cpp
```
The program is called project.cpp. The AI Engine compiler reads the input graph specified, compiles it to the AI Engine array, produces various reports, and generates output files in the Work directory.
After parsing the C++ input into a graphical intermediate form expressed in JavaScript object notation (JSON), the AI Engine compiler does the resource mapping and scheduling analysis and maps kernel nodes in the graph to the processing cores in the AI Engine array and data windows to memory banks. The JSON representation is augmented with mapping information. Each AI Engine also requires a schedule of all the kernels mapped to it.
The input graph is first partitioned into groups of kernels to be mapped to the same core.
The output of the mapper can also be viewed as a tabular report in the file project_mapping_analysis_report.txt. This reports the mapping of nodes to processing cores and data windows to memory banks. Inter-processor communication is appropriately double-banked as ping-pong buffers.
The AI Engine compiler allocates the necessary locks, memory buffers, and DMA channels and descriptors, and generates routing information for mapping the graph onto the AI Engine array. It synthesizes a main program for each core that schedules all the kernels on the cores, and implements the necessary locking mechanism and data copy among buffers. The C program for each core is compiled using the Synopsys Single Core Compiler to produce loadable ELF files. The AI Engine compiler also generates control APIs to control the graph initialization, execution and termination from the main application and a simulator configuration script scsim_config.json. These are all stored within the Work directory under various sub-folders (see Compiling an AI Engine Graph Application for more details).
After the compilation of the AI Engine graph, the AI Engine compiler writes a summary of compilation results called <graph-file-name>.aiecompile_summary to view in the Vitis analyzer. The summary contains a collection of reports, and diagrams reflecting the state of the AI Engine application implemented in the compiled build. The summary is written to the working directory of the AI Engine compiler as specified by the --workdir option, which defaults to ./Work.
To open the AI Engine compiler summary, use the following command:
```
vitis_analyzer ./Work/graph.aiecompile_summary
```
To run the graph, execute the following command (see Simulating an AI Engine Graph Application for more details).
```
aiesimulator –-pkg-dir=./Work
```
This starts the SystemC-based simulator with the control program being the main application. The graph APIs which are used in the control program configure the AI Engine array including setting up static routing, programming the DMAs, loading the ELF files onto the individual cores, and then initiates AI Engine array execution.
At the end of the simulation, the output data is produced in the directory aiesimulator_output and it should match the reference data.
The graph can be loaded at device boot time in hardware or through the host application. Details on deploying the graph in hardware and the flow associated with it is described in detail in Integrating the Application Using the Vitis Tools Flow.

Note: Only AI Engine kernels that have been modified are recompiled in subsequent compilations of the AI Engine graph. Any un-modified kernels will not be recompiled.