AI Engine/Programmable Logic Integration

In addition to kernels operating on the AI Engines, you can specify kernels to run in the programmable logic (PL) region of the device. PL kernels can be written in RTL or HLS C/C++. The HLS C/C++ kernel can be directly used in a graph. For RTL kernels, the RTL code is packaged into the kernel separately by the Vitis™ command tools.

  • PL kernels can be modeled as C++ functions to represent the functionality in the graph. See Model PL Kernels with C++.
  • For PL kernels written in HLS C/C++, the functions can be directly used in an input graph, provided that the HLS features are supported in the AI Engine compiler. See HLS Kernels Written in C/C++.

Model PL Kernels with C++

Kernels intended for the PL must use stream inputs and outputs rather than windows using the API defined in Programmable Logic Stream Operations. To use these data types, the standard header adf.h must be included before the function declaration. PL kernels modeled in C/C++ can be simulated using the AI Engine simulator. You cannot take this design to hardware because the C/C++ code is only a simulation construct.

Prepare the Kernels describes the header file for the AI Engine; this is the same concept for PL kernels with C++. The following code shows an example kernel.

#ifndef MAGNITUDE_MODULE_H
#define MAGNITUDE_MODULE_H

#include <adf.h>

void pladd(input_stream_int32 * in1, input_stream_int32 *in2, output_stream_int32 * out);

#endif

In this example, the module has two stream inputs carrying 32-bit integers, and produces a single stream output carrying 32-bit integers. The body of the kernel is specified in a separate file and must include adf.h. The following is an example kernel body.

#include <adf.h>

void pladd(input_stream_int32 * in1, input_stream_int32 *in2, output_stream_int32 * out) {
  std::cerr << "Waiting for a value" << "\n";
  int value1 = readincr(in1); 
  int value2 = readincr(in2); 
  int newvalue = value1 + value2;
  std::cerr << "add " << value1 << "and " << value2  << " sent " << newvalue  << "\n";
  writeincr(out, newvalue); 
};

There is no restriction on the C++ code used. For debugging, in addition to performing actual computation, the outputs can be printed to the console. To read from a stream, use readincr accessor functions for reading various data types (Reading and Advancing an Input Stream). To write to a stream, use writeincr accessor functions for writing various data types (Writing and Advancing an Output Stream). Both of these are blocking calls where the API function stalls when there is insufficient buffering between the source and destination.

You can specify that a kernel is intended to run in the programmable logic by attaching an attribute to the kernel instance.

fabric<pl>(adder);

The source used to implement the functionality of the kernel must also be specified in exactly the same way the kernels for the AI Engine are specified.

source(adder) = "module/add.cpp";

The connections from a PL kernel to an AI Engine kernel are similar to the connections from AI Engine to AI Engine kernels. The difference is the connection now specifies the stream and the window, rather than just a window. For example, the following connections are used to connect the adder.

connect<window<32>, stream>(prod.out[0], adder.in[0]);
connect<window<32>, stream>(prod.out[1], adder.in[1]);
connect<stream, window<32> >(adder.out[0], cons.in[0]);

All sources and destinations are supported as connections (AI Engine to PL, PL to AI Engine, and PL to PL). Run-time parameters (see Run-Time Graph Control API) can be specified for kernels mapped to the programmable logic. The input specification for the parameters is exactly the same as that for the kernels mapped to the AI Engine array. Parameters in the PL are synthesized to be registers or memories accessible via the memory-mapped AXI4 bus. Only scalar input parameters in the PL are supported and are modeled through a separate System C control thread.

Programmable Logic Stream Operations

The following operations read data from the given input stream and advance the stream for the kernels mapped to the programmable logic.

int32 readincr(input_stream_int32 *w);
uint32 readincr(input_stream_uint32 *w);
cint16 readincr(input_stream_cint16 *w);
float readincr(input_stream_float *w);

int64 readincr(input_stream_int64 *w);
uint64 readincr(input_stream_uint64 *w);
cint32 readincr(input_stream_cint32 *w);
cfloat readincr(input_stream_cfloat *w);

The following operations write data to the given output stream and advance the stream for the kernels mapped to the programmable logic.

void writeincr(output_stream_int32 *w, int32 v);
void writeincr(output_stream_uint32 *w, uint32 v);
void writeincr(output_stream_cint16 *w, cint16 v);
void writeincr(output_stream_float *w, float v);

void writeincr(output_stream_int64 *w, int64 v);
void writeincr(output_stream_uint64 *w, uint64 v);
void writeincr(output_stream_cint32 *w, cint32 v);
void writeincr(output_stream_cfloat *w, cfloat v);

HLS Kernels Written in C/C++

This section describes programmable logic (PL) kernels that are written in C/C++, and work inside the AI Engine graph. It describes the restrictions on the kernel interfaces and constraints on function signature. It also describes control protocols for RTP support.

For information on C/C++ kernels in the Vitis tools flow, see C/C++ Kernels in the Application Acceleration Development flow of the Vitis Unified Software Platform Documentation (UG1416).

Supported High-Level Synthesis Features in the AI Engine Compiler

The AI Engine compiler supports a subset of high-level synthesis (HLS) interfaces.

The supported input and output data types of HLS functions are:

  • hls::stream<ap_axis<N,0,0,0>> and hls::stream<ap_axiu<N,0,0,0>>, where N can be 32, 64, or 128
  • hls::stream<ap_int<N>> and hls::stream<ap_uint<N>>, where N can be 32 or 64
  • hls::stream<T>, where T can be int, unsigned int, long long, unsigned long long, or float
  • hls::stream<std::complex<T>>, where T can be int, short or float

The following scalar run-time parameter (RTP) data types are also supported:

  • ap_int<N> and ap_uint<N>, where N can be 8, 16, 32, or 64
  • short, unsigned short, int, unsigned int, long long, unsigned long long, and float
  • std::complex<T>, where T can be int, short, or float

The following array RTP data types are supported:

  • Arrays of ap_int<N> and ap_uint<N>, where N can be 8, 16, 32, or 64
  • Arrays of short, unsigned short, int, unsigned int, long long, unsigned long long, or float
  • Arrays of std::complex<T>, where T can be int, short, or float
Note: For PL kernels, ap_memory and s_axilite interfaces are required to support array RTPs. Synchronous array RTPs are not supported for PL kernels. Asynchronous array RTPs are supported for PL kernels in bare-metal systems.

When ap_int<N> and ap_uint<N> are used as RTPs, the PS program can use adf::graph::update and adf::graph::read with compatible data types to access RTPs. For example, use adf::graph::update(input_port& rtpPort, int32 value) to update the ap_int<32> run-time parameter. For more information on the use of RTPs, see Run-Time Graph Control API.

The only supported HLS function return type is void.

graph::update for an HLS scalar RTP greater than 32 bits should be called before graph::run.

To support the HLS math library inside the kernel, an additional linker option is needed for the AI Engine compiler:

--Xpslinker="-lhlsmc++-GCC46
-lIp_floating_point_v7_0_bitacc_cmodel 
-lIp_xfft_v9_1_bitacc_cmodel 
-lIp_fir_compiler_v7_2_bitacc_cmodel 
-lIp_dds_compiler_v6_0_bitacc_cmodel 
-L$(XILINX_HLS)/lnx64/tools/fpo_v7_0 
-L$(XILINX_HLS)/lnx64/lib/csim 
-L$(XILINX_HLS)/lnx64/tools/dds_v6_0 
-L$(XILINX_HLS)/lnx64/tools/fir_v7_0 
-L$(XILINX_HLS)/lnx64/tools/fft_v9_1 
-Wl,-rpath,$(XILINX_HLS)/lnx64/lib/csim 
-Wl,-rpath,$(XILINX_HLS)/lnx64/tools/fpo_v7_0 
-Wl,-rpath,$(XILINX_HLS)/lnx64/tools/fft_v9_1 
-Wl,-rpath,$(XILINX_HLS)/lnx64/tools/fir_v7_0 
-Wl,-rpath,$(XILINX_HLS)/lnx64/tools/dds_v6_0"

Using High-Level Synthesis IP in Data Flow Graphs

HLS functions can be directly used in data-flow graphs. The only extra work is to write a separate header file to declare function signatures and to specify hls::stream directions for HLS functions to be used in data-flow graphs. In this special header file, the function signature is declared the same as in the HLS function definition, except that the input or output direction of a hls::stream data type is specially qualified using adf::dir::in<T> and adf::dir::out<T>, respectively.

For example, given an HLS function hls_mul defined in hls_mul.cpp:
#include "hls_stream.h"
#include "ap_int.h"
extern "C" void hls_mul(hls::stream<ap_axis<32,0,0,0>>& in, hls::stream<ap_axis<32,0,0,0>>& out, ap_int<32> inputParam, unsigned int loop_count)
{
  for (unsigned int n=0; n<loop_count; n++)
  {
    for (int i=0; i<8; i++)
    {
      ap_axis<32,0,0,0> value = in.read();
      value.data *= inputParam;
      out.write(value);
    }
  }
}
A separate header file, for example hls_kernels.h should be written as follows:
#ifndef __HLS_KERNELS_H__
#define __HLS_KERNELS_H__
#include "hls_stream.h"
#include "ap_int.h"
#include "adf.h"
void hls_mul(adf::dir::in<hls::stream<ap_axis<32,0,0,0>>&> in, adf::dir::out<hls::stream<ap_axis<32,0,0,0>>&> out, ap_int<32> inputParam, unsigned int loop_count);
#endif
In this special header file, the data types for parameters in and out are specifically changed to adf::dir::in<hls::stream<ap_axis<32,0,0,0>>&> and adf::dir::out<hls::stream<ap_axis<32,0,0,0>>&> respectively to indicate the direction of the hls::stream. This is because hls::stream data type does not provide directional information needed by AI Engine compiler. This header file should be included in the graph specification for adf::kernel::create and should not be included in any HLS source file. To connect a port representing hls::stream to another port using adf::connect<S,D>(src,dst), the template parameter (S or D) for the hls::stream port should be specified as adf::stream. In other words, the type of a hls::stream port is still considered as a stream and is treated similarly to other stream ports such as input_stream_int32 or output_stream_int32. The following code shows how to create and connect a HLS kernel in a graph specification.
#include "hls_kernels.h" //declare hls_mul as shown above
#include "aie_kernels.h" //declare producer and consumer aie kernel functions
#include "adf.h"
using namespace adf;

class mygraph : public graph
{
public:
  kernel prd, mul, cns;
  mygraph()
  {
    prd = kernel::create(producer);
    cns = kernel::create(consumer);

    mul = kernel::create(hls_mul); //create a hls kernel from the declaration in hls_kernels.h
    fabric<pl>(mul); //hls kernel runs on programmable logic
    source(mul) = "hls_mul.cpp" //specify source file containing hls function definition

    adf::connect<window<32>, stream>(prd.out[0], mul.in[0]);
    adf::connect<stream, window<32>>(mul.out[0], cns.in[0]);
  }
};
IMPORTANT: Simulation of the PL components in C++ is not timing accurate because it is only intended for functional modeling.
Note: To verify correctness and optimize for performance, use Vitis HLS for single HLS C++ kernels. After single-kernel development and graph System C have successfully completed, the system including the AI Engine kernels, PL kernels, and PS code can be integrated using the Vitis tools flow.

Connecting High-Level Synthesis Kernels with Other Kernels in Data Flow Graphs

hls::stream<ap_axis<N,0,0,0>> and hls::stream<ap_axiu<N,0,0,0>> represent an AXI4-Stream with API to specify the tkeep per byte. However, the stream implementation inside the AI Engine and adf stream API do not support a per-byte tkeep, instead per-word tkeep is supported. As a result, ap_axis<N,0,0,0> and ap_axiu<N,0,0,0> are assumed to be fully packed within each 32-bit word boundary when used inside data flow graphs. The following tables specify the allowed connections between hls::stream and adf stream and window data types in various scenarios where users want to pack and unpack data using ap_axis and ap_axiu.

Table 1. Connecting hls::stream<T1> to/from input_stream_{T2}/output_stream_{T2} in an AI Engine
T1 T2
ap_axis<32,0,0,0> int8, int16, int32, cint16
ap_axis<64,0,0,0> int8, int16, int32, int64, cint16, cint32
ap_axis<128,0,0,0> int8, int16, int32, int64, cint16, cint32
ap_axiu<32,0,0,0> uint8, uint32
ap_axiu<64,0,0,0> uint8, uint32, uint64
ap_axiu<128,0,0,0> uint8, uint32, uint64
Table 2. Connecting hls::stream<T1> to/from input_window_{T2}/output_window_{T2} in an AI Engine
T1 T2
ap_axis<32,0,0,0> int8, int16, int32, cint16
ap_axis<64,0,0,0> int8, int16, int32, int64, cint16, cint32
ap_axis<128,0,0,0> int8, int16, int32, int64, cint16, cint32
ap_axiu<32,0,0,0> uint8, uint16, uint32
ap_axiu<64,0,0,0> uint8, uint16,uint32, uint64
ap_axiu<128,0,0,0> uint8, uint16,uint32, uint64
Table 3. Connecting hls::stream<T1> to/from input_stream_{T2}/output_stream_{T2} in Programmable Logic
T1 T2
ap_axis<32,0,0,0> int32, cint16
ap_axis<64,0,0,0> int32, int64, cint16, cint32
ap_axis<128,0,0,0> int32, int64, cint16, cint32
ap_axiu<32,0,0,0> uint32
ap_axiu<64,0,0,0> uint32, uint64
ap_axiu<128,0,0,0> uint32, uint64
The following table specifies the allowed connections between hls::stream and adf stream and window data types for ap_int, ap_uint, std::complex, and native C/C++ element data types.
Table 4. Connecting hls::stream<T1> to/from input_stream_{T2}/output_stream_{T2} or input_window_{T2}/output_window_{T2}
T1 T2
ap_int<32> int32
ap_uint<32> uint32
ap_int<64> int64
ap_uint<64> uint64
std::complex<short> cint16
std::complex<int> cint32
std::complex<float> cfloat
int int32
unsigned int uint32
long long int64
unsigned long long uint64
float float

AXI4-Lite Interface and Control Protocols

To conform to stream interface protocol and data flow graph control semantics, PL kernels (whether written in HDL or HLS) are required to adhere to one or more of the following interface protocols.

  • Block level: AXI4-Lite interface with ap_ctrl_hs protocol.

    For HLS kernels, the corresponding INTERFACE pragma is:

    #pragma HLS interface s_axilite port=return bundle=control
    #pragma HLS interface ap_ctrl_hs port=return
    IMPORTANT: When a PL kernel inside a graph is under the control of the ADF API in each adf::graph::run, the PL kernel is started once, regardless of the number of iterations set for adf::graph::run(). However, in each graph adf::graph::run(), an AI Engine kernel will run the number of iterations set for adf::graph::run(). In the graph adf::graph::wait(), both the AI Engine kernel and PL kernel inside the graphs are waiting to be completed. Make sure that the AI Engine kernel and PL kernel inside the graph can start and complete in synchronous mode.
  • hls::stream port: AXI4-Stream interface.

    For HLS kernels, the corresponding INTERFACE pragma is:

    #pragma HLS interface axis port={hls::stream port name}
  • Synchronous run-time parameter (RTP): AXI4-Lite interface with the ap_hs protocol.

    For HLS kernels, the corresponding INTERFACE pragmas are:

    #pragma HLS interface s_axilite port={RTP port name} bundle=control
    #pragma HLS interface ap_hs port={RTP port name}
  • Asynchronous RTP: AXI4-Lite interface with the ap_none protocol.

    For HLS kernels, the corresponding INTERFACE pragmas are:

    #pragma HLS interface s_axilite port={RTP port name} bundle=control
    #pragma HLS interface ap_none port={RTP port name}
    IMPORTANT: An asynchronous run-time parameter for a PL kernel should be initialized using adf::graph::update before adf::graph::run.
  • Asynchronous array run-time parameter: AXI4-Lite interface with the ap_memory protocol.

    For HLS kernels, the corresponding INTERFACE pragmas are:

    #pragma HLS interface s_axilite port={RTP port name} bundle=control
    #pragma HLS interface ap_memory port={RTP port name}
  • Array or pointer port: Memory-mapped AXI4 interface.

    For HLS kernels, the corresponding INTERFACE pragmas are:

    #pragma HLS interface m_axi port=mem offset=slave
    #pragma HLS interface s_axilite port=mem bundle=control

Using the previous hls_mul function as an example, and assuming inputParam and loop_count are asynchronous run-time parameters, the required HLS pragmas to conform with the data flow semantics are:

#include "hls_stream.h"
#include "ap_int.h"
extern "C" void hls_mul(hls::stream<ap_axis<32,0,0,0>>& in, hls::stream<ap_axis<32,0,0,0>>& out, ap_int<32> inputParam, unsigned int loop_count)
{
#pragma HLS interface axis port=in
#pragma HLS interface axis port=out
#pragma HLS interface s_axilite port=inputParam bundle=control
#pragma HLS interface ap_none port=inputParam
#pragma HLS interface s_axilite port=loop_count bundle=control
#pragma HLS interface ap_none port=loop_count
#pragma HLS interface s_axilite port=return bundle=control
#pragma HLS interface ap_ctrl_hs port=return
  for (unsigned int n=0; n<loop_count; n++)
  {
    for (int i=0; i<8; i++)
    {
      ap_axis<32,0,0,0> value = in.read();
      value.data *= inputParam;
      out.write(value);
    }
  }
}

Here, loop_count can be used to control the loop count when adf::graph::run is called. Before each adf::graph::run, call adf::graph::update to update the loop_count and the samples by PL kernel will match the samples by AI Engine kernel with adf::graph::run(<ITERATION_NUM>).

Free-running Programmable Logic Kernels without AXI4-Lite Interface

By default, all PL kernels inside the graph are free-running (that is, run forever after reset). The free-running kernels are kernels without any AXI4-Lite interface and control protocols.

To support PL kernels with an AXI4-Lite interface and control protocols, add --pl-axi-lite=true option (default is false) to the AI Engine compiler (aiecompiler).

HLS Programmable Logic Kernels AXI4-Lite Interface

The run-time parameter is supported using the AXI4-Lite interface. To use the AXI4-Lite interface, add the --pl-axi-lite=true option to the AI Engine compiler.

Run-Time Parameter Support Summary for PL Kernel

This section summarizes the run-time parameter (RTP) support status for the PL kernel inside the graph.

Table 5. RTP Support Status for PL Kernel Inside Graph
PL RTP Input Output
Synchronous Asynchronous Synchronous Asynchronous
Scalar Not Supported Asynchronous Not Supported Asynchronous
Array Not Supported Asynchronous Not Supported Asynchronous
Note: The PL kernel RTP to PL kernel RTP connection is not supported.

Design Flow Using RTL Programmable Logic

RTL blocks are not supported inside the ADF graph. Communication between the RTL blocks and the ADF graph requires that you use PLIO interfacing. In the following example, interpolator and classify are AI Engine kernels. The interpolator AI Engine kernel streams data to a PL RTL block, which, in turn, streams data back to the AI Engine classify kernel.

class clipped : public graph {  
  private:
    kernel interpolator;
    kernel classify;
   
  public:
    port<input> in;
    port<output> clip_in;
    port<output> out;
    port<input> clip_out;

    clipped() {
      interpolator = kernel::create(fir_27t_sym_hb_2i);
      classify     = kernel::create(classifier);

      connect< window<INTERPOLATOR27_INPUT_BLOCK_SIZE, INTERPOLATOR27_INPUT_MARGIN> >(in, interpolator.in[0]);
      connect< window<POLAR_CLIP_INPUT_BLOCK_SIZE>, stream >(interpolator.out[0], clip_in);
      connect< stream >(clip_out, classify.in[0]);
      connect< window<CLASSIFIER_OUTPUT_BLOCK_SIZE> >(classify.out[0], out);

      std::vector<std::string> myheaders;
      myheaders.push_back("include.h");

      adf::headers(interpolator) = myheaders;
      adf::headers(classify) = myheaders;

      source(interpolator) = "kernels/interpolators/hb27_2i.cc";
      source(classify)    = "kernels/classifiers/classify.cc";

      runtime<ratio>(interpolator) = 0.8;
      runtime<ratio>(classify) = 0.8;
    };
};

clip_in and clip_out are ports to and from the polar_clip PL RTL kernel which is connected to the AI Engine kernels in the graph. For example, the clip_in port is the output of the interpolator AI Engine kernel that is connected to the input of the polar_clip RTL kernel. The clip_out port is the input of the classify AI Engine kernel and the output of the polar_clip RTL kernel.

RTL Blocks and AI Engine Simulator

The top-level application file that contains an instance of your graph class and connects the graph to a simulation platform, also needs to include the PLIO inputs and outputs of the RTL blocks. These files are called output_interp.txt and input_classify.txt in the following example.

#include "graph.h"

PLIO *in0 = new PLIO("DataIn1", adf::plio_32_bits,"data/input.txt");
PLIO *ai_to_pl = new PLIO("clip_in",adf::plio_32_bits, "data/output_interp.txt",100); 
PLIO *pl_to_ai = new PLIO("clip_out", adf::plio_32_bits,"data/input_classify.txt",100); 
PLIO *out0 = new PLIO("DataOut1",adf::plio_32_bits, "data/output.txt");

simulation::platform<2,2> platform(in0, pl_to_ai, out0, ai_to_pl);

clipped clipgraph;

connect<> net0(platform.src[0], clipgraph.in);
connect<> net1(clipgraph.clip_in,platform.sink[1]);
connect<> net2(platform.src[1],clipgraph.clip_out);
connect<> net3(clipgraph.out, platform.sink[0]);

#ifdef __AIESIM__
int main(int argc, char ** argv) {
    clipgraph.init();
    clipgraph.run();
    clipgraph.end();
    return 0;
}
#endif

To make the AI Engine simulator work, you must create input test bench files related to the RTL kernel. data/output_interp.txt is the test bench input to the RTL kernel. The AI Engine simulator generates the output file from the interpolator AI Engine kernel. The data/input_classify.txt file contains data from the polar_clip kernel which is input to the AI Engine classify kernel. Note that PLIO can have an optional attribute, PL clock frequency, which is 100 for the polar_clip.

RTL Blocks in Hardware Emulation and Hardware Flows

RTL kernels are fully supported in hardware emulation and hardware flows. You need to add the RTL kernel as an nk option and link the interfaces with the sc option, as shown in the following code. If necessary, adjust any clock frequency using freqHz. The following is an example of a Vitis configuration file.

[connectivity]
nk=mm2s:1:mm2s
nk=s2mm:1:s2mm
nk=polar_clip:1:polar_clip
sc=mm2s.s:ai_engine_0.DataIn1
sc=ai_engine_0.clip_in:polar_clip.in_sample
sc=polar_clip.out_sample:ai_engine_0.clip_out
sc=ai_engine_0.DataOut1:s2mm.s
[clock]
freqHz=100000000:polar_clip.ap_clk

For more information on RTL kernels and the Vitis flow see Integrating the Application Using the Vitis Tools Flow.

Design Considerations for Graphs Containing Programmable Logic Kernels

The AI Engine array is made up of AI Engine tiles and AI Engine array interface tiles on the last row of the array. The types of interface tiles include AI Engine-to-PL and AI Engine-to-NoC.

Knowing the PL interface tile, which interfaces and adapts the signals between the AI Engines and the PL region, is essential to take full advantage of the bandwidth between AI Engines and the PL. The following figure shows an expanded view of a single PL interface tile.

Figure 1: AI Engine-PL Interface Tile
Note: Notice the interface tile supports two different clock domains, AI Engine clock and PL clock, as well as a predefined number of streaming channels available to connect from the AI Engine tile to a specific PL interface tile.

PL Interface Tile Capabilities

The AI Engine clock is set to run at minimum of 1 GHz or higher based on the device speed grade. The default width of a stream channel is 32 bits. Because this frequency is higher than the PL clock frequency, it is always necessary to perform a clock domain crossing to the PL region, for example, to either one-half or a quarter of the AI Engine clock frequency.

Note: Though not required, Xilinx recommends running the PL kernel with a frequency where the AI Engine frequency is an integer multiple of the PL kernel frequency.

For C++ HLS PL kernels, choose a reasonable target frequency depending on the complexity of the algorithm implemented. adf::pl_frequency constraints can be used to constrain each PL kernel in a graph. For example, adf::pl_frequency(<PL_KERNEL>) = freq; construct can be used in the AI Engine compiler and the --hls.clock option can be used in the Vitis compiler when compiling HLS C/C++ into Xilinx object (XO) files.

AI Engine-to-PL Rate Matching

The AI Engine runs at 1 GHz (or more, depending on the device) and can write at most two streams with a 32-bit data width per cycle. In contrast, an IP implemented in the PL can run at 500 MHz (depending on the device), while consuming a larger bit width. Rate matching is concerned with balancing the throughput from the producer to the consumer, and is used to ensure that neither of the processes creates a bottleneck with respect to the total performance. The following equation shows the rate matching for each channel:

Frequency AI Engine × Data AI Engine per cycle = Frequency PL × Data PL per cycle

The following table shows a PL rate matching example for a 32-bit channel written to each cycle by the AI Engine at 1 GHz. As shown, the PL IP has to consume two times the data at half the frequency or four times the data at one quarter of the frequency.

Table 6. Frequency Response of AI Engine vs. PL Region
AI Engine PL
Frequency Data per Cycle Frequency Data per Cycle
1 GHz 32 bit 500 MHz 64 bit
250 MHz 128 bit

Because the need to match frequency and adjust data-path width is well understood by the Vitis compiler (v++), the tool automatically extracts the port width from the PL kernel, the frequency from the clock specification, and introduces an upsizer/downsizer to temporarily store the data exchanged between the AI Engine and the PL regions to manage the rate match.

To avoid deadlocks, it is important to ensure that if multiple channels are read or written between the AI Engine and the PL, the data rate per cycle is concurrently achieved on all channels. For example, if one channel requires 32 bits, and the second 64 bits, the AI Engine code must ensure that both channels are written adequately to avoid back pressure or starvation on the channel. Additionally, to avoid deadlock, writing/reading from the AI Engine and reading/writing in the PL code must follow the same chronological order.

The number of interfaces used in the graph function definition for the PL defines the number of AXI4-Stream interfaces. Each argument results in the creation of a separate stream.

PL-to-PL Connections

Multiple PL kernels can be defined inside the graph, with stream connections between the PL kernels. But only PL kernel streams with the same width can be connected for the AI Engine simulator.

However, the Vitis compiler can automatically insert a data-width converter and clock-domain crossing IP between the connections of PL kernels outside the graph to resolve any mismatch in bit width or frequency. Taking these facts into consideration, you must carefully choose which PL kernels are inside or outside the graph.

AI Engine-PL Interface Performance

Versal AI Core series devices include an AI Engine array with the following column categories:

PL column
provides PL stream access. Each column supports eight 64-bit slave channels for streaming data into the AI Engine and six 64-bit master channels for streaming data to the PL.
NoC column
provides connectivity between the AI Engine array and the NoC. These interfaces can also connect to the PL.

To instruct the AI Engine compiler to select higher frequency interfaces, use the --pl-freq=<number> to specify the clock frequency (in MHz) for the PL kernels. The default value is one quarter of AI Engine core frequency, which varies for each speed grade. Following are examples:

  • Option to enable an AI Engine-PL frequency of 300 MHz for all AI Engine-PL interfaces:
    --pl-freq=300
  • To set a different frequency for a specific PLIO interface use the following code to set it in the ADF graph.
    adf::PLIO *<input>= new adf::PLIO(<logical_name>, <plio_width>, <file>, <FreqMHz>);
Note: The following information applies to the AI Engine device architecture documented in Versal ACAP AI Engine Architecture Manual (AM009).

The AI Engine-PL AXI4-Stream channels use boundary logic interface (BLI) connections that include optional BLI registers with the exception of the slave channels 3 and 7. The two slave channels channel 3 and channel 7 are slower interfaces. The performance of the data transfer between the AI Engine and PL depends on whether the optional BLI registers are enabled or not.

For less timing-critical designs, all eight channels can be used without using the BLI registers. PL timing can still be met in this case. However, for higher frequency designs, only the six fast channels (0,1,2,4,5,6) can be used and the timing paths from the PL must be registered, using the BLI registers.

To control the use of BLI registers across the AI Engine-PL channels, use the --pl-register-threshold=<number> compiler option, specified in MHz. The default value is 1/8 of the AI Engine frequency based on speed grade. Following is an example:

  • –pl-register-threshold=125

    The compiler will map any PLIO interface with an AI Engine-PL frequency higher than this setting (125 MHz in this case) to high-speed channels with the BLI registers enabled. If the PLIO interface frequency is not higher than the pl-register-threshold value then any of the AI Engine-PL channels will be used.

In summary, if pl-freq < pl-register-threshold all eight channels can be used unregistered. If pl-freq > pl-register-threshold only the six fast channels can be used, with registering. pl-register-threshold is a way to control the threshold frequency beyond which only fast channels can be used (with registering).