AI Engine/Programmable Logic Integration
In addition to kernels operating on the AI Engines, you can specify kernels to run in the programmable logic (PL) region of the device. PL kernels can be written in RTL or HLS C/C++. The HLS C/C++ kernel can be directly used in a graph. For RTL kernels, the RTL code is packaged into the kernel separately by the Vitis™ command tools.
- PL kernels can be modeled as C++ functions to represent the functionality in the graph. See Model PL Kernels with C++.
- For PL kernels written in HLS C/C++, the functions can be directly used in an input graph, provided that the HLS features are supported in the AI Engine compiler. See HLS Kernels Written in C/C++.
Model PL Kernels with C++
Kernels intended for the PL must use stream inputs and outputs rather than windows using the API defined in Programmable Logic Stream Operations. To use these data types, the standard header adf.h must be included before the function declaration. PL kernels modeled in C/C++ can be simulated using the AI Engine simulator. You cannot take this design to hardware because the C/C++ code is only a simulation construct.
Prepare the Kernels describes the header file for the AI Engine; this is the same concept for PL kernels with C++. The following code shows an example kernel.
#ifndef MAGNITUDE_MODULE_H
#define MAGNITUDE_MODULE_H
#include <adf.h>
void pladd(input_stream_int32 * in1, input_stream_int32 *in2, output_stream_int32 * out);
#endif
In this example, the module has two stream inputs carrying 32-bit integers, and produces a single stream output carrying 32-bit integers. The body of the kernel is specified in a separate file and must include adf.h. The following is an example kernel body.
#include <adf.h>
void pladd(input_stream_int32 * in1, input_stream_int32 *in2, output_stream_int32 * out) {
std::cerr << "Waiting for a value" << "\n";
int value1 = readincr(in1);
int value2 = readincr(in2);
int newvalue = value1 + value2;
std::cerr << "add " << value1 << "and " << value2 << " sent " << newvalue << "\n";
writeincr(out, newvalue);
};
There is no restriction on the C++ code used. For debugging, in addition to performing actual computation, the outputs can be printed to the console. To read from a stream, use readincr accessor functions for reading various data types (Reading and Advancing an Input Stream). To write to a stream, use writeincr accessor functions for writing various data types (Writing and Advancing an Output Stream). Both of these are blocking calls where the API function stalls when there is insufficient buffering between the source and destination.
You can specify that a kernel is intended to run in the programmable logic by attaching an attribute to the kernel instance.
fabric<pl>(adder);
The source used to implement the functionality of the kernel must also be specified in exactly the same way the kernels for the AI Engine are specified.
source(adder) = "module/add.cpp";
The connections from a PL kernel to an AI Engine kernel are similar to the connections from AI Engine to AI Engine kernels. The difference is the connection now specifies the stream and the window, rather than just a window. For example, the following connections are used to connect the adder.
connect<window<32>, stream>(prod.out[0], adder.in[0]);
connect<window<32>, stream>(prod.out[1], adder.in[1]);
connect<stream, window<32> >(adder.out[0], cons.in[0]);
All sources and destinations are supported as connections (AI Engine to PL, PL to AI Engine, and PL to PL). Run-time parameters (see Run-Time Graph Control API) can be specified for kernels mapped to the programmable logic. The input specification for the parameters is exactly the same as that for the kernels mapped to the AI Engine array. Parameters in the PL are synthesized to be registers or memories accessible via the memory-mapped AXI4 bus. Only scalar input parameters in the PL are supported and are modeled through a separate System C control thread.
Programmable Logic Stream Operations
The following operations read data from the given input stream and advance the stream for the kernels mapped to the programmable logic.
int32 readincr(input_stream_int32 *w);
uint32 readincr(input_stream_uint32 *w);
cint16 readincr(input_stream_cint16 *w);
float readincr(input_stream_float *w);
int64 readincr(input_stream_int64 *w);
uint64 readincr(input_stream_uint64 *w);
cint32 readincr(input_stream_cint32 *w);
cfloat readincr(input_stream_cfloat *w);
The following operations write data to the given output stream and advance the stream for the kernels mapped to the programmable logic.
void writeincr(output_stream_int32 *w, int32 v);
void writeincr(output_stream_uint32 *w, uint32 v);
void writeincr(output_stream_cint16 *w, cint16 v);
void writeincr(output_stream_float *w, float v);
void writeincr(output_stream_int64 *w, int64 v);
void writeincr(output_stream_uint64 *w, uint64 v);
void writeincr(output_stream_cint32 *w, cint32 v);
void writeincr(output_stream_cfloat *w, cfloat v);
HLS Kernels Written in C/C++
This section describes programmable logic (PL) kernels that are written in C/C++, and work inside the AI Engine graph. It describes the restrictions on the kernel interfaces and constraints on function signature. It also describes control protocols for RTP support.
For information on C/C++ kernels in the Vitis tools flow, see C/C++ Kernels in the Application Acceleration Development flow of the Vitis Unified Software Platform Documentation (UG1416).
Supported High-Level Synthesis Features in the AI Engine Compiler
The AI Engine compiler supports a subset of high-level synthesis (HLS) interfaces.
The supported input and output data types of HLS functions are:
hls::stream<ap_axis<N,0,0,0>>
andhls::stream<ap_axiu<N,0,0,0>>
, where N can be 32, 64, or 128hls::stream<ap_int<N>>
andhls::stream<ap_uint<N>>
, where N can be 32 or 64hls::stream<T>
, where T can beint
,unsigned int
,long long
,unsigned long long
, orfloat
hls::stream<std::complex<T>>
, where T can beint
,short
orfloat
The following scalar run-time parameter (RTP) data types are also supported:
ap_int<N>
andap_uint<N>
, whereN
can be 8, 16, 32, or 64short
,unsigned short
,int
,unsigned int
,long long
,unsigned long long
, andfloat
std::complex<T>
, where T can beint
,short
, orfloat
The following array RTP data types are supported:
- Arrays of
ap_int<N>
andap_uint<N>
, where N can be 8, 16, 32, or 64 - Arrays of
short
,unsigned short
,int
,unsigned int
,long long
,unsigned long long
, orfloat
- Arrays of
std::complex<T>
, where T can beint
,short
, orfloat
ap_memory
and s_axilite
interfaces are required to support array RTPs. Synchronous
array RTPs are not supported for PL kernels. Asynchronous array RTPs are supported
for PL kernels in bare-metal systems.When ap_int<N>
and ap_uint<N>
are used as RTPs, the PS program can
use adf::graph::update
and adf::graph::read
with compatible data types to access RTPs. For
example, use adf::graph::update(input_port& rtpPort,
int32 value)
to update the ap_int<32>
run-time parameter. For more information on the use
of RTPs, see Run-Time Graph Control API.
The only supported HLS function return type is void
.
graph::update
for an HLS scalar RTP greater
than 32 bits should be called before graph::run
.
To support the HLS math library inside the kernel, an additional linker option is needed for the AI Engine compiler:
--Xpslinker="-lhlsmc++-GCC46
-lIp_floating_point_v7_0_bitacc_cmodel
-lIp_xfft_v9_1_bitacc_cmodel
-lIp_fir_compiler_v7_2_bitacc_cmodel
-lIp_dds_compiler_v6_0_bitacc_cmodel
-L$(XILINX_HLS)/lnx64/tools/fpo_v7_0
-L$(XILINX_HLS)/lnx64/lib/csim
-L$(XILINX_HLS)/lnx64/tools/dds_v6_0
-L$(XILINX_HLS)/lnx64/tools/fir_v7_0
-L$(XILINX_HLS)/lnx64/tools/fft_v9_1
-Wl,-rpath,$(XILINX_HLS)/lnx64/lib/csim
-Wl,-rpath,$(XILINX_HLS)/lnx64/tools/fpo_v7_0
-Wl,-rpath,$(XILINX_HLS)/lnx64/tools/fft_v9_1
-Wl,-rpath,$(XILINX_HLS)/lnx64/tools/fir_v7_0
-Wl,-rpath,$(XILINX_HLS)/lnx64/tools/dds_v6_0"
Using High-Level Synthesis IP in Data Flow Graphs
HLS functions can be directly used in data-flow graphs. The only
extra work is to write a separate header file to declare function signatures and to
specify hls::stream
directions for HLS functions
to be used in data-flow graphs. In this special header file, the function signature
is declared the same as in the HLS function definition, except that the input or
output direction of a hls::stream
data type is
specially qualified using adf::dir::in<T>
and adf::dir::out<T>
, respectively.
hls_mul
defined in hls_mul.cpp:#include "hls_stream.h"
#include "ap_int.h"
extern "C" void hls_mul(hls::stream<ap_axis<32,0,0,0>>& in, hls::stream<ap_axis<32,0,0,0>>& out, ap_int<32> inputParam, unsigned int loop_count)
{
for (unsigned int n=0; n<loop_count; n++)
{
for (int i=0; i<8; i++)
{
ap_axis<32,0,0,0> value = in.read();
value.data *= inputParam;
out.write(value);
}
}
}
A
separate header file, for example hls_kernels.h should be written as
follows:#ifndef __HLS_KERNELS_H__
#define __HLS_KERNELS_H__
#include "hls_stream.h"
#include "ap_int.h"
#include "adf.h"
void hls_mul(adf::dir::in<hls::stream<ap_axis<32,0,0,0>>&> in, adf::dir::out<hls::stream<ap_axis<32,0,0,0>>&> out, ap_int<32> inputParam, unsigned int loop_count);
#endif
In
this special header file, the data types for parameters in
and out
are specifically changed
to adf::dir::in<hls::stream<ap_axis<32,0,0,0>>&>
and adf::dir::out<hls::stream<ap_axis<32,0,0,0>>&>
respectively to indicate the direction of the hls::stream
. This is because hls::stream
data type does not provide directional information needed
by AI Engine compiler. This header file should be
included in the graph specification for adf::kernel::create
and should not be included in any HLS source file.
To connect a port representing hls::stream
to
another port using adf::connect<S,D>(src,dst)
, the template parameter (S
or D
) for the
hls::stream
port should be specified as
adf::stream
. In other words, the type of a
hls::stream
port is still considered as a
stream and is treated similarly to other
stream ports such as input_stream_int32
or output_stream_int32
. The following code shows how to
create and connect a HLS kernel in a graph specification.#include "hls_kernels.h" //declare hls_mul as shown above
#include "aie_kernels.h" //declare producer and consumer aie kernel functions
#include "adf.h"
using namespace adf;
class mygraph : public graph
{
public:
kernel prd, mul, cns;
mygraph()
{
prd = kernel::create(producer);
cns = kernel::create(consumer);
mul = kernel::create(hls_mul); //create a hls kernel from the declaration in hls_kernels.h
fabric<pl>(mul); //hls kernel runs on programmable logic
source(mul) = "hls_mul.cpp" //specify source file containing hls function definition
adf::connect<window<32>, stream>(prd.out[0], mul.in[0]);
adf::connect<stream, window<32>>(mul.out[0], cns.in[0]);
}
};
Connecting High-Level Synthesis Kernels with Other Kernels in Data Flow Graphs
hls::stream<ap_axis<N,0,0,0>>
and hls::stream<ap_axiu<N,0,0,0>>
represent an AXI4-Stream with API to specify the tkeep per byte.
However, the stream implementation inside the AI Engine and adf
stream API
do not support a per-byte tkeep, instead per-word tkeep is supported. As a result,
ap_axis<N,0,0,0>
and ap_axiu<N,0,0,0>
are assumed to be fully packed
within each 32-bit word boundary when used inside data flow graphs. The following
tables specify the allowed connections between hls::stream
and adf
stream and
window data types in various scenarios where users want to pack and unpack data
using ap_axis
and ap_axiu
.
T1 |
T2 |
---|---|
ap_axis<32,0,0,0> |
int8 , int16 , int32 , cint16 |
ap_axis<64,0,0,0> |
int8 , int16 , int32 , int64 ,
cint16 , cint32 |
ap_axis<128,0,0,0> |
int8 , int16 , int32 , int64 ,
cint16 , cint32 |
ap_axiu<32,0,0,0> |
uint8 , uint32 |
ap_axiu<64,0,0,0> |
uint8 , uint32 , uint64 |
ap_axiu<128,0,0,0> |
uint8 , uint32 , uint64 |
T1 |
T2 |
---|---|
ap_axis<32,0,0,0> |
int8 , int16 , int32 , cint16 |
ap_axis<64,0,0,0> |
int8 , int16 , int32 , int64 ,
cint16 , cint32 |
ap_axis<128,0,0,0> |
int8 , int16 , int32 , int64 ,
cint16 , cint32 |
ap_axiu<32,0,0,0> |
uint8 , uint16 , uint32 |
ap_axiu<64,0,0,0> |
uint8 , uint16 ,uint32 , uint64 |
ap_axiu<128,0,0,0> |
uint8 , uint16 ,uint32 , uint64 |
T1 |
T2 |
---|---|
ap_axis<32,0,0,0> |
int32 , cint16 |
ap_axis<64,0,0,0> |
int32 , int64 , cint16 , cint32 |
ap_axis<128,0,0,0> |
int32 , int64 , cint16 , cint32 |
ap_axiu<32,0,0,0> |
uint32 |
ap_axiu<64,0,0,0> |
uint32 ,
uint64 |
ap_axiu<128,0,0,0> |
uint32 ,
uint64 |
hls::stream
and adf
stream and window data types for ap_int
, ap_uint
, std::complex
,
and native C/C++ element data types.T1 |
T2 |
---|---|
ap_int<32> |
int32 |
ap_uint<32> |
uint32 |
ap_int<64> |
int64 |
ap_uint<64> |
uint64 |
std::complex<short> |
cint16 |
std::complex<int> |
cint32 |
std::complex<float> |
cfloat |
int |
int32 |
unsigned
int |
uint32 |
long
long |
int64 |
unsigned long
long |
uint64 |
float |
float |
AXI4-Lite Interface and Control Protocols
To conform to stream interface protocol and data flow graph control semantics, PL kernels (whether written in HDL or HLS) are required to adhere to one or more of the following interface protocols.
- Block level: AXI4-Lite
interface with
ap_ctrl_hs
protocol.For HLS kernels, the corresponding INTERFACE pragma is:
#pragma HLS interface s_axilite port=return bundle=control #pragma HLS interface ap_ctrl_hs port=return
IMPORTANT: When a PL kernel inside a graph is under the control of the ADF API in eachadf::graph::run
, the PL kernel is started once, regardless of the number of iterations set foradf::graph::run()
. However, in each graphadf::graph::run()
, an AI Engine kernel will run the number of iterations set foradf::graph::run()
. In the graphadf::graph::wait()
, both the AI Engine kernel and PL kernel inside the graphs are waiting to be completed. Make sure that the AI Engine kernel and PL kernel inside the graph can start and complete in synchronous mode. hls::stream
port: AXI4-Stream interface.For HLS kernels, the corresponding INTERFACE pragma is:
#pragma HLS interface axis port={hls::stream port name}
- Synchronous run-time parameter (RTP): AXI4-Lite interface with the
ap_hs
protocol.For HLS kernels, the corresponding INTERFACE pragmas are:
#pragma HLS interface s_axilite port={RTP port name} bundle=control #pragma HLS interface ap_hs port={RTP port name}
- Asynchronous RTP: AXI4-Lite
interface with the
ap_none
protocol.For HLS kernels, the corresponding INTERFACE pragmas are:
#pragma HLS interface s_axilite port={RTP port name} bundle=control #pragma HLS interface ap_none port={RTP port name}
IMPORTANT: An asynchronous run-time parameter for a PL kernel should be initialized usingadf::graph::update
beforeadf::graph::run
. - Asynchronous array run-time parameter: AXI4-Lite interface with the
ap_memory
protocol.For HLS kernels, the corresponding INTERFACE pragmas are:
#pragma HLS interface s_axilite port={RTP port name} bundle=control #pragma HLS interface ap_memory port={RTP port name}
- Array or pointer port: Memory-mapped AXI4 interface.
For HLS kernels, the corresponding INTERFACE pragmas are:
#pragma HLS interface m_axi port=mem offset=slave #pragma HLS interface s_axilite port=mem bundle=control
Using the previous hls_mul
function as an example, and assuming inputParam
and loop_count
are asynchronous run-time
parameters, the required HLS pragmas to conform with the data flow semantics
are:
#include "hls_stream.h"
#include "ap_int.h"
extern "C" void hls_mul(hls::stream<ap_axis<32,0,0,0>>& in, hls::stream<ap_axis<32,0,0,0>>& out, ap_int<32> inputParam, unsigned int loop_count)
{
#pragma HLS interface axis port=in
#pragma HLS interface axis port=out
#pragma HLS interface s_axilite port=inputParam bundle=control
#pragma HLS interface ap_none port=inputParam
#pragma HLS interface s_axilite port=loop_count bundle=control
#pragma HLS interface ap_none port=loop_count
#pragma HLS interface s_axilite port=return bundle=control
#pragma HLS interface ap_ctrl_hs port=return
for (unsigned int n=0; n<loop_count; n++)
{
for (int i=0; i<8; i++)
{
ap_axis<32,0,0,0> value = in.read();
value.data *= inputParam;
out.write(value);
}
}
}
Here, loop_count
can be used to
control the loop count when adf::graph::run
is
called. Before each adf::graph::run
, call adf::graph::update
to update the loop_count
and the samples by PL kernel will match
the samples by AI Engine kernel with adf::graph::run(<ITERATION_NUM>
).
Free-running Programmable Logic Kernels without AXI4-Lite Interface
By default, all PL kernels inside the graph are free-running (that is, run forever after reset). The free-running kernels are kernels without any AXI4-Lite interface and control protocols.
To support PL kernels with an AXI4-Lite interface and control protocols, add --pl-axi-lite=true
option (default is false) to the
AI Engine compiler (aiecompiler
).
HLS Programmable Logic Kernels AXI4-Lite Interface
The run-time parameter is supported using the AXI4-Lite interface. To use the AXI4-Lite interface, add the --pl-axi-lite=true
option to the AI Engine compiler.
Run-Time Parameter Support Summary for PL Kernel
This section summarizes the run-time parameter (RTP) support status for the PL kernel inside the graph.
PL RTP | Input | Output | ||
---|---|---|---|---|
Synchronous | Asynchronous | Synchronous | Asynchronous | |
Scalar | Not Supported | Asynchronous | Not Supported | Asynchronous |
Array | Not Supported | Asynchronous | Not Supported | Asynchronous |
Design Flow Using RTL Programmable Logic
RTL blocks are not supported inside the ADF graph. Communication between the
RTL blocks and the ADF graph requires that you use PLIO interfacing. In the following
example, interpolator
and classify
are AI Engine kernels. The interpolator
AI Engine kernel streams data to a PL RTL block, which,
in turn, streams data back to the AI Engine
classify
kernel.
class clipped : public graph {
private:
kernel interpolator;
kernel classify;
public:
port<input> in;
port<output> clip_in;
port<output> out;
port<input> clip_out;
clipped() {
interpolator = kernel::create(fir_27t_sym_hb_2i);
classify = kernel::create(classifier);
connect< window<INTERPOLATOR27_INPUT_BLOCK_SIZE, INTERPOLATOR27_INPUT_MARGIN> >(in, interpolator.in[0]);
connect< window<POLAR_CLIP_INPUT_BLOCK_SIZE>, stream >(interpolator.out[0], clip_in);
connect< stream >(clip_out, classify.in[0]);
connect< window<CLASSIFIER_OUTPUT_BLOCK_SIZE> >(classify.out[0], out);
std::vector<std::string> myheaders;
myheaders.push_back("include.h");
adf::headers(interpolator) = myheaders;
adf::headers(classify) = myheaders;
source(interpolator) = "kernels/interpolators/hb27_2i.cc";
source(classify) = "kernels/classifiers/classify.cc";
runtime<ratio>(interpolator) = 0.8;
runtime<ratio>(classify) = 0.8;
};
};
clip_in
and clip_out
are
ports to and from the polar_clip
PL RTL kernel which is
connected to the AI Engine kernels in the graph. For
example, the clip_in
port is the output of the interpolator
AI Engine kernel that is connected to the input of the
polar_clip
RTL kernel. The clip_out
port is the input of the classify
AI Engine kernel and the output of the polar_clip
RTL kernel.
RTL Blocks and AI Engine Simulator
The top-level application file that contains an instance of your graph class and connects the graph to a simulation platform, also needs to include the PLIO inputs and outputs of the RTL blocks. These files are called output_interp.txt and input_classify.txt in the following example.
#include "graph.h"
PLIO *in0 = new PLIO("DataIn1", adf::plio_32_bits,"data/input.txt");
PLIO *ai_to_pl = new PLIO("clip_in",adf::plio_32_bits, "data/output_interp.txt",100);
PLIO *pl_to_ai = new PLIO("clip_out", adf::plio_32_bits,"data/input_classify.txt",100);
PLIO *out0 = new PLIO("DataOut1",adf::plio_32_bits, "data/output.txt");
simulation::platform<2,2> platform(in0, pl_to_ai, out0, ai_to_pl);
clipped clipgraph;
connect<> net0(platform.src[0], clipgraph.in);
connect<> net1(clipgraph.clip_in,platform.sink[1]);
connect<> net2(platform.src[1],clipgraph.clip_out);
connect<> net3(clipgraph.out, platform.sink[0]);
#ifdef __AIESIM__
int main(int argc, char ** argv) {
clipgraph.init();
clipgraph.run();
clipgraph.end();
return 0;
}
#endif
To make the AI Engine simulator work,
you must create input test bench files related to the RTL kernel. data/output_interp.txt is the test bench input to the RTL kernel. The AI Engine simulator generates the output file from the interpolator
AI Engine kernel. The data/input_classify.txt file contains data from the polar_clip
kernel which is input to the AI Engine
classify
kernel. Note that PLIO can have an optional
attribute, PL clock frequency, which is 100 for the polar_clip
.
RTL Blocks in Hardware Emulation and Hardware Flows
RTL kernels are fully supported in hardware emulation and hardware flows.
You need to add the RTL kernel as an nk
option and link the
interfaces with the sc
option, as shown in the following
code. If necessary, adjust any clock frequency using freqHz
. The following is an example of a Vitis configuration
file.
[connectivity]
nk=mm2s:1:mm2s
nk=s2mm:1:s2mm
nk=polar_clip:1:polar_clip
sc=mm2s.s:ai_engine_0.DataIn1
sc=ai_engine_0.clip_in:polar_clip.in_sample
sc=polar_clip.out_sample:ai_engine_0.clip_out
sc=ai_engine_0.DataOut1:s2mm.s
[clock]
freqHz=100000000:polar_clip.ap_clk
For more information on RTL kernels and the Vitis flow see Integrating the Application Using the Vitis Tools Flow.
Design Considerations for Graphs Containing Programmable Logic Kernels
The AI Engine array is made up of AI Engine tiles and AI Engine array interface tiles on the last row of the array. The types of interface tiles include AI Engine-to-PL and AI Engine-to-NoC.
Knowing the PL interface tile, which interfaces and adapts the signals between the AI Engines and the PL region, is essential to take full advantage of the bandwidth between AI Engines and the PL. The following figure shows an expanded view of a single PL interface tile.
PL Interface Tile Capabilities
The AI Engine clock is set to run at minimum of 1 GHz or higher based on the device speed grade. The default width of a stream channel is 32 bits. Because this frequency is higher than the PL clock frequency, it is always necessary to perform a clock domain crossing to the PL region, for example, to either one-half or a quarter of the AI Engine clock frequency.
For C++ HLS PL kernels, choose a reasonable target frequency depending on the
complexity of the algorithm implemented. adf::pl_frequency
constraints can be used to constrain each PL kernel
in a graph. For example, adf::pl_frequency(<PL_KERNEL>) = freq;
construct can be used in
the AI Engine compiler and the --hls.clock
option can be used in the Vitis compiler when compiling HLS C/C++ into Xilinx object (XO) files.
AI Engine-to-PL Rate Matching
The AI Engine runs at 1 GHz (or more, depending on the device) and can write at most two streams with a 32-bit data width per cycle. In contrast, an IP implemented in the PL can run at 500 MHz (depending on the device), while consuming a larger bit width. Rate matching is concerned with balancing the throughput from the producer to the consumer, and is used to ensure that neither of the processes creates a bottleneck with respect to the total performance. The following equation shows the rate matching for each channel:
The following table shows a PL rate matching example for a 32-bit channel written to each cycle by the AI Engine at 1 GHz. As shown, the PL IP has to consume two times the data at half the frequency or four times the data at one quarter of the frequency.
AI Engine | PL | ||
---|---|---|---|
Frequency | Data per Cycle | Frequency | Data per Cycle |
1 GHz | 32 bit | 500 MHz | 64 bit |
250 MHz | 128 bit |
Because the need to match frequency and adjust data-path width is well
understood by the Vitis compiler
(v++
), the tool automatically extracts the port width from the
PL kernel, the frequency from the clock specification, and introduces an
upsizer/downsizer to temporarily store the data exchanged between the AI Engine and the PL regions to manage the rate
match.
To avoid deadlocks, it is important to ensure that if multiple channels are read or written between the AI Engine and the PL, the data rate per cycle is concurrently achieved on all channels. For example, if one channel requires 32 bits, and the second 64 bits, the AI Engine code must ensure that both channels are written adequately to avoid back pressure or starvation on the channel. Additionally, to avoid deadlock, writing/reading from the AI Engine and reading/writing in the PL code must follow the same chronological order.
The number of interfaces used in the graph function definition for the PL defines the number of AXI4-Stream interfaces. Each argument results in the creation of a separate stream.
PL-to-PL Connections
Multiple PL kernels can be defined inside the graph, with stream connections between the PL kernels. But only PL kernel streams with the same width can be connected for the AI Engine simulator.
However, the Vitis compiler can automatically insert a data-width converter and clock-domain crossing IP between the connections of PL kernels outside the graph to resolve any mismatch in bit width or frequency. Taking these facts into consideration, you must carefully choose which PL kernels are inside or outside the graph.
AI Engine-PL Interface Performance
Versal AI Core series devices include an AI Engine array with the following column categories:
- PL column
- provides PL stream access. Each column supports eight 64-bit slave channels for streaming data into the AI Engine and six 64-bit master channels for streaming data to the PL.
- NoC column
- provides connectivity between the AI Engine array and the NoC. These interfaces can also connect to the PL.
To instruct the AI Engine compiler
to select higher frequency interfaces, use the --pl-freq=<number>
to specify the clock frequency (in MHz) for the
PL kernels. The default value is one quarter of AI Engine core frequency, which varies for each speed grade. Following are
examples:
- Option to enable an AI Engine-PL
frequency of 300 MHz for all AI Engine-PL
interfaces:
--pl-freq=300
- To set a different frequency for a specific PLIO interface use the
following code to set it in the ADF
graph.
adf::PLIO *<input>= new adf::PLIO(<logical_name>, <plio_width>, <file>, <FreqMHz>);
The AI Engine-PL AXI4-Stream channels use boundary logic interface (BLI) connections that include optional BLI registers with the exception of the slave channels 3 and 7. The two slave channels channel 3 and channel 7 are slower interfaces. The performance of the data transfer between the AI Engine and PL depends on whether the optional BLI registers are enabled or not.
For less timing-critical designs, all eight channels can be used without using the BLI registers. PL timing can still be met in this case. However, for higher frequency designs, only the six fast channels (0,1,2,4,5,6) can be used and the timing paths from the PL must be registered, using the BLI registers.
To control the use of BLI registers across the AI Engine-PL channels, use the --pl-register-threshold=<number>
compiler option, specified in MHz.
The default value is 1/8 of the AI Engine frequency
based on speed grade. Following is an example:
-
–pl-register-threshold=125
The compiler will map any PLIO interface with an AI Engine-PL frequency higher than this setting (125 MHz in this case) to high-speed channels with the BLI registers enabled. If the PLIO interface frequency is not higher than the
pl-register-threshold
value then any of the AI Engine-PL channels will be used.
In summary, if pl-freq
< pl-register-threshold
all
eight channels can be used unregistered. If pl-freq
>
pl-register-threshold
only the six fast channels can be used, with
registering. pl-register-threshold
is a way to control the threshold
frequency beyond which only fast channels can be used (with registering).