AI Engine/Programmable Logic Integration
When you are ready to consider interfacing to the programmable logic (PL), you need to make a decision on the platform you want to interface with. A platform is a fully contained image that defines both the hardware (XSA) as well as the software (bare metal, Linux, or both). The XSA contains the hardware description of the platform, which is defined in the Vivado® Design Suite, and the software is defined with the use of a bare-metal setup, or a Linux image defined through PetaLinux. Depending on the needs of your application you might decide to use an example reference platform provided by Xilinx, or a custom platform created by your organization.
Xilinx recommends interfacing to the PLIO port attributes which represent external stream connections that cross the AI Engine-PL boundary. PLIO represents an ADF graph interface to the PL. This PL could be, for example, a PL kernel, a platform IP representing a signal source or sink, or it could be a data mover to interface the ADF graph to memory.
Design Flow Using RTL Programmable Logic
RTL blocks are not supported inside the ADF graph. Communication between the
RTL blocks and the ADF graph requires that you use PLIO interfacing. In the following
example, interpolator
and classify
are AI Engine kernels. The interpolator
AI Engine kernel streams data to a PL RTL block, which,
in turn, streams data back to the AI Engine
classify
kernel.
class clipped : public graph {
private:
kernel interpolator;
kernel classify;
public:
port<input> in;
port<output> clip_in;
port<output> out;
port<input> clip_out;
clipped() {
interpolator = kernel::create(fir_27t_sym_hb_2i);
classify = kernel::create(classifier);
connect< window<INTERPOLATOR27_INPUT_BLOCK_SIZE, INTERPOLATOR27_INPUT_MARGIN> >(in, interpolator.in[0]);
connect< window<POLAR_CLIP_INPUT_BLOCK_SIZE>, stream >(interpolator.out[0], clip_in);
connect< stream >(clip_out, classify.in[0]);
connect< window<CLASSIFIER_OUTPUT_BLOCK_SIZE> >(classify.out[0], out);
std::vector<std::string> myheaders;
myheaders.push_back("include.h");
adf::headers(interpolator) = myheaders;
adf::headers(classify) = myheaders;
source(interpolator) = "kernels/interpolators/hb27_2i.cc";
source(classify) = "kernels/classifiers/classify.cc";
runtime<ratio>(interpolator) = 0.8;
runtime<ratio>(classify) = 0.8;
};
};
clip_in
and clip_out
are
ports to and from the polar_clip
PL RTL kernel which is
connected to the AI Engine kernels in the graph. For
example, the clip_in
port is the output of the interpolator
AI Engine kernel that is connected to the input of the
polar_clip
RTL kernel. The clip_out
port is the input of the classify
AI Engine kernel and the output of the polar_clip
RTL kernel.
RTL Blocks and AI Engine Simulator
The top-level application file that contains an instance of your graph class and connects the graph to a simulation platform, also needs to include the PLIO inputs and outputs of the RTL blocks. These files are called output_interp.txt and input_classify.txt in the following example.
#include "graph.h"
PLIO *in0 = new PLIO("DataIn1", adf::plio_32_bits,"data/input.txt");
PLIO *ai_to_pl = new PLIO("clip_in",adf::plio_32_bits, "data/output_interp.txt",100);
PLIO *pl_to_ai = new PLIO("clip_out", adf::plio_32_bits,"data/input_classify.txt",100);
PLIO *out0 = new PLIO("DataOut1",adf::plio_32_bits, "data/output.txt");
simulation::platform<2,2> platform(in0, pl_to_ai, out0, ai_to_pl);
clipped clipgraph;
connect<> net0(platform.src[0], clipgraph.in);
connect<> net1(clipgraph.clip_in,platform.sink[1]);
connect<> net2(platform.src[1],clipgraph.clip_out);
connect<> net3(clipgraph.out, platform.sink[0]);
#ifdef __AIESIM__
int main(int argc, char ** argv) {
clipgraph.init();
clipgraph.run();
clipgraph.end();
return 0;
}
#endif
To make the AI Engine simulator work,
you must create input test bench files related to the RTL kernel. data/output_interp.txt is the test bench input to the RTL kernel. The AI Engine simulator generates the output file from the interpolator
AI Engine kernel. The data/input_classify.txt file contains data from the polar_clip
kernel which is input to the AI Engine
classify
kernel. Note that PLIO can have an optional
attribute, PL clock frequency, which is 100 for the polar_clip
.
RTL Blocks in Hardware Emulation and Hardware Flows
RTL kernels are fully supported in hardware emulation and hardware flows.
You need to add the RTL kernel as an nk
option and link the
interfaces with the sc
option, as shown in the following
code. If necessary, adjust any clock frequency using freqHz
. The following is an example of a Vitis configuration
file.
[connectivity]
nk=mm2s:1:mm2s
nk=s2mm:1:s2mm
nk=polar_clip:1:polar_clip
sc=mm2s.s:ai_engine_0.DataIn1
sc=ai_engine_0.clip_in:polar_clip.in_sample
sc=polar_clip.out_sample:ai_engine_0.clip_out
sc=ai_engine_0.DataOut1:s2mm.s
[clock]
freqHz=100000000:polar_clip.ap_clk
For more information on RTL kernels and the Vitis flow see Integrating the Application Using the Vitis Tools Flow. SystemC kernels can also be used in an emulation-only form. For this flow see Working with SystemC Models in the Application Acceleration Development flow of the Vitis Unified Software Platform Documentation (UG1416).
Design Considerations for Graphs Interacting with Programmable Logic
The AI Engine array is made up of AI Engine tiles and AI Engine array interface tiles on the last row of the array. The types of interface tiles include AI Engine-PL and AI Engine-NoC.
Knowledge of the PL interface tile, which interfaces and adapts the signals between the AI Engines and the PL region, is essential to take full advantage of the bandwidth between AI Engines and the PL. The following figure shows an expanded view of a single PL interface tile.
PL Interface Tile Capabilities
The AI Engine clock can run at up to 1 GHz for -1L speed grade devices, or higher, for -2 and -3 speed grade devices. The default width of a stream channel is 32 bits. Because this frequency is higher than the PL clock frequency, it is always necessary to perform a clock domain crossing to the PL region, for example, to either one-half or a quarter of the AI Engine clock frequency.
For C++ HLS PL kernels, choose an appropriate target frequency depending on
the complexity of the algorithm implemented. The --hls.clock
option can be used in the Vitis compiler when compiling HLS C/C++ into Xilinx object (XO) files.
AI Engine-to-PL Rate Matching
The AI Engine runs at 1 GHz (or more, depending on the device) and can write at most two streams with a 32-bit data width per cycle. In contrast, a PL kernel can run at 500 MHz (half the frequency of the AI Engine), while consuming a larger bit width. Rate matching is concerned with balancing the throughput from the producer to the consumer, and is used to ensure that neither of the processes creates a bottleneck with respect to the total performance. The following equation shows the rate matching for each channel:
The following table shows a PL rate matching example for a 32-bit channel written to each cycle by the AI Engine at 1 GHz for -1L speed grade devices. As shown, the PL IP has to consume two times the data at half the frequency or four times the data at one quarter of the frequency.
AI Engine | PL | ||
---|---|---|---|
Frequency | Data per Cycle | Frequency | Data per Cycle |
1 GHz | 32 bit | 500 MHz | 64 bit |
250 MHz | 128 bit |
Because the need to match frequency and adjust data-path width is well
understood by the Vitis compiler
(v++
), the tool automatically extracts the port width from the
PL kernel, the frequency from the clock specification, and introduces an
upsizer/downsizer to temporarily store the data exchanged between the AI Engine and the PL regions to manage the rate
match.
To avoid deadlocks, it is important to ensure that if multiple channels are read or written between the AI Engine and the PL, the data rate per cycle is concurrently achieved on all channels. For example, if one channel requires 32 bits, and the second 64 bits, the AI Engine code must ensure that both channels are written adequately to avoid back pressure or starvation on the channel. Additionally, to avoid deadlock, writing/reading from the AI Engine and reading/writing in the PL code must follow the same chronological order.
The number of interfaces used in the graph function definition for the PL defines the number of AXI4-Stream interfaces. Each argument results in the creation of a separate stream.
AI Engine-PL Interface Performance
Versal AI Core series devices include an AI Engine array with the following column categories.
- PL column
- provides PL stream access. Each column supports eight 64-bit slave channels for streaming data into the AI Engine and six 64-bit master channels for streaming data to the PL.
- NoC column
- provides connectivity between the AI Engine array and the NoC. These interfaces can also connect to the PL.
To instruct the AI Engine compiler to
select higher frequency interfaces, use the --pl-freq=<number>
to specify the clock frequency (in MHz) for the
PL kernels. The default value is one quarter of the AI Engine frequency and the maximum supported value is a half of the
AI Engine frequency, the values depending on the
speed grade. Following are examples:
- Option to enable an AI Engine-PL
frequency of 300 MHz for all AI Engine-PL
interfaces:
--pl-freq=300
- To set a different frequency for a specific PLIO interface use the
following code to set it in the ADF
graph.
adf::PLIO *<input>= new adf::PLIO(<logical_name>, <plio_width>, <file>, <FreqMHz>);
The AI Engine-PL AXI4-Stream channels use boundary logic interface (BLI) connections that include optional BLI registers with the exception of the slave channels 3 and 7. The two slave channels channel 3 and channel 7 are slower interfaces. The performance of the data transfer between the AI Engine and PL depends on whether the optional BLI registers are enabled or not.
For less timing-critical designs, all eight channels can be used without using the BLI registers. PL timing can still be met in this case. However, for higher frequency designs, only the six fast channels (0,1,2,4,5,6) can be used and the timing paths from the PL must be registered, using the BLI registers.
To control the use of BLI registers across the AI Engine-PL channels, use the --pl-register-threshold=<number>
compiler option, specified in MHz.
The default value is 1/8 of the AI Engine frequency
based on speed grade. Following is an example:
-
–pl-register-threshold=125
The compiler will map any PLIO interface with an AI Engine-PL frequency higher than this setting (125 MHz in this case) to high-speed channels with the BLI registers enabled. If the PLIO interface frequency is not higher than the
pl-register-threshold
value then any of the AI Engine-PL channels will be used.
In summary, if pl-freq
< pl-register-threshold
all eight channels can be used
unregistered. If pl-freq
> pl-register-threshold
only the six fast channels can be used, with
registering. pl-register-threshold
is a way to control
the threshold frequency beyond which only fast channels can be used (with registering).