AI Engine/Programmable Logic Integration

When you are ready to consider interfacing to the programmable logic (PL), you need to make a decision on the platform you want to interface with. A platform is a fully contained image that defines both the hardware (XSA) as well as the software (bare metal, Linux, or both). The XSA contains the hardware description of the platform, which is defined in the Vivado® Design Suite, and the software is defined with the use of a bare-metal setup, or a Linux image defined through PetaLinux. Depending on the needs of your application you might decide to use an example reference platform provided by Xilinx, or a custom platform created by your organization.

Xilinx recommends interfacing to the PLIO port attributes which represent external stream connections that cross the AI Engine-PL boundary. PLIO represents an ADF graph interface to the PL. This PL could be, for example, a PL kernel, a platform IP representing a signal source or sink, or it could be a data mover to interface the ADF graph to memory.

Alternatively interface connections can also be GMIO port attributes which represent external memory-mapped connections to or from the global memory. Further details on these attributes can be found in Using a Virtual Platform.

Note: Graphs with PL-only kernels is deprecated.

Design Flow Using RTL Programmable Logic

RTL blocks are not supported inside the ADF graph. Communication between the RTL blocks and the ADF graph requires that you use PLIO interfacing. In the following example, interpolator and classify are AI Engine kernels. The interpolator AI Engine kernel streams data to a PL RTL block, which, in turn, streams data back to the AI Engine classify kernel.

class clipped : public graph {  
  private:
    kernel interpolator;
    kernel classify;
   
  public:
    port<input> in;
    port<output> clip_in;
    port<output> out;
    port<input> clip_out;

    clipped() {
      interpolator = kernel::create(fir_27t_sym_hb_2i);
      classify     = kernel::create(classifier);

      connect< window<INTERPOLATOR27_INPUT_BLOCK_SIZE, INTERPOLATOR27_INPUT_MARGIN> >(in, interpolator.in[0]);
      connect< window<POLAR_CLIP_INPUT_BLOCK_SIZE>, stream >(interpolator.out[0], clip_in);
      connect< stream >(clip_out, classify.in[0]);
      connect< window<CLASSIFIER_OUTPUT_BLOCK_SIZE> >(classify.out[0], out);

      std::vector<std::string> myheaders;
      myheaders.push_back("include.h");

      adf::headers(interpolator) = myheaders;
      adf::headers(classify) = myheaders;

      source(interpolator) = "kernels/interpolators/hb27_2i.cc";
      source(classify)    = "kernels/classifiers/classify.cc";

      runtime<ratio>(interpolator) = 0.8;
      runtime<ratio>(classify) = 0.8;
    };
};

clip_in and clip_out are ports to and from the polar_clip PL RTL kernel which is connected to the AI Engine kernels in the graph. For example, the clip_in port is the output of the interpolator AI Engine kernel that is connected to the input of the polar_clip RTL kernel. The clip_out port is the input of the classify AI Engine kernel and the output of the polar_clip RTL kernel.

RTL Blocks and AI Engine Simulator

The top-level application file that contains an instance of your graph class and connects the graph to a simulation platform, also needs to include the PLIO inputs and outputs of the RTL blocks. These files are called output_interp.txt and input_classify.txt in the following example.

#include "graph.h"

PLIO *in0 = new PLIO("DataIn1", adf::plio_32_bits,"data/input.txt");
PLIO *ai_to_pl = new PLIO("clip_in",adf::plio_32_bits, "data/output_interp.txt",100); 
PLIO *pl_to_ai = new PLIO("clip_out", adf::plio_32_bits,"data/input_classify.txt",100); 
PLIO *out0 = new PLIO("DataOut1",adf::plio_32_bits, "data/output.txt");

simulation::platform<2,2> platform(in0, pl_to_ai, out0, ai_to_pl);

clipped clipgraph;

connect<> net0(platform.src[0], clipgraph.in);
connect<> net1(clipgraph.clip_in,platform.sink[1]);
connect<> net2(platform.src[1],clipgraph.clip_out);
connect<> net3(clipgraph.out, platform.sink[0]);

#ifdef __AIESIM__
int main(int argc, char ** argv) {
    clipgraph.init();
    clipgraph.run();
    clipgraph.end();
    return 0;
}
#endif

To make the AI Engine simulator work, you must create input test bench files related to the RTL kernel. data/output_interp.txt is the test bench input to the RTL kernel. The AI Engine simulator generates the output file from the interpolator AI Engine kernel. The data/input_classify.txt file contains data from the polar_clip kernel which is input to the AI Engine classify kernel. Note that PLIO can have an optional attribute, PL clock frequency, which is 100 for the polar_clip.

RTL Blocks in Hardware Emulation and Hardware Flows

RTL kernels are fully supported in hardware emulation and hardware flows. You need to add the RTL kernel as an nk option and link the interfaces with the sc option, as shown in the following code. If necessary, adjust any clock frequency using freqHz. The following is an example of a Vitis configuration file.

[connectivity]
nk=mm2s:1:mm2s
nk=s2mm:1:s2mm
nk=polar_clip:1:polar_clip
sc=mm2s.s:ai_engine_0.DataIn1
sc=ai_engine_0.clip_in:polar_clip.in_sample
sc=polar_clip.out_sample:ai_engine_0.clip_out
sc=ai_engine_0.DataOut1:s2mm.s
[clock]
freqHz=100000000:polar_clip.ap_clk

For more information on RTL kernels and the Vitis flow see Integrating the Application Using the Vitis Tools Flow. SystemC kernels can also be used in an emulation-only form. For this flow see Working with SystemC Models in the Application Acceleration Development flow of the Vitis Unified Software Platform Documentation (UG1416).

Design Considerations for Graphs Interacting with Programmable Logic

The AI Engine array is made up of AI Engine tiles and AI Engine array interface tiles on the last row of the array. The types of interface tiles include AI Engine-PL and AI Engine-NoC.

Knowledge of the PL interface tile, which interfaces and adapts the signals between the AI Engines and the PL region, is essential to take full advantage of the bandwidth between AI Engines and the PL. The following figure shows an expanded view of a single PL interface tile.

Note: Notice the interface tile supports two different clock domains, AI Engine clock and PL clock, as well as a predefined number of streaming channels available to connect from the AI Engine tile to a specific PL interface tile.

PL Interface Tile Capabilities

The AI Engine clock can run at up to 1 GHz for -1L speed grade devices, or higher, for -2 and -3 speed grade devices. The default width of a stream channel is 32 bits. Because this frequency is higher than the PL clock frequency, it is always necessary to perform a clock domain crossing to the PL region, for example, to either one-half or a quarter of the AI Engine clock frequency.

Note: Though not required, Xilinx recommends running the PL kernel with a frequency where the AI Engine frequency is an integer multiple of the PL kernel frequency.

For C++ HLS PL kernels, choose an appropriate target frequency depending on the complexity of the algorithm implemented. The --hls.clock option can be used in the Vitis compiler when compiling HLS C/C++ into Xilinx object (XO) files.

AI Engine-to-PL Rate Matching

The AI Engine runs at 1 GHz (or more, depending on the device) and can write at most two streams with a 32-bit data width per cycle. In contrast, a PL kernel can run at 500 MHz (half the frequency of the AI Engine), while consuming a larger bit width. Rate matching is concerned with balancing the throughput from the producer to the consumer, and is used to ensure that neither of the processes creates a bottleneck with respect to the total performance. The following equation shows the rate matching for each channel:

Frequency AI Engine × Data AI Engine per cycle = Frequency PL × Data PL per cycle

The following table shows a PL rate matching example for a 32-bit channel written to each cycle by the AI Engine at 1 GHz for -1L speed grade devices. As shown, the PL IP has to consume two times the data at half the frequency or four times the data at one quarter of the frequency.

Table 1. Frequency Response of AI Engine Compared to PL Region
AI Engine		PL
Frequency	Data per Cycle	Frequency	Data per Cycle
1 GHz	32 bit	500 MHz	64 bit
		250 MHz	128 bit

Because the need to match frequency and adjust data-path width is well understood by the Vitis compiler (v++), the tool automatically extracts the port width from the PL kernel, the frequency from the clock specification, and introduces an upsizer/downsizer to temporarily store the data exchanged between the AI Engine and the PL regions to manage the rate match.

To avoid deadlocks, it is important to ensure that if multiple channels are read or written between the AI Engine and the PL, the data rate per cycle is concurrently achieved on all channels. For example, if one channel requires 32 bits, and the second 64 bits, the AI Engine code must ensure that both channels are written adequately to avoid back pressure or starvation on the channel. Additionally, to avoid deadlock, writing/reading from the AI Engine and reading/writing in the PL code must follow the same chronological order.

The number of interfaces used in the graph function definition for the PL defines the number of AXI4-Stream interfaces. Each argument results in the creation of a separate stream.

AI Engine-PL Interface Performance

Versal AI Core series devices include an AI Engine array with the following column categories.

PL column: provides PL stream access. Each column supports eight 64-bit slave channels for streaming data into the AI Engine and six 64-bit master channels for streaming data to the PL.
NoC column: provides connectivity between the AI Engine array and the NoC. These interfaces can also connect to the PL.

To instruct the AI Engine compiler to select higher frequency interfaces, use the --pl-freq=<number> to specify the clock frequency (in MHz) for the PL kernels. The default value is one quarter of the AI Engine frequency and the maximum supported value is a half of the AI Engine frequency, the values depending on the speed grade. Following are examples:

Option to enable an AI Engine-PL frequency of 300 MHz for all AI Engine-PL interfaces:
```
--pl-freq=300
```
To set a different frequency for a specific PLIO interface use the following code to set it in the ADF graph.
```
adf::PLIO *<input>= new adf::PLIO(<logical_name>, <plio_width>, <file>, <FreqMHz>);
```

Note: The following information applies to the AI Engine device architecture documented in Versal ACAP AI Engine Architecture Manual (AM009).

The AI Engine-PL AXI4-Stream channels use boundary logic interface (BLI) connections that include optional BLI registers with the exception of the slave channels 3 and 7. The two slave channels channel 3 and channel 7 are slower interfaces. The performance of the data transfer between the AI Engine and PL depends on whether the optional BLI registers are enabled or not.

For less timing-critical designs, all eight channels can be used without using the BLI registers. PL timing can still be met in this case. However, for higher frequency designs, only the six fast channels (0,1,2,4,5,6) can be used and the timing paths from the PL must be registered, using the BLI registers.

To control the use of BLI registers across the AI Engine-PL channels, use the --pl-register-threshold=<number> compiler option, specified in MHz. The default value is 1/8 of the AI Engine frequency based on speed grade. Following is an example:

```
–pl-register-threshold=125
```
The compiler will map any PLIO interface with an AI Engine-PL frequency higher than this setting (125 MHz in this case) to high-speed channels with the BLI registers enabled. If the PLIO interface frequency is not higher than the pl-register-threshold value then any of the AI Engine-PL channels will be used.

In summary, if pl-freq < pl-register-threshold all eight channels can be used unregistered. If pl-freq > pl-register-threshold only the six fast channels can be used, with registering. pl-register-threshold is a way to control the threshold frequency beyond which only fast channels can be used (with registering).

Note: TLAST is required for a 64-bit stream between the AI Engine and PL if single 32-bit words are sent: AI Engine to PL 32-bit stream interfaces are automatically internally up-sized to 64-bit interfaces by the AI Engine compiler. When sending 32-bit stream data (to or from the PL from the AI Engine), single 32-bit words without TLAST are held in the interface until a second 32-bit word arrives to complete a 64-bit up-sizing. The workaround is to send TLAST after the 32-bit stream is sent.