Specialized Graph Constructs
This chapter describes several graph constructs that help when modeling specific scenarios.
Look-up Tables
Static File-scoped Tables
Kernel functions can use private, read-only data structures that are accessed as file-scoped variables. The compiler allocates a limited amount of static heap space for such data. As an example, consider the following header file (user_parameter.h):
#ifndef USER_PARAMETER_H
#define USER_PARAMETER_H
#include <adf.h>
static int32 lutarray[8] = {1,2,3,4,5,6,0,0} ;
#endif
This header file can be included in the kernel source file
and the look-up table can be accessed inside a kernel function directly. The static
modifier ensures that the array definition is
local to this file. The AI Engine compiler then
allocates this array in static heap space for the processor where this kernel is
used.
#include "user_parameter.h"
void simple_lut(input_window_cint16 * in, output_window_cint16 * out){
v4cint32 tmp;
v4cacc48 acc;
v32cint16 coeffs;
upd_w(coeffs, 0, lutarray);
window_readincr(in, tmp);
acc = mul4(tmp 0, 0x3210, 1, coeffs, 0, 0x0000, 1);
acc = mac4(acc, tmp, 2, 0x3210, 1, coeffs, 2, 0x0000, 1) ;
acc = mac4(acc, tmp, 4, 0x3210, 1, coeffs, 4, 0x0000, 1) ;
window_writeincr(out, srs(acc) ) ;
}
Global Graph-scoped Tables
While the previous example only includes an eight entry look-up table accessed as a global variable, many other algorithms require much larger look-up tables. Because AI Engine local memory is at a premium, it is much more efficient for the AI Engine compiler to manage the look-up table explicitly for specific kernels than to leave a large amount of stack or heap space on every processor. Such tables should not be declared static in the kernel header file.
#ifndef USER_PARAMETER_H
#define USER_PARAMETER_H
#include <adf.h>
int32 lutarray[8] = {1,2,3,4,5,6,0,0} ;
#endif
The kernel source continues to include the header file and
use the table as before. But, now you must declare this table as extern
in the graph class header and use the parameter::array(…)
function to create a parameter
object explicitly in the graph. You also need to attach this parameter object to the
kernel as shown in the following code:
#include <adf.h>
extern int32 lutarray[8];
class simple_lut_graph : public graph {
public:
kernel k;
parameter p;
simple_lut_graph() {
k = kernel::create(simple);
p = parameter::array(lutarray);
connect<>(p,k);
...
}
}
Including this explicit specification of the look-up table in the graph description ensures that the compiler is aware of the requirement to reserve a suitably sized piece of memory for the look-up table when it allocates memory for kernel input and output buffers.
Shared Graph-scoped Tables
Sometimes, the same table definition is used in multiple
kernels. Because the AI Engine architecture is a
distributed address-space architecture, each processor binary image that executes
such a kernel needs to have that table defined in its own local memory. To get the
correct graph linkage spread across multiple processors, you must declare the tables
as extern
within the kernel source file as well as
the graph class definition file. Then, the actual table definition needs to be
specified in a separate header file that is attached as a property to the kernel as
shown below.
#include <adf.h>
extern int32 lutarray[8];
class simple_lut_graph : public adf::graph {
public:
kernel k;
parameter p;
simple_lut_graph() {
k = kernel::create(simple);
p = parameter::array(lutarray);
connect<>(p,k);
std::vector<std::string> myheaders;
myheaders.push_back("./user_parameter.h")
headers(k) = myheaders;
...
}
}
This ensures that the header file that defines the table is included in the final binary link wherever this kernel is used without causing re-definition errors.
Stream FIFO Depth
The AI Engine architecture uses stream data extensively for DMA-based I/O, for communicating between two AI Engines, and for communicating between the AI Engine and the programmable logic (PL). This raises the potential for a resource deadlock when the data flow graph has reconvergent stream paths. If the pipeline depth of one path is longer than the other, the producer kernel can stall and might not be able to push data into the shorter path because of back pressure. At the same time, the consumer kernel is waiting to receive data on the longer path due to the lack of data. If the order of data production and consumption between two stream paths is different, a deadlock can happen even between two kernels that are directly connected with two stream paths. The following figure illustrates the paths.
If the producer kernel is trying to push data on stream
S1 and runs into back pressure while the consumer kernel is still trying to read data
from stream S2, a deadlock occurs. A general way to fix this situation is to create more
buffering in the paths that have back pressure in the source code by using a fifo_depth
constraint on a connection.
p = kernel::create(producer);
c = kernel::create(consumer);
connect<stream> s1(p.out[0], c.in[0]);
connect<stream> s2(p.out[1], c.in[1]);
fifo_depth(s1) = 20;
fifo_depth(s2) = 10;
Specifying FIFO Depth on the Producer and Consumer Side
Specifying a single fifo_depth
value
assigns the FIFO to the destination side, or consumer kernel. You can also specify
two values for fifo_depth
, assigning the first
value to the source side, the producer, and the second value to the consumer for
greater control of the FIFO implementation. Both source and destination can be all
AI Engine kernels, or a combination of
AI Engine kernels and PL kernels. For
example:
fifo_depth(stream_input1) = {10, 6}; // producer fifo depth=10, consumer fifo depth=6
fifo_depth(stream_input2) = {8, 20};
Kernel Bypass
A bypass encapsulator construct discussed in Run-Time Graph Reconfiguration Using Control Parameters is used to execute a kernel conditionally. The control
of the bypass is done through a run-time parameter: 0
for no bypass and 1
for bypass. In addition to the
control parameter, the external connections of a bypassed kernel or a graph are directed
to the external ports of the bypass construct itself. Internally, the bypass construct
is connected to the bypassed kernel or the graph automatically by the compiler. The
following example shows the required coding.
inout_port control;
bypass b;
kernel f, p, c;
f = kernel::create(filter);
...
b = bypass::create(f);
connect<parameter> (control, b.bp);
connect<window<128>> n1(p.out[0], b.in[0]);
connect<window<128>> n2(b.out[0], c.out[0]);
Explicit Packet Switching
Just as multiple AI Engine kernels can share a single processor and execute in a interleaved manner, multiple stream connections can be shared on a single physical channel. This mechanism is known as Packet Switching. The AI Engine architecture and compiler work together to provide a programming model where up to four stream connections can share the same physical channel.
The Explicit Packet Switching feature allows fine-grain control over how packets are generated, distributed, and consumed in a graph computation. Explicit Packet Switching is typically recommended in cases where many low bandwidth streams from common PL source can be distributed to different AI Engine destinations. Similarly many low bandwidth streams from different AI Engine sources to a common PL destination can also take advantage of this feature. Because a single physical channel is shared between multiple streams, you minimize the number of AI Engine - PL interface streams used. This section describes graph constructs to create packet-switched streams explicitly in the graph.
Packet Switching Graph Constructs
input_pktstream
and
output_pktstream
are introduced to represent the
multiplexed data streams as input to or output from a kernel, respectively. More details
on the packet headers and data types can be found in Packet Stream Operations.To explicitly control the multiplexing and de-multiplexing of packets,
two new templated node classes are added to the ADF graph library: pktsplit<n>
and pktmerge<n>
. A node instance of class pktmerge<n>
is a n:1 multiplexer of n packet streams producing a
single packet stream. A node instance of class pktsplit<n>
is a 1:n de-multiplexer of a packet stream producing n
different packet streams. The maximum number of allowable packet streams is four on a
single physical channel (n≤4). See Adaptive Data Flow Graph Specification Reference for
more details.
input_pktstream
and output_pktstream
. To connect a packet stream to a window of data meant for
an AI Engine kernel use the following graph
construct:
connect<pktstream, window<32>>
connect<window<32>, pktstream>
connect<pktstream, pktstream>
connect<pktstream, pktstream>
To connect a stream of data from/to a PLIO connection use the following graph construct:
connect<input_port, pktstream>
connect<pktstream, output_port>
When a kernel receives packets of data as a window of data, the header and TLAST are dropped prior to the kernel receiving the window of data. If the kernel writes an output window of data, the packet header and TLAST are automatically inserted.
input_pktstream
of data, the kernel needs to process the packer header and
TLAST, in addition to the packet data. Similarly if the kernel sends an output_pktstream
of data, the kernel needs to insert the
packer header and TLAST, in addition to the packet data into the output stream.These concepts are illustrated in the following example:
class ExplicitPacketSwitching: public adf::graph {
private:
adf:: kernel core[4];
adf:: pktsplit<4> sp;
adf:: pktmerge<4> mg;
public:
adf::port in;
adf::port out;
mygraph() {
core[0] = adf::kernel::create(aie_core1);
core[1] = adf::kernel::create(aie_core2);
core[2] = adf::kernel::create(aie_core3);
core[3] = adf::kernel::create(aie_core4);
adf::source(core[0]) = "aie_core1.cpp";
adf::source(core[1]) = "aie_core2.cpp";
adf::source(core[2]) = "aie_core3.cpp";
adf::source(core[3]) = "aie_core4.cpp";
sp = adf::pktsplit<4>::create();
mg = adf::pktmerge<4>::create();
for(int i=0;i<4;i++){
adf::runtime<ratio>(core[i]) = 0.9;
adf::connect<adf::pktstream, adf::window<32> > (sp.out[i], core[i].in[0]);
adf::connect<adf::window<32>, adf::pktstream > (core[i].out[0], mg.in[i]);
}
adf::connect (in, sp.in[0]);
adf::connect (mg.out[0], out);
}
};
The graph has one input PLIO port and one output PLIO port. The input packet stream from the PL is split four ways and input to four different AI Engine kernels. The output streams from the four AI Engine kernels are merged into one packet stream which is output to the PL. The Vitis analyzer Graph view of the code is shown as follows.
Packet Switching and the AI Engine Simulator
Explicit packet switching is supported by the AI Engine simulator. Consider the example of the previous graph that expects packet switched data from the PL; the data is split inside the AI Engine and sent to four AI Engine kernels. On the output side the four kernel outputs are merged into one output stream to the PL.
The input data file from the PL contains all the packet switched data from the PL, for the four AI Engine kernels in the previous example. It contains the data for different kernels, packet by packet. Each packet of data is for one window input for an AI Engine kernel. The data format is as follows.
2415853568
0
1
2
3
4
5
6
TLAST
7
2415853568
is 0x8fff0000
in hex format. The last five bits are the
packet ID, 0 in this case. The last data in the packet has the keyword TLAST, which
denotes the last data for the window input for the kernel. You can construct the header for each packet manually, or write helper functions to generate the header. The AI Engine compiler generates a packet switching report file Work/reports/packet_switching_report.json that lists the packet IDs used in the graph. In addition it also generates Work/temp/packet_ids_c.h and Work/temp/packet_ids_v.h header files that can be included in your C or Verilog kernel code.
Location Constraints
Kernel Location Constraints
When building large graphs with multiple subgraphs, it is sometimes useful to control the exact mapping of kernels to AI Engines, either relative to other kernels or in an absolute sense. The AI Engine compiler provides a mechanism to specify location constraints for kernels, which when used with the C++ template class specification, provides a powerful mechanism to create a robust, scalable, and predictable mapping of your graph onto the AI Engine array. It also reduces the choices for the mapper to try, which can considerably speed up the mapper. Consider the following graph specification:
#include <adf.h>
#include "kernels.h
#define NUMCORES (COLS*ROWS)
using namespace adf;
template <int COLS, int ROWS, int STARTCOL, int STARTROW>
class indep_nodes_graph1 : public graph {
public:
kernel kr[NUMCORES];
port<input> datain[NUMCORES] ;
port<output> dataout[NUMCORES] ;
indep_nodes_graph1() {
for (int i = 0; i < COLS; i++) {
for (int j = 0; j < ROWS; j++) {
int k = i*ROWS + j;
kr[k] = kernel::create(mykernel);
source(kr[k]) = "kernels/kernel.cc";
runtime<ratio>(kr[k]) = 0.9;
location<kernel>(kr[k]) = tile(STARTCOL+i, STARTROW+j);
}
}
for (int i = 0; i < NUMCORES; i++) {
connect<stream, window<64> >(datain[i], kr[i].in[0]);
connect<window<64>, stream >(kr[i].out[0], dataout[i]);
}
};
};
The template parameters identify a COLS x ROWS logical array of kernels (COLS x ROWS = NUMCORES) that are placed within a larger logical device of some dimensionality starting at (STARTCOL, STARTROW) as the origin. Each kernel in that graph is constrained to be placed on a specific AI Engine. This is accomplished using an absolute location constraint for each kernel placing it on a specific processor tile. For example, the following declaration would create a 1 x 2 kernel array starting at offset (3,2). When embedded within a 4 x 4 logical device topology, the kernel array is constrained to the top right corner.
indep_nodes_graph1<1,2,3,2> mygraph;
location<absolute>(k)
, function to specify
kernel constraints and proc(x,y)
function to specify a
processor tile location. These functions are now deprecated. Instead, use location<kernel>(k)
to specify the kernel constraints and
tile(x,y)
to identify a specific tile location. See Adaptive Data Flow Graph Specification Reference for more information.Buffer Location Constraints
The AI Engine compiler tries to automatically allocate buffers for windows, lookup tables, and run-time parameters in the most efficient manner possible. However, you might want to explicitly control their placement in memory. Similar to the kernels shown previously in this section, buffers inferred on a kernel port can also be constrained to be mapped to specific tiles, banks, or even address offsets using location constraints, as shown in the following example.
#include <adf.h>
#include "kernels.h"
#define NUMCORES (COLS*ROWS)
using namespace adf;
template <int COLS, int ROWS, int STARTCOL, int STARTROW>
class indep_nodes_graph2 : public graph {
public:
kernel kr[NUMCORES];
port<input> datain[NUMCORES] ;
port<output> dataout[NUMCORES] ;
indep_nodes_graph() {
for (int i = 0; i < COLS; i++) {
for (int j = 0; j < ROWS; j++) {
int k = i*ROWS + j;
kr[k] = kernel::create(mykernel);
source(kr[k]) = "kernels/kernel.cc";
runtime<ratio>(kr[k]) = 0.9;
location<kernel>(kr[k]) = tile(STARTCOL+i, STARTROW+j); // kernel location
location<buffer>(kr[k].in[0]) =
{ address(STARTCOL+i, STARTROW+j, 0x0),
address(STARTCOL+i, STARTROW+j, 0x2000) }; // double buffer location
location<stack>(kr[k]) = bank(STARTCOL+i, STARTROW+j, 2); // stack location
location<buffer>(kr[k].out[0]) = location<kernel>(kr[k]); // relative buffer location
}
}
for (int i = 0; i < NUMCORES; i++) {
connect< stream, window<64> >(datain[i], kr[i].in[0]);
connect< window<64>, stream >(kr[i].out[0], dataout[i]);
}
};
};
In the previous code, the location of double buffers at port kr[k].in[0]
is constrained to the specific memory tile address
offsets that are created using the address(col,row,offset)
constructor. Furthermore, the location of the system memory (including the stack and static
heap) for the processor that executes kernel instance kr[k]
is constrained to a particular bank using the bank(col,row,bankid)
constructor. Finally, the tile location of the buffers
connected to the port kr[k].out[0]
is constrained to be the
same tile as that of the kernel instance kr[k]
. Buffer
location constraints are only allowed on window kernel ports.
Hierarchical Constraints
When creating complex graphs with multiple subgraph classes, or multiple instances of the same subgraph class, the location constraints described above can also be applied to each kernel instance or kernel port instance individually at the point of subgraph instantiation instead of the definition. In this case, you need to specify the graph qualified name of that kernel instance or kernel port instance in the constraint as shown below. Also, make sure that the kernels or their ports being constrained as above are defined to be public members of the subgraph.
class ToplevelGraph : public graph {
public:
indep_nodes_graph1<1,2,3,2> mygraph;
port<input> datain[2] ;
port<output> dataout[2] ;
ToplevelGraph() {
for (int i = 0; i < 2; i++) {
connect<stream, window<64> >(datain[i], mygraph.datain[i]);
connect<window<64>, stream >(mygraph.dataout[i], dataout[i]);
// hierarchical constraints
location<stack>(mygraph.kr[i]) = bank(3, 2+i, 2);
location<buffer>(mygraph.kr[i].out[0]) = location<kernel>(mygraph.kr[i]);
}
};
};
Buffer Allocation Control
The AI Engine compiler automatically allocates the desired number of buffers for each memory connection. There are several different cases.
- Lookup tables are always allocated as single buffers because they are expected to be read-only and private to a kernel. No locks are needed to synchronize lookup table accesses because they are expected to be accessed in an exclusive manner.
- Window connections are usually assigned double buffers if the producer and
consumer kernels are mapped to different processors or if the producer or the
consumer is a DMA. This enables the two agents to operate in a pipelined manner
using ping-pong synchronization with two locks. The AI Engine compiler automatically generates this synchronization in
the respective processor
main
functions. - If the producer and consumer kernels are mapped to the same processor, then the window connection is given only one buffer and no lock synchronization is needed because the kernels are executed sequentially.
- Run-time parameter connections are always assigned double buffers along with a selector word to choose the next buffer to be accessed.
Sometimes, with window connections, it is desirable to use only
single buffer synchronization instead of double buffer. This is useful when the
local data memory is at a premium and the performance penalty of using a single
buffer for data transfer is not critical. This can be achieved using single_buffer(port<T>&)
constraint.
single_buffer(first.in[0]);
C++ Kernel Class Support
The AI Engine compiler supports C++
kernel classes. The following example shows how to set filter coefficients and the
number of samples of a FIR filter class through a constructor. The C++ kernel class
allows internal states for each kernel instance to be encapsulated within the
corresponding class object. In the following code, you can see an example of this
where the filter coefficients (coeffs
) are
specified through the constructor. This resolves the problem of using file scope
variable, global variable, or static function scope variable to store the internal
states of a C function kernel. When multiple instances of such a kernel are mapped
to the same core, the internal state variables are shared across multiple instances
and cause conflicts.
//fir.h
#pragma once
#include "adf.h"
#define NUM_COEFFS 12
class FIR
{
private:
int32 coeffs[NUM_COEFFS];
int32 tapDelayLine[NUM_COEFFS];
uint32 numSamples;
public:
FIR(const int32(&coefficients)[NUM_COEFFS], uint32 samples);
void filter(input_window_int32* in, output_window_int32* out);
static void registerKernelClass()
{
REGISTER_FUNCTION(FIR::filter);
}
};
You are required to write the static void
registerKernelClass()
method in the header file. Inside the registerKernelClass()
method, you need to call the
REGISTER_FUNCTION macro. This macro is used to register the class run
method to be executed on the AI Engine core to perform the kernel functionality. In
the preceding example FIR::filter
is registered
using this macro. The kernel class constructor and run method should be implemented
inside a separate source file. The implementation of a run
method of a kernel class is the same as writing a kernel function
described in previous chapters.
//fir.cpp
//implementation in this example is not optimized and is for illustration purpose
#include "fir.h"
FIR::FIR(const int32(&coefficients)[NUM_COEFFS], uint32 samples)
{
for (int i = 0; i < NUM_COEFFS; i++)
coeffs[i] = coefficients[i];
for (int i = 0; i < NUM_COEFFS; i++)
tapDelayLine[i] = 0;
numSamples = samples;
}
void FIR::filter(input_window_int32* in, output_window_int32* out)
{
for (int i = 0; i < numSamples; i++)
{
for (int j = NUM_COEFFS-1; j > 0; j--)
tapDelayLine[j] = tapDelayLine[j - 1];
tapDelayLine[0] = window_readincr(in);
int32 y = 0;
for (int j = 0; j < NUM_COEFFS; j++)
{
y += coeffs[j] * tapDelayLine[j];
}
window_writeincr(out, y);
}
}
//graph.h
#pragma once
#include "adf.h"
#include "fir.h"
using namespace adf;
class mygraph : public graph
{
public:
input_port in1, in2;
output_port out1, out2;
kernel k1, k2;
mygraph()
{
//see lab8.3 for narrow filter coefficients
k1 = kernel::create_object<FIR>(std::vector<int>({ 180, 89, -80, -391, -720, -834, -478, 505, 2063, 3896, 5535, 6504 }), 8);
runtime<ratio>(k1) = 0.1;
source(k1) = "src/fir.cpp";
//see lab8.3 for wide filter coefficients
k2 = kernel::create_object<FIR>(std::vector<int>({ -21, -249, 319, -78, -511, 977, -610, -844, 2574, -2754, -1066, 18539 }), 8);
runtime<ratio>(k2) = 0.1;
source(k2) = "src/fir.cpp";
connect<window<32>>(in1, k1.in[0]);
connect<window<32>>(in2, k2.in[0]);
connect<window<32>>(k1.out[0], out1);
connect<window<32>>(k2.out[0], out2);
}
};
For a kernel class with a non-default constructor, you can specify the
constructor parameter values in the arguments of kernel::create_object
, when creating a representation of a kernel
instance. In the previous example, two FIR filter kernels (k1
and k2
) are created using kernel::create_object<FIR>
. k1
has filter coefficients { 180, 89, -80, -391, -720,
-834, -478, 505, 2063, 3896, 5535, 6504 } and k2
has filter coefficients { -21, -249, 319, -78, -511, 977, -610, -844, 2574, -2754,
-1066, 18539 }. Both of them consume eight samples for each invocation.
The following code shows the AI Engine compiler generated program. The two FIR kernel objects are instantiated with the proper constructor parameters.
//Work/aie/x_y/src/x_y.cc
...
FIR i4({180, 89, -80, -391, -720, -834, -478, 505, 2063, 3896, 5535, 6504}, 8);
FIR i5({-21, -249, 319, -78, -511, 977, -610, -844, 2574, -2754, -1066, 18539}, 8);
int main(void) {
...
// Kernel call : i4:filter
i4.filter(get_input_window_int32(window_buf0_buf0d),get_output_window_int32(window_buf2_buf2d));
...
// Kernel call : i5:filter
i5.filter(get_input_window_int32(window_buf1_buf1d),get_output_window_int32(window_buf3_buf3d));
...
}
A kernel class can have a member variable occupying a significant
amount of memory space that might not fit into program memory. The location of the
kernel class member variable can be controlled. The AI Engine compiler supports array
reference
member variables that allow the compiler to allocate or
constrain the memory space while passing the reference to the object.
//fir.h
#pragma once
#include "adf.h"
#define NUM_COEFFS 12
class FIR
{
private:
int32 (&coeffs)[NUM_COEFFS];
int32 tapDelayLine[NUM_COEFFS];
uint32 numSamples;
public:
FIR(int32(&coefficients)[NUM_COEFFS], uint32 samples);
void filter(input_window_int32* in, output_window_int32* out);
static void registerKernelClass()
{
REGISTER_FUNCTION(FIR::filter);
REGISTER_PARAMETER(coeffs);
}
};
//fir.cpp
#include "fir.h"
FIR::FIR(int32(&coefficients)[NUM_COEFFS], uint32 samples)
: coeffs(coefficients)
{
for (int i = 0; i < NUM_COEFFS; i++)
tapDelayLine[i] = 0;
numSamples = samples;
}
void FIR::filter(input_window_int32* in, output_window_int32* out)
{
...
}
The previous example shows a slightly modified version of the FIR
kernel class. Here, member variable coeffs
is a
int32 (&)[NUM_COEFFS]
data type. The
constructor initializer coeffs(coefficients)
initializes coeffs
to the reference to an array
allocated externally to the class object. To let the AI Engine compiler know that the coeffs
member variable is intended to be allocated by the compiler,
you must use REGISTER_PARAMETER
to register an
array reference member variable inside the registerKernelClass
The use of kernel::create_object
to
create a representation of a FIR kernel instance and to specify the initial value of
the constructor parameters is the same as in the previous example. See the following
code.
//graph.h
...
class mygraph : public graph
{
...
mygraph()
{
k1 = kernel::create_object<FIR>(std::vector<int>({ 180, 89, -80, -391, -720, -834, -478, 505, 2063, 3896, 5535, 6504 }), 8);
...
k2 = kernel::create_object<FIR>(std::vector<int>({ -21, -249, 319, -78, -511, 977, -610, -844, 2574, -2754, -1066, 18539 }), 8);
...
}
};
The following code shows the corresponding AI Engine compiler generated program. The memory spaces for int32 i4_coeffs[12]
and int32
i5_coeffs[15]
are outside the kernel object instances and are passed
into the FIR objects by reference.
//Work/aie/x_y/src/x_y.cc
int32 i4_coeffs[12] = {180, 89, -80, -391, -720, -834, -478, 505, 2063, 3896, 5535, 6504};
FIR i4(i4_coeffs, 8);
int32 i5_coeffs[12] = {-21, -249, 319, -78, -511, 977, -610, -844, 2574, -2754, -1066, 18539};
FIR i5(i5_coeffs, 8);
int main(void) {
...
// Kernel call : i4:filter
i4.filter(get_input_window_int32(window_buf0_buf0d),get_output_window_int32(window_buf2_buf2d));
...
// Kernel call : i5:filter
i5.filter(get_input_window_int32(window_buf1_buf1d),get_output_window_int32(window_buf3_buf3d));
...
}
Because the memory space for an array reference member variable is
allocated by the AI Engine compiler, the
location constraint can be applied to constrain the memory location of these arrays,
as shown in the following example code. The REGISTER_PARAMETER
macro allows kernel::create_object
to create a parameter handle for an array
reference member variable, like k1.param[0]
and
k2.param[0]
, and the location<parameter>
constraint can be applied.
//graph.h
...
class mygraph : public graph
{
...
mygraph()
{
k1 = kernel::create_object<FIR>(std::vector<int>({ 180, 89, -80, -391, -720, -834, -478, 505, 2063, 3896, 5535, 6504 }), 8);
...
k2 = kernel::create_object<FIR>(std::vector<int>({ -21, -249, 319, -78, -511, 977, -610, -844, 2574, -2754, -1066, 18539 }), 8);
...
location<parameter>(k1.param[0]) = address(…);
location<parameter>(k2.param[0]) = bank(…);
}
};
The C++ kernel class header files and the C++ kernel function template (see C++ Template Support) should not contain single-core specific intrinsic APIs and pragmas. This is the same programming guideline as writing regular C function kernels. This is because these header files are included in the graph header file and can be cross-compiled as part of the PS program. The Arm® cross-compiler cannot understand single-core intrinsic APIs or pragmas. Single-core specific programming content must be kept inside the source files.
C++ Template Support
A template is a powerful tool in C++. By passing the data type as a parameter, you eliminate the need to rewrite code to support different data types. Templates are expanded at compile time, like macros. The difference is that the compiler performs type checking before template expansion. The source code contains template functions and class definitions, but the compiled code can contain multiple copies of same function or class. Type parameters, non-type parameters, default arguments, scalar parameters, and template parameters can be passed to a template, where the compiler instantiates the function or class accordingly.
- Support for general C++ template features.
- Supported data types (T) and connection types between kernels:
- Data type (T):
int8
,uint8
,int16
,uint16
,cint16
,int32
,uint32
,cint32
,int64
,uint64
,float
,cfloat
IMPORTANT:acc48
andcacc48
data types are not supported in template stream connections. - Function parameter type:
input_window<T>
,output_window<T>
,input_stream<T>
,output_stream<T>
- Data type (T):
- The compiler does not support pre-compiled headers for template kernels.
Function Templates
Function template source code defines a generic function that can be used for different data types. Example function template:
// add.h
template<typename ELEMENT_TYPE, int FACTOR, size_t NUM_SAMPLES> void add(input_window<ELEMENT_TYPE>* in,
output_window<ELEMENT_TYPE>* out);
// add.cpp
template<typename ELEMENT_TYPE, int FACTOR, size_t NUM_SAMPLES> void add(input_window<ELEMENT_TYPE>* in,
output_window<ELEMENT_TYPE>* out)
{
for (int i=0; i<NUM_SAMPLES; i++)
{
ELEMENT_TYPE value = window_readincr(in);
value += FACTOR;
window_writeincr(out, value);
}
}
// graph.h
mygraph()
{
k[0] = kernel::create(add<int32, 6, 8>);
k[1] = kernel::create(add<int16, 3, 8>);
for (int i=0; i<NUM_KERNELS; i++)
{
runtime<ratio>(k[i]) = 0.3;
source(k[i]) = "src/add.cpp";
}
connect<window<32>>(in[0], k[0].in[0]);
connect<window<32>>(k[0].out[0], out[0]);
connect<window<16>>(in[1], k[1].in[0]);
connect<window<16>>(k[1].out[0], out[1]);
}
where:
- add.h defines a template
add()
function. - add.cpp defines the code for the
template
add()
function. - graph.h uses the template
add()
function withinmygraph
class.
Class Templates
Like function templates, class templates are useful when a class defines an object that is independent of a specific data type. Example class template:
// fir.h
...
template<size_t NUM_COEFFS, typename ELEMENT_TYPE> class FIR
{
private:
ELEMENT_TYPE (&coeffs)[NUM_COEFFS];
ELEMENT_TYPE tapDelayLine[NUM_COEFFS];
uint32 numSamples;
public:
FIR(ELEMENT_TYPE(&coefficients)[NUM_COEFFS], uint32 samples);
void filter(input_window<ELEMENT_TYPE>* in, output_window<ELEMENT_TYPE>* out);
//user needs to write this function to register necessary info
static void registerKernelClass()
{
REGISTER_FUNCTION(FIR::filter);
REGISTER_PARAMETER(coeffs);
}
};
// fir.cpp
...
template<size_t NUM_COEFFS, typename ELEMENT_TYPE> FIR<NUM_COEFFS, ELEMENT_TYPE>::FIR(ELEMENT_TYPE(&coefficients)[NUM_COEFFS], uint32 samples):coeffs(coefficients)
{
...
}
template<size_t NUM_COEFFS, typename ELEMENT_TYPE> void FIR<NUM_COEFFS, ELEMENT_TYPE>::filter(input_window<ELEMENT_TYPE>* in, output_window<ELEMENT_TYPE>* out)
{
...
}
// graph.h
...
mygraph()
{
k1 = kernel::create_object<FIR<12, int32>>(std::vector<int>({ 180, 89, -80, -391, -720, -834, -478, 505, 2063, 3896, 5535, 6504 }), 8);
runtime<ratio>(k1) = 0.1;
source(k1) = "src/fir.cpp";
headers(k1) = { "src/fir.h" };
k2 = kernel::create_object<FIR<15, int32>>(std::vector<int>({ -21, -249, 319, -78, -511, 977, -610, -844, 2574, -2754, -1066, 18539, 0, 0, 0 }), 8);
runtime<ratio>(k2) = 0.1;
source(k2) = "src/fir.cpp";
headers(k2) = { "src/fir.h" };
...
}
where:
- fir.h defines a class
template where class
FIR
is declared. - fir.cpp contains class
FIR
implementation and the classFIR
member functionfilter
implementation. - graph.h demonstrates the template
class
FIR
instantiation within themygraph
class.