Run-Time Graph Control API
This chapter describes the control APIs that can be used to initialize, run, update, and control the graph execution from an external controller. This chapter also describes how run-time parameters can be specified in the input graph specification that affect the data processing within the kernels and change the control flow of the overall graph synchronously or asynchronously.
Graph Execution Control
In Versal™ ACAPs with AI Engines, the processing system (PS) can be used to dynamically load, monitor, and control the graphs that are executing on the AI Engine array. Even if the AI Engine graph is loaded once as a single bitstream image, the PS program can be used to monitor the state of the execution and modify the run-time parameters of the graph.
The graph
base class provides a number of API
methods to control the initialization and execution of the graph that can be used in the
main
program. See Adaptive Data Flow Graph Specification Reference for more details.
Basic Iterative Graph Execution
The following graph control API illustrates how to use graph APIs to
initialize, run, wait, and terminate graphs for a specific number of iterations. A
graph object mygraph
is declared using a
pre-defined graph class called simpleGraph
. Then,
in the main
application, this graph object is
initialized and run. The init()
method loads the
graph to the AI Engine array at prespecified
AI Engine tiles. This includes loading the
ELF binaries for each AI Engine, configuring the
stream switches for routing, and configuring the DMAs for I/O. It leaves the
processors in a disabled state. The run()
method
starts the graph execution by enabling the processors. The run
API is where a specific number of iterations of the graph can be
run by supplying a positive integer argument at run time. This form is useful for
debugging your graph execution.
#include "project.h"
simpleGraph mygraph;
int main(void) {
mygraph.init();
mygraph.run(3); // run 3 iterations
mygraph.wait(); // wait for 3 iterations to finish
mygraph.run(10); // run 10 iterations
mygraph.end(); // wait for 10 iterations to finish
return 0;
}
wait()
is used to wait for the first run to finish before starting the
second run. wait
has the same blocking effect as
end
except that it allows re-running the graph
again without having to re-initialize it. Calling run
back-to-back without an intervening wait
to finish that run can have an unpredictable effect because the
run
API modifies the loop bounds of the active
processors of the graph.Finite Execution of Graph
For finite graph execution, the graph state is maintained across the
graph.run(n)
. The AI Engine
is not reinitialized and memory contents are not cleared after
graph.run(n)
. In the following code example, after the first
run of three invocations, the core-main wrapper code is left in a state where the
kernel will start with the pong buffer in the next run (of ten iterations). The
ping-pong buffer selector state is left as-is. graph.end()
does not
clean up the graph state (specifically, does not re-initialize global variables),
nor clean up stream switch configurations. It merely exits the core-main. To re-run
the graph, you have to reload the PDI/XCLBIN.
#include "project.h"
simpleGraph mygraph;
int main(void) {
mygraph.init();
mygraph.run(3); // run 3 iterations
mygraph.wait(); // wait for 3 iterations to finish
mygraph.run(10); // run 10 iterations
mygraph.end(); // wait for 10 iterations to finish
return 0;
}
Infinite Graph Execution
The following graph control API illustrates how to run the graph infinitely.
#include "project.h"
simpleGraph mygraph;
int main(void) {
mygraph.init(); // load the graph
mygraph.run(); // start the graph
return 0;
}
A graph object mygraph
is declared using a
pre-defined graph class called simpleGraph
. Then,
in the main
application, this graph object is
initialized and run. The init()
method loads the
graph to the AI Engine array at prespecified
AI Engine tiles. This includes loading the
ELF binaries for each AI Engine, configuring the
stream switches for routing, and configuring the DMAs for I/O. It leaves the
processors in a disabled state. The run()
method
starts the graph execution by enabling the processors. This graph runs forever
because the number of iterations to be run is not provided to the run()
method.
graph::run()
without an argument runs the AI Engine kernels for a previously specified number of
iterations (which is infinity by default if the graph is run without any arguments).
If the graph is run with a finite number of iterations, for example,
mygraph.run(3);mygraph.run()
the second run call will also run
for three iterations.Parallel Graph Execution
Among the above API methods, only the wait()
and end()
methods are blocking operations that can
block the main
application indefinitely. Therefore,
if you declare multiple graphs at the top level, you need to interleave the APIs
suitably to execute the graphs in parallel, as shown.
#include "project.h"
simpleGraph g1, g2, g3;
int main(void) {
g1.init(); g2.init(); g3.init();
g1.run(<num-iter>); g2.run(<num-iter>); g3.run(<num-iter>);
g1.end(); g2.end(); g3.end();
return 0;
}
run
) only after it has been initialized
(init
). Also, to get parallel execution, all the graphs must be
started (run
) before any graph is waited upon for termination
(end
).Timed Execution
In multi-rate graphs, all kernels need not execute for the same number of
iterations. In such situations, a timed execution model is more suitable for
testing. There are variants of the wait
and end
APIs with a positive integer that specifies a cycle
timeout. This is the number of AI Engine cycles
that the API call will block before disabling the processors and returning. The
blocking condition does not depend on any graph termination event. The graph can be
in an arbitrary state at the expiration of the timeout.
#include "project.h"
simpleGraph mygraph;
int main(void) {
mygraph.init();
mygraph.run();
mygraph.wait(10000); // wait for 10000 AI Engine cycles
mygraph.resume(); // continue executing
mygraph.end(15000); // wait for another 15000 cycles and terminate
}
resume()
is used to resume execution from the point it was stopped
after the first timeout. resume
only resets the
timer and enables the AI Engines. Calling resume
after the AI Engine execution has already terminated will have no effect.Run-Time Parameter Specification
The data flow graphs shown until now are defined completely statically. However, in real situations you might need to modify the behavior of the graph based on some dynamic condition or event. The required modification could be in the data being processed, for example a modified mode of operation or a new coefficient table, or it could be in the control flow of the graph such as conditional execution or dynamically reconfiguring a graph with another graph. Run-time parameters are useful in such situations. Either the kernels or the graphs can be defined to execute with parameters. Additional graph API are also provided to update or read these parameter values while the graph is running.
Two types of run-time parameters are supported. The first is the asynchronous or sticky parameters which can be changed at any time by either a controlling processor such as the Processing System (PS), or by another AI Engine kernel. They are read each time a kernel is invoked without any specific synchronization. These types of parameters can be used as filter coefficients that change infrequently, for example.
Synchronous or triggering parameters are the other type of supported run-time parameters. A kernel that requires a triggering parameter does not execute until these parameters have been written by a controlling processor. Upon a write, the kernel executes once, reading the new updated value. After completion, the kernel is blocked from executing until the parameter is updated again. This allows a different type of execution model from the normal streaming model, which can be useful for certain updating operations where blocking synchronization is important.
Run-time parameters can either be scalar values or array
values. In the case where a controlling processor (such as the PS) is responsible for
the update, the graph.update()
API should be used.
Specifying Run-Time Data Parameters
Parameter Inference
If an integer scalar value appears in the formal arguments of a kernel
function, then that parameter becomes a run-time parameter. In the following
example, the argument select
is a run-time
parameter.
#ifndef FUNCTION_KERNELS_H
#define FUNCTION_KERNELS_H
void simple_param(input_window_cint16 * in, output_window_cint16 * out, int select);
#endif
int8, int16, int32, int64, uint8, uint16, uint32, uint64,
cint16, cint32, float, cfloat
.filter_with_array_param
function.#ifndef FUNCTION_KERNELS_H
#define FUNCTION_KERNELS_H
void filter_with_array_param(input_window_cint16 * in, output_window_cint16 * out, const int32 (&coefficients)[32]);
#endif
Implicit ports are inferred for each parameter in the function argument, including the array parameters. The following table describes the type of port inferred for each function argument.
Formal Parameter | Port Class |
---|---|
T | Input |
const T | Input |
T & | Inout |
const T & | Input |
const T (&)[ …] | Input |
T(&)[…] | Inout |
From the table, you can see that when the AI Engine cannot make externally visible changes to the function parameter, an input port is inferred. When the formal parameter is passed by value, a copy is made, so changes to that copy are not externally visible. When a parameter is passed with a const qualifier, the parameter cannot be written, so these are also treated as input ports.
When the AI Engine kernel is passed a parameter reference and it is able to modify it, an inout port is inferred and can be used to pass parameters between AI Engine kernels or to allow reading back of results from the control processor.
graph::read()
. The
inout port cannot be updated by
graph::update()
.arg
list, once as an input and once as an inout, for example,
kernel_function(int32 foo_in, int32
&foo_out)
.Parameter Hookup
Both input and inout run-time parameter ports can be
connected to corresponding hierarchical ports in their enclosing graph. This is the
mechanism that parameters are exposed for run-time modification. In the following
graph, an instance is created of the previously defined simple_param
kernel. This kernel has two input ports and one output
port. The first argument to appear in the argument list, in[0]
, is
an input window. The second argument is an output window. The third argument is a
run-time parameter (it is not a window or stream type) and is inferred as an input
parameter, in[1]
, because it is passed by value.
In the following graph definition, a simple_param
kernel is instantiated and windows are
connected to in[0]
and out[0]
(the input and
output windows of the kernel). The run-time parameter is connected to the graph
input port, select_value
.
class parameterGraph : public graph {
private:
kernel first;
public:
input_port select_value;
input_port in;
output_port out;
parameterGraph() {
first = kernel::create(simple_param);
connect< window <32> >(in, first.in[0]);
connect< window <32> >(first.out[0], out);
connect<parameter>(select_value, first.in[1]);
}
};
An array parameter can be hooked up in the same way. The compiler automatically allocates space for the array data so that it is accessible from the processor where this kernel gets mapped.
class arrayParameterGraph : public graph {
private:
kernel first;
public:
input_port coeffs;
input_port in;
output_port out;
arrayParameterGraph() {
first = kernel::create(filter_with_array_param);
connect< window <32> >(in, first.in[0]);
connect< window <32> >(first.out[0], out);
connect<parameter>(coeffs, first.in[1]);
}
};
Input Parameter Synchronization
The default behavior for input run-time parameters ports is triggering behavior. This means that the parameter plays a part in the rules that determine when a kernel could fire. In this graph example, the kernel only fires when three conditions are met:
- A valid window of 32 bytes of input data is available
- An empty window of 32 bytes is available for the output data
- A write to the input parameter takes place
In triggering mode, a single write to the input parameter allows the kernel to fire once, setting the input parameter value on every individual kernel call.
There is an alternative mode to allow input kernels parameters to be set asynchronously. To specify that parameters update asynchronously, use the async modifier when connecting a port.
connect<parameter>(param_port, async(first.in[1]));
When a kernel port is designated as asynchronous, it no longer plays a role in the firing rules for the kernel. When the parameter is written once, the value is observed in subsequent firings. At any time, the PS can write a new value for the run-time parameter. That value is observed on the next and any subsequent kernel firing.
Inout Parameter Synchronization
The default behavior for inout run-time parameters ports is asynchronous behavior. This means that the parameter can be read back by the controlling processor or another kernel, but the producer kernel execution is not affected. For synchronous behavior from the inout parameter where the kernel blocks until the parameter value is read out on each invocation of the kernel, you can use a sync modifier when connecting the inout port to the enclosing graph as follows.
connect<parameter>(sync(first.out[1]), param_port);
Run-Time Parameter Update/Read Mechanisms
This section describes the mechanisms to update or read back the run-time parameters. For these types of applications, it is usually better not to specify an iteration limit at compile time to allow the cores to run freely and monitor the effect of the parameter change.
Parameter Update/Read Using Graph APIs
In default compilation mode, the main
application is compiled as a separate control thread which needs to be executed on
the PS in parallel with the graph executing on the AI Engine array. The main
application
can use update and read APIs to access run-time parameters declared within the
graphs at any level. This section describes these APIs using examples.
Synchronous Update/Read
The following code shows the main
application
of the simple_param
graph described in Specifying Run-Time Data Parameters.
#include "param.h"
parameterGraph mygraph;
int main(void) {
mygraph.init();
mygraph.run(2);
mygraph.update(mygraph.select_value, 23);
mygraph.update(mygraph.select_value, 45);
mygraph.end();
return 0;
}
In this example, the graph mygraph
is
initialized first and then run for two iterations. It has a triggered input
parameter port select_value
that must be updated
with a new value for each invocation of the receiving kernel. The first argument of
the update
API identifies the port to be updated
and the second argument provides the value. Several other forms of update APIs are
supported based on the direction of the port, its data type, and whether it is a
scalar or array parameter, see Adaptive Data Flow Graph Specification Reference.
If the program is compiled with a fixed number of test iterations,
then for triggered parameters the number of update API calls in the main
program must match the number of test iterations,
otherwise the simulation could be waiting for additional updates. For asynchronous
parameters, the updates are done asynchronously with the graph execution and the
kernel uses the old value if the update was not made.
If additionally, the previous graph was compiled with a synchronous inout parameter, then the update and read calls must be interleaved as shown in the following example.
#include "param.h"
parameterGraph mygraph;
int main(void) {
int result0, result1;
mygraph.init();
mygraph.run(2);
mygraph.update(mygraph.select_value, 23);
mygraph.read(mygraph.result_out, result0);
mygraph.update(mygraph.select_value, 45);
mygraph.read(mygraph.result_out, result1);
mygraph.end();
return 0;
}
In this example, it is assumed that the graph produces a scalar result every
iteration through the input port result_out
. The
read
API is used to read out the value of this
port synchronously after each iteration. The first argument of the read
API is the graph inout port to be read back and
the second argument is the location where the value will be stored (passed by
reference).
The synchronous protocol ensures that the read operation will wait
for the value to be produced by the graph before sampling it and the graph will wait
for the value to be read before proceeding to the next iteration. This is why it is
important to interleave the update
and read
operations.
Asynchronous Update/Read
When an input parameter is specified with asynchronous protocol, the kernel
execution waits for the first update to happen for parameter initialization.
However, an arbitrary number of kernel invocations can take place before the next
update. This is usually the intent of the asynchronous update during application
deployment. However, for debugging, wait
API can be
used to finish a predetermined set of iterations before the next update as shown in
the following example.
#include "param.h"
asyncGraph mygraph;
int main(void) {
int result0, result1;
mygraph.init();
mygraph.update(mygraph.select_value, 23);
mygraph.run(5);
mygraph.wait();
mygraph.update(mygraph.select_value, 45);
mygraph.run(15);
mygraph.end();
return 0;
}
In the previous example, after the initial update, five iterations are run to completion followed by another update, then followed by another set of 15 iterations. If the graph has asynchronous inout ports, that data can also be read back immediately after the wait (or end).
Another template for asynchronous updates is to use timeouts in wait
API as shown in the following example.
#include "param.h"
asyncGraph mygraph;
int main(void) {
int result0, result1;
mygraph.init();
mygraph.run();
mygraph.update(mygraph.select_value, 23);
mygraph.wait(10000);
mygraph.update(mygraph.select_value, 45);
mygraph.resume();
mygraph.end(15000);
return 0;
}
In this example, the graph is set up to run forever. However, after the
run
API is called, it still blocks for the
first update to happen for parameter initialization. Then, it runs for 10,000 cycles
(approximately) before allowing the control thread to make another update. The new
update takes effect at the next kernel invocation boundary. Then the graph is
allowed to run for another 15,000 cycles before terminating.
Chained Updates Between AI Engine Kernels
The previous run-time parameter examples highlight the ability to do run-time parameter updates from the control processor. It is also possible to propagate parameter updates between AI Engines. If an inout port on a kernel is connected to an input port on another kernel, then a chain of updates can be triggered through multiple AI Engines. Consider the two kernels defined in the following code. The producer has an input port that reads a scalar integer and an inout port that can read and write to an array of 32 integers. The consumer has an input port that can read an array of coefficients, and an output port that can write a window of data.
#ifndef FUNCTION_KERNELS_H
#define FUNCTION_KERNELS_H
void producer(const int32 &, int32 (&)[32] );
void consumer(const int32 (&)[32], output_window_cint16 *);
#endif
As shown in the following graph, the PS updates the scalar input of the producer kernel. When the producer kernel is run, it automatically triggers execution of the consumer kernel (when a buffer is available for the output data).
#include <adf.h>
#include "kernels.h"
using namespace adf;
class chainedGraph : public graph {
private:
kernel first;
kernel second;
public:
input_port select_value;
output_port out;
chainedGraph() {
first = kernel::create(producer);
second = kernel::create(consumer);
connect< window <32> >(second.out[0], out);
connect<parameter>(select_value, first.in[0]);
connect<parameter>(first.inout[0], second.in[0]);
}
};
If the intention is to make a one-time update of values
that are used in continuous processing of streams, the consumer parameter input port
can use the async
modifier to ensure it runs
continuously (when a parameter is provided).
Run-Time Graph Reconfiguration Using Control Parameters
The run-time parameters are also used to switch the flow of data within the graph and provide alternative routes for processing dynamically. The most basic version of this type of processing is a kernel bypass that allows the data to be processed or pass-through based on a run-time parameter (see Kernel Bypass). This can be useful, for example, in multi-modal applications where switching from one mode to another requires bypassing a kernel.
Bypass Control Using Run-Time Parameters
The following figure shows an application supporting two channels of signal data, where one is split into two channels of lower bandwidth while the other must continue to run undisturbed. This type of dynamic reconfiguration is common in wireless applications.
In the figure, the first channel processes LTE20 data unchanged, while the middle channel is dynamically split into two LTE10 channels. The control parameters marked as carrier configuration RTP are used to split the data processing on a block boundary. When the middle channel is operating as an LTE20 channel, the 11-tap half-band kernel is bypassed. However, when the bandwidth of the middle channel is split between itself and the third channel forming two LTE10 channels, both of them need a 3-stage filter chain before the data can be mixed together. This is achieved by switching the 11-tap half-band filter back into the flow and reconfiguring the mixer to handle three streams of data instead of two.
The top-level input graph specification for the above application is shown in the following code.
class lte_reconfig : public graph {
private:
kernel demux;
kernel cf[3];
kernel interp0[3];
kernel interp1[2];
bypass bphb11;
kernel delay ;
kernel delay_byp ;
bypass bpdelay ;
kernel mixer ;
public:
input_port in;
input_port fromPS;
output_port out ;
lte_reconfig() {
// demux also handles the control
demux = kernel::create(demultiplexor);
connect< window<1536> >(in, demux.in[0]);
connect< parameter >(fromPS, demux.in[1]);
runtime<ratio>(demux) = 0.1;
source(demux) = "kernels/demux.cc";
// instantiate all channel kernels
for (int i=0;i<3;i++) {
cf[i] = kernel::create(fir_89t_sym);
source(cf[i]) = "kernels/fir_89t_sym.cc";
runtime<ratio>(cf[i]) = 0.12;
}
for (int i=0;i<3;i++) {
interp0[i] = kernel::create(fir_23t_sym_hb_2i);
source(interp0[i]) = "kernels/hb23_2i.cc";
runtime<ratio>(interp0[i]) = 0.1;
}
for (int i=0;i<2;i++) {
interp1[i] = kernel::create(fir_11t_sym_hb_2i);
source(interp1[i]) = "kernels/hb11_2i.cc";
runtime<ratio>(interp1[i]) = 0.1;
}
bphb11 = bypass::create(interp1[0]);
mixer = kernel::create(mixer_dynamic);
source(mixer) = "kernels/mixer_dynamic.cc";
runtime<ratio>(mixer) = 0.4;
delay = kernel::create(sample_delay);
source(delay) = "kernels/delay.cc";
runtime<ratio>(delay) = 0.1;
delay_byp = kernel::create(sample_delay);
source(delay_byp) = "kernels/delay.cc";
runtime<ratio>(delay_byp) = 0.1;
bpdelay = bypass::create(delay_byp) ;
// Graph connections
for (int i=0; i<3; i++) {
connect< window<512, 352> >(demux.out[i], cf[i].in[0]);
connect< parameter >(demux.inout[i], cf[i].in[1]);
}
connect< parameter >(demux.inout[3], bphb11.bp);
connect< parameter >(demux.inout[3], negate(bpdelay.bp)) ;
for (int i=0;i<3;i++) {
connect< window<512, 64> >(cf[i].out[0], interp0[i].in[0]);
connect< parameter >(cf[i].inout[0], interp0[i].in[1]);
}
// chan0 is LTE20 and is output right away
connect< window<1024, 416> >(interp0[0].out[0], delay.in[0]);
connect< window<1024> >(delay.out[0], mixer.in[0]);
// chan1 is LTE20/10 and uses bypass
connect< window<1024, 32> >(interp0[1].out[0], bphb11.in[0]);
connect< parameter >(interp0[1].inout[0], bphb11.in[1]);
connect< window<1024, 416> >(bphb11.out[0], bpdelay.in[0]);
connect< window<1024> >(bpdelay.out[0], mixer.in[1]);
// chan2 is LTE10 always
connect< window<512, 32> >(interp0[2].out[0], interp1[1].in[0]);
connect< parameter >(interp0[2].inout[0], interp1[1].in[1]);
connect< window<1024> >(interp1[1].out[0], mixer.in[2]);
//Mixer
connect< parameter >(demux.inout[3], mixer.in[3]);
connect< window<1024> >(mixer.out[0], out);
};
};
The bypass specification is coded as a special encapsulator over the kernel to be bypassed. The port signature
of the bypass matches the port signature of the kernel that it encapsulates. It also receives
a run-time parameter to control the bypass: 0
for no bypass
and 1
for bypass. The control can also be inverted by using
the negate function as shown.
The bypass parameter port of this graph is an ordinary scalar run-time parameter and can be driven by another kernel or by the Arm® processor using the interactive or scripted mechanisms described in Run-Time Parameter Update/Read Mechanisms. This can also be connected hierarchically by embedding it into an enclosing graph.
Sharing Run-Time Parameters Across Multiple Kernels
The run-time parameter to switch channels in the previous graph is shared by the bypass encapsulator and the mixer kernel. Both entities need to see the same switched value at the same data boundary. When the nodes sharing the run-time parameter are mapped to the same AI Engine, switching of the parameter value is synchronized because each node, mapped to the same processor, processes the current set of data before any node processes the next set of data. However, when the sharing kernels are mapped to different processors, they can execute in a pipelined fashion on different sets of data, as shown in the following figure. Then, the run-time parameter should be pipelined along with the data.
In the current release, you need to pipeline the control parameter through the kernel by making it an inout parameter on the producing kernel connected to an input parameter on the consuming kernel. Pipelining across processors can be intermixed with a one-to-many broadcast connection within a single processor to create arbitrary control topologies.
Run-Time Parameter Support Summary
This section summarizes the AI Engine run-time parameter (RTP) support status. For RTP support for the PL kernel inside the graph, see Run-Time Parameter Support Summary for PL Kernel.
AI Engine RTP (from/to PS) |
Input | Output | ||
---|---|---|---|---|
Synchronous | Asynchronous | Synchronous | Asynchronous | |
Scalar | Default | Supported | Supported | Default |
Array | Default | Supported | Supported | Default |
Code snippets for RTP connections from or to the PS:
connect<parameter>(fromPS, first.in[0]); //Synchronous RTP, default for input
connect<parameter>(fromPS, sync(first.in[0])); //Synchronous RTP
connect<parameter>(fromPS, async(first.in[0])); //Asynchronous RTP
connect<parameter>(second.inout[0], toPS); //Asynchronous RTP, default for output
connect<parameter>(async(second.inout[0]), toPS); //Asynchronous RTP
connect<parameter>(sync(second.inout[0]), toPS); //Synchronous RTP
AI Engine RTP to AI Engine RTP | To | |||
---|---|---|---|---|
Synchronous | Asynchronous | Not Specified | ||
From | Synchronous | Synchronous | Not Supported | Synchronous |
Asynchronous | Not Supported | Asynchronous | Asynchronous | |
Not Specified | Synchronous | Asynchronous | Synchronous |
Code snippets for RTP connections between AI Engines:
connect<parameter>(first.inout[0], second.in[0]); //Not specified for output and input. Synchronous RTP from first.inout to second.in
connect<parameter>(sync(first.inout[0]), second.in[0]); //Specify “sync” for output. Synchronous RTP from first.inout to second.in
connect<parameter>(first.inout[0], sync(second.in[0])); //Specify “sync” for input. Synchronous RTP from first.inout to second.in
connect<parameter>(sync(first.inout[0]), sync(second.in[0])); //Specify “sync” for both. Synchronous RTP from first.inout to second.in
connect<parameter>(async(first.inout[0]), async(second.in[0])); //Specify “async” for both. Asynchronous RTP from first.inout to second.in
connect<parameter>(first.inout[0], async(second.in[0])); //Specify “async” for input. Asynchronous RTP from first.inout to second.in
connect<parameter>(async(first.inout[0]), second.in[0]); //Specify “async” for output. Asynchronous RTP from first.inout to second.in
connect<parameter>(async(first.inout[0]), sync(second.in[0])); //Not supported
connect<parameter>(sync(first.inout[0]), async(second.in[0])); //Not supported