Window and Streaming Data API

Data flow graph kernels operate on data streams that are infinitely long sequences of typed values. These data streams can be broken into separate blocks and these blocks are processed by a kernel. Kernels consume input blocks of data and produce output blocks of data. Kernels can also access the data streams in a sample-by-sample fashion. The data access API in these two cases are described in this chapter.

Note: The data movement APIs described in this chapter apply to both vector and scalar, signed and unsigned data. However, note that the AI Engine architecture supports unsigned integer vector arithmetic only for the 8-bit data types

v16uint8, v32uint8, v64uint8,
				v128uint8

. But for scalar arithmetic, all standard C unsigned integer data types

unsigned char(uint8),
				unsigned short(uint16), unsigned int(uint32), unsigned long long(uint64)

are supported.

Data Access Mechanisms

Window-based Access

The view that a kernel has of incoming blocks of data is called an input window. Input windows are defined by type, to define the type of data contained within that window. This example shows a declaration of an input window carrying complex integers where both the real and the imaginary parts are 16-bits wide.

input_window_cint16 myFirstWindow;

The view that a kernel has of outgoing blocks of data is called an output window. Again, these are defined by a type. This example shows a declaration of an output window carrying 32-bit integers.

output_window_int32 myOtherWindow;

These window data structures are automatically inferred by the AI Engine compiler from the data flow graph connections and are automatically declared in the wrapper code implementing the graph control. The kernel functions merely operate on pointers to the window data structures that are passed to them as arguments. There is no need to declare these window data structures in the data flow graph or kernel program.

Synchronous Window Access

A kernel reads from its input windows and writes to its output windows. By default, the synchronization that is required to wait for an input window of data, or to provide an empty output window, is performed before entering the kernel. There is no synchronization needed within the kernel to read or write the individual elements of data after the kernel has started.

The size of the window (in bytes) is declared along with the connection declaration between a producer and a consumer port as shown in the following (see Connections for details). This establishes a window connection of 128 bytes between port in and the first input port of the kernel.

connect< window<128> > net0 (in, first.in[0]);

An optional second template parameter identifies the overlap (in bytes) from one block of data to the next, which is sometimes also referred to as the margin, as shown in the following. If a margin parameter is specified, the total memory allocated is window size + margin size.

connect< window<128, 32> net1 (in, first.in[0]);

These windows are designed to be accessed sequentially. The kernel programming reads the window type and starts from the first position. Therefore, a useful model is that of a current position, which can be advanced or rolled back on reads or writes. On starting a kernel, the current position is always in the correct position. For example, the current position for an input window for a filter is on the first sample to restore to the delay line. It could be an older sample in the case of filters requiring overlap of incoming data samples, in which case the connection needs to be declared using the overlap or margin as described above. Similarly, the current position for an output window is on the first sample to send to the next block, irrespective of whether that block requires a duplication of older samples. The kernel is free to manipulate this current position and it is not necessary that this position is at the end of the block when the kernel completes. Window data types are implemented as circular buffers.

Note: The minimum size for window allocation is 16 bytes. Window size allocation is rounded up to a multiple of 16 bytes. The minimum size for margin overlap is 32 bytes and must be a multiple of 32 bytes.

Note: In a multicast communication approach, all receivers are required to be same size. For example,

connect< window<128> > net0 (in, first.in[0]);
connect< window<128> > net1 (in, second.in[0]);

Asynchronous Window Access

In some situations, if you are not consuming a windows worth of data on every invocation of a kernel, or if you are not producing a windows worth of data on every invocation, then you can control the buffer synchronization by declaring the kernel port to be async as shown in the following.

 connect< window<128, 32> net1 (in, async(first.in[0]));

This declaration tells the compiler to omit synchronization of the window buffer upon entry to the kernel. You must use window synchronization APIs shown inside the kernel code before accessing the window using read/write APIs, as shown in the following.

void super_kernel(input_window_int32 * data, output_window_int32 * result) {
  ...
  window_acquire(data);     // acquire input window unconditionally inside the kernel
  if (<somecondition>) {
    window_acquire(result); // acquire output window conditionally 
  }
  ...                       // do some computation with "data" and "result"
  window_release(data);     // release input window inside the kernel
  if (<somecondition>) {  
    window_release(result); // release output window conditionally
  }
  ...
};

The window_acquire API performs the appropriate synchronization and initialization to ensure that the window object is available for read or write. The API keeps track of the appropriate buffer pointers and locks to be acquired internally, even if the window is shared across AI Engine processors and can be double-buffered. This API can be called unconditionally or conditionally under dynamic control, and is potentially a blocking operation. It is your responsibility to ensure that the corresponding window_release API is executed some time later (possibly even in a subsequent kernel call) to release the lock associated with that window object. Incorrect synchronization can lead to a deadlock in your code.

Stream-based Access

With a stream-based access model, the kernels receive an input stream or an output stream of typed data as an argument. Each access to these streams is synchronized, i.e., reads stall if the data is not available in the stream and writes stall if the stream is unable to accept new data.

An AI Engine supports two 32-bit input stream ports with id=0 or 1 and two 32-bit output stream ports with id=0 or 1. This ID is supplied as an argument to the stream object constructors. The AI Engine compiler automatically allocates the input and output stream port IDs from left to right in the argument list of a kernel. Multiple kernels mapped to the same AI Engine are not allowed to share stream ports unless the streams are packet switched (see Explicit Packet Switching).

There is also a direct stream communication channel between the accumulator register of one AI Engine and the physically adjacent core, called a cascade. The cascade stream is connected within the AI Engine array in a snake-like linear fashion from AI Engine processor to processor.

The stream data structures are automatically inferred by the AI Engine compiler from data flow graph connections, and are automatically declared in the wrapper code implementing the graph control. The kernel functions merely operate on pointers to stream data structures that are passed to them as arguments. There is no need to declare these stream data structures in data flow graph or kernel program.

Window Operations for Kernels

Window Data Types

Table 1. Supported Window Data Types
Input Window Types	Output Window Types
`input_window_int8`	`output_window_int8`
`input_window_int16`	`output_window_int16`
`input_window_int32`	`output_window_int32`
`input_window_int64`	`output_window_int64`
`input_window_uint8`	`output_window_uint8`
`input_window_uint16`	`output_window_uint16`
`input_window_uint32`	`output_window_uint32`
`input_window_uint64`	`output_window_uint64`
`input_window_cint16`	`output_window_cint16`
`input_window_cint32`	`output_window_cint32`
`input_window_float`	`output_window_float`
`input_window_cfloat`	`output_window_cfloat`

Moving the Current Read/Write Position Forward

In the following description, <input_window_type> stands for any of the allowed input window data types. Likewise, <output_window_type> stands for any of the allowed output window data types.

To increase the current read/write position by the count times of the underlying window type.

void window_incr(<input_window_type> *w, int count);
void window_incr(<output_window_type> *w, int count);

To increase the current read/write position by four times the count times of the underlying window type.

void window_incr_v4(<input_window_type> *w, int count);
void window_incr_v4(<output_window_type> *w, int count);

To increase the current read/write position by eight times the count times of the underlying window type.

void window_incr_v8(<input_window_type> *w, int count);
void window_incr_v8(<output_window_type> *w, int count);

To increase the current read/write position by 16 times the count times of the underlying window type.

void window_incr_v16(<input_window_type> *w, int count);
void window_incr_v16(<output_window_type> *w, int count);

To increase the current read/write position by 32 times the count times of the underlying window type.

void window_incr_v32(<input_window_type> *w, int count);
void window_incr_v32(<output_window_type> *w, int count);

To increase the current read/write position by 64 times the count times of the underlying window type.

void window_incr_v64(<input_window_type> *w, int count);
void window_incr_v64(<output_window_type> *w, int count);

Moving the Current Read/Write Position Backward

In the following description, <input_window_type> stands for any of the allowed input window data types. Likewise, <output_window_type> stands for any of the allowed output window data types.

To decrease the current read/write position by the count times of the underlying window type.

void window_decr(<input_window_type> *w, int count);
void window_decr(<output_window_type> *w, int count);

To decrease the current read/write position by four times the count times of the underlying window type.

void window_decr_v4(<input_window_type> *w, int count);
void window_decr_v4(<output_window_type> *w, int count);

To decrease the current read/write position by eight times the count times of the underlying window type.

void window_decr_v8(<input_window_type> *w, int count);
void window_decr_v8(<output_window_type> *w, int count);

To decrease the current read/write position by 16 times the count times of the underlying window type.

void window_decr_v16(<input_window_type> *w, int count);
void window_decr_v16(<output_window_type> *w, int count);

To decrease the current read/write position by 32 times the count times of the underlying window type.

void window_decr_v32(<input_window_type> *w, int count);
void window_decr_v32(<output_window_type> *w, int count);

To decrease the current read/write position by 64 times the count times of the underlying window type.

void window_decr_v64(<input_window_type> *w, int count);
void window_decr_v64(<output_window_type> *w, int count);

Reading Data from an Input Window

The following code reads a scalar typed value from an input window of the same type. The current position is not modified and both functional form (returns the value) and procedural form (modifies a reference argument) are provided.

int8 window_read(input_window_int8 *w);
int16 window_read(input_window_int16 *w);
int32 window_read(input_window_int32 *w);
int64 window_read(input_window_int64 *w);
uint8 window_read(input_window_uint8 *w);
uint16 window_read(input_window_uint16 *w);
uint32 window_read(input_window_uint32 *w);
uint64 window_read(input_window_uint64 *w);
cint16 window_read(input_window_cint16 *w);
cint32 window_read(input_window_cint32 *w);
float window_read(input_window_float *w);
cfloat window_read(input_window_cfloat *w);

void window_read(input_window_int8 *w, int8 &v );
void window_read(input_window_int16 *w, int16 &v );
void window_read(input_window_int32 *w, int32 &v );
void window_read(input_window_int64 *w, int64 &v );
void window_read(input_window_uint8 *w, uint8 &v );
void window_read(input_window_uint16 *w, uint16 &v );
void window_read(input_window_uint32 *w, uint32 &v );
void window_read(input_window_uint64 *w, uint64 &v );
void window_read(input_window_cint16 *w, cint16 &v);
void window_read(input_window_cint32 *w, cint32 &v);
void window_read(input_window_float *w, float &v);
void window_read(input_window_cfloat *w, cfloat &v);

The following code reads a 4-way vector of typed value from an input window of the same type. The current position is not modified and both functional form (returns the value) and procedural form (modifies a reference argument) are provided. The memory data path is either 128-bits or 256-bits wide for vector operations.

v4cint16 window_read_v4(input_window_cint16 *w);
v4int32 window_read_v4(input_window_int32 *w);
v4cint32 window_read_v4(input_window_cint32 *w);
v4int64 window_read_v4(input_window_int64 *w);
v4float window_read_v4(input_window_float *w);
v4cfloat window_read_v4(input_window_cfloat *w);

void window_read(input_window_cint16 *w, v4cint16 &v);
void window_read(input_window_int32 *w, v4int32 &v);
void window_read(input_window_cint32 *w, v4cint32 &v);
void window_read(input_window_int64 *w, v4int64 &v);
void window_read(input_window_float *w, v4float &v);
void window_read(input_window_cfloat *w, v4cfloat &v);

The following code reads an 8-way vector of typed value from an input window of the same type. The current position is not modified and both functional form (returns the value) and procedural form (modifies a reference argument) are provided. The memory data path is either 128-bits or 256-bits wide for vector operations.

v8int16 window_read_v8(input_window_int16 *w);
v8cint16 window_read_v8(input_window_cint16 *w);
v8int32 window_read_v8(input_window_int32 *w);
v8float window_read_v8(input_window_float *w);

void window_read(input_window_int16 *w, v8int16 &v);
void window_read(input_window_cint16 *w, v8cint16 &v);
void window_read(input_window_int32 *w, v8int32 &v);
void window_read(input_window_float *w, v8float &v);

The following code reads a 16-way vector of typed value from an input window of the same type. The current position is not modified and both functional form (returns the value) and procedural form (modifies a reference argument) are provided. The memory data path is either 128-bits or 256-bits wide for vector operations.

v16int8 window_read_v16(input_window_int8 *w);
v16uint8 window_read_v16(input_window_uint8 *w);
v16int16 window_read_v16(input_window_int16 *w);
v16cint16 window_read_v16(input_window_cint16 *w);
v16int32 window_read_v16(input_window_int32 *w);
v16cint32 window_read_v16(input_window_cint32 *w);
v16float window_read_v16(input_window_float *w);
v16cfloat window_read_v16(input_window_cfloat *w);

void window_read(input_window_int8 *w, v16int8 &v);
void window_read(input_window_uint8 *w, v16uint8 &v);
void window_read(input_window_int16 *w, v16int16 &v);
void window_read(input_window_cint16 *w, v16cint16 &v);
void window_read(input_window_int32 *w, v16int32 &v);
void window_read(input_window_cint32 *w, v16cint32 &v);
void window_read(input_window_float *w, v16float &v);
void window_read(input_window_cfloat *w, v16cfloat &v);

The following code reads a 32-way vector of typed value from an input window of the same type. The current position is not modified and both functional form (returns the value) and procedural form (modifies a reference argument) are provided. The memory data path is either 128-bits or 256-bits wide for vector operations.

v32int8 window_read_v32(input_window_int8 *w);
v32uint8 window_read_v32(input_window_uint8 *w);
v32int16 window_read_v32(input_window_int16 *w);
v32cint16 window_read_v32(input_window_cint16 *w);
v32int32 window_read_v32(input_window_int32 *w);
v32float window_read_v32(input_window_float *w);

void window_read(input_window_int8 *w, v32int8 &v);
void window_read(input_window_uint8 *w, v32uint8 &v);
void window_read(input_window_int16 *w, v32int16 &v);
void window_read(input_window_cint16 *w, v32cint16 &v);
void window_read(input_window_int32 *w, v32int32 &v);
void window_read(input_window_float *w, v32float &v);

The following code reads a 64-way vector of typed value from an input window of the same type. The current position is not modified and both functional form (returns the value) and procedural form (modifies a reference argument) are provided. The memory data path is either 128-bits or 256-bits wide for vector operations.

v64int8 window_read_v64(input_window_int8 *w);
v64uint8 window_read_v64(input_window_uint8 *w);
v64int16 window_read_v64(input_window_int16 *w);

void window_read(input_window_int8 *w, v64int8 &v);
void window_read(input_window_uint8 *w, v64uint8 &v);
void window_read(input_window_int16 *w, v64int16 &v);

Reading and Advancing an Input Window

The following code reads a scalar typed value from an input window of the same type and advances the window current position by one times the size of the underlying data type. Both functional form (returns the value) and procedural form (modifies a reference argument) are provided.

int8 window_readincr(input_window_int8 *w);
int16 window_readincr(input_window_int16 *w);
int32 window_readincr(input_window_int32 *w);
int64 window_readincr(input_window_int64 *w);
uint8 window_readincr(input_window_uint8 *w);
uint16 window_readincr(input_window_uint16 *w);
uint32 window_readincr(input_window_uint32 *w);
uint64 window_readincr(input_window_uint64 *w);
cint16 window_readincr(input_window_cint16 *w);
cint32 window_readincr(input_window_cint32 *w);
float window_readincr(input_window_float *w);
cfloat window_readincr(input_window_cfloat *w);

void window_readincr(input_window_int8 *w, int8 &v );
void window_readincr(input_window_int16 *w, int16 &v );
void window_readincr(input_window_int32 *w, int32 &v );
void window_readincr(input_window_int64 *w, int64 &v );
void window_readincr(input_window_uint8 *w, uint8 &v );
void window_readincr(input_window_uint16 *w, uint16 &v );
void window_readincr(input_window_uint32 *w, uint32 &v );
void window_readincr(input_window_uint64 *w, uint64 &v );
void window_readincr(input_window_cint16 *w, cint16 &v);
void window_readincr(input_window_cint32 *w, cint32 &v);
void window_readincr(input_window_float *w, float &v );
void window_readincr(input_window_cfloat *w, cfloat &v);

The following code reads a 4-way vector of typed value from an input window of the same type and advances the window current position by four times the size of the underlying data type. Both functional form (returns the value) and procedural form (modifies a reference argument) are provided. The memory data path is either 128-bits or 256-bits wide for vector operations.

v4cint16 window_readincr_v4(input_window_cint16 *w);
v4int32 window_readincr_v4(input_window_int32 *w);
v4cint32 window_readincr_v4(input_window_cint32 *w);
v4int64 window_readincr_v4(input_window_int64 *w);
v4float window_readincr_v4(input_window_float *w);
v4cfloat window_readincr_v4(input_window_cfloat *w);

void window_readincr(input_window_cint16 *w, v4cint16 &v);
void window_readincr(input_window_int32 *w, v4int32 &v);
void window_readincr(input_window_cint32 *w, v4cint32 &v);
void window_readincr(input_window_int64 *w, v4int64 &v);
void window_readincr(input_window_float *w, v4float &v);
void window_readincr(input_window_cfloat *w, v4cfloat &v);

The following code reads an 8-way vector of typed value from an input window of the same type and advances the window current position by eight times the size of the underlying data type. Both functional form (returns the value) and procedural form (modifies a reference argument) are provided. The memory data path is either 128-bits or 256-bits wide for vector operations.

v8int16 window_readincr_v8(input_window_int16 *w);
v8cint16 window_readincr_v8(input_window_cint16 *w);
v8int32 window_readincr_v8(input_window_int32 *w);
v8float window_readincr_v8(input_window_float *w);

void window_readincr(input_window_int16 *w, v8int16 &v);
void window_readincr(input_window_cint16 *w, v8cint16 &v);
void window_readincr(input_window_int32 *w, v8int32 &v);
void window_readincr(input_window_float *w, v8float &v);

The following code reads a 16-way vector of typed value from an input window of the same type and advances the window current position by sixteen times the size of the underlying data type. Both functional form (returns the value) and procedural form (modifies a reference argument) are provided. The memory data path is either 128-bits or 256-bits wide for vector operations.

v16int8 window_readincr_v16(input_window_int8 *w);
v16uint8 window_readincr_v16(input_window_uint8 *w);
v16int16 window_readincr_v16(input_window_int16 *w);
v16cint16 window_readincr_v16(input_window_cint16 *w);
v16int32 window_readincr_v16(input_window_int32 *w);
v16cint32 window_readincr_v16(input_window_cint32 *w);
v16float window_readincr_v16(input_window_float *w);
v16cfloat window_readincr_v16(input_window_cfloat *w);

void window_readincr(input_window_int8 *w, v16int8 &v);
void window_readincr(input_window_uint8 *w, v16uint8 &v);
void window_readincr(input_window_int16 *w, v16int16 &v);
void window_readincr(input_window_cint16 *w, v16cint16 &v);
void window_readincr(input_window_int32 *w, v16int32 &v);
void window_readincr(input_window_cint32 *w, v16cint32 &v);
void window_readincr(input_window_float *w, v16float &v);
void window_readincr(input_window_cfloat *w, v16cfloat &v);

The following code reads a 32-way vector of typed value from an input window of the same type and advances the window current position by thirty-two times the size of the underlying data type. Both functional form (returns the value) and procedural form (modifies a reference argument) are provided. The memory data path is either 128-bits or 256-bits wide for vector operations.

v32int8 window_readincr_v32(input_window_int8 *w);
v32uint8 window_readincr_v32(input_window_uint8 *w);
v32int16 window_readincr_v32(input_window_int16 *w);
v32cint16 window_readincr_v32(input_window_cint16 *w);
v32int32 window_readincr_v32(input_window_int32 *w);
v32float window_readincr_v32(input_window_float *w);

void window_readincr(input_window_int8 *w, v32int8 &v);
void window_readincr(input_window_uint8 *w, v32uint8 &v);
void window_readincr(input_window_int16 *w, v32int16 &v);
void window_readincr(input_window_cint16 *w, v32cint16 &v);
void window_readincr(input_window_int32 *w, v32int32 &v);
void window_readincr(input_window_float *w, v32float &v);

The following code reads a 64-way vector of typed value from an input window of the same type and advances the window current position by sixty-four times the size of the underlying data type. Both functional form (returns the value) and procedural form (modifies a reference argument) are provided. The memory data path is either 128-bits or 256-bits wide for vector operations.

v64int8 window_readincr_v64(input_window_int8 *w);
v64uint8 window_readincr_v64(input_window_uint8 *w);
v64int16 window_readincr_v64(input_window_int16 *w);

void window_readincr(input_window_int8 *w, v64int8 &v);
void window_readincr(input_window_uint8 *w, v64uint8 &v);
void window_readincr(input_window_int16 *w, v64int16 &v);

Reading and Decrementing an Input Window

The following code reads a scalar typed value from an input window of the same type and decrements the window current position by one times the size of the underlying data type. Both functional form (returns the value) and procedural form (modifies a reference argument) are provided.

int8 window_readdecr(input_window_int8 *w);
int16 window_readdecr(input_window_int16 *w);
int32 window_readdecr(input_window_int32 *w);
int64 window_readdecr(input_window_int64 *w);
uint8 window_readdecr(input_window_uint8 *w);
uint16 window_readdecr(input_window_uint16 *w);
uint32 window_readdecr(input_window_uint32 *w);
uint64 window_readdecr(input_window_uint64 *w);
cint16 window_readdecr(input_window_cint16 *w);
cint32 window_readdecr(input_window_cint32 *w);
float window_readdecr(input_window_float *w);
cfloat window_readdecr(input_window_cfloat *w);

void window_readdecr(input_window_int8 *w, int8 &v );
void window_readdecr(input_window_int16 *w, int16 &v );
void window_readdecr(input_window_int32 *w, int32 &v );
void window_readdecr(input_window_int64 *w, int64 &v );
void window_readdecr(input_window_uint8 *w, uint8 &v );
void window_readdecr(input_window_uint16 *w, uint16 &v );
void window_readdecr(input_window_uint32 *w, uint32 &v );
void window_readdecr(input_window_uint64 *w, uint64 &v );
void window_readdecr(input_window_cint16 *w, cint16 &v);
void window_readdecr(input_window_cint32 *w, cint32 &v);
void window_readdecr(input_window_float *w, float &v );
void window_readdecr(input_window_cfloat *w, cfloat &v);

The following code reads a 4-way vector of typed value from an input window of the same type and decrements the window current position by four times the size of the underlying data type. Both functional form (returns the value) and procedural form (modifies a reference argument) are provided. The memory data path is either 128-bits or 256-bits wide for vector operations.

v4cint16 window_readdecr_v4(input_window_cint16 *w);
v4int32 window_readdecr_v4(input_window_int32 *w);
v4cint32 window_readdecr_v4(input_window_cint32 *w);
v4int64 window_readdecr_v4(input_window_int64 *w);
v4float window_readdecr_v4(input_window_float *w);
v4cfloat window_readdecr_v4(input_window_cfloat *w);

void window_readdecr(input_window_cint16 *w, v4cint16 &v);
void window_readdecr(input_window_int32 *w, v4int32 &v);
void window_readdecr(input_window_cint32 *w, v4cint32 &v);
void window_readdecr(input_window_int64 *w, v4int64 &v);void window_readdecr(input_window_float *w, v4float &v);
void window_readdecr(input_window_cfloat *w, v4cfloat &v);

The following code reads an 8-way vector of typed value from an input window of the same type and decrements the window current position by eight times the size of the underlying data type. Both functional form (returns the value) and procedural form (modifies a reference argument) are provided. The memory data path is either 128-bits or 256-bits wide for vector operations.

v8int16 window_readdecr_v8(input_window_int16 *w);
v8cint16 window_readdecr_v8(input_window_cint16 *w);
v8int32 window_readdecr_v8(input_window_int32 *w);
v8float window_readdecr_v8(input_window_float *w);

void window_readdecr(input_window_int16 *w, v8int16 &v);
void window_readdecr(input_window_cint16 *w, v8cint16 &v);
void window_readdecr(input_window_int32 *w, v8int32 &v);
void window_readdecr(input_window_float *w, v8float &v);

The following code reads a 16-way vector of typed value from an input window of the same type and decrements the window current position by sixteen times the size of the underlying data type. Both functional form (returns the value) and procedural form (modifies a reference argument) are provided. The memory data path is either 128-bits or 256-bits wide for vector operations.

v16int8 window_readdecr_v16(input_window_int8 *w);
v16uint8 window_readdecr_v16(input_window_uint8 *w);
v16int16 window_readdecr_v16(input_window_int16 *w);
v16cint16 window_readdecr_v16(input_window_cint16 *w);
v16int32 window_readdecr_v16(input_window_int32 *w);
v16cint32 window_readdecr_v16(input_window_cint32 *w);
v16float window_readdecr_v16(input_window_float *w);
v16cfloat window_readdecr_v16(input_window_cfloat *w);

void window_readdecr(input_window_int8 *w, v16int8 &v);
void window_readdecr(input_window_uint8 *w, v16uint8 &v);
void window_readdecr(input_window_int16 *w, v16int16 &v);
void window_readdecr(input_window_cint16 *w, v16cint16 &v);
void window_readdecr(input_window_int32 *w, v16int32 &v);
void window_readdecr(input_window_cint32 *w, v16cint32 &v);
void window_readdecr(input_window_float *w, v16float &v);
void window_readdecr(input_window_cfloat *w, v16cfloat &v);

The following code reads a 32-way vector of typed value from an input window of the same type and decrements the window current position by thirty-two times the size of the underlying data type. Both functional form (returns the value) and procedural form (modifies a reference argument) are provided. The memory data path is either 128-bits or 256-bits wide for vector operations.

v32int8 window_readdecr_v32(input_window_int8 *w);
v32uint8 window_readdecr_v32(input_window_uint8 *w);
v32int16 window_readdecr_v32(input_window_int16 *w);
v32cint16 window_readdecr_v32(input_window_cint16 *w);
v32int32 window_readdecr_v32(input_window_int32 *w);
v32float window_readdecr_v32(input_window_float *w);

void window_readdecr(input_window_int8 *w, v32int8 &v);
void window_readdecr(input_window_uint8 *w, v32uint8 &v);
void window_readdecr(input_window_int16 *w, v32int16 &v);
void window_readdecr(input_window_cint16 *w, v32cint16 &v);
void window_readdecr(input_window_int32 *w, v32int32 &v);
void window_readdecr(input_window_float *w, v32float &v);

The following code reads a 64-way vector of typed value from an input window of the same type and decrements the window current position by sixty-four times the size of the underlying data type. Both functional form (returns the value) and procedural form (modifies a reference argument) are provided. The memory data path is either 128-bits or 256-bits wide for vector operations.

v64int8 window_readdecr_v64(input_window_int8 *w);
v64uint8 window_readdecr_v64(input_window_uint8 *w);
v64int16 window_readdecr_v64(input_window_int16 *w);

void window_readdecr(input_window_int8 *w, v64int8 &v);
void window_readdecr(input_window_uint8 *w, v64uint8 &v);
void window_readdecr(input_window_int16 *w, v64int16 &v);

Writing Data to an Output Window

The following code writes a scalar typed value to an output window of the same type. The current position is not modified.

void window_write(output_window_int8 *w, int8 v);
void window_write(output_window_uint8 *w, uint8 v);
void window_write(output_window_int16 *w, int16 v);
void window_write(output_window_uint16 *w, uint16 v);
void window_write(output_window_cint16 *w, cint16 v);
void window_write(output_window_int32 *w, int32 v );
void window_write(output_window_uint32 *w, uint32 v );
void window_write(output_window_cint32 *w, cint32 v);
void window_write(output_window_int64 *w, int64 v );
void window_write(output_window_uint64 *w, uint64 v );
void window_write(output_window_float *w, float v );
void window_write(output_window_cfloat *w, cfloat v);

The following code writes a 4-way vector of a typed value to an output window of the same type. The current position is not modified.

void window_write(output_window_cint16 *w, v4cint16 v);
void window_write(output_window_int32 *w, v4int32 v );
void window_write(output_window_cint32 *w, v4cint32 v);
void window_write(output_window_int64 *w, v4int64 v );
void window_write(output_window_float *w, v4float v );
void window_write(output_window_cfloat *w, v4cfloat v);

The following code writes an 8-way vector of a typed value to an output window of the same type. The current position is not modified.

void window_write(output_window_int16 *w, v8int16 v);
void window_write(output_window_cint16 *w, v8cint16 v);
void window_write(output_window_int32 *w, v8int32 v );
void window_write(output_window_float *w, v8float v );

The following code writes a 16-way vector of a typed value to an output window of the same type. The current position is not modified.

void window_write(output_window_int8 *w, v16int8 v);
void window_write(output_window_uint8 *w, v16uint8 v);
void window_write(output_window_int16 *w, v16int16 v);
void window_write(output_window_cint16 *w, v16cint16 v);
void window_write(output_window_int32 *w, v16int32 v );
void window_write(output_window_cint32 *w, v16cint32 v);
void window_write(output_window_float *w, v16float v );
void window_write(output_window_cfloat *w, v16cfloat v);

The following code writes a 32-way vector of a typed value to an output window of the same type. The current position is not modified.

void window_write(output_window_int8 *w, v32int8 v);
void window_write(output_window_uint8 *w, v32uint8 v);
void window_write(output_window_int16 *w, v32int16 v);
void window_write(output_window_cint16 *w, v32cint16 v);
void window_write(output_window_int32 *w, v32int32 v );
void window_write(output_window_float *w, v32float v );

The following code writes a 64-way vector of a typed value to an output window of the same type. The current position is not modified.

void window_write(output_window_int8 *w, v64int8 v);
void window_write(output_window_uint8 *w, v64uint8 v);
void window_write(output_window_int16 *w, v64int16 v);

Writing and Advancing an Output Window

The following code writes a scalar typed value from an output window of the same type and advances the current position based upon that type.

void window_writeincr (output_window_int8 *w, int8 v);
void window_writeincr (output_window_uint8 *w, uint8 v);
void window_writeincr (output_window_int16 *w, int16 v);
void window_writeincr (output_window_uint16 *w, uint16 v);
void window_writeincr (output_window_cint16 *w, cint16 v);
void window_writeincr (output_window_int32 *w, int32 v );
void window_writeincr (output_window_uint32 *w, uint32 v );
void window_writeincr (output_window_cint32 *w, cint32 v);
void window_writeincr (output_window_int64 *w, int64 v);
void window_writeincr (output_window_uint64 *w, uint64 v);void window_writeincr (output_window_float *w, float v );
void window_writeincr (output_window_cfloat *w, cfloat v);

The following code writes a 4-way vector of a typed value from an output window of the same type and advances the current position by four times the size of the underlying type.

void window_writeincr(output_window_cint16 *w, v4cint16 v);
void window_writeincr(output_window_int32 *w, v4int32 v );
void window_writeincr(output_window_cint32 *w, v4cint32 v);
void window_writeincr(output_window_int64 *w, v4int64 v );
void window_writeincr(output_window_float *w, v4float v );
void window_writeincr(output_window_cfloat *w, v4cfloat v);

The following code writes an 8-way vector of a typed value to an output window of the same type and advances the current position by eight times the size of the underlying type.

void window_writeincr(output_window_int16 *w, v8int16 v);
void window_writeincr(output_window_cint16 *w, v8cint16 v);
void window_writeincr(output_window_int32 *w, v8int32 v );
void window_writeincr(output_window_float *w, v8float v );

The following code writes a 16-way vector of a typed value to an output window of the same type and advances the current position by sixteen times the size of the underlying type.

void window_writeincr(output_window_int8 *w, v16int8 v);
void window_writeincr(output_window_uint8 *w, v16uint8 v);
void window_writeincr(output_window_int16 *w, v16int16 v);
void window_writeincr(output_window_cint16 *w, v16cint16 v);
void window_writeincr(output_window_int32 *w, v16int32 v );
void window_writeincr(output_window_cint32 *w, v16cint32 v);
void window_writeincr(output_window_float *w, v16float v );
void window_writeincr(output_window_cfloat *w, v16cfloat v);

The following code writes a 32-way vector of a typed value to an output window of the same type and advances the current position by thirty-two times the size of the underlying type.

void window_writeincr(output_window_int8 *w, v32int8 v);
void window_writeincr(output_window_uint8 *w, v32uint8 v);
void window_writeincr(output_window_int16 *w, v32int16 v);
void window_writeincr(output_window_cint16 *w, v32cint16 v);
void window_writeincr(output_window_int32 *w, v32int32 v );
void window_writeincr(output_window_float *w, v32float v );

The following code writes a 64-way vector of a typed value to an output window of the same type and advances the current position by sixty-four times the size of the underlying type.

void window_writeincr(output_window_int8 *w, v64int8 v);
void window_writeincr(output_window_uint8 *w, v64uint8 v);
void window_writeincr(output_window_int16 *w, v64int16 v);

Stream Operations for Kernels

Stream Data Types

Table 2. Supported Stream Data Types
Input Stream Types	Output Stream Types
input_stream_int8	output_stream_int8
input_stream_int16	output_stream_int16
input_stream_int32	output_stream_int32
input_stream_int64	output_stream_int64
input_stream_uint8	output_stream_uint8
input_stream_uint16	output_stream_uint16
input_stream_uint32	output_stream_uint32
input_stream_uint64	output_stream_uint64
input_stream_cint16	output_stream_cint16
input_stream_cint32	output_stream_cint32
input_stream_acc48	output_stream_acc48
input_stream_cacc48	output_stream_cacc48
input_stream_acc80	output_stream_acc80
input_stream_cacc80	output_stream_cacc80
input_stream_accfloat	output_stream_accfloat
input_stream_caccfloat	output_stream_caccfloat
input_stream_float	output_stream_float
input_stream_cfloat	output_stream_cfloat

Each of the data types in the table can be read or written from the AI Engine as either scalars or in vector groups. However, there are certain restrictions on valid groupings based on the bus data width supported on the AI Engine to programmable logic interface ports or through the stream-switch network. The valid combinations for AI Engine kernels are vector bundles totaling up to 32-bits or 128-bits. The accumulator data types are only used to specify cascade-stream connections between adjacent AI Engines. Its valid groupings are based on the 384-bit wide cascade channel between two processors.

Note: To use these data types, it is necessary to use

#include
						<adf.h>

in the kernel source file.

Reading and Advancing an Input Stream

AI Engine Operations

The following operations read data from the given input stream and advance the stream on the AI Engine. Because there are two input stream ports on the AI Engine, the physical port assignment is made by the AI Engine compiler automatically and conveyed as part of the stream data structure. Data values from the stream can be read one at a time or as a vector. In the latter case, unless all values are present, the stream operation stalls. The data groupings are based on the underlying single cycle, 32-bit stream operation or 4 cycle, 128-bit wide stream operation. The cascade connection reads all accumulator values in parallel.

int32 readincr(input_stream_int32 *w);
uint32 readincr(input_stream_uint32 *w);
cint16 readincr(input_stream_cint16 *w);
float readincr(input_stream_float *w);
cfloat readincr(input_stream_cfloat *w);

v16int8 readincr_v16(input_stream_int8 *w);
v16uint8 readincr_v16(input_stream_uint8 *w);
v8int16 readincr_v8(input_stream_int16 *w);
v4cint16 readincr_v4(input_stream_cint16 *w);
v4int32 readincr_v4(input_stream_int32 *w);
v2cint32 readincr_v2(input_stream_cint32 *w);
v4float readincr_v4(input_stream_float *w);

v8acc48 readincr_v8(input_stream_acc48 *w);
v4cacc48 readincr_v4(input_stream_cacc48 *w);
v4acc80 readincr_v4(input_stream_acc80 * str);
v2cacc80 readincr_v2(input_stream_cacc80 * str);
v8float readincr_v8(input_stream_accfloat * str);
v4cfloat readincr_v4(input_stream_caccfloat * str);

Writing and Advancing an Output Stream

AI Engine Operations

The following operations write data to the given output stream and advance the stream on the AI Engine. Because there are two output stream ports on the AI Engine, the physical port assignment is made by the AI Engine compiler automatically and conveyed as part of the stream data structure. Data values can be written to the output stream one at a time or as a vector. In the latter case, until all values are written, the stream operation stalls. The data groupings are based on the underlying single cycle, 32-bit stream operation or 4 cycle, 128-bit wide stream operation. Cascade connection writes all values in parallel.

void writeincr(output_stream_int32 *w, int32 v);
void writeincr(output_stream_uint32 *w, uint32 v);
void writeincr(output_stream_cint16 *w, cint16 v);
void writeincr(output_stream_float *w, float v);
void writeincr(output_stream_cfloat *w, cfloat v);

void writeincr_v16(output_stream_int8 *w, v16int8 v);
void writeincr_v16(output_stream_uint8 *w, v16uint8 v);
void writeincr_v8(output_stream_int16 *w, v8int16 v);
void writeincr_v4(output_stream_cint16 *w, v4cint16 v);
void writeincr_v4(output_stream_int32 *w, v4int32 v);
void writeincr_v2(output_stream_cint32 *w, v2cint32 v);
void writeincr_v4(output_stream_float *w, v4float v);

void writeincr_v8(output_stream_acc48 *w, v8acc48 v);
void writeincr_v4(output_stream_cacc48 *w, v4cacc48 v);
void writeincr_v4(output_stream_acc80* str, v4acc80 value);
void writeincr_v2(output_stream_cacc80* str, v2cacc80 value);
void writeincr_v8(output_stream_accfloat* str, v8float value);
void writeincr_v4(output_stream_caccfloat* str, v4cfloat value);

Using Streams in Parallel

For streaming input and output interfaces, when the performance is limited by the streaming interface, it is possible to use two streaming inputs or two streaming outputs in parallel. To use two parallel streams, it is recommended to use the following pairs of macros, where idx1 and idx2 are the two streams. Add the restrict keyword to the stream ports to optimize them for parallel processing. The macro is operating on 32 bits every cycle, or 128 bits per four cycles.

READINCR(SS_rsrc1, idx1) and READINCR(SS_rsrc2, idx2)
READINCRW(WSS_rsrc1, idx1) and READINCRW(WSS_rsrc2, idx2)
WRITEINCR(MS_rsrc1, idx1, val) and WRITEINCR(MS_rsrc2, idx2, val)
WRITEINCRW(WMS_rsrc1, idx1, val) and WRITEINCRW(WMS_rsrc2, idx2, val)

The following example code shows two parallel input streams using pipelining with an interval of 1.

void simple( input_stream_int32 * restrict data0, input_stream_int32 * restrict data1, 
    output_stream_int32 * restrict out) {
   for(int i=0; i<1024; i++)
   chess_prepare_for_pipelining
 {
     int32_t d = READINCR(SS_rsrc1, data0) ;
     int32_t e = READINCR(SS_rsrc2, data1) ;
     WRITEINCR(MS_rsrc1,out,d+e);
 }
}

Packet Stream Operations

Table 3. Supported Packet Stream Data Types
Input Stream Types	Output Stream Types
input_pktstream	output_pktstream

Two additional stream data types are provided to characterize streaming data that consists of packetized interleaving of several different streams. These data types are useful when the number of independent data streams in your program exceeds the number of hardware stream channels or ports available. This mechanism is described in more detail in Explicit Packet Switching.

Packet Stream Reading and Writing

A data packet consists of a one word (32-bit) packet header, followed by some number of data words where the last data word has the TLAST field denotes the end-of-packet. The following operations are used to read and advance input packet streams and write and advance output packet streams.

int32 readincr(input_pktstream *w);
int32 readincr(input_pktstream *w, bool &tlast);

void writeincr(output_pktstream *w, int32 value);
void writeincr(output_pktstream *w, int32 value, bool tlast);

The API with TLAST argument help to read or write the end-of-packet condition if the packet size is not fixed.

Packet Processing

The first 32-bit word of a packet must always be a packet header, which encodes several bit fields as shown in the following table.

Table 4. Packet Bit Fields
Bits	Field
4-0	Packet ID
11-5	`7'b0000000`
14-12	Packet Type
15	`1'b0`
20-16	Source Row
27-21	Source Column
30-28	`3'b000`
31	Odd parity of bits[30:0]

The packet ID is assigned by the compiler based on routing requirements. The packet type can be any 3-bit pattern that you want to insert to identify the type of packet. The source row and column denote the AI Engine tile coordinates from where the packet originated. By convention, source row and column for packets originating in the programmable logic (PL) is -1,-1.

It is your responsibility to construct and send an appropriate packet header at the beginning of every packet. On the receive side, the packet header needs to be received and decoded before reading the data.

The following operations help to assemble or disassemble the packet header in the AI Engine kernel.

void writeHeader(output_pktstream *str, unsigned int pcktType, unsigned int ID);
void writeHeader(output_pktstream *str, unsigned int pcktType, unsigned int ID, bool tlast);


uint32 getPacketid(input_pktstream *w, int index);
uint32 getPacketid(output_pktstream *w, int index);

The writeHeader API allows a packet header to be assembled with a given packet ID and packet type. The source row and column are inserted automatically using the coordinates of the AI Engine tile where this API is executed.

The getPacketid API allows the compiler assigned packet ID to be queried on the input or output packet stream data structure. The index argument refers to the split or merge branch edge in the graph specification.

IMPORTANT: The writeHeader() and getPacketid() APIs are not supported in PL kernels.

IMPORTANT: The generateHeader API has been deprecated and replaced with the writeHeader API.

See Explicit Packet Switching for more details.