Data Movement Between AI Engines

Generally, there are two methods to transfer data between kernels–window or stream. When using the window, data transfers can be realized as ping-pong buffers and optionally, using a single buffer. AI Engine tools will take care of buffer synchronization between the kernels. Designers need to decide the window size and buffer location between kernels through their partition of the application. If an overlap is needed between different windows of the data, AI Engine tools provide options for setting a margin for the window, that is, to copy the overlap of data by AI Engine tools automatically.

When using the stream, the data movement involves two input as well as two output stream ports, along with one dedicated cascade stream input port and output port. Stream ports can provide 32-bit per cycle or, 128-bit per four cycles on each port. Stream interfaces are bidirectional and can read or write neighboring or non-neighboring AI Engines by stream ports. However, cascade stream ports are unidirectional and only provide a one-way access between the neighboring AI Engines.

Data Communication via Shared Memory

In the case where multiple kernels fit in a single AI Engine, communication between two or more consecutive kernels can be established using a common buffer in the shared memory. In this case, only a single buffer is needed because the kernels are time-multiplexed.

For cases where the kernels are in separate but neighboring AI Engines, the communication can be carried out through the shared memory module that use ping-pong buffers. These buffers are on separate memory banks so access conflicts are avoided. The synchronization is done through locks. The input and output buffers for the AI Engine kernel are ensured to be ready by the locks associated with the buffers. In this type of communication, routing resources are saved and data transferring latency is eliminated because DMA and AXI4-Stream interconnect are not needed.

Data Communication via Memory and DMA

For non-neighboring AI Engines, similar communication can be established using the DMA in the memory module associated with each AI Engine. Ping-pong buffers in each memory module are used and synchronization is carried out with locks. There is increased communication latency as well as memory resources in comparison to shared memory communication.

Figure 1: Data Communication via Memory and DMA

Data Communication via AXI4-Stream Interconnect

AI Engines can directly communicate through the AXI4-Stream interconnect without any DMA and memory interaction. Data can be sent from one AI Engine to another or broadcast through the streaming interface. The data bandwidth of a streaming connection is 32-bit per cycle and built-in handshake and backpressure mechanisms are available.

For streaming input and output interfaces, when the performance is limited by the stream number, the AI Engine is able to use two streaming inputs or two streaming outputs in parallel, instead of one streaming input or output. To use two parallel streams, it is recommended to use the following pairs of macros, where idx1 and idx2 are the two streams. Add the __restrict keyword to stream ports to ensure they are optimized for parallel processing.

READINCR(SS_rsrc1, idx1) and READINCR(SS_rsrc2, idx2) 
READINCRW(WSS_rsrc1, idx1) and READINCRW(WSS_rsrc2, idx2) 
WRITEINCR(MS_rsrc1, idx1, val) and WRITEINCR(MS_rsrc2, idx2, val) 
WRITEINCRW(WMS_rsrc1, idx1, val) and WRITEINCRW(WMS_rsrc2, idx2, val)

Following is a sample code to use two parallel input streams to achieve pipelining with interval 1. Interval 1 means that two read, one write, and one add are in every cycle.

void simple(   input_stream_int32 * __restrict data0,  
			input_stream_int32 * __restrict data1,   			
			output_stream_int32 * __restrict out) { 
	for(int i=0; i<1024; i++) 
	chess_prepare_for_pipelining 
	{ 
		int32_t d = READINCR(SS_rsrc1, data0) ; 
		int32_t e = READINCR(SS_rsrc2, data1) ; 
		WRITEINCR(MS_rsrc1,out,d+e); 
	} 
}

Intrinsics can be used to perform stream operations directly, but it is important to not swap the two streams based on the mapping AI Engine tools has found.

v16float off = *(v16float*)offset; 
v8float v8in = undef_v8float(); 
v8float v8out = undef_v8float(); 
for(int i=0;i<128/4;i++) 
chess_prepare_for_pipelining 
{ 
	v8in = concat(getf_wss(0),getf_wss(1)); // reads 8 float values 
	v8out = fpadd(v8in,off,0,0); 
	put_wms(0,ext_v(v8out,0)); 
	put_wms(1,ext_v(v8out,1)); 
} 

The stream connection can be unicast or multicast. Note that in the case of multicast communication, the data is sent to all the destination ports at the same time and only when all destinations are ready to receive data.