Mapper/Router Methodology

Design Convergence

This section describes the process of handling a failure in the AI Engine compiler during the mapper (place) or router steps.

Mapping Solution Not Found

Mapper failures typically have two modes. Either the failure happens during a pre-check phase or it might happen during the actual mapping phase.

Pre-check failures have explicit error messages which point to exact reason for failure, as shown in the following example.

ERROR: [aiecompiler 47-772] Inst g.kernel_a is in conflicting pblocks:(0,0) (5,5) and (20,0) (25,5).

You can trace such errors to the design element or the constraints.

Mapping phase failures typically have an error message that looks like the following.

ERROR: [aiecompiler 47-51] AIE Mapper failed to find a legal solution. Please try to relax constraints and/or try alternate strategies like disableFloorplanning.

In this case, use the following steps to either narrow down the cause of the failure if it is design-related or help the tool to find a solution if the failure is tool-related.

Checking User-Defined Constraints

Incorrectly defined user constraints can cause mapper failure. Some of these will be caught by the mapper pre-checker. However, not all of them can be caught in the pre-check phase. In such cases, check for the following conditions.

If you have a large number of absolute location or co-location constraints in the graph, check that these constraints do not give conflicting directives to the mapper. This might occur because of the checkerboard nature of the AI Engine array as shown in the following diagram.
In this diagram, the kernels in red have absolute location constraints while the window buffer (green) between them has a co-location constraint with the first kernel. This will result in mapper failure.
Figure 1: Conflicting Absolute Location/Co-location Constraints
If you have absolute location constraints for kernels that are part of a cascade chain, check that these constraints are compatible with the architecture. In AI Engine architecture, the direction of the cascade changes in each row. If you have absolute constraints and cascades as shown in the following diagram, it will cause mapper failure.
Figure 2: Conflicting Cascade Direction
In some cases the size of the buffers being constrained to a particular tile might exceed the memory capacity of the tile (32 KB in AI Engine architecture). This will result in mapper failure.
Each tile in the AI Engine architecture has two input and two output DMA channels. If you have constrained buffers in such a way that a particular tile needs more than this number of DMA channels, it will result in mapper failure.

Reducing Window Buffer Sizes for Very High Memory Density Designs

One of the main considerations when determining the window sizes for a design is that the number of cycles required for data loading is balanced with the number of compute cycles required by the kernel. This helps to pipeline the ping and pong buffer data loading with the kernel compute. For very high memory density designs, it makes sense to have smaller window sizes which can still balance the kernel compute because having larger window sizes might lead to mapper failure.

The following table shows the number of cycles required for the matrix multiplication of two matrices with 16-bit data. Example 1 and Example 2 have different matrix sizes, but both have their compute and data loading balanced. Note that only the larger of the A or B matrix size determines the data loading time whereas the time of kernel compute is determined by both sizes. This shows that Example 1 has smaller window sizes than Example 2, but the compute and data loading are balanced and can be pipelined.

Table 1. Matrix Multiplication Examples
	Matrix A Size	Matrix B Size	# multops	#Cycles for Compute 32 ops/ cycle	#Cycles for Data Loading 32 bits/ cycle
Example 1	16x64	64x16	16384	512 (16384/32)	512 (64x16x16/32)
Example 2	16x64	64x32	32768	1024 (32768/32)	1024 (64x32x16/32)

Providing User Guidance to Mapper

In some cases, the mapper error might be due to limitations of the tool. In such cases. it might be useful to provide guidance to the tool.

Turn off automatic AI Engine compiler floorplanning using the command line switch.
```
-–Xmapper=disableFloorplanning
```
Create your own floorplanning by using either bounding box constraints in the graph or areaGroup constraints in the constraints file. This is particularly useful if there are multiple disjoint graphs in the design. Each of these separate graphs can be constrained to a particular region of the array. This technique not only helps design convergence but can also help improve performance by minimizing interference between different graph buffers. The following diagram shows an example using a 16 antenna transmit chain design. The highlighted kernels in blue belong to antenna 4.
Figure 3: Kernels in Antenna 4

The constraints file syntax to achieve this is as follows.
```
{
  "GlobalConstraints": {
    "areaGroup": {
      "name": "ant4_cores",
      "nodeGroup": ["tx_chain4.*"],
      "tileGroup": ["(16,0):(19,3)"]
    }
  }
}
```
The bounding box syntax to achieve this is as follows.
```
location<graph>(tx_chain4) = bounding_box(16,0,19,3);
```
Add co-location or absolute location constraints if necessary. If this guidance does not get the design to converge, you can try adding co-location or absolute location constraints. Co-location constraints can be added between kernels and buffers or system memory that you expect to be mapped to the same tile, as shown in the following example.
```
location<buffer>(kernel_1.out[0]) = location<kernel>(kernel_1);
location<stack>(kernel_1) = location<kernel>(kernel_1);
```
Absolute location constraints can also be added to certain key kernels or buffers to act as anchors and guide the mapper's placement of other components, as shown in the following example.
```
location<kernel>(kernel_1) = tile(20, 0);
location<buffer>(kernel_1.in[0]) =

        { address(19, 0, 0x0),
          address(19, 0, 0x2000) }; // double buffer needs two locations
```

Routing Solution Not Found

In this section you can see how to check router congestion to see whether the mapper has put the router into an impossible situation and how to fix that. Also you can see if packet switching disable is affecting the congestion.

Initially, check the user-defined constraints, which include the FIFO depth constraints and the area group constraints and then check the mapping results.

FIFO Depth Constraints

There are limited switch and DMA FIFOs on the device. When deciding on fifo_depth constraints it is important to consider the amount of FIFOs you specify for an area. This includes taking into account if the nets that have fifo_depth constraints also have area group constraints. In this case make sure that all fifo_depth constraints can be met within the specified area.

If there is a high contention for switch FIFOs, consider moving to DMA FIFOs. Without changing the fifo_depth you can specify the DMA FIFO type using the following constraint.

location<fifo>(net1) = { dma_fifo()}

With careful consideration FIFO locations constraints can be applied, as shown in the following example.

location<fifo>(net2) = { dma_fifo(aie_tile, 15, 0, 0x3100, 32) };
location<fifo>(net3) = { ss_fifo(shim_tile, 16, 0, 0), dma_fifo(aie_tile, 17, 0, 0x3100, 48)}

Area Group Constraints

Sometimes the placement of objects is only considered when defining the area group constraints. This can leave routing without the ability to form all its connections. The following image shows a variety of area group constraints that all allow routing to form connections. In all three of these cases the routing never has to leave the defined area groups to complete its routing.

Figure 4: Routing in Defined Area Groups

In contrast there are two common errors with routing and area group constraints. The left side of the following image shows a missing area group for an object. The second case is one where all objects are contained within separate area groups but the two groups are not adjacent. In this case the router has no way to complete the routing of its nets without violating an area group constraint and the router will fail to find a legal solution.

Checking Mapping Results

Check PLIO placement to see if it is causing routing congestion. The most common congestion area for router is in the interface tile region. The device has two more incoming/outgoing stream channels for PLIO than the interface tile has connections going into the AI Engine array.

If PLIOs are being constrained make sure adequate resources into the AI Engine array are open to handle the locked PLIOs.
If possible limit PLIO placement near shadow regions to decrease congestion of routing resources.
Check PLIO placement/constraints to verify that crossing such as shown in the following image is not occurring.

Improving Design Performance

Design throughput can be negatively impacted by memory or stream stalls and also by large skew between the graph output. These situations can be identified by viewing the simulator output in the Vitis™ analyzer tool. The following sections discuss techniques to be followed in each of these cases.

Memory Stalls

The objective of the mapper is to prevent buffer conflicts, where possible. It also has different buffer optimization levels that try to bloat or increase the size of buffers to prevent conflicts. These buffer optimization levels range from 0 (default) to 9. They are invoked with the Xmapper option, --Xmapper=BufferOptLevel<level num>. At the highest buffer optimization level (9), no two buffers can be placed in the same bank. However it is important to know that at the higher buffer optimization levels it might become impossible for the mapper to find a solution and it will error out. So the first option if you see a large number of memory stalls is to cycle through the BufferOptLevel options to see if fewer memory stalls are seen at higher bufferOptLevels.

Another possibility is that you can explicitly inform the mapper not to place two buffers in the same bank. If your simulation analysis indicates that a significant throughput degradation is caused by memory stall resulting from a bank conflict between buffer kernel_0.in[0] and kernel_1.out[0], you can provide a directive to the mapper to not place these buffers in the same bank as follows.

not_equal(location<buffer>(kernel_0.in[0]), location<buffer>(kernel_1.out[0]));

If DMA FIFOs are used in the design and they are placed in the same bank as other buffers then the Xrouter option DMAFIFOsInFreeBankOnly can force the router to place these FIFOs in free banks. This eliminates memory conflicts with the DMA FIFOs. If it is not possible to reserve an entire free bank for the DMA FIFO then location constraints can be used in coordination with outside knowledge of memory buffers. In this case it is important to have knowledge of which buffers might cause stalls when conflicting with DMA FIFOs. The constraints can look like the following.

location<fifo>(net2) = { dma_fifo(aie_tile, 15, 0, 0x3100, 32) };

Stream Stalls

The driver of a cascade chain should be placed closer to the head of the cascade chain, if possible. This reduces routing latency for these long latency paths and can reduce the need for expensively large switch/DMA FIFOs.

Make sure fifo_depth constraints have been specified for nets that require buffering.
Minimize streams and favor windows.

If fifo_depth has been specified you can check the log for the FIFO report section in AIECompiler.log.

Large Skew Between Identical Graph Outputs

If a design has multiple instantiations of the same graph, consider using the stamp and repeat flow in the mapper. In this flow, you provide input to the mapper that all the graphs should be mapped in an exactly identical manner in the area group region that you have provided for each graph. This not only simplifies the problem for the mapper, it also significantly reduces skew between the outputs of different graphs. This is particularly important for wireless communications designs with multiple antennas. Steps to use the stamp and repeat flow are as follows.

Define an area group for each graph. For example in the following add constraints in the aiecst file.

"GlobalConstraints": {
  "areaGroup": {
    "name": "ant0_cores",
    "nodeGroup": ["tx_chain0.*"],
    "tileGroup": ["(0,0):(3,3)"]
  },
  "areaGroup": {
    "name": "ant1_cores",
    "nodeGroup": ["tx_chain1.*"],
    "tileGroup": ["(4,0):(7,3)"]
  }
}

or add constraints in the graph:

location<graph>(tx_chain0) = bounding_box(0,0,3,3);
location<graph>(tx_chain1) = bounding_box(4,0,7,3);

Define stamp and repeat constraint in the aiecst file.
```
{
  "GlobalConstraints": {
    "isomorphicGraphGroup": {
      "name": "isoGroup",
      "referenceGraph": "tx_chain0",
      "stampedGraphs": ["tx_chain1"]
    }
  }
}
```
Note that the graph that is designated as the reference graph can also be given additional constraints such as co-location or absolute location constraints. These are automatically applied to other graphs with an appropriate offset.
Note: For more information see Mapping Constraints.