Mapper/Router Methodology
Design Convergence
This section describes the process of handling a failure in the AI Engine compiler during the mapper (place) or router steps.
Mapping Solution Not Found
Mapper failures typically have two modes. Either the failure happens during a pre-check phase or it might happen during the actual mapping phase.
Pre-check failures have explicit error messages which point to exact reason for failure, as shown in the following example.
ERROR: [aiecompiler 47-772] Inst g.kernel_a is in conflicting pblocks:(0,0) (5,5) and (20,0) (25,5).
You can trace such errors to the design element or the constraints.
Mapping phase failures typically have an error message that looks like the following.
ERROR: [aiecompiler 47-51] AIE Mapper failed to find a legal solution. Please try to relax constraints and/or try alternate strategies like disableFloorplanning.
In this case, use the following steps to either narrow down the cause of the failure if it is design-related or help the tool to find a solution if the failure is tool-related.
Checking User-Defined Constraints
Incorrectly defined user constraints can cause mapper failure. Some of these will be caught by the mapper pre-checker. However, not all of them can be caught in the pre-check phase. In such cases, check for the following conditions.
- If you have a large number of absolute location or co-location constraints in
the graph, check that these constraints do not give conflicting directives to the
mapper. This might occur because of the checkerboard nature of the AI Engine array as shown in the following diagram.
In this diagram, the kernels in red have absolute location constraints while the window buffer (green) between them has a co-location constraint with the first kernel. This will result in mapper failure.
- If you have absolute location constraints for kernels that are part of a cascade chain, check that these constraints are compatible with the architecture. In AI Engine architecture, the direction of the cascade changes in each row. If you have absolute constraints and cascades as shown in the following diagram, it will cause mapper failure.
- In some cases the size of the buffers being constrained to a particular tile might exceed the memory capacity of the tile (32 KB in AI Engine architecture). This will result in mapper failure.
- Each tile in the AI Engine architecture has two input and two output DMA channels. If you have constrained buffers in such a way that a particular tile needs more than this number of DMA channels, it will result in mapper failure.
Reducing Window Buffer Sizes for Very High Memory Density Designs
One of the main considerations when determining the window sizes for a design is that the number of cycles required for data loading is balanced with the number of compute cycles required by the kernel. This helps to pipeline the ping and pong buffer data loading with the kernel compute. For very high memory density designs, it makes sense to have smaller window sizes which can still balance the kernel compute because having larger window sizes might lead to mapper failure.
The following table shows the number of cycles required for the matrix multiplication of two matrices with 16-bit data. Example 1 and Example 2 have different matrix sizes, but both have their compute and data loading balanced. Note that only the larger of the A or B matrix size determines the data loading time whereas the time of kernel compute is determined by both sizes. This shows that Example 1 has smaller window sizes than Example 2, but the compute and data loading are balanced and can be pipelined.
Matrix A Size | Matrix B Size | # multops | #Cycles for Compute 32 ops/ cycle |
#Cycles for Data Loading 32 bits/ cycle |
|
---|---|---|---|---|---|
Example 1 | 16x64 | 64x16 | 16384 | 512 (16384/32) |
512 (64x16x16/32) |
Example 2 | 16x64 | 64x32 | 32768 | 1024 (32768/32) |
1024 (64x32x16/32) |
Providing User Guidance to Mapper
In some cases, the mapper error might be due to limitations of the tool. In such cases. it might be useful to provide guidance to the tool.
- Turn off automatic AI Engine compiler
floorplanning using the command line
switch.
-–Xmapper=disableFloorplanning
- Create your own floorplanning by using either bounding box constraints in the
graph or
areaGroup
constraints in the constraints file. This is particularly useful if there are multiple disjoint graphs in the design. Each of these separate graphs can be constrained to a particular region of the array. This technique not only helps design convergence but can also help improve performance by minimizing interference between different graph buffers. The following diagram shows an example using a 16 antenna transmit chain design. The highlighted kernels in blue belong to antenna 4.The constraints file syntax to achieve this is as follows.
{ "GlobalConstraints": { "areaGroup": { "name": "ant4_cores", "nodeGroup": ["tx_chain4.*"], "tileGroup": ["(16,0):(19,3)"] } } }
The bounding box syntax to achieve this is as follows.
location<graph>(tx_chain4) = bounding_box(16,0,19,3);
- Add co-location or absolute location constraints if necessary. If this
guidance does not get the design to converge, you can try adding co-location or
absolute location constraints. Co-location constraints can be added between kernels
and buffers or system memory that you expect to be mapped to the same tile, as shown
in the following
example.
Absolute location constraints can also be added to certain key kernels or buffers to act as anchors and guide the mapper's placement of other components, as shown in the following example.location<buffer>(kernel_1.out[0]) = location<kernel>(kernel_1); location<stack>(kernel_1) = location<kernel>(kernel_1);
location<kernel>(kernel_1) = tile(20, 0); location<buffer>(kernel_1.in[0]) = { address(19, 0, 0x0), address(19, 0, 0x2000) }; // double buffer needs two locations
Routing Solution Not Found
In this section you can see how to check router congestion to see whether the mapper has put the router into an impossible situation and how to fix that. Also you can see if packet switching disable is affecting the congestion.
Initially, check the user-defined constraints, which include the FIFO depth constraints and the area group constraints and then check the mapping results.
FIFO Depth Constraints
There are limited switch and DMA FIFOs on the device. When deciding on
fifo_depth
constraints it is important to consider the amount of
FIFOs you specify for an area. This includes taking into account if the nets that have
fifo_depth
constraints also have area group constraints. In this
case make sure that all fifo_depth
constraints can be
met within the specified area.
If there is a high contention for switch FIFOs, consider moving to DMA FIFOs.
Without changing the fifo_depth
you can specify the DMA
FIFO type using the following constraint.
location<fifo>(net1) = { dma_fifo()}
With careful consideration FIFO locations constraints can be applied, as shown in the following example.
location<fifo>(net2) = { dma_fifo(aie_tile, 15, 0, 0x3100, 32) };
location<fifo>(net3) = { ss_fifo(shim_tile, 16, 0, 0), dma_fifo(aie_tile, 17, 0, 0x3100, 48)}
Area Group Constraints
Sometimes the placement of objects is only considered when defining the area group constraints. This can leave routing without the ability to form all its connections. The following image shows a variety of area group constraints that all allow routing to form connections. In all three of these cases the routing never has to leave the defined area groups to complete its routing.
In contrast there are two common errors with routing and area group constraints. The left side of the following image shows a missing area group for an object. The second case is one where all objects are contained within separate area groups but the two groups are not adjacent. In this case the router has no way to complete the routing of its nets without violating an area group constraint and the router will fail to find a legal solution.
Checking Mapping Results
Check PLIO placement to see if it is causing routing congestion. The most common congestion area for router is in the interface tile region. The device has two more incoming/outgoing stream channels for PLIO than the interface tile has connections going into the AI Engine array.
- If PLIOs are being constrained make sure adequate resources into the AI Engine array are open to handle the locked PLIOs.
- If possible limit PLIO placement near shadow regions to decrease congestion of routing resources.
- Check PLIO placement/constraints to verify that crossing such as shown in the following image is not occurring.
Improving Design Performance
Design throughput can be negatively impacted by memory or stream stalls and also by large skew between the graph output. These situations can be identified by viewing the simulator output in the Vitis™ analyzer tool. The following sections discuss techniques to be followed in each of these cases.
Memory Stalls
The objective of the mapper is to prevent buffer conflicts, where possible. It
also has different buffer optimization levels that try to bloat or increase the size of
buffers to prevent conflicts. These buffer optimization levels range from 0 (default) to
9. They are invoked with the Xmapper
option, --Xmapper=BufferOptLevel<level num>
. At the highest buffer
optimization level (9), no two buffers can be placed in the same bank. However it is
important to know that at the higher buffer optimization levels it might become
impossible for the mapper to find a solution and it will error out. So the first option
if you see a large number of memory stalls is to cycle through the BufferOptLevel
options to see if fewer memory stalls are
seen at higher bufferOptLevels
.
Another possibility is that you can explicitly inform the mapper not to place
two buffers in the same bank. If your simulation analysis indicates that a significant
throughput degradation is caused by memory stall resulting from a bank conflict between
buffer kernel_0.in[0]
and kernel_1.out[0]
, you can provide a directive to the mapper to not place
these buffers in the same bank as follows.
not_equal(location<buffer>(kernel_0.in[0]), location<buffer>(kernel_1.out[0]));
Xrouter
option DMAFIFOsInFreeBankOnly
can force the router to place these FIFOs in free
banks. This eliminates memory conflicts with the DMA FIFOs. If it is not possible to
reserve an entire free bank for the DMA FIFO then location constraints can be used in
coordination with outside knowledge of memory buffers. In this case it is important to
have knowledge of which buffers might cause stalls when conflicting with DMA FIFOs. The
constraints can look like the
following.location<fifo>(net2) = { dma_fifo(aie_tile, 15, 0, 0x3100, 32) };
Stream Stalls
The driver of a cascade chain should be placed closer to the head of the cascade chain, if possible. This reduces routing latency for these long latency paths and can reduce the need for expensively large switch/DMA FIFOs.
- Make sure
fifo_depth
constraints have been specified for nets that require buffering. - Minimize streams and favor windows.
If fifo_depth
has been specified you can
check the log for the FIFO report section in AIECompiler.log.
Large Skew Between Identical Graph Outputs
If a design has multiple instantiations of the same graph, consider using the stamp and repeat flow in the mapper. In this flow, you provide input to the mapper that all the graphs should be mapped in an exactly identical manner in the area group region that you have provided for each graph. This not only simplifies the problem for the mapper, it also significantly reduces skew between the outputs of different graphs. This is particularly important for wireless communications designs with multiple antennas. Steps to use the stamp and repeat flow are as follows.
- Define an area group for each graph. For example in the following add
constraints in the aiecst
file.
or add constraints in the graph:"GlobalConstraints": { "areaGroup": { "name": "ant0_cores", "nodeGroup": ["tx_chain0.*"], "tileGroup": ["(0,0):(3,3)"] }, "areaGroup": { "name": "ant1_cores", "nodeGroup": ["tx_chain1.*"], "tileGroup": ["(4,0):(7,3)"] } }
location<graph>(tx_chain0) = bounding_box(0,0,3,3); location<graph>(tx_chain1) = bounding_box(4,0,7,3);
- Define stamp and repeat constraint in the aiecst
file.
{ "GlobalConstraints": { "isomorphicGraphGroup": { "name": "isoGroup", "referenceGraph": "tx_chain0", "stampedGraphs": ["tx_chain1"] } } }
Note that the graph that is designated as the reference graph can also be given additional constraints such as co-location or absolute location constraints. These are automatically applied to other graphs with an appropriate offset.
Note: For more information see Mapping Constraints.