Optimization Techniques in Vitis HLS

This section outlines the various optimization techniques you can use to direct Vitis HLS to produce a micro-architecture that satisfies the desired performance and area goals. Using Vitis HLS, you can apply different optimization directives to the design, including:

Pipelining tasks, allowing the next execution of the task to begin before the current execution is complete.
Specifying a target latency for the completion of functions, loops, and regions.
Specifying a limit on the number of resources used.
Overriding the inherent or implied dependencies in the code to permit specific operations. For example, if it is acceptable to discard or ignore the initial data values, such as in a video stream, allow a memory read before write if it results in better performance.
Specifying the I/O protocol to ensure function arguments can be connected to other hardware blocks with the same I/O protocol.
Note: Vitis HLS automatically determines the I/O protocol used by any sub-functions. You cannot control these ports except to specify whether the port is registered.

The optimizations techniques are presented in the context of how they are typically applied to a design:

Optimizing for Throughput presents primary optimizations in the order in which they are typically used: pipeline the tasks to improve performance, improve the flow of data between tasks, and optimize structures to improve address issues which may limit performance.
Optimizing for Latency uses the techniques of latency constraints and the removal of loop transitions to reduce the number of clock cycles required to complete.
Optimizing for Area focuses on how operations are implemented - controlling the number of operations and how those operations are implemented in hardware - is the principal technique for improving the area.
Optimizing Logic discusses optimizations affecting the implementation of the RTL.

You can add optimization directives directly into the source code as compiler pragmas using various HLS pragmas, or you can use Tcl set_directive commands to apply optimization directives in a Tcl script to be used by a solution during compilation as discussed in Adding Pragmas and Directives. The following table lists the optimization directives provided by Vitis HLS as either pragma or Tcl directive.

Table 1. Vitis HLS Optimization Directives
Directive	Description
`ALLOCATION`	Specify a limit for the number of operations, implementations, or functions used. This can force the sharing or hardware resources and may increase latency.
`ARRAY_PARTITION`	Partitions large arrays into multiple smaller arrays or into individual registers, to improve access to data and remove block RAM bottlenecks.
`ARRAY_RESHAPE`	Reshape an array from one with many elements to one with greater word-width. Useful for improving block RAM accesses without using more block RAM.
`BIND_OP`	Define a specific implementation for an operation in the RTL.
`BIND_STORAGE`	Define a specific implementation for a storage element, or memory, in the RTL.
`DATAFLOW`	Enables task level pipelining, allowing functions and loops to execute concurrently. Used to optimize throughput and/or latency.
`DEPENDENCE`	Used to provide additional information that can overcome loop-carried dependencies and allow loops to be pipelined (or pipelined with lower intervals).
`DISAGGREGATE`	Break a struct down into its individual elements.
`EXPRESSION_BALANCE`	Allows automatic expression balancing to be turned off.
`INLINE`	Inlines a function, removing function hierarchy at this level. Used to enable logic optimization across function boundaries and improve latency/interval by reducing function call overhead.
`INTERFACE`	Specifies how RTL ports are created from the function description.
`LATENCY`	Allows a minimum and maximum latency constraint to be specified.
`LOOP_FLATTEN`	Allows nested loops to be collapsed into a single loop with improved latency.
`LOOP_MERGE`	Merge consecutive loops to reduce overall latency, increase sharing and improve logic optimization.
`LOOP_TRIPCOUNT`	Used for loops which have variables bounds. Provides an estimate for the loop iteration count. This has no impact on synthesis, only on reporting.
`OCCURRENCE`	Used when pipelining functions or loops, to specify that the code in a location is executed at a lesser rate than the code in the enclosing function or loop.
`PIPELINE`	Reduces the initiation interval by allowing the overlapped execution of operations within a loop or function.
`RESET`	This directive is used to add or remove reset on a specific state variable (global or static).
`SHARED`	Specifies that a global variable, or function argument array is shared among multiple dataflow processes, without the need for synchronization.
`STABLE`	Indicates that a variable input or output of a dataflow region can be ignored when generating the synchronizations at entry and exit of the dataflow region.
`STREAM`	Specifies that a specific array is to be implemented as a FIFO or RAM memory channel during dataflow optimization. When using hls::stream, the STREAM optimization directive is used to override the configuration of the hls::stream.
`TOP`	The top-level function for synthesis is specified in the project settings. This directive may be used to specify any function as the top-level for synthesis. This then allows different solutions within the same project to be specified as the top-level function for synthesis without needing to create a new project.
`UNROLL`	Unroll for-loops to create multiple instances of the loop body and its instructions that can then be scheduled independently.

In addition to the optimization directives, Vitis HLS provides a number of configuration commands that can influence the performance of synthesis results. Details on using configurations commands can be found in Setting Configuration Options. The following table reflects some of these commands.

Table 2. Vitis HLS Configurations
GUI Directive	Description
Config Array Partition	Determines how arrays are partitioned, including global arrays and if the partitioning impacts array ports.
Config Compile	Controls synthesis specific optimizations such as the automatic loop pipelining and floating point math optimizations.
Config Dataflow	Specifies the default memory channel and FIFO depth in dataflow optimization.
Config Interface	Controls I/O ports not associated with the top-level function arguments and allows unused ports to be eliminated from the final RTL.
Config Op	Configures the default latency and implementation of specified operations.
Config RTL	Provides control over the output RTL including file and module naming, reset style and FSM encoding.
Config Schedule	Determines the effort level to use during the synthesis scheduling phase and the verbosity of the output messages
Config Storage	Configures the default latency and implementation of specified storage types.
Config Unroll	Configures the default tripcount threshold for unrolling loops.

Controlling the Reset Behavior

The reset port is used in an FPGA to return the registers and block RAM connected to the reset port to an initial value any time the reset signal is applied. Typically the most important aspect of RTL configuration is selecting the reset behavior.

Note: When discussing reset behavior it is important to understand the difference between initialization and reset. Refer to Initialization Behavior for more information.

The presence and behavior of the RTL reset port is controlled using the config_rtl command, as shown in the following figure. You can access this command by selecting the Solution > Solution Settings menu command.

The reset settings include the ability to set the polarity of the reset and whether the reset is synchronous or asynchronous but more importantly it controls, through the reset option, which registers are reset when the reset signal is applied.

IMPORTANT: When AXI4 interfaces are used on a design the reset polarity is automatically changed to active-Low irrespective of the setting in the config_rtl configuration. This is required by the AXI4 standard.

The reset option has four settings:

none: No reset is added to the design.
control: This is the default and ensures all control registers are reset. Control registers are those used in state machines and to generate I/O protocol signals. This setting ensures the design can immediately start its operation state.
state: This option adds a reset to control registers (as in the control setting) plus any registers or memories derived from static and global variables in the C/C++ code. This setting ensures static and global variable initialized in the C/C++ code are reset to their initialized value after the reset is applied.
all: This adds a reset to all registers and memories in the design.

Finer grain control over reset is provided through the RESET pragma or directive. Static and global variables can have a reset added through the RESET directive. Variables can also be removed from those being reset by using the RESET directive’s off option.

IMPORTANT: It is important when using the reset state or all options to consider the effect on resetting arrays as discussed in Initializing and Resetting Arrays.

Initialization Behavior

In C/C++, variables defined with the static qualifier and those defined in the global scope are initialized to zero, by default. These variables may optionally be assigned a specific initial value. For these initialized variables, the value in the C/C++ code is assigned at compile time (at time zero) and never again. In both cases, the initial value is implemented in the RTL.

During RTL simulation the variables are initialized with the same values as the C/C++ code.
The variables are also initialized in the bitstream used to program the FPGA. When the device powers up, the variables will start in their initialized state.

In the RTL, although the variables start with the same initial value as the C/C++ code, there is no way to force the variable to return to this initial state. To restore the initial state, variables must be implemented with a reset signal.

IMPORTANT: Top-level function arguments can be implemented in an AXI4-Lite interface. Because there is no way to provide an initial value in C/C++ for function arguments, these variable cannot be initialized in the RTL as doing so would create an RTL design with different functional behavior from the C/C++ code which would fail to verify during C/RTL co-simulation.

Initializing and Resetting Arrays

Arrays are often defined as static variables, which implies all elements are initialized to zero; and arrays are typically implemented as block RAM. When reset options state or all are used, it forces all arrays implemented as block RAM to be returned to their initialized state after reset. This may result in two very undesirable conditions in the RTL design:

Unlike a power-up initialization, an explicit reset requires the RTL design iterate through each address in the block RAM to set the value: this can take many clock cycles if N is large, and requires more area resources to implement the reset.
A reset is added to every array in the design.

To prevent adding reset logic onto every such block RAM, and incurring the cycle overhead to reset all elements in the RAM, specify the default control reset mode and use the RESET directive to identify individual static or global variables to be reset.

Alternatively, you can use the state reset mode, and use the RESET directive off option to identify individual static or global variables to remove the reset from.

Optimizing for Throughput

Use the following optimizations to improve throughput or reduce the initiation interval.

Function and Loop Pipelining

Pipelining allows operations to happen concurrently: each execution step does not have to complete all operations before it begins the next operation. Pipelining is applied to functions and loops. The throughput improvements in function pipelining are shown in the following figure.

Without pipelining, the function in the above example reads an input every 3 clock cycles and outputs a value after 2 clock cycles. The function has an initiation interval (II) of 3 and a latency of 3. With pipelining, for this example, a new input is read every cycle (II=1) with no change to the output latency.

Loop pipelining allows the operations in a loop to be implemented in an overlapping manner. In the following figure, (A) shows the default sequential operation where there are 3 clock cycles between each input read (II=3), and it requires 8 clock cycles before the last output write is performed.

In the pipelined version of the loop shown in (B), a new input sample is read every cycle (II=1) and the final output is written after only 4 clock cycles: substantially improving both the II and latency while using the same hardware resources.

Functions or loops are pipelined using the PIPELINE directive. The directive is specified in the region that constitutes the function or loop body. The initiation interval defaults to 1 if not specified but may be explicitly specified.

Pipelining is applied only to the specified region and not to the hierarchy below. However, all loops in the hierarchy below are automatically unrolled. Any sub-functions in the hierarchy below the specified function must be pipelined individually. If the sub-functions are pipelined, the pipelined functions above it can take advantage of the pipeline performance. Conversely, any sub-function below the pipelined top-level function that is not pipelined might be the limiting factor in the performance of the pipeline.

There is a difference in how pipelined functions and loops behave.

In the case of functions, the pipeline runs forever and never ends.
In the case of loops, the pipeline executes until all iterations of the loop are completed.

This difference in behavior is summarized in the following figure.

Figure 4: Function and Loop Pipelining Behavior

The difference in behavior impacts how inputs and outputs to the pipeline are processed. As seen in the figure above, a pipelined function will continuously read new inputs and write new outputs. By contrast, because a loop must first finish all operations in the loop before starting the next loop, a pipelined loop causes a “bubble” in the data stream; that is, a point when no new inputs are read as the loop completes the execution of the final iterations, and a point when no new outputs are written as the loop starts new loop iterations.

Rewinding Pipelined Loops for Performance

To avoid issues shown in the previous figure (Function and Loop Pipelining), the PIPELINE pragma has an optional command rewind. This command enables the overlap of the iterations of successive calls to the rewind loop, when this loop is the outermost construct of the top function or of a dataflow process (and the dataflow region is called multiple times).

The following figure shows the operation when the rewind option is used when pipelining a loop. At the end of the loop iteration count, the loop starts to re-execute. While it generally re-executes immediately, a delay is possible and is shown and described in the GUI.

Figure 5: Loop Pipelining with Rewind Option

Note: If a loop is used around a DATAFLOW region, Vitis HLS automatically implements it to allow successive iterations to overlap. See Exploiting Task Level Parallelism: Dataflow Optimization for more information.

Flushing Pipelines

Pipelines continue to execute as long as data is available at the input of the pipeline. If there is no data available to process, the pipeline will stall. This is shown in the following figure, where the input data valid signal goes low to indicate there is no more data. Once there is new data available to process, the pipeline will continue operation.

In some cases, it is desirable to have a pipeline that can be “emptied” or “flushed.” The flush option is provided to perform this. When a pipeline is “flushed” the pipeline stops reading new inputs when none are available (as determined by a data valid signal at the start of the pipeline) but continues processing, shutting down each successive pipeline stage, until the final input has been processed through to the output of the pipeline.

The default style of pipelining implemented by Vitis HLS is defined by the config_compile -pipeline_style command. You can specify stalling pipelines (stp), or free-running flushing pipelines (frp) to be used throughout the design. You can also define a third type of flushable pipeline (flp) with the PIPELINE pragma or directive, using the enable_flush option. This option applies to the specific scope of the pragma or directive only, and does not change the global default assigned by config_compile.

The three types of pipelines available in the tool are summarized in the following table:

Table 3. Pipeline Types
Name	Stalled Pipeline (default)	Free-Running/ Flushable Pipeline	Flushable Pipeline
Global Setting	`config_compile -pipeline_style stp` (default)	`config_compile -pipeline_style frp`	N/A
Pragma/Directive	`#HLS pragma pipeline`	N/A	`#HLS pragma pipeline enable_flush`
Advantages	Default pipeline. No usage constraints. Typically the lowest overall resource usage.	Better timing due to Less fanout Simpler pipeline control logic Flushable	Flushable
Disadvantages	Not flushable, hence it can: Cause more deadlocks in dataflow Prevent already computed outputs from being delivered, if the inputs to the next iterations are missing Timing issues due to high fanout on pipeline controls	Moderate resource increase due to FIFOs added on outputs Usage constraints: Mainly used for dataflow internal processes M-AXI not supported	Can have larger II Greater resource usage due to less sharing(II>1) Usage constraints: Function pipeline only
Use cases	When there is no timing issue due to high fanout on pipeline control When flushable is not required (such as no performance or deadlock issue due to stall)	When you need better timing due to fanout to register enables from pipeline control When flushable is required for better performance or avoiding deadlock	When flushable is required for better performance or avoiding deadlock

Automatic Loop Pipelining

The config_compile configuration enables loops to be pipelined automatically based on the iteration count. This configuration is accessed through the menu Solution > Solution Setting > General > Add > config_compile.

The pipeline_loops option sets the iteration limit. All loops with an iteration count below this limit are automatically pipelined. The default is 64: no automatic loop pipelining is performed.

Given the following example code:

for (y = 0; y < 480; y++) {
  for (x = 0; x < 640; x++) {
    for (i = 0; i < 5; i++) {
      // do something 5 times
 …     ...
    }
  }
}

If the pipeline_loops option is set to 6, the innermost for loop in the above code snippet will be automatically pipelined. This is equivalent to the following code snippet:

for (y = 0; y < 480; y++) {
  for (x = 0; x < 640; x++) {
    for (i = 0; i < 5; i++) {
#pragma HLS PIPELINE II=1
      // do something 5 times
 …    ...
    }
  }
}

If there are loops in the design for which you do not want to use automatic pipelining, apply the PIPELINE directive with the off option to that loop. The off option prevents automatic loop pipelining.

IMPORTANT: Vitis HLS applies the config_compile pipeline_loops option after performing all user-specified directives. For example, if Vitis HLS applies a user-specified UNROLL directive to a loop, the loop is first unrolled, and automatic loop pipelining cannot be applied.

Unrolling Loops to Improve Pipelining

By default, loops are kept rolled in Vitis HLS. These rolled loops generate a hardware resource which is used by each iteration of the loop. While this creates a resource efficient block, it can sometimes be a performance bottleneck.

Vitis HLS provides the ability to unroll or partially unroll FOR loops using the UNROLL pragma or directive.

The following figure shows both the advantages of loop unrolling and the implications that must be considered when unrolling loops. This example assumes the arrays a[i], b[i], and c[i] are mapped to block RAMs. This example shows how easy it is to create many different implementations by the simple application of loop unrolling.

Rolled Loop: When the loop is rolled, each iteration is performed in separate clock cycles. This implementation takes four clock cycles, only requires one multiplier and each block RAM can be a single-port block RAM.
Partially Unrolled Loop: In this example, the loop is partially unrolled by a factor of 2. This implementation required two multipliers and dual-port RAMs to support two reads or writes to each RAM in the same clock cycle. This implementation does however only take 2 clock cycles to complete: half the initiation interval and half the latency of the rolled loop version.
Unrolled loop: In the fully unrolled version all loop operation can be performed in a single clock cycle. This implementation however requires four multipliers. More importantly, this implementation requires the ability to perform 4 reads and 4 write operations in the same clock cycle. Because a block RAM only has a maximum of two ports, this implementation requires the arrays be partitioned.

To perform loop unrolling, you can apply the UNROLL directives to individual loops in the design. Alternatively, you can apply the UNROLL directive to a function, which unrolls all loops within the scope of the function.

If a loop is completely unrolled, all operations will be performed in parallel if data dependencies and resources allow. If operations in one iteration of the loop require the result from a previous iteration, they cannot execute in parallel but will execute as soon as the data is available. A completely unrolled and fully optimized loop will generally involve multiple copies of the logic in the loop body.

The following example code demonstrates how loop unrolling can be used to create an optimized design. In this example, the data is stored in the arrays as interleaved channels. If the loop is pipelined with II=1, each channel is only read and written every eighth block cycle.

// Array Order :  0  1  2  3  4  5  6  7  8     9     10    etc. 16       etc...
// Sample Order:  A0 B0 C0 D0 E0 F0 G0 H0 A1    B1    C2    etc. A2       etc...
// Output Order:  A0 B0 C0 D0 E0 F0 G0 H0 A0+A1 B0+B1 C0+C2 etc. A0+A1+A2 etc...

#define CHANNELS 8
#define SAMPLES  400
#define N CHANNELS * SAMPLES

void foo (dout_t d_out[N], din_t d_in[N]) {
 int i, rem;

  // Store accumulated data
 static dacc_t acc[CHANNELS];

 // Accumulate each channel
 For_Loop: for (i=0;i<N;i++) {
 rem=i%CHANNELS;
 acc[rem] = acc[rem] + d_in[i];
 d_out[i] = acc[rem];
 }
}

Partially unrolling the loop by a factor of 8 will allow each of the channels (every eighth sample) to be processed in parallel (if the input and output arrays are also partitioned in a cyclic manner to allow multiple accesses per clock cycle). If the loop is also pipelined with the rewind option, this design will continuously process all 8 channels in parallel if called in a pipelined fashion (that is, either at the top, or within a dataflow region).

void foo (dout_t d_out[N], din_t d_in[N]) {
#pragma HLS ARRAY_PARTITION variable=d_i cyclic factor=8 dim=1 partition
#pragma HLS ARRAY_PARTITION variable=d_o cyclic factor=8 dim=1 partition

 int i, rem;

  // Store accumulated data
 static dacc_t acc[CHANNELS];

 // Accumulate each channel
 For_Loop: for (i=0;i<N;i++) {
#pragma HLS PIPELINE rewind
#pragma HLS UNROLL factor=8

 rem=i%CHANNELS;
 acc[rem] = acc[rem] + d_in[i];
 d_out[i] = acc[rem];
 }
}

Partial loop unrolling does not require the unroll factor to be an integer multiple of the maximum iteration count. Vitis HLS adds an exit checks to ensure partially unrolled loops are functionally identical to the original loop. For example, given the following code:

for(int i = 0; i < N; i++) {
  a[i] = b[i] + c[i];
}

Loop unrolling by a factor of 2 effectively transforms the code to look like the following example where the break construct is used to ensure the functionality remains the same:

for(int i = 0; i < N; i += 2) {
  a[i] = b[i] + c[i];
  if (i+1 >= N) break;
  a[i+1] = b[i+1] + c[i+1];
}

Because N is a variable, Vitis HLS might not be able to determine its maximum value (it could be driven from an input port). If the unrolling factor, which is 2 in this case, is an integer factor of the maximum iteration count N, the skip_exit_check option removes the exit check and associated logic. The effect of unrolling can now be represented as:

for(int i = 0; i < N; i += 2) {
  a[i] = b[i] + c[i];
  a[i+1] = b[i+1] + c[i+1];
}

This helps minimize the area and simplify the control logic.

Addressing Failure to Pipeline

When a function is pipelined, all loops in the hierarchy below are automatically unrolled. This is a requirement for pipelining to proceed. If a loop has variable bounds it cannot be unrolled. This will prevent the function from being pipelined.

Static Variables

Static variables are used to keep data between loop iterations, often resulting in registers in the final implementation. If this is encountered in pipelined functions, Vitis HLS might not be able to optimize the design sufficiently, which would result in initiation intervals longer than required.

The following is a typical example of this situation:

function_foo()
{
	static bool change = 0
	if (condition_xyz){ 
		change = x; // store 
	}
	y = change; // load
}

If Vitis HLS cannot optimize this code, the stored operation requires a cycle and the load operation requires an additional cycle. If this function is part of a pipeline, the pipeline has to be implemented with a minimum initiation interval of 2 as the static change variable creates a loop-carried dependency.

One way the user can avoid this is to rewrite the code, as shown in the following example. It ensures that only a read or a write operation is present in each iteration of the loop, which enables the design to be scheduled with II=1.

function_readstream()
{
	static bool change = 0
	bool change_temp = 0;
	if (condition_xyz)
	{ 
		change = x; // store 
		change_temp = x; 
	}
	else
	{ 
	change_temp = change; // load 
	}
	y = change_temp;
}

Partitioning Arrays to Improve Pipelining

A common issue when pipelining functions is the following message:

INFO: [SCHED 204-61] Pipelining loop 'SUM_LOOP'.
WARNING: [SCHED 204-69] Unable to schedule 'load' operation ('mem_load_2', 
bottleneck.c:62) on array 'mem' due to limited memory ports.
WARNING: [SCHED 204-69] The resource limit of core:RAM:mem:p0 is 1, current 
assignments: 
WARNING: [SCHED 204-69]     'load' operation ('mem_load', bottleneck.c:62) on array 
'mem',
WARNING: [SCHED 204-69] The resource limit of core:RAM:mem:p1 is 1, current 
assignments: 
WARNING: [SCHED 204-69]     'load' operation ('mem_load_1', bottleneck.c:62) on array 
'mem',
INFO: [SCHED 204-61] Pipelining result: Target II: 1, Final II: 2, Depth: 3.

In this example, Vitis HLS states it cannot reach the specified initiation interval (II) of 1 because it cannot schedule a load (read) operation (mem_load_2) onto the memory because of limited memory ports. The above message notes that the resource limit for "core:RAM:mem:p0 is 1" which is used by the operation mem_load on line 62. The second port of the block RAM also only has 1 resource, which is also used by operation mem_load_1. Due to this memory port contention, Vitis HLS reports a final II of 2 instead of the desired 1.

This issue is typically caused by arrays. Arrays are implemented as block RAM which only has a maximum of two data ports. This can limit the throughput of a read/write (or load/store) intensive algorithm. The bandwidth can be improved by splitting the array (a single block RAM resource) into multiple smaller arrays (multiple block RAMs), effectively increasing the number of ports.

Arrays are partitioned using the ARRAY_PARTITION directive. Vitis HLS provides three types of array partitioning, as shown in the following figure. The three styles of partitioning are:

block: The original array is split into equally sized blocks of consecutive elements of the original array.
cyclic: The original array is split into equally sized blocks interleaving the elements of the original array.
complete: The default operation is to split the array into its individual elements. This corresponds to resolving a memory into registers.

For block and cyclic partitioning the factor option specifies the number of arrays that are created. In the preceding figure, a factor of 2 is used, that is, the array is divided into two smaller arrays. If the number of elements in the array is not an integer multiple of the factor, the final array has fewer elements.

When partitioning multi-dimensional arrays, the dimension option is used to specify which dimension is partitioned. The following figure shows how the dimension option is used to partition the following example code:

void foo (...) {
 int  my_array[10][6][4];
   ...   
}

The examples in the figure demonstrate how partitioning dimension 3 results in 4 separate arrays and partitioning dimension 1 results in 10 separate arrays. If zero is specified as the dimension, all dimensions are partitioned.

Automatic Array Partitioning

The config_array_partition configuration determines how arrays are automatically partitioned based on the number of elements. This configuration is accessed through the menu Solution > Solution Settings > General > Add > config_array_partition.

Managing Pipeline Dependencies

Vitis HLS constructs a hardware datapath that corresponds to the C/C++ source code.

When there is no pipeline directive, the execution is sequential so there are no dependencies to take into account. But when the design has been pipelined, the tool needs to deal with the same dependencies as found in processor architectures for the hardware that Vitis HLS generates.

Typical cases of data dependencies or memory dependencies are when a read or a write occurs after a previous read or write.

A read-after-write (RAW), also called a true dependency, is when an instruction (and data it reads/uses) depends on the result of a previous operation.
- I1: t = a * b;
- I2: c = t + 1;
The read in statement I2 depends on the write of t in statement I1. If the instructions are reordered, it uses the previous value of t.
A write-after-read (WAR), also called an anti-dependence, is when an instruction cannot update a register or memory (by a write) before a previous instruction has read the data.
- I1: b = t + a;
- I2: t = 3;
The write in statement I2 cannot execute before statement I1, otherwise the result of b is invalid.
A write-after-write (WAW) is a dependence when a register or memory must be written in specific order otherwise other instructions might be corrupted.
- I1: t = a * b;
- I2: c = t + 1;
- I3: t = 1;
The write in statement I3 must happen after the write in statement I1. Otherwise, the statement I2 result is incorrect.
A read-after-read has no dependency as instructions can be freely reordered if the variable is not declared as volatile. If it is, then the order of instructions has to be maintained.

For example, when a pipeline is generated, the tool needs to take care that a register or memory location read at a later stage has not been modified by a previous write. This is a true dependency or read-after-write (RAW) dependency. A specific example is:

int top(int a, int b) {
 int t,c;
I1: t = a * b;
I2: c = t + 1;
 return c;
}

Statement I2 cannot be evaluated before statement I1 completes because there is a dependency on variable t. In hardware, if the multiplication takes 3 clock cycles, then I2 is delayed for that amount of time. If the above function is pipelined, then VHLS detects this as a true dependency and schedules the operations accordingly. It uses data forwarding optimization to remove the RAW dependency, so that the function can operate at II =1.

Memory dependencies arise when the example applies to an array and not just variables.

int top(int a) {
 int r=1,rnext,m,i,out;
 static int mem[256];
L1: for(i=0;i<=254;i++) {
#pragma HLS PIPELINE II=1
I1:     m = r * a; mem[i+1] = m;    // line 7
I2:     rnext = mem[i]; r = rnext; // line 8
 }
 return r;
}

In the above example, scheduling of loop L1 leads to a scheduling warning message:

WARNING: [SCHED 204-68] Unable to enforce a carried dependency constraint (II = 1, 
distance = 1)
 between 'store' operation (top.cpp:7) of variable 'm', top.cpp:7 on array 'mem' and 
'load' operation ('rnext', top.cpp:8) on array 'mem'.
INFO: [SCHED 204-61] Pipelining result: Target II: 1, Final II: 2, Depth: 3.

There are no issues within the same iteration of the loop as you write an index and read another one. The two instructions could execute at the same time, concurrently. However, observe the read and writes over a few iterations:

// Iteration for i=0
I1:     m = r * a; mem[1] = m;      // line 7
I2:     rnext = mem[0]; r = rnext; // line 8
// Iteration for i=1
I1:     m = r * a; mem[2] = m;      // line 7
I2:     rnext = mem[1]; r = rnext; // line 8
// Iteration for i=2
I1:     m = r * a; mem[3] = m;      // line 7
I2:     rnext = mem[2]; r = rnext; // line 8

When considering two successive iterations, the multiplication result m (with a latency = 2) from statement I1 is written to a location that is read by statement I2 of the next iteration of the loop into rnext. In this situation, there is a RAW dependence as the next loop iteration cannot start reading mem[i] before the previous computation's write completes.

Note that if the clock frequency is increased, then the multiplier needs more pipeline stages and increased latency. This will force II to increase as well.

Consider the following code, where the operations have been swapped, changing the functionality.

int top(int a) {
 int r,m,i;
 static int mem[256];
L1: for(i=0;i<=254;i++) {
#pragma HLS PIPELINE II=1
I1:     r = mem[i];             // line 7
I2:     m = r * a , mem[i+1]=m; // line 8
 }
 return r;
}

The scheduling warning is:

INFO: [SCHED 204-61] Pipelining loop 'L1'.
WARNING: [SCHED 204-68] Unable to enforce a carried dependency constraint (II = 1, 
distance = 1)
 between 'store' operation (top.cpp:8) of variable 'm', top.cpp:8 on array 'mem' 
and 'load' operation ('r', top.cpp:7) on array 'mem'.
WARNING: [SCHED 204-68] Unable to enforce a carried dependency constraint (II = 2, 
distance = 1)
 between 'store' operation (top.cpp:8) of variable 'm', top.cpp:8 on array 'mem' 
and 'load' operation ('r', top.cpp:7) on array 'mem'.
WARNING: [SCHED 204-68] Unable to enforce a carried dependency constraint (II = 3, 
distance = 1)
 between 'store' operation (top.cpp:8) of variable 'm', top.cpp:8 on array 'mem' 
and 'load' operation ('r', top.cpp:7) on array 'mem'.
INFO: [SCHED 204-61] Pipelining result: Target II: 1, Final II: 4, Depth: 4.

Observe the continued read and writes over a few iterations:

Iteration with i=0
I1:     r = mem[0];           // line 7
I2:     m = r * a , mem[1]=m; // line 8
Iteration with i=1
I1:     r = mem[1];           // line 7
I2:     m = r * a , mem[2]=m; // line 8
Iteration with i=2
I1:     r = mem[2];           // line 7
I2:     m = r * a , mem[3]=m; // line 8

A longer II is needed because the RAW dependence is via reading r from mem[i], performing the multiplication, and writing to mem[i+1].

Removing False Dependencies to Improve Loop Pipelining

False dependencies are dependencies that arise when the compiler is too conservative. These dependencies do not exist in the real code, but cannot be determined by the compiler. These dependencies can prevent loop pipelining.

The following example illustrates false dependencies. In this example, the read and write accesses are to two different addresses in the same loop iteration. Both of these addresses are dependent on the input data, and can point to any individual element of the hist array. Because of this, Vitis HLS assumes that both of these accesses can access the same location. As a result, it schedules the read and write operations to the array in alternating cycles, resulting in a loop II of 2. However, the code shows that hist[old] and hist[val] can never access the same location because they are in the else branch of the conditional if(old == val).

void histogram(int in[INPUT SIZE], int hist[VALUE SIZE]) f
  int acc = 0;
  int i, val;
  int old = in[0];
  for(i = 0; i < INPUT SIZE; i++) 
  {
    #pragma HLS PIPELINE II=1
    val = in[i];
    if(old == val) 
    {
        acc = acc + 1;
    } 
    else 
    {
        hist[old] = acc;
        acc = hist[val] + 1;
    }
    
    old = val;
  }
   
  hist[old] = acc;

To overcome this deficiency, you can use the DEPENDENCE directive to provide Vitis HLS with additional information about the dependencies.

void histogram(int in[INPUT SIZE], int hist[VALUE SIZE]) {
  int acc = 0;
  int i, val;
  int old = in[0];
  #pragma HLS DEPENDENCE variable=hist intra RAW false
  for(i = 0; i < INPUT SIZE; i++) 
  {
    #pragma HLS PIPELINE II=1
    val = in[i];
    if(old == val) 
    {
        acc = acc + 1;
    } 
    else 
    {
        hist[old] = acc;
        acc = hist[val] + 1;
    }
    
    old = val;
  }
   
  hist[old] = acc;

Note: Specifying a FALSE dependency, when in fact the dependency is not FALSE, can result in incorrect hardware. Be sure dependencies are correct (TRUE or FALSE) before specifying them.

When specifying dependencies there are two main types:

Inter: Specifies the dependency is between different iterations of the same loop.
If this is specified as FALSE it allows Vitis HLS to perform operations in parallel if the pipelined or loop is unrolled or partially unrolled and prevents such concurrent operation when specified as TRUE.
Intra: Specifies dependence within the same iteration of a loop, for example an array being accessed at the start and end of the same iteration.
When intra dependencies are specified as FALSE, Vitis HLS may move operations freely within the loop, increasing their mobility and potentially improving performance or area. When the dependency is specified as TRUE, the operations must be performed in the order specified.

Scalar Dependencies

Some scalar dependencies are much harder to resolve and often require changes to the source code. A scalar data dependency could look like the following:

while (a != b) {
   if (a > b) a -= b;
   else b -= a;  
 }

The next iteration of this loop cannot start until the current iteration has calculated the updated the values of a and b, as shown in the following figure.

If the result of the previous loop iteration must be available before the current iteration can begin, loop pipelining is not possible. If Vitis HLS cannot pipeline with the specified initiation interval, it increases the initiation internal. If it cannot pipeline at all, as shown by the above example, it halts pipelining and proceeds to output a non-pipelined design.

Exploiting Task Level Parallelism: Dataflow Optimization

The dataflow optimization is useful on a set of sequential tasks (for example, functions and/or loops), as shown in the following figure.

Figure 12: Sequential Functional Description

The above figure shows a specific case of a chain of three tasks, but the communication structure can be more complex than shown.

Using this series of sequential tasks, dataflow optimization creates an architecture of concurrent processes, as shown below. Dataflow optimization is a powerful method for improving design throughput and latency.

Figure 13: Parallel Process Architecture

The following figure shows how dataflow optimization allows the execution of tasks to overlap, increasing the overall throughput of the design and reducing latency.

In the following figure and example, (A) represents the case without the dataflow optimization. The implementation requires 8 cycles before a new input can be processed by func_A and 8 cycles before an output is written by func_C.

For the same example, (B) represents the case when the dataflow optimization is applied. func_A can begin processing a new input every 3 clock cycles (lower initiation interval) and it now only requires 5 clocks to output a final value (shorter latency).

This type of parallelism cannot be achieved without incurring some overhead in hardware. When a particular region, such as a function body or a loop body, is identified as a region to apply the dataflow optimization,Vitis HLS analyzes the function or loop body and creates individual channels that model the dataflow to store the results of each task in the dataflow region. These channels can be simple FIFOs for scalar variables, or ping-pong (PIPO) buffers for non-scalar variables like arrays. Each of these channels also contain signals to indicate when the FIFO or the ping-pong buffer is full or empty. These signals represent a handshaking interface that is completely data driven. By having individual FIFOs and/or ping-pong buffers, Vitis HLS frees each task to execute at its own pace and the throughput is only limited by availability of the input and output buffers. This allows for better interleaving of task execution than a normal pipelined implementation but does so at the cost of additional FIFO or block RAM registers for the ping-pong buffer, as shown in the following figure.

Figure 15: Structure Created During Dataflow Optimization

Dataflow optimization potentially improves performance over a statically pipelined solution. It replaces the strict, centrally-controlled pipeline stall philosophy with more flexible and distributed handshaking architecture using FIFOs and/or ping-pong buffers (PIPOs). The replacement of the centralized control structure with a distributed one also benefits the fanout of control signals, for example register enables, which is distributed among the control structures of individual processes.

Dataflow optimization is not limited to a chain of processes, but can be used on any directed acyclic graph (DAG) structure. It can produce two different forms of overlapping: within an iteration if processes are connected with FIFOs, and across different iterations through PIPOs and FIFOs.

Canonical Forms

Vitis HLS transforms the region to apply the DATAFLOW optimization. Xilinx recommends writing the code inside this region (referred to as the canonical region) using canonical forms. There are two main canonical forms for the dataflow optimization:

The canonical form for a function where sub-functions are not inlined.

void dataflow(Input0, Input1, Output0, Output1) 
{
 #pragma HLS dataflow
 UserDataType C0, C1, C2;
 func1(read Input0, read Input1, write C0, write C1);
 func2(read C0, read C1, write C2);
 func3(read C2, write Output0, write Output1);
}

Dataflow inside a loop body.
For the for loop (where no function inside is inlined), the integral loop variable should have:
1. Initial value declared in the loop header and set to 0.
2. The loop condition is a positive numerical constant or constant function argument.
3. Increment by 1.
4. Dataflow pragma needs to be inside the loop.
```
void dataflow(Input0, Input1, Output0, Output1) 
{
 for (int i = 0; i < N; i++) 
 {
 #pragma HLS dataflow
 UserDataType C0, C1, C2;
 func1(read Input0, read Input1, write C0, write C1);
 func2(read C0, read C0, read C1, write C2);
 func3(read C2, write Output0, write Output1);
 }
}
```

Canonical Body

Inside the canonical region, the canonical body should follow these guidelines:

Use a local, non-static scalar or array/pointer variable, or local static stream variable. A local variable is declared inside the function body (for dataflow in a function) or loop body (for dataflow inside a loop).
A sequence of function calls that pass data forward (with no feedback), from a function to one that is lexically later, under the following conditions:
1. Variables (except scalar) can have only one reading process and one writing process.
2. Use write before read (producer before consumer) if you are using local variables, which then become channels.
3. Use read before write (consumer before producer) if you are using function arguments. Any intra-body anti-dependencies must be preserved by the design.
4. Function return type must be void.
5. No loop-carried dependencies among different processes via variables.
  - Inside the canonical loop (i.e., values written by one iteration and read by a following one).
  - Among successive calls to the top function (i.e., inout argument written by one iteration and read by the following iteration).

Dataflow Checking

Vitis HLS has a dataflow checker which, when enabled, checks the code to see if it is in the recommended canonical form. Otherwise it will emit an error/warning message to the user. By default this checker is set to warning. You can set the checker to error or disable it by selecting off in the strict mode of the config_dataflow TCL command:

config_dataflow -strict_mode  (off | error | warning)

Dataflow Optimization Limitations

The DATAFLOW optimization optimizes the flow of data between tasks (functions and loops), and ideally pipelined functions and loops for maximum performance. It does not require these tasks to be chained, one after the other, however there are some limitations in how the data is transferred.

The following behaviors can prevent or limit the overlapping that Vitis HLS can perform with DATAFLOW optimization:

Single-producer-consumer violations
Bypassing tasks
Feedback between tasks
Conditional execution of tasks
Loops with multiple exit conditions

IMPORTANT: If any of these coding styles are present, Vitis HLS issues a message describing the situation.

Note: You can use the Dataflow viewer in the Analysis perspective to view the structure when the DATAFLOW directive is applied.

Single-producer-consumer Violations

For Vitis HLS to perform the DATAFLOW optimization, all elements passed between tasks must follow a single-producer-consumer model. Each variable must be driven from a single task and only be consumed by a single task. In the following code example, temp1 fans out and is consumed by both Loop2 and Loop3. This violates the single-producer-consumer model.

void foo(int data_in[N], int scale, int data_out1[N], int data_out2[N]) {
   int temp1[N];

   Loop1: for(int i = 0; i < N; i++) {
     temp1[i] = data_in[i] * scale;   
   }
   Loop2: for(int j = 0; j < N; j++) {
     data_out1[j] = temp1[j] * 123;   
   }
   Loop3: for(int k = 0; k < N; k++) {
    data_out2[k] = temp1[k] * 456;
   }
}

A modified version of this code uses function Split to create a single-producer-consumer design. The following code block example shows how the data flows with the function Split. The data now flows between all four tasks, and Vitis HLS can perform the DATAFLOW optimization.

void Split (in[N], out1[N], out2[N]) {
// Duplicated data
 L1:for(int i=1;i<N;i++) {
 out1[i] = in[i]; 
 out2[i] = in[i];     
 }
}
void foo(int data_in[N], int scale, int data_out1[N], int data_out2[N]) {

 int temp1[N], temp2[N]. temp3[N]; 
 Loop1: for(int i = 0; i < N; i++) {
 temp1[i] = data_in[i] * scale;
 }
 Split(temp1, temp2, temp3);
 Loop2: for(int j = 0; j < N; j++) {
 data_out1[j] = temp2[j] * 123;
 }
 Loop3: for(int k = 0; k < N; k++) {
 data_out2[k] = temp3[k] * 456;
 }
}

Bypassing Tasks

In addition, data should generally flow from one task to another. If you bypass tasks, this can reduce the performance of the DATAFLOW optimization. In the following example, Loop1 generates the values for temp1 and temp2. However, the next task, Loop2, only uses the value of temp1. The value of temp2 is not consumed until after Loop2. Therefore, temp2 bypasses the next task in the sequence, which can limit the performance of the DATAFLOW optimization.

void foo(int data_in[N], int scale, int data_out1[N], int data_out2[N]) {
  int temp1[N], temp2[N]. temp3[N];
  Loop1: for(int i = 0; i < N; i++) {
  temp1[i] = data_in[i] * scale;
  temp2[i] = data_in[i] >> scale;
  }
  Loop2: for(int j = 0; j < N; j++) {
  temp3[j] = temp1[j] + 123;
  }
  Loop3: for(int k = 0; k < N; k++) {
  data_out[k] = temp2[k] + temp3[k];
  }
}

In this case, you should increase the depth of the PIPO buffer used to store temp2 to be 3, instead of the default depth of 2. This lets the buffer store the value intended for Loop3, while Loop2 is being executed. Similarly, a PIPO that bypasses two processes should have a depth of 4. Set the depth of the buffer with the STREAM pragma or directive:

#pragma HLS STREAM off variable=temp2 depth=3

Feedback between Tasks

Feedback occurs when the output from a task is consumed by a previous task in the DATAFLOW region. Feedback between tasks is not recommended in a DATAFLOW region. When Vitis HLS detects feedback, it issues a warning, depending on the situation, and might not perform the DATAFLOW optimization.

However, DATAFLOW can support feedback when used with hls::streams. The following example demonstrates this exception.

#include "ap_axi_sdata.h"
#include "hls_stream.h"

void firstProc(hls::stream<int> &forwardOUT, hls::stream<int> &backwardIN) {
  static bool first = true;
  int fromSecond;

  //Initialize stream
  if (first) 
    fromSecond = 10; // Initial stream value
  else
    //Read from stream
    fromSecond = backwardIN.read(); //Feedback value
  first = false;

  //Write to stream
  forwardOUT.write(fromSecond*2);
}

void secondProc(hls::stream<int> &forwardIN, hls::stream<int> &backwardOUT) {
  backwardOUT.write(forwardIN.read() + 1);
}

void top(...) {
#pragma HLS dataflow
  hls::stream<int> forward, backward;
  firstProc(forward, backward);
  secondProc(forward, backward);
}

In this simple design, when firstProc is executed, it uses 10 as an initial value for input. Because hls::streams do not support an initial value, this technique can be used to provide one without violating the single-producer-consumer rule. In subsequent iterations firstProc reads from the hls::stream through the backwardIN interface.

firstProc processes the value and sends it to secondProc, via a stream that goes forward in terms of the original C++ function execution order. secondProc reads the value on forwardIN, adds 1 to it, and sends it back to firstProc via the feedback stream that goes backwards in the execution order.

From the second execution, firstProc uses the value read from the stream to do its computation, and the two processes can keep going forever, with both forward and feedback communication, using an initial value for the first execution.

Conditional Execution of Tasks

The DATAFLOW optimization does not optimize tasks that are conditionally executed. The following example highlights this limitation. In this example, the conditional execution of Loop1 and Loop2 prevents Vitis HLS from optimizing the data flow between these loops, because the data does not flow from one loop into the next.

void foo(int data_in1[N], int data_out[N], int sel) {

 int temp1[N], temp2[N];

 if (sel) {
 Loop1: for(int i = 0; i < N; i++) {
 temp1[i] = data_in[i] * 123;
 temp2[i] = data_in[i];
 }
 } else {
 Loop2: for(int j = 0; j < N; j++) {
 temp1[j] = data_in[j] * 321;
 temp2[j] = data_in[j];
 }
 }
 Loop3: for(int k = 0; k < N; k++) {
 data_out[k] = temp1[k] * temp2[k];
 }
}

To ensure each loop is executed in all cases, you must transform the code as shown in the following example. In this example, the conditional statement is moved into the first loop. Both loops are always executed, and data always flows from one loop to the next.

void foo(int data_in[N], int data_out[N], int sel) {

 int temp1[N], temp2[N];

 Loop1: for(int i = 0; i < N; i++) {
 if (sel) {
 temp1[i] = data_in[i] * 123;
 } else {
 temp1[i] = data_in[i] * 321;
 }
 }
 Loop2: for(int j = 0; j < N; j++) {
 temp2[j] = data_in[j];
 }
 Loop3: for(int k = 0; k < N; k++) {
 data_out[k] = temp1[k] * temp2[k];
 }
}

Loops with Multiple Exit Conditions

Loops with multiple exit points cannot be used in a DATAFLOW region. In the following example, Loop2 has three exit conditions:

An exit defined by the value of N; the loop will exit when k>=N.
An exit defined by the break statement.
An exit defined by the continue statement.
```
#include "ap_int.h"
#define N 16

typedef ap_int<8> din_t;
typedef ap_int<15> dout_t;
typedef ap_uint<8> dsc_t;
typedef ap_uint<1> dsel_t;

void multi_exit(din_t data_in[N], dsc_t scale, dsel_t select, dout_t data_out[N]) {
 dout_t temp1[N], temp2[N];
 int i,k;

 Loop1: for(i = 0; i < N; i++) {
 temp1[i] = data_in[i] * scale;
 temp2[i] = data_in[i] >> scale;
 }

 Loop2: for(k = 0; k < N; k++) {
 switch(select) {
        case  0: data_out[k] = temp1[k] + temp2[k];
        case  1: continue;
        default: break;
 }
 }
}
```
Because a loop’s exit condition is always defined by the loop bounds, the use of break or continue statements will prohibit the loop being used in a DATAFLOW region.
Finally, the DATAFLOW optimization has no hierarchical implementation. If a sub-function or loop contains additional tasks that might benefit from the DATAFLOW optimization, you must apply the DATAFLOW optimization to the loop, the sub-function, or inline the sub-function.

You can also use std::complex inside the DATAFLOW region. However, they should be used with an __attribute__((no_ctor)) as shown in the following example:

void proc_1(std::complex<float> (&buffer)[50], const std::complex<float> *in);
void proc_2(hls::Stream<std::complex<float>> &fifo, const std::complex<float> (&buffer)[50], std::complex<float> &acc);
void proc_3(std::complex<float> *out, hls::Stream<std::complex<float>> &fifo, const std::complex<float> acc);

void top(std::complex<float> *out, const std::complex<float> *in) {
#pragma HLS DATAFLOW

  std::complex<float> acc __attribute((no_ctor)); // Here
  std::complex<float> buffer[50] __attribute__((no_ctor)); // Here
  hls::Stream<std::complex<float>, 5> fifo; // Not here

  proc_1(buffer, in);
  proc_2(fifo, buffer, acc);
  proc_3(out, fifo, acc);
}

Configuring Dataflow Memory Channels

Vitis HLS implements channels between the tasks as either ping-pong or FIFO buffers, depending on the access patterns of the producer and the consumer of the data:

For scalar, pointer, and reference parameters, Vitis HLS implements the channel as a FIFO.
If the parameter (producer or consumer) is an array, Vitis HLS implements the channel as a ping-pong buffer or a FIFO as follows:
- If Vitis HLS determines the data is accessed in sequential order, Vitis HLS implements the memory channel as a FIFO channel with a depth that is estimated to optimize performance (but can require manual tuning in practice).
- If Vitis HLS is unable to determine that the data is accessed in sequential order or determines the data is accessed in an arbitrary manner, Vitis HLS implements the memory channel as a ping-pong buffer, that is, as two block RAMs each defined by the maximum size of the consumer or producer array.
  Note: A ping-pong buffer ensures that the channel always has the capacity to hold all samples without a loss. However, this might be an overly conservative approach in some cases.

To explicitly specify the default channel used between tasks, use the config_dataflow configuration. This configuration sets the default channel for all channels in a design. To reduce the size of the memory used in the channel and allow for overlapping within an iteration, you can use a FIFO. To explicitly set the depth (that is, number of elements) in the FIFO, use the -fifo_depth option.

Specifying the size of the FIFO channels overrides the default approach. If any task in the design can produce or consume samples at a greater rate than the specified size of the FIFO, the FIFOs might become empty (or full). In this case, the design halts operation, because it is unable to read (or write). This might result in or lead to a stalled, deadlock state.

Note: If a deadlocked situation is created, you will only see this when executing C/RTL co-simulation or when the block is used in a complete system.

When setting the depth of the FIFOs, Xilinx recommends initially setting the depth as the maximum number data values transferred (for example, the size of the array passed between tasks), confirming the design passes C/RTL co-simulation, and then reducing the size of the FIFOs and confirming C/RTL co-simulation still completes without issues. If RTL co-simulation fails, the size of the FIFO is likely too small to prevent stalling or a deadlock situation.

The Vitis HLS IDE can display a histogram of the occupation of each FIFO/PIPO buffer over time, after RTL co-simulation has been run. This can be useful to help determine the best depth for each buffer.

Specifying Arrays as Ping-Pong Buffers or FIFOs

All arrays are implemented by default as ping-pong to enable random access. These buffers can also be sized if needed. For example, in some circumstances, such as when a task is being bypassed, a performance degradation is possible. To mitigate this affect on performance, you can give more slack to the producer and consumer by increasing the size of these buffers by using the STREAM directive as shown below.

void top ( ... ) {
#pragma HLS dataflow
  int A[1024];
#pragma HLS stream off variable=A depth=3
 
  producer(A, B, …);  // producer writes A and B
  middle(B, C, ...);  // middle reads B and writes C
  consumer(A, C, …);  // consumer reads A and C

In the interface, arrays are automatically specified as streaming if an array on the top-level function interface is set as interface type ap_fifo, axis or ap_hs, it is automatically set as streaming.

Inside the design, all arrays must be specified as streaming using the STREAM directive if a FIFO is desired for the implementation.

Note: When the STREAM directive is applied to an array, the resulting FIFO implemented in the hardware contains as many elements as the array. The -depth option can be used to specify the size of the FIFO.

The STREAM directive is also used to change any arrays in a DATAFLOW region from the default implementation specified by the config_dataflow configuration.

If the config_dataflow default_channel is set as ping-pong, any array can be implemented as a FIFO by applying the STREAM directive to the array.
Note: To use a FIFO implementation, the array must be accessed in a streaming manner.
If the config_dataflow default_channel is set to FIFO or Vitis HLS has automatically determined the data in a DATAFLOW region is accessed in a streaming manner, any array can still be implemented as a ping-pong implementation by applying the STREAM directive to the array with the -off option.

IMPORTANT: To preserve the accesses, it might be necessary to prevent compiler optimizations (dead code elimination particularly) by using the volatile qualifier.

When an array in a DATAFLOW region is specified as streaming and implemented as a FIFO, the FIFO is typically not required to hold the same number of elements as the original array. The tasks in a DATAFLOW region consume each data sample as soon as it becomes available. The config_dataflow command with the -fifo_depth option or the STREAM directive with the -depth can be used to set the size of the FIFO to the minimum number of elements required to ensure flow of data never stalls. If the -off option is selected, the -depth option sets the depth (number of blocks) of the PIPO. The depth should be at least 2.

Specifying Compiler-Created FIFO Depth

Start Propagation

The compiler might automatically create a start FIFO to propagate the ap_start/ap_ready handshake to an internal process. Such FIFOs can sometimes be a bottleneck for performance, in which case you can increase the default size which can be incorrectly estimated by the tool with the following command:

config_dataflow -start_fifo_depth <value>

If an unbounded slack between producer and consumer is needed, and internal processes can run forever, fully and safely driven by their inputs or outputs (FIFOs or PIPOs), these start FIFOs can be removed, at user's risk, locally for a given dataflow region with the pragma:

#pragma HLS DATAFLOW disable_start_propagation

Scalar Propagation

The compiler automatically propagates some scalars from C/C++ code through scalar FIFOs between processes. Such FIFOs can sometimes be a bottleneck for performance or cause deadlocks, in which case you can set the size (the default value is set to -fifo_depth) with the following command:

config_dataflow -scalar_fifo_depth <value>

Stable Arrays

The stable pragma can be used to mark input or output variables of a dataflow region. Its effect is to remove their corresponding synchronizations, assuming that the user guarantees this removal is indeed correct.

void dataflow_region(int A[...], ...
#pragma HLS stable variable=A
#pragma HLS dataflow
    proc1(...);
    proc2(A, ...);

Without the stable pragma, and assuming that A is read by proc2, then proc2 would be part of the initial synchronization (via ap_start), for the dataflow region where it is located. This means that proc1 would not restart until proc2 is also ready to start again, which would prevent dataflow iterations to be overlapped and induce a possible loss of performance. The stable pragma indicates that this synchronization is not necessary to preserve correctness.

In the previous example, without the stable pragma, and assuming that A is read by proc2 as proc2 is bypassing the tasks, there will be a performance loss.

With the stable pragma, the compiler assumes that:

if A is read by proc2, then the memory locations that are read will not be overwritten, by any other process or calling context, while dataflow_region is being executed.
if A is written by proc2, then the memory locations written will not be read, before their definition, by any other process or calling context, while dataflow_region is being executed.

A typical scenario is when the caller updates or reads these variables only when the dataflow region has not started yet or has completed execution.

Using ap_ctrl_none Inside the Dataflow

The ap_ctrl_none block-level I/O protocol avoids the rigid synchronization scheme implied by the ap_ctrl_hs and ap_ctrl_chain protocols. These protocols require that all processes in the region are executed exactly the same number of times in order to better match the C/C++ behavior.

However, there are situations where, for example, the intent is to have a faster process that executes more frequently to distribute work to several slower ones.

For any dataflow region (except "dataflow-in-loop"), it is possible to specify #pragma HLS interface ap_ctrl_none port=return as long as all the following conditions are satisfied:

The region and all the processes it contains communicates only via FIFOs (hls::stream, streamed arrays, AXIS); that is, excluding memories.
All the parents of the region, up to the top level design, must fit the following requirements:
- They must be dataflow regions (excluding "dataflow-in-loop").
- They must all specify ap_ctrl_none.

This means that none of the parents of a dataflow region with ap_ctrl_none in the hierarchy can be:

A sequential or pipelined FSM
A dataflow region inside a for loop ("dataflow-in-loop")

The result of this pragma is that ap_ctrl_chain is not used to synchronize any of the processes inside that region. They are executed or stalled based on the availability of data in their input FIFOs and space in their output FIFOs. For example:

void region(...) {
#pragma HLS dataflow
#pragma HLS interface ap_ctrl_none port=return
    hls::stream<int> outStream1, outStream2;
    demux(inStream, outStream1, outStream2);
    worker1(outStream1, ...);
    worker2(outStream2, ....);

In this example, demux can be executed twice as frequently as worker1 and worker2. For example, it can have II=1 while worker1 and worker2 can have II=2, and still achieving a global II=1 behavior.

Note:

Non-blocking reads may need to be used very carefully inside processes that are executed less frequently to ensure that C/C++ simulation works.
The pragma is applied to a region, not to the individual processes inside it.
Deadlock detection must be disabled in co-simulation. This can be done with the -disable_deadlock_detection option in cosim_design.

Optimizing for Latency

Using Latency Constraints

Vitis HLS supports the use of a latency constraint on any scope. Latency constraints are specified using the LATENCY directive.

When a maximum and/or minimum LATENCY constraint is placed on a scope, Vitis HLS tries to ensure all operations in the function complete within the range of clock cycles specified.

The latency directive applied to a loop specifies the required latency for a single iteration of the loop: it specifies the latency for the loop body, as the following examples shows:

Loop_A: for (i=0; i<N; i++) { 
#pragma HLS latency max=10
  ..Loop Body...  
}

If the intention is to limit the total latency of all loop iterations, the latency directive should be applied to a region that encompasses the entire loop, as in this example:

Region_All_Loop_A: {
#pragma HLS latency max=10
Loop_A: for (i=0; i<N; i++) 
  { 
  ..Loop Body... 
  }
}

In this case, even if the loop is unrolled, the latency directive sets a maximum limit on all loop operations.

If Vitis HLS cannot meet a maximum latency constraint it relaxes the latency constraint and tries to achieve the best possible result.

If a minimum latency constraint is set and Vitis HLS can produce a design with a lower latency than the minimum required it inserts dummy clock cycles to meet the minimum latency.

Merging Sequential Loops to Reduce Latency

All rolled loops imply and create at least one state in the design FSM. When there are multiple sequential loops it can create additional unnecessary clock cycles and prevent further optimizations.

The following figure shows a simple example where a seemingly intuitive coding style has a negative impact on the performance of the RTL design.

In the preceding figure, (A) shows how, by default, each rolled loop in the design creates at least one state in the FSM. Moving between those states costs clock cycles: assuming each loop iteration requires one clock cycle, it take a total of 11 cycles to execute both loops:

1 clock cycle to enter the ADD loop.
4 clock cycles to execute the add loop.
1 clock cycle to exit ADD and enter SUB.
4 clock cycles to execute the SUB loop.
1 clock cycle to exit the SUB loop.
For a total of 11 clock cycles.

In this simple example it is obvious that an else branch in the ADD loop would also solve the issue but in a more complex example it may be less obvious and the more intuitive coding style may have greater advantages.

The LOOP_MERGE optimization directive is used to automatically merge loops. The LOOP_MERGE directive will seek so to merge all loops within the scope it is placed. In the above example, merging the loops creates a control structure similar to that shown in (B) in the preceding figure, which requires only 6 clocks to complete.

Merging loops allows the logic within the loops to be optimized together. In the example above, using a dual-port block RAM allows the add and subtraction operations to be performed in parallel.

Currently, loop merging in Vitis HLS has the following restrictions:

If loop bounds are all variables, they must have the same value.
If loops bounds are constants, the maximum constant value is used as the bound of the merged loop.
Loops with both variable bound and constant bound cannot be merged.
The code between loops to be merged cannot have side effects: multiple execution of this code should generate the same results (a=b is allowed, a=a+1 is not).
Loops cannot be merged when they contain FIFO accesses: merging would change the order of the reads and writes from a FIFO: these must always occur in sequence.

Flattening Nested Loops to Improve Latency

In a similar manner to the consecutive loops discussed in the previous section, it requires additional clock cycles to move between rolled nested loops. It requires one clock cycle to move from an outer loop to an inner loop and from an inner loop to an outer loop.

In the small example shown here, this implies 200 extra clock cycles to execute loop Outer.

void foo_top { a, b, c, d} {
 ...
 Outer: while(j<100)
 Inner: while(i<6) // 1 cycle to enter inner
 ...
 LOOP_BODY
 ...
 } // 1 cycle to exit inner
 }
 ...
}

Vitis HLS provides the set_directive_loop_flatten command to allow labeled perfect and semi-perfect nested loops to be flattened, removing the need to re-code for optimal hardware performance and reducing the number of cycles it takes to perform the operations in the loop.

Perfect loop nest: Only the innermost loop has loop body content, there is no logic specified between the loop statements and all the loop bounds are constant.
Semi-perfect loop nest: Only the innermost loop has loop body content, there is no logic specified between the loop statements but the outermost loop bound can be a variable.

For imperfect loop nests, where the inner loop has variables bounds or the loop body is not exclusively inside the inner loop, designers should try to restructure the code, or unroll the loops in the loop body to create a perfect loop nest.

When the directive is applied to a set of nested loops it should be applied to the inner most loop that contains the loop body.

set_directive_loop_flatten top/Inner

Loop flattening can also be performed using the directive tab in the IDE, either by applying it to individual loops or applying it to all loops in a function by applying the directive at the function level.

Optimizing for Area

Data Types and Bit-Widths

The bit-widths of the variables in the C/C++ function directly impact the size of the storage elements and operators used in the RTL implementation. If a variables only requires 12-bits but is specified as an integer type (32-bit) it will result in larger and slower 32-bit operators being used, reducing the number of operations that can be performed in a clock cycle and potentially increasing initiation interval and latency.

Use the appropriate precision for the data types.
Confirm the size of any arrays that are to be implemented as RAMs or registers. The area impact of any over-sized elements is wasteful in hardware resources.
Pay special attention to multiplications, divisions, modulus or other complex arithmetic operations. If these variables are larger than they need to be, they negatively impact both area and performance.

Function Inlining

Function inlining removes the function hierarchy. A function is inlined using the INLINE directive. Inlining a function may improve area by allowing the components within the function to be better shared or optimized with the logic in the calling function. This type of function inlining is also performed automatically by Vitis HLS. Small functions are automatically inlined.

Inlining allows functions sharing to be better controlled. For functions to be shared they must be used within the same level of hierarchy. In this code example, function foo_top calls foo twice and function foo_sub.

foo_sub (p, q) {
 int q1 = q + 10;
 foo(p1,q); // foo_3
 ...
}
void foo_top { a, b, c, d} {
 ...
 foo(a,b); //foo_1
 foo(a,c); //foo_2
 foo_sub(a,d);
 ...
}

Inlining function foo_sub and using the ALLOCATION directive to specify only 1 instance of function foo is used, results in a design which only has one instance of function foo: one-third the area of the example above.

foo_sub (p, q) {
#pragma HLS INLINE
 int q1 = q + 10;
 foo(p1,q); // foo_3
 ...
}
void foo_top { a, b, c, d} {
#pragma HLS ALLOCATION instances=foo limit=1 function
 ...
 foo(a,b); //foo_1
 foo(a,c); //foo_2
 foo_sub(a,d);
 ...
}

The INLINE directive optionally allows all functions below the specified function to be recursively inlined by using the recursive option. If the recursive option is used on the top-level function, all function hierarchy in the design is removed.

The INLINE off option can optionally be applied to functions to prevent them being inlined. This option may be used to prevent Vitis HLS from automatically inlining a function.

The INLINE directive is a powerful way to substantially modify the structure of the code without actually performing any modifications to the source code and provides a very powerful method for architectural exploration.

Array Reshaping

The ARRAY_RESHAPE directive reforms the array with a vertical mode of remapping, and is used to reduce the number of block RAM consumed while providing parallel access to the data.

Given the following example code:

void foo (...) {
int  array1[N];
int  array2[N];
int  array3[N];
#pragma HLS ARRAY_RESHAPE variable=array1 block factor=2 dim=1
#pragma HLS ARRAY_RESHAPE variable=array2 cycle factor=2 dim=1
#pragma HLS ARRAY_RESHAPE variable=array3 complete dim=1
...   
}

The ARRAY_RESHAPE directive transforms the arrays into the form shown in the following figure.

The ARRAY_RESHAPE directive allows more data to be accessed in a single clock cycle. In cases where more data can be accessed in a single clock cycle, Vitis HLS might automatically unroll any loops consuming this data, if doing so will improve the throughput. The loop can be fully or partially unrolled to create enough hardware to consume the additional data in a single clock cycle. This feature is controlled using the config_unroll command and the option tripcount_threshold. In the following example, any loops with a tripcount of less than 16 will be automatically unrolled if doing so improves the throughput.

config_unroll -tripcount_threshold 16

Function Instantiation

Function instantiation is an optimization technique that has the area benefits of maintaining the function hierarchy but provides an additional powerful option: performing targeted local optimizations on specific instances of a function. This can simplify the control logic around the function call and potentially improve latency and throughput.

The FUNCTION_INSTANTIATE directive exploits the fact that some inputs to a function may be a constant value when the function is called and uses this to both simplify the surrounding control structures and produce smaller more optimized function blocks. This is best explained by example.

Given the following code:

void foo_sub(bool mode){
#pragma HLS FUNCTION_INSTANTIATE variable=mode
if (mode) {
     // code segment 1 
  } else {
     // code segment 2
  }
}

void foo(){  
#pragma HLS FUNCTION_INSTANTIATE variable=select
foo_sub(true);
foo_sub(false);
}

It is clear that function foo_sub has been written to perform multiple but exclusive operations (depending on whether mode is true or not). Each instance of function foo_sub is implemented in an identical manner: this is great for function reuse and area optimization but means that the control logic inside the function must be more complex.

The FUNCTION_INSTANTIATE optimization allows each instance to be independently optimized, reducing the functionality and area. After FUNCTION_INSTANTIATE optimization, the code above can effectively be transformed to have two separate functions, each optimized for different possible values of mode, as shown:

void foo_sub1() {
  // code segment 1
}

void foo_sub1() {
  // code segment 2
}

void A(){
  B1();
  B2();
}

If the function is used at different levels of hierarchy such that function sharing is difficult without extensive inlining or code modifications, function instantiation can provide the best means of improving area: many small locally optimized copies are better than many large copies that cannot be shared.

Controlling Hardware Resources

During synthesis, Vitis HLS performs the following basic tasks:

Elaborates the C, C++ source code into an internal database containing the operators in the C code, such as additions, multiplications, array reads, and writes.
Maps the operators onto implementations in the hardware.
Implementations are the specific hardware components used to create the design (such as adders, multipliers, pipelined multipliers, and block RAM).

Commands, pragmas and directives provide control over each of these steps, allowing you to control the hardware implementation at a fine level of granularity.

Limiting the Number of Operators

Explicitly limiting the number of operators to reduce area may be required in some cases: the default operation of Vitis HLS is to first maximize performance. Limiting the number of operators in a design is a useful technique to reduce the area of the design: it helps reduce area by forcing the sharing of operations. However, this might cause a decline in performance.

The ALLOCATION directive allows you to limit how many operators are used in a design. For example, if a design called foo has 317 multiplications but the FPGA only has 256 multiplier resources (DSP macrocells). The ALLOCATION pragma shown below directs Vitis HLS to create a design with a maximum of 256 multiplication (mul) operators:

dout_t array_arith (dio_t d[317]) {
 static int acc;
 int i;
#pragma HLS ALLOCATION instances=mul limit=256 operation

 for (i=0;i<317;i++) {
#pragma HLS UNROLL
 acc += acc * d[i];
 }
 rerun acc;
}

Note: If you specify an ALLOCATION limit that is greater than needed, Vitis HLS attempts to use the number of resources specified by the limit, or the maximum necessary, which reduces the amount of sharing.

You can use the type option to specify if the ALLOCATION directives limits operations, implementations, or functions. The following table lists all the operations that can be controlled using the ALLOCATION directive.

Note: The operations listed below are supported by the ALLOCATION pragma or directive. The BIND_OP pragma or directive supports a subset of operators as described in the command syntax.

Table 4. Vitis HLS Operators
Operator	Description
add	Integer Addition
ashr	Arithmetic Shift-Right
dadd	Double-precision floating-point addition
dcmp	Double-precision floating-point comparison
ddiv	Double-precision floating-point division
dmul	Double-precision floating-point multiplication
drecip	Double-precision floating-point reciprocal
drem	Double-precision floating-point remainder
drsqrt	Double-precision floating-point reciprocal square root
dsub	Double-precision floating-point subtraction
dsqrt	Double-precision floating-point square root
fadd	Single-precision floating-point addition
fcmp	Single-precision floating-point comparison
fdiv	Single-precision floating-point division
fmul	Single-precision floating-point multiplication
frecip	Single-precision floating-point reciprocal
frem	Single-precision floating point remainder
frsqrt	Single-precision floating-point reciprocal square root
fsub	Single-precision floating-point subtraction
fsqrt	Single-precision floating-point square root
icmp	Integer Compare
lshr	Logical Shift-Right
mul	Multiplication
sdiv	Signed Divider
shl	Shift-Left
srem	Signed Remainder
sub	Subtraction
udiv	Unsigned Division
urem	Unsigned Remainder

Controlling Hardware Implementation

When synthesis is performed, Vitis HLS uses the timing constraints specified by the clock, the delays specified by the target device together with any directives specified by you, to determine which hardware implementations to use for various operators in the code. For example, to implement a multiplier operation, Vitis HLS could use the combinational multiplier or use a pipeline multiplier.

The implementations which are mapped to operators during synthesis can be limited by specifying the ALLOCATION pragma or directive, in the same manner as the operators. Instead of limiting the total number of multiplication operations, you can choose to limit the number of combinational multipliers, forcing any remaining multiplications to be performed using pipelined multipliers (or vice versa).

The BIND_OP or BIND_STORAGE pragmas or directives are used to explicitly specify which implementations to use for specific operations or storage types. The following command informs Vitis HLS to use a two-stage pipelined multiplier using fabric logic for variable c. It is left to Vitis HLS which implementation to use for variable d.

int foo (int a, int b) {
 int c, d;
#pragma HLS BIND_OP variable=c op=mul impl=fabric latency=2
 c = a*b;
 d = a*c;

 return d;
}

In the following example, the BIND_OP pragma specifies that the add operation for variable temp is implemented using the dsp implementation. This ensures that the operation is implemented using a DSP module primitive in the final design. By default, add operations are implemented using LUTs.

void apint_arith(dinA_t  inA, dinB_t  inB,
          dout1_t *out1
  ) {

 dout2_t temp;
#pragma HLS BIND_OP variable=temp op=add impl=dsp

 temp = inB + inA;
 *out1 = temp;

}

Refer to the BIND_OP or BIND_STORAGE pragmas or directives to obtain details on the implementations available for assignment to operations or storage types.

In the following example, the BIND_OP pragma specifies the multiplication for out1 is implemented with a 3-stage pipelined multiplier.

void foo(...) {
#pragma HLS BIND_OP variable=out1 op=mul latency=3

 // Basic arithmetic operations
 *out1 = inA * inB;
 *out2 = inB + inA;
 *out3 = inC / inA;
 *out4 = inD % inA;

}

If the assignment specifies multiple identical operators, the code must be modified to ensure there is a single variable for each operator to be controlled. For example, in the following code, if only the first multiplication (inA * inB) is to be implemented with a pipelined multiplier:

*out1 = inA * inB * inC;

The code should be changed to the following with the pragma specified on the Result_tmp variable:

#pragma HLS BIND_OP variable=Result_tmp op=mul latency=3
 Result_tmp = inA * inB;
 *out1 = Result_tmp * inC;

Optimizing Logic

Inferring Shift Registers

Vitis HLS will now infer a shift register when encountering the following code:

int A[N]; // This will be replaced by a shift register
 
for(...) {
  // The loop below is the shift operation
  for (int i = 0; i < N-1; ++i)
    A[i] = A[i+1];
  A[N] = ...;
 
  // This is an access to the shift register
  ... A[x] ...
}

Shift registers can perform a shift/cycle, which offers a significant performance improvement, and also allows a random read access per cycle anywhere in the shift register, thus it is more flexible than a FIFO.

Controlling Operator Pipelining

Vitis HLS automatically determines the level of pipelining to use for internal operations. You can use the BIND_OP or BIND_STORAGE pragmas with the -latency option to explicitly specify the number of pipeline stages and override the number determined by Vitis HLS.

RTL synthesis might use the additional pipeline registers to help improve timing issues that might result after place and route. Registers added to the output of the operation typically help improve timing in the output datapath. Registers added to the input of the operation typically help improve timing in both the input datapath and the control logic from the FSM.

You can use the config_op command to pipeline all instances of a specific operation used in the design that have the same pipeline depth. Refer to config_op for more information.

Optimizing Logic Expressions

During synthesis several optimizations, such as strength reduction and bit-width minimization are performed. Included in the list of automatic optimizations is expression balancing.

Expression balancing rearranges operators to construct a balanced tree and reduce latency.

For integer operations expression balancing is on by default but may be disabled using the EXPRESSION_BALANCE pragma or directive.
For floating-point operations, expression balancing is off by default but may be enabled.

Given the highly sequential code using assignment operators such as += and *= in the following example:

data_t foo_top (data_t a, data_t b, data_t c, data_t d)
{
 data_t sum;

 sum = 0;
 sum += a;
 sum += b;
 sum += c;
 sum += d;
 return sum;

}

Without expression balancing, and assuming each addition requires one clock cycle, the complete computation for sum requires four clock cycles shown in the following figure.

However additions a+b and c+d can be executed in parallel allowing the latency to be reduced. After balancing the computation completes in two clock cycles as shown in the following figure. Expression balancing prohibits sharing and results in increased area.

For integers, you can disable expression balancing using the EXPRESSION_BALANCE optimization directive with the off option. By default, Vitis HLS does not perform the EXPRESSION_BALANCE optimization for operations of type float or double. When synthesizing float and double types, Vitis HLS maintains the order of operations performed in the C/C++ code to ensure that the results are the same as the C/C++ simulation. For example, in the following code example, all variables are of type float or double. The values of O1 and O2 are not the same even though they appear to perform the same basic calculation.

A=B*C; A=B*F;
D=E*F; D=E*C;
O1=A*D O2=A*D;

This behavior is a function of the saturation and rounding in the C/C++ standard when performing operation with types float or double. Therefore, Vitis HLS always maintains the exact order of operations when variables of type float or double are present and does not perform expression balancing by default.

You can enable expression balancing for specific operations, or you can configure the tool to enable expression balancing with float and double types using the config_compile -unsafe_math_optimizations command as follows:

In the Vitis HLS IDE, select Solution > Solution Settings.
In the Solution Settings dialog box, click the General category, select config_compile, and enable unsafe_math_optimizations.

With this setting enabled, Vitis HLS might change the order of operations to produce a more optimal design. However, the results of C/RTL co-simulation might differ from the C/C++ simulation.

The unsafe_math_optimizations feature also enables the no_signed_zeros optimization. The no_signed_zeros optimization ensures that the following expressions used with float and double types are identical:

x - 0.0 = x; 
x + 0.0 = x; 
0.0 - x = -x; 
x - x = 0.0; 
x*0.0 = 0.0;

Without the no_signed_zeros optimization the expressions above would not be equivalent due to rounding. The optimization may be optionally used without expression balancing by selecting only this option in the config_compile command.

TIP: When the unsafe_math_optimizations and no_signed_zero optimizations are used, the RTL implementation will have different results than the C/C++ simulation. The test bench should be capable of ignoring minor differences in the result: check for a range, do not perform an exact comparison.

Optimizing Burst Transfers

Overview of Burst Transfers

Bursting is an optimization that tries to intelligently aggregate your memory accesses to the DDR to maximize the throughput bandwidth and/or minimize the latency. Bursting is one of the many optimizations to the kernel. Bursting typically gives you a 4-5x improvement while other optimizations, like access widening or ensuring there are no dependencies through the DDR, can provide even bigger performance improvements. Typically, bursting is useful when you have contention on the DDR ports from multiple competing kernels.

The figure above shows how the AXI protocol works. The HLS kernel sends out a read request for a burst of length 8 and then sends a write request burst of length 8. The read latency is defined as the time taken between the sending of the read request burst to when the data from the first read request in the burst is received by the kernel. Similarly, the write latency is defined as the time taken between when data for the last write in the write burst is sent and the time the write acknowledgment is received by the kernel. Read requests are usually sent at the first available opportunity while write requests get queued until the data for each write in the burst becomes available.

To help you understand the various latencies that are possible in the system, the following figure shows what happens when an HLS kernel sends a burst to the DDR.

When your design makes a read/write request, the request is sent to the DDR through several specialized helper modules. First, the M-AXI adapter serves as a buffer for the requests created by the HLS kernel. The adapter contains logic to cut large bursts into smaller ones (which it needs to do to prevent hogging the channel or if the request crosses the 4 KB boundary, see Vivado Design Suite: AXI Reference Guide (UG1037)), and can also stall the sending of burst requests (depending on the maximum outstanding requests parameter) so that it can safely buffer the entirety of the data for each kernel. This can slightly increase write latency but can resolve deadlock due to concurrent requests (read or write) on the memory subsystem. You can configure the M-AXI interface to hold the write request until all data is available using config_interface -m_axi_conservative_mode.

Getting through the adapter will cost a few cycles of latency, typically 5 to 7 cycles. Then, the request goes to the AXI interconnect that routes the kernel’s request to the MIG and then eventually to the DDR. Getting through the interconnect is expensive in latency and can take around 30 cycles. Finally, getting to the DDR and back can cost anywhere from 9 to 14 cycles. These are not precise measurements of latency but rather estimates provided to show the relative latency cost of these specialized modules. For more precise measurements, you can test and observe these latencies using the Application Timeline report for your specific system.

TIP: For information about the Application Timeline report, see Application Timeline in the Vitis Unified Software Platform Documentation (UG1416).

Another way to view the latencies in the system is as follows: the interconnect has an average II of 2 while the DDR controller has an average II of 4-5 cycles on requests (while on the data they are both II=1). The interconnect arbitration strategy is based on the size of read/write requests, and so data requested with longer burst lengths get prioritized over requests with shorter bursts (thus leading to a bigger channel bandwidth being allocated to longer bursts in case of contention). Of course, a large burst request has the side-effect of preventing anyone else from accessing the DDR, and therefore there must be a compromise between burst length and reducing DDR port contention. Fortunately, the large latencies help prevent some of this port contention, and effective pipelining of the requests can significantly improve the bandwidth throughput available in the system.

Burst Semantics

For a given kernel, the HLS compiler implements the burst analysis optimization as a multi-pass optimization, but on a per function basis. Bursting is only done for a function and bursting across functions is not supported. The burst optimizations are reported in the Synthesis Summary report, and missed burst opportunities are also reported to help you improve burst optimization.

At first, the HLS compiler looks for memory accesses in the basic blocks of the function, such as memory accesses in a sequential set of statements inside the function. Assuming the preconditions of bursting are met, each burst inferred in these basic blocks is referred to as region burst. The compiler will automatically scan the basic block to build the longest sequence of accesses into a single region burst.

The compiler then looks at loops and tries to infer what are known as loop bursts. A loop burst is the sequence of reads/writes across the iterations of a loop. The compiler tries to infer the length of the burst by analyzing the loop induction variable and the trip count of the loop. If the analysis is successful, the compiler can chain the sequences of reads/writes in each iteration of the loop into one long loop burst. The compiler today automatically infers a loop or a region burst, but there is no way to specifically request a loop or a region burst. The code needs to be written so as to cause the tool to infer the loop or region burst.

To understand the underlying semantics of bursting, consider the following code snippet:

for(size_t i = 0; i < size; i+=4) {
    out[4*i+0] = f(in[4*i+0]);
    out[4*i+1] = f(in[4*i+1]);
    out[4*i+2] = f(in[4*i+2]);
    out[4*i+3] = f(in[4*i+3]);
}

The code above is typically used to perform a series of reads from an array, and a series of writes to an array from within a loop. The code below is what Vitis HLS may infer after performing the burst analysis optimization. Alongside the actual array accesses, the compiler will additionally make the required read and write requests that are necessary for the user selected AXI protocol.

Loop Burst

/* requests can move anywhere in func */
rb = ReadReq(in, size); 
wb = WriteReq(out, size);
for(size_t i = 0; i < size; i+=4) {
    Write(wb, 4*i+0) = f(Read(rb, 4*i+0));
    Write(wb, 4*i+1) = f(Read(rb, 4*i+1));
    Write(wb, 4*i+2) = f(Read(rb, 4*i+2));
    Write(wb, 4*i+3) = f(Read(rb, 4*i+3));
}
WriteResp(wb);

If the compiler can successfully deduce the burst length from the induction variable (size) and the trip count of the loop, it will infer one big loop burst and will move the ReadReq, WriteReq and WriteResp calls outside the loop, as shown in the Loop Burst code example. So, the read requests for all loop iterations are combined into one read request and all the write requests are combined into one write request. Note that all read requests are typically sent out immediately while write requests are only sent out after the data becomes available.

However, if any of the preconditions of bursting are not met, as described in Preconditions and Limitations of Burst Transfer, the compiler may not infer a loop burst but will instead try and infer a region burst where the ReadReq, WriteReg and WriteResp are alongside the read/write accesses being burst optimized, as shown in the Region Burst code example. In this case, the read and write requests for each loop iteration are combined into one read or write request.

Region Burst

for(size_t i = 0; i < size; i+=4) {
    rb = ReadReq(in+4*i, 4);
    wb = WriteReq(out+4*i, 4);
    Write(wb, 0) = f(Read(rb, 0));
    Write(wb, 1) = f(Read(rb, 1));
    Write(wb, 2) = f(Read(rb, 2));
    Write(wb, 3) = f(Read(rb, 3));
    WriteResp(wb);
}

Preconditions and Limitations of Burst Transfer

Bursting Preconditions

Bursting is about aggregating successive memory access requests. Here are the set of preconditions that these successive accesses must meet for the bursting optimization to launch successfully:

Must be all reads, or all writes – bursting reads and writes is not possible.
Must be a monotonically increasing order of access (both in terms of the memory location being accessed as well as in time). You cannot access a memory location that is in between two previously accessed memory locations.
Must be consecutive in memory – one next to another with no gaps or overlap and in forward order.
The number of read/write accesses (or burst length) must be determinable before the request is sent out. This means that even if the burst length is parametric, it must be computed before the read/write request is sent out.
If bundling two arrays to the same M-AXI port, bursting will be done only for one array, at most, in each direction at any given time.
There must be no dependency issues from the time a burst request is initiated and finished.

Outer Loop Burst Failure Due to Overlapping Memory Accesses

Outer loop burst inference will fail in the following example because both iteration 0 and iteration 1 of the loop L1 access the same element in arrays a and b. Burst inference is an all or nothing type of optimization - the tool will not infer a partial burst. It is a greedy algorithm that tries to maximize the length of the burst. The auto-burst inference will try to infer a burst in a bottom up fashion - from the inner loop to the outer loop, and will stop when one of the preconditions is not met. In the example below the burst inference will stop when it sees that element 8 is being read again, and so an inner loop burst of length 9 will be inferred in this case.

L1: for (int i = 0; i < 8; ++i)
    L2: for (int j = 0; j < 9; ++j)
            b[i*8 + j] = a[i*8 + j];

itr 0: |0 1 2 3 4 5 6 7 8|
itr 1: |                8 9 10 11 12 13 14 15 16|

Usage of ap_int/ap_uint Types as Loop Induction Variables

Because the burst inference depends on the loop induction variable and the trip count, using non-native types can hinder the optimization from firing. It is recommended to always use unsigned integer type for the loop induction variable.

Must Enter Loop at Least Once

In some cases, the compiler can fail to infer that the max value of the loop induction variable can never be zero – that is, if it cannot prove that the loop will always be entered. In such cases, an assert statement will help the compiler infer this.

assert (N > 0);
L1: for(int a = 0; a < N; ++a) { … }

Inter or Intra Loop Dependencies on Arrays

If you write to an array location and then read from it in the same iteration or the next, this type of array dependency can be hard for the optimization to decipher. Basically, the optimization will fail for these cases because it cannot guarantee that the write will happen before the read.

Conditional Access to Memory

If the memory accesses are being made conditionally, it can cause the burst inferencing algorithm to fail as it cannot reason through the conditional statements. In some cases, the compiler will simplify the conditional and even remove it but it is generally recommended to not use conditional statements around the memory accesses.

M-AXI Accesses Made from Inside a Function Called from a Loop

Cross-functional array access analysis is not a strong suit for compiler transformations such as burst inferencing. In such cases, users can inline the function using the INLINE pragma or directive to avoid burst failures.

void my_function(hls::stream<T> &out_pkt, int *din, int input_idx) {
    T v;
    v.data = din[input_idx];
    out_pkt.write(v);
}

void my_kernel(hls::stream<T> &out_pkt,
               int            *din,
               int            num_512_bytes,
               int            num_times) {
#pragma HLS INTERFACE m_axi port = din offset=slave bundle=gmem0
#pragma HLS INTERFACE axis port=out_pkt
#pragma HLS INTERFACE s_axilite port=din bundle=control
#pragma HLS INTERFACE s_axilite port=num_512_bytes bundle=control
#pragma HLS INTERFACE s_axilite port=num_times bundle=control
#pragma HLS INTERFACE s_axilite port=return bundle=control

unsigned int idx = 0;
L0: for (int i = 0; i < ntimes; ++i) {
    L1: for (int j = 0; j < num_512_bytes; ++j) {
#pragma HLS PIPELINE
        my_function(out_pkt, din, idx++);
    }
}

Burst inferencing will fail because the memory accesses are being made from a called function. For the burst inferencing to work, it is recommended that users inline any such functions that are making accesses to the M-AXI memory.

An additional reason the burst inferencing will fail in this example is that the memory accessed through din in my_function, is defined by a variable (idx) which is not a function of the loop induction variables i and j, and therefore may not be sequential or monotonic. Instead of passing idx, use (i*num_512_bytes+j).

Loop Burst Inference on a Dataflow Loop

Burst inference is not supported on a loop that has the DATAFLOW pragma or directive. However, each process/task inside the dataflow loop can have bursts. Also, sharing of M-AXI ports is not supported inside a dataflow region because the tasks can execute in parallel.

Options for Controlling AXI4 Burst Behavior

An optimal AXI4 interface is one in which the design never stalls while waiting to access the bus, and after bus access is granted, the bus never stalls while waiting for the design to read/write. To create the optimal AXI4 interface, the following command options are provided in the INTERFACE directive to specify the behavior of the bursts and optimize the efficiency of the AXI4 interface.

Note that some of these options can use internal storage to buffer data and this may have an impact on area and resources:

latency: Specifies the expected latency of the AXI4 interface, allowing the design to initiate a bus request several cycles (latency) before the read or write is expected. If this figure it too low, the design will be ready too soon and may stall waiting for the bus. If this figure is too high, bus access may be granted but the bus may stall waiting on the design to start the access. Default latency in Vitis HLS is 64.
max_read_burst_length: Specifies the maximum number of data values read during a burst transfer. Default value is 16.
num_read_outstanding: Specifies how many read requests can be made to the AXI4 bus, without a response, before the design stalls. This implies internal storage in the design: a FIFO of size num_read_outstanding*max_read_burst_length*word_size. Default value is 16.
max_write_burst_length: Specifies the maximum number of data values written during a burst transfer. Default value is 16.
num_write_outstanding: Specifies how many write requests can be made to the AXI4 bus, without a response, before the design stalls. This implies internal storage in the design: a FIFO of size num_read_outstanding*max_read_burst_length*word_size. Default value is 16.

The following INTERFACE pragma example can be used to help explain these options:

#pragma HLS interface m_axi port=input offset=slave bundle=gmem0         
depth=1024*1024*16/(512/8) latency=100 num_read_outstanding=32 num_write_outstanding=32 
max_read_burst_length=16 max_write_burst_length=16

The interface is specified as having a latency of 100. The HLS compiler seeks to schedule the request for burst access 100 clock cycles before the design is ready to access the AXI4 bus.
To further improve bus efficiency, the options num_write_outstanding and num_read_outstanding ensure the design contains enough buffering to store up to 32 read and/or write accesses. Each request will require its own buffer. This allows the design to continue processing until the bus requests are serviced.
Finally, the options max_read_burst_length and max_write_burst_length ensure the maximum burst size is 16 and that the AXI4 interface does not hold the bus for longer than this. The HLS tool will partition longer bursts according to the specified burst length, and report this condition with a message like the following:
```
Multiple burst reads of length 192 and bit width 128 in loop 'VITIS_LOOP_2'(./src/filter.cpp:247:21)has been inferred on port 'mm_read'.
These burst requests might be further partitioned into multiple requests during RTL generation based on the max_read_burst_length settings.
```

These options allow the AXI4 interface to be optimized for the system in which it will operate. The efficiency of the operation depends on these values being set accurately. The provided default values are conservative, and may require changing depending on the memory access profile of your design.

Table 5. Vitis HLS Controls
Vitis HLS Command	Value	Description
`config_rtl -m_axi_conservative_mode`	bool default=false	Delay M-AXI each write request until the associated write data are entirely available (typically, buffered into the adapter or already emitted). This can slightly increase write latency but can resolve deadlock due to concurrent requests (read or write) on the memory subsystem.
`config_interface -m_axi_latency`	uint 0 is auto default=0 (for Vivado IP flow) default=64 (for Vitis Kernel flow)	Provide the scheduler with an expected latency for M-AXI accesses. Latency is the delay between a read request and the first read data, or between the last write data and the write response. Note that this number need not be exact, underestimation makes for a lower-latency schedule, but with longer dynamic stalls. The scheduler will account for the additional adapter latency and add a few cycles.
`config_interface -m_axi_min_bitwidth`	uint default=8	Minimum bitwidth for M-AXI interfaces data channels. Must be a power-of-two between 8 and 1024. Note that this does not necessarily increase throughput if the actual accesses are smaller than the required interface.
`config_interface -m_axi_max_bitwidth`	uint default=1024	Minimum bitwidth for M-AXI interfaces data channels. Must be a power-of-two between 8 and 1024. Note that this does decrease throughput if the actual accesses are bigger than the required interface as they will be split into a multi-cycle burst of accesses.
`config_interface -m_axi_max_widen_bitwidth`	uint default=0 (for Vivado IP flow) default=512 (for Vitis Kernel flow)	Allow the tool to automatically widen bursts on M-AXI interfaces up to the chosen bitwidth. Must be a power-of-two between 8 and 1024. Note that burst widening requires strong alignment properties (in addition to burst).
`config_interface -m_axi_auto_max_ports`	bool default=false	If the option is false, all the M-AXI interfaces that are not explicitly bundled will be bundled into a single common interface, thus minimizing resource usage (single adapter). If the option is true, all the M-AXI interfaces that are not explicitly bundled will be mapped into individual interfaces, thus increasing the resource usage (multiple adapters).
`config_interface -m_axi_alignment_byte_size`	uint default=0 (for Vivado IP flow) default=64 (for Vitis Kernel flow)	Assume top function pointers that are mapped to M-AXI interfaces are at least aligned to the provided width in byte (power of two). This can help automatic burst widening. Warning: behavior will be incorrect if the pointer are not actually aligned at runtime.
`config_interface -m_axi_num_read_outstanding`	uint default=16	Default value for M-AXI num_read_outstanding interface parameter.
`config_interface -m_axi_num_write_outstanding`	uint default=16	Default value for M-AXI num_write_outstanding interface parameter.
`config_interface -m_axi_max_read_burst_length`	uint default=16	Default value for M-AXI max_read_burst_length interface parameter.
`config_interface -m_axi_max_write_burst_length`	uint default=16	Default value for M-AXI max_write_burst_length interface parameter.

Examples of Recommended Coding Styles

As described in Synthesis Summary, Vitis HLS issues a report summarizing burst activities and also identifying burst failures. If bursts of variable lengths are done, then the report will mention that bursts of variable lengths were inferred. The compiler also provides burst messages that can be found in the compiler log, vitis_hls.log. These messages are issued before the scheduling step.

Simple Read/Write Burst Inference

The following example is the standard way of reading and writing to the DDR and inferring a read and write burst. The Vitis HLS compiler will report the following burst inferences for the example below:

INFO: [HLS 214-115] Burst read of variable length and bit width 32 has been inferred on port 'gmem'
INFO: [HLS 214-115] Burst write of variable length and bit width 32 has been inferred on port 'gmem' (./src/vadd.cpp:75:9).

The code for this example follows:

/****** BEGIN EXAMPLE *******/

#define DATA_SIZE 2048
// Define internal buffer max size
#define BURSTBUFFERSIZE 256

//TRIPCOUNT identifiers
const unsigned int c_min = 1;
const unsigned int c__max = BURSTBUFFERSIZE;
const unsigned int c_chunk_sz = DATA_SIZE;

extern "C" {
void vadd(int *a, int size, int inc_value) {
    // Map pointer a to AXI4-master interface for global memory access
#pragma HLS INTERFACE m_axi port=a  offset=slave bundle=gmem max_read_burst_length=256 max_write_burst_length=256
    // We also need to map a and return to a bundled axilite slave interface
#pragma HLS INTERFACE s_axilite port=a bundle=control
#pragma HLS INTERFACE s_axilite port=size bundle=control
#pragma HLS INTERFACE s_axilite port=inc_value bundle=control
#pragma HLS INTERFACE s_axilite port=return bundle=control

    int burstbuffer[BURSTBUFFERSIZE];

    // Per iteration of this loop perform BURSTBUFFERSIZE vector addition
    for (int i = 0; i < size; i += BURSTBUFFERSIZE) {
#pragma HLS LOOP_TRIPCOUNT min=c_min*c_min max=c_chunk_sz*c_chunk_sz/(c_max*c_max)
        int chunk_size = BURSTBUFFERSIZE;
        //boundary checks
        if ((i + BURSTBUFFERSIZE) > size)
            chunk_size = size - i;

        // memcpy creates a burst access to memory
        // multiple calls of memcpy cannot be pipelined and will be scheduled sequentially
        // memcpy requires a local buffer to store the results of the memory transaction
        memcpy(burstbuffer, &a[i], chunk_size * sizeof(int));

    // Calculate and write results to global memory, the sequential write in a for loop can be 
    // inferred as a memory burst access
    calc_write:
        for (int j = 0; j < chunk_size; j++) {
           #pragma HLS LOOP_TRIPCOUNT min=c_size_max max=c_chunk_sz
           #pragma HLS PIPELINE II=1
            burstbuffer[j] = burstbuffer[j] + inc_value;
            a[i + j] = burstbuffer[j];
        }
    }
}

Pipelining Between Bursts

The following example will infer bursts of length N:

for(int x=0; x < k; ++x) {
   int off = f(x);
   for(int i = 0; i < N; ++i) {
      #pragma HLS PIPELINE II=1
      ... = gmem[off + i];
   }
}

But notice that the outer loop is not pipelined. This means that while there is pipelining inside bursts, there won't be any pipelining between bursts.

To remedy this you can unroll the inner loop and pipeline the outer loop to get pipelining between bursts as well. The following example will still infer bursts of length N, but now there will also be pipelining between bursts leading to higher throughput:

for(int x=0; x < k; ++x) {
   #pragma HLS PIPELINE II=N
   int off = f(x);
   for(int i = 0; i < N; ++i) {
      #pragma HLS UNROLL
      ... = gmem[off + i];
   }
}

Accessing Row Data from a Two-Dimensional Array

The following is an example of reading/writing to/from a two dimensional array. Vitis HLS infers read and write bursts and issues the following messages:

INFO: [HLS 214-115] Burst read of length 256 and bit width 512 has been inferred on port 'gmem' (./src/row_array_2d.cpp:43:5)
INFO: [HLS 214-115] Burst write of length 256 and bit width 512 has been inferred on port 'gmem' (./src/row_array_2d.cpp:56:5)

Notice that a bit width of 512 is achieved in this example. This is more efficient than the 32 bit width achieved in the simple example above. Bursting wider bit widths is another way bursts can be optimized as discussed in Automatic Port Width Resizing.

The code for this example follows:

/****** BEGIN EXAMPLE *******/
// Parameters Description:
//         NUM_ROWS:            matrix height
//         WORD_PER_ROW:        number of words in a row
//         BLOCK_SIZE:          number of words in an array
#define NUM_ROWS   64
#define WORD_PER_ROW 64
#define BLOCK_SIZE (WORD_PER_ROW*NUM_ROWS)

// Default datatype is integer
typedef int DTYPE;
typedef hls::stream<DTYPE> my_data_fifo;

// Read data function: reads data from global memory
void read_data(DTYPE *inx, my_data_fifo &inFifo) {
read_loop_i:
    for (int i = 0; i < NUM_ROWS; ++i) {
    read_loop_jj:
        for (int jj = 0; jj < WORD_PER_ROW; ++jj) {
           #pragma HLS PIPELINE II=1
            inFifo << inx[WORD_PER_ROW * i + jj];
            ;
        }
    }
}

// Write data function - writes results to global memory
void write_data(DTYPE *outx, my_data_fifo &outFifo) {
write_loop_i:
    for (int i = 0; i < NUM_ROWS; ++i) {
    write_loop_jj:
        for (int jj = 0; jj < WORD_PER_ROW; ++jj) {
           #pragma HLS PIPELINE II=1
            outFifo >> outx[WORD_PER_ROW * i + jj];
        }
    }
}

// Compute function is pretty simple because this example is focused on efficient 
// memory access pattern.
void compute(my_data_fifo &inFifo, my_data_fifo &outFifo, int alpha) {
compute_loop_i:
    for (int i = 0; i < NUM_ROWS; ++i) {
    compute_loop_jj:
        for (int jj = 0; jj < WORD_PER_ROW; ++jj) {
           #pragma HLS PIPELINE II=1
            DTYPE inTmp;
            inFifo >> inTmp;
            DTYPE outTmp = inTmp * alpha;
            outFifo << outTmp;
        }
    }
}

extern "C" {
    void row_array_2d(DTYPE *inx, DTYPE *outx, int alpha) {
        // AXI master interface
    #pragma HLS INTERFACE m_axi port = inx offset = slave bundle = gmem
    #pragma HLS INTERFACE m_axi port = outx offset = slave bundle = gmem
        // AXI slave interface
    #pragma HLS INTERFACE s_axilite port = inx bundle = control
    #pragma HLS INTERFACE s_axilite port = outx bundle = control
    #pragma HLS INTERFACE s_axilite port = alpha bundle = control
    #pragma HLS INTERFACE s_axilite port = return bundle = control

        my_data_fifo inFifo;
        // By default the FIFO depth is 2, user can change the depth by using 
        // #pragma HLS stream variable=inFifo depth=256
        my_data_fifo outFifo;

        // Dataflow enables task level pipelining, allowing functions and loops to execute 
        // concurrently. For more details please refer to UG902.
    #pragma HLS DATAFLOW
        // Read data from each row of 2D array
        read_data(inx, inFifo);
        // Do computation with the acquired data
        compute(inFifo, outFifo, alpha);
        // Write data to each row of 2D array
        write_data(outx, outFifo);
        return;
    }
}

Summary

Write code in such a way that bursting can be inferred. Ensure that none of the preconditions are violated.

Bursting does not mean that you will get all your data in one shot – it is about merging the requests together into one request, but the data will arrive sequentially, one after another.

Burst length of 16 is ideal, but even burst lengths of 8 are enough. Bigger bursts have more latency while shorter bursts can be pipelined. Do not confuse bursting with pipelining, but note that bursts can be pipelined with other bursts.

If your bursts are of fixed length, you can unroll the inner loop where bursts are inferred and pipeline the outer loop. This will achieve the same burst length, but also pipelining between the bursts to enable higher throughput.

For greater throughput, focus on widening the interface up to 512 bits rather than simply achieving longer bursts.

Bigger bursts have higher priority with the AXI interconnect. No dynamic arbitration is done inside the kernel.

You can have two m_axi ports connected to same DDR to model mutually exclusive access inside kernel, but the AXI interconnect outside the kernel will arbitrate competing requests.

One way to get around the out-of-order access restriction is to create your own buffer in BRAM, store the bursts in this buffer and then use this buffer to do out of order accesses. This is typically called a line buffer and is a common optimization used in video processing.

Review the Burst Optimization section of the Synthesis Summary report to learn more about burst optimizations in the design, and missed burst opportunities.

Adding RTL Blackbox Functions

The RTL blackbox enables the use of existing RTL IP in an HLS project. This lets you add RTL code to your C/C++ code for synthesis of the project by Vitis HLS. The RTL IP can be used in a sequential, pipeline, or dataflow region.

Integrating RTL IP into a Vitis HLS project requires the following files:

C function signature for the RTL code. This can be placed into a header (.h) file.
Blackbox JSON description file as discussed in JSON File for RTL Blackbox.
RTL IP files.

To use the RTL blackbox in an HLS project, use the following steps.

Call the C function signature from within your top-level function, or a sub-function in the Vitis HLS project.
Add the blackbox JSON description file to your HLS project using the Add Files command from the Vitis HLS IDE as discussed in Creating a New Vitis HLS Project, or using the add_files command:
```
add_files –blackbox my_file.json
```
TIP: As explained in the next section, the new RTL Blackbox wizard can help you generate the JSON file and add the RTL IP to your project.
Run the Vitis HLS design flow for simulation, synthesis, and co-simulation as usual.

Requirements and Limitations

RTL IP used in the RTL blackbox feature have the following requirements:

Should be Verilog (.v) code.
Must have a unique clock signal, and a unique active-High reset signal.
Must have a CE signal that is used to enable or stall the RTL IP.
Must use the ap_ctrl_chain protocol as described in Block-Level I/O Protocols.

Within Vitis HLS, the RTL blackbox feature:

Supports only C++.
Cannot connect to top-level interface I/O signals.
Cannot directly serve as the design-under-test (DUT).
Does not support struct or class type interfaces.
Supports the following interface protocols as described in JSON File for RTL Blackbox:
hls::stream

The RTL blackbox IP supports the hls::stream interface. When this data type is used in the C function, use a FIFO RTL port protocol for this argument in the RTL blackbox IP.

Arrays
The RTL blackbox IP supports RAM interface for arrays. For array arguments in the C function, use one of the following RTL port protocols for the corresponding argument in the RTL blackbox IP:
- Single port RAM – RAM_1P
- Dual port RAM – RAM_T2P
Scalars and Input Pointers

The RTL Blackbox IP supports C scalars and input pointers only in sequential and pipeline regions. They are not supported in a dataflow region. When these constructs are used in the C function, use wire port protocol in the RTL IP.
Inout and Output Pointers

The RTL blackbox IP supports inout and output pointers only in sequential and pipeline regions. They are not supported in a dataflow region. When these constructs are used in the C function, the RTL IP should use ap_vld for output pointers, and ap_ovld for inout pointers.

TIP: All other Vitis HLS design restrictions also apply when using RTL blackbox in your project.

Using the RTL Blackbox Wizard

Navigate to the project, right-click to open the RTL Blackbox Wizard as shown in the following figure:

The Wizard is organized into pages that break down the process for creating a JSON file. To navigate between pages, click Next and select Back. Once the options are finalized, you can generate a JSON by clicking OK. Each of the following section describes each page and its input options.

C++ Model and Header Files

In the Blackbox C/C++ files page, you provide the C++ files which form the functional model of the RTL IP. This C++ model is only used during C++ simulation and C++/RTL co-simulation. The RTL IP is combined with Vitis HLS results to form the output of synthesis.

In this page, you can perform the following:

Click Add Files to add files.
Click Edit CFLAGS to provide a linker flag to the functional C model.
Click Next to proceed.

The C File Wizard page lets you specify the values used for the C functional model of the RTL IP. The fields include:

C Function: Specify the C function name of the RTL IP.
C Argument Name: Specify the name(s) of the function arguments. These should relate to the ports on the IP.
C Argument Type: Specify the data type used for each argument.
C Port Direction: Specify the port direction of the argument, corresponding to the port in the IP.
RAM Type: Specify the RAM type used at the interface.
RTL Group Configuration: Specifies the corresponding RTL signal name.

Click Next to proceed.

RTL IP Definition

The RTL Wizard page lets you define the RTL source for the IP. The fields to define include:

RTL Files

This option is used to add or remove the pre existing RTL IP files.

RTL Module Name

Specify the top level RTL IP module name in this field.

Performance

Specify performance targets for the IP.

Latency: Latency is the time required for the design to complete. Specify the Latency information in this field.
II: Define the target II (Initiation Interval). This is the number of clocks cycles before new input can be applied.

Resource

Specify the device resource utilization for the RTL IP. The resource information provided here will be combined with utilization from synthesis to report the overall design resource utilization. You should be able to extract this information from the Vivado Design Suite

Click Next to proceed to the RTL Common Signal page, as shown below.

module_clock: Specify the name the of the clock used in the RTL IP.
module_reset: Specify the name of the reset signal used in the IP.
module_clock_enable: Specify the name of the clock enable signal in the IP.
ap_ctrl_chain_protocol_start: Specify the name of the block control start signal used in the IP.
ap_ctrl_chain_protocol_ready: Specify the name of the block control ready signal used in the IP.
ap_ctrl_chain_protocol_done: Specify the name of the block control done signal used in the IP.
ap_ctrl_chain_protocol_continue: Specify the name of the block control continue signal used in the RTL IP.

Click Finish to automatically generate a JSON file for the specified IP. This can be confirmed through the log message as shown below.

Log Message:

"[2019-08-29 16:51:10] RTL Blackbox Wizard Information: the "foo.json" file has been created in the rtl_blackbox/Source folder."

The JSON file can be accessed through the Source file folder, and will be generated as described in the next section.

JSON File for RTL Blackbox

JSON File Format

The following table describes the JSON file format:

Table 6. JSON File Format
Item	Attribute	Description
c_function_name		The C++ function name for the blackbox. The `c_function_name` must be consistent with the C function simulation model.
rtl_top_module_name		The RTL function name for the blackbox. The `rtl_top_module_name` must be consistent with the `c_function_name`.
c_files	c_file	Specifies the C file used for the blackbox module.
c_files	cflag	Provides any compile option necessary for the corresponding C file.
rtl_files		Specifies the RTL files for the blackbox module.
c_parameters	c_name	Specifies the name of the argument used for the black box C++ function. Unused `c_parameters` should be deleted from the template.
	c_port_direction	The access direction for the corresponding C argument. `in` Read only by blackbox C++ function. `out` Write only by blackbox C++ function. `inout` Will both read and write by blackbox C++ function.
	RAM_type	Specifies the RAM type to use if the corresponding C argument uses the RTL RAM protocol. Two type of RAM are used: RAM_1P For 1 port RAM module RAM_T2P For 2 port RAM module Omit this attribute when the corresponding C argument is not using RTL 'RAM' protocol.
	rtl_ports	Specifies the RTL port protocol signals for the corresponding C argument (`c_name`). Every `c_parameter` should be associated with an `rtl_port`. Five type of RTL port protocols are used. Refer to the RTL Port Protocols table for additional details. `wire` An argument can be mapped to `wire` if it is a scalar or pointer with 'in' direction. `ap_vld` An argument can be mapped to `ap_vld` if it uses pointer with 'out' direction. `ap_ovld` An argument can be mapped to `ap_ovld` if it use a pointer with an inout direction. `FIFO` An argument can be mapped to `FIFO` if it uses the `hls::stream` data type. `RAM` An argument can be mapped to `RAM` if it uses an array type. The array type supports inout directions. The specified RTL port protocols have associated control signals, which also need to be specified in the JSON file.
c_return	c_port_direction	It must be `out`.
c_return	rtl_ports	Specifies the corresponding RTL port name used in the RTL blackbox IP.
rtl_common_signal	module_clock	The unique clock signal for RTL blackbox module.
	module_reset	Specifies the reset signal for RTL blackbox module. The reset signal must be active-High or positive valid.
	module_clock_enable	Specifies the clock enable signal for the RTL blackbox module. The enable signal must be active-High or positive valid.
	ap_ctrl_chain_protocol_idle	The ap_idle signal in the `ap_ctrl_chain` protocol for the RTL blackbox module.
	ap_ctrl_chain_protocol_start	The ap_start signal in the `ap_ctrl_chain` protocol for the RTL blackbox module.
	ap_ctrl_chain_protocol_ready	The ap_ready signal in the `ap_ctrl_chain` protocol for the RTL blackbox IP.
	ap_ctrl_chain_protocol_done	The ap_done signal in the `ap_ctrl_chain` protocol for blackbox RTL module.
	ap_ctrl_chain_protocol_continue	The ap_continue signal in the `ap_ctrl_chain` protocol for RTL blackbox module.
rtl_performance	latency	Specifies the Latency of the RTL blackbox module. It must be a non-negative integer value. For Combinatorial RTL IP specify `0`, otherwise specify the exact latency of the RTL module.
rtl_performance	II	Number of clock cycles before the function can accept new input data. It must be non-negative integer value. `0` means the blackbox can not be pipelined. Otherwise, it means the blackbox module is pipelined.
rtl_resource_usage	FF	Specifies the register utilization for the RTL blackbox module.
	LUT	Specifies the LUT utilization for the RTL blackbox module.
	BRAM	Specifies the block RAM utilization for the RTL blackbox module.
	URAM	Specifies the URAM utilization for the RTL blackbox module.
	DSP	Specifies the DSP utilization for the RTL blackbox module.

Table 7. RTL Port Protocols
RTL Port Protocol	RAM Type	C Port Direction	Attribute	User-Defined Name	Notes
wire		in	data_read_in	Specifies a user defined name used in the RTL blackbox IP. As an example for wire, if the RTL port name is "flag" then the JSON FILE format is `"data_read-in" : "flag"`.
ap_vld		out	data_write_out
ap_vld		out	data_write_valid
ap_ovld		inout	data_read_in
			data_write_out
			data_write_valid
FIFO		in	FIFO_empty_flag		Must be negative valid.
			FIFO_read_enable
			FIFO_data_read_in
		out	FIFO_full_flag		Must be negative valid.
			FIFO_write_enable
			FIFO_data_write_out
RAM	RAM_1P	in	RAM_address
			RAM_clock_enable
			RAM_data_read_in
		out	RAM_address
			RAM_clock_enable
			RAM_write_enable
			RAM_data_write_out
		inout	RAM_address
			RAM_clock_enable
			RAM_write_enable
			RAM_data_write_out
			RAM_data_read_in
RAM	RAM_T2P	in	RAM_address	Specifies a user defined name used in the RTL blackbox IP. As an example for wire, if the RTL port name is "flag" then the JSON FILE format is `"data_read-in" : "flag"`.	Signals with _snd belong to the second port of the RAM. Signals without _snd belong to the first port.
			RAM_clock_enable
			RAM_data_read_in
			RAM_address_snd
			RAM_clock_enable_snd
			RAM_data_read_in_snd
		out	RAM_address
			RAM_clock_enable
			RAM_write_enable
			RAM_data_write_out
			RAM_address_snd
			RAM_clock_enable_snd
			RAM_write_enable_snd
			RAM_data_write_out_snd
		inout	RAM_address
			RAM_clock_enable
			RAM_write_enable
			RAM_data_write_out
			RAM_data_read_in
			RAM_address_snd
			RAM_clock_enable_snd
			RAM_write_enable_snd
			RAM_data_write_out_snd
			RAM_data_read_in_snd

Note: The behavioral C-function model for the RTL blackbox must also adhere to the recommended HLS coding styles.

JSON File Example

This section provides details on manually writing the JSON file required for the RTL blackbox. The following is an example of a JSON file:

{
"c_function_name" : "foo",
"rtl_top_module_name" : "foo",
"c_files" :
      [
            {
              "c_file" : "../../a/top.cpp",
              "cflag" : ""
            },
            {
              "c_file" : "xx.cpp",
              "cflag" : "-D KF"
            }
           ],
"rtl_files" : [
                "../../foo.v",
                "xx.v"
              ],
"c_parameters" : [{
                   "c_name" : "a",
                   "c_port_direction" : "in",
                   "rtl_ports" : {
                                  "data_read_in" : "a"
                                 }
                  },
                  {
                   "c_name" : "b",
                   "c_port_direction" : "in",
                   "rtl_ports" : {
                                  "data_read_in" : "b"
                                 }
                  },
                  {
                   "c_name" : "c",
                   "c_port_direction" : "out",
                   "rtl_ports" : {
                                   "data_write_out" : "c",
                                   "data_write_valid" : "c_ap_vld"
                                 }
                  },
                  {
                   "c_name" : "d",
                   "c_port_direction" : "inout",
                   "rtl_ports" : {
                                   "data_read_in" : "d_i",
                                   "data_write_out" : "d_o",
                                   "data_write_valid" : "d_o_ap_vld"
                                 }
                  },
                  {
                   "c_name" : "e",
                   "c_port_direction" : "in",
                   "rtl_ports" : {
                                   "FIFO_empty_flag" : "e_empty_n",
                                   "FIFO_read_enable" : "e_read",
                                   "FIFO_data_read_in" : "e"
                                 }
                  },
                  {
                   "c_name" : "f",
                   "c_port_direction" : "out",
                   "rtl_ports" : {
                                   "FIFO_full_flag" : "f_full_n",
                                   "FIFO_write_enable" : "f_write",
                                   "FIFO_data_write_out" : "f"
                                 }
                  },
                  {
                   "c_name" : "g",
                   "c_port_direction" : "in",
                   "RAM_type" : "RAM_1P",
                   "rtl_ports" : {
                                   "RAM_address" : "g_address0",
                                   "RAM_clock_enable" : "g_ce0",
                                   "RAM_data_read_in" : "g_q0"
                                 }
                  },
                  {
                   "c_name" : "h",
                   "c_port_direction" : "out",
                   "RAM_type" : "RAM_1P",
                   "rtl_ports" : {
                                   "RAM_address" : "h_address0",
                                   "RAM_clock_enable" : "h_ce0",
                                   "RAM_write_enable" : "h_we0",
                                   "RAM_data_write_out" : "h_d0"
                                 }
                  },
                  {
                   "c_name" : "i",
                   "c_port_direction" : "inout",
                   "RAM_type" : "RAM_1P",
                   "rtl_ports" : {
                                   "RAM_address" : "i_address0",
                                   "RAM_clock_enable" : "i_ce0",
                                   "RAM_write_enable" : "i_we0",
                                   "RAM_data_write_out" : "i_d0",
                                   "RAM_data_read_in" : "i_q0"
                                 }
                  },
                  {
                   "c_name" : "j",
                   "c_port_direction" : "in",
                   "RAM_type" : "RAM_T2P",
                   "rtl_ports" : {
                                   "RAM_address" : "j_address0",
                                   "RAM_clock_enable" : "j_ce0",
                                   "RAM_data_read_in" : "j_q0",
                                   "RAM_address_snd" : "j_address1",
                                   "RAM_clock_enable_snd" : "j_ce1",
                                   "RAM_data_read_in_snd" : "j_q1"
                                 }
                  },
                  {
                   "c_name" : "k",
                   "c_port_direction" : "out",
                   "RAM_type" : "RAM_T2P",
                   "rtl_ports" : {
                                   "RAM_address" : "k_address0",
                                   "RAM_clock_enable" : "k_ce0",
                                   "RAM_write_enable" : "k_we0",
                                   "RAM_data_write_out" : "k_d0",
                                   "RAM_address_snd" : "k_address1",
                                   "RAM_clock_enable_snd" : "k_ce1",
                                   "RAM_write_enable_snd" : "k_we1",
                                   "RAM_data_write_out_snd" : "k_d1"
                                 }
                  },
                  {
                   "c_name" : "l",
                   "c_port_direction" : "inout",
                   "RAM_type" : "RAM_T2P",
                   "rtl_ports" : {
                                   "RAM_address" : "l_address0",
                                   "RAM_clock_enable" : "l_ce0",
                                   "RAM_write_enable" : "l_we0",
                                   "RAM_data_write_out" : "l_d0",
                                   "RAM_data_read_in" : "l_q0",
                                   "RAM_address_snd" : "l_address1",
                                   "RAM_clock_enable_snd" : "l_ce1",
                                   "RAM_write_enable_snd" : "l_we1",
                                   "RAM_data_write_out_snd" : "l_d1",
                                   "RAM_data_read_in_snd" : "l_q1"
                                 }
                  }],
"c_return" : {
               "c_port_direction" : "out",
               "rtl_ports" : {
                               "data_write_out" : "ap_return"
                             }
             },
"rtl_common_signal" : {
                        "module_clock" : "ap_clk",
                        "module_reset" : "ap_rst",
                        "module_clock_enable" : "ap_ce",
                        "ap_ctrl_chain_protocol_idle" : "ap_idle",
                        "ap_ctrl_chain_protocol_start" : "ap_start",
                        "ap_ctrl_chain_protocol_ready" : "ap_ready",
                        "ap_ctrl_chain_protocol_done" : "ap_done",
                        "ap_ctrl_chain_protocol_continue" : "ap_continue"
                       },
"rtl_performance" : {
                     "latency" : "6",
                     "II" : "2"
                    },
"rtl_resource_usage" : {
                        "FF" : "0",
                        "LUT" : "0",
                        "BRAM" : "0",
                        "URAM" : "0",
                        "DSP" : "0"
                       }
}