Optimization Techniques in Vitis HLS
This section outlines the various optimization techniques you can use to direct Vitis HLS to produce a micro-architecture that satisfies the desired performance and area goals. Using Vitis HLS, you can apply different optimization directives to the design, including:
- Pipelining tasks, allowing the next execution of the task to begin before the current execution is complete.
- Specifying a target latency for the completion of functions, loops, and regions.
- Specifying a limit on the number of resources used.
- Overriding the inherent or implied dependencies in the code to permit specific operations. For example, if it is acceptable to discard or ignore the initial data values, such as in a video stream, allow a memory read before write if it results in better performance.
- Specifying the I/O protocol to ensure function arguments can be
connected to other hardware blocks with the same I/O protocol.Note: Vitis HLS automatically determines the I/O protocol used by any sub-functions. You cannot control these ports except to specify whether the port is registered.
-
Optimizing for Throughput presents primary optimizations in the order in which they are typically used: pipeline the tasks to improve performance, improve the flow of data between tasks, and optimize structures to improve address issues which may limit performance.
-
Optimizing for Latency uses the techniques of latency constraints and the removal of loop transitions to reduce the number of clock cycles required to complete.
-
Optimizing for Area focuses on how operations are implemented - controlling the number of operations and how those operations are implemented in hardware - is the principal technique for improving the area.
- Optimizing Logic discusses optimizations affecting the implementation of the RTL.
You can add optimization directives directly into the source code as
compiler pragmas using various HLS pragmas, or you can use Tcl set_directive
commands to apply optimization directives in a Tcl script to
be used by a solution during compilation as discussed in Adding Pragmas and Directives. The following table lists the optimization directives provided by Vitis HLS as either pragma or Tcl directive.
Directive | Description |
---|---|
ALLOCATION |
Specify a limit for the number of operations, implementations, or functions used. This can force the sharing or hardware resources and may increase latency. |
ARRAY_PARTITION |
Partitions large arrays into multiple smaller arrays or into individual registers, to improve access to data and remove block RAM bottlenecks. |
ARRAY_RESHAPE |
Reshape an array from one with many elements to one with greater word-width. Useful for improving block RAM accesses without using more block RAM. |
BIND_OP |
Define a specific implementation for an operation in the RTL. |
BIND_STORAGE |
Define a specific implementation for a storage element, or memory, in the RTL. |
DATAFLOW |
Enables task level pipelining, allowing functions and loops to execute concurrently. Used to optimize throughput and/or latency. |
DEPENDENCE |
Used to provide additional information that can overcome loop-carried dependencies and allow loops to be pipelined (or pipelined with lower intervals). |
DISAGGREGATE |
Break a struct down into its individual elements. |
EXPRESSION_BALANCE |
Allows automatic expression balancing to be turned off. |
INLINE |
Inlines a function, removing function hierarchy at this level. Used to enable logic optimization across function boundaries and improve latency/interval by reducing function call overhead. |
INTERFACE |
Specifies how RTL ports are created from the function description. |
LATENCY |
Allows a minimum and maximum latency constraint to be specified. |
LOOP_FLATTEN |
Allows nested loops to be collapsed into a single loop with improved latency. |
LOOP_MERGE |
Merge consecutive loops to reduce overall latency, increase sharing and improve logic optimization. |
LOOP_TRIPCOUNT |
Used for loops which have variables bounds. Provides an estimate for the loop iteration count. This has no impact on synthesis, only on reporting. |
OCCURRENCE |
Used when pipelining functions or loops, to specify that the code in a location is executed at a lesser rate than the code in the enclosing function or loop. |
PIPELINE |
Reduces the initiation interval by allowing the overlapped execution of operations within a loop or function. |
RESET |
This directive is used to add or remove reset on a specific state variable (global or static). |
SHARED |
Specifies that a global variable, or function argument array is shared among multiple dataflow processes, without the need for synchronization. |
STABLE |
Indicates that a variable input or output of a dataflow region can be ignored when generating the synchronizations at entry and exit of the dataflow region. |
STREAM |
Specifies that a specific array is to be implemented as a FIFO or RAM memory channel during dataflow optimization. When using hls::stream, the STREAM optimization directive is used to override the configuration of the hls::stream. |
TOP |
The top-level function for synthesis is specified in the project settings. This directive may be used to specify any function as the top-level for synthesis. This then allows different solutions within the same project to be specified as the top-level function for synthesis without needing to create a new project. |
UNROLL |
Unroll for-loops to create multiple instances of the loop body and its instructions that can then be scheduled independently. |
In addition to the optimization directives, Vitis HLS provides a number of configuration commands that can influence the performance of synthesis results. Details on using configurations commands can be found in Setting Configuration Options. The following table reflects some of these commands.
GUI Directive | Description |
---|---|
Config Array Partition | Determines how arrays are partitioned, including global arrays and if the partitioning impacts array ports. |
Config Compile | Controls synthesis specific optimizations such as the automatic loop pipelining and floating point math optimizations. |
Config Dataflow | Specifies the default memory channel and FIFO depth in dataflow optimization. |
Config Interface | Controls I/O ports not associated with the top-level function arguments and allows unused ports to be eliminated from the final RTL. |
Config Op | Configures the default latency and implementation of specified operations. |
Config RTL | Provides control over the output RTL including file and module naming, reset style and FSM encoding. |
Config Schedule | Determines the effort level to use during the synthesis scheduling phase and the verbosity of the output messages |
Config Storage | Configures the default latency and implementation of specified storage types. |
Config Unroll | Configures the default tripcount threshold for unrolling loops. |
Controlling the Reset Behavior
The reset port is used in an FPGA to return the registers and block RAM connected to the reset port to an initial value any time the reset signal is applied. Typically the most important aspect of RTL configuration is selecting the reset behavior.
The presence and behavior of the RTL reset port is controlled using the config_rtl
command, as shown in the following figure. You
can access this command by selecting the menu command.
The reset settings include the ability to set the polarity of the reset and whether the reset is synchronous or asynchronous but more importantly it controls, through the reset option, which registers are reset when the reset signal is applied.
config_rtl
configuration. This is required by the AXI4 standard.The reset option has four settings:
- none
- No reset is added to the design.
- control
- This is the default and ensures all control registers are reset. Control registers are those used in state machines and to generate I/O protocol signals. This setting ensures the design can immediately start its operation state.
- state
- This option adds a reset to control registers (as in the control setting) plus any registers or memories derived from static and global variables in the C/C++ code. This setting ensures static and global variable initialized in the C/C++ code are reset to their initialized value after the reset is applied.
- all
- This adds a reset to all registers and memories in the design.
Finer grain control over reset is provided through the RESET pragma or directive.
Static and global variables can have a reset added through the RESET directive.
Variables can also be removed from those being reset by using the RESET directive’s
off
option.
state
or all
options to consider the effect on resetting arrays as discussed in Initializing and Resetting Arrays.Initialization Behavior
In C/C++, variables defined with the static qualifier and those defined in the global scope are initialized to zero, by default. These variables may optionally be assigned a specific initial value. For these initialized variables, the value in the C/C++ code is assigned at compile time (at time zero) and never again. In both cases, the initial value is implemented in the RTL.
- During RTL simulation the variables are initialized with the same values as the C/C++ code.
- The variables are also initialized in the bitstream used to program the FPGA. When the device powers up, the variables will start in their initialized state.
In the RTL, although the variables start with the same initial value as the C/C++ code, there is no way to force the variable to return to this initial state. To restore the initial state, variables must be implemented with a reset signal.
Initializing and Resetting Arrays
state
or all
are
used, it forces all arrays implemented as block RAM to be returned to their initialized
state after reset. This may result in two very undesirable conditions in the RTL
design:- Unlike a power-up initialization, an explicit reset requires the RTL design iterate through each address in the block RAM to set the value: this can take many clock cycles if N is large, and requires more area resources to implement the reset.
- A reset is added to every array in the design.
To prevent adding reset logic onto every such block RAM, and incurring the cycle
overhead to reset all elements in the RAM, specify the default control
reset mode and use the RESET directive to identify individual
static or global variables to be reset.
Alternatively, you can use the state
reset mode, and use the RESET directive off
option to
identify individual static or global variables to remove the reset from.
Optimizing for Throughput
Use the following optimizations to improve throughput or reduce the initiation interval.
Function and Loop Pipelining
Pipelining allows operations to happen concurrently: each execution step does not have to complete all operations before it begins the next operation. Pipelining is applied to functions and loops. The throughput improvements in function pipelining are shown in the following figure.
Without pipelining, the function in the above example reads an input every 3 clock cycles and outputs a value after 2 clock cycles. The function has an initiation interval (II) of 3 and a latency of 3. With pipelining, for this example, a new input is read every cycle (II=1) with no change to the output latency.
Loop pipelining allows the operations in a loop to be implemented in an overlapping manner. In the following figure, (A) shows the default sequential operation where there are 3 clock cycles between each input read (II=3), and it requires 8 clock cycles before the last output write is performed.
In the pipelined version of the loop shown in (B), a new input sample is read every cycle (II=1) and the final output is written after only 4 clock cycles: substantially improving both the II and latency while using the same hardware resources.
Functions or loops are pipelined using the PIPELINE directive. The directive is specified in the region that constitutes the function or loop body. The initiation interval defaults to 1 if not specified but may be explicitly specified.
Pipelining is applied only to the specified region and not to the hierarchy below. However, all loops in the hierarchy below are automatically unrolled. Any sub-functions in the hierarchy below the specified function must be pipelined individually. If the sub-functions are pipelined, the pipelined functions above it can take advantage of the pipeline performance. Conversely, any sub-function below the pipelined top-level function that is not pipelined might be the limiting factor in the performance of the pipeline.
There is a difference in how pipelined functions and loops behave.
- In the case of functions, the pipeline runs forever and never ends.
- In the case of loops, the pipeline executes until all iterations of the loop are completed.
This difference in behavior is summarized in the following figure.
The difference in behavior impacts how inputs and outputs to the pipeline are processed. As seen in the figure above, a pipelined function will continuously read new inputs and write new outputs. By contrast, because a loop must first finish all operations in the loop before starting the next loop, a pipelined loop causes a “bubble” in the data stream; that is, a point when no new inputs are read as the loop completes the execution of the final iterations, and a point when no new outputs are written as the loop starts new loop iterations.
Rewinding Pipelined Loops for Performance
To avoid issues shown in the previous figure (Function and Loop Pipelining), the
PIPELINE pragma has an optional command rewind
. This command enables the overlap
of the execution of successive calls to the loop, when this loop
is the outermost construct of the top function or of a dataflow
process (and the dataflow region is executed multiple
times).
The following figure shows the operation when the rewind
option is used when
pipelining a loop. At the end of the loop iteration count, the
loop starts to execute again. While it generally re-executes
immediately, a delay is possible and is shown and described in
the GUI.
Flushing Pipelines
Pipelines continue to execute as long as data is available at the input of the pipeline. If there is no data available to process, the pipeline will stall. This is shown in the following figure, where the input data valid signal goes low to indicate there is no more data. Once there is new data available to process, the pipeline will continue operation.
In some cases, it is desirable to have a pipeline that can be “emptied” or
“flushed.” The flush
option is provided to perform this. When
a pipeline is “flushed” the pipeline stops reading new inputs when none are available (as
determined by a data valid signal at the start of the
pipeline) but continues processing, shutting down each successive pipeline stage, until the
final input has been processed through to the output of the pipeline.
The default style of pipelining implemented by Vitis HLS is defined by the config_compile
-pipeline_style
command. You can specify stalling pipelines (stp), or free-running
flushing pipelines (frp) to be used throughout the design. You can also define a third type of
flushable pipeline (flp) with the PIPELINE pragma or directive, using the enable_flush
option. This option applies to the specific scope of
the pragma or directive only, and does not change the global default assigned by config_compile
.
Name | Stalled Pipeline (default) | Free-Running/ Flushable Pipeline | Flushable Pipeline |
---|---|---|---|
Global Setting | config_compile -pipeline_style stp
(default) |
config_compile -pipeline_style frp |
N/A |
Pragma/Directive | #HLS pragma pipeline |
N/A | #HLS pragma pipeline enable_flush |
Advantages |
|
|
|
Disadvantages |
|
|
|
Use cases |
|
|
|
Automatic Loop Pipelining
The config_compile
configuration enables loops to
be pipelined automatically based on the iteration count. This configuration is accessed
through the menu .
The pipeline_loops
option sets the iteration
limit. All loops with an iteration count below this limit are automatically pipelined. The
default is 64: no automatic loop pipelining is performed.
Given the following example code:
for (y = 0; y < 480; y++) {
for (x = 0; x < 640; x++) {
for (i = 0; i < 5; i++) {
// do something 5 times
...
}
}
}
If the pipeline_loops
option is set to 6, the innermost
for
loop in the above code snippet will be automatically
pipelined. This is equivalent to the following code snippet:
for (y = 0; y < 480; y++) {
for (x = 0; x < 640; x++) {
for (i = 0; i < 5; i++) {
#pragma HLS PIPELINE II=1
// do something 5 times
...
}
}
}
If there are loops in the design for which you do not want to use automatic
pipelining, apply the PIPELINE directive with the off
option to that loop. The off
option prevents automatic
loop pipelining.
config_compile pipeline_loops
option after
performing all user-specified directives. For example, if Vitis HLS applies
a user-specified UNROLL directive to a loop, the loop is first unrolled, and automatic loop
pipelining cannot be applied.Unrolling Loops to Improve Pipelining
By default, loops are kept rolled in Vitis HLS. These rolled loops generate a hardware resource which is used by each iteration of the loop. While this creates a resource efficient block, it can sometimes be a performance bottleneck.
Vitis HLS provides the ability to unroll or partially unroll FOR loops using the UNROLL pragma or directive.
The following figure shows both the advantages of loop unrolling and the
implications that must be considered when unrolling loops. This example assumes the arrays
a[i]
, b[i]
, and c[i]
are
mapped to block RAMs. This example shows how easy it is to create many different
implementations by the simple application of loop unrolling.
- Rolled Loop
- When the loop is rolled, each iteration is performed in separate clock cycles. This implementation takes four clock cycles, only requires one multiplier and each block RAM can be a single-port block RAM.
- Partially Unrolled Loop
- In this example, the loop is partially unrolled by a factor of 2. This implementation required two multipliers and dual-port RAMs to support two reads or writes to each RAM in the same clock cycle. This implementation does however only take 2 clock cycles to complete: half the initiation interval and half the latency of the rolled loop version.
- Unrolled loop
- In the fully unrolled version all loop operation can be performed in a single clock cycle. This implementation however requires four multipliers. More importantly, this implementation requires the ability to perform 4 reads and 4 write operations in the same clock cycle. Because a block RAM only has a maximum of two ports, this implementation requires the arrays be partitioned.
To perform loop unrolling, you can apply the UNROLL directives to individual loops in the design. Alternatively, you can apply the UNROLL directive to a function, which unrolls all loops within the scope of the function.
If a loop is completely unrolled, all operations will be performed in parallel if data dependencies and resources allow. If operations in one iteration of the loop require the result from a previous iteration, they cannot execute in parallel but will execute as soon as the data is available. A completely unrolled and fully optimized loop will generally involve multiple copies of the logic in the loop body.
The following example code demonstrates how loop unrolling can be used to create an optimized design. In this example, the data is stored in the arrays as interleaved channels. If the loop is pipelined with II=1, each channel is only read and written every eighth block cycle.
// Array Order : 0 1 2 3 4 5 6 7 8 9 10 etc. 16 etc...
// Sample Order: A0 B0 C0 D0 E0 F0 G0 H0 A1 B1 C2 etc. A2 etc...
// Output Order: A0 B0 C0 D0 E0 F0 G0 H0 A0+A1 B0+B1 C0+C2 etc. A0+A1+A2 etc...
#define CHANNELS 8
#define SAMPLES 400
#define N CHANNELS * SAMPLES
void foo (dout_t d_out[N], din_t d_in[N]) {
int i, rem;
// Store accumulated data
static dacc_t acc[CHANNELS];
// Accumulate each channel
For_Loop: for (i=0;i<N;i++) {
rem=i%CHANNELS;
acc[rem] = acc[rem] + d_in[i];
d_out[i] = acc[rem];
}
}
Partially unrolling the loop by a factor
of 8
will allow each of the channels (every eighth sample) to be processed in parallel (if the
input and output arrays are also partitioned in a cyclic
manner to allow multiple accesses per clock cycle). If the loop is also pipelined with the
rewind
option, this design will continuously process all 8
channels in parallel if called in a pipelined fashion (that is, either at the top, or within a
dataflow region).
void foo (dout_t d_out[N], din_t d_in[N]) {
#pragma HLS ARRAY_PARTITION variable=d_i type=cyclic factor=8 dim=1
#pragma HLS ARRAY_PARTITION variable=d_o type=cyclic factor=8 dim=1
int i, rem;
// Store accumulated data
static dacc_t acc[CHANNELS];
// Accumulate each channel
For_Loop: for (i=0;i<N;i++) {
#pragma HLS PIPELINE rewind
#pragma HLS UNROLL factor=8
rem=i%CHANNELS;
acc[rem] = acc[rem] + d_in[i];
d_out[i] = acc[rem];
}
}
Partial loop unrolling does not require the unroll factor to be an integer multiple of the maximum iteration count. Vitis HLS adds an exit checks to ensure partially unrolled loops are functionally identical to the original loop. For example, given the following code:
for(int i = 0; i < N; i++) {
a[i] = b[i] + c[i];
}
Loop unrolling by a factor of 2 effectively transforms the code to look
like the following example where the break
construct
is used to ensure the functionality remains the same:
for(int i = 0; i < N; i += 2) {
a[i] = b[i] + c[i];
if (i+1 >= N) break;
a[i+1] = b[i+1] + c[i+1];
}
Because N is a variable, Vitis HLS might not
be able to determine its maximum value (it could be driven from an input port). If the
unrolling factor, which is 2 in this case, is an integer factor of the maximum iteration count
N, the skip_exit_check
option removes the exit check and
associated logic. The effect of unrolling can now be represented as:
for(int i = 0; i < N; i += 2) {
a[i] = b[i] + c[i];
a[i+1] = b[i+1] + c[i+1];
}
This helps minimize the area and simplify the control logic.
Addressing Failure to Pipeline
When a function is pipelined, all loops in the hierarchy below are automatically unrolled. This is a requirement for pipelining to proceed. If a loop has variable bounds it cannot be unrolled. This will prevent the function from being pipelined.
Static Variables
Static variables are used to keep data between loop iterations, often resulting in registers in the final implementation. If this is encountered in pipelined functions, Vitis HLS might not be able to optimize the design sufficiently, which would result in initiation intervals longer than required.
The following is a typical example of this situation:
function_foo()
{
static bool change = 0
if (condition_xyz){
change = x; // store
}
y = change; // load
}
If Vitis HLS cannot optimize this code, the stored operation requires a cycle and the load operation requires an additional cycle. If this function is part of a pipeline, the pipeline has to be implemented with a minimum initiation interval of 2 as the static change variable creates a loop-carried dependency.
One way the user can avoid this is to rewrite the code, as shown in the following example. It ensures that only a read or a write operation is present in each iteration of the loop, which enables the design to be scheduled with II=1.
function_readstream()
{
static bool change = 0
bool change_temp = 0;
if (condition_xyz)
{
change = x; // store
change_temp = x;
}
else
{
change_temp = change; // load
}
y = change_temp;
}
Partitioning Arrays to Improve Pipelining
A common issue when pipelining functions is the following message:
INFO: [SCHED 204-61] Pipelining loop 'SUM_LOOP'.
WARNING: [SCHED 204-69] Unable to schedule 'load' operation ('mem_load_2',
bottleneck.c:62) on array 'mem' due to limited memory ports.
WARNING: [SCHED 204-69] The resource limit of core:RAM:mem:p0 is 1, current
assignments:
WARNING: [SCHED 204-69] 'load' operation ('mem_load', bottleneck.c:62) on array
'mem',
WARNING: [SCHED 204-69] The resource limit of core:RAM:mem:p1 is 1, current
assignments:
WARNING: [SCHED 204-69] 'load' operation ('mem_load_1', bottleneck.c:62) on array
'mem',
INFO: [SCHED 204-61] Pipelining result: Target II: 1, Final II: 2, Depth: 3.
In this example, Vitis HLS states it cannot
reach the specified initiation interval (II) of 1 because it cannot schedule a load
(read) operation (mem_load_2
)
onto the memory because of limited memory ports. The above message notes that the resource
limit for "core:RAM:mem:p0 is 1
" which is used by the
operation mem_load
on line 62. The second port of the block
RAM also only has 1 resource, which is also used by operation mem_load_1
. Due to this memory port contention, Vitis HLS reports a final II of 2 instead of the desired 1.
This issue is typically caused by arrays. Arrays are implemented as block RAM which only has a maximum of two data ports. This can limit the throughput of a read/write (or load/store) intensive algorithm. The bandwidth can be improved by splitting the array (a single block RAM resource) into multiple smaller arrays (multiple block RAMs), effectively increasing the number of ports.
Arrays are partitioned using the ARRAY_PARTITION directive. Vitis HLS provides three types of array partitioning, as shown in the following figure. The three styles of partitioning are:
block
- The original array is split into equally sized blocks of consecutive elements of the original array.
cyclic
- The original array is split into equally sized blocks interleaving the elements of the original array.
complete
- The default operation is to split the array into its individual elements. This corresponds to resolving a memory into registers.
For block
and cyclic
partitioning the factor
option specifies the number of arrays that are created. In
the preceding figure, a factor of 2 is used, that is, the array is divided into two smaller
arrays. If the number of elements in the array is not an integer multiple of the factor, the
final array has fewer elements.
When partitioning multi-dimensional arrays, the dimension
option is used to specify which dimension is partitioned. The following
figure shows how the dimension
option is used
to partition the following example code:
void foo (...) {
int my_array[10][6][4];
...
}
The examples in the figure demonstrate how partitioning
dimension
3 results in 4 separate arrays and partitioning
dimension
1 results in 10 separate arrays. If zero is specified as the
dimension
, all dimensions are partitioned.
Automatic Array Partitioning
The config_array_partition
configuration
determines how arrays are automatically partitioned based on the number of elements.
This configuration is accessed through the menu .
Managing Pipeline Dependencies
Vitis HLS constructs a hardware datapath that corresponds to the C/C++ source code.
When there is no pipeline directive, the execution is sequential so there are no dependencies to take into account. But when the design has been pipelined, the tool needs to deal with the same dependencies as found in processor architectures for the hardware that Vitis HLS generates.
Typical cases of data dependencies or memory dependencies are when a read or a write occurs after a previous read or write.
- A read-after-write (RAW), also called a true dependency, is when an instruction
(and data it reads/uses) depends on the result of a previous operation.
- I1: t = a * b;
- I2: c = t + 1;
The read in statement I2 depends on the write of t in statement I1. If the instructions are reordered, it uses the previous value of t.
- A write-after-read (WAR), also called an anti-dependence, is when an
instruction cannot update a register or memory (by a write) before a previous instruction has
read the data.
- I1: b = t + a;
- I2: t = 3;
The write in statement I2 cannot execute before statement I1, otherwise the result of b is invalid.
- A write-after-write (WAW) is a dependence when a register or memory must be
written in specific order otherwise other instructions might be corrupted.
- I1: t = a * b;
- I2: c = t + 1;
- I3: t = 1;
The write in statement I3 must happen after the write in statement I1. Otherwise, the statement I2 result is incorrect.
- A read-after-read has no dependency as instructions can be freely reordered if the variable is not declared as volatile. If it is, then the order of instructions has to be maintained.
For example, when a pipeline is generated, the tool needs to take care that a register or memory location read at a later stage has not been modified by a previous write. This is a true dependency or read-after-write (RAW) dependency. A specific example is:
int top(int a, int b) {
int t,c;
I1: t = a * b;
I2: c = t + 1;
return c;
}
Statement I2
cannot be evaluated before
statement I1
completes because there is a dependency on variable
t
. In hardware, if the multiplication takes 3 clock cycles,
then I2
is delayed for that amount of time. If the above
function is pipelined, then VHLS detects this as a true dependency and schedules the operations
accordingly. It uses data forwarding optimization to remove the RAW dependency, so that the
function can operate at II =1.
Memory dependencies arise when the example applies to an array and not just variables.
int top(int a) {
int r=1,rnext,m,i,out;
static int mem[256];
L1: for(i=0;i<=254;i++) {
#pragma HLS PIPELINE II=1
I1: m = r * a; mem[i+1] = m; // line 7
I2: rnext = mem[i]; r = rnext; // line 8
}
return r;
}
In the above example, scheduling of loop L1
leads
to a scheduling warning message:
WARNING: [SCHED 204-68] Unable to enforce a carried dependency constraint (II = 1,
distance = 1)
between 'store' operation (top.cpp:7) of variable 'm', top.cpp:7 on array 'mem' and
'load' operation ('rnext', top.cpp:8) on array 'mem'.
INFO: [SCHED 204-61] Pipelining result: Target II: 1, Final II: 2, Depth: 3.
There are no issues within the same iteration of the loop as you write an index and read another one. The two instructions could execute at the same time, concurrently. However, observe the read and writes over a few iterations:
// Iteration for i=0
I1: m = r * a; mem[1] = m; // line 7
I2: rnext = mem[0]; r = rnext; // line 8
// Iteration for i=1
I1: m = r * a; mem[2] = m; // line 7
I2: rnext = mem[1]; r = rnext; // line 8
// Iteration for i=2
I1: m = r * a; mem[3] = m; // line 7
I2: rnext = mem[2]; r = rnext; // line 8
When considering two successive iterations, the multiplication result m
(with a latency = 2) from statement I1
is written to a location that is read by statement I2
of the next iteration of the loop into rnext
. In
this situation, there is a RAW dependence as the next loop iteration cannot start reading mem[i]
before the previous computation's write completes.
Note that if the clock frequency is increased, then the multiplier needs more pipeline stages and increased latency. This will force II to increase as well.
Consider the following code, where the operations have been swapped, changing the functionality.
int top(int a) {
int r,m,i;
static int mem[256];
L1: for(i=0;i<=254;i++) {
#pragma HLS PIPELINE II=1
I1: r = mem[i]; // line 7
I2: m = r * a , mem[i+1]=m; // line 8
}
return r;
}
The scheduling warning is:
INFO: [SCHED 204-61] Pipelining loop 'L1'.
WARNING: [SCHED 204-68] Unable to enforce a carried dependency constraint (II = 1,
distance = 1)
between 'store' operation (top.cpp:8) of variable 'm', top.cpp:8 on array 'mem'
and 'load' operation ('r', top.cpp:7) on array 'mem'.
WARNING: [SCHED 204-68] Unable to enforce a carried dependency constraint (II = 2,
distance = 1)
between 'store' operation (top.cpp:8) of variable 'm', top.cpp:8 on array 'mem'
and 'load' operation ('r', top.cpp:7) on array 'mem'.
WARNING: [SCHED 204-68] Unable to enforce a carried dependency constraint (II = 3,
distance = 1)
between 'store' operation (top.cpp:8) of variable 'm', top.cpp:8 on array 'mem'
and 'load' operation ('r', top.cpp:7) on array 'mem'.
INFO: [SCHED 204-61] Pipelining result: Target II: 1, Final II: 4, Depth: 4.
Observe the continued read and writes over a few iterations:
Iteration with i=0
I1: r = mem[0]; // line 7
I2: m = r * a , mem[1]=m; // line 8
Iteration with i=1
I1: r = mem[1]; // line 7
I2: m = r * a , mem[2]=m; // line 8
Iteration with i=2
I1: r = mem[2]; // line 7
I2: m = r * a , mem[3]=m; // line 8
A longer II is needed because the RAW dependence is via reading r
from mem[i]
, performing the
multiplication, and writing to mem[i+1]
.
Removing False Dependencies to Improve Loop Pipelining
False dependencies are dependencies that arise when the compiler is too conservative. These dependencies do not exist in the real code, but cannot be determined by the compiler. These dependencies can prevent loop pipelining.
The following example illustrates false dependencies. In this example, the
read and write accesses are to two different addresses in the same loop iteration. Both of
these addresses are dependent on the input data, and can point to any individual element of
the hist
array. Because of this, Vitis HLS assumes that both of these accesses can access the same location. As a
result, it schedules the read and write operations to the array in alternating cycles,
resulting in a loop II of 2. However, the code shows that hist[old]
and hist[val]
can never access the same
location because they are in the else branch of the conditional if(old
== val)
.
void histogram(int in[INPUT SIZE], int hist[VALUE SIZE]) f
int acc = 0;
int i, val;
int old = in[0];
for(i = 0; i < INPUT SIZE; i++)
{
#pragma HLS PIPELINE II=1
val = in[i];
if(old == val)
{
acc = acc + 1;
}
else
{
hist[old] = acc;
acc = hist[val] + 1;
}
old = val;
}
hist[old] = acc;
To overcome this deficiency, you can use the DEPENDENCE directive to provide Vitis HLS with additional information about the dependencies.
void histogram(int in[INPUT SIZE], int hist[VALUE SIZE]) {
int acc = 0;
int i, val;
int old = in[0];
#pragma HLS DEPENDENCE variable=hist type=intra direction=RAW dependent=false
for(i = 0; i < INPUT SIZE; i++)
{
#pragma HLS PIPELINE II=1
val = in[i];
if(old == val)
{
acc = acc + 1;
}
else
{
hist[old] = acc;
acc = hist[val] + 1;
}
old = val;
}
hist[old] = acc;
When specifying dependencies there are two main types:
- Inter
- Specifies the dependency is between different iterations of the same
loop.
If this is specified as FALSE it allows Vitis HLS to perform operations in parallel if the pipelined or loop is unrolled or partially unrolled and prevents such concurrent operation when specified as TRUE.
- Intra
- Specifies dependence within the same iteration of a loop, for example an
array being accessed at the start and end of the same iteration.
When intra dependencies are specified as FALSE, Vitis HLS may move operations freely within the loop, increasing their mobility and potentially improving performance or area. When the dependency is specified as TRUE, the operations must be performed in the order specified.
Scalar Dependencies
Some scalar dependencies are much harder to resolve and often require changes to the source code. A scalar data dependency could look like the following:
while (a != b) {
if (a > b) a -= b;
else b -= a;
}
The next iteration of this loop cannot start until the current iteration
has calculated the updated the values of a
and b
, as shown in the following figure.
If the result of the previous loop iteration must be available before the current iteration can begin, loop pipelining is not possible. If Vitis HLS cannot pipeline with the specified initiation interval, it increases the initiation internal. If it cannot pipeline at all, as shown by the above example, it halts pipelining and proceeds to output a non-pipelined design.
Exploiting Task Level Parallelism: Dataflow Optimization
The dataflow optimization is useful on a set of sequential tasks (for example, functions and/or loops), as shown in the following figure.
The above figure shows a specific case of a chain of three tasks, but the communication structure can be more complex than shown.
Using this series of sequential tasks, dataflow optimization creates an architecture of concurrent processes, as shown below. Dataflow optimization is a powerful method for improving design throughput and latency.
The following figure shows how dataflow optimization allows the execution of tasks to overlap, increasing the overall throughput of the design and reducing latency.
In the following figure and example, (A) represents the case without the
dataflow optimization. The implementation requires 8 cycles before a new input can be
processed by func_A
and 8 cycles before an output is
written by func_C
.
For the same example, (B) represents the case when the dataflow
optimization is applied. func_A
can begin processing a
new input every 3 clock cycles (lower initiation interval) and it now only requires 5
clocks to output a final value (shorter latency).
This type of parallelism cannot be achieved without incurring some overhead in hardware. When a particular region, such as a function body or a loop body, is identified as a region to apply the dataflow optimization,Vitis HLS analyzes the function or loop body and creates individual channels that model the dataflow to store the results of each task in the dataflow region. These channels can be simple FIFOs for scalar variables, or ping-pong (PIPO) buffers for non-scalar variables like arrays. Each of these channels also contain signals to indicate when the FIFO or the ping-pong buffer is full or empty. These signals represent a handshaking interface that is completely data driven. By having individual FIFOs and/or ping-pong buffers, Vitis HLS frees each task to execute at its own pace and the throughput is only limited by availability of the input and output buffers. This allows for better interleaving of task execution than a normal pipelined implementation but does so at the cost of additional FIFO or block RAM registers for the ping-pong buffer, as shown in the following figure.
Dataflow optimization potentially improves performance over a statically pipelined solution. It replaces the strict, centrally-controlled pipeline stall philosophy with more flexible and distributed handshaking architecture using FIFOs and/or ping-pong buffers (PIPOs). The replacement of the centralized control structure with a distributed one also benefits the fanout of control signals, for example register enables, which is distributed among the control structures of individual processes.
Dataflow optimization is not limited to a chain of processes, but can be used on any directed acyclic graph (DAG) structure. It can produce two different forms of overlapping: within an iteration if processes are connected with FIFOs, and across different iterations through PIPOs and FIFOs.
Canonical Forms
Vitis HLS transforms the region to apply the DATAFLOW optimization. Xilinx recommends writing the code inside this region (referred to as the canonical region) using canonical forms. There are two main canonical forms for the dataflow optimization:
- The canonical form for a function where sub-functions are not
inlined.
void dataflow(Input0, Input1, Output0, Output1) { #pragma HLS dataflow UserDataType C0, C1, C2; func1(read Input0, read Input1, write C0, write C1); func2(read C0, read C1, write C2); func3(read C2, write Output0, write Output1); }
- Dataflow inside a loop body.
For the for loop (where no function inside is inlined), the integral loop variable should have:
- Initial value declared in the loop header and set to 0.
- The loop condition is a positive numerical constant or constant function argument.
- Increment by 1.
- Dataflow pragma needs to be inside the
loop.
void dataflow(Input0, Input1, Output0, Output1) { for (int i = 0; i < N; i++) { #pragma HLS dataflow UserDataType C0, C1, C2; func1(read Input0, read Input1, write C0, write C1); func2(read C0, read C0, read C1, write C2); func3(read C2, write Output0, write Output1); } }
Canonical Body
Inside the canonical region, the canonical body should follow these guidelines:
- Use a local, non-static scalar or array/pointer variable, or local static stream variable. A local variable is declared inside the function body (for dataflow in a function) or loop body (for dataflow inside a loop).
- A sequence of function calls that pass data forward (with no feedback), from a
function to one that is lexically later, under the following conditions:
- Variables (except scalar) can have only one reading process and one writing process.
- Use write before read (producer before consumer) if you are using local variables, which then become channels.
- Use read before write (consumer before producer) if you are using function arguments. Any intra-body anti-dependencies must be preserved by the design.
- Function return type must be void.
- No loop-carried dependencies among different processes via
variables.
- Inside the canonical loop (i.e., values written by one iteration and read by a following one).
- Among successive calls to the top function (i.e., inout argument written by one iteration and read by the following iteration).
Dataflow Checking
Vitis HLS has a dataflow checker which, when enabled, checks the code to see if
it is in the recommended canonical form. Otherwise it will emit an error/warning message
to the user. By default this checker is set to warning
. You can set the checker to error
or disable it by selecting off
in the strict mode of the config_dataflow
TCL command:
config_dataflow -strict_mode (off | error | warning)
Dataflow Optimization Limitations
The DATAFLOW optimization optimizes the flow of data between tasks (functions and loops), and ideally pipelined functions and loops for maximum performance. It does not require these tasks to be chained, one after the other, however there are some limitations in how the data is transferred.
The following behaviors can prevent or limit the overlapping that Vitis HLS can perform with DATAFLOW optimization:
- Single-producer-consumer violations
- Bypassing tasks
- Feedback between tasks
- Conditional execution of tasks
- Loops with multiple exit conditions
Single-producer-consumer Violations
For Vitis HLS to perform the DATAFLOW
optimization, all elements passed between tasks must follow a single-producer-consumer
model. Each variable must be driven from a single task and only be consumed by a single
task. In the following code example, temp1
fans out and is
consumed by both Loop2
and Loop3
. This violates the single-producer-consumer model.
void foo(int data_in[N], int scale, int data_out1[N], int data_out2[N]) {
int temp1[N];
Loop1: for(int i = 0; i < N; i++) {
temp1[i] = data_in[i] * scale;
}
Loop2: for(int j = 0; j < N; j++) {
data_out1[j] = temp1[j] * 123;
}
Loop3: for(int k = 0; k < N; k++) {
data_out2[k] = temp1[k] * 456;
}
}
A modified version of this code uses function Split
to create a single-producer-consumer design. The following code block
example shows how the data flows with the function Split
.
The data now flows between all four tasks, and Vitis HLS
can perform the DATAFLOW optimization.
void Split (in[N], out1[N], out2[N]) {
// Duplicated data
L1:for(int i=1;i<N;i++) {
out1[i] = in[i];
out2[i] = in[i];
}
}
void foo(int data_in[N], int scale, int data_out1[N], int data_out2[N]) {
int temp1[N], temp2[N]. temp3[N];
Loop1: for(int i = 0; i < N; i++) {
temp1[i] = data_in[i] * scale;
}
Split(temp1, temp2, temp3);
Loop2: for(int j = 0; j < N; j++) {
data_out1[j] = temp2[j] * 123;
}
Loop3: for(int k = 0; k < N; k++) {
data_out2[k] = temp3[k] * 456;
}
}
Bypassing Tasks
In addition, data should generally flow from one task to another. If you
bypass tasks, this can reduce the performance of the DATAFLOW optimization. In the following
example, Loop1
generates the values for temp1
and temp2
. However, the
next task, Loop2
, only uses the value of temp1
. The value of temp2
is
not consumed until after
Loop2
. Therefore, temp2
bypasses the next task in the sequence, which can limit the performance of the DATAFLOW
optimization.
void foo(int data_in[N], int scale, int data_out1[N], int data_out2[N]) {
int temp1[N], temp2[N]. temp3[N];
Loop1: for(int i = 0; i < N; i++) {
temp1[i] = data_in[i] * scale;
temp2[i] = data_in[i] >> scale;
}
Loop2: for(int j = 0; j < N; j++) {
temp3[j] = temp1[j] + 123;
}
Loop3: for(int k = 0; k < N; k++) {
data_out[k] = temp2[k] + temp3[k];
}
}
temp2
to be 3, instead of the default depth of 2.
This lets the buffer store the value intended for Loop3
,
while Loop2
is being executed. Similarly, a PIPO that
bypasses two processes should have a depth of 4. Set the depth of the buffer with the
STREAM
pragma or
directive:#pragma HLS STREAM type=pipo variable=temp2 depth=3
Feedback between Tasks
Feedback occurs when the output from a task is consumed by a previous task in the DATAFLOW region. Feedback between tasks is not recommended in a DATAFLOW region. When Vitis HLS detects feedback, it issues a warning, depending on the situation, and might not perform the DATAFLOW optimization.
However, DATAFLOW can support feedback when used with hls::streams
. The following example demonstrates this
exception.
#include "ap_axi_sdata.h"
#include "hls_stream.h"
void firstProc(hls::stream<int> &forwardOUT, hls::stream<int> &backwardIN) {
static bool first = true;
int fromSecond;
//Initialize stream
if (first)
fromSecond = 10; // Initial stream value
else
//Read from stream
fromSecond = backwardIN.read(); //Feedback value
first = false;
//Write to stream
forwardOUT.write(fromSecond*2);
}
void secondProc(hls::stream<int> &forwardIN, hls::stream<int> &backwardOUT) {
backwardOUT.write(forwardIN.read() + 1);
}
void top(...) {
#pragma HLS dataflow
hls::stream<int> forward, backward;
firstProc(forward, backward);
secondProc(forward, backward);
}
In this simple design, when firstProc
is
executed, it uses 10 as an initial value for input. Because hls::streams
do not support an initial value, this technique can be used to
provide one without violating the single-producer-consumer rule. In subsequent iterations
firstProc
reads from the hls::stream
through the backwardIN
interface.
firstProc
processes the value and sends
it to secondProc
, via a stream that goes
forward in terms of the original C++ function execution order.
secondProc
reads the value on forwardIN
, adds 1 to it, and sends it back to firstProc
via the feedback stream that goes backwards in
the execution order.
From the second execution, firstProc
uses the value read from the stream to do its computation, and the two processes can keep
going forever, with both forward and feedback communication, using an initial value for the
first execution.
Conditional Execution of Tasks
The DATAFLOW optimization does not optimize tasks that are conditionally
executed. The following example highlights this limitation. In this example, the conditional
execution of Loop1
and Loop2
prevents Vitis HLS from optimizing the
data flow between these loops, because the data does not flow from one loop into the
next.
void foo(int data_in1[N], int data_out[N], int sel) {
int temp1[N], temp2[N];
if (sel) {
Loop1: for(int i = 0; i < N; i++) {
temp1[i] = data_in[i] * 123;
temp2[i] = data_in[i];
}
} else {
Loop2: for(int j = 0; j < N; j++) {
temp1[j] = data_in[j] * 321;
temp2[j] = data_in[j];
}
}
Loop3: for(int k = 0; k < N; k++) {
data_out[k] = temp1[k] * temp2[k];
}
}
To ensure each loop is executed in all cases, you must transform the code as shown in the following example. In this example, the conditional statement is moved into the first loop. Both loops are always executed, and data always flows from one loop to the next.
void foo(int data_in[N], int data_out[N], int sel) {
int temp1[N], temp2[N];
Loop1: for(int i = 0; i < N; i++) {
if (sel) {
temp1[i] = data_in[i] * 123;
} else {
temp1[i] = data_in[i] * 321;
}
}
Loop2: for(int j = 0; j < N; j++) {
temp2[j] = data_in[j];
}
Loop3: for(int k = 0; k < N; k++) {
data_out[k] = temp1[k] * temp2[k];
}
}
Loops with Multiple Exit Conditions
Loops with multiple exit points cannot be used in a DATAFLOW region. In the
following example, Loop2
has three exit conditions:
- An exit defined by the value of
N
; the loop will exit whenk>=N
. - An exit defined by the
break
statement. - An exit defined by the
continue
statement.#include "ap_int.h" #define N 16 typedef ap_int<8> din_t; typedef ap_int<15> dout_t; typedef ap_uint<8> dsc_t; typedef ap_uint<1> dsel_t; void multi_exit(din_t data_in[N], dsc_t scale, dsel_t select, dout_t data_out[N]) { dout_t temp1[N], temp2[N]; int i,k; Loop1: for(i = 0; i < N; i++) { temp1[i] = data_in[i] * scale; temp2[i] = data_in[i] >> scale; } Loop2: for(k = 0; k < N; k++) { switch(select) { case 0: data_out[k] = temp1[k] + temp2[k]; case 1: continue; default: break; } } }
Because a loop’s exit condition is always defined by the loop bounds, the use of
break
orcontinue
statements will prohibit the loop being used in a DATAFLOW region.Finally, the DATAFLOW optimization has no hierarchical implementation. If a sub-function or loop contains additional tasks that might benefit from the DATAFLOW optimization, you must apply the DATAFLOW optimization to the loop, the sub-function, or inline the sub-function.
std::complex
inside the
DATAFLOW region. However, they should be used with an __attribute__((no_ctor))
as shown in the following
example:void proc_1(std::complex<float> (&buffer)[50], const std::complex<float> *in);
void proc_2(hls::Stream<std::complex<float>> &fifo, const std::complex<float> (&buffer)[50], std::complex<float> &acc);
void proc_3(std::complex<float> *out, hls::Stream<std::complex<float>> &fifo, const std::complex<float> acc);
void top(std::complex<float> *out, const std::complex<float> *in) {
#pragma HLS DATAFLOW
std::complex<float> acc __attribute((no_ctor)); // Here
std::complex<float> buffer[50] __attribute__((no_ctor)); // Here
hls::Stream<std::complex<float>, 5> fifo; // Not here
proc_1(buffer, in);
proc_2(fifo, buffer, acc);
proc_3(out, fifo, acc);
}
Configuring Dataflow Memory Channels
Vitis HLS implements channels between the tasks as either ping-pong or FIFO buffers, depending on the access patterns of the producer and the consumer of the data:
- For scalar, pointer, and reference parameters, Vitis HLS implements the channel as a FIFO.
- If the parameter (producer or consumer) is an array, Vitis HLS implements the channel as a ping-pong buffer or a FIFO
as follows:
- If Vitis HLS determines the data is accessed in sequential order, Vitis HLS implements the memory channel as a FIFO channel with a depth that is estimated to optimize performance (but can require manual tuning in practice).
- If Vitis HLS is unable to determine
that the data is accessed in sequential order or determines the data is accessed in an
arbitrary manner, Vitis HLS implements the memory
channel as a ping-pong buffer, that is, as two block RAMs each defined by the maximum
size of the consumer or producer array.Note: A ping-pong buffer ensures that the channel always has the capacity to hold all samples without a loss. However, this might be an overly conservative approach in some cases.
To explicitly specify the default channel used between tasks, use the config_dataflow
configuration. This configuration sets the default
channel for all channels in a design. To reduce the size of the memory used in the channel and
allow for overlapping within an iteration, you can use a FIFO. To explicitly set the depth
(that is, number of elements) in the FIFO, use the -fifo_depth
option.
Specifying the size of the FIFO channels overrides the default approach. If any task in the design can produce or consume samples at a greater rate than the specified size of the FIFO, the FIFOs might become empty (or full). In this case, the design halts operation, because it is unable to read (or write). This might result in or lead to a stalled, deadlock state.
When setting the depth of the FIFOs, Xilinx recommends initially setting the depth as the maximum number data values transferred (for example, the size of the array passed between tasks), confirming the design passes C/RTL co-simulation, and then reducing the size of the FIFOs and confirming C/RTL co-simulation still completes without issues. If RTL co-simulation fails, the size of the FIFO is likely too small to prevent stalling or a deadlock situation.
The Vitis HLS IDE can display a histogram of the occupation of each FIFO/PIPO buffer over time, after RTL co-simulation has been run. This can be useful to help determine the best depth for each buffer.
Specifying Arrays as Ping-Pong Buffers or FIFOs
All arrays are implemented by default as ping-pong to enable random access. These buffers can also be sized if needed. For example, in some circumstances, such as when a task is being bypassed, a performance degradation is possible. To mitigate this affect on performance, you can give more slack to the producer and consumer by increasing the size of these buffers by using the STREAM directive as shown below.
void top ( ... ) {
#pragma HLS dataflow
int A[1024];
#pragma HLS stream off variable=A depth=3
producer(A, B, …); // producer writes A and B
middle(B, C, ...); // middle reads B and writes C
consumer(A, C, …); // consumer reads A and C
In the interface, arrays are automatically specified as streaming if an
array on the top-level function interface is set as interface type ap_fifo
, axis
or ap_hs
, it is automatically set as streaming.
Inside the design, all arrays must be specified as streaming using the STREAM directive if a FIFO is desired for the implementation.
-depth
option can be used to specify the
size of the FIFO.The STREAM directive is also used to change any arrays in a DATAFLOW
region from the default implementation specified by the config_dataflow
configuration.
- If the
config_dataflow
default_channel
is set as ping-pong, any array can be implemented as a FIFO by applying the STREAM directive to the array.Note: To use a FIFO implementation, the array must be accessed in a streaming manner. - If the
config_dataflow
default_channel
is set to FIFO or Vitis HLS has automatically determined the data in a DATAFLOW region is accessed in a streaming manner, any array can still be implemented as a ping-pong implementation by applying the STREAM directive to the array with the-off
option.
When an array in a DATAFLOW region is specified as streaming and
implemented as a FIFO, the FIFO is typically not required to hold the same number of
elements as the original array. The tasks in a DATAFLOW region consume each data sample
as soon as it becomes available. The config_dataflow
command with the -fifo_depth
option or the STREAM
directive with the -depth
can be used to set the size
of the FIFO to the minimum number of elements required to ensure flow of data never
stalls. If the -off
option is selected, the -depth
option sets the depth (number of blocks) of the
PIPO. The depth should be at least 2.
Specifying Compiler-Created FIFO Depth
Start Propagation
The compiler might automatically create a start FIFO to propagate the
ap_start/ap_ready
handshake to an internal
process. Such FIFOs can sometimes be a bottleneck for performance, in which case you
can increase the default size which can be incorrectly estimated by the tool with
the following command:
config_dataflow -start_fifo_depth <value>
#pragma HLS DATAFLOW disable_start_propagation
ap_ctrl_none
.
Scalar Propagation
The compiler automatically propagates some scalars from C/C++ code through scalar
FIFOs between processes. Such FIFOs can sometimes be a bottleneck for performance or
cause deadlocks, in which case you can set the size (the default value is set to
-fifo_depth
) with the following command:
config_dataflow -scalar_fifo_depth <value>
Stable Arrays
The stable
pragma can be used to mark input or output
variables of a dataflow region. Its effect is to remove their corresponding
synchronizations, assuming that the user guarantees this removal is indeed correct.
void dataflow_region(int A[...], ...
#pragma HLS stable variable=A
#pragma HLS dataflow
proc1(...);
proc2(A, ...);
Without the stable
pragma, and assuming that
A
is read by proc2
, then proc2
would be part of the
initial synchronization (via ap_start
), for the
dataflow region where it is located. This means that proc1
would not restart until proc2
is
also ready to start again, which would prevent dataflow iterations to be overlapped and
induce a possible loss of performance. The stable
pragma indicates that
this synchronization is not necessary to preserve correctness.
In the previous example, without the stable
pragma, and assuming that A
is read by proc2
as proc2
is
bypassing the tasks, there will be a performance loss.
stable
pragma, the compiler assumes
that:- if
A
is read byproc2
, then the memory locations that are read will not be overwritten, by any other process or calling context, whiledataflow_region
is being executed. - if
A
is written byproc2
, then the memory locations written will not be read, before their definition, by any other process or calling context, whiledataflow_region
is being executed.
A typical scenario is when the caller updates or reads these variables only when the dataflow region has not started yet or has completed execution.
Using ap_ctrl_none Inside the Dataflow
The ap_ctrl_none
block-level I/O
protocol avoids the rigid synchronization scheme implied by the ap_ctrl_hs
and ap_ctrl_chain
protocols.
These protocols require that all processes in the region are executed exactly the same
number of times in order to better match the C/C++ behavior.
However, there are situations where, for example, the intent is to have a faster process that executes more frequently to distribute work to several slower ones.
For any dataflow region (except "dataflow-in-loop"), it is possible to specify
#pragma HLS interface mode=ap_ctrl_none
port=return
as long as all the following conditions are satisfied:
-
The region and all the processes it contains communicates only via FIFOs (hls::stream, streamed arrays, AXIS); that is, excluding memories.
- All the parents of the region, up to the top level design, must fit
the following requirements:
- They must be dataflow regions (excluding "dataflow-in-loop").
- They must all specify
ap_ctrl_none
.
This means that none of the parents of a dataflow region with ap_ctrl_none
in the hierarchy can be:
- A sequential or pipelined FSM
- A dataflow region inside a for loop ("dataflow-in-loop")
The result of this pragma is that ap_ctrl_chain
is not used to synchronize any of the processes inside that
region. They are executed or stalled based on the availability of data in their input
FIFOs and space in their output FIFOs. For example:
void region(...) {
#pragma HLS dataflow
#pragma HLS interface mode=ap_ctrl_none port=return
hls::stream<int> outStream1, outStream2;
demux(inStream, outStream1, outStream2);
worker1(outStream1, ...);
worker2(outStream2, ....);
In this example, demux
can be executed
twice as frequently as worker1
and worker2
. For example, it can have II=1 while worker1
and worker2
can
have II=2, and still achieving a global II=1 behavior.
- Non-blocking reads may need to be used very carefully inside processes that are executed less frequently to ensure that C/C++ simulation works.
- The pragma is applied to a region, not to the individual processes inside it.
- Deadlock detection must be disabled in co-simulation. This can
be done with the
-disable_deadlock_detection
option in cosim_design.
Improve Performance Using Stream-of-Blocks
The hls::stream_of_blocks
type provides a
user-synchronized stream that supports streaming blocks of data for process-level
interfaces in a dataflow context, where each block is an array or multidimensional
array. The intended use of stream-of blocks is to replace array-based communication
between a pair of processes within a dataflow region.
Currently, Vitis HLS implements arrays
written by a producer process and read by a consumer process in a dataflow region by
mapping them to ping pong buffers (PIPOs). The buffer exchange for a PIPO buffer is
driven by the ap_done
/ap_continue
handshake of the producer process, and by the ap_start
/ap_ready
handshake of the consumer process. In other words, the exchange occurs at the return of
the producer function and the calling of the consumer function in C++.
While this ensures a concurrent communication semantic that is fully compliant with the sequential C++ execution semantics, it also implies that the consumer cannot start until the producer is done, as shown in the following code example.
void producer (int b[M][N], ...) {
for (int i = 0; i < M; i++)
for (int j = 0; j < N; j++)
b[i][f(j)] = ...;
}
void consumer(int b[M][N], ...) {
for (int i = 0; i < M; i++)
for (int j = 0; j < N; j++)
... = b[i][g(j)] ...;;
}
void top(...) {
#pragma HLS dataflow
int b[M][N];
#pragma HLS stream off variable=b
producer(b, ...);
consumer(b, ...);
}
This can unnecessarily limit throughput if the producer generates data for
the consumer in smaller blocks, for example by writing one row of the buffer output
inside a nested loop, and the consumer uses the data in smaller blocks by reading one
row of the buffer input inside a nested loop, as the example above does. In this
example, due to the non-sequential buffer column access in the inner loop you cannot
simply stream the array b
. However, the row access in
the outer loop is sequential thus supporting hls::stream_of_blocks
communication where each block is a 1-dimensional array of size N.
The main purpose of the hls::stream_of_blocks
feature is
to provide PIPO-like functionality, but with user-managed explicit synchronization,
accesses, and better coding style. Stream-of-blocks lets you avoid the use of dataflow
in a loop containing the producer and consumer, which would have been a way to optimize
the example above. However, in this case, the use of the dataflow loop containing the
producer and consumer requires the use of a very large PIPO buffer
(2xMxN
) as shown in the following example:
void producer (int b[N], ...) {
for (int j = 0; j < N; j++)
b[f(j)] = ...;
}
void consumer(int b[N], ...) {
for (int j = 0; j < N; j++)
... = b[g(j)];
}
void top(...) {
// The loop below is very constrained in terms of how it must be written
for (int i = 0; i < M; i++) {
#pragma HLS dataflow
int b[N];
#pragma HLS stream off variable=b
producer(b, ...); // writes b
consumer(b, ...); // reads b
}
}
The dataflow-in-a-loop code above is also not desirable because this structure has several limitations in Vitis HLS, such as the loop structure must be very constrained (singe induction variable, starting from 0 and compared with a constant or a function argument and incremented by 1).
Stream-of-Blocks Modeling Style
On the other hand, for a stream-of-blocks the communication between the producer and the consumer is modeled as a stream of array-like objects, providing several advantages over array transfer through PIPO.
#include "hls_streamofblocks.h"
The stream-of-blocks object template is: hls::stream_of_blocks<block_type, depth> v
<block_type>
specifies the datatype of the array or multidimensional array held by the stream-of-blocks<depth>
is an optional argument that provides depth control just likehls::stream
or PIPOs, and specifies the total number of blocks, including the one acquired by the producer and the one acquired by the consumer at any given time. The default value is 2v
specifies the variable name for the stream-of-blocks object
- The producer or consumer process that wants to access the stream
first needs to acquire access to it, using a
hls::write_lock
orhls::read_lock
object. - After the producer has acquired the lock it can start writing (or
reading) the acquired block. Once the block has been fully initialized, it can
be released by the producer, when the
write_lock
object goes out of scope.Note: The producer process with awrite_lock
can also read the block as long as it only reads from already written locations, because the newly acquired buffer must be assumed to contain uninitialized data. The ability to write and read the block is unique to the producer process, and is not supported for the consumer. - Then the block is queued in the stream-of-blocks in a FIFO
fashion, and when the consumer acquires a
read_lock
object, the block can be read by the consumer process.
The main difference between hls::stream_of_blocks
and
the PIPO mechanism seen in the prior examples is that the block becomes available to the
consumer as soon as the write_lock
goes out of scope,
rather than only at the return of the producer process. Hence the size of storage
required to manage the prior example is much less with a stream-of-blocks than with a
PIPO: namely 2N instead of 2xMxN in the example.
Rewriting the prior example to use hls::stream_of_blocks
is shown in the example below. The producer acquires the block by constructing an
hls::write_lock
object called b
,
and passing it the reference to the stream-of-blocks object, called s
. The write_lock
object provides an
overloaded array access operator, letting it be accessed as an array to access
underlying storage in random order as shown in the example below.
write_lock
/read_lock
object, and the release occurs automatically when that object is destructed as it goes
out of scope. This approach uses the common Resource Acquisition Is
Initialization (RAII) style of locking and unlocking.#include "hls_streamofblocks.h"
typedef int buf[N];
void producer (hls::stream_of_blocks<buf> &s, ...) {
for (int i = 0; i < M; i++) {
// Allocation of hls::write_lock acquires the block for the producer
hls::write_lock<buf> b(s);
for (int j = 0; j < N; j++)
b[f(j)] = ...;
// Deallocation of hls::write_lock releases the block for the consumer
}
}
void consumer(hls::stream_of_blocks<buf> &s, ...) {
for (int i = 0; i < M; i++) {
// Allocation of hls::read_lock acquires the block for the consumer
hls::read_lock<buf> b(s);
for (int j = 0; j < N; j++)
... = b[g(j)] ...;
// Deallocation of hls::write_lock releases the block to be reused by the producer
}
}
void top(...) {
#pragma HLS dataflow
hls::stream_of_blocks<buf> s;
producer(b, ...);
consumer(b, ...);
}
The key features of this approach include:
- The expected performance of the outer loop in the producer above is to achieve an overall Initiation Interval (II) of 1
- A locked block can be used as though it were private to the producer or the consumer process until it is released.
- The initial state of the array object for the producer is undefined, whereas it contains the values written by the producer for the consumer.
- The principle advantage of stream-of-blocks is to provide overlapped execution of multiple iterations of the consumer and the producer to increase throughput.
Resource Usage
The resource cost when increasing the depth beyond the default value of 2 is similar to the resource cost of PIPOs. Namely each increment of 1 will require enough memory for a block, e.g., in the example above N * 32-bit words.
The stream of blocks object can be bound to a specific RAM type, by placing the
BIND_STORAGE
pragma where the stream-of-blocks is
declared, for example in the top-level function. The stream of blocks uses 2-port BRAM
(type=RAM_2P
) by default.
Checking for Full and Empty Blocks
The read_lock
and write_lock
are like while(1)
loops - they keep trying to
acquire the resource until they get the resource - so the code execution will stall
until the lock is acquired. You can use the empty()
and
full()
methods as shown in the following example to
determine if a call to read_lock
or write_lock
will stall due to the lack of available blocks
to be acquired.
#include "hls_streamofblocks.h"
void reader(hls::stream_of_blocks<buf> &in1, hls::stream_of_blocks<buf> &in2, int out[M][N], int c) {
for(unsigned j = 0; j < M;) {
if (!in1.empty()) {
hls::read_lock<ppbuf> arr1(in1);
for(unsigned i = 0; i < N; ++i) {
out[j][i] = arr1[N-1-i];
}
j++;
} else if (!in2.empty()) {
hls::read_lock<ppbuf> arr2(in2);
for(unsigned i = 0; i < N; ++i) {
out[j][i] = arr2[N-1-i];
}
j++;
}
}
}
void writer(hls::stream_of_blocks<buf> &out1, hls::stream_of_blocks<buf> &out2, int in[M][N], int d) {
for(unsigned j = 0; j < M; ++j) {
if (d < 2) {
if (!out1.full()) {
hls::write_lock<ppbuf> arr(out1);
for(unsigned i = 0; i < N; ++i) {
arr[N-1-i] = in[j][i];
}
}
} else {
if (!out2.full()) {
hls::write_lock<ppbuf> arr(out2);
for(unsigned i = 0; i < N; ++i) {
arr[N-1-i] = in[j][i];
}
}
}
}
}
void top(int in[M][N], int out[M][N], int c, int d) {
#pragma HLS dataflow
hls::stream_of_blocks<buf, 3> strm1, strm; // Depth=3
writer(strm1, strm2, in, d);
reader(strm1, strm2, out, c);
}
The producer and the consumer processes can perform the following actions within any scope in their body. As shown in the various examples, the scope will typically be a loop, but this is not required. Other scopes such as conditionals are also supported. Supported actions include:
- Acquire a block, i.e. an array of any supported data type.
- In the case of the producer, the array will be empty, i.e. initialized according to the constructor (if any) of the underlying data type.
- In the case of the consumer, the array will be full (of course in as much as the producer has filled it; the same requirements as for PIPO buffers, namely full writing if needed apply).
- Use the block for both reading and writing as if it were private
local memory, up to its maximum allocated number of ports based on a
BIND_STORAGE
pragma or directive specified for the stream of blocks, which specifies what ports each side can see: .- 1 port means that each side can access only one port, and the final stream-of-blocks can use a single dual-port memory for implementation.
- 2 ports means that each side can use 1 or 2 ports depending
on the schedule:
- If the scheduler uses 2 ports on at least one side, merging will not happen;
- If the scheduler uses 1 port, merging can happen
- If the pragma is not specified, the scheduler will decide,
based on the same criteria currently used for local arrays. Moreover:
- The producer can both write and read the block it has acquired
- The consumer can only read the block it has acquired
- Automatically release the block when exiting the scope in which it
was acquired. A released block:
- If released by the producer, can be acquired by the consumer.
- If released by the consumer, can be acquired to be reused by the producer, after being re-initialized by the constructor, if any.
- the handshakes for a PIPO are
- ap_start/ap_ready on the consumer side and
- ap_done/ap_continue on the producer side.
- the handshakes of a stream of blocks are
- its own read/empty_n on the consumer side and
- write/full_n on the producer side.
Modeling Feedback in Dataflow Regions
One main limitation of PIPO buffers is that they can flow only forward with
respect to the function call sequence in C++. In other words, the following connection is not
supported with PIPOs, while it can be supported with hls::stream_of_blocks
:
void top(...) {
int b[N];
for (int i = 0; i < M; i++) {
#pragma HLS dataflow
#pragma HLS stream off variable=b
consumer(b, ...); // reads b
producer(b, ...); // writes b
}
}
The following code example is contrived to demonstrate the concept:
#include "hls_streamofblocks.h"
typedef int buf[N];
void producer (hls::stream_of_blocks<buf> &out, ...) {
for (int i = 0; i < M; i++) {
hls::write_lock<buf> arr(out);
for (int j = 0; j < N; j++)
arr[f(j)] = ...;
}
}
void consumer(hls::stream_of_blocks<buf> &in, ...) {
if (in.empty()) // execute only when producer has already generated some meaningful data
return;
for (int i = 0; i < M; i++) {
hls::read_lock<buf> arr(in);
for (int j = 0; j < N; j++)
... = arr[g(j)];
...
}
}
void top(...) {
// Note the large non-default depth.
// The producer must complete execution before the consumer can start again, due to ap_ctrl_chain.
// A smaller depth would require ap_ctrl_none
hls::stream_of_blocks<buf, M+2> backward;
for (int i = 0; i < M; i++) {
#pragma HLS dataflow
consumer(backward, ...); // reads backward
producer(backward, ...); // writes backward
}
Limitations
There are some limitations with the use of
hls::stream_of_blocks
that you should be aware of:
- Each
hls::stream_of_blocks
object must have a single producer and consumer process, and each process must be different. In other words, local streams-of-blocks within a single process are not supported. - You cannot use
hls::stream_of_blocks
within a sequential region. The producer and consumer must be separate concurrent processes in a dataflow region. - You cannot use multiple nested acquire/release statements
(
write_lock
/read_lock
), for example in the same or nested scopes, as shown in the following example:using ppbuf = int[N]; void readerImplicitNested(hls::stream_of_blocks<ppbuf>& in, ...) { for(unsigned j = 0; j < M; ++j) { hls::read_lock<ppbuf> arrA(in); // constructor would acquire A first hls::read_lock<ppbuf> arrB(in); // constructor would acquire B second for(unsigned i = 0; i < N; ++i) ... = arrA[f(i)] + arrB[g(i)]; // destructor would release B first // destructor would release A second } }
However, you can use multiple sequential or mutually exclusive acquire/release statements (
write_lock
/read_lock
), for example inside IF/ELSE branches or in two subsequent code blocks. This is shown in the following example:void readerImplicitNested(hls::stream_of_blocks<ppbuf>& in, ...) { for(unsigned j = 0; j < M; ++j) { { hls::read_lock<ppbuf> arrA(in); // constructor acquires A for(unsigned i = 0; i < N; ++i) ... = arrA[f(i)]; // destructor releases A } { hls::read_lock<ppbuf> arrB(in); // constructor acquires B for(unsigned i = 0; i < N; ++i) ... = arrB[g(i)]; // destructor releases B } } }
- Explicit release of locks in producer and consumer processes are not
recommended, as they are automatically released when they go out of scope. However,
you can use these by adding
#define EXPLICIT_ACQUIRE_RELEASE
before#include "hls_streamofblocks.h
in your source code.
Programming Model for Multi-Port Access in HBM
HBM provides high bandwidth if the arrays is split in different banks/pseudo-channels and used in the design. This is a common practice in partitioning an array into different memory regions in high-performance computing. The host allocates a single buffer, which will be spread across the pseudo-channels.
Vitis HLS would consider different pointers to be independent
channels, and removes any dependency analysis. But the host allocates a single buffer
for both pointers, and this lets the tool maintain the dependency analysis through
pragma HLS ALIAS
. The ALIAS pragma informs data dependence analysis
about the pointer distance. Refer to the ALIAS pragma for more information.
distance
option of the ALIAS pragma
as shown below:
//Assume that the host code looks like this:
int *buf = clCreateBuffer(ctxt, CL_MEM_READ_ONLY, 2*bank_size, ...);
clSetKernelArg(kernel, 0, 0x20000000, buf); // bank0
clSetKernelArg(kernel, 1, 0x20000000, buf+bank_size); // bank1
//The ALIAS pragma informs data dependence analysis about the pointer distance
void kernel(int *bank0, int *bank1, ...)
{
#pragma HLS alias ports=bank0,bank1 distance=bank_size
The ALIAS pragma can be specified using one of the following forms:
- Constant distance:
#pragma HLS alias ports=arr0,arr1,arr2,arr3 distance=1024
- Variable
distance:
#pragma HLS alias ports=arr0,arr1,arr2,arr3 offset=0,512,1024,2048
Constraints:
- The depths of all the ports in the interface pragma must be the same
- All ports must be in assigned to different bundles, bound to different HBM controllers
-
The number of ports specified in the second form must be the same as the number of offsets specified, one offset per port.
#pragma HLS interface offset=off
is not supported - Each port can only be used in one ALIAS pragma
Optimizing for Latency
Using Latency Constraints
Vitis HLS supports the use of a latency constraint on any scope. Latency constraints are specified using the LATENCY directive.
When a maximum and/or minimum LATENCY constraint is placed on a scope, Vitis HLS tries to ensure all operations in the function complete within the range of clock cycles specified.
The latency directive applied to a loop specifies the required latency for a single iteration of the loop: it specifies the latency for the loop body, as the following examples shows:
Loop_A: for (i=0; i<N; i++) {
#pragma HLS latency max=10
..Loop Body...
}
If the intention is to limit the total latency of all loop iterations, the latency directive should be applied to a region that encompasses the entire loop, as in this example:
Region_All_Loop_A: {
#pragma HLS latency max=10
Loop_A: for (i=0; i<N; i++)
{
..Loop Body...
}
}
In this case, even if the loop is unrolled, the latency directive sets a maximum limit on all loop operations.
If Vitis HLS cannot meet a maximum latency constraint it relaxes the latency constraint and tries to achieve the best possible result.
If a minimum latency constraint is set and Vitis HLS can produce a design with a lower latency than the minimum required it inserts dummy clock cycles to meet the minimum latency.
Merging Sequential Loops to Reduce Latency
All rolled loops imply and create at least one state in the design FSM. When there are multiple sequential loops it can create additional unnecessary clock cycles and prevent further optimizations.
The following figure shows a simple example where a seemingly intuitive coding style has a negative impact on the performance of the RTL design.
In the preceding figure, (A) shows how, by default, each rolled loop in the design creates at least one state in the FSM. Moving between those states costs clock cycles: assuming each loop iteration requires one clock cycle, it take a total of 11 cycles to execute both loops:
- 1 clock cycle to enter the ADD loop.
- 4 clock cycles to execute the add loop.
- 1 clock cycle to exit ADD and enter SUB.
- 4 clock cycles to execute the SUB loop.
- 1 clock cycle to exit the SUB loop.
- For a total of 11 clock cycles.
In this simple example it is obvious that an else branch in the ADD loop would also solve the issue but in a more complex example it may be less obvious and the more intuitive coding style may have greater advantages.
The LOOP_MERGE optimization directive is used to automatically merge loops. The LOOP_MERGE directive will seek so to merge all loops within the scope it is placed. In the above example, merging the loops creates a control structure similar to that shown in (B) in the preceding figure, which requires only 6 clocks to complete.
Merging loops allows the logic within the loops to be optimized together. In the example above, using a dual-port block RAM allows the add and subtraction operations to be performed in parallel.
Currently, loop merging in Vitis HLS has the following restrictions:
- If loop bounds are all variables, they must have the same value.
- If loops bounds are constants, the maximum constant value is used as the bound of the merged loop.
- Loops with both variable bound and constant bound cannot be merged.
- The code between loops to be merged cannot have side effects: multiple execution of this code should generate the same results (a=b is allowed, a=a+1 is not).
- Loops cannot be merged when they contain FIFO accesses: merging would change the order of the reads and writes from a FIFO: these must always occur in sequence.
Flattening Nested Loops to Improve Latency
In a similar manner to the consecutive loops discussed in the previous section, it requires additional clock cycles to move between rolled nested loops. It requires one clock cycle to move from an outer loop to an inner loop and from an inner loop to an outer loop.
In the small example shown here, this implies 200 extra clock cycles
to execute loop Outer
.
void foo_top { a, b, c, d} {
...
Outer: while(j<100)
Inner: while(i<6) // 1 cycle to enter inner
...
LOOP_BODY
...
} // 1 cycle to exit inner
}
...
}
Vitis HLS provides the set_directive_loop_flatten
command to allow labeled perfect and semi-perfect
nested loops to be flattened, removing the need to re-code for optimal hardware performance
and reducing the number of cycles it takes to perform the operations in the loop.
- Perfect loop nest
- Only the innermost loop has loop body content, there is no logic specified between the loop statements and all the loop bounds are constant.
- Semi-perfect loop nest
- Only the innermost loop has loop body content, there is no logic specified between the loop statements but the outermost loop bound can be a variable.
For imperfect loop nests, where the inner loop has variables bounds or the loop body is not exclusively inside the inner loop, designers should try to restructure the code, or unroll the loops in the loop body to create a perfect loop nest.
When the directive is applied to a set of nested loops it should be applied to the inner most loop that contains the loop body.
set_directive_loop_flatten top/Inner
Loop flattening can also be performed using the directive tab in the IDE, either by applying it to individual loops or applying it to all loops in a function by applying the directive at the function level.
Optimizing for Area
Data Types and Bit-Widths
The bit-widths of the variables in the C/C++ function directly impact the size of the storage elements and operators used in the RTL implementation. If a variables only requires 12-bits but is specified as an integer type (32-bit) it will result in larger and slower 32-bit operators being used, reducing the number of operations that can be performed in a clock cycle and potentially increasing initiation interval and latency.
- Use the appropriate precision for the data types.
- Confirm the size of any arrays that are to be implemented as RAMs or registers. The area impact of any over-sized elements is wasteful in hardware resources.
- Pay special attention to multiplications, divisions, modulus or other complex arithmetic operations. If these variables are larger than they need to be, they negatively impact both area and performance.
Function Inlining
Function inlining removes the function hierarchy. A function is inlined using the INLINE directive. Inlining a function may improve area by allowing the components within the function to be better shared or optimized with the logic in the calling function. This type of function inlining is also performed automatically by Vitis HLS. Small functions are automatically inlined.
Inlining allows functions sharing to be better controlled. For functions to be
shared they must be used within the same level of hierarchy. In this code example, function
foo_top
calls foo
twice and
function foo_sub
.
foo_sub (p, q) {
int q1 = q + 10;
foo(p1,q); // foo_3
...
}
void foo_top { a, b, c, d} {
...
foo(a,b); //foo_1
foo(a,c); //foo_2
foo_sub(a,d);
...
}
Inlining function foo_sub
and using the ALLOCATION
directive to specify only 1 instance of function foo
is used, results in a design which only has one instance of function
foo: one-third the area of the example above.
foo_sub (p, q) {
#pragma HLS INLINE
int q1 = q + 10;
foo(p1,q); // foo_3
...
}
void foo_top { a, b, c, d} {
#pragma HLS ALLOCATION instances=foo limit=1 function
...
foo(a,b); //foo_1
foo(a,c); //foo_2
foo_sub(a,d);
...
}
The INLINE directive optionally allows all functions below the specified function
to be recursively inlined by using the recursive
option. If the
recursive
option is used on the top-level function, all
function hierarchy in the design is removed.
The INLINE off
option can optionally be applied
to functions to prevent them being inlined. This option may be used to
prevent Vitis HLS from automatically inlining a function.
The INLINE directive is a powerful way to substantially modify the structure of the code without actually performing any modifications to the source code and provides a very powerful method for architectural exploration.
Array Reshaping
The ARRAY_RESHAPE directive reforms the array with a vertical mode of remapping, and is used to reduce the number of block RAM consumed while providing parallel access to the data.
Given the following example code:
void foo (...) {
int array1[N];
int array2[N];
int array3[N];
#pragma HLS ARRAY_RESHAPE variable=array1 type=block factor=2 dim=1
#pragma HLS ARRAY_RESHAPE variable=array2 type=cycle factor=2 dim=1
#pragma HLS ARRAY_RESHAPE variable=array3 type=complete dim=1
...
}
The ARRAY_RESHAPE directive transforms the arrays into the form shown in the following figure.
The ARRAY_RESHAPE directive allows more data to be
accessed in a single clock cycle. In cases where more data can
be accessed in a single clock cycle, Vitis HLS might automatically unroll any
loops consuming this data, if doing so will improve the
throughput. The loop can be fully or partially unrolled to
create enough hardware to consume the additional data in a
single clock cycle. This feature is controlled using the config_unroll
command and
the option tripcount_threshold
.
In the following example, any loops with a tripcount of less
than 16 will be automatically unrolled if doing so improves the
throughput.
config_unroll -tripcount_threshold 16
Function Instantiation
Function instantiation is an optimization technique that has the area benefits of maintaining the function hierarchy but provides an additional powerful option: performing targeted local optimizations on specific instances of a function. This can simplify the control logic around the function call and potentially improve latency and throughput.
The FUNCTION_INSTANTIATE directive exploits the fact that some inputs to a function may be a constant value when the function is called and uses this to both simplify the surrounding control structures and produce smaller more optimized function blocks. This is best explained by example.
Given the following code:
void foo_sub(bool mode){
#pragma HLS FUNCTION_INSTANTIATE variable=mode
if (mode) {
// code segment 1
} else {
// code segment 2
}
}
void foo(){
#pragma HLS FUNCTION_INSTANTIATE variable=select
foo_sub(true);
foo_sub(false);
}
It is clear that function foo_sub
has been written
to perform multiple but exclusive operations (depending on whether mode
is true or not). Each instance of function foo_sub
is implemented in an identical manner: this is great for function reuse
and area optimization but means that the control logic inside the function must be more
complex.
The FUNCTION_INSTANTIATE optimization allows each instance to be independently optimized, reducing the functionality and area. After FUNCTION_INSTANTIATE optimization, the code above can effectively be transformed to have two separate functions, each optimized for different possible values of mode, as shown:
void foo_sub1() {
// code segment 1
}
void foo_sub1() {
// code segment 2
}
void A(){
B1();
B2();
}
If the function is used at different levels of hierarchy such that function sharing is difficult without extensive inlining or code modifications, function instantiation can provide the best means of improving area: many small locally optimized copies are better than many large copies that cannot be shared.
Controlling Hardware Resources
During synthesis, Vitis HLS performs the following basic tasks:
- Elaborates the C, C++ source code into an internal database containing the operators in the C code, such as additions, multiplications, array reads, and writes.
- Maps the operators onto implementations in the hardware.
Implementations are the specific hardware components used to create the design (such as adders, multipliers, pipelined multipliers, and block RAM).
Commands, pragmas and directives provide control over each of these steps, allowing you to control the hardware implementation at a fine level of granularity.
Limiting the Number of Operators
Explicitly limiting the number of operators to reduce area may be required in some cases: the default operation of Vitis HLS is to first maximize performance. Limiting the number of operators in a design is a useful technique to reduce the area of the design: it helps reduce area by forcing the sharing of operations. However, this might cause a decline in performance.
The ALLOCATION directive allows you to limit how many operators are used in a
design. For example, if a design called foo
has 317
multiplications but the FPGA only has 256 multiplier resources (DSP macrocells). The
ALLOCATION pragma shown below directs Vitis HLS to create a design with
a maximum of 256 multiplication (mul
) operators:
dout_t array_arith (dio_t d[317]) {
static int acc;
int i;
#pragma HLS ALLOCATION instances=mul limit=256 operation
for (i=0;i<317;i++) {
#pragma HLS UNROLL
acc += acc * d[i];
}
rerun acc;
}
You can use the type
option to specify if the
ALLOCATION directives limits operations, implementations, or functions. The following table
lists all the operations that can be controlled using the ALLOCATION directive.
Operator | Description |
---|---|
add | Integer Addition |
ashr | Arithmetic Shift-Right |
dadd | Double-precision floating-point addition |
dcmp | Double-precision floating-point comparison |
ddiv | Double-precision floating-point division |
dmul | Double-precision floating-point multiplication |
drecip | Double-precision floating-point reciprocal |
drem | Double-precision floating-point remainder |
drsqrt | Double-precision floating-point reciprocal square root |
dsub | Double-precision floating-point subtraction |
dsqrt | Double-precision floating-point square root |
fadd | Single-precision floating-point addition |
fcmp | Single-precision floating-point comparison |
fdiv | Single-precision floating-point division |
fmul | Single-precision floating-point multiplication |
frecip | Single-precision floating-point reciprocal |
frem | Single-precision floating point remainder |
frsqrt | Single-precision floating-point reciprocal square root |
fsub | Single-precision floating-point subtraction |
fsqrt | Single-precision floating-point square root |
icmp | Integer Compare |
lshr | Logical Shift-Right |
mul | Multiplication |
sdiv | Signed Divider |
shl | Shift-Left |
srem | Signed Remainder |
sub | Subtraction |
udiv | Unsigned Division |
urem | Unsigned Remainder |
Controlling Hardware Implementation
When synthesis is performed, Vitis HLS uses the timing constraints specified by the clock, the delays specified by the target device together with any directives specified by you, to determine which hardware implementations to use for various operators in the code. For example, to implement a multiplier operation, Vitis HLS could use the combinational multiplier or use a pipeline multiplier.
The implementations which are mapped to operators during synthesis can be limited by specifying the ALLOCATION pragma or directive, in the same manner as the operators. Instead of limiting the total number of multiplication operations, you can choose to limit the number of combinational multipliers, forcing any remaining multiplications to be performed using pipelined multipliers (or vice versa).
The BIND_OP or BIND_STORAGE pragmas or directives are used to explicitly
specify which implementations to use for specific operations or storage types. The following
command informs Vitis HLS to use a two-stage pipelined
multiplier using fabric logic for variable c
. It is left to
Vitis HLS which implementation to use for variable d
.
int foo (int a, int b) {
int c, d;
#pragma HLS BIND_OP variable=c op=mul impl=fabric latency=2
c = a*b;
d = a*c;
return d;
}
In the following example, the BIND_OP pragma specifies that the add operation
for variable temp
is implemented using the dsp
implementation. This ensures that the operation is implemented
using a DSP module primitive in the final design. By default, add operations are implemented
using LUTs.
void apint_arith(dinA_t inA, dinB_t inB,
dout1_t *out1
) {
dout2_t temp;
#pragma HLS BIND_OP variable=temp op=add impl=dsp
temp = inB + inA;
*out1 = temp;
}
Refer to the BIND_OP or BIND_STORAGE pragmas or directives to obtain details on the implementations available for assignment to operations or storage types.
In the following example, the BIND_OP pragma specifies the multiplication for
out1
is implemented with a 3-stage pipelined multiplier.
void foo(...) {
#pragma HLS BIND_OP variable=out1 op=mul latency=3
// Basic arithmetic operations
*out1 = inA * inB;
*out2 = inB + inA;
*out3 = inC / inA;
*out4 = inD % inA;
}
If the assignment specifies multiple identical operators, the code must be
modified to ensure there is a single variable for each operator to be controlled. For example,
in the following code, if only the first multiplication (inA * inB
) is to be
implemented with a pipelined multiplier:
*out1 = inA * inB * inC;
The code should be changed to the following with the pragma specified on the
Result_tmp
variable:
#pragma HLS BIND_OP variable=Result_tmp op=mul latency=3
Result_tmp = inA * inB;
*out1 = Result_tmp * inC;
Optimizing Logic
Inferring Shift Registers
Vitis HLS will now infer a shift register when encountering the following code:
int A[N]; // This will be replaced by a shift register
for(...) {
// The loop below is the shift operation
for (int i = 0; i < N-1; ++i)
A[i] = A[i+1];
A[N] = ...;
// This is an access to the shift register
... A[x] ...
}
Shift registers can perform a shift/cycle, which offers a significant performance improvement, and also allows a random read access per cycle anywhere in the shift register, thus it is more flexible than a FIFO.
Controlling Operator Pipelining
Vitis HLS automatically determines the
level of pipelining to use for internal operations. You can use the BIND_OP or
BIND_STORAGE pragmas with the -latency
option to
explicitly specify the number of pipeline stages and override the number determined by
Vitis HLS.
RTL synthesis might use the additional pipeline registers to help improve timing issues that might result after place and route. Registers added to the output of the operation typically help improve timing in the output datapath. Registers added to the input of the operation typically help improve timing in both the input datapath and the control logic from the FSM.
You can use the config_op
command to pipeline all
instances of a specific operation used in the design that have the same pipeline depth.
Refer to config_op for more information.
Optimizing Logic Expressions
During synthesis several optimizations, such as strength reduction and bit-width minimization are performed. Included in the list of automatic optimizations is expression balancing.
Expression balancing rearranges operators to construct a balanced tree and reduce latency.
- For integer operations expression balancing is on by default but may be disabled using the EXPRESSION_BALANCE pragma or directive.
- For floating-point operations, expression balancing is off by default but may be enabled.
Given the highly sequential code using assignment operators such as
+=
and *=
in the following example:
data_t foo_top (data_t a, data_t b, data_t c, data_t d)
{
data_t sum;
sum = 0;
sum += a;
sum += b;
sum += c;
sum += d;
return sum;
}
Without expression balancing, and assuming each addition requires one
clock cycle, the complete computation for sum
requires four clock cycles shown in the following figure.
However additions a+b
and c+d
can be executed in parallel allowing the latency to be reduced. After
balancing the computation completes in two clock cycles as shown in the
following figure. Expression balancing prohibits sharing and results
in increased area.
For integers, you can disable expression balancing using the EXPRESSION_BALANCE
optimization directive with the off
option.
By default, Vitis HLS does not perform the
EXPRESSION_BALANCE optimization for operations of type float
or double
. When synthesizing float
and double
types,
Vitis HLS maintains the order of
operations performed in the C/C++ code to ensure that the results are the
same as the C/C++ simulation. For example, in the following code example,
all variables are of type float
or double
. The values of O1
and O2
are not the same even though they appear to perform the same basic
calculation.
A=B*C; A=B*F;
D=E*F; D=E*C;
O1=A*D O2=A*D;
This behavior is a function of the saturation and rounding in the C/C++ standard
when performing operation with types float
or double
. Therefore, Vitis HLS always maintains the exact order
of operations when variables of type float
or double
are present and does not perform
expression balancing by default.
You can enable expression balancing for specific operations, or you can configure
the tool to enable expression balancing with float
and double
types
using the config_compile
-unsafe_math_optimizations
command as follows:
- In the Vitis HLS IDE, select .
- In the Solution Settings dialog box, click the General category, select config_compile, and enable unsafe_math_optimizations.
With this setting enabled, Vitis HLS might change the order of operations to produce a more optimal design. However, the results of C/RTL co-simulation might differ from the C/C++ simulation.
The unsafe_math_optimizations
feature also
enables the no_signed_zeros
optimization.
The no_signed_zeros
optimization ensures
that the following expressions used with float
and double
types
are identical:
x - 0.0 = x;
x + 0.0 = x;
0.0 - x = -x;
x - x = 0.0;
x*0.0 = 0.0;
Without the no_signed_zeros
optimization the
expressions above would not be equivalent due to rounding. The optimization
may be optionally used without expression balancing by selecting only this
option in the config_compile
command.
unsafe_math_optimizations
and no_signed_zero
optimizations are used, the RTL
implementation will have different results than the C/C++ simulation. The
test bench should be capable of ignoring minor differences in the result:
check for a range, do not perform an exact comparison.Optimizing Burst Transfers
Overview of Burst Transfers
Bursting is an optimization that tries to intelligently aggregate your memory accesses to the DDR to maximize the throughput bandwidth and/or minimize the latency. Bursting is one of the many optimizations to the kernel. Bursting typically gives you a 4-5x improvement while other optimizations, like access widening or ensuring there are no dependencies through the DDR, can provide even bigger performance improvements. Typically, bursting is useful when you have contention on the DDR ports from multiple competing kernels.
The figure above shows how the AXI protocol works. The HLS kernel sends out a read request for a burst of length 8 and then sends a write request burst of length 8. The read latency is defined as the time taken between the sending of the read request burst to when the data from the first read request in the burst is received by the kernel. Similarly, the write latency is defined as the time taken between when data for the last write in the write burst is sent and the time the write acknowledgment is received by the kernel. Read requests are usually sent at the first available opportunity while write requests get queued until the data for each write in the burst becomes available.
To help you understand the various latencies that are possible in the system, the following figure shows what happens when an HLS kernel sends a burst to the DDR.
When your design makes a read/write request, the request is sent to the DDR
through several specialized helper modules. First, the M-AXI adapter serves as a
buffer for the requests created by the HLS kernel. The adapter contains logic to cut
large bursts into smaller ones (which it needs to do to prevent hogging the channel
or if the request crosses the 4 KB boundary, see Vivado Design Suite: AXI Reference
Guide (UG1037)), and can also stall the
sending of burst requests (depending on the maximum outstanding requests parameter)
so that it can safely buffer the entirety of the data for each kernel. This can
slightly increase write latency but can resolve deadlock due to concurrent requests
(read or write) on the memory subsystem. You can configure the M-AXI interface to
hold the write request until all data is available using config_interface
-m_axi_conservative_mode
.
Another way to view the latencies in the system is as follows: the interconnect has an average II of 2 while the DDR controller has an average II of 4-5 cycles on requests (while on the data they are both II=1). The interconnect arbitration strategy is based on the size of read/write requests, and so data requested with longer burst lengths get prioritized over requests with shorter bursts (thus leading to a bigger channel bandwidth being allocated to longer bursts in case of contention). Of course, a large burst request has the side-effect of preventing anyone else from accessing the DDR, and therefore there must be a compromise between burst length and reducing DDR port contention. Fortunately, the large latencies help prevent some of this port contention, and effective pipelining of the requests can significantly improve the bandwidth throughput available in the system.
Burst Semantics
For a given kernel, the HLS compiler implements the burst analysis optimization as a multi-pass optimization, but on a per function basis. Bursting is only done for a function and bursting across functions is not supported. The burst optimizations are reported in the Synthesis Summary report, and missed burst opportunities are also reported to help you improve burst optimization.
At first, the HLS compiler looks for memory accesses in the basic blocks of the function, such as memory accesses in a sequential set of statements inside the function. Assuming the preconditions of bursting are met, each burst inferred in these basic blocks is referred to as region burst. The compiler will automatically scan the basic block to build the longest sequence of accesses into a single region burst.
The compiler then looks at loops and tries to infer what are known as loop bursts. A loop burst is the sequence of reads/writes across the iterations of a loop. The compiler tries to infer the length of the burst by analyzing the loop induction variable and the trip count of the loop. If the analysis is successful, the compiler can chain the sequences of reads/writes in each iteration of the loop into one long loop burst. The compiler today automatically infers a loop or a region burst, but there is no way to specifically request a loop or a region burst. The code needs to be written so as to cause the tool to infer the loop or region burst.
To understand the underlying semantics of bursting, consider the following code snippet:
for(size_t i = 0; i < size; i+=4) {
out[4*i+0] = f(in[4*i+0]);
out[4*i+1] = f(in[4*i+1]);
out[4*i+2] = f(in[4*i+2]);
out[4*i+3] = f(in[4*i+3]);
}
The code above is typically used to perform a series of reads from an array, and a series of writes to an array from within a loop. The code below is what Vitis HLS may infer after performing the burst analysis optimization. Alongside the actual array accesses, the compiler will additionally make the required read and write requests that are necessary for the user selected AXI protocol.
Loop Burst
/* requests can move anywhere in func */
rb = ReadReq(in, size);
wb = WriteReq(out, size);
for(size_t i = 0; i < size; i+=4) {
Write(wb, 4*i+0) = f(Read(rb, 4*i+0));
Write(wb, 4*i+1) = f(Read(rb, 4*i+1));
Write(wb, 4*i+2) = f(Read(rb, 4*i+2));
Write(wb, 4*i+3) = f(Read(rb, 4*i+3));
}
WriteResp(wb);
If the compiler can successfully deduce the burst length from the induction
variable (size
) and the trip count of the loop, it
will infer one big loop burst and will move the ReadReq
, WriteReq
and WriteResp
calls outside the loop, as shown in the Loop
Burst code example. So, the read requests for all loop iterations are combined into
one read request and all the write requests are combined into one write request.
Note that all read requests are typically sent out immediately while write requests
are only sent out after the data becomes available.
However, if any of the preconditions of bursting are not met, as
described in Preconditions and Limitations of Burst Transfer, the compiler may not
infer a loop burst but will instead try and infer a region burst where the ReadReq
, WriteReg
and
WriteResp
are alongside the read/write accesses
being burst optimized, as shown in the Region Burst code example. In this case, the
read and write requests for each loop iteration are combined into one read or write
request.
Region Burst
for(size_t i = 0; i < size; i+=4) {
rb = ReadReq(in+4*i, 4);
wb = WriteReq(out+4*i, 4);
Write(wb, 0) = f(Read(rb, 0));
Write(wb, 1) = f(Read(rb, 1));
Write(wb, 2) = f(Read(rb, 2));
Write(wb, 3) = f(Read(rb, 3));
WriteResp(wb);
}
Preconditions and Limitations of Burst Transfer
Bursting Preconditions
Bursting is about aggregating successive memory access requests. Here are the set of preconditions that these successive accesses must meet for the bursting optimization to launch successfully:
- Must be all reads, or all writes – bursting reads and writes is not possible.
- Must be a monotonically increasing order of access (both in terms of the memory location being accessed as well as in time). You cannot access a memory location that is in between two previously accessed memory locations.
- Must be consecutive in memory – one next to another with no gaps or overlap and in forward order.
- The number of read/write accesses (or burst length) must be determinable before the request is sent out. This means that even if the burst length is parametric, it must be computed before the read/write request is sent out.
- If bundling two arrays to the same M-AXI port, bursting will be done only for one array, at most, in each direction at any given time.
- There must be no dependency issues from the time a burst request is initiated and finished.
Outer Loop Burst Failure Due to Overlapping Memory Accesses
Outer loop burst inference will fail in the following example because both iteration 0 and iteration 1 of the loop L1 access the same element in arrays a and b. Burst inference is an all or nothing type of optimization - the tool will not infer a partial burst. It is a greedy algorithm that tries to maximize the length of the burst. The auto-burst inference will try to infer a burst in a bottom up fashion - from the inner loop to the outer loop, and will stop when one of the preconditions is not met. In the example below the burst inference will stop when it sees that element 8 is being read again, and so an inner loop burst of length 9 will be inferred in this case.
L1: for (int i = 0; i < 8; ++i)
L2: for (int j = 0; j < 9; ++j)
b[i*8 + j] = a[i*8 + j];
itr 0: |0 1 2 3 4 5 6 7 8|
itr 1: | 8 9 10 11 12 13 14 15 16|
Usage of ap_int/ap_uint Types as Loop Induction Variables
Because the burst inference depends on the loop induction variable and the trip count, using non-native types can hinder the optimization from firing. It is recommended to always use unsigned integer type for the loop induction variable.
Must Enter Loop at Least Once
In some cases, the compiler can fail to infer that the max value of the loop induction variable can never be zero – that is, if it cannot prove that the loop will always be entered. In such cases, an assert statement will help the compiler infer this.
assert (N > 0);
L1: for(int a = 0; a < N; ++a) { … }
Inter or Intra Loop Dependencies on Arrays
If you write to an array location and then read from it in the same iteration or the next, this type of array dependency can be hard for the optimization to decipher. Basically, the optimization will fail for these cases because it cannot guarantee that the write will happen before the read.
Conditional Access to Memory
If the memory accesses are being made conditionally, it can cause the burst inferencing algorithm to fail as it cannot reason through the conditional statements. In some cases, the compiler will simplify the conditional and even remove it but it is generally recommended to not use conditional statements around the memory accesses.
M-AXI Accesses Made from Inside a Function Called from a Loop
Cross-functional array access analysis is not a strong suit for compiler transformations such as burst inferencing. In such cases, users can inline the function using the INLINE pragma or directive to avoid burst failures.
void my_function(hls::stream<T> &out_pkt, int *din, int input_idx) {
T v;
v.data = din[input_idx];
out_pkt.write(v);
}
void my_kernel(hls::stream<T> &out_pkt,
int *din,
int num_512_bytes,
int num_times) {
#pragma HLS INTERFACE mode=m_axi port = din offset=slave bundle=gmem0
#pragma HLS INTERFACE mode=axis port=out_pkt
#pragma HLS INTERFACE mode=s_axilite port=din bundle=control
#pragma HLS INTERFACE mode=s_axilite port=num_512_bytes bundle=control
#pragma HLS INTERFACE mode=s_axilite port=num_times bundle=control
#pragma HLS INTERFACE mode=s_axilite port=return bundle=control
unsigned int idx = 0;
L0: for (int i = 0; i < ntimes; ++i) {
L1: for (int j = 0; j < num_512_bytes; ++j) {
#pragma HLS PIPELINE
my_function(out_pkt, din, idx++);
}
}
Burst inferencing will fail because the memory accesses are being made from a called function. For the burst inferencing to work, it is recommended that users inline any such functions that are making accesses to the M-AXI memory.
An additional reason the burst inferencing will fail in this example is that
the memory accessed through din
in my_function
, is defined by a variable (idx
) which is not a function of the loop induction
variables i
and j
,
and therefore may not be sequential or monotonic. Instead of passing idx
, use (i*num_512_bytes+j)
.
Loop Burst Inference on a Dataflow Loop
Burst inference is not supported on a loop that has the DATAFLOW pragma or directive. However, each process/task inside the dataflow loop can have bursts. Also, sharing of M-AXI ports is not supported inside a dataflow region because the tasks can execute in parallel.
Options for Controlling AXI4 Burst Behavior
An optimal AXI4 interface is one in which the design never stalls while waiting to access the bus, and after bus access is granted, the bus never stalls while waiting for the design to read/write. To create the optimal AXI4 interface, the following command options are provided in the INTERFACE directive to specify the behavior of the bursts and optimize the efficiency of the AXI4 interface.
Note that some of these options can use internal storage to buffer data and this may have an impact on area and resources:
latency
- Specifies the expected latency of the AXI4 interface, allowing the design to initiate a bus request several cycles (latency) before the read or write is expected. If this figure it too low, the design will be ready too soon and may stall waiting for the bus. If this figure is too high, bus access may be granted but the bus may stall waiting on the design to start the access. Default latency in Vitis HLS is 64.
max_read_burst_length
- Specifies the maximum number of data values read during a burst transfer. Default value is 16.
num_read_outstanding
- Specifies how many read requests can be made to the AXI4 bus, without a response, before the design
stalls. This implies internal storage in the design: a FIFO of size
num_read_outstanding
*max_read_burst_length
*word_size
. Default value is 16. max_write_burst_length
- Specifies the maximum number of data values written during a burst transfer. Default value is 16.
num_write_outstanding
- Specifies how many write requests can be made to the AXI4 bus, without a response, before the design
stalls. This implies internal storage in the design: a FIFO of size
num_read_outstanding
*max_read_burst_length
*word_size
. Default value is 16.
#pragma HLS interface mode=m_axi port=input offset=slave bundle=gmem0
depth=1024*1024*16/(512/8) latency=100 num_read_outstanding=32 num_write_outstanding=32
max_read_burst_length=16 max_write_burst_length=16
- The interface is specified as having a latency of 100. The HLS compiler seeks to schedule the request for burst access 100 clock cycles before the design is ready to access the AXI4 bus.
- To further improve bus efficiency, the options
num_write_outstanding
andnum_read_outstanding
ensure the design contains enough buffering to store up to 32 read and/or write accesses. Each request will require its own buffer. This allows the design to continue processing until the bus requests are serviced. - Finally, the options
max_read_burst_length
andmax_write_burst_length
ensure the maximum burst size is 16 and that the AXI4 interface does not hold the bus for longer than this. The HLS tool will partition longer bursts according to the specified burst length, and report this condition with a message like the following:Multiple burst reads of length 192 and bit width 128 in loop 'VITIS_LOOP_2'(./src/filter.cpp:247:21)has been inferred on port 'mm_read'. These burst requests might be further partitioned into multiple requests during RTL generation based on the max_read_burst_length settings.
These options allow the AXI4 interface to be optimized for the system in which it will operate. The efficiency of the operation depends on these values being set accurately. The provided default values are conservative, and may require changing depending on the memory access profile of your design.
Vitis HLS Command | Value | Description |
---|---|---|
config_rtl
-m_axi_conservative_mode |
bool default=false |
Delay M-AXI each write request until the associated write data are entirely available (typically, buffered into the adapter or already emitted). This can slightly increase write latency but can resolve deadlock due to concurrent requests (read or write) on the memory subsystem. |
config_interface
-m_axi_latency |
uint 0 is auto default=0 (for Vivado IP flow) default=64 (for Vitis Kernel flow) |
Provide the scheduler with an expected latency for M-AXI accesses. Latency is the delay between a read request and the first read data, or between the last write data and the write response. Note that this number need not be exact, underestimation makes for a lower-latency schedule, but with longer dynamic stalls. The scheduler will account for the additional adapter latency and add a few cycles. |
config_interface
-m_axi_min_bitwidth |
uint default=8 |
Minimum bitwidth for M-AXI interfaces data channels. Must be a power-of-two between 8 and 1024. Note that this does not necessarily increase throughput if the actual accesses are smaller than the required interface. |
config_interface
-m_axi_max_bitwidth |
uint default=1024 |
Minimum bitwidth for M-AXI interfaces data channels. Must be a power-of-two between 8 and 1024. Note that this does decrease throughput if the actual accesses are bigger than the required interface as they will be split into a multi-cycle burst of accesses. |
config_interface
-m_axi_max_widen_bitwidth |
uint default=0 (for Vivado IP flow) default=512 (for Vitis Kernel flow) |
Allow the tool to automatically widen bursts on M-AXI interfaces up to the chosen bitwidth. Must be a power-of-two between 8 and 1024. Note that burst widening requires strong alignment properties (in addition to burst). |
config_interface
-m_axi_auto_max_ports |
bool default=false |
If the option is false, all the M-AXI interfaces that are not explicitly bundled will be bundled into a single common interface, thus minimizing resource usage (single adapter). If the option is true, all the M-AXI interfaces that are not explicitly bundled will be mapped into individual interfaces, thus increasing the resource usage (multiple adapters). |
config_interface
-m_axi_alignment_byte_size |
uint default=0 (for Vivado IP flow) default=64 (for Vitis Kernel flow) |
Assume top function pointers that are mapped to M-AXI interfaces are at least aligned to the provided width in byte (power of two). This can help automatic burst widening. Warning: behavior will be incorrect if the pointer are not actually aligned at runtime. |
config_interface
-m_axi_num_read_outstanding |
uint default=16 |
Default value for M-AXI num_read_outstanding interface parameter. |
config_interface
-m_axi_num_write_outstanding |
uint default=16 |
Default value for M-AXI num_write_outstanding interface parameter. |
config_interface
-m_axi_max_read_burst_length |
uint default=16 |
Default value for M-AXI max_read_burst_length interface parameter. |
config_interface
-m_axi_max_write_burst_length |
uint default=16 |
Default value for M-AXI max_write_burst_length interface parameter. |
Examples of Recommended Coding Styles
As described in Synthesis Summary, Vitis HLS issues a report summarizing burst activities and also identifying burst failures. If bursts of variable lengths are done, then the report will mention that bursts of variable lengths were inferred. The compiler also provides burst messages that can be found in the compiler log, vitis_hls.log. These messages are issued before the scheduling step.
Simple Read/Write Burst Inference
INFO: [HLS 214-115] Burst read of variable length and bit width 32 has been inferred on port 'gmem'
INFO: [HLS 214-115] Burst write of variable length and bit width 32 has been inferred on port 'gmem' (./src/vadd.cpp:75:9).
The code for this example follows:
/****** BEGIN EXAMPLE *******/
#define DATA_SIZE 2048
// Define internal buffer max size
#define BURSTBUFFERSIZE 256
//TRIPCOUNT identifiers
const unsigned int c_min = 1;
const unsigned int c__max = BURSTBUFFERSIZE;
const unsigned int c_chunk_sz = DATA_SIZE;
extern "C" {
void vadd(int *a, int size, int inc_value) {
// Map pointer a to AXI4-master interface for global memory access
#pragma HLS INTERFACE mode=m_axi port=a offset=slave bundle=gmem max_read_burst_length=256 max_write_burst_length=256
// We also need to map a and return to a bundled axilite slave interface
#pragma HLS INTERFACE mode=s_axilite port=a bundle=control
#pragma HLS INTERFACE mode=s_axilite port=size bundle=control
#pragma HLS INTERFACE mode=s_axilite port=inc_value bundle=control
#pragma HLS INTERFACE mode=s_axilite port=return bundle=control
int burstbuffer[BURSTBUFFERSIZE];
// Per iteration of this loop perform BURSTBUFFERSIZE vector addition
for (int i = 0; i < size; i += BURSTBUFFERSIZE) {
#pragma HLS LOOP_TRIPCOUNT min=c_min*c_min max=c_chunk_sz*c_chunk_sz/(c_max*c_max)
int chunk_size = BURSTBUFFERSIZE;
//boundary checks
if ((i + BURSTBUFFERSIZE) > size)
chunk_size = size - i;
// memcpy creates a burst access to memory
// multiple calls of memcpy cannot be pipelined and will be scheduled sequentially
// memcpy requires a local buffer to store the results of the memory transaction
memcpy(burstbuffer, &a[i], chunk_size * sizeof(int));
// Calculate and write results to global memory, the sequential write in a for loop can be
// inferred as a memory burst access
calc_write:
for (int j = 0; j < chunk_size; j++) {
#pragma HLS LOOP_TRIPCOUNT min=c_size_max max=c_chunk_sz
#pragma HLS PIPELINE II=1
burstbuffer[j] = burstbuffer[j] + inc_value;
a[i + j] = burstbuffer[j];
}
}
}
Pipelining Between Bursts
for(int x=0; x < k; ++x) {
int off = f(x);
for(int i = 0; i < N; ++i) {
#pragma HLS PIPELINE II=1
... = gmem[off + i];
}
}
But notice that the outer loop is not pipelined. This means that while there is pipelining inside bursts, there won't be any pipelining between bursts.
for(int x=0; x < k; ++x) {
#pragma HLS PIPELINE II=N
int off = f(x);
for(int i = 0; i < N; ++i) {
#pragma HLS UNROLL
... = gmem[off + i];
}
}
Accessing Row Data from a Two-Dimensional Array
INFO: [HLS 214-115] Burst read of length 256 and bit width 512 has been inferred on port 'gmem' (./src/row_array_2d.cpp:43:5)
INFO: [HLS 214-115] Burst write of length 256 and bit width 512 has been inferred on port 'gmem' (./src/row_array_2d.cpp:56:5)
Notice that a bit width of 512 is achieved in this example. This is more efficient than the 32 bit width achieved in the simple example above. Bursting wider bit widths is another way bursts can be optimized as discussed in Automatic Port Width Resizing.
The code for this example follows:
/****** BEGIN EXAMPLE *******/
// Parameters Description:
// NUM_ROWS: matrix height
// WORD_PER_ROW: number of words in a row
// BLOCK_SIZE: number of words in an array
#define NUM_ROWS 64
#define WORD_PER_ROW 64
#define BLOCK_SIZE (WORD_PER_ROW*NUM_ROWS)
// Default datatype is integer
typedef int DTYPE;
typedef hls::stream<DTYPE> my_data_fifo;
// Read data function: reads data from global memory
void read_data(DTYPE *inx, my_data_fifo &inFifo) {
read_loop_i:
for (int i = 0; i < NUM_ROWS; ++i) {
read_loop_jj:
for (int jj = 0; jj < WORD_PER_ROW; ++jj) {
#pragma HLS PIPELINE II=1
inFifo << inx[WORD_PER_ROW * i + jj];
;
}
}
}
// Write data function - writes results to global memory
void write_data(DTYPE *outx, my_data_fifo &outFifo) {
write_loop_i:
for (int i = 0; i < NUM_ROWS; ++i) {
write_loop_jj:
for (int jj = 0; jj < WORD_PER_ROW; ++jj) {
#pragma HLS PIPELINE II=1
outFifo >> outx[WORD_PER_ROW * i + jj];
}
}
}
// Compute function is pretty simple because this example is focused on efficient
// memory access pattern.
void compute(my_data_fifo &inFifo, my_data_fifo &outFifo, int alpha) {
compute_loop_i:
for (int i = 0; i < NUM_ROWS; ++i) {
compute_loop_jj:
for (int jj = 0; jj < WORD_PER_ROW; ++jj) {
#pragma HLS PIPELINE II=1
DTYPE inTmp;
inFifo >> inTmp;
DTYPE outTmp = inTmp * alpha;
outFifo << outTmp;
}
}
}
extern "C" {
void row_array_2d(DTYPE *inx, DTYPE *outx, int alpha) {
// AXI master interface
#pragma HLS INTERFACE mode=m_axi port = inx offset = slave bundle = gmem
#pragma HLS INTERFACE mode=m_axi port = outx offset = slave bundle = gmem
// AXI slave interface
#pragma HLS INTERFACE mode=s_axilite port = inx bundle = control
#pragma HLS INTERFACE mode=s_axilite port = outx bundle = control
#pragma HLS INTERFACE mode=s_axilite port = alpha bundle = control
#pragma HLS INTERFACE mode=s_axilite port = return bundle = control
my_data_fifo inFifo;
// By default the FIFO depth is 2, user can change the depth by using
// #pragma HLS stream variable=inFifo depth=256
my_data_fifo outFifo;
// Dataflow enables task level pipelining, allowing functions and loops to execute
// concurrently. For more details please refer to UG902.
#pragma HLS DATAFLOW
// Read data from each row of 2D array
read_data(inx, inFifo);
// Do computation with the acquired data
compute(inFifo, outFifo, alpha);
// Write data to each row of 2D array
write_data(outx, outFifo);
return;
}
}
Summary
Write code in such a way that bursting can be inferred. Ensure that none of the preconditions are violated.
Bursting does not mean that you will get all your data in one shot – it is about merging the requests together into one request, but the data will arrive sequentially, one after another.
Burst length of 16 is ideal, but even burst lengths of 8 are enough. Bigger bursts have more latency while shorter bursts can be pipelined. Do not confuse bursting with pipelining, but note that bursts can be pipelined with other bursts.
If your bursts are of fixed length, you can unroll the inner loop where bursts are inferred and pipeline the outer loop. This will achieve the same burst length, but also pipelining between the bursts to enable higher throughput.
For greater throughput, focus on widening the interface up to 512 bits rather than simply achieving longer bursts.
Bigger bursts have higher priority with the AXI interconnect. No dynamic arbitration is done inside the kernel.
You can have two m_axi
ports connected to
same DDR to model mutually exclusive access inside kernel, but the AXI interconnect
outside the kernel will arbitrate competing requests.
One way to get around the out-of-order access restriction is to create your own buffer in BRAM, store the bursts in this buffer and then use this buffer to do out of order accesses. This is typically called a line buffer and is a common optimization used in video processing.
Review the Burst Optimization section of the Synthesis Summary report to learn more about burst optimizations in the design, and missed burst opportunities.
Adding RTL Blackbox Functions
The RTL blackbox enables the use of existing RTL IP in an HLS project. This lets you add RTL code to your C/C++ code for synthesis of the project by Vitis HLS. The RTL IP can be used in a sequential, pipeline, or dataflow region.
Integrating RTL IP into a Vitis HLS project requires the following files:
- C function signature for the RTL code. This can be placed into a header (.h) file.
- Blackbox JSON description file as discussed in JSON File for RTL Blackbox.
- RTL IP files.
To use the RTL blackbox in an HLS project, use the following steps.
- Call the C function signature from within your top-level function, or a sub-function in the Vitis HLS project.
- Add the blackbox JSON description file to your HLS project using the
Add Files command from the Vitis HLS IDE as discussed in Creating a New Vitis HLS Project, or using the
add_files
command:add_files –blackbox my_file.json
TIP: As explained in the next section, the new RTL Blackbox wizard can help you generate the JSON file and add the RTL IP to your project. - Run the Vitis HLS design flow for simulation, synthesis, and co-simulation as usual.
Requirements and Limitations
RTL IP used in the RTL blackbox feature have the following requirements:
- Should be Verilog (.v) code.
- Must have a unique clock signal, and a unique active-High reset signal.
- Must have a CE signal that is used to enable or stall the RTL IP.
- Must use the
ap_ctrl_chain
protocol as described in Block-Level Control Protocols.
Within Vitis HLS, the RTL blackbox feature:
- Supports only C++.
- Cannot connect to top-level interface I/O signals.
- Cannot directly serve as the design-under-test (DUT).
- Does not support
struct
orclass
type interfaces. - Supports the following interface protocols as described in JSON File for RTL Blackbox:
- hls::stream
- The RTL blackbox IP supports the
hls::stream
interface. When this data type is used in the C function, use aFIFO
RTL port protocol for this argument in the RTL blackbox IP. - Arrays
- The RTL blackbox IP supports RAM interface for arrays. For array
arguments in the C function, use one of the following RTL port
protocols for the corresponding argument in the RTL blackbox IP:
- Single port RAM – RAM_1P
- Dual port RAM – RAM_T2P
- Scalars and Input Pointers
- The RTL Blackbox IP supports C scalars and input pointers only in
sequential and pipeline regions. They are not supported in a
dataflow region. When these constructs are used in the C function,
use
wire
port protocol in the RTL IP.
- Inout and Output Pointers
- The RTL blackbox IP supports inout and output pointers only in sequential
and pipeline regions. They are not supported in a dataflow region.
When these constructs are used in the C function, the RTL IP should
use
ap_vld
for output pointers, andap_ovld
for inout pointers.
Using the RTL Blackbox Wizard
Navigate to the project, right-click to open the RTL Blackbox Wizard as shown in the following figure:
The Wizard is organized into pages that break down the process for creating a JSON file. To navigate between pages, click Next and select Back. Once the options are finalized, you can generate a JSON by clicking OK. Each of the following section describes each page and its input options.
C++ Model and Header Files
In the Blackbox C/C++ files page, you provide the C++ files which form the functional model of the RTL IP. This C++ model is only used during C++ simulation and C++/RTL co-simulation. The RTL IP is combined with Vitis HLS results to form the output of synthesis.
In this page, you can perform the following:
- Click Add Files to add files.
- Click Edit CFLAGS to provide a linker flag to the functional C model.
- Click Next to proceed.
The C File Wizard page lets you specify the values used for the C functional model of the RTL IP. The fields include:
- C Function
- Specify the C function name of the RTL IP.
- C Argument Name
- Specify the name(s) of the function arguments. These should relate to the ports on the IP.
- C Argument Type
- Specify the data type used for each argument.
- C Port Direction
- Specify the port direction of the argument, corresponding to the port in the IP.
- RAM Type
- Specify the RAM type used at the interface.
- RTL Group Configuration
- Specifies the corresponding RTL signal name.
Click Next to proceed.
RTL IP Definition
The RTL Wizard page lets you define the RTL source for the IP. The fields to define include:
- RTL Files
- This option is used to add or remove the pre existing RTL IP files.
- RTL Module Name
- Specify the top level RTL IP module name in this field.
- Performance
- Specify performance targets for the IP.
- Latency
- Latency is the time required for the design to complete. Specify the Latency information in this field.
- II
- Define the target II (Initiation Interval). This is the number of clocks cycles before new input can be applied.
- Resource
- Specify the device resource utilization for the RTL IP. The resource information provided here will be combined with utilization from synthesis to report the overall design resource utilization. You should be able to extract this information from the Vivado Design Suite
Click Next to proceed to the RTL Common Signal page, as shown below.
- module_clock
- Specify the name the of the clock used in the RTL IP.
- module_reset
- Specify the name of the reset signal used in the IP.
- module_clock_enable
- Specify the name of the clock enable signal in the IP.
- ap_ctrl_chain_protocol_start
- Specify the name of the block control start signal used in the IP.
- ap_ctrl_chain_protocol_ready
- Specify the name of the block control ready signal used in the IP.
- ap_ctrl_chain_protocol_done
- Specify the name of the block control done signal used in the IP.
- ap_ctrl_chain_protocol_continue
- Specify the name of the block control continue signal used in the RTL IP.
Click Finish to automatically generate a JSON file for the specified IP. This can be confirmed through the log message as shown below.
"[2019-08-29 16:51:10] RTL Blackbox Wizard Information: the "foo.json" file has been created in the rtl_blackbox/Source folder."
The JSON file can be accessed through the Source file folder, and will be generated as described in the next section.
JSON File for RTL Blackbox
JSON File Format
The following table describes the JSON file format:
Item | Attribute | Description |
---|---|---|
c_function_name | The C++ function name for the blackbox. The c_function_name must be consistent with the C function
simulation model. |
|
rtl_top_module_name | The RTL function name for the blackbox. The rtl_top_module_name must be consistent with the c_function_name . |
|
c_files | c_file | Specifies the C file used for the blackbox module. |
cflag | Provides any compile option necessary for the corresponding C file. | |
rtl_files | Specifies the RTL files for the blackbox module. | |
c_parameters | c_name |
Specifies the name of the argument used for the black box C++ function. Unused |
c_port_direction | The access direction for the corresponding C argument.
|
|
RAM_type | Specifies the RAM type to use if the corresponding C argument
uses the RTL RAM protocol. Two type of RAM are used:
|
|
rtl_ports | Specifies the RTL port protocol signals for the corresponding C
argument (c_name ). Every c_parameter should be associated with an rtl_port . Five type of RTL port protocols are used.
Refer to the RTL Port Protocols table for additional
details.
|
|
c_return | c_port_direction | It must be out . |
rtl_ports | Specifies the corresponding RTL port name used in the RTL blackbox IP. | |
rtl_common_signal | module_clock | The unique clock signal for RTL blackbox module. |
module_reset | Specifies the reset signal for RTL blackbox module. The reset signal must be active-High or positive valid. | |
module_clock_enable | Specifies the clock enable signal for the RTL blackbox module. The enable signal must be active-High or positive valid. | |
ap_ctrl_chain_protocol_idle | The ap_idle signal in the
ap_ctrl_chain protocol for the RTL blackbox
module. |
|
ap_ctrl_chain_protocol_start | The ap_start signal in the
ap_ctrl_chain protocol for the RTL blackbox
module. |
|
ap_ctrl_chain_protocol_ready | The ap_ready signal in the
ap_ctrl_chain protocol for the RTL blackbox
IP. |
|
ap_ctrl_chain_protocol_done | The ap_done signal in the
ap_ctrl_chain protocol for blackbox RTL module. |
|
ap_ctrl_chain_protocol_continue | The ap_continue signal in
the ap_ctrl_chain protocol for RTL blackbox
module. |
|
rtl_performance | latency | Specifies the Latency of the RTL blackbox module. It must be a
non-negative integer value. For Combinatorial RTL IP specify 0 , otherwise specify the exact latency of the RTL
module. |
II | Number of clock cycles before the function can accept new
input data. It must be non-negative integer value. 0 means the blackbox can not be pipelined. Otherwise, it means the
blackbox module is pipelined. |
|
rtl_resource_usage | FF | Specifies the register utilization for the RTL blackbox module. |
LUT | Specifies the LUT utilization for the RTL blackbox module. | |
BRAM | Specifies the block RAM utilization for the RTL blackbox module. | |
URAM | Specifies the URAM utilization for the RTL blackbox module. | |
DSP | Specifies the DSP utilization for the RTL blackbox module. |
RTL Port Protocol | RAM Type | C Port Direction | Attribute | User-Defined Name | Notes |
---|---|---|---|---|---|
wire | in | data_read_in | Specifies a user defined name used in the RTL
blackbox IP. As an example for wire, if the RTL port name is "flag" then the JSON
FILE format is "data_read-in" : "flag" . |
||
ap_vld | out | data_write_out | |||
data_write_valid | |||||
ap_ovld | inout | data_read_in | |||
data_write_out | |||||
data_write_valid | |||||
FIFO | in | FIFO_empty_flag | Must be negative valid. | ||
FIFO_read_enable | |||||
FIFO_data_read_in | |||||
out | FIFO_full_flag | Must be negative valid. | |||
FIFO_write_enable | |||||
FIFO_data_write_out | |||||
RAM | RAM_1P | in | RAM_address | ||
RAM_clock_enable | |||||
RAM_data_read_in | |||||
out | RAM_address | ||||
RAM_clock_enable | |||||
RAM_write_enable | |||||
RAM_data_write_out | |||||
inout | RAM_address | ||||
RAM_clock_enable | |||||
RAM_write_enable | |||||
RAM_data_write_out | |||||
RAM_data_read_in | |||||
RAM | RAM_T2P | in | RAM_address | Specifies a user defined name used in the RTL
blackbox IP. As an example for wire, if the RTL port name is "flag" then the JSON
FILE format is "data_read-in" : "flag" . |
Signals with _snd belong to the second port of the RAM. Signals without _snd belong to the first port. |
RAM_clock_enable | |||||
RAM_data_read_in | |||||
RAM_address_snd | |||||
RAM_clock_enable_snd | |||||
RAM_data_read_in_snd | |||||
out | RAM_address | ||||
RAM_clock_enable | |||||
RAM_write_enable | |||||
RAM_data_write_out | |||||
RAM_address_snd | |||||
RAM_clock_enable_snd | |||||
RAM_write_enable_snd | |||||
RAM_data_write_out_snd | |||||
inout | RAM_address | ||||
RAM_clock_enable | |||||
RAM_write_enable | |||||
RAM_data_write_out | |||||
RAM_data_read_in | |||||
RAM_address_snd | |||||
RAM_clock_enable_snd | |||||
RAM_write_enable_snd | |||||
RAM_data_write_out_snd | |||||
RAM_data_read_in_snd |
JSON File Example
This section provides details on manually writing the JSON file required for the RTL blackbox. The following is an example of a JSON file:
{
"c_function_name" : "foo",
"rtl_top_module_name" : "foo",
"c_files" :
[
{
"c_file" : "../../a/top.cpp",
"cflag" : ""
},
{
"c_file" : "xx.cpp",
"cflag" : "-D KF"
}
],
"rtl_files" : [
"../../foo.v",
"xx.v"
],
"c_parameters" : [{
"c_name" : "a",
"c_port_direction" : "in",
"rtl_ports" : {
"data_read_in" : "a"
}
},
{
"c_name" : "b",
"c_port_direction" : "in",
"rtl_ports" : {
"data_read_in" : "b"
}
},
{
"c_name" : "c",
"c_port_direction" : "out",
"rtl_ports" : {
"data_write_out" : "c",
"data_write_valid" : "c_ap_vld"
}
},
{
"c_name" : "d",
"c_port_direction" : "inout",
"rtl_ports" : {
"data_read_in" : "d_i",
"data_write_out" : "d_o",
"data_write_valid" : "d_o_ap_vld"
}
},
{
"c_name" : "e",
"c_port_direction" : "in",
"rtl_ports" : {
"FIFO_empty_flag" : "e_empty_n",
"FIFO_read_enable" : "e_read",
"FIFO_data_read_in" : "e"
}
},
{
"c_name" : "f",
"c_port_direction" : "out",
"rtl_ports" : {
"FIFO_full_flag" : "f_full_n",
"FIFO_write_enable" : "f_write",
"FIFO_data_write_out" : "f"
}
},
{
"c_name" : "g",
"c_port_direction" : "in",
"RAM_type" : "RAM_1P",
"rtl_ports" : {
"RAM_address" : "g_address0",
"RAM_clock_enable" : "g_ce0",
"RAM_data_read_in" : "g_q0"
}
},
{
"c_name" : "h",
"c_port_direction" : "out",
"RAM_type" : "RAM_1P",
"rtl_ports" : {
"RAM_address" : "h_address0",
"RAM_clock_enable" : "h_ce0",
"RAM_write_enable" : "h_we0",
"RAM_data_write_out" : "h_d0"
}
},
{
"c_name" : "i",
"c_port_direction" : "inout",
"RAM_type" : "RAM_1P",
"rtl_ports" : {
"RAM_address" : "i_address0",
"RAM_clock_enable" : "i_ce0",
"RAM_write_enable" : "i_we0",
"RAM_data_write_out" : "i_d0",
"RAM_data_read_in" : "i_q0"
}
},
{
"c_name" : "j",
"c_port_direction" : "in",
"RAM_type" : "RAM_T2P",
"rtl_ports" : {
"RAM_address" : "j_address0",
"RAM_clock_enable" : "j_ce0",
"RAM_data_read_in" : "j_q0",
"RAM_address_snd" : "j_address1",
"RAM_clock_enable_snd" : "j_ce1",
"RAM_data_read_in_snd" : "j_q1"
}
},
{
"c_name" : "k",
"c_port_direction" : "out",
"RAM_type" : "RAM_T2P",
"rtl_ports" : {
"RAM_address" : "k_address0",
"RAM_clock_enable" : "k_ce0",
"RAM_write_enable" : "k_we0",
"RAM_data_write_out" : "k_d0",
"RAM_address_snd" : "k_address1",
"RAM_clock_enable_snd" : "k_ce1",
"RAM_write_enable_snd" : "k_we1",
"RAM_data_write_out_snd" : "k_d1"
}
},
{
"c_name" : "l",
"c_port_direction" : "inout",
"RAM_type" : "RAM_T2P",
"rtl_ports" : {
"RAM_address" : "l_address0",
"RAM_clock_enable" : "l_ce0",
"RAM_write_enable" : "l_we0",
"RAM_data_write_out" : "l_d0",
"RAM_data_read_in" : "l_q0",
"RAM_address_snd" : "l_address1",
"RAM_clock_enable_snd" : "l_ce1",
"RAM_write_enable_snd" : "l_we1",
"RAM_data_write_out_snd" : "l_d1",
"RAM_data_read_in_snd" : "l_q1"
}
}],
"c_return" : {
"c_port_direction" : "out",
"rtl_ports" : {
"data_write_out" : "ap_return"
}
},
"rtl_common_signal" : {
"module_clock" : "ap_clk",
"module_reset" : "ap_rst",
"module_clock_enable" : "ap_ce",
"ap_ctrl_chain_protocol_idle" : "ap_idle",
"ap_ctrl_chain_protocol_start" : "ap_start",
"ap_ctrl_chain_protocol_ready" : "ap_ready",
"ap_ctrl_chain_protocol_done" : "ap_done",
"ap_ctrl_chain_protocol_continue" : "ap_continue"
},
"rtl_performance" : {
"latency" : "6",
"II" : "2"
},
"rtl_resource_usage" : {
"FF" : "0",
"LUT" : "0",
"BRAM" : "0",
"URAM" : "0",
"DSP" : "0"
}
}