HLS Pragmas
Optimizations in Vivado HLS
In both SDAccel™ and SDSoC™ development environments, the hardware kernel must be synthesized from the OpenCL™, C, or C++ language into the register transfer level (RTL) that can be implemented into the programmable logic of a Xilinx® device. The Vivado® High-Level Synthesis (HLS) tool synthesizes RTL from the OpenCL, C, and C++ language descriptions.
The HLS tool is intended to work with your SDAccel or SDSoC development environment project without interaction. However, the HLS tool also provides pragmas that can be used to optimize the design: reduce latency, improve throughput performance, and reduce area and device resource usage of the resulting RTL code. These pragmas can be added directly to the source code for the kernel.
The HLS pragmas include the optimization types specified below:
Type | Attributes |
---|---|
Kernel Optimization | |
Function Inlining | |
Interface Synthesis | |
Task-level Pipeline | |
Pipeline | |
Loop Unrolling | |
Loop Optimization | |
Array Optimization | |
Structure Packing |
pragma HLS allocation
Description
Specifies instance restrictions to limit resource allocation in the implemented kernel. This defines and can limit the number of register transfer level (RTL) instances and hardware resources used to implement specific functions, loops, operations or cores. The ALLOCATION pragma is specified inside the body of a function, a loop, or a region of code.
For example, if the C source has four instances of a function foo_sub
, the ALLOCATION pragma can ensure that there is only one
instance of foo_sub
in the final RTL. All four instances of
the C function are implemented using the same RTL block. This reduces resources used by the
function, but negatively impacts performance.
The operations in the C code, such as additions, multiplications, array reads, and writes, can be limited by the ALLOCATION pragma. Cores, which operators are mapped to during synthesis, can be limited in the same manner as the operators. Instead of limiting the total number of multiplication operations, you can choose to limit the number of combinational multiplier cores, forcing any remaining multiplications to be performed using pipelined multipliers (or vice versa).
The ALLOCATION pragma applies to the scope it is specified within: a
function, a loop, or a region of code. However, you can use the -min_op
argument of the config_bind
command to
globally minimize operators throughout the design.
config_bind
in Vivado Design Suite User Guide:
High-Level Synthesis (UG902). Syntax
Place the pragma inside the body of the function, loop, or region where it will apply.
#pragma HLS allocation instances=<list> \
limit=<value> <type>
Where:
instances=<list>
: Specifies the names of functions, operators, or cores.limit=<value>
: Optionally specifies the limit of instances to be used in the kernel.<type>
: Specifies that the allocation applies to a function, an operation, or a core (hardware component) used to create the design (such as adders, multipliers, pipelined multipliers, and block RAM). The type is specified as one of the following::function
: Specifies that the allocation applies to the functions listed in theinstances=
list. The function can be any function in the original C or C++ code that has not been:- Inlined by the
pragma HLS inline
, or theset_directive_inline
command, or - Inlined automatically by the Vivado High-Level Synthesis (HLS) tool.
- Inlined by the
operation
: Specifies that the allocation applies to the operations listed in theinstances=
list. Refer to Vivado Design Suite User Guide: High-Level Synthesis (UG902) for a complete list of the operations that can be limited using the ALLOCATION pragma.core
: Specifies that the ALLOCATION applies to the cores, which are the specific hardware components used to create the design (such as adders, multipliers, pipelined multipliers, and block RAM). The actual core to use is specified in theinstances=
option. In the case of cores, you can specify which the tool should use, or you can define a limit for the specified core.
Example 1
Given a design with multiple instances of function foo
, this example limits the number of instances of foo
in the RTL for the hardware kernel to 2.
#pragma HLS allocation instances=foo limit=2 function
Example 2
Limits the number of multiplier operations used in the implementation of
the function my_func
to 1. This limit does not apply to any
multipliers outside of my_func
, or multipliers that might
reside in sub-functions of my_func
.
my_func
.void my_func(data_t angle) {
#pragma HLS allocation instances=mul limit=1 operation
...
}
See Also
- pragma HLS function_instantiate
- pragma HLS inline
- Vivado Design Suite User Guide: High-Level Synthesis (UG902)
pragma HLS array_map
Description
Combines multiple smaller arrays into a single large array to help reduce block RAM resources.
Designers typically use the pragma HLS
array_map
command (with the same instance=
target) to combine multiple smaller arrays into a single larger array. This larger array can
then be targeted to a single larger memory (RAM or FIFO) resource.
Each array is mapped into a block RAM or UltraRAM, when supported by the device. The basic block RAM unit provided in an FPGA is 18K. If many small arrays do not use the full 18K, a better use of the block RAM resources is to map many small arrays into a single larger array.
- Horizontal mapping: this corresponds to creating a new array by concatenating the original arrays. Physically, this gets implemented as a single array with more elements.
- Vertical mapping: this corresponds to creating a new array by concatenating the original words in the array. Physically, this gets implemented as a single array with a larger bit-width.
The arrays are concatenated in the order that the pragmas are specified, starting at:
- Target element zero for horizontal mapping, or
- Bit zero for vertical mapping.
Syntax
Place the pragma in the C source within the boundaries of the function where the array variable is defined.
#pragma HLS array_map variable=<name> instance=<instance> \
<mode> offset=<int>
Where:
variable=<name>
: A required argument that specifies the array variable to be mapped into the new target array <instance
>.instance=<
: Specifies the name of the new array to merge arrays into.instance
>- <
mode
>: Optionally specifies the array map as being eitherhorizontal
orvertical
.- Horizontal mapping is the default <
mode
>, and concatenates the arrays to form a new array with more elements. Remapping the original N arrays will require N cycles with 1 port block RAM, or ceiling (N/2) cycles with a 2 port block RAM. - Vertical mapping concatenates the array to form a new array with longer words. Remapping the original N arrays is similar to the horizontal mapping above except when the same index is used: this will require only 1 cycle.
- Horizontal mapping is the default <
offset=<int>
: Applies to horizontal type array mapping only. The offset specifies an integer value offset to apply before mapping the array into the new array <instance
>. For example:- Element 0 of the array variable maps to element <
int
> of the new target. - Other elements map to <
int+1
>, <int+2
>... of the new target.
IMPORTANT: If an offset is not specified, the Vivado High-Level Synthesis (HLS) tool calculates the required offset automatically to avoid overlapping array elements.- Element 0 of the array variable maps to element <
Example 1
Arrays array1
and array2
in function foo
are mapped into a single
array, specified as array3
in the following example:
void foo (...) {
int8 array1[M];
int12 array2[N];
#pragma HLS ARRAY_MAP variable=array1 instance=array3 horizontal
#pragma HLS ARRAY_MAP variable=array2 instance=array3 horizontal
...
loop_1: for(i=0;i<M;i++) {
array1[i] = ...;
array2[i] = ...;
...
}
...
}
Example 2
This example provides a horizontal mapping of array A[10] and array B[15]
in function foo
into a single new array AB[25].
- Element AB[0] will be the same as A[0].
- Element AB[10] will be the same as B[0] because no
offset=
option is specified. - The bit-width of array AB[25] will be the maximum bit-width of either A[10] or B[15].
#pragma HLS array_map variable=A instance=AB horizontal
#pragma HLS array_map variable=B instance=AB horizontal
Example 3
The following example performs a vertical concatenation of arrays C and D into a new array CD, with the bit-width of C and D combined. The number of elements in CD is the maximum of the original arrays, C or D:
#pragma HLS array_map variable=C instance=CD vertical
#pragma HLS array_map variable=D instance=CD vertical
See Also
- pragma HLS array_partition
- pragma HLS array_reshape
- Vivado Design Suite User Guide: High-Level Synthesis (UG902)
pragma HLS array_partition
Description
Partitions an array into smaller arrays or individual elements and provides the following:
- Results in RTL with multiple small memories or multiple registers instead of one large memory.
- Effectively increases the amount of read and write ports for the storage.
- Potentially improves the throughput of the design.
- Requires more memory instances or registers.
Syntax
Place the pragma in the C source within the boundaries of the function where the array variable is defined.
#pragma HLS array_partition variable=<name> \
<type> factor=<int> dim=<int>
where
variable=<name>
: A required argument that specifies the array variable to be partitioned.- <type>: Optionally specifies the partition type. The default type
is
complete
. The following types are supported:cyclic
: Cyclic partitioning creates smaller arrays by interleaving elements from the original array. The array is partitioned cyclically by putting one element into each new array before coming back to the first array to repeat the cycle until the array is fully partitioned. For example, iffactor=3
is used:- Element 0 is assigned to the first new array
- Element 1 is assigned to the second new array.
- Element 2 is assigned to the third new array.
- Element 3 is assigned to the first new array again.
block
: Block partitioning creates smaller arrays from consecutive blocks of the original array. This effectively splits the array into N equal blocks, where N is the integer defined by thefactor=
argument.complete
: Complete partitioning decomposes the array into individual elements. For a one-dimensional array, this corresponds to resolving a memory into individual registers. This is the default <type>.
factor=<int>
: Specifies the number of smaller arrays that are to be created.IMPORTANT: For complete type partitioning, the factor is not specified. For block and cyclic partitioning thefactor=
is required.dim=<int>
: Specifies which dimension of a multi-dimensional array to partition. Specified as an integer from 0 to <N>, for an array with <N> dimensions:- If a value of 0 is used, all dimensions of a multi-dimensional array are partitioned with the specified type and factor options.
- Any non-zero value partitions only the specified dimension. For example, if a value 1 is used, only the first dimension is partitioned.
Example 1
This example partitions the 13 element array, AB[13], into four arrays using block partitioning:
#pragma HLS array_partition variable=AB block factor=4
Because four is not an integer factor of 13:
- Three of the new arrays have three elements each,
- One array has four elements (AB[9:12]).
Example 2
This example partitions dimension two of the two-dimensional array, AB[6][4] into two new arrays of dimension [6][2]:
#pragma HLS array_partition variable=AB block factor=2 dim=2
Example 3
This example partitions the second dimension of the two-dimensional in_local
array into individual elements.
int in_local[MAX_SIZE][MAX_DIM];
#pragma HLS ARRAY_PARTITION variable=in_local complete dim=2
See Also
- pragma HLS array_map
- pragma HLS array_reshape
- Vivado Design Suite User Guide: High-Level Synthesis (UG902)
- xcl_array_partition
- SDAccel Environment Profiling and Optimization Guide
pragma HLS array_reshape
Description
Combines array partitioning with vertical array mapping.
The ARRAY_RESHAPE pragma combines the effect of ARRAY_PARTITION, breaking an array into smaller arrays, with the effect of the vertical type of ARRAY_MAP, concatenating elements of arrays by increasing bit-widths. This reduces the number of block RAM consumed while providing the primary benefit of partitioning: parallel access to the data. This pragma creates a new array with fewer elements but with greater bit-width, allowing more data to be accessed in a single clock cycle.
Given the following code:
void foo (...) {
int array1[N];
int array2[N];
int array3[N];
#pragma HLS ARRAY_RESHAPE variable=array1 block factor=2 dim=1
#pragma HLS ARRAY_RESHAPE variable=array2 cycle factor=2 dim=1
#pragma HLS ARRAY_RESHAPE variable=array3 complete dim=1
...
}
The ARRAY_RESHAPE pragma transforms the arrays into the form shown in the following figure:
Syntax
Place the pragma in the C source within the region of a function where the array variable is defines.
#pragma HLS array_reshape variable=<name> \
<type> factor=<int> dim=<int>
Where:
- <name>: A required argument that specifies the array variable to be reshaped.
- <type>: Optionally specifies the partition type. The default type
is
complete
. The following types are supported:cyclic
: Cyclic reshaping creates smaller arrays by interleaving elements from the original array. For example, iffactor=3
is used, element 0 is assigned to the first new array, element 1 to the second new array, element 2 is assigned to the third new array, and then element 3 is assigned to the first new array again. The final array is a vertical concatenation (word concatenation, to create longer words) of the new arrays into a single array.block
: Block reshaping creates smaller arrays from consecutive blocks of the original array. This effectively splits the array into <N> equal blocks where <N> is the integer defined byfactor=
, and then combines the <N> blocks into a single array withword-width*N
.complete
: Complete reshaping decomposes the array into temporary individual elements and then recombines them into an array with a wider word. For a one-dimension array this is equivalent to creating a very-wide register (if the original array was N elements of M bits, the result is a register withN*M
bits). This is the default type of array reshaping.
factor=<int>
: Specifies the amount to divide the current array by (or the number of temporary arrays to create). A factor of 2 splits the array in half, while doubling the bit-width. A factor of 3 divides the array into three, with triple the bit-width.IMPORTANT: For complete type partitioning, the factor is not specified. For block and cyclic reshaping thefactor=
is required.dim=<int>
: Specifies which dimension of a multi-dimensional array to partition. Specified as an integer from 0 to <N>, for an array with <N> dimensions:- If a value of 0 is used, all dimensions of a multi-dimensional array are partitioned with the specified type and factor options.
- Any non-zero value partitions only the specified dimension. For example, if a value 1 is used, only the first dimension is partitioned.
object
: A keyword relevant for container arrays only. When the keyword is specified the ARRAY_RESHAPE pragma applies to the objects in the container, reshaping all dimensions of the objects within the container, but all dimensions of the container itself are preserved. When the keyword is not specified the pragma applies to the container array and not the objects.
Example 1
Reshapes (partition and maps) an 8-bit array with 17 elements, AB[17], into a new 32-bit array with five elements using block mapping.
#pragma HLS array_reshape variable=AB block factor=4
Example 2
Reshapes the two-dimensional array AB[6][4] into a new array of dimension [6][2], in which dimension 2 has twice the bit-width:
#pragma HLS array_reshape variable=AB block factor=2 dim=2
Example 3
Reshapes the three-dimensional 8-bit array, AB[4][2][2] in function foo
, into a new single element array (a register), 128 bits wide
(4*2*2*8):
#pragma HLS array_reshape variable=AB complete dim=0
See Also
- pragma HLS array_map
- pragma HLS array_partition
- Vivado Design Suite User Guide: High-Level Synthesis (UG902)
- SDAccel Environment Profiling and Optimization Guide
pragma HLS data_pack
Description
Packs the data fields of a struct
into a
single scalar with a wider word width.
The DATA_PACK pragma is used for packing all the elements of a struct
into a single wide vector to reduce the memory required for
the variable, while allowing all members of the struct
to be
read and written to simultaneously. The bit alignment of the resulting new wide-word can be
inferred from the declaration order of the struct
fields.
The first field takes the LSB of the vector, and the final element of the struct
is aligned with the MSB of the vector.
If the struct
contains arrays, the
DATA_PACK pragma performs a similar operation as the ARRAY_RESHAPE pragma and combines the
reshaped array with the other elements in the struct
. Any
arrays declared inside the struct
are completely
partitioned and reshaped into a wide scalar and packed with other scalar fields. However, a
struct
cannot be optimized with DATA_PACK and
ARRAY_PARTITION or ARRAY_RESHAPE, as those pragmas are mutually exclusive.
struct
objects with large arrays. If an array has 4096 elements of type int,
this will result in a vector (and port) of width 4096*32=131072 bits. The Vivado High-Level Synthesis (HLS) tool can create this RTL
design, however it is very unlikely logic synthesis will be able to route this during the
FPGA implementation.In general, Xilinx recommends that you
use arbitrary precision (or bit-accurate) data types. Standard C types are based on 8-bit
boundaries (8-bit, 16-bit, 32-bit, and 64-bit); however, using arbitrary precision data
types in a design lets you specify the exact bit-sizes in the C code prior to synthesis. The
bit-accurate widths result in hardware operators that are smaller and faster. This allows
more logic to be placed in the FPGA and for the logic to execute at higher clock
frequencies. However, the DATA_PACK pragma also lets you align data in the packed struct
along 8-bit boundaries, if needed.
If a struct
port is to be implemented with
an AXI4 interface you should consider using the
DATA_PACK <byte_pad> option to automatically align member elements of the struct
to 8-bit boundaries. The AXI4-Stream protocol requires that TDATA
ports
of the IP have a width in multiples of 8. It is a specification violation to define an
AXI4-Stream IP with a TDATA
port width that is not a multiple of 8, therefore, it is a requirement to
round up TDATA
widths to byte multiples. Refer to
"Interface Synthesis and Structs" in Vivado Design Suite User Guide:
High-Level Synthesis (UG902) for more information.
Syntax
Place the pragma near the definition of the struct
variable to pack:
#pragma HLS data_pack variable=<variable> \
instance=<name> <byte_pad>
Where:
variable=<variable>
: is the variable to be packed.instance=<name>
: Specifies the name of resultant variable after packing. If no <name> is specified, the input <variable> is used.<byte_pad>
: Optionally specifies whether to pack data on an 8-bit boundary (8-bit, 16-bit, 24-bit, etc.). The two supported values for this option are:struct_level
: Pack the wholestruct
first, then pad it upward to the next 8-bit boundary.field_level
: First pad each individual element (field) of thestruct
on an 8-bit boundary, then pack thestruct
.
TIP: Deciding whether multiple fields of data should be concatenated together before (field_level
) or after (struct_level
) alignment to byte boundaries is generally determined by considering how atomic the data is. Atomic information is data that can be interpreted on its own, whereas non-atomic information is incomplete for the purpose of interpreting the data. For example, atomic data can consist of all the bits of information in a floating point number. However, the exponent bits in the floating point number alone would not be atomic. When packing information intoTDATA
, generally non-atomic bits of data are concatenated together (regardless of bit width) until they form atomic units. The atomic units are then aligned to byte boundaries using pad bits where necessary.
Example 1
Packs struct
array AB[17] with three 8-bit
field fields (R, G, B) into a new 17 element array of 24-bits.
typedef struct{
unsigned char R, G, B;
} pixel;
pixel AB[17];
#pragma HLS data_pack variable=AB
Example 2
Packs struct pointer AB with three 8-bit fields (typedef struct {unsigned char R, G, B;} pixel
) in function foo
, into a new 24-bit pointer.
typedef struct{
unsigned char R, G, B;
} pixel;
pixel AB;
#pragma HLS data_pack variable=AB
Example 3
In this example the DATA_PACK pragma is specified for in
and out
arguments to rgb_to_hsv
function to instruct the compiler to do pack the
structure on an 8-bit boundary to improve the memory access:
void rgb_to_hsv(RGBcolor* in, // Access global memory as RGBcolor struct-wise
HSVcolor* out, // Access Global Memory as HSVcolor struct-wise
int size) {
#pragma HLS data_pack variable=in struct_level
#pragma HLS data_pack variable=out struct_level
...
}
See Also
- pragma HLS array_partition
- pragma HLS array_reshape
- Vivado Design Suite User Guide: High-Level Synthesis (UG902)
pragma HLS dataflow
Description
The DATAFLOW pragma enables task-level pipelining, allowing functions and loops to overlap in their operation, increasing the concurrency of the register transfer level (RTL) implementation, and increasing the overall throughput of the design.
All operations are performed sequentially in a C description. In the absence
of any directives that limit resources (such as pragma HLS
allocation
), the Vivado High-Level Synthesis
(HLS) tool seeks to minimize latency and improve concurrency. However, data dependencies can
limit this. For example, functions or loops that access arrays must finish all read/write
accesses to the arrays before they complete. This prevents the next function or loop that
consumes the data from starting operation. The DATAFLOW optimization enables the operations
in a function or loop to start operation before the previous function or loop completes all
its operations.
When the DATAFLOW pragma is specified, the HLS tool analyzes the dataflow between sequential functions or loops and creates channels (based on ping pong RAMs or FIFOs) that allow consumer functions or loops to start operation before the producer functions or loops have completed. This allows functions or loops to operate in parallel, which decreases latency and improves the throughput of the RTL.
If no initiation interval (number of cycles between the start of one function or loop and the next) is specified, the HLS tool attempts to minimize the initiation interval and start operation as soon as data is available.
config_dataflow
command specifies the default memory channel and
FIFO depth used in dataflow optimization. Refer to the config_dataflow
command in the Vivado Design Suite User Guide:
High-Level Synthesis (UG902) for more information.- Single-producer-consumer violations
- Bypassing tasks
- Feedback between tasks
- Conditional execution of tasks
- Loops with multiple exit conditions
Finally, the DATAFLOW optimization has no hierarchical implementation. If a sub-function or loop contains additional tasks that might benefit from the DATAFLOW optimization, you must apply the optimization to the loop, the sub-function, or inline the sub-function.
Syntax
Place the pragma in the C source within the boundaries of the region, function, or loop.
#pragma HLS dataflow
Example 1
Specifies DATAFLOW optimization within the loop wr_loop_j
.
wr_loop_j: for (int j = 0; j < TILE_PER_ROW; ++j) {
#pragma HLS DATAFLOW
wr_buf_loop_m: for (int m = 0; m < TILE_HEIGHT; ++m) {
wr_buf_loop_n: for (int n = 0; n < TILE_WIDTH; ++n) {
#pragma HLS PIPELINE
// should burst TILE_WIDTH in WORD beat
outFifo >> tile[m][n];
}
}
wr_loop_m: for (int m = 0; m < TILE_HEIGHT; ++m) {
wr_loop_n: for (int n = 0; n < TILE_WIDTH; ++n) {
#pragma HLS PIPELINE
outx[TILE_HEIGHT*TILE_PER_ROW*TILE_WIDTH*i+TILE_PER_ROW*TILE_WIDTH*m+TILE_WIDTH*j+n] = tile[m][n];
}
}
See Also
- pragma HLS allocation
- xcl_latency
- Vivado Design Suite User Guide: High-Level Synthesis (UG902)
- SDAccel Environment Profiling and Optimization Guide
pragma HLS dependence
Description
The DEPENDENCE pragma is used to provide additional information that can overcome loop-carry dependencies and allow loops to be pipelined (or pipelined with lower intervals).
The Vivado High-Level Synthesis (HLS) tool automatically detects the following dependencies:
- Within loops (loop-independent dependence), or
- Between different iterations of a loop (loop-carry dependence).
These dependencies impact when operations can be scheduled, especially during function and loop pipelining.
- Loop-independent dependence: The same element is accessed in the same loop
iteration.
for (i=0;i<N;i++) { A[i]=x; y=A[i]; }
- Loop-carry dependence: The same element is accessed in a different loop iteration.
for (i=0;i<N;i++) { A[i]=A[i-1]*2; }
Under certain complex scenarios automatic dependence analysis can be too conservative and fail to filter out false dependencies. Under some circumstances, such as variable dependent array indexing, or when an external requirement needs to be enforced (for example, two inputs are never the same index), the dependence analysis might be too conservative. The DEPENDENCE pragma allows you to explicitly specify the dependence and resolve a false dependence.
Syntax
Place the pragma within the boundaries of the function where the dependence is defined.
#pragma HLS dependence variable=<variable> <class> \
<type> <direction> distance=<int> <dependent>
Where:
variable=<variable>
: Optionally specifies the variable to consider for the dependence.- <class>: Optionally specifies a class of variables in which the
dependence needs clarification. Valid values include
array
orpointer
.TIP: <class> andvariable=
do not need to be specified together as you can either specify a variable or a class of variables within a function. - <type>: Valid values include
intra
orinter
. Specifies whether the dependence is:intra
: dependence within the same loop iteration. When dependence <type> is specified asintra
, and <dependent> is false, the HLS tool might move operations freely within a loop, increasing their mobility and potentially improving performance or area. When <dependent> is specified as true, the operations must be performed in the order specified.inter
: dependence between different loop iterations. This is the default <type>. If dependence <type> is specified asinter
, and <dependent> is false, it allows the HLS tool to perform operations in parallel if the function or loop is pipelined, or the loop is unrolled, or partially unrolled, and prevents such concurrent operation when <dependent> is specified as true.
- <direction>: Valid values include
RAW
,WAR
, orWAW
. This is relevant for loop-carry dependencies only, and specifies the direction for a dependence:RAW
(Read-After-Write - true dependence) The write instruction uses a value used by the read instruction.WAR
(Write-After-Read - anti dependence) The read instruction gets a value that is overwritten by the write instruction.WAW
(Write-After-Write - output dependence) Two write instructions write to the same location, in a certain order.
distance=<int>
: Specifies the inter-iteration distance for array access. Relevant only for loop-carry dependencies where dependence is set totrue
.- <dependent>: Specifies whether a dependence needs to be enforced
(
true
) or removed (false
). The default istrue
.
Example 1
In the following example, the HLS tool does not have any knowledge about
the value of cols
and conservatively assumes that there is
always a dependence between the write to buff_A[1][col]
and
the read from buff_A[1][col]
. In an algorithm such as this,
it is unlikely cols
will ever be zero, but the HLS tool
cannot make assumptions about data dependencies. To overcome this deficiency, you can use
the DEPENDENCE pragma to state that there is no dependence between loop iterations (in this
case, for both buff_A
and buff_B
).
void foo(int rows, int cols, ...)
for (row = 0; row < rows + 1; row++) {
for (col = 0; col < cols + 1; col++) {
#pragma HLS PIPELINE II=1
#pragma HLS dependence variable=buff_A inter false
#pragma HLS dependence variable=buff_B inter false
if (col < cols) {
buff_A[2][col] = buff_A[1][col]; // read from buff_A[1][col]
buff_A[1][col] = buff_A[0][col]; // write to buff_A[1][col]
buff_B[1][col] = buff_B[0][col];
temp = buff_A[0][col];
}
Example 2
Removes the dependence between Var1
in the same iterations of
loop_1
in function foo
.
#pragma HLS dependence variable=Var1 intra false
Example 3
Defines the dependence on all arrays in loop_2
of function foo
to inform the HLS tool
that all reads must happen after writes (RAW) in the same loop iteration.
#pragma HLS dependence array intra RAW true
See Also
- pragma HLS pipeline
- Vivado Design Suite User Guide: High-Level Synthesis (UG902)
- xcl_pipeline_loop
- SDAccel Environment Profiling and Optimization Guide
pragma HLS expression_balance
Description
Sometimes a C-based specification is written with a sequence of operations resulting in a long chain of operations in RTL. With a small clock period, this can increase the latency in the design. By default, the Vivado High-Level Synthesis (HLS) tool rearranges the operations using associative and commutative properties. This rearrangement creates a balanced tree that can shorten the chain, potentially reducing latency in the design at the cost of extra hardware.
The EXPRESSION_BALANCE pragma allows this expression balancing to be disabled, or to be expressly enabled, within a specified scope.
Syntax
Place the pragma in the C source within the boundaries of the required location.
#pragma HLS expression_balance off
Where:
off
: Turns off expression balancing at this location.TIP: Leaving this option out of the pragma enables expression balancing, which is the default mode.
Example 1
This example explicitly enables expression balancing in function my_Func
:
void my_func(char inval, char incr) {
#pragma HLS expression_balance
Example 2
Disables expression balancing within function my_Func
:
void my_func(char inval, char incr) {
#pragma HLS expression_balance off
See Also
- Vivado Design Suite User Guide: High-Level Synthesis (UG902)
pragma HLS function_instantiate
Description
The FUNCTION_INSTANTIATE pragma is an optimization technique that has the area benefits of maintaining the function hierarchy but provides an additional powerful option: performing targeted local optimizations on specific instances of a function. This can simplify the control logic around the function call and potentially improve latency and throughput.
By default:
- Functions remain as separate hierarchy blocks in the register transfer level (RTL).
- All instances of a function, at the same level of hierarchy, make use of a single RTL implementation (block).
The FUNCTION_INSTANTIATE pragma is used to create a unique RTL implementation for each instance of a function, allowing each instance to be locally optimized according to the function call. This pragma exploits the fact that some inputs to a function may be a constant value when the function is called, and uses this to both simplify the surrounding control structures and produce smaller more optimized function blocks.
Without the FUNCTION_INSTANTIATE pragma, the following code results in a
single RTL implementation of function foo_sub
for all three
instances of the function in foo
. Each instance of function
foo_sub
is implemented in an identical manner. This is
fine for function reuse and reducing the area required for each instance call of a function,
but means that the control logic inside the function must be more complex to account for the
variation in each call of foo_sub
.
char foo_sub(char inval, char incr) {
#pragma HLS function_instantiate variable=incr
return inval + incr;
}
void foo(char inval1, char inval2, char inval3,
char *outval1, char *outval2, char * outval3)
{
*outval1 = foo_sub(inval1, 1);
*outval2 = foo_sub(inval2, 2);
*outval3 = foo_sub(inval3, 3);
}
In the code sample above, the FUNCTION_INSTANTIATE pragma results in three
different implementations of function foo_sub
, each
independently optimized for the incr
argument, reducing the
area and improving the performance of the function. After FUNCTION_INSTANTIATE optimization,
foo_sub
is effectively be transformed into three separate
functions, each optimized for the specified values of incr
.
Syntax
Place the pragma in the C source within the boundaries of the required location.
#pragma HLS function_instantiate variable=<variable>
Where:
variable=<variable>
: A required argument that defines the function argument to use as a constant.
Example 1
In the following example, the FUNCTION_INSTANTIATE pragma placed in
function swInt
) allows each instance of function swInt
to be independently optimized with respect to the maxv
function argument:
void swInt(unsigned int *readRefPacked, short *maxr, short *maxc, short *maxv){
#pragma HLS function_instantiate variable=maxv
uint2_t d2bit[MAXCOL];
uint2_t q2bit[MAXROW];
#pragma HLS array partition variable=d2bit,q2bit cyclic factor=FACTOR
intTo2bit<MAXCOL/16>((readRefPacked + MAXROW/16), d2bit);
intTo2bit<MAXROW/16>(readRefPacked, q2bit);
sw(d2bit, q2bit, maxr, maxc, maxv);
}
See Also
- pragma HLS allocation
- pragma HLS inline
- Vivado Design Suite User Guide: High-Level Synthesis (UG902)
pragma HLS inline
Description
Removes a function as a separate entity in the hierarchy. After inlining, the function is dissolved into the calling function and no longer appears as a separate level of hierarchy in the register transfer level (RTL). In some cases, inlining a function allows operations within the function to be shared and optimized more effectively with surrounding operations. An inlined function cannot be shared. This can increase area required for implementing the RTL.
The INLINE pragma applies differently to the scope it is defined in depending on how it is specified:
INLINE
- Without arguments, the pragma means that the function it is specified in should be inlined upward into any calling functions or regions.
INLINE OFF
- Specifies that the function it is specified in should NOT be inlined upward into any calling functions or regions. This disables the inline of a specific function that may be automatically inlined, or inlined as part of a region or recursion.
INLINE REGION
- This applies the pragma to the region or the body of the function it is assigned in. It applies downward, inlining the contents of the region or function, but not inlining recursively through the hierarchy.
INLINE RECURSIVE
- This applies the pragma to the region or the body of the function it is assigned in. It applies downward, recursively inlining the contents of the region or function.
By default, inlining is only performed on the next level of function
hierarchy, not sub-functions. However, the recursive
option
lets you specify inlining through levels of the hierarchy.
Syntax
Place the pragma in the C source within the body of the function or region of code.
#pragma HLS inline <region | recursive | off>
Where:
region
: Optionally specifies that all functions in the specified region (or contained within the body of the function) are to be inlined, applies to the scope of the region.recursive
: By default, only one level of function inlining is performed, and functions within the specified function are not inlined. Therecursive
option inlines all functions recursively within the specified function or region.off
: Disables function inlining to prevent specified functions from being inlined. For example, ifrecursive
is specified in a function, this option can prevent a particular called function from being inlined when all others are.TIP: The Vivado High-Level Synthesis (HLS) tool automatically inlines small functions, and using the INLINE pragma with theoff
option may be used to prevent this automatic inlining.
Example 1
This example inlines all functions within the region it is specified in, in
this case the body of foo_top
, but does not inline any
lower level functions within those functions.
void foo_top { a, b, c, d} {
#pragma HLS inline region
...
Example 2
The following example, inlines all functions within the body of foo_top
, inlining recursively down through the function
hierarchy, except function foo_sub
is not inlined. The
recursive pragma is placed in function foo_top
. The pragma
to disable inlining is placed in the function foo_sub
:
foo_sub (p, q) {
#pragma HLS inline off
int q1 = q + 10;
foo(p1,q);// foo_3
...
}
void foo_top { a, b, c, d} {
#pragma HLS inline region recursive
...
foo(a,b);//foo_1
foo(a,c);//foo_2
foo_sub(a,d);
...
}
foo_top
, but applies upward to the code calling foo_sub
.Example 3
This example inlines the copy_output
function into any functions or regions calling copy_output
.
void copy_output(int *out, int out_lcl[OSize * OSize], int output) {
#pragma HLS INLINE
// Calculate each work_item's result update location
int stride = output * OSize * OSize;
// Work_item updates output filter/image in DDR
writeOut: for(int itr = 0; itr < OSize * OSize; itr++) {
#pragma HLS PIPELINE
out[stride + itr] = out_lcl[itr];
}
See Also
- pragma HLS allocation
- pragma HLS function_instantiate
- Vivado Design Suite User Guide: High-Level Synthesis (UG902)
pragma HLS interface
Description
In C-based design, all input and output operations are performed, in zero time, through formal function arguments. In a register transfer level (RTL) design, these same input and output operations must be performed through a port in the design interface and typically operate using a specific input/output (I/O) protocol. For more information, refer to "Managing Interfaces" in the Vivado Design Suite User Guide: High-Level Synthesis (UG902).
The INTERFACE pragma specifies how RTL ports are created from the function definition during interface synthesis.
The ports in the RTL implementation are derived from the following:
- Any function-level protocol that is specified: Function-level protocols,
also called block-level I/O protocols, provide signals to control when the function starts
operation, and indicate when function operation ends, is idle, and is ready for new
inputs. The implementation of a function-level protocol is :
- Specified by the <mode> values
ap_ctrl_none
,ap_ctrl_hs
orap_ctrl_chain
. Theap_ctrl_hs
block-level I/O protocol is the default. - Are associated with the function name.
- Specified by the <mode> values
- Function arguments: Each function argument can be specified to have its
own port-level (I/O) interface protocol, such as valid handshake (
ap_vld
), or acknowledge handshake (ap_ack
). Port-level interface protocols are created for each argument in the top-level function and the function return, if the function returns a value. The default I/O protocol created depends on the type of C argument. After the block-level protocol has been used to start the operation of the block, the port-level I/O protocols are used to sequence data into and out of the block. - Global variables accessed by the top-level function, and defined outside its scope:
- If a global variable is accessed, but all read and write operations are local to the function, the resource is created in the RTL design. There is no need for an I/O port in the RTL. If the global variable is expected to be an external source or destination, specify its interface in a similar manner as standard function arguments. See the Examples below.
When the INTERFACE pragma is used on sub-functions, only the register
option can be used. The <mode> option is not
supported on sub-functions.
Specifying Burst Mode
When specifying burst-mode for interfaces, using the max_read_burst_length
or max_write_burst_length
options (as described in the Syntax section) there are limitations and related
considerations that are derived from the AXI standard:
- The burst length should be less than, or equal to 256 words per transaction, because ARLEN & AWLEN are 8 bits; the actual burst length is AxLEN+1.
- In total, less than 4 KB is transferred per burst transaction.
- Do not cross the 4 KB address boundary.
- The bus width is specified as a power of 2, between 32-bits and 512-bits (i.e. 32, 64, 128, 256, 512 bits) or in bytes: 4, 8, 16, 32, 64.
With the 4KB limit, the max burst length for a bus width of:
- 32-bits is 256 words transferred in a single burst transaction. In this case, the total bytes transferred per transaction would be 1024.
- 64-bits is 256 words transferred in a single burst transaction. The total bytes transferred per transaction would be 2048.
- 128-bits is 256 words transferred in a single burst transaction. The total bytes transferred per transaction would be 4096.
- 256-bits is 128 words transferred in a single burst transaction. The total bytes transferred per transaction would be 4096.
- 512-bits is 64 words transferred in a single burst transaction. The total bytes transferred per transaction would be 4096.
max_read_burst_length
or max_write_burst_length
is set to 128, will not fill the max burst length.However, if the design is doing longer accesses in the source code than
the specified maximum burst length, the access will be split into smaller bursts. For
example, a pipelined for-loop with 100 accesses and max_read_burst_length
or max_write_burst_length
set to 64, will be split into 2 transactions, one of
the max burst length (or 64) and one transaction of the remaining data (burst of length 36
words).
Syntax
Place the pragma within the boundaries of the function.
#pragma HLS interface <mode> port=<name> bundle=<string> \
register register_mode=<mode> depth=<int> offset=<string> \
clock=<string> name=<string> \
num_read_outstanding=<int> num_write_outstanding=<int> \
max_read_burst_length=<int> max_write_burst_length=<int>
Where:
- <mode>: Specifies the interface protocol mode for function arguments, global
variables used by the function, or the block-level control protocols. For detailed
descriptions of these different modes see "Interface Synthesis Reference" in the Vivado Design Suite User Guide:
High-Level Synthesis (UG902). The mode can be specified as one
of the following:
ap_none
: No protocol. The interface is a data port.ap_stable
: No protocol. The interface is a data port. The HLS tool assumes the data port is always stable after reset, which allows internal optimizations to remove unnecessary registers.ap_vld
: Implements the data port with an associatedvalid
port to indicate when the data is valid for reading or writing.ap_ack
: Implements the data port with an associatedacknowledge
port to acknowledge that the data was read or written.ap_hs
: Implements the data port with associatedvalid
andacknowledge
ports to provide a two-way handshake to indicate when the data is valid for reading and writing and to acknowledge that the data was read or written.ap_ovld
: Implements the output data port with an associatedvalid
port to indicate when the data is valid for reading or writing.IMPORTANT: The HLS tool implements the input argument or the input half of any read/write arguments with modeap_none
.ap_fifo
: Implements the port with a standard FIFO interface using data input and output ports with associated active-Low FIFOempty
andfull
ports.Note: You can only use this interface on read arguments or write arguments. Theap_fifo
mode does not support bidirectional read/write arguments.ap_bus
: Implements pointer and pass-by-reference ports as a bus interface.ap_memory
: Implements array arguments as a standard RAM interface. If you use the RTL design in the Vivado IP integrator, the memory interface appears as discrete ports.bram
: Implements array arguments as a standard RAM interface. If you use the RTL design in the IP integrator, the memory interface appears as a single port.axis
: Implements all ports as an AXI4-Stream interface.s_axilite
: Implements all ports as an AXI4-Lite interface. The HLS tool produces an associated set of C driver files during the Export RTL process.m_axi
: Implements all ports as an AXI4 interface. You can use theconfig_interface
command to specify either 32-bit (default) or 64-bit address ports and to control any address offset.ap_ctrl_none
: No block-level I/O protocol.Note: Using theap_ctrl_none
mode might prevent the design from being verified using the C/RTL co-simulation feature.ap_ctrl_hs
: Implements a set of block-level control ports tostart
the design operation and to indicate when the design isidle
,done
, andready
for new input data.Note: Theap_ctrl_hs
mode is the default block-level I/O protocol.ap_ctrl_chain
: Implements a set of block-level control ports tostart
the design operation,continue
operation, and indicate when the design isidle
,done
, andready
for new input data.Note: Theap_ctrl_chain
interface mode is similar toap_ctrl_hs
but provides an additional input signalap_continue
to apply back pressure. Xilinx recommends using theap_ctrl_chain
block-level I/O protocol when chaining the HLS tool blocks together.
port=<name>
: Specifies the name of the function argument, function return, or global variable which theINTERFACE
pragma applies to.TIP: Block-level I/O protocols (ap_ctrl_none
,ap_ctrl_hs
, orap_ctrl_chain
) can be assigned to a port for the functionreturn
value.bundle=<string>
: Groups function arguments into AXI interface ports. By default, the HLS tool groups all function arguments specified as an AXI4-Lite (s_axilite
) interface into a single AXI4-Lite port. Similarly, all function arguments specified as an AXI4 (m_axi
) interface are grouped into a single AXI4 port. This option explicitly groups all interface ports with the samebundle=<string>
into the same AXI interface port and names the RTL port the value specified by <string>.IMPORTANT: When specifying thebundle=
name you should use all lower-case characters.register
: An optional keyword to register the signal and any relevant protocol signals, and causes the signals to persist until at least the last cycle of the function execution. This option applies to the following interface modes:ap_none
ap_ack
ap_vld
ap_ovld
ap_hs
ap_stable
axis
s_axilite
TIP: The-register_io
option of theconfig_interface
command globally controls registering all inputs/outputs on the top function. Refer to the Vivado Design Suite User Guide: High-Level Synthesis (UG902) for more information.register_mode= <forward|reverse|both|off>
: Used with theregister
keyword, this option specifies if registers are placed on theforward
path (TDATA and TVALID), thereverse
path (TREADY), onboth
paths (TDATA, TVALID, and TREADY), or if none of the port signals are to be registered (off
). The defaultregister_mode
isboth
. AXI4-Stream (axis
) side-channel signals are considered to be data signals and are registered whenever the TDATA is registered.depth=<int>
: Specifies the maximum number of samples for the test bench to process. This setting indicates the maximum size of the FIFO needed in the verification adapter that the HLS tool creates for RTL co-simulation.TIP: Whiledepth
is usually an option, it is required form_axi
interfaces.offset=<string>
: Controls the address offset in AXI4-Lite (s_axilite
) and AXI4 (m_axi
) interfaces.- For the
s_axilite
interface, <string> specifies the address in the register map. - For the
m_axi
interface, <string> specifies on of the following values:direct
: Generate a scalar input offset port.slave
: Generate an offset port and automatically map it to an AXI4-Lite slave interface.off
: Do not generate an offset port.
TIP: The-m_axi_offset
option of theconfig_interface
command globally controls the offset ports of all M_AXI interfaces in the design.
- For the
clock=<name>
: Optionally specified only for interface modes_axilite
. This defines the clock signal to use for the interface. By default, the AXI4-Lite interface clock is the same clock as the system clock. This option is used to specify a separate clock for the AXI4-Lite (s_axilite
) interface.TIP: If thebundle
option is used to group multiple top-level function arguments into a single AXI4-Lite interface, the clock option need only be specified on one of the bundle members.latency=<value>
: When mode ism_axi
, this specifies the expected latency of the AXI4 interface, allowing the design to initiate a bus request a number of cycles (latency) before the read or write is expected. If this figure it too low, the design will be ready too soon and may stall waiting for the bus. If this figure is too high, bus access may be granted but the bus may stall waiting on the design to start the access.num_read_outstanding=<int>
: For AXI4 (m_axi
) interfaces, this option specifies how many read requests can be made to the AXI4 bus, without a response, before the design stalls. This implies internal storage in the design, a FIFO of size:num_read_outstanding
*max_read_burst_length
*word_size
.num_write_outstanding=<int>
: For AXI4 (m_axi
) interfaces, this option specifies how many write requests can be made to the AXI4 bus, without a response, before the design stalls. This implies internal storage in the design, a FIFO of size:num_write_outstanding
*max_write_burst_length
*word_size
max_read_burst_length=<int>
: For AXI4 (m_axi
) interfaces, this option specifies the maximum number of data values read during a burst transfer.max_write_burst_length=<int>
: For AXI4 (m_axi
) interfaces, this option specifies the maximum number of data values written during a burst transfer.TIP: If the port is a read-only port, then set thenum_write_outstanding=1
andmax_write_burst_length=2
to conserve memory resources. For write-only ports, set thenum_read_outstanding=1
andmax_read_burst_length=2
.name=<string>
: This option is used to rename the port based on your own specification. The generated RTL port will use this name.
Example 1
In this example, both function arguments are implemented using an AXI4-Stream interface:
void example(int A[50], int B[50]) {
//Set the HLS native interface types
#pragma HLS INTERFACE axis port=A
#pragma HLS INTERFACE axis port=B
int i;
for(i = 0; i < 50; i++){
B[i] = A[i] + 5;
}
}
Example 2
The following turns off block-level I/O protocols, and is assigned to the function return value:
#pragma HLS interface ap_ctrl_none port=return
The function argument InData
is
specified to use the ap_vld
interface, and also indicates
the input should be registered:
#pragma HLS interface ap_vld register port=InData
This exposes the global variable lookup_table
as a port on the RTL design, with an ap_memory
interface:
pragma HLS interface ap_memory port=lookup_table
Example 3
This example defines the INTERFACE standards for the ports of the
top-level transpose
function. Notice the use of the
bundle=
option to group signals.
// TOP LEVEL - TRANSPOSE
void transpose(int* input, int* output) {
#pragma HLS INTERFACE m_axi port=input offset=slave bundle=gmem0
#pragma HLS INTERFACE m_axi port=output offset=slave bundle=gmem1
#pragma HLS INTERFACE s_axilite port=input bundle=control
#pragma HLS INTERFACE s_axilite port=output bundle=control
#pragma HLS INTERFACE s_axilite port=return bundle=control
#pragma HLS dataflow
See Also
- Vivado Design Suite User Guide: High-Level Synthesis (UG902)
pragma HLS latency
Description
Specifies a minimum or maximum latency value, or both, for the completion of functions, loops, and regions.
- Latency
- Number of clock cycles required to produce an output.
- Function latency
- Number of clock cycles required for the function to compute all output values, and return.
- Loop latency
- Number of cycles to execute all iterations of the loop.
See "Performance Metrics Example" of Vivado Design Suite User Guide: High-Level Synthesis (UG902).
The Vivado High-Level Synthesis (HLS) tool always tries to minimize latency in the design. When the LATENCY pragma is specified, the tool behavior is as follows:
- Latency is greater than the minimum, or less than the maximum: The constraint is satisfied. No further optimizations are performed.
- Latency is less than the minimum: If the HLS tool can achieve less than the minimum specified latency, it extends the latency to the specified value, potentially increasing sharing.
- Latency is greater than the maximum: If HLS tool cannot schedule within the maximum limit, it increases effort to achieve the specified constraint. If it still fails to meet the maximum latency, it issues a warning, and produces a design with the smallest achievable latency in excess of the maximum.
Syntax
Place the pragma within the boundary of a function, loop, or region of code where the latency must be managed.
#pragma HLS latency min=<int> max=<int>
Where:
min=<int>
: Optionally specifies the minimum latency for the function, loop, or region of code.max=<int>
: Optionally specifies the maximum latency for the function, loop, or region of code.Note: Although both min and max are described as optional, one must be specified.
Example 1
Function foo
is specified to have a
minimum latency of 4 and a maximum latency of 8:
int foo(char x, char a, char b, char c) {
#pragma HLS latency min=4 max=8
char y;
y = x*a+b+c;
return y
}
Example 2
In the following example, loop_1
is
specified to have a maximum latency of 12. Place the pragma in the loop body as shown:
void foo (num_samples, ...) {
int i;
...
loop_1: for(i=0;i< num_samples;i++) {
#pragma HLS latency max=12
...
result = a + b;
}
}
Example 3
The following example creates a code region and groups signals that need to change in the same clock cycle by specifying zero latency:
// create a region { } with a latency = 0
{
#pragma HLS LATENCY max=0 min=0
*data = 0xFF;
*data_vld = 1;
}
See Also
- Vivado Design Suite User Guide: High-Level Synthesis (UG902)
pragma HLS loop_flatten
Description
Allows nested loops to be flattened into a single loop hierarchy with improved latency.
In the register transfer level (RTL) implementation, it requires one clock cycle to move from an outer loop to an inner loop, and from an inner loop to an outer loop. Flattening nested loops allows them to be optimized as a single loop. This saves clock cycles, potentially allowing for greater optimization of the loop body logic.
Apply the LOOP_FLATTEN pragma to the loop body of the inner-most loop in the loop hierarchy. Only perfect and semi-perfect loops can be flattened in this manner:
- Perfect loop nests:
- Only the innermost loop has loop body content.
- There is no logic specified between the loop statements.
- All loop bounds are constant.
- Semi-perfect loop nests:
- Only the innermost loop has loop body content.
- There is no logic specified between the loop statements.
- The outermost loop bound can be a variable.
- Imperfect loop nests: When the inner loop has variable bounds (or the loop body is not exclusively inside the inner loop), try to restructure the code, or unroll the loops in the loop body to create a perfect loop nest.
Syntax
Place the pragma in the C source within the boundaries of the nested loop.
#pragma HLS loop_flatten off
Where:
off
: Is an optional keyword that prevents flattening from taking place. Can prevent some loops from being flattened while all others in the specified location are flattened.Note: The presence of the LOOP_FLATTEN pragma enables the optimization.
Example 1
Flattens loop_1
in function foo
and all (perfect or semi-perfect) loops above it in the loop
hierarchy, into a single loop. Place the pragma in the body of loop_1
.
void foo (num_samples, ...) {
int i;
...
loop_1: for(i=0;i< num_samples;i++) {
#pragma HLS loop_flatten
...
result = a + b;
}
}
Example 2
Prevents loop flattening in loop_1
:
loop_1: for(i=0;i< num_samples;i++) {
#pragma HLS loop_flatten off
...
See Also
- pragma HLS loop_merge
- pragma HLS loop_tripcount
- pragma HLS unroll
- Vivado Design Suite User Guide: High-Level Synthesis (UG902)
pragma HLS loop_merge
Description
Merge consecutive loops into a single loop to reduce overall latency, increase sharing, and improve logic optimization. Merging loops:
- Reduces the number of clock cycles required in the register transfer level (RTL) to transition between the loop-body implementations.
- Allows the loops be implemented in parallel (if possible).
The LOOP_MERGE pragma will seek to merge all loops within the scope it is placed. For example, if you apply a LOOP_MERGE pragma in the body of a loop, the Vivado High-Level Synthesis (HLS) tool applies the pragma to any sub-loops within the loop but not to the loop itself.
The rules for merging loops are:
- If the loop bounds are variables, they must have the same value (number of iterations).
- If the loop bounds are constants, the maximum constant value is used as the bound of the merged loop.
- Loops with both variable bounds and constant bounds cannot be merged.
- The code between loops to be merged cannot have side effects. Multiple execution of this code should generate the same results (a=b is allowed, a=a+1 is not).
- Loops cannot be merged when they contain FIFO reads. Merging changes the order of the reads. Reads from a FIFO or FIFO interface must always be in sequence.
Syntax
Place the pragma in the C source within the required scope or region of code:
#pragma HLS loop_merge force
where
force
: An optional keyword to force loops to be merged even when the HLS tool issues a warning.IMPORTANT: In this case, you must manually insure that the merged loop will function correctly.
Examples
Merges all consecutive loops in function foo
into a single loop.
void foo (num_samples, ...) {
#pragma HLS loop_merge
int i;
...
loop_1: for(i=0;i< num_samples;i++) {
...
All loops inside loop_2
(but not loop_2
itself) are merged by using the force
option. Place the pragma in the body of loop_2
.
loop_2: for(i=0;i< num_samples;i++) {
#pragma HLS loop_merge force
...
See Also
- pragma HLS loop_flatten
- pragma HLS loop_tripcount
- pragma HLS unroll
- Vivado Design Suite User Guide: High-Level Synthesis (UG902)
pragma HLS loop_tripcount
Description
The LOOP_TRIPCOUNT pragma can be applied to a loop to manually specify the total number of iterations performed by a loop.
The Vivado High-Level Synthesis (HLS) tool reports the total latency of each loop, which is the number of clock cycles to execute all iterations of the loop. The loop latency is therefore a function of the number of loop iterations, or tripcount.
The tripcount can be a constant value. It may depend on the value of variables used in the loop expression (for example, <x<y>), or depend on control statements used inside the loop. In some cases, the HLS tool cannot determine the tripcount, and the latency is unknown. This includes cases in which the variables used to determine the tripcount are:
- Input arguments or
- Variables calculated by dynamic operation.
In cases where the loop latency is unknown or cannot be calculated, the LOOP_TRIPCOUNT pragma lets you specify minimum and maximum iterations for a loop. This allows the tool analyze how the loop latency contributes to the total design latency in the reports, and helps you determine appropriate optimizations for the design.
Syntax
Place the pragma in the C source within the body of the loop:
#pragma HLS loop_tripcount min=<int> max=<int> avg=<int>
Where:
max=
<int>: Specifies the maximum number of loop iterations.min=
<int>: Specifies the minimum number of loop iterations.avg=
<int>: Specifies the average number of loop iterations.
Examples
In the following example, loop_1
in
function foo
is specified to have a minimum tripcount of 12
and a maximum tripcount of 16:
void foo (num_samples, ...) {
int i;
...
loop_1: for(i=0;i< num_samples;i++) {
#pragma HLS loop_tripcount min=12 max=16
...
result = a + b;
}
}
See Also
- Vivado Design Suite User Guide: High-Level Synthesis (UG902)
pragma HLS occurrence
Description
When pipelining functions or loops, the OCCURRENCE pragma specifies that the code in a region is executed less frequently than the code in the enclosing function or loop. This allows the code that is executed less often to be pipelined at a slower rate, and potentially shared within the top-level pipeline. To determine the OCCURRENCE:
- A loop iterates <N> times.
- However, part of the loop body is enabled by a conditional statement, and as a result only executes <M> times, where <N> is an integer multiple of <M>.
- The conditional code has an occurrence that is N/M times slower than the rest of the loop body.
For example, in a loop that executes 10 times, a conditional statement within the loop only executes two times has an occurrence of 5 (or 10/2).
Identifying a region with the OCCURRENCE pragma allows the functions and loops in that region to be pipelined with a higher initiation interval that is slower than the enclosing function or loop.
Syntax
Place the pragma in the C source within a region of code.
#pragma HLS occurrence cycle=<int>
Where:
cycle=<int>
: Specifies the occurrence N/M:- <N> is the number of times the enclosing function or loop is executed.
- <M> is the number of times the conditional region is executed.
IMPORTANT: <N> must be an integer multiple of <M>.
Examples
In this example, the region Cond_Region
has an occurrence of 4 (it executes at a rate four times less often than the surrounding
code that contains it):
Cond_Region: {
#pragma HLS occurrence cycle=4
...
}
See Also
- pragma HLS pipeline
- Vivado Design Suite User Guide: High-Level Synthesis (UG902)
pragma HLS pipeline
Description
The PIPELINE pragma reduces the initiation interval (II) for a function or loop by allowing the concurrent execution of operations.
A pipelined function or loop can process new inputs every <N> clock cycles, where <N> is the II of the loop or function. The default II for the PIPELINE pragma is 1, which processes a new input every clock cycle. You can also specify the initiation interval through the use of the II option for the pragma.
Pipelining a loop allows the operations of the loop to be implemented in a concurrent manner as shown in the following figure. In the following figure, (A) shows the default sequential operation where there are 3 clock cycles between each input read (II=3), and it requires 8 clock cycles before the last output write is performed.
If the Vivado High-Level Synthesis (HLS) tool cannot create a design with the specified II, it issues a warning and creates a design with the lowest possible II.
You can then analyze this design with the warning message to determine what steps must be taken to create a design that satisfies the required initiation interval.
Syntax
Place the pragma in the C source within the body of the function or loop.
#pragma HLS pipeline II=<int> enable_flush rewind
Where:
II=
<int>: Specifies the desired initiation interval for the pipeline. The HLS tool tries to meet this request. Based on data dependencies, the actual result might have a larger initiation interval. The default II is 1.enable_flush
: Optional keyword that implements a pipeline that will flush and empty if the data valid at the input of the pipeline goes inactive.TIP: This feature is only supported for pipelined functions: it is not supported for pipelined loops.rewind
: Optional keyword that enables rewinding, or continuous loop pipelining with no pause between one loop iteration ending and the next iteration starting. Rewinding is effective only if there is one single loop (or a perfect loop nest) inside the top-level function. The code segment before the loop:- Is considered as initialization.
- Is executed only once in the pipeline.
- Cannot contain any conditional operations (if-else).
TIP: This feature is only supported for pipelined loops; it is not supported for pipelined functions.
Example 1
In this example function foo
is pipelined
with an initiation interval of 1:
void foo { a, b, c, d} {
#pragma HLS pipeline II=1
...
}
See Also
- pragma HLS dependence
- xcl_pipeline_loop
- Vivado Design Suite User Guide: High-Level Synthesis (UG902)
- SDAccel Environment Profiling and Optimization Guide
pragma HLS reset
Description
The RESET pragma adds or removes resets for specific state variables (global or static).
The reset port is used in an FPGA to restore the registers and block RAM
connected to the reset port to an initial value any time the reset signal is applied. The
presence and behavior of the register transfer level (RTL) reset port is controlled using
the config_rtl
configuration file. The reset settings
include the ability to set the polarity of the reset, and specify whether the reset is
synchronous or asynchronous, but more importantly it controls, through the reset option,
which registers are reset when the reset signal is applied. See "Clock, Reset, and RTL
Output" in the Vivado Design Suite User Guide:
High-Level Synthesis (UG902) for more
information.
Greater control over reset is provided through the RESET pragma. If a
variable is a static or global, the RESET pragma is used to explicitly add a reset, or the
variable can be removed from the reset by turning off
the
pragma. This can be particularly useful when static or global arrays are present in the
design.
Syntax
Place the pragma in the C source within the boundaries of the variable life cycle.
#pragma HLS reset variable=<a> off
Where:
variable=
<a>: Specifies the variable to which the pragma is applied.off
: Indicates that reset is not generated for the specified variable.
Example 1
This example adds reset to the variable a
in function foo
even when the global reset setting is
none
or control
:
void foo(int in[3], char a, char b, char c, int out[3]) {
#pragma HLS reset variable=a
Example 2
Removes reset from variable a
in function
foo
even when the global reset setting is state
or all
.
void foo(int in[3], char a, char b, char c, int out[3]) {
#pragma HLS reset variable=a off
See Also
Vivado Design Suite User Guide: High-Level Synthesis (UG902)pragma HLS resource
Description
The RESET pragma specifies that a specific library resource (core) is used
to implement a variable (array, arithmetic operation, or function argument) in the register
transfer logic (RTL). If the RESOURCE pragma is not specified, the Vivado High-Level Synthesis (HLS) tool determines the resource to use. The HLS
tool implements the operations in the code using hardware cores. When multiple cores in the
library can implement the operation, you can specify which core to use with the RESOURCE
pragma. To generate a list of available cores, use the list_core
command.
list_core
command obtains details on the cores available in the
library. The list_core
can only be used in the HLS tool Tcl
command interface, and a Xilinx device must be specified
using the set_part
command. If a device has not been
selected, the list_core
command does not have any
effect.For example, to specify which memory element in the library to use to implement an array, use the RESOURCE pragma. This allows you control whether the array is implemented as a single or a dual-port RAM. This usage is important for arrays on the top-level function interface, because the memory type associated with the array determines the ports needed in the RTL.
You can use the latency=
option to specify
the latency of the core. For block RAMs on the interface, the latency=
option allows you to model off-chip, non-standard SRAMs at the
interface, for example supporting an SRAM with a latency of 2 or 3. For internal
operations, the latency=
option allows the operation to be
implemented using more pipelined stages. These additional pipeline stages can help resolve
timing issues during RTL synthesis.
For more information, see "Arrays on the Interface" in the Vivado Design Suite User Guide: High-Level Synthesis (UG902).
latency=
option, the operation must have an available
multi-stage core. The HLS tool provides a multi-stage core for all basic arithmetic
operations (add, subtract, multiply and divide), all floating-point operations, and all
block RAMs.For best results, Xilinx recommends
that you use -std=c99
for C and -fno-builtin
for C and C++. To specify the C compile options, such as -std=c99
, use the Tcl command add_files
with the -cflags
option.
Alternatively, select the Edit CFLAGs button in the
Project Settings dialog box. See "Creating a New Synthesis Project" in the Vivado Design Suite User Guide:
High-Level Synthesis (UG902).
Syntax
Place the pragma in the C source within the body of the function where the variable is defined.
#pragma HLS resource variable=<variable> core=<core>\
latency=<int>
Where:
variable=
<variable>: A required argument that specifies the array, arithmetic operation, or function argument to assign the RESOURCE pragma to.core=
<core>: A required argument that specifies the core, as defined in the technology library.latency=
<int>: Specifies the latency of the core.
Example 1
In the following example, a two-stage pipelined multiplier is specified to
implement the multiplication for variable <c> of the function foo
. The HLS tool selects the core to use for variable <d>.
int foo (int a, int b) {
int c, d;
#pragma HLS RESOURCE variable=c latency=2
c = a*b;
d = a*c;
return d;
}
Example 2
In the following example, the <coeffs[128]>variable is an argument to
the top-level function foo_top
. This example specifies that
coeffs
is implemented with core RAM_1P from the library:
#pragma HLS resource variable=coeffs core=RAM_1P
coeffs
are defined in the
RAM_1P core. See Also
Vivado Design Suite User Guide: High-Level Synthesis (UG902)pragma HLS stream
Description
By default, array variables are implemented as RAM:
- Top-level function array parameters are implemented as a RAM interface port.
- General arrays are implemented as RAMs for read-write access.
- In sub-functions involved in DATAFLOW optimizations, the array arguments are implemented using a RAM pingpong buffer channel.
- Arrays involved in loop-based DATAFLOW optimizations are implemented as a RAM ping pong buffer channel.
If the data stored in the array is consumed or produced in a sequential manner, a more efficient communication mechanism is to use streaming data as specified by the STREAM pragma, where FIFOs are used instead of RAMs.
ap_fifo
, the array is
automatically implemented as streaming.Syntax
Place the pragma in the C source within the boundaries of the required location.
#pragma HLS stream variable=<variable> depth=<int> dim=<int> off
Where:
variable=<variable>
: Specifies the name of the array to implement as a streaming interface.- depth=<int>: Relevant only for array streaming in DATAFLOW
channels. By default, the depth of the FIFO implemented in the RTL is the same size as the
array specified in the C code. This option lets you modify the size of the FIFO and
specify a different depth.
When the array is implemented in a DATAFLOW region, it is common to the use the
depth=
option to reduce the size of the FIFO. For example, in a DATAFLOW region when all loops and functions are processing data at a rate of II=1, there is no need for a large FIFO because data is produced and consumed in each clock cycle. In this case, thedepth=
option may be used to reduce the FIFO size to 1 to substantially reduce the area of the RTL design.TIP: Theconfig_dataflow -depth
command provides the ability to stream all arrays in a DATAFLOW region. Thedepth=
option specified here overrides theconfig_dataflow
command for the assigned <variable>. dim=<int>
: Specifies the dimension of the array to be streamed. The default is dimension 1. Specified as an integer from 0 to <N>, for an array with <N> dimensions.off
: Disables streaming data. Relevant only for array streaming in dataflow channels.TIP: Theconfig_dataflow -default_channel fifo
command globally implies aSTREAM
pragma on all arrays in the design. Theoff
option specified here overrides theconfig_dataflow
command for the assigned variable, and restores the default of using a RAM ping pong buffer based channel.
Example 1
The following example specifies array A[10]
to be streaming, and implemented as a FIFO:
#pragma HLS STREAM variable=A
Example 2
In this example array B
is set to
streaming with a FIFO depth of 12:
#pragma HLS STREAM variable=B depth=12
Example 3
Array C has streaming disabled. In the below example, it is assumed to be
enabled by config_dataflow
:
#pragma HLS STREAM variable=C off
See Also
- Vivado Design Suite User Guide: High-Level Synthesis (UG902)
pragma HLS top
Description
Attaches a name to a function, which can then be used with the set_top
command to synthesize the function and any functions
called from the specified top-level. This is typically used to synthesize member functions
of a class in C/C++.
Specify the pragma in an active solution, and then use the set_top
command with the new name.
Syntax
Place the pragma in the C source within the boundaries of the required location.
#pragma HLS top name=<string>
Where:
name=<string>
: Specifies the name to be used by theset_top
command.
Examples
Function foo_long_name
is designated the
top-level function, and renamed to DESIGN_TOP
. After the
pragma is placed in the code, the set_top
command must
still be issued from the Tcl command line, or from the top-level specified in the GUI
project settings.
void foo_long_name () {
#pragma HLS top name=DESIGN_TOP
...
}
set_top DESIGN_TOP
See Also
- Vivado Design Suite User Guide: High-Level Synthesis (UG902)
pragma HLS unroll
Description
Unroll loops to create multiple independent operations rather than a single collection of operations. The UNROLL pragma transforms loops by creating multiples copies of the loop body in the register transfer level (RTL) design, which allows some or all loop iterations to occur in parallel.
Loops in the C/C++ functions are kept rolled by default. When loops are
rolled, synthesis creates the logic for one iteration of the loop, and the RTL design
executes this logic for each iteration of the loop in sequence. A loop is executed for the
number of iterations specified by the loop induction variable. The number of iterations
might also be impacted by logic inside the loop body (for example, break
conditions or modifications to a loop exit variable). Using the UNROLL
pragma you can unroll loops to increase data access and throughput.
The UNROLL pragma allows the loop to be fully or partially unrolled. Fully unrolling the loop creates a copy of the loop body in the RTL for each loop iteration, so the entire loop can be run concurrently. Partially unrolling a loop lets you specify a factor <N>, to create <N> copies of the loop body and reduce the loop iterations accordingly. To unroll a loop completely, the loop bounds must be known at compile time. This is not required for partial unrolling.
Partial loop unrolling does not require <N> to be an integer factor of the maximum loop iteration count. The Vivado High-Level-Synthesis (HLS) tool adds an exit check to ensure that partially unrolled loops are functionally identical to the original loop. For example, given the following code:
for(int i = 0; i < X; i++) {
pragma HLS unroll factor=2
a[i] = b[i] + c[i];
}
break
construct is used
to ensure the functionality remains the same, and the loop exits at the appropriate point:
for(int i = 0; i < X; i += 2) {
a[i] = b[i] + c[i];
if (i+1 >= X) break;
a[i+1] = b[i+1] + c[i+1];
}
Because the maximum iteration count <X> is a variable, the HLS tool
might not be able to determine its value and so adds an exit check and control logic to
partially unrolled loops. However, if you know that the specified unrolling factor, 2 in
this example, is an integer factor of the maximum iteration count <X>, the skip_exit_check
option lets you remove the exit check and
associated logic. This helps minimize the area and simplify the control logic.
config_unroll
command. See config_unroll
in the Vivado Design Suite User Guide:
High-Level Synthesis (UG902) for more information.Syntax
Place the pragma in the C/C++ source within the body of the loop to unroll.
#pragma HLS unroll factor=<N> region skip_exit_check
Where:
factor=<N>
: Specifies a non-zero integer indicating that partial unrolling is requested. The loop body is repeated the specified number of times, and the iteration information is adjusted accordingly. Iffactor=
is not specified, the loop is fully unrolled.region
: An optional keyword that unrolls all loops within the body (region) of the specified loop, without unrolling the enclosing loop itself.skip_exit_check
: An optional keyword that applies only if partial unrolling is specified withfactor=
. The elimination of the exit check is dependent on whether the loop iteration count is known or unknown:- Fixed (known) bounds: No exit condition check is performed if the
iteration count is a multiple of the factor. If the iteration count is not an integer
multiple of the factor, the tool:
- Prevents unrolling.
- Issues a warning that the exit check must be performed to proceed.
- Variable (unknown) bounds: The exit condition check is removed as
requested. You must ensure that:
- The variable bounds is an integer multiple of the specified unroll factor.
- No exit check is in fact required.
- Fixed (known) bounds: No exit condition check is performed if the
iteration count is a multiple of the factor. If the iteration count is not an integer
multiple of the factor, the tool:
Example 1
The following example fully unrolls loop_1
in function foo
. Place the pragma in the body of loop_1
as shown:
loop_1: for(int i = 0; i < N; i++) {
#pragma HLS unroll
a[i] = b[i] + c[i];
}
Example 2
This example specifies an unroll factor of 4 to partially unroll loop_2
of function foo
, and
removes the exit check:
void foo (...) {
int8 array1[M];
int12 array2[N];
...
loop_2: for(i=0;i<M;i++) {
#pragma HLS unroll skip_exit_check factor=4
array1[i] = ...;
array2[i] = ...;
...
}
...
}
Example 3
The following example fully unrolls all loops inside loop_1
in function foo
, but not
loop_1
itself due to the presence of the region
keyword:
void foo(int data_in[N], int scale, int data_out1[N], int data_out2[N]) {
int temp1[N];
loop_1: for(int i = 0; i < N; i++) {
#pragma HLS unroll region
temp1[i] = data_in[i] * scale;
loop_2: for(int j = 0; j < N; j++) {
data_out1[j] = temp1[j] * 123;
}
loop_3: for(int k = 0; k < N; k++) {
data_out2[k] = temp1[k] * 456;
}
}
}
See Also
- pragma HLS loop_flatten
- pragma HLS loop_merge
- pragma HLS loop_tripcount
- Vivado Design Suite User Guide: High-Level Synthesis (UG902)
- opencl_unroll_hint
- SDAccel Environment Profiling and Optimization Guide