Optimizing the Hardware Function
The SDSoC™ environment employs heterogeneous cross-compilation, with Arm® CPU-specific compilers for the Zynq®-7000 and Zynq® UltraScale+™ MPSoC processor CPUs, and the Vivado® High-Level Synthesis (HLS) tool as a programmable logic (PL) cross-compiler for hardware functions. This section explains the default behavior and optimization directives associated with the HLS cross-compiler.
The default behavior of the HLS tool is to execute functions and loops in a sequential manner such that the hardware is an accurate reflection of the C/C++ code. Optimization directives can be used to enhance the performance of the hardware function, allowing pipelining which substantially increases the performance of the functions. This chapter outlines a general methodology for optimizing your design for high performance.
There are many possible goals when trying to optimize a design using the HLS tool. The methodology assumes you want to create a design with the highest possible performance, processing one sample of new input data every clock cycle, and so addresses those optimizations before the ones used for reducing latency or resources.
Understanding the Hardware Function Optimization Methodology
Hardware functions are synthesized in the PL by the VivadoHLS tool compiler. This compiler automatically translates C/C++ code into an FPGA hardware implementation, and as with all compilers, does so using compiler defaults.
In addition to the compiler defaults, the HLS tool provides a number of optimizations that are applied to the C/C++ code through the use of pragmas in the code. This chapter explains the optimizations that can be applied and a recommended methodology for applying them.
There are two flows for optimizing the hardware functions:
- Top-down flow
- In this flow, program decomposition into hardware functions proceeds top-down within the SDSoC environment, letting the system cross-compiler create pipelines of functions that automatically operate in dataflow mode. The microarchitecture for each hardware function is optimized using the HLS tool.
- Bottom-up flow
- In this flow, the hardware functions are optimized in isolation from the system using the HLS tool compiler provided in the Vivado Design Suite. The hardware functions are analyzed, optimizations directives can be applied to create an implementation other than the default, and the resulting optimized hardware functions are then incorporated into the SDSoC environment.
The bottom-up flow is often used in organizations where the software and hardware are optimized by different teams and can be used by software programmers who wish to take advantage of existing hardware implementations from within their organization or from partners. Both flows are supported, and the same optimization methodology is used in either case. Both workflows result in the same high-performance system. Xilinx sees the choice as a workflow decision made by individual teams and organizations and provides no recommendation on which flow to use.
The optimization methodology for hardware functions is shown in the following figure:
This figure details all the steps in the methodology and the subsequent sections in this chapter explain the optimizations in detail.
- Step 1: See Optimizing Metrics, and review the topics in this chapter prior to attempting to optimize.
- Step 2: See Pipelining for Performance
- Step 3: See Optimizing Structures for Performance
- Step 4: See Reducing Latency. This step is used to minimize or specifically control the latency through the design and is required only for applications where this is of concern.
- Step 5: See Reducing Area. This topic explains how to reduce the resources required for hardware implementation and is typically applied only when larger hardware functions fail to implement in the available resources. The FPGA has a fixed number of resources, and there is typically no benefit in creating a smaller implementation if the performance goals have been met.
Baselining Hardware Functions
Before you perform any hardware function optimization, it is important to understand the performance achieved with the existing code and compiler defaults, and appreciate how performance is measured. Select the functions to implement hardware and build the project.
After you build a project, a report is available in the Hardware Reports section of the IDE. The report is also available from <project name>/<build_config>/_sds/vhls/<hw_function>/solution/syn/report/<hw_function>.rpt. This report details the performance estimates and usage estimates.
The key factors in the performance estimates are ordered by the timing, interval (which includes loop initiation interval), and latency.
- The timing summary shows the target and estimated clock period. If the estimated clock period is greater than the target, the hardware will not function at this clock period. Reduce the clock period by using the option. Alternatively, because this is only an estimate at this point in the flow, it might be possible to proceed through the remainder of the flow if the estimate only exceeds the target by 20%. Further optimizations are applied when the bitstream is generated, and it might still be possible to satisfy the timing requirements. However, this is an indication that the hardware function is not guaranteed to meet timing.
- The function initiation interval (II) is the number of clock cycles before the
function can accept new inputs and is generally the most critical
performance metric in any system. In an ideal hardware function, the hardware
processes data at the rate of one sample per clock cycle. If the largest data set passed into
the hardware is size <N> (for example:
my_array[<N>]
), the most optimal II is<N> + 1
. This means the hardware function processes <N> data samples in <N> clock cycles and can accept new data one clock cycle after all <N> samples are processed. It is possible to create a hardware function with anII <N>
; however, this requires greater resources in the programmable logic (PL) with typically little benefit. Often, this hardware function is ideal because it consumes and produces data at a rate faster than the rest of the system. - The loop initiation interval is the number of clock cycles before the next iteration of a loop starts to process data. This metric becomes important as you delve deeper into the analysis to locate and remove performance bottlenecks.
- The latency is the number of clock cycles required for the function to compute all output values. This is simply the lag from when data is applied until when it is ready. For most applications this is of little concern, especially when the latency of the hardware function vastly exceeds that of the software or system functions, such as DMA; however, it is a performance metric that you should review and confirm is not an issue for your application.
- The loop iteration latency is the number of clock cycles it takes to complete one iteration of a loop, and the loop latency is the number of cycles to execute all iterations of the loop. See Optimizing Metrics.
The Area Estimates section of the report details how many resources are required in the PL to implement the hardware function and how many are available. The key metric here is the Utilization (%). The utilization (%) should not exceed 100% for any of the resources. A figure greater than 100% means there are not enough resources to implement the hardware function, and a larger FPGA device might be required. As with the timing, at this point in the flow, this is an estimate. If the numbers are only slightly over 100%, it might be possible for the hardware to be optimized during bitstream creation.
You should already have an understanding of the required performance of your
system and what metrics are required from the hardware functions; however, even if you are
unfamiliar with hardware concepts such as clock cycles, you are now aware that the highest
performing hardware functions have an II = <N> + 1
, where
<N> is the largest data set processed by the function. With an understanding of the current
design performance and a set of baseline performance metrics, you can now proceed to apply
optimization directives to the hardware functions.
Optimizing Metrics
The following table shows the first directive for you to consider adding to your design.
Directives and Configurations | Description |
---|---|
LOOP_TRIPCOUNT | Used for loops that have variable bounds. Provides an estimate for the loop iteration count. This has no impact on synthesis, only on reporting. |
A common issue when hardware functions are first compiled is report files showing the latency and interval as a question mark “?” rather than as numerical values. If the design has loops with variable loop bounds, the compiler cannot determine the latency or II and uses the “?” to indicate this condition. Variable loop bounds are where the loop iteration limit cannot be resolved at compile time, as when the loop iteration limit is an input argument to the hardware function, such as variable height, width, or depth parameters.
To resolve this condition, use the hardware function report to locate the lowest
level loop, which fails to report a numerical value, and use the LOOP_TRIPCOUNT
directive to apply an estimated tripcount. The tripcount is the
minimum, average, and/or maximum number of expected iterations. This allows values for latency
and interval to be reported and allows implementations with different optimizations to be
compared.
Because the LOOP_TRIPCOUNT
value is only used
for reporting and has no impact on the resulting hardware implementation, any value can be used.
However, an accurate expected value results in more useful reports.
Pipelining for Performance
The next stage in creating a high-performance design is to pipeline the functions, loops, and operations. Pipelining results in the greatest level of concurrency and a very high level of performance. The following table shows the directives you can use for pipelining.
Directives and Configurations | Description |
---|---|
PIPELINE | Reduces the initiation interval by allowing the concurrent execution of operations within a loop or function. |
DATAFLOW | Enables task-level pipelining, allowing functions and loops to execute concurrently. Used to minimize interval. |
RESOURCE | Specifies pipelining on the hardware resource used to implement a variable (array, arithmetic operation). |
Config Compile | Allows loops to be automatically pipelined based on their iteration count when using the bottom-up flow. |
At this stage of the optimization process, you want to create as much concurrent operation as possible. You can apply the PIPELINE directive to functions and loops. You can use the DATAFLOW directive at the level that contains the functions and loops to make them work in parallel. Although rarely required, the RESOURCE directive can be used to squeeze out the highest levels of performance.
A recommended strategy is to work from the bottom up and be aware of the following:
- Some functions and loops contain sub-functions. If the sub-function is not pipelined, the function above it might show limited improvement when it is pipelined. The non-pipelined sub-function will be the limiting factor.
- Some functions and loops contain sub-loops. When you use the PIPELINE directive, the directive automatically unrolls all loops in the hierarchy below. This can create a great deal of logic. It might make more sense to pipeline the loops in the hierarchy below.
- For cases where it does make sense to pipeline the upper hierarchy and unroll any loops lower in the hierarchy, loops with variable bounds cannot be unrolled, and any loops and functions in the hierarchy above these loops cannot be pipelined. To address this issue, pipeline these loops with variable bounds, and use the DATAFLOW optimization to ensure the pipelined loops operate concurrently to maximize the performance of the tasks that contains the loops. Alternatively, rewrite the loop to remove the variable bound. Apply a maximum upper bound with a conditional break.
The basic strategy at this point in the optimization process is to pipeline the tasks (functions and loops) as much as possible. For detailed information on which functions and loops to pipeline, see Hardware Function Pipeline Strategies.
Although not commonly used, you can also apply pipelining at the operator level. For example, wire routing in the FPGA can introduce large and unanticipated delays that make it difficult for the design to be implemented at the required clock frequency. In this case, you can use the RESOURCE directive to pipeline specific operations such as multipliers, adders, and block RAM to add additional pipeline register stages at the logic level and allow the hardware function to process data at the highest possible performance level without the need for recursion.
Hardware Function Pipeline Strategies
The key optimization directives for obtaining a high-performance design are the PIPELINE and DATAFLOW directives. This section discusses in detail how to apply these directives for various C code architectures.
There are two types of C/C++ functions: those that are frame-based and those that are sampled-based. No matter which coding style is used, the hardware function can be implemented with the same performance in both cases. The difference is only in how the optimization directives are applied.
Frame-Based C Code
The primary characteristic of a frame-based coding style is that the function processes multiple data samples—a frame of data—typically supplied as an array or pointer with data accessed through pointer arithmetic during each transaction (a transaction is considered to be one complete execution of the C function). In this coding style, the data is typically processed through a series of loops or nested loops.
The following is an example outline of frame-based C code:
void foo(
data_t in1[HEIGHT][WIDTH],
data_t in2[HEIGHT][WIDTH],
data_t out[HEIGHT][WIDTH] {
Loop1: for(int i = 0; i < HEIGHT; i++) {
Loop2: for(int j = 0; j < WIDTH; j++) {
out[i][j] = in1[i][j] * in2[i][j];
Loop3: for(int k = 0; k < NUM_BITS; k++) {
. . . .
}
}
}
When seeking to pipeline any C/C++ code for maximum performance in hardware, you want to place the pipeline optimization directive at the level where a sample of data is processed.
The above example is representative of code used to process an image or video frame and can be used to highlight how to effectively pipeline hardware functions. Two sets of input are provided as frames of data to the function, and the output is also a frame of data. There are multiple locations where this function can be pipelined:
- At the level of function foo.
- At the level of loop Loop1.
- At the level of loop Loop2.
- At the level of loop Loop3.
There are advantages and disadvantages for placing the PIPELINE directive at various locations. Understanding them helps guide you to the best location to place the pipeline directive in your code.
- Function Level
- The function accepts a frame of data as input (
in1
andin2
). If the function is pipelined withII = 1
—read a new set of inputs every clock cycle—this informs the compiler to read allHEIGHT*WIDTH
values ofin1
andin2
in a single clock cycle. This is a lot of data to read in one cycle and is unlikely the design you want.If the PIPELINE directive is applied to function foo, all loops in the hierarchy below this level must be unrolled. This is a requirement for pipelining, namely, there cannot be sequential logic inside the pipeline. This would create
HEIGHT*WIDTH*NUM_ELEMENT
copies of the logic, which would lead to a large design.Because the data is accessed in a sequential manner, the arrays on the interface to the hardware function can be implemented as multiple types of hardware interface:
- Block RAM interface
- AXI4 interface
- AXI4-Lite interface
- AXI4-Stream interface
- FIFO interface
A block RAM interface can be implemented as a dual-port interface supplying two samples per clock. The other interface types can only supply one sample per clock. This would result in a bottleneck; there would be a highly parallel but large hardware design unable to process all the data in parallel, resulting in a waste of hardware resources.
- Loop1 Level
- The logic in Loop1 processes an entire row of the two-dimensional
matrix. Placing the PIPELINE directive here would create a design which seeks to process
one row in each clock cycle. Again, this would unroll the loops below and create
additional logic. To make use of the additional hardware, transfer an entire row of data
each clock cycle: an array of
HEIGHT
data words, with each word beingWIDTH*
<number of bits in data_t> bits wide.Because it is unlikely the host code running on the PS can process such large data words, this would again be a case where there are many highly parallel hardware resources that cannot operate in parallel due to bandwidth limitations.
- Loop2 Level
- The logic in Loop2 seeks to process one sample from the arrays. In an
image algorithm, this is the level of a single pixel. This is the level to pipeline if the
design is to process one sample per clock cycle. This is also the rate at which the
interfaces consume and produce data to and from the PS.
This causes Loop3 to be completely unrolled and process one sample per clock. It is a requirement that all the operations in Loop3 execute in parallel. In a typical design, the logic in Loop3 is a shift register or is processing bits within a word. To execute at one sample per clock, you want these processes to occur in parallel and hence you want to unroll the loop. The hardware function created by pipelining Loop2 processes one data sample per clock and creates parallel logic only where needed to achieve the required level of data throughput.
- Loop3 Level
- As stated above, given that Loop2 operates on each data sample or pixel,
Loop3 will typically be doing bit-level or data shifting tasks, so this level is doing
multiple operations per pixel. Pipelining this level would mean performing each operation
in this loop once per clock and thus NUM_BITS clocks per
pixel: processing at the rate of multiple clocks per pixel or data sample.
For example, Loop3 might contain a shift register holding the previous pixels required for a windowing or convolution algorithm. Adding the PIPELINE directive at this level informs the compiler to shift one data value every clock cycle. The design would only return to the logic in Loop2 and read the next inputs after NUM_BITS iterations resulting in a very slow data processing rate.
The ideal location to pipeline in this example is Loop2.When dealing with frame-based code you will want to pipeline at the loop level and typically pipeline the loop that operates at the level of a sample. If in doubt, place a print command into the C code and to confirm this is the level you wish to execute on each clock cycle.
For cases where there are multiple loops at the same level of hierarchy—the example above shows only a set of nested loops—the best location to place the PIPELINE directive can be determined for each loop and then the DATAFLOW directive applied to the function to ensure each of the loops executes in a concurrent manner.
Sample-Based C Code
An example outline of sample-based C code is shown below. The primary characteristic of this coding style is that the function processes a single data sample during each transaction.
void foo (data_t *in, data_t *out) {
static data_t acc;
Loop1: for (int i=N-1;i>=0;i--) {
acc+= ..some calculation..;
}
*out=acc>>N;
}
Another characteristic of sample-based coding style is that the function often contains a static variable: a variable whose value must be remembered between invocations of the function, such as an accumulator or sample counter.
With sample-based code, the location of the PIPELINE directive is clear,
namely, to achieve an II = 1
and process one data value
each clock cycle, for which the function must be pipelined.
This unrolls any loops inside the function and creates additional hardware
logic, but there is no way around this. If Loop1
is not pipelined, it takes a
minimum of N
clock cycles to complete. Only then can the function read the
next x
input value.
When dealing with C code that processes at the sample level, the strategy is always to pipeline the function.
In this type of coding style, the loops are typically operating on arrays and performing a shift register or line buffer functions. It is not uncommon to partition these arrays into individual elements as discussed in Optimizing Structures for Performance to ensure all samples are shifted in a single clock cycle. If the array is implemented in a block RAM, only a maximum of two samples can be read or written in each clock cycle, creating a data processing bottleneck.
The solution here is to pipeline function foo
. Doing so results in a design that processes one sample per clock.
Optimizing Structures for Performance
C code can contain descriptions that prevent a function or loop from being pipelined with the required performance. This is often implied by the structure of the C code or the default logic structures used to implement the PL. In some cases, this might require a code modification, but in most cases these issues can be addressed using additional optimization directives.
The following example shows a case where an optimization directive is used to improve the structure of the implementation and the performance of pipelining. In this initial example, the PIPELINE directive is added to a loop to improve the performance of the loop. This example code shows a loop being used inside a function.
#include "bottleneck.h"
dout_t bottleneck(...) {
...
SUM_LOOP: for(i=3;i<N;i=i+4) {
#pragma HLS PIPELINE
sum += mem[i] + mem[i-1] + mem[i-2] + mem[i-3];
}
...
}
When the code above is compiled into hardware, the following message appears as output:
INFO: [SCHED 61] Pipelining loop 'SUM_LOOP'.
WARNING: [SCHED 69] Unable to schedule 'load' operation ('mem_load_2', bottleneck.c:62) on array 'mem' due to limited memory ports.
INFO: [SCHED 61] Pipelining result: Target II: 1, Final II: 2, Depth: 3.
I
The issue in this example is that arrays are implemented using the efficient block RAM resources in the PL. This results in a small, cost-efficient, fast design. The disadvantage of block RAM is that, like other memories, such as DDR or SRAM, they have a limited number of data ports, typically a maximum of two.
In the code above, four data values from mem
are required to compute the value of sum
. Because mem
is an array and implemented in a block RAM that only has two data ports, only two values can be read (or written) in each clock cycle. With this configuration, it is impossible to compute the value of sum
in one clock cycle and thus consume or produce data with an II of 1 (process one data sample per clock).
The memory port limitation issue can be solved by using the ARRAY_PARTITION
directive on the mem
array. This directive partitions arrays
into smaller arrays, improving the data structure by providing more data ports and allowing a
higher performance pipeline.
With the additional directive shown below, array mem
is partitioned into two dual-port memories so that all four reads can
occur in one clock cycle. There are multiple options to partitioning an array. In this
case, cyclic partitioning with a factor of two ensures the first partition contains
elements 0, 2, 4, and so forth, from the original array and the second partition
contains elements 1, 3, 5, and so forth. Because the partitioning ensures there are now
two dual-port block RAMs (with a total of four data ports), this allows elements 0, 1,
2, and 3 to be read in a single clock cycle.
#include "bottleneck.h"
dout_t bottleneck(...) {
#pragma HLS ARRAY_PARTITION variable=mem cyclic factor=2 dim=1
...
SUM_LOOP: for(i=3;i<N;i=i+4) {
#pragma HLS PIPELINE
sum += mem[i] + mem[i-1] + mem[i-2] + mem[i-3];
}
...
}
Other such issues might be encountered when trying to pipeline loops and functions. The following table lists the directives that are likely to address these issues by helping to reduce bottlenecks in data structures.
Directives and Configurations | Description |
---|---|
ARRAY_PARTITION | Partitions large arrays into multiple smaller arrays or into individual registers to improve access to data and remove block RAM bottlenecks. |
DEPENDENCE | Provides additional information that can overcome loop-carry dependencies and allow loops to be pipelined (or pipelined with lower intervals). |
INLINE | Inlines a function, removing all function hierarchy. Enables logic optimization across function boundaries and improves latency/interval by reducing function call overhead. |
UNROLL | Unrolls for-loops to create multiple independent operations rather than a single collection of operations, allowing greater hardware parallelism. This also allows for partial unrolling of loops. |
Config Array Partition | This configuration determines how arrays are automatically partitioned, including global arrays, and if the partitioning impacts array ports. |
Config Compile | Controls synthesis specific optimizations such as the automatic loop pipelining and floating point math optimizations. |
Config Schedule | Determines the effort level to use during the synthesis scheduling phase, the verbosity of the output messages, and to specify if II should be relaxed in pipelined tasks to achieve timing. |
Config Unroll | Allows all loops below the specified number of loop iterations to be automatically unrolled. |
In addition to the ARRAY_PARTITION directive, the configuration for array partitioning can be used to automatically partition arrays.
The DEPENDENCE directive might be required to remove implied dependencies when
pipelining loops. Such dependencies are reported by message SCHED-68
.
@W [SCHED-68] Target II not met due to carried dependence(s)
The INLINE directive removes function boundaries. This can be used to bring logic or loops up one level of hierarchy. It might be more efficient to pipeline the logic in a function by including it in the function above it, and merging loops into the function above them where the DATAFLOW optimization can be used to execute all the loops concurrently without the overhead of the intermediate sub-function call. This might lead to a higher performing design.
The UNROLL directive might be required for cases where a loop cannot be
pipelined with the required II. If a loop can only be pipelined with II = 4
, it will constrain the other loops and functions in
the system to be limited to II = 4
. In some cases, it
might be worth unrolling or partially unrolling the loop to create more logic, and
remove a potential bottleneck. If the loop can only achieve II
= 4
, unrolling the loop by a factor of 4 creates logic that can process
four iterations of the loop in parallel and achieve II =
1
.
The configuration commands are used to change the optimization default settings and are only available from within the Vivado HLS tool when using a bottom-up flow. For more details, see the Vivado Design Suite User Guide: High-Level Synthesis (UG902). If optimization directives cannot be used to improve the initiation interval, it might require changes to the code. See examples of such in the same guide.
Reducing Latency
When the compiler finishes minimizing the initiation interval (II), it automatically seeks to minimize the latency. The optimization directives listed in the following table can help specify a particular latency or inform the compiler to achieve a latency lower than the one produced, namely, instruct the compiler to satisfy the latency directive even if it results in a higher II. This could result in a lower performance design.
Latency directive are generally not required because most applications have a required throughput but no required latency. When hardware functions are integrated with a processor, the latency of the processor is generally the limiting factor in the system.
If the loops and functions are not pipelined, the throughput is limited by the latency because the task does not start reading the next set of inputs until the current task has completed.
Directive | Description |
---|---|
LATENCY | Allows a minimum and maximum latency constraint to be specified. |
LOOP_FLATTEN | Allows nested loops to be collapsed into a single loop. This removes the loop transition overhead and improves the latency. Nested loops are automatically flattened when the PIPELINE directive is applied. |
LOOP_MERGE | Merges consecutive loops to reduce overall latency, increase logic resource sharing, and improve logic optimization. |
The loop optimization directives can be used to flatten a loop hierarchy or merge consecutive loops together. The benefit to the latency is due to the fact that it typically costs a clock cycle in the control logic to enter and leave the logic created by a loop. The fewer the number of transitions between loops, the lesser number of clock cycles a design takes to complete.
Reducing Area
In hardware, the number of resources required to implement a logic function is referred to as the design area. Design area also refers to how much area the resource used on the fixed-size PL fabric. The area is important when the hardware is too large to be implemented in the target device, and when the hardware function consumes a very high percentage (> 90%) of the available area. This can result in difficulties when trying to wire the hardware logic together because the wires themselves require resources.
After meeting the required performance target or initiation interval (II), the next step might be to reduce the area while maintaining the same performance. This step can be optimal because there is no advantage to reducing the area if the hardware function is operating at the required performance, and no other hardware functions are to be implemented in the remaining space in the PL.
The most common area optimization is the optimization of dataflow memory channels to reduce the number of block RAM resources required to implement the hardware function. Each device has a limited number of block RAM resources.
If you used the DATAFLOW optimization, and the compiler cannot determine whether the tasks in the design are streaming data, it implements the memory channels between dataflow tasks using ping-pong buffers. These require two block RAMs each of size <N>, where <N> is the number of samples to be transferred between the tasks (typically the size of the array passed between tasks). If the design is pipelined and the data is streaming from one task to the next with values produced and consumed in a sequential manner, you can greatly reduce the area by using the STREAM directive to specify that the arrays are to be implemented in a streaming manner that uses a simple FIFO for which you can specify the depth. FIFOs with a small depth are implemented using registers and the PL fabric has many registers.
For most applications, the depth can be specified as 1, which results in the memory channel being implemented as a simple register. If the algorithm implements data compression or extrapolation, where some tasks consume more data than they produce or produce more data than they consume, some arrays must be specified with a higher depth:
- For tasks which produce and consume data at the same rate, specify the array between them to stream with a depth of 1.
- For tasks which reduce the data rate by a factor of X-to-1, specify arrays at the input of the task to stream with a depth of X. All arrays prior to this in the function should also have a depth of X to ensure the hardware function does not stall because the FIFOs are full.
- For tasks which increase the data rate by a factor of 1-to-Y, specify arrays at the output of the task to stream with a depth of Y. All arrays after this in the function should also have a depth of Y to ensure the hardware function does not stall because the FIFOs are full.
The following table lists the other directives and configurations to consider when attempting to minimize the resources used to implement the design.
Directives and Configurations | Description |
---|---|
ALLOCATION | Specifies a limit for the number of operations, hardware resources, or functions used. This can force the sharing of hardware resources but might increase latency. |
ARRAY_MAP | Combines multiple smaller arrays into a single large array to help reduce the number of block RAM resources. |
ARRAY_RESHAPE | Reshapes an array from one with many elements to one with greater word width. Useful for improving block RAM accesses without increasing the number of block RAM. |
DATA_PACK | Packs the data fields of an internal struct into a single scalar with a wider word width, allowing a single control signal to control all fields. |
LOOP_MERGE | Merges consecutive loops to reduce overall latency, increase sharing, and improve logic optimization. |
OCCURRENCE | Used when pipelining functions or loops to specify that the code in a location is executed at a lesser rate than the code in the enclosing function or loop. |
RESOURCE | Specifies that a specific hardware resource (core) is used to implement a variable (array, arithmetic operation). |
STREAM | Specifies that a specific memory channel is to be implemented as a FIFO with an optional specific depth. |
Config Bind | Determines the effort level to use during the synthesis binding phase and can be used to globally minimize the number of operations used. |
Config Dataflow | This configuration specifies the default memory channel and FIFO depth in dataflow optimization. |
The ALLOCATION and RESOURCE directives are used to limit the number of operations and to select which cores (hardware resources) are used to implement the operations. For example, you could limit the function or loop to using only one multiplier, and specify it to be implemented using a pipelined multiplier.
If the ARRAY_PARITION directive is used to improve the initiation interval you might want to consider using the ARRAY_RESHAPE directive instead. The ARRAY_RESHAPE optimization performs a similar task to array partitioning, however, the reshape optimization recombines the elements created by partitioning into a single block RAM with wider data ports. This might prevent an increase in the number of block RAM resources required.
If the C code contains a series of loops with similar indexing, merging the loops with the LOOP_MERGE directive might allow some optimizations to occur. Finally, in cases where a section of code in a pipeline region is only required to operate at an initiation interval lower than the rest of the region, the OCCURENCE directive is used to indicate that this logic can be optimized to execute at a lower rate.
Design Optimization Workflow
Before performing any optimizations, it is recommended to create a new build
configuration within the project. Using different build configurations allows one set of results
to be compared against a different set of results. In addition to the standard Debug and Release
configurations, custom configurations with more useful names (for example,
Opt_ver1
and UnOpt_ver
) might be created in the window using
the toolbar button.
Different build configurations allow you to compare not only the results, but also the log files and even output RTL files used to implement the FPGA (the RTL files are only recommended for users very familiar with hardware design).
The basic optimization strategy for a high-performance design is:
- Create an initial or baseline design.
- Pipeline the loops and functions. Apply the
DATAFLOW
optimization to execute loops and functions concurrently. - Address any issues that limit pipelining, such as array bottlenecks and loop dependencies (with ARRAY_PARTITION and DEPENDENCE directives).
- Specify a specific latency or reduce the size of the dataflow memory channels and use the ALLOCATION and RESOURCES directives to further reduce area.
In summary, the goal is to always meet performance first, before reducing area. If the strategy is to create a design with the fewest resources, omit the steps to improving performance, although the baseline results might be very close to the smallest possible design.
Throughout the optimization process, it is highly recommended to review the console output (or log file) after compilation. When the compiler cannot reach the specified performance goals of an optimization, it automatically relaxes the goals (except the clock frequency) and creates a design with the goals that can be satisfied. It is important to review the output from the compilation log files and reports to understand what optimizations have been performed.
For specific details on applying optimizations, refer to Vivado Design Suite User Guide: High-Level Synthesis (UG902).
Optimization Guidelines
This section documents several fundamental optimization techniques to enhance hardware function performance using the Vivado HLS tool. These techniques include: function inlining, loop and function pipelining, loop unrolling, increasing local memory bandwidth, and streaming data flow between loops and functions.
Function Inlining
Similar to function inlining of software functions, it can be beneficial to inline hardware functions.
Function inlining replaces a function call by substituting a copy of the function body after resolving the actual and formal arguments. After that, the inlined function is dissolved and no longer appears as a separate level of hierarchy. Function inlining allows operations within the inlined function be optimized more effectively with surrounding operations, thus improving the overall latency or the initiation interval for a loop.
To inline a function, put #pragma HLS inline
at the beginning of the body of the desired function. The following code snippet directs the
Vivado HLS tool to inline the mmult_kernel
function:
void mmult_kernel(float in_A[A_NROWS][A_NCOLS],
float in_B[A_NCOLS][B_NCOLS],
float out_C[A_NROWS][B_NCOLS])
{
#pragma HLS INLINE
int index_a, index_b, index_d;
// rest of code body omitted
}
Loop Pipelining and Loop Unrolling
Both loop pipelining and loop unrolling improve the performance of the hardware functions by exploiting the parallelism between loop iterations. The basic concepts of loop pipelining and loop unrolling and example codes to apply these techniques are shown and the limiting factors to achieve optimal performance using these techniques are discussed.
Loop Pipelining
In sequential languages such as C/C++, the operations in a loop are executed sequentially, and the next iteration of the loop can only begin when the last operation in the current loop iteration is complete. Loop pipelining allows the operations in a loop to be implemented in a concurrent manner as shown in the following figure.
As shown in the previous figure, without pipelining, there are three clock
cycles between the two RD
operations, and it requires
six clock cycles for the entire loop to finish. However, with pipelining, there is only
one clock cycle between the two RD
operations, and it requires four clock cycles for the entire loop to finish, that is,
the next iteration of the loop can start before the current iteration is finished.
An important term for loop pipelining is called initiation interval (II), which is the number of clock cycles between the start times of consecutive loop iterations. In the above figure, the II is one because there is only one clock cycle between the start times of consecutive loop iterations.
To pipeline a loop, put #pragma HLS pipeline
at the beginning of the loop body, as illustrated in the following code snippet. The
Vivado HLS tool tries to pipeline the loop with
minimum II.
for (index_a = 0; index_a < A_NROWS; index_a++) {
for (index_b = 0; index_b < B_NCOLS; index_b++) {
#pragma HLS PIPELINE II=1
float result = 0;
for (index_d = 0; index_d < A_NCOLS; index_d++) {
float product_term = in_A[index_a][index_d] * in_B[index_d][index_b];
result += product_term;
}
out_C[index_a * B_NCOLS + index_b] = result;
}
}
Loop Unrolling
Loop unrolling is another technique to exploit parallelism between loop iterations. It creates multiple copies of the loop body and adjust the loop iteration counter accordingly. The following code snippet shows a normal rolled loop:
int sum = 0;
for(int i = 0; i < 10; i++) {
sum += a[i];
}
After the loop is unrolled by a factor of two, the loop becomes:
int sum = 0;
for(int i = 0; i < 10; i+=2) {
sum += a[i];
sum += a[i+1];
}
Unrolling a loop by a factor of <N> creates <N> copies of the
loop body, the loop variable referenced by each copy is updated accordingly (such as the
a[i+1]
in the above code snippet), and the loop
iteration counter is also updated accordingly (such as the i+=2
in the above code
snippet).
Loop unrolling creates more operations in each loop iteration, so that the Vivado HLS tool can exploit more parallelism among these operations. More parallelism means more throughput and higher system performance.
- When the factor <N> is less than the total number of loop iterations (10 in the example above), it is called a partial unroll.
- When the factor <N> is the same as the number of loop iterations, it is called a full unroll. While full unroll requires that the loop bounds are known at compile time, it exposes the most parallelism.
To unroll a loop, put #pragma HLS unroll
[factor=N]
at the beginning of the loop. Without the optional factor=N
, the loop will be fully unrolled.
int sum = 0;
for(int i = 0; i < 10; i++) {
#pragma HLS unroll factor=2
sum += a[i];
}
Factors Limiting the Parallelism Achieved by Loop Pipelining and Loop Unrolling
Both loop pipelining and loop unrolling exploit the parallelism between loop iterations. However, parallelism between loop iterations is limited by two main factors:
- The data dependencies between loop iterations.
- The number of available hardware resources.
A data dependence from an operation in one iteration to another operation in a subsequent iteration is called a loop-carried dependence. It implies that the operation in the subsequent iteration cannot start until the operation in the current iteration has finished computing the data input for the operation in subsequent iteration. Loop-carried dependencies fundamentally limit the initiation interval that can be achieved using loop pipelining and the parallelism that can be exploited using loop unrolling.
The following example demonstrates loop-carried dependencies among
operations producing and consuming variables a
and
b
.
while (a != b) {
if (a > b)
a –= b;
else
b –= a;
}
Operations in the next iteration of this loop can not start until the
current iteration has calculated and updated the values of a
and b
. Array accesses are a common
source of loop-carried dependencies, as shown in the following example:
for (i = 1; i < N; i++)
mem[i] = mem[i-1] + i;
In this case, the next iteration of the loop must wait until the current iteration updates the content of the array. In case of loop pipelining, the minimum initiation interval (II) is the total number of clock cycles required for the memory read, the add operation, and the memory write
Another performance limiting factor for loop pipelining and loop unrolling is the number of available hardware resources. The following figure shows an example the issues created by resource limitations, which in this case prevents the loop to be pipelined with an initiation interval of 1.
In this example, if the loop is pipelined with an initiation interval of
one, there are two read operations. If the memory has only a single port, then the two
read operations cannot be executed simultaneously and must be executed in two cycles. So
the minimal initiation interval can only be two, as shown in part (B) of the figure. The
same can happen with other hardware resources. For example, if the op_compute
is implemented with a DSP core which cannot
accept new inputs every cycle, and there is only one such DSP core. Then op_compute
cannot be issued to the DSP core each cycle, and
an initiation interval of one is not possible.
Increasing Local Memory Bandwidth
This section shows several ways provided by the Vivado HLS tool to increase local memory bandwidth, which can be used together with loop pipelining and loop unrolling to improve system performance.
Arrays are intuitive and useful constructs in C/C++ programs. They allow for the algorithm to be easily captured and understood. In the HLS tool, each array is implemented by default with a single port memory resource; however, such memory implementation might not be the most ideal memory architecture for performance- oriented programs. Refer to Loop Pipelining and Loop Unrolling for an example of resource contention caused by limited memory ports.
Array Partitioning
Arrays can be partitioned into smaller arrays. Physical implementation of memories have only a limited number of read ports and write ports, which can limit the throughput of a load/store intensive algorithm. The memory bandwidth can sometimes be improved by splitting up the original array (implemented as a single memory resource) into multiple smaller arrays (implemented as multiple memories), effectively increasing the number of load/store ports.
The Vivado HLS tool provides three types of array partitioning, as shown in the following figure.
- block
- The original array is split into equally sized blocks of consecutive elements of the original array.
- cyclic
- The original array is split into equally sized blocks interleaving the elements of the original array.
- complete
- The default operation is to split the array into its individual elements. This corresponds to implementing an array as a collection of registers rather than as a memory.
To partition an array in the HLS tool, insert this in the hardware function source code:
#pragma HLS array_partition variable=<variable> <block, cyclic, complete> factor=<int> dim=<int>
For block and cyclic partitioning, the factor
option can be used to specify the number of arrays which are
created. In the figure above, a factor of two is used, dividing the array into two
smaller arrays. If the number of elements in the array is not an integer multiple of the
factor, the last array will have fewer than average elements.
When partitioning multi-dimensional arrays, the dim
option can be used to specify which dimension is partitioned. The
following figure shows an example of partitioning different dimensions of a
multi-dimensional array.
Array Reshaping
Arrays can also be reshaped to increase the memory bandwidth. Reshaping takes different elements from a dimension in the original array, and combines them into a single wider element. Array reshaping is similar to array partitioning, but instead of partitioning into multiple arrays, it widens array elements. The following figure illustrates the concept of array reshaping.
To use array reshaping in the Vivado HLS tool, insert this in the hardware function source code:
#pragma HLS array_reshape variable=<variable> <block, cyclic, complete> factor=<int> dim=<int>
The options have the same meaning as the array partition pragma.
Data Flow Pipelining
The previously discussed optimization techniques are all "fine grain" parallelizing optimizations at the level of operators, such as multiplier, adder, and memory load/store operations. These techniques optimize the parallelism between these operators. Data flow pipelining on the other hand, exploits the "coarse grain" parallelism at the level of functions and loops. Data flow pipelining can increase the concurrency between functions and loops.
Function Data Flow Pipelining
The default behavior for a series of function calls in the Vivado HLS tool is to complete a function before starting
the next function. In the following figure, part (A) shows the latency without function
data flow pipelining. Assuming it takes eight cycles for the three functions to
complete, the code requires eight cycles before a new input can be processed by func_A
and also eight cycles before an output is written by
func_C
(assume the output is written at the end of
func_C
).
An example execution with data flow pipelining is shown in the part (B) of
the figure above. Assuming the execution of func_A
takes three cycles, func_A
can begin processing a new
input every three clock cycles rather than waiting for all the three functions to
complete, resulting in increased throughput, The complete execution to produce an output
then requires only five clock cycles, resulting in shorter overall latency.
The HLS tool implements function data flow pipelining by inserting "channels" between the functions. These channels are implemented as either ping-pong buffers or FIFOs, depending on the access patterns of the producer and the consumer of the data.
- If a function parameter (producer or consumer) is an array, the corresponding channel is implemented as a multi-buffer using standard memory accesses (with associated address and control signals).
- For scalar, pointer and reference parameters, as well as the function return, the channel is implemented as a FIFO, which uses less hardware resources (no address generation), but requires that the data is accessed sequentially.
To use function data flow pipelining, put #pragma
HLS dataflow
where the data flow optimization is desired. The following
code snippet shows an example:
void top(a, b, c, d) {
#pragma HLS dataflow
func_A(a, b, i1);
func_B(c, i1, i2);
func_C(i2, d);
}
Loop Dataflow Pipelining
Data flow pipelining can also be applied to loops in similar manner as it can be applied to functions. It enables a sequence of loops, normally executed sequentially, to execute concurrently. Data flow pipelining should be applied to a function, loop or region which contains either all function or all loops: do not apply on a scope which contains a mixture of loops and functions.
The following figure shows the advantages data flow pipelining can
produce when applied to loops. Without data flow pipelining, loop N
must execute and complete all iterations before loop
M
can begin. The same applies to the relationship
between loops M
and P
.
In this example, it is eight cycles before loop N
can
start processing the next value and eight cycles before an output is written (assuming
the output is written when loop P
finishes).
With data flow pipelining, these loops can operate concurrently. An example
execution with data flow pipelining is shown in part (B) of the figure above. Assuming
the loop M
takes three cycles to execute, the code can
accept new inputs every three cycles. Similarly, it can produce an output value every
five cycles, using the same hardware resources. The Vivado HLS tool automatically inserts channels between the loops to
ensure data can flow asynchronously from one loop to the next. As with data flow
pipelining, the channels between the loops are implemented either as multi-buffers or
FIFOs.
To use loop data flow pipelining, put #pragma HLS
dataflow
where you want the data flow optimization.
Hardware Function Interfacing
After defining what function is needed for acceleration, there are a few key
items to ensure compilation is valid. The Vivado HLS
tool data types (ap_int
, ap_uint
, ap_fixed
, etc.) cannot be part
of the function parameter list that the software part of the application calls. These
data types are unique to the HLS tool and have no bearing outside of the intended tool
and associated compiler.
For example, if the following function was written in the HLS tool, the parameter list needs to be adjusted, and the function body has to handle moving the data from the HLS tool to a more generic data type, as shown below:
void foo(ap_int *a, ap_int *b, ap_int *c) { /* Function body */ }
This needs to be modified if using local variables:
void foo(int *a, int *b, int *c) {
ap_int *local_a = a;
ap_int *local_b = b;
ap_int *local_c = c;
// Remaining function body
}