Kernel Optimization
One of the key advantages of an FPGA is its flexibility and capacity to create customized designs specifically for your algorithm. This enables various implementation choices to trade off algorithm throughput vs. power consumption. The downside of creating custom logic is that the design needs to go through the traditional FPGA design flow.
The following guidelines help manage the design complexity and achieve the desired design goals.
Interface Attributes (Detailed Kernel Trace)
The detailed kernel trace provides easy access to the AXI transactions and their properties. The AXI transactions are presented for the global memory, as well as the Kernel side (Kernel "pass" 1:1:1) of the AXI interconnect. The following figure illustrates a typical kernel trace of a newly accelerated algorithm.
The following fields are the most important in respect to performance:
- Burst Length
- Describes how many beats are sent within one transaction
- Burst Size
- Describes the number of bytes being transferred as part of one beat
Given a burst length of 1 and just 4 Bytes per package, it will require many individual AXI transactions to transfer any reasonable amount of data.
Therefore, small burst lengths, as well as burst sizes, considerably less than 512-bits are good opportunities to optimize interface performance. The following sections show improved implementations:
Top Level Dataflow
Top level dataflow optimization provides the capability to the kernels to executed in a parallel processing architecture by overlapping multiple kernel execution. In cases of long kernel latency, this optimization is useful and will improve the performance of the overall application. Xilinx recommends enabling this optimization only to kernels which have dataflow at the top function by applying the interface pragma ap_ctrl_chain to the return port (along with s_axilite).
#pragma HLS INTERFACE ap_ctrl_chain port=return bundle=control
The following example is provided to illustrate that both the ap_ctrl_chain and s_axilite option is required for the return port to enable top level dataflow.
void N_stage_Adders(int *input, int *output, int incr, int size)
{
…
#pragma HLS INTERFACE s_axilite port=return bundle=control
#pragma HLS INTERFACE ap_ctrl_chain port=return bundle=control
…
…
#pragma HLS dataflow
read_input(input,streamArray[0],size);
compute_loop: for (int i = 0 ; i < STAGES ; i++)
{
#pragma HLS UNROLL
adder(streamArray[i],streamArray[i+1],incr,size);
}
write_result(output,streamArray[STAGES],size);
}
Using Burst Data Transfers
Transferring data in bursts hides the memory access latency and improves bandwidth usage and efficiency of the memory controller.
If burst data transfers occur, the detailed kernel trace will reflect the higher burst rate as a larger burst length number:
In the previous figure, it is also possible to observe that the memory data transfers following the AXI interconnect are actually implemented rather differently (shorter transaction time). If you hover over these transactions, you would see that the AXI interconnect has packed the 16x4 Byte transaction into a single burst transaction of 1x64 Bytes. This effectively uses the AXI4 bandwidth which is even more favorable. The next section focuses on this optimization technique in more detail.
Burst inference is heavily dependent on coding style and access pattern. To avoid potential modeling pitfalls, refer to the SDAccel Environment Programmers Guide (UG1277). However, you can ease burst detection and improve performance by isolating data transfer and computation, as shown in the following code snippet:
void kernel(T in[1024], T out[1024]) {
T tmpIn[1024];
T tmpOu[1024];
read(in, tmpIn);
process(tmpIn, tmpOut);
write(tmpOut, out);
}
In short, the function read
is
responsible for reading from the AXI input to an internal variable (tmpIn)
. The computation is implemented by the function
process
working on the internal variables tmpIn
and tmpOut
. The
function write
takes the produced output and writes to
the AXI output.
The isolation of the read and write function from the computation results in:
- Simple control structures (loops) in the read/write function which makes burst detection simpler.
- Isolation of the computational function away from the AXI interfaces, simplifies potential kernel optimization. See the Kernel Optimization chapter for more information.
- Internal variables are mapped to on-chip memory, which allow faster access compared to AXI transactions. Acceleration platforms supported in SDAccel environment can have as much as 10 MB on-chip memories that can be used as pipes, local memories, and private memories. Using these resources effectively can greatly improve the efficiency and performance of your applications.
Using AXI4 Data Width
The user data width between the kernel and the memory controller can be
configured by the SDAccel environment compiler based
on the data types of the kernel arguments. To maximize the data throughput, Xilinx recommends that you choose data types mapping to
the full data width on the memory controller. The memory controller in all supported
acceleration cards supports 512-bit user interface, which can be mapped to OpenCL™ vector data types, such as int16
or C/C++ arbitrary precision data type ap_int<512>
.
As shown on the following figure, you can observe burst AXI transactions (Burst Length 16) and a 512-bit beat size (Burst Size 64 Bytes).
This example shows good interface configuration as it maximizes AXI data width and also shows actual burst transactions.
Complex structs or classes, used to declare interfaces, can lead to very complex hardware interfaces due to memory layout and data packing differences. This can introduce potential issues that are very difficult to debug in a complex system.
OpenCL API Attributes
The OpenCL API provides attributes to support a more automatic approach to incrementing AXI data width usage. The change of the interface data types, as stated above, is supported in the API as well but will require the same code changes as C/C++ to the algorithm to accommodate the larger input vector.
To eliminate manual code modifications, the following OpenCL attributes are interpreted to perform data path widening and vectorization of the algorithm. A detailed description can be found in the SDx Pragma Reference Guide (UG1253).
- vec_type_hint
- reqd_work_group_size
- xcl_zero_global_work_offset
In the following example, examine the combined functionality on the following case:
__attribute__((reqd_work_group_size(64, 1, 1)))
__attribute__((vec_type_hint(int)))
__attribute__((xcl_zero_global_work_offset))
__kernel void vector_add(__global int* c, __global const int* a, __global const int* b) {
size_t idx = get_global_id(0);
c[idx] = a[idx] + b[idx];
}
In this case, the hard coded interface is a 32-bit wide data path (int *c, int* a, int *b)
, which drastically limits the
memory throughput if implemented directly. However, the automatic widening and
transformation is applied, based on the values of the three attributes.
- __attribute__((vec_type_hint(int)))
- Declares that
int
is the main type used for computation and memory transfer (32-bit). This knowledge is used to calculate the vectorization/widening factor based on the target bandwidth of the AXI interface (512-bits). In this example the factor would be 16 = 512-bits / 32-bit. This implies that in theory, 16 values could be processed if vectorization can be applied. - __attribute__((reqd_work_group_size(X, Y, Z)))
- Defines the total number of work items (where
X
,Y
, andZ
are positive constants).X*Y*Z
is the maximum number of work items therefore defining the maximum possible vectorization factor which would saturate the memory bandwidth. In this example, the total number of work items is64*1*1=64
.The actual vectorization factor to be applied will be the greatest common divider of the vectorization factor defined by the actual coded type or the vec_type_hint, and the maximum possible vectorization factor defined through
reqd_work_group_size
.The quotient of maximum possible vectorization factor divided by the actual vectorization factor provides the remaining loop count of the OpenCL description. As this loop is pipelined, it can be advantageous to have several remaining loop iterations to take advantage of a pipelined implementation. This is especially true if the vectorized OpenCL code has long latency.
There is one optional parameter that is highly recommended to be specified for performance optimization on OpenCL interfaces.
- The __attribute__((xcl_zero_global_work_offset)) instructs the compiler that no global offset parameter is used at runtime, and all accesses are aligned. This gives the compiler valuable information with regard to alignment of the work groups, which in turn usually propagate to the alignment of the memory accesses (less hardware).
It should be noted, that the application of these transformations changes the actual design to be synthesized. Partially unrolled loops require reshaping of local arrays in which data is stored. This usually behaves nicely, but can interact poorly in rare situations.
For example:
- For partitioned arrays, when the partition factor is not divisible
by the unrolling/vectorization factor.
- The resulting access requires a lot of multiplexers and will create a difficult problem for the scheduler (might severely increase memory usage and compilation time), Xilinx recommends that you use partitioning factors that are powers of two (as the vectorization factor is always a power of two).
- If the loop being vectorized has an unrelated resource constraint,
the scheduler complains about II not being met.
- This is not necessarily correlated with a loss of performance (usually it is still performing better) because the II is computed on the unrolled loop (which has therefore a multiplied throughput for each iteration).
- The scheduler informs you of the possible resources constraints and resolving those will further improve the performance.
- Note that a common occurrence is that a local array does not get automatically reshaped (usually because it is accessed in a later section of the code in non-vectorizable way).
Reducing Kernel-to-Kernel Communication Latency with OpenCL Pipes
This section specifically applies to OpenCL kernels. For C++ kernels, kernel-to-kernel streaming is provided. This is discussed in Memory Data Transfer Types.
The OpenCL API 2.0 specification introduces a new memory object called a pipe. A pipe stores data organized as a FIFO. Pipe objects can only be accessed using built-in functions that read from and write to a pipe. Pipe objects are not accessible from the host. Pipes can be used to stream data from one kernel to another inside the FPGA without having to use the external memory, which greatly improves the overall system latency.
In the SDAccel development
environment, pipes must be statically defined outside of all kernel functions; dynamic
pipe allocation using the OpenCL 2.x clCreatePipe
API
is not currently supported. The depth of a pipe must be specified by using the
xcl_reqd_pipe_depth attribute in the pipe declaration.
The valid depth values are as follows:
- 16
- 32
- 64
- 128
- 256
- 512
- 1024
- 2048
- 4096
- 8192
- 16384
- 32768
A given pipe can have one and only one producer and consumer in different kernels.
pipe int p0 __attribute__((xcl_reqd_pipe_depth(32)));
Pipes can be accessed using standard OpenCL
read_pipe()
and write_pipe()
built-in functions in non-blocking mode or using the Xilinx extended read_pipe_block()
and write_pipe_block()
functions in blocking mode. The status of pipes can be queried using OpenCL
get_pipe_num_packets()
and get_pipe_max_packets()
built-in functions. See the OpenCL C Specification, Version 2.0 from Khronos Group
for more details on these built-in functions.
The following function signatures are the currently supported pipe
functions, where gentype
indicates the built-in
OpenCL C scalar integer or floating-point data
types.
int read_pipe_block (pipe gentype p, gentype *ptr)
int write_pipe_block (pipe gentype p, const gentype *ptr)
The following “Blocking Pipes Example” from SDAccel Getting Started Examples on GitHub
uses pipes to pass data from one processing stage to the next using blocking read_pipe_block()
and write_pipe_block()
functions:
pipe int p0 __attribute__((xcl_reqd_pipe_depth(32)));
pipe int p1 __attribute__((xcl_reqd_pipe_depth(32)));
// Input Stage Kernel : Read Data from Global Memory and write into Pipe P0
kernel __attribute__ ((reqd_work_group_size(1, 1, 1)))
void input_stage(__global int *input, int size)
{
__attribute__((xcl_pipeline_loop))
mem_rd: for (int i = 0 ; i < size ; i++)
{
//blocking Write command to pipe P0
write_pipe_block(p0, &input[i]);
}
}
// Adder Stage Kernel: Read Input data from Pipe P0 and write the result
// into Pipe P1
kernel __attribute__ ((reqd_work_group_size(1, 1, 1)))
void adder_stage(int inc, int size)
{
__attribute__((xcl_pipeline_loop))
execute: for(int i = 0 ; i < size ; i++)
{
int input_data, output_data;
//blocking read command to Pipe P0
read_pipe_block(p0, &input_data);
output_data = input_data + inc;
//blocking write command to Pipe P1
write_pipe_block(p1, &output_data);
}
}
// Output Stage Kernel: Read result from Pipe P1 and write the result to Global
// Memory
kernel __attribute__ ((reqd_work_group_size(1, 1, 1)))
void output_stage(__global int *output, int size)
{
__attribute__((xcl_pipeline_loop))
mem_wr: for (int i = 0 ; i < size ; i++)
{
//blocking read command to Pipe P1
read_pipe_block(p1, &output[i]);
}
}
The Device Traceline view shows the detailed activities and stalls on the OpenCL pipes after hardware emulation is run. This information can be used to choose the correct FIFO sizes to achieve the optimal application area and performance.
Optimizing Computational Parallelism
By default, C/C++ does not model computational parallelism, as it always executes any algorithm sequentially. Conversely, the OpenCL API does model computational parallelism with respect to work groups, but it does not use any additional parallelism within the algorithm description. However, fully configurable computational engines, like FPGAs, allow more freedom to exploit computational parallelism.
Coding Data Parallelism
To leverage computational parallelism during the implementation of an algorithm on the FPGA, it should be mentioned that the synthesis tool must be able to recognize computational parallelism from the source code first. Loops and functions are prime candidates for reflecting computational parallelism and compute units in the source description. However, even in this case, it is key to verify that the implementation takes advantage of the computational parallelism as in some cases the SDAccel tool might not be able to apply the desired transformation due to the structure of the source code.
It is quite common, that some computational parallelism might not be reflected
in the source code to begin with. In this case, it will need to be added. A typical
example is a kernel that might be described to operate on a single input value, while
the FPGA implementation might execute computations more efficiently in parallel on
multiple values. This kind of parallel modeling is described in the Using Full AXI Data Width section. A 512-bit
interface can be created using OpenCL vector data
types such as int16
or C/C++ arbitrary precision data
type ap_int<512>
.
int16
. Refer to the
Median Filter Example in the vision category at SDAccel Getting Started Examples on GitHub
for the recommended way to use vectors.Loop Parallelism
Loops are the basic C/C++/OpenCL™ API method of representing repetitive algorithmic code. The following example illustrates various implementation aspects of a loop structure:
for(int i = 0; i<255; i++) {
out[i] = in[i]+in[i+1];
}
out[255] = in[255];
This code iterates over an array of values and adds consecutive values, except the last value. If this loop is implemented as written, each loop iteration requires two cycles for implementation, which results in a total of 510 cycles for implementation. This can be analyzed in detail in the Schedule Viewer in the HLS Project:
This can also be analyzed in terms of total numbers and latency through the Vivado synthesis results:
The key numbers here are the latency numbers and total LUT usage. For example, depending on the configuration, you could get latency of 511 and total LUT usage of 47. As you will see, these values can widely vary based on the implementation choices. While this implementation will require very little area, it results in significant latency.
Unrolling Loops
__attribute__((opencl_unroll_hint))
or a C/C++
loop can use the UNROLL pragma: #pragma HLS UNROLL
When applied to this specific example, the Schedule Viewer in the HLS Project will be:
With an estimated performance of:
As you can see, the total latency was considerably improved to 127 cycles, and as expected, the computational hardware was increased to 4845 LUTs to perform the same computation in parallel.
However, if you analyze the for-loop, you might ask why this algorithm cannot
be implemented in a single cycle, as each addition is completely independent of the
previous loop iteration. The reason is the memory interface to be used for the out
variable. The SDAccel environment uses dual port memory by default for an array. However, this implies that
at most two values can be written to the memory per cycle. Therefore, to see a fully
parallel implementation, you must specify that the out
variable is kept in registers, as shown in the following example:
#pragma HLS array_partition variable=out complete dim= 0
For more information see the pragma HLS array_partition section in SDx Pragma Reference Guide.
The results of this transformation can be observed in the following Schedule Viewer:
The associated estimates are:
As you can see, this code can be implemented as a combinatorial function requiring only a fraction of the cycle to complete.
Pipelining Loops
Pipelining loops allows you to overlap iterations of a loop in time. Allowing iterations to operate concurrently is often a good compromise, as resources can be shared between iterations (less resource usage), while requiring less execution time compared to loops that are not unrolled.
Pipelining is enabled in C/C++ through the following pragma:
#pragma HLS PIPELINE
While the OpenCL API uses the following attribute:
__attribute__((xcl_pipeline_loop))
__attribute__((xcl_pipeline_workitems))
More details to any of these specifications are provided in the SDx Pragma Reference Guide and the SDAccel Environment Programmers Guide.
In this example, the Schedule Viewer in the HLS Project produces the following information:
With the overall estimates being:
Because each iteration of a loop only consumes two cycles of latency, there can only be a single iteration overlap. This enables the total latency to be cut into half compared to the original, resulting in 257 cycles of total latency. However, when compared to unrolling, this reduction in latency was achieved using fewer resources.
In most cases, loop pipelining by itself can improve overall performance. However, the effectiveness of the pipelining will depend on the structure of the loop. Some common limitations are:
- Resources with limited availability such as memory ports or process channels can limit the overlap of the iterations (II).
- Similarly, loop-carried dependencies, such as those created by variables conditions computed in one iteration affecting the next, might increase the initial interval of the pipeline.
These are reported by the tool during high-level synthesis and can be observed and examined in the Schedule Viewer. For the best possible performance, the code might have to be modified to eliminate these limiting factors, or the tool needs to be instructed to eliminate some dependency by restructuring the memory implementation of an array or breaking the dependencies all together.
Task Parallelism
Task parallelism allows you to take advantage of data flow parallelism. In contrast to loop parallelism, when task parallelism is deployed, full execution units (tasks) are allowed to operate in parallel taking advantage of extra buffering introduced between the tasks.
Look at the following example:
void run (ap_uint<16> in[1024],
ap_uint<16> out[1024]
) {
ap_uint<16> tmp[128];
for(int i = 0; i<8; i++) {
processA(&(in[i*128]), tmp);
processB(tmp, &(out[i*128]));
}
}
When this code is executed, the function processA and processB are executed sequentially 128 times in a row. Given the combined latency for processA and processB in the loop is 278, the total latency can be estimated as:
The extra cycle is due to loop setup and can be observed in the Schedule Viewer.
For C/C++ code, task parallelism is performed by adding the DATAFLOW pragma into the for-loop:
#pragma HLS DATAFLOW
__attribute__ ((xcl_dataflow))
Refer to SDx Pragma Reference Guide and SDAccel Environment Programmers Guide for more details regarding the specifics and limitations of these modifiers.
As illustrated by the estimates in the HLS Report, applying the transformation will considerably improve the overall performance effectively using a double (ping pong) buffer scheme between the tasks:
The overall latency of the design has almost halved in this case due to concurrent execution of the different tasks of the different iterations. Given the 139 cycles per processing function and the full overlap of the 128 iterations, this allows the total latency to be:
(1x only processA + 127x both processes + 1x only processB) * 139 cycles = 17931 cycles
Using task parallelism is a very powerful way improve performance when it comes to implementation. However, the effectiveness of applying the DATAFLOW pragma to a specific and arbitrary piece of code might vary vastly. The coding guidelines for applying DATAFLOW effectively are provided in SDx Pragma Reference Guide and SDAccel Environment Programmers Guide. However, to understand the final implementation of the DATAFLOW pragma, it is often necessary to actually look at the execution pattern of the individual tasks. Towards that end, the SDAccel environment provides the Detailed Kernel Trace, which nicely illustrates concurrent execution.
For this detailed kernel trace, the tool displays the start of the dataflowed loop, as shown in the previous figure. It illustrates how processA is starting up right away with the beginning of the loop, while processB waits until the completion of the processA before it can start up its first iteration. However, while processB completes the first iteration of the loop, processA begins operating on the second iteration and so forth.
A more abstract representation of this information is presented in the Application Timeline (Host & Device) and Device Hardware Transaction View (device-only during hardware emulation).
Optimizing Compute Units
Data Width
One aspect for performance is the data width required for the implementation. The tool propagates port widths throughout the algorithm. In some cases, especially when starting out with an algorithmic description, the C/C++/OpenCL™ API code might only use large data types such as integers even at the ports of the design. However, as the algorithm gets mapped to a fully configurable implementation, smaller data types such as 10- or 12-bit might often suffice. Towards that end it is beneficial to check the size of basic operations in the HLS Synthesis report during optimization. In general, when the SDAccel environment maps an algorithm onto the FPGA, much processing is required to comprehend the C/C++/OpenCL API structure and extract operational dependencies. Therefore, to perform this mapping the SDAccel environment generally partitions the source code into operational units which are then mapped onto the FPGA. Several aspects influence the number and size of these operational units (ops) as seen by the tool.
In the following figure, the basic operations and their bitwidth are reported.
Look for bit widths of 16, 32, and 64 bits commonly used in algorithmic descriptions, and verify that the associated operation from the C/C++/OpenCL API source actually requires the bit width to be this large. This can considerably improve the implementation of the algorithm, as smaller operations require less computation time.
Fixed Point Arithmetic
Some applications use floating point computation only because they are optimized for other hardware architectures. As explained in Deep Learning with INT8 Optimization on Xilinx Devices, using fixed point arithmetic for applications, like deep learning. can save the power efficiency and area significantly while keeping the same level of accuracy.
Macro Operations
It is sometimes advantageous to think about larger computational elements. The tool will operate on the source code independently of the remaining source code, effectively mapping the algorithm without consideration of surrounding operations onto the FPGA. When applied, SDAccel tool keeps operational boundaries, effectively creating macro operations for specific code. This uses the following principles:
- Operational locality to the mapping process.
- Reduction in complexity for the heuristics.
#pragma HLS inline off
__attribute__((always_inline))
Using Optimized Libraries
The OpenCL specification provides many
math built-in functions. All math built-in functions with the native_
prefix are mapped to one or more native device instructions and
will typically have better performance compared to the corresponding functions (without
the native_
prefix). The accuracy and in some cases the
input ranges of these functions is implementation-defined. In SDAccel™ environment these native_
built-in functions use the equivalent functions in the Vivado® HLS tool Math library, which are already optimized for Xilinx® FPGAs in terms of area and performance.
native_
built-in functions or the HLS tool Math library if
the accuracy meets the application requirement.Optimizing Memory Architecture
A key aspect of implementation is memory architecture. Because of the limited access bandwidth, it can heavily impact the overall performance, as shown in the following example:
void run (ap_uint<16> in[256][4],
ap_uint<16> out[256]
) {
...
ap_uint<16> inMem[256][4];
ap_uint<16> outMem[256];
... Preprocess input to local memory
for( int j=0; j<256; j++) {
#pragma HLS PIPELINE OFF
ap_uint<16> sum = 0;
for( int i = 0; i<4; i++) {
sum += inMem[j][i];
}
outMem[j] = sum;
}
... Postprocess write local memory to output
}
This code adds the four values associated with the inner dimension of the two dimensional input array. If implemented without any additional modifications, it results in the following estimates:
The overall latency of 4608 (Loop 2) is due to 256 iterations of 18 cycles (16 cycles spent in the inner loop, plus the reset of sum, plus the output being written). This is can be observed in the Schedule Viewer in the HLS Project. The estimates become considerably better when unrolling the inner loop.
However, this improvement is largely due to the fact that this process uses both ports of a dual port memory. This can be seen from the Schedule Viewer in the HLS Project:
As you can see, two read operations are performed per cycle to access all the values from the memory to calculate the sum. This is often an undesired result as this completely blocks the access to the memory. To further improve the results, the memory can be split into four smaller memories along the second dimension:
#pragma HLS ARRAY_PARTITION variable=inMem complete dim=2
This results in four array reads, all executed on different memories using a single port:
Using a total of 256 * 4 cycles = 1024 cycles for loop 2.
Alternatively, the memory can be reshaped into to a single memory with four words in parallel. This is performed through the following pragma:
#pragma HLS array_reshape variable=inMem complete dim=2
This results in the same latency as the array partitioning, but with a single memory using a single port:
Although, either solution creates comparable results with respect to overall latency and utilization, reshaping the array results in cleaner interfaces and less routing congestion making this the preferred solution.
void run (ap_uint<16> in[256][4],
ap_uint<16> out[256]
) {
...
ap_uint<16> inMem[256][4];
ap_uint<16> outMem[256];
#pragma HLS array_reshape variable=inMem complete dim=2
... Preprocess input to local memory
for( int j=0; j<256; j++) {
#pragma HLS PIPELINE OFF
ap_uint<16> sum = 0;
for( int i = 0; i<4; i++) {
#pragma HLS UNROLL
sum += inMem[j][i];
}
outMem[j] = sum;
}
... Postprocess write local memory to output
}