Optimizing the Performance

Host Optimization

This section focuses on optimization of the host program, which uses the OpenCL™ API to schedule the individual compute unit executions, and data transfers to and from the FPGA board. For optimizing data transfer and compute calls, you need to think about concurrent execution of tasks through the OpenCL command queue(s). This section discusses common pitfalls, and how to recognize and address them.

Reducing Overhead of Kernel Enqueing

The OpenCL-based execution model supports data parallel and task parallel programming models. An OpenCL host generally needs to call different kernels multiple times. These calls are enqueued in a command queue, either in a certain sequence, or in an out-of-order command queue. Then depending on the availability of compute resources and task data they get scheduled for execution on the device.

Kernel calls can be enqueued for execution on a command queue using clEnqueueTask. The dispatching process is executed on the host processor. The dispatcher invokes kernel execution after transferring the kernel arguments to the accelerator running on the device. The dispatcher uses a low-level Xilinx® Runtime (XRT) library for transferring kernel arguments and issuing trigger commands for starting the compute. The overhead of dispatching the commands and arguments to the accelerator can be between 30 µs and 60 µs, depending on the number of arguments set for the kernel. You can reduce the impact of this overhead by minimizing the number of times the kernel needs to be executed, and minimizing calls to clEnqueueTask. Ideally, you should finish all the compute in a single call to clEnqueueTask.

You can minimize the calls to clEnqueueTask by batching your data and invoking the kernel one time, with a loop wrapped around the original implementation to avoid the overhead of multiple enqueue calls. It can also improve data transfer performance between the host and accelerator, by transferring fewer large data packets rather than many small data packets. For more information on reducing overhead on kernel execution, see Kernel Execution.

The following example shows a simple kernel with given work or data size to process.
#define SIZE 256
extern "C" {
    void add(int *a , int *b, int inc){
        int buff_a[SIZE];
        for(int i=0;i<size;i++)
        {
            buff_a[i] = a[i];
        }
        for(int i=0;i<size;i++)
        {
            b[i] = a[i]+inc;
        }
    }
}
The following example shows the same simple kernel optimized to process batched data. Depending on the num_batches argument the kernel can process multiple inputs of size 256 in a single call and avoid the overhead of multiple clEnqueueTask calls. The host application changes to allocate data and buffers in chunks of SIZE * num_batches, essentially batching the memory allocation and transfer of data between the host global and device memory.
#define SIZE 256
extern "C" {
    void add(int *a , int *b, int inc, int num_batches){
        int buff_a[SIZE];
        for(int j=0;j<num_batches;j++)
        {
            for(int i=0;i<size;i++)
            {
                buff_a[i] = a[i];
            }
            for(int i=0;i<size;i++)
            {
                b[i] = a[i]+inc;
            }
       }   
    }
}

Optimizing Data Movement

Figure 1: Optimizing Data Movement Flow

In the OpenCL execution model, all data is transferred from the host main memory to the global device memory first, and then from the global device memory to the kernel for computation. The computation results are written back from the kernel to the global device memory, and lastly from the global memory to the host main memory. A key factor in determining strategies for kernel data movement optimization is understanding how data can be efficiently moved around between different level of memories maximizing the efficient use of bandwidth on all the memory interfaces.

Note: Optimize the data movement in the application before optimizing computation.

During data movement optimization, it is important to isolate data transfer code from computation code because inefficiency in computation might cause stalls in data movement. You should focus on modifying the data transfer logic in the host and kernel code during this optimization step. The goal is to maximize the system level data throughput by maximizing data transfer bandwidth and device global memory bandwidth usage. It usually takes multiple iterations of running software emulation, hardware emulation, as well as execution in hardware to achieve optimum performance.

Overlapping Data Transfers with Kernel Computation

Applications, such as database analytics, have a much larger data set than can be stored in the available global device memory on the acceleration device. They require the complete data to be transferred and processed in blocks. Techniques that overlap the data transfers with the computation are critical to achieve high performance for these applications.

An example can be found in the vadd kernel from the overlap example in the host category of Vitis Accelerated Examples on GitHub. This examples demonstrates techniques to overlap Host (CPU) and FPGA computation in the application. In this example, the kernel processes two arrays by adding them together and writing to output. From the host perspective, there are four tasks to perform in this example:

  1. Write buffer a (Wa)
  2. Write buffer b (Wb)
  3. Execute vadd kernel
  4. Read buffer c (Rc)

Using a simple in-order command queue without data transfer optimization, the overall execution timeline trace should look similar to the one shown below:

Figure 2: Host View of Tasks

Using an out-of-order command queue, data transfer and kernel execution can overlap as illustrated in the figure below. In the host code for this example, double buffering is used for all buffers so that the kernel can process one set of buffers while the host can operate on the other set of buffers.

The OpenCL event object provides an easy method to set up complex operation dependencies and synchronize host threads and device operations. Events are OpenCL objects that track the status of operations. Event objects are created by kernel execution commands, read, write, and copy commands on memory objects, or user events created using clCreateUserEvent.

You can ensure an operation has completed by querying the events returned by these commands. The arrows in the figure below show how event triggering can be set up to achieve optimal performance.

Figure 3: Event Triggering Setup

In the example, the host code (host.cpp) enqueues the four tasks in a loop to process the complete data set. It also sets up event synchronization between different tasks to ensure that data dependencies are met for each task. The double buffering is set up by passing different memory objects values to clEnqueueMigrateMemObjects API. The event synchronization is achieved by having each API call wait for other event as well as trigger its own event when the API completes.

The Application Timeline view below clearly shows that the data transfer time is completely hidden, while the compute unit vadd_1 is running constantly.

Figure 4: Data Transfer Time Hidden in Application Timeline View

Buffer Memory Segmentation

Allocation and deallocation of memory buffers can lead to memory segmentation in the DDR controllers. This might result in sub-optimal performance of compute units, even if they could theoretically execute in parallel.

This issue occurs most often when multiple pthreads for different compute units are used and the threads allocate and release many device buffers with different sizes every time they enqueue the kernels. In this case, the timeline trace will exhibit gaps between kernel executions and it might seem the processes are sleeping.

Each buffer allocated by runtime should be continuous in hardware. For large memory, it might take some time to wait for that space to be freed, when many buffers are allocated and deallocated. This can be resolved by allocating device buffer and reusing it between different enqueues of a kernel.

Compute Unit Scheduling

Scheduling kernel operations is key to overall system performance. This becomes even more important when implementing multiple compute units (of the same kernel or of different kernels). This section examines the different command queues responsible for scheduling the kernels.

Multiple In-Order Command Queues

The following figure shows an example with two in-order command queues, CQ0 and CQ1. The scheduler dispatches commands from each queue in order, but commands from CQ0 and CQ1 can be pulled out by the scheduler in any order. You must manage synchronization between CQ0 and CQ1 if required.

Figure 5: Example with Two In-Order Command Queues

The following is code extracted from host.cpp of the concurrent_kernel_execution_c example that sets up multiple in-order command queues and enqueues commands into each queue:

   OCL_CHECK(
        err,
        cl::CommandQueue ooo_queue(context,
                                   device,
                                   CL_QUEUE_PROFILING_ENABLE |
                                       CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE,
                                   &err));
...
    printf("[OOO Queue]: Enqueueing scale kernel\n");
    OCL_CHECK(
        err,
        err = ooo_queue.enqueueTask(
            kernel_mscale,nullptr,&ooo_events[0]));
    set_callback(ooo_events[0], "scale");
...
    // This is an out of order queue, events can be executed in any order. Since
    // this call depends on the results of the previous call we must pass the
    // event object from the previous call to this kernel's event wait list.
    printf("[OOO Queue]: Enqueueing addition kernel (Depends on scale)\n");
    kernel_wait_events.resize(0);
    kernel_wait_events.push_back(ooo_events[0]);
    OCL_CHECK(err,
              err = ooo_queue.enqueueTask(
                  kernel_madd,
                  &kernel_wait_events, // Event from previous call
                  &ooo_events[1]));
    set_callback(ooo_events[1], "addition");
...
    // This call does not depend on previous calls so we are passing nullptr
    // into the event wait list. The runtime should schedule this kernel in
    // parallel to the previous calls.
    printf("[OOO Queue]: Enqueueing matrix multiplication kernel\n");
    OCL_CHECK(err,
              err = ooo_queue.enqueueTask(
                  kernel_mmult,
                  nullptr,
                  &ooo_events[2]));
    set_callback(ooo_events[2], "matrix multiplication");

Single Out-of-Order Command Queue

The following figure shows an example with a single out-of-order command queue. The scheduler can dispatch commands from the queue in any order. You must manually define event dependencies and synchronizations as required.

Figure 6: Example with Single Out-of-Order Command Queue

The following is code extracted from host.cpp of the concurrent_kernel_execution_c example that sets up a single out-of-order command queue and enqueues commands as needed:

    OCL_CHECK(
        err,
        cl::CommandQueue ooo_queue(context,
                                   device,
                                   CL_QUEUE_PROFILING_ENABLE |
                                       CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE,
                                   &err));
...
    printf("[OOO Queue]: Enqueueing scale kernel\n");
    OCL_CHECK(
        err,
        err = ooo_queue.enqueueTask(
            kernel_mscale,nullptr, &ooo_events[0]));
    set_callback(ooo_events[0], "scale");
...
    // This is an out of order queue, events can be executed in any order. Since
    // this call depends on the results of the previous call we must pass the
    // event object from the previous call to this kernel's event wait list.
    printf("[OOO Queue]: Enqueueing addition kernel (Depends on scale)\n");
    kernel_wait_events.resize(0);
    kernel_wait_events.push_back(ooo_events[0]);
    OCL_CHECK(err,
              err = ooo_queue.enqueueTask(
                  kernel_madd,
                  &kernel_wait_events, // Event from previous call
                  &ooo_events[1]));
    set_callback(ooo_events[1], "addition");
    // This call does not depend on previous calls so we are passing nullptr
    // into the event wait list. The runtime should schedule this kernel in
    // parallel to the previous calls.
    printf("[OOO Queue]: Enqueueing matrix multiplication kernel\n");
    OCL_CHECK(err,
              err = ooo_queue.enqueueTask(
                  kernel_mmult,
                  nullptr,
                  &ooo_events[2]));
    set_callback(ooo_events[2], "matrix multiplication");

The Application Timeline view shows that the compute unit mmult_1 is running in parallel with the compute units mscale_1 and madd_1, using both multiple in-order queues and single out-of-order queue methods.

Figure 7: Application Timeline View Showing mult_1 Running with mscale_1 and madd_1

Kernel Optimization

One of the key advantages of an FPGA is its flexibility and capacity to create customized designs specifically for your algorithm. This enables various implementation choices to trade off algorithm throughput versus power consumption. The following guidelines help manage the design complexity and achieve the desired design goals.

Optimizing Kernel Computation

Figure 8: Optimizing Kernel Computation Flow

The goal of kernel optimization is to create processing logic that can consume all the data as soon as it arrives at the kernel interfaces. The key metric is the initiation interval (II), or the number of clock cycles before the kernel can accept new input data. Optimizing the II is generally achieved by expanding the processing code to match the data path with techniques such as function pipelining, loop unrolling, array partitioning, data flowing, etc.

Interface Attributes (Detailed Kernel Trace)

The detailed kernel trace provides easy access to the AXI transactions and their properties. The AXI transactions are presented for the global memory, as well as the Kernel side (Kernel "pass" 1:1:1) of the AXI interconnect. The following figure illustrates a typical kernel trace of a newly accelerated algorithm.

Figure 9: Accelerated Algorithm Kernel Trace

Most interesting with respect to performance are the fields:

Burst Length
Describes how many packages are sent within one transaction.
Burst Size
Describes the number of bytes being transferred as part of one package.

Given a burst length of 1 and just 4 bytes per package, it will require many individual AXI transactions to transfer any reasonable amount of data.

Note: The Vitis core development kit never creates burst sizes less than 4 bytes, even if smaller data is transmitted. In this case, if consecutive items are accessed without AXI bursts enabled, it is possible to observe multiple AXI reads to the same address.

Small burst lengths, as well as burst sizes, considerably less than 512 bits are therefore good opportunities to optimize interface performance.

Using Burst Data Transfers

Transferring data in bursts hides the memory access latency and improves bandwidth usage and efficiency of the memory controller. Also, check the HLS report for bursting information.

Note: Infer burst transfers from successive requests of data from consecutive address locations. Refer to Reading and Writing by Burst for more details.

If burst data transfers occur, the detailed kernel trace will reflect the higher burst rate as a larger burst length number:

Figure 10: Burst Data Transfer with Detailed Kernel Trace

In the previous figure, it is also possible to observe that the memory data transfers following the AXI interconnect are actually implemented rather differently (shorter transaction time). Hover over these transactions, you would see that the AXI interconnect has packed the 16 x 4 byte transaction into a single package transaction of 1 x 64 bytes. This effectively uses the AXI4 bandwidth which is even more favorable. The next section focuses on this optimization technique in more detail.

Burst inference is heavily dependent on coding style and access pattern. However, you can ease burst detection and improve performance by isolating data transfer and computation, as shown in the following code snippet:

void kernel(T in[1024], T out[1024]) {
    T tmpIn[1024];
    T tmpOu[1024];
    read(in, tmpIn);
    process(tmpIn, tmpOut);
    write(tmpOut, out);
}

In short, the function read is responsible for reading from the AXI input to an internal variable (tmpIn). The computation is implemented by the function process working on the internal variables tmpIn and tmpOut. The function write takes the produced output and writes to the AXI output. For more information on burst, see the Vitis High-Level Synthesis User Guide (UG1399).

The isolation of the read and write function from the computation results in:

  • Simple control structures (loops) in the read/write function which makes burst detection simpler.
  • The isolation of the computational function away from the AXI interfaces, simplifies potential kernel optimization. See Kernel Optimization for more information.
  • The internal variables are mapped to on-chip memory, which allow faster access compared to AXI transactions. Acceleration platforms supported in the Vitis core development kit can have as much as 10 MB on-chip memories that can be used as pipes, local memories, and private memories. Using these resources effectively can greatly improve the efficiency and performance of your applications.

Using Full AXI Data Width

The user data width between the kernel and the memory controller can be configured by the Vitis compiler based on the data types of the kernel arguments. To maximize the data throughput, Xilinx recommends that you choose data types map to the full data width on the memory controller. The memory controller in all supported acceleration cards supports 512-bit user interface, which can be mapped to C/C++ arbitrary precision data type ap_int<512> or OpenCL vector data types such as int16.

As described in Memory Interface Width Considerations, the default is for Vitis HLS to automatically re-size the kernel interface ports up to 512-bits to improve burst access. As shown on the following figure, you can observe burst AXI transactions (Burst Length 16) and a 512-bit package size (Burst Size 64 bytes).

Figure 11: Burst AXI Transactions

This example shows good interface configuration as it maximizes AXI data width as well as actual burst transactions.

Complex structs or classes, used to declare interfaces, can lead to very complex hardware interfaces due to memory layout and data packing differences. This can introduce potential issues that are very difficult to debug in a complex system.

Note: Use simple structs for kernel arguments that can be packed to 32-bit boundary. Refer to the Custom Data Type Example in kernel_to_gmem category at Xilinx Getting Started Example on GitHub for the recommended method to use structs.

Optimizing Kernel to Kernel Communication

Support for hardware accelerator pipelines that communicate through streams is one of the major advantages of FPGAs and FPGA-based SoCs and have been used in DSP and image processing applications, as well as in communication systems. As described in Streaming Data Transfers between Kernels (K2K), AXI4-Stream interfaces can be used to stream data from one kernel to another without having to use the external memory, which greatly improves the overall system latency.

Kernel ports involved in streaming are defined within the kernel, and are not addressed by the host program. There is no need to send data back to global memory before it is forwarded for processing. The connections between the kernels are directly defined during the v++ linking process as described in Specifying Streaming Connections between Compute Units.

Optimizing Memory Architecture

Memory architecture is a key aspect of implementation. Due to the limited access bandwidth, it can heavily impact the overall performance, as shown in the following example:


void run (ap_uint<16> in[256][4],
          ap_uint<16> out[256]
         ) {
  ...
  ap_uint<16> inMem[256][4];
  ap_uint<16> outMem[256];

  ... Preprocess input to local memory
  
  for( int j=0; j<256; j++) {
    #pragma HLS PIPELINE OFF
    ap_uint<16> sum = 0;
    for( int i = 0; i<4; i++) {

      sum += inMem[j][i];
    }
    outMem[j] = sum;
  } 

  ... Postprocess write local memory to output
}

This code adds the four values associated with the inner dimension of the two dimensional input array. If implemented without any additional modifications, it results in the following estimates:

Figure 12: Performance Estimates

The overall latency of 4608 (Loop 2) is due to 256 iterations of 18 cycles (16 cycles spent in the inner loop, plus the reset of sum, plus the output being written). This is observed in the Schedule Viewer in the HLS Project. The estimates become considerably better when unrolling the inner loop.

Figure 13: Performance Estimates

However, this improvement is largely because of the process using both ports of a dual port memory. This can be seen from the Schedule Viewer in the HLS Project:

Figure 14: Schedule Viewer

Two read operations are performed per cycle to access all the values from the memory to calculate the sum. This is often an undesired result as this completely blocks the access to the memory. To further improve the results, the memory can be split into four smaller memories along the second dimension:

#pragma HLS ARRAY_PARTITION variable=inMem complete dim=2

For more information, see pragma HLS array_partition.

This results in four array reads, all executed on different memories using a single port:

Figure 15: Executed Four Arrays Results

Using a total of 256 * 4 cycles = 1024 cycles for loop 2.

Figure 16: Performance Estimates

Alternatively, the memory can be reshaped into to a single memory with four words in parallel. This is performed through the pragma:

#pragma HLS array_reshape variable=inMem complete dim=2

For more information, see pragma HLS array_reshape.

This results in the same latency as when the array partitioning, but with a single memory using a single port:

Figure 17: Latency Result

Although, either solution creates comparable results with respect to overall latency and utilization, reshaping the array results in cleaner interfaces and less routing congestion making this the preferred solution.

Note: This completes array optimization, in a real design the latency could be further improved by employing loop parallelism (see Loop Parallelism).
void run (ap_uint<16> in[256][4],
	  ap_uint<16> out[256]
	  ) {
  ...

  ap_uint<16> inMem[256][4];
  ap_uint<16> outMem[256];
  #pragma HLS array_reshape variable=inMem complete dim=2
  
  ... Preprocess input to local memory
  
  for( int j=0; j<256; j++) {
    #pragma HLS PIPELINE OFF
    ap_uint<16> sum = 0;
    for( int i = 0; i<4; i++) {
      #pragma HLS UNROLL
      sum += inMem[j][i];
    }
    outMem[j] = sum;
  } 

  ... Postprocess write local memory to output

}

Optimizing Computational Parallelism

By default, C/C++ does not model computational parallelism, as it always executes any algorithm sequentially. However, fully configurable computational engines like FPGAs allow more freedom to exploit computational parallelism.

Coding Data Parallelism

To leverage computational parallelism during the implementation of an algorithm on the FPGA, it should be mentioned that the synthesis tool will need to be able to recognize computational parallelism from the source code first. Loops and functions are prime candidates for reflecting computational parallelism and compute units in the source description. However, even in this case, it is key to verify that the implementation takes advantage of the computational parallelism as in some cases the Vitis technology might not be able to apply the desired transformation due to the structure of the source code.

It is quite common, that some computational parallelism might not be reflected in the source code to begin with. In this case, it will need to be added. A typical example is a kernel that might be described to operate on a single input value, while the FPGA implementation might execute computations more efficiently in parallel on multiple values. This kind of parallel modeling is described in Task Parallelism.

A 512-bit interface can be created using OpenCL vector data types such as int16 or C/C++ arbitrary precision data type ap_int<512>. These vector types can also be used as a powerful way to model data parallelism within a kernel, with up to 16 data paths operating in parallel in case of int16. Refer to the Median Filter Example in the vision category at Xilinx Getting Started Example on GitHub for the recommended method to use vectors.

Loop Parallelism

Loops are the basic C/C++/OpenCL API method of representing repetitive algorithmic code. The following example illustrates various implementation aspects of a loop structure:

  for(int i = 0; i<255; i++) {
    out[i] = in[i]+in[i+1];
  }
  out[255] = in[255];

This code iterates over an array of values and adds consecutive values, except the last value. If this loop is implemented as written, each loop iteration requires two cycles for implementation, which results in a total of 510 cycles for implementation. This can be analyzed in detail through the Schedule Viewer in the HLS Project:

Figure 18: Implemented Loop Structure in Schedule Viewer

This can also be analyzed in terms of total numbers and latency through the Vivado synthesis results:

Figure 19: Synthesis Results Performance Estimates

The key numbers here are the latency numbers and total LUT usage. For example, depending on the configuration, you could get latency of 511 and total LUT usage of 47. As a result, these values can vary based on the implementation choices. While this implementation will require very little area, it results in significant latency.

Unrolling Loops

Unrolling a loop enables the full parallelism of the model to be used. To perform this, mark a loop to be unrolled and the tool will create the implementation with the most parallelism possible. To mark a loop to unroll, an OpenCL loop can be marked with the UNROLL attribute:

__attribute__((opencl_unroll_hint))

Or a C/C++ loop can use the unroll pragma:

#pragma HLS UNROLL

For more information, see Loop Unrolling.

When applied to this specific example, the Schedule Viewer in the HLS Project will be:

Figure 20: Schedule Viewer

The following figure shows the estimated performance:

Figure 21: Performance Estimates

Therefore, the total latency was considerably improved to be 127 cycles and as expected the computational hardware was increased to 4845 LUTs, to perform the same computation in parallel.

However, if you analyze the for-loop, you might ask why this algorithm cannot be implemented in a single cycle, as each addition is completely independent of the previous loop iteration. The reason is the memory interface is used for the variable out. The Vitis core development kit uses dual port memory by default for an array. However, this implies that at most two values can be written to the memory per cycle. Thus to see a fully parallel implementation, you must specify that the variable out should be kept in registers as in this example:

#pragma HLS array_partition variable= out complete dim= 0

For more information, see pragma HLS array_partition.

The results of this transformation can be observed in the following Schedule Viewer:

Figure 22: Transformation Results in Schedule Viewer

The associated estimates are:

Figure 23: Transformation Results Performance Estimates

Accordingly, this code can be implemented as a combinatorial function requiring only a fraction of the cycle to complete.

Pipelining Loops

Pipelining loops allow you to overlap iterations of a loop in time, as discussed in Loop Pipelining. Allowing loop iterations to operate concurrently is often a good approach, as resources can be shared between iterations (less resource utilization), while requiring less execution time compared to loops that are not unrolled.

Pipelining is enabled in C/C++ through the pragma HLS pipeline:

#pragma HLS PIPELINE

While the OpenCL API uses the xcl_pipeline_loop attribute:

__attribute__((xcl_pipeline_loop))
Note: The OpenCL API has an additional method of specifying loop pipelining, see xcl_pipeline_workitems. The reason is the work item loops are not explicitly stated and pipelining these loops require this attribute:
__attribute__((xcl_pipeline_workitems))

In this example, the Schedule Viewer in the HLS Project produces the following information:

Figure 24: Pipelining Loops in Schedule Viewer

With the overall estimates being:

Figure 25: Performance Estimates

Because each iteration of a loop consumes only two cycles of latency, there can only be a single iteration overlap. This enables the total latency to be cut into half compared to the original, resulting in 257 cycles of total latency. However, this reduction in latency was achieved using fewer resources when compared to unrolling.

In most cases, loop pipelining by itself can improve overall performance. Yet, the effectiveness of the pipelining depends on the structure of the loop. Some common limitations are:

  • Resources with limited availability such as memory ports or process channels can limit the overlap of the iterations (Initiation Interval).
  • Loop-carry dependencies, such as those created by variable conditions computed in one iteration affecting the next, might increase the II of the pipeline.

These are reported by the tool during high-level synthesis and can be observed and examined in the Schedule Viewer. For the best possible performance, the code might have to be modified to remove these limiting factors, or the tool needs to be instructed to eliminate some dependency by restructuring the memory implementation of an array, or breaking the dependencies all together.

Task Parallelism

Task parallelism allows you to take advantage of dataflow parallelism. In contrast to loop parallelism, when task parallelism is deployed, full execution units (tasks) are allowed to operate in parallel taking advantage of extra buffering introduced between the tasks.

See the following example:

void run (ap_uint<16> in[1024],
	  ap_uint<16> out[1024]
	  ) {
  ap_uint<16> tmp[128];
  for(int i = 0; i<8; i++) {
    processA(&(in[i*128]), tmp);
    processB(tmp, &(out[i*128]));
  }
}

When this code is executed, the function processA and processB are executed sequentially 128 times in a row. Given the combined latency for processA and processB, the loop is set to 278 and the total latency can be estimated as:

Figure 26: Performance Estimates

The extra cycle is due to loop setup and can be observed in the Schedule Viewer.

For C/C++ code, task parallelism is performed by adding the DATAFLOW pragma into the for-loop:

#pragma HLS DATAFLOW

For OpenCL API code, add the attribute before the for-loop:

__attribute__ ((xcl_dataflow))

Refer to Dataflow Optimization, HLS Pragmas, and OpenCL Attributes for more details on this topic.

As illustrated by the estimates in the HLS report, applying the transformation will considerably improve the overall performance effectively using a double (ping-pong) buffer scheme between the tasks:

Figure 27: Performances Estimates

The overall latency of the design has almost halved in this case due to concurrent execution of the different tasks of the different iterations. Given the 139 cycles per processing function and the full overlap of the 128 iterations, this allows the total latency to be:

(1x only processA + 127x both processes + 1x only processB) * 139 cycles = 17931 cycles 

Using task parallelism is a powerful method to improve performance when it comes to implementation. However, the effectiveness of applying the DATAFLOW pragma to a specific and arbitrary piece of code might vary vastly. It is often necessary to look at the execution pattern of the individual tasks to understand the final implementation of the DATAFLOW pragma. Finally, the Vitis core development kit provides the Detailed Kernel Trace, which illustrates concurrent execution.

Figure 28: Detailed Kernel Trace

For this Detailed Kernel Trace, the tool displays the start of the dataflow loop, as shown in the previous figure. It illustrates how processA is starting up right away with the beginning of the loop, while processB waits until the completion of the processA before it can start up its first iteration. However, while processB completes the first iteration of the loop, processA begins operating on the second iteration, etc.

A more abstract representation of this information is presented in Application Timeline for the host and device activity.

Optimizing Device Resources

Data Width

One, if not the most important aspect for performance is the data width required for the implementation. The tool propagates port widths throughout the algorithm. In some cases, especially when starting out with an algorithmic description, the C/C++/OpenCL code might only use large data types such as integers even at the ports of the design. However, as the algorithm is mapped to a fully configurable implementation, smaller data types such as 10-/12-bit might often suffice. It is beneficial to check the size of basic operations in the HLS Synthesis report during optimization.

In general, when the Vitis core development kit maps an algorithm onto the FPGA, more processing is required to comprehend the C/C++/OpenCL API structure and extract operational dependencies. Therefore, to perform this mapping the Vitis core development kit generally partitions the source code into operational units which are then mapped onto the FPGA. Several aspects influence the number and size of these operational units (ops) as seen by the tool.

In the following figure, the basic operations and their bit-width are reported.

Figure 29: Operations Utilization Estimates

Look for bit widths of 16, 32, and 64 bits commonly used in algorithmic descriptions and verify that the associated operation from the C/C++/OpenCL API source actually requires the bit width to be this large. This can considerably improve the implementation of the algorithm, as smaller operations require less computation time.

Fixed Point Arithmetic

Some applications use floating point computation only because they are optimized for other hardware architecture. Using fixed point arithmetic for applications like deep learning can save the power efficiency and area significantly while keeping the same level of accuracy.

Note: Xilinx recommends exploring fixed point arithmetic for your application before committing to using floating point operations.

Macro Operations

It is sometimes advantageous to think about larger computational elements. The tool will operate on the source code independently of the remaining source code, effectively mapping the algorithm without consideration of surrounding operations onto the FPGA. When applied, the Vitis technology keeps operational boundaries, effectively creating macro operations for specific code. This uses the following principles:

  • Operational locality to the mapping process
  • Reduction in complexity for the heuristics

This might create vastly different results when applied. In C/C++, macro operations are created with the help of #pragma HLS inline off. While in the OpenCL API, the same kind of macro operation can be generated by not specifying the following attribute when defining a function:

__attribute__((always_inline))

For more information, see pragma HLS inline.

Using Optimized Libraries

The OpenCL specification provides many math built-in functions. All math built-in functions with the native_ prefix are mapped to one or more native device instructions and will typically have better performance compared to the corresponding functions (without the native_ prefix). The accuracy and in some cases the input ranges of these functions is implementation-defined. In the Vitis technology, these native_ built-in functions use the equivalent functions in the Vitis HLS tool Math library, which are already optimized for Xilinx FPGAs in terms of area and performance.

Note: Xilinx recommends using native_ built-in functions or the HLS tool Math library if the accuracy meets the application requirement.

Exploring Kernel Optimizations Using Vitis HLS

All kernel optimizations using OpenCL or C/C++ can be performed from within the Vitis core development kit. The primary performance optimizations, such as those discussed in this section (pipelining function and loops, applying dataflow to enable greater concurrency between functions and loops, unrolling loops, etc.), are performed by the Vitis HLS tool.

The Vitis core development kit automatically calls the HLS tool. However, to use the GUI analysis capabilities, you must launch the HLS tool directly from within the Vitis technology. Using the HLS tool in standalone mode, as discussed in Compiling Kernels with the Vitis HLS, enables the following enhancements to the optimization methodology:

  • The ability to focus solely on the kernel optimization because there is no requirement to execute emulation.
  • The skill to create multiple solutions, compare their results, and explore the solution space to find the most optimum design.
  • The competence to use the interactive Analysis Perspective to analyze the design performance.
IMPORTANT: Only the kernel source code is incorporated back into the Vitis core development kit. After exploring the optimization space, ensure that all optimizations are applied to the kernel source code as OpenCL attributes or C/C++ pragmas.

To open the HLS tool in standalone mode, from the Assistant window, right-click the hardware function object, and select Open HLS Project, as shown in the following figure.

Figure 30: Open HLS Project

Topological Optimization

This section focuses on the topological optimization. It looks at the attributes related to the rough layout and implementation of multiple compute units and their impact on performance.

Multiple Compute Units

Depending on available resources on the target device, multiple compute units of the same kernel (or different kernels) can be created to run in parallel, which improves the system processing time and throughput. For more details, see Creating Multiple Instances of a Kernel.

Using Multiple DDR Banks

Acceleration cards supported in Vitis technology provide one, two, or four DDR banks, and up to 80 GB/s raw DDR bandwidth. For kernels moving large amount of data between the FPGA and the DDR, Xilinx® recommends that you direct the Vitis compiler and runtime library to use multiple DDR banks.

In addition to DDR banks, the host application can access PLRAM to transfer data directly to a kernel. This feature is enabled using the connnectivity.sp option in a configuration file specified with the v++ --config option. Refer to Mapping Kernel Ports to Memory for more information on implementing this optimization and Memory Mapped Interfaces on data transfer to the global memory banks.

To take advantage of multiple DDR banks, you need to assign CL memory buffers to different banks in the host code as well as configure the xclbin file to match the bank assignment in v++ command line.

The following block diagram shows the Global Memory Two Banks (C) example in Vitis Examples on GitHub. This example connects the input pointer interface of the kernel to DDR bank 0, and the output pointer interface to DDR bank 1.

Figure 31: Global Memory Two Banks Example

Assigning DDR Bank in Host Code

IMPORTANT: This is optional and only needed in specific cases as described below.

During the Vitis tool flow, the kernel port to memory bank connectivity can be established using the --connectivity.sp switch as described in Mapping Kernel Ports to Memory. The xclbin generated by v++ contains the information about the kernel port to memory connectivity so that XRT can allocate buffers appropriately. When a buffer is created in the host code, XRT automatically assigns the buffer to memory from the kernel xclbin, and manages the buffers internally. If a single kernel port is connected to multiple memory banks, XRT always starts from the lower numbered bank.

In most cases, this approach is sufficient. However, in some specific cases you may need to manually assign the buffer location (or special property) in the host code. For this purpose, the Xilinx OpenCL vendor extension provides a buffer extension called CL_MEM_XRT_PTR_XILINX to specifically manage bank assignment in the host code. The following code example shows the required header file and code for assigning input and output buffers to DDR bank 0 and bank 1:

#include <CL/cl_ext.h>
…
int main(int argc, char** argv) 
{
…
    cl_mem_ext_ptr_t inExt, outExt;  // Declaring two extensions for both buffers
    inExt.flags  = 0|XCL_MEM_TOPOLOGY; // Specify Bank0 Memory for input memory
    outExt.flags = 1|XCL_MEM_TOPOLOGY; // Specify Bank1 Memory for output Memory
    inExt.obj = 0   ; outExt.obj = 0; // Setting Obj and Param to Zero
    inExt.param = 0 ; outExt.param = 0;

    int err;
    //Allocate Buffer in Bank0 of Global Memory for Input Image using Xilinx Extension
    cl_mem buffer_inImage = clCreateBuffer(world.context, CL_MEM_READ_ONLY | CL_MEM_EXT_PTR_XILINX,
            image_size_bytes, &inExt, &err);
    if (err != CL_SUCCESS){
        std::cout << "Error: Failed to allocate device Memory" << std::endl;
        return EXIT_FAILURE;
    }
    //Allocate Buffer in Bank1 of Global Memory for Input Image using Xilinx Extension
    cl_mem buffer_outImage = clCreateBuffer(world.context, CL_MEM_WRITE_ONLY | CL_MEM_EXT_PTR_XILINX,
            image_size_bytes, &outExt, NULL);
    if (err != CL_SUCCESS){
        std::cout << "Error: Failed to allocate device Memory" << std::endl;
        return EXIT_FAILURE;
    }
…
}

The extension pointer cl_mem_ext_ptr_t is a struct as defined below:

typedef struct{
    unsigned flags;
    void *obj;
    void *param;
  } cl_mem_ext_ptr_t;
  • Valid values for flags are:
    • XCL_MEM_DDR_BANK0
    • XCL_MEM_DDR_BANK1
    • XCL_MEM_DDR_BANK2
    • XCL_MEM_DDR_BANK3
    • <id> | XCL_MEM_TOPOLOGY
      Note: The <id> is determined by looking at the Memory Configuration section in the xxx.xclbin.info file generated next to the xxx.xclbin file. In the xxx.xclbin.info file, the global memory (DDR, HBM, PLRAM, etc.) is listed with an index representing the <id>.
  • obj is the pointer to the associated host memory allocated for the CL memory buffer only if CL_MEM_USE_HOST_PTR flag is passed to clCreateBuffer API, otherwise set it to NULL.
  • param is reserved for future use. Always assign it to 0 or NULL.

Here are some specific cases where you might want to use the extension pointer:

P2P Buffer
For an explanation and example, refer to https://xilinx.github.io/XRT/master/html/p2p.html
Host-Memory Buffer
For an explanation and example, refer to https://xilinx.github.io/XRT/master/html/sb.html
Allocating the host buffer to a specific bank when the kernel port is connected to multiple banks
For example, DDR[0:1]. This use case is described in detail in the Using Multiple DDR Banks lab of the Vitis Optimizing Accelerated FPGA Applications: Bloom Filter Example tutorial.
Example of Allocating the Host Buffer to A Specific Bank

An example of the third case listed above, where you might need to use cl_mem_ext_ptr_t, is when the host and kernel are both accessing the DDR bank simultaneously, and you would like to split the data so that kernel and host access memory banks in a ping pong fashion. When the host is writing/reading to a specific memory bank, the kernel is writing/reading from another bank so that these host/kernel accesses don't compete and impact performance. For this scenario, you must manage the buffer allocation yourself.

The kernel ports in the xclbin are connected to DDR bank1 and bank2, and reading the data from these banks alternatively. The connectivity is established during linking by the Vitis compiler using the --connectivity.sp switch:

[connectivity]
sp=runOnfpga_1.input_words:DDR[1:2]

From the host code, you can send the input_words data to DDR banks 1 and 2 alternatively. Two Xilinx extension pointer (cl_mem_ext_ptr_t) objects are created as shown in the example code below. The object flags will determine which DDR bank each buffer will be assigned to for the kernel to access. The kernel argument can be set to input_words[0] and input_words[1] for consecutive kernel enqueues.

#include <CL/cl_ext.h>
…
int main(int argc, char** argv) 
{
cl_mem_ext_ptr_t buffer_words_ext[2];

buffer_words_ext[0].flags = 1 | XCL_MEM_TOPOLOGY; // DDR[1]
buffer_words_ext[0].param = 0;
buffer_words_ext[0].obj   = input_doc_words;
buffer_words_ext[1].flags = 2 | XCL_MEM_TOPOLOGY; // DDR[2]
buffer_words_ext[1].param = 0;
buffer_words_ext[1].obj   = input_doc_words;
…

Assigning Global Memory for Kernel Code

Creating Multiple AXI Interfaces

OpenCL kernels, C/C++ kernels, and RTL kernels have different methods for assigning function parameters to AXI interfaces.

  • For OpenCL kernels, the --max_memory_ports option is required to generate one AXI4 interface for each global pointer on the kernel argument. The AXI4 interface name is based on the order of the global pointers on the argument list.

    The following code is taken from the example gmem_2banks_ocl in the ocl_kernels category from the Vitis Accel Examples on GitHub:

    __kernel __attribute__ ((reqd_work_group_size(1, 1, 1)))
    void apply_watermark(__global const TYPE * __restrict input, 
    __global TYPE * __restrict output, int width, int height) {
     ...
    }

    In this example, the first global pointer input is assigned an AXI4 name M_AXI_GMEM0, and the second global pointer output is assigned a name M_AXI_GMEM1.

  • For C/C++ kernels, multiple AXI4 interfaces are generated by specifying different “bundle” names in the HLS INTERFACE pragma for different global pointers. Refer to Kernel Interfaces for more information.

    The following is a code snippet from the gmem_2banks example that assigns the input pointer to the bundle gmem0 and the output pointer to the bundle gmem1. The bundle name can be any valid C string, and the AXI4 interface name generated will be M_AXI_<bundle_name>. For this example, the input pointer will have AXI4 interface name as M_AXI_gmem0, and the output pointer will have M_AXI_gmem1. Refer to pragma HLS interface for more information.
    #pragma HLS INTERFACE m_axi port=input  offset=slave bundle=gmem0
    #pragma HLS INTERFACE m_axi port=output offset=slave bundle=gmem1
    
  • For RTL kernels, the port names are generated during the import process by the RTL kernel wizard. The default names proposed by the RTL kernel wizard are m00_axi and m01_axi. If not changed, these names have to be used when assigning a DDR bank through the connectivity.sp option in the configuration file. Refer to Mapping Kernel Ports to Memory for more information.
Assigning AXI Interfaces to DDR Banks
IMPORTANT: When using more than one DDR interface, Xilinx requires you to specify the DDR memory bank for each kernel/CU, and specify the SLR to place the kernel into. For more information, see Mapping Kernel Ports to Memory and Assigning Compute Units to SLRs.

The following is an example configuration file that specifies the connectivity.sp option, and the v++ command line that connects the input pointer (M_AXI_GMEM0) to DDR bank 0 and the output pointer (M_AXI_GMEM1) to DDR bank 1:

The config_sp.cfg file:

[connectivity] 
sp=apply_watermark_1.m_axi_gmem0:DDR[0] 
sp=apply_watermark_1.m_axi_gmem1:DDR[1]

The v++ command line:

v++ apply_watermark --config config_sp.cfg

You can use the Device Hardware Transaction view to observe the actual DDR Bank communication, and to analyze DDR usage.

Figure 32: Device Hardware Transaction View Transactions on DDR Bank
Assigning AXI Interfaces to PLRAM

Some platforms support PLRAMs. In these cases, use the same --connectivity.sp option as described in Assigning AXI Interfaces to DDR Banks, but use the name, PLRAM[id]. Valid names supported by specific platforms can be found in the Memory Configuration section of the xclibin.info file generated alongside xclbin.

Kernel SLR and DDR Memory Assignments

Kernel compute unit (CU) instance and DDR memory resource floorplanning are keys to meeting quality of results of your design in terms of frequency and resources. Floorplanning involves explicitly allocating CUs (a kernel instance) to SLRs and mapping CUs to DDR memory resources. When floorplanning, both CU resource usage and DDR memory bandwidth requirements need to be considered.

The largest Xilinx FPGAs are made up of multiple stacked silicon dies. Each stack is referred to as a super logic region (SLR) and has a fixed amount of resources and memory including DDR interfaces. Available device SLR resources which can be used for custom logic can be found in the Vitis Software Platform Release Notes, or can be displayed using the platforminfo utility described in platforminfo Utility.

You can use the actual kernel resource utilization values to help distribute CUs across SLRs to reduce congestion in any one SLR. The system estimate report lists the number of resources (LUTs, Flip-Flops, BRAMs, etc.) used by the kernels early in the design cycle. The report can be generated during hardware emulation and system compilation through the command line or GUI and is described in System Estimate Report.

Use this information along with the available SLR resources to help assign CUs to SLRs such that no one SLR is over-utilized. The less congestion in an SLR, the better the tools can map the design to the FPGA resources and meet your performance target. For mapping memory resources and CUs, see Mapping Kernel Ports to Memory and Assigning Compute Units to SLRs.

Note: While compute units can be connected to any available DDR memory resource, it is also necessary to account for the bandwidth requirements of the kernels when assigning to SLRs.

After allocating your CUs to SLRs, map any CU master AXI port(s) to DDR memory resources. Xilinx recommends connecting to a DDR memory resource in the same SLR as the CU. This reduces competition for the limited SLR-crossing connection resources. In addition, connections between SLRs use super long line (SLL) routing resources, which incurs a greater delay than a standard intra-SLR routing.

It might be necessary to cross an SLR region to connect to a DDR resource in a different SLR. However, if both the connectivity.sp and the connectivity.slr directives are explicitly defined, the tools automatically add additional crossing logic to minimize the effect of the SLL delay, and facilitates better timing closure.

Guidelines for Kernels that Access Multiple Memory Banks

The DDR memory resources are distributed across the super logic regions (SLRs) of the platform. Because the number of connections available for crossing between SLRs is limited, the general guidance is to place a kernel in the same SLR as the DDR memory resource with which it has the most connections. This reduces competition for SLR-crossing connections and avoids consuming extra logic resources associated with SLR crossing.

Figure 33: Kernel and Memory in Same SLR
Note: The image on the left shows a single AXI interface mapped to a single memory bank. The image on the right shows multiple AXI interfaces mapped to the same memory bank.

As shown in the previous figure, when a kernel has a single AXI interface that maps only a single memory bank, the platforminfo utility described in platforminfo Utility lists the SLR that is associated with the memory bank of the kernel; therefore, the SLR where the kernel would be best placed. In this scenario, the design tools might automatically place the kernel in that SLR without need for extra input; however, you might need to provide an explicit SLR assignment for some of the kernels under the following conditions:

  • If the design contains a large number of kernels accessing the same memory bank.
  • A kernel requires some specialized logic resources that are not available in the SLR of the memory bank.

When a kernel has multiple AXI interfaces and all of the interfaces of the kernel access the same memory bank, it can be treated in a very similar way to the kernel with a single AXI interface, and the kernel should reside in the same SLR as the memory bank that its AXI interfaces are mapping.

Figure 34: Memory Bank in Adjoining SLR
Note: The image on the left shows one SLR crossing is required when the kernel is placed in SLR0. The image on the right shows two SLR crossings are required for kernel to access memory banks.

When a kernel has multiple AXI interfaces to multiple memory banks in different SLRs, the recommendation is to place the kernel in the SLR that has the majority of the memory banks accessed by the kernel (shown it the figure above). This minimizes the number of SLR crossings required by this kernel which leaves more SLR crossing resources available for other kernels in your design to reach your memory banks.

When the kernel is mapping memory banks from different SLRs, explicitly specify the SLR assignment as described in Kernel SLR and DDR Memory Assignments.

Figure 35: Memory Banks Two SLRs Away
Note: The image on the left shows two SLR crossings are required to access all of the mapped memory banks. The image on the right shows three SLR crossings are required to access all of the mapped memory banks.

As shown in the previous figure, when a platform contains more than two SLRs, it is possible that the kernel might map a memory bank that is not in the immediately adjacent SLR to its most commonly mapped memory bank. When this scenario arises, memory accesses to the distant memory bank must cross more than one SLR boundary and incur additional SLR-crossing resource costs. To avoid such costs it might be better to place the kernel in an intermediate SLR where it only requires less expensive crossings into the adjacent SLRs.