Memory Access Optimizations
Efficient memory accesses are critical to the performance of hardware functions running on an FPGA. The first category of optimizations that can be made to a hardware function is improving memory accesses and the data transfer rate between the PS and PL.
Data is often stored in an external memory. Accesses to an off-chip can take time and reduce the system performance. To reduce the memory access times and improve the runtime in high-performance systems, data can be stored in a local cache.
In addition to these memories, an FPGA provides local memory, where small and medium-sized data blocks can be stored and efficiently accessed. The memory architecture of a system, which uses FPGAs to accelerate performance, is similar to that of a CPU+GPU or CPU+DSP system, where consideration is given to making the most common memory accesses through the most local memory.
The following techniques can help improve the design of a hardware function by minimizing the latency impact of accessing and storing the data through the extensive use of local memories.
- Data Motion Optimization
- This includes the transfer of data between the PS and the PL components. The SDSoC™ environment implements a default data motion architecture based on the arguments of the functions selected for the PL. However, it could use optimization directives to ensure data is stored in contiguous memory. This would improve the efficiency of the data transfer. A scatter-gather transfer could also be used to transfer very large sets of data more efficiently.
- Data Access Patterns
- An FPGA excels at processing data quickly, in a highly concurrent manner. Poor data access patterns interrupt the flow of data and leave the computational power of an FPGA underused. Good data access patterns minimize the use of re-reading data and increase the use of conditional branching to process one sample in multiple ways.
- On-Chip Memories
- On-chip memories (that is, programmable logic RAM, or PLRAM) use the block RAM on the FPGA, and are physically located near the computation. They allow one-cycle reads and writes, which drastically improve memory access performance. Copying data using optimal and efficient data access patterns from the DDR to these local memories can be done very quickly using burst transactions and can considerably improve performance. Performance could also be improved by using optimal data access patterns, such as burst transactions, to efficiently copy the data from the DDR to these local memories.
Data Motion Optimization
The data motion network is the hardware used to connect the program running on the PS to the hardware function implemented in the PL fabric. The SDSoC environment automatically creates a data motion network based on the data types; however, directives can be used to optimize the implementation for both speed and hardware resources. The default data motion networks created for each data type are:
- Scalar
- Scalar data types are always transferred by the
AXI_LITE
data mover. This is a memory-mapped interface that requires very few hardware resources to implement, consisting of single word read/write transfers with handshaking. This makes it the ideal choice for function arguments that are only transferred once per invocation. These types rarely need to be addressed when optimizing the system. - Array
- Arrays contain multiple data values and are more efficiently transferred using fast DMA transfers. The SDSoC environment provides a number of options to control both the efficiency of the hardware and the resources used to implement the hardware, even allowing for the selection of specific data movers.
- Pointers
- By default, in the absence of any pragmas, a pointer argument is taken to be a scalar, even though in C/C++ it might denote a one-dimensional array type. If a pointer is written to or read from multiple times, pragmas should be used to ensure the efficient transfer of data.
- Struct
- A single
struct
is flattened, and each data member uses its own data mover depending on whether it is a scalar or array. For an array ofstruct
, the data motion network is the same as anarray
discussed previously.
With an understanding of the SDSoC environment default behavior, select the most optimum data motion network based on your own particular needs.
Memory Allocation
The sds++/sdscc
(referred to as
sds++
) compiler analyzes your program and selects data movers to
match the requirements for each hardware function call between software and hardware,
based on payload size, hardware interface on the accelerator, and properties of the
function arguments. When the compiler can guarantee an array argument is located in
physically contiguous memory, it can use the most efficient data movers. Allocating or
memory-mapping arrays with the following sds_lib
library
functions can inform the compiler that memory is physically contiguous.
// guarantees physically contiguous memory
sds_alloc(size_t size);
// paddr must point to contiguous memory
sds_mmap(void *paddr, size_t size, void *vaddr);
// assumes physically contiguous memory
sds_register_dmabuf(void *vaddr, int fd);
It is possible that due to the program structure, the sds++
compiler cannot definitively deduce the memory
contiguity, and when this occurs, it issues a warning message, as shown:
WARNING: [DMAnalysis 83-4492] Unable to determine the memory attributes passed to foo_arg_A of function foo at foo.cpp:102
You can inform the compiler that the data is allocated in a physically contiguous memory by inserting the following pragma immediately before the function declaration.
sds_alloc
to allocate such memory).#pragma SDS data mem_attribute (A:PHYSICAL_CONTIGUOUS) // default is NON_PHYSICAL_CONTIGUOUS
It is important to know how the data will be used, because it helps determine what data movers to use. This is because some require contiguous allocation, and others do not. For example, if the data to be used is random, it will use a scatter-gather DMA, which causes these types of overhead:
- walking page tables
- pinning pages
- creating descriptors
- cleaning up when the accelerator is completed
Copy and Shared Memory Semantics
By default, hardware function calls involve copy-in, copy-out semantics for function arguments. It is possible to impose a shared memory model for hardware function arguments, but you must keep in mind that while throughput on burst transfers is quite good, the latency to external DDR from the programmable logic is significantly higher than it is for the CPU. The following pragma, inserted immediately preceding the function declaration, is used to declare that a variable transfer employs shared memory semantics.
#pragma SDS data zero_copy(A[0:<array_size>]) // array_size = number of elements
Within a synthesizable hardware function, it is usually inefficient to read/write single words from the shared memory (specified using the ZERO_COPY pragma). A more efficient approach is to employ a pipelined for-loop to read/write data from memory in bursts and store it in a local memory.
For copy and zero-copy memory semantics, another efficient alternative is to stream data between programmable logic and external DDR to maximize memory efficiency. This stores data in local memory within a hardware function whenever you need to make non-sequential and repeated accesses to variables. For example, video applications typically have data coming in as pixel streams and implement line buffers in FPGA memory to support multiple accesses to the pixel stream data.
To declare to the sds++/sdscc
(referred to as
sds++
) compiler command that a hardware function can admit streaming access
for an array data transfer (that is, each element is accessed precisely once in index order),
insert the following pragma immediately preceding the function prototype.
#pragma SDS data access_pattern(A:SEQUENTIAL) // access pattern = SEQUENTIAL | RANDOM
For arrays passed as pointer typed arguments to hardware functions, sometimes the compilers can infer transfer size, but if they cannot, they issue the following message:
WARNING: [DMAnalysis 83-4439] Cannot determine data size for argument p of function foo
Use the following to specify the size of the data to be transferred.
#pragma SDS data copy(p[0:<array_size>]) // for example, int *p
You can vary the data transfer size on a per function call basis to avoid
transferring data that is not required by a hardware function by setting <array_size>
in the pragma definition to be an expression
defined in the scope of the function call (that is, all variables in the size expression must
be scalar arguments to the function), for example:
#pragma SDS data copy(A[0:L+2*T/3]) // scalar arguments L, T to same function
The sds++
compiler implements function call
interfaces by creating stub functions that have a specific synchronization code (such as
cf_wait()
) to preserve program behavior and ensure
consistency.
Data Cache Coherency
The sds++/sdscc
(referred to as sds++
) compiler
automatically generates software configuration code for each data mover required by the
system, including interfacing to underlying device drivers, as needed. The default assumption
is that the system compiler maintains cache coherency for the memory allocated to arrays
passed between the CPU and hardware functions; consequently, the compiler might generate code
to perform a cache flush before transferring data to a hardware function and to perform a
cache-invalidate before transferring data from a hardware function to the memory. Both actions
are necessary for correctness, but have performance implications. When using Zynq® device HP ports, for example, you can override the default
when you know that the CPU will not access the memory indicating that the correctness of the
application does not depend on cache coherency. To avoid the overhead of unnecessary cache
flushes, use the following API to allocate memory.
void *sds_alloc_non_cacheable(size_t size)
A typical use case of non-cacheable memory is a video application where some frame buffers are accessed by programmable logic but not the CPU.
Data Access Patterns
Designers select an FPGA to implement C code due to the superior performance of the FPGA—the parallel architecture of an FPGA allows it to perform operations much faster than the inherently sequential operations of a processor, and users typically want to take advantage of that performance.
The focus here is on understanding the impact that the access patterns inherent in the C code might have on the results. Although the access patterns of most concern are those into and out of the hardware function, it is worth considering the access patterns within functions as any bottlenecks within the hardware function will negatively impact the transfer rate into and out of the function.
To highlight how some data access patterns can negatively impact performance and demonstrate how other patterns can be used to fully embrace the parallelism and high performance capabilities of an FPGA, this section reviews an image convolution algorithm.
- The first part reviews the algorithm and highlights the data access aspects that limit the performance in an FPGA.
- The second part shows how the algorithm might be written to achieve the highest performance possible.
Algorithm with Poor Data Access Patterns
A standard convolution function applied to an image is used here to demonstrate how the C code can negatively impact the performance that is possible from an FPGA. In this example, a horizontal and then vertical convolution is performed on the data. Because the data at the edge of the image lies outside the convolution windows, the final step is to address the data around the border.
The algorithm structure can be summarized as follows:
- A horizontal convolution.
- Followed by a vertical convolution.
- Followed by a manipulation of the border pixels.
static void convolution_orig(
int width,
int height,
const T *src,
T *dst,
const T *hcoeff,
const T *vcoeff) {
T local[MAX_IMG_ROWS*MAX_IMG_COLS];
// Horizontal convolution
HconvH:for(int row = 0; row < height; row++){
HconvWfor(int col = border_width; col < width - border_width; col++){
Hconv:for(int i = - border_width; i <= border_width; i++){
...
}
}
}
// Vertical convolution
VconvH:for(int row = border_width; row < height - border_width; row++){
VconvW:for(int col = 0; col < width; col++){
Vconv:for(int i = - border_width; i <= border_width; i++){
}
}
}
// Border pixels
Top_Border:for(int col = 0; col < border_width; col++){
}
Side_Border:for(int row = border_width; row < height - border_width; row++){
}
Bottom_Border:for(int = height - border_width; row < height; row++){
}
}
Standard Horizontal Convolution
The first step is to perform the convolution in the horizontal direction, as shown in the following figure.
The convolution is performed using K samples of data and K convolution coefficients. In the figure above, K is shown as 5, however, the value of K is defined in the code. To perform the convolution, a minimum of K data samples are required. The convolution window cannot start at the first pixel because the window would need to include pixels that are outside the image.
By performing a symmetric convolution, the first K data samples from input
src
can be convolved with the horizontal coefficients, and
the first output calculated. To calculate the second output, the next set of K data samples is
used. This calculation proceeds along each row until the final output is written.
The C code for performing this operation is shown below.
const int conv_size = K;
const int border_width = int(conv_size / 2);
#ifndef __SYNTHESIS__
T * const local = new T[MAX_IMG_ROWS*MAX_IMG_COLS];
#else // Static storage allocation for HLS, dynamic otherwise
T local[MAX_IMG_ROWS*MAX_IMG_COLS];
#endif
Clear_Local:for(int i = 0; i < height * width; i++){
local[i]=0;
}
// Horizontal convolution
HconvH:for(int col = 0; col < height; col++){
HconvWfor(int row = border_width; row < width - border_width; row++){
int pixel = col * width + row;
Hconv:for(int i = - border_width; i <= border_width; i++){
local[pixel] += src[pixel + i] * hcoeff[i + border_width];
}
}
}
The code is straightforward; however, there are some issues with this C code that will negatively impact the quality of the hardware results.
The first issue is the large storage requirements during C compilation. The
intermediate results in the algorithm are stored in an internal local array. This requires an
array of HEIGHT*WIDTH
, which for a standard video image of 1920*1080 will
hold 2,073,600 values.
- For the compilers targeting Zynq®-7000 SoC or Zynq® UltraScale+™ MPSoC devices, as well as many host systems, this amount of local storage can lead to stack
overflows at runtime (for example, running on the target device, or running co-sim flows
within the Vivado® High-Level Synthesis (HLS) tool).
The data for a local array is placed on the stack and not the heap, which is managed by the
OS. When cross-compiling with
arm-linux-gnueabihf-g++
use the-Wl,"-z stacksize=4194304"
linker option to allocate sufficient stack space.Note: The syntax for this option varies for different linkers - When a function will only be run in hardware, a useful way to avoid such
issues is to use the
__SYNTHESIS__
macro. This macro is automatically defined by the system compiler when the hardware function is synthesized into hardware. The code shown above uses dynamic memory allocation during C simulation to avoid any compilation issues and only uses static storage during synthesis. A downside of using this macro is the code verified by C simulation is not the same code that is synthesized. In this case, however, the code is not complex, and the behavior will be the same. - The main issue with this local array is the quality of the FPGA
implementation. Because this is an array it will be implemented using internal FPGA block
RAM. This is a very large memory to implement inside the FPGA. It might require a larger and
more costly FPGA device. The use of block RAM can be minimized by using the
DATAFLOW
optimization and streaming the data through small efficient FIFOs, but this will require the data to be used in a streaming sequential manner. There is currently no such requirement.
The next issue relates to the performance: the initialization for the local
array. The Clear_Local
loop is used to set the values in array local to zero.
Even if this loop is pipelined in the hardware to execute in a high-performance manner, this
operation still requires approximately two million clock cycles
(HEIGHT*WIDTH
) to implement. While this memory is being initialized, the
system cannot perform any image processing. This same initialization of the data could be
performed using a temporary variable inside loop HConv to initialize the accumulation before
the write.
Finally, the throughput of the data, and thus the system performance, is fundamentally limited by the data access pattern.
- To create the first convolved output, the first K values are read from the input.
- To calculate the second output, a new value is read, and then the same K-1 values are re-read.
One of the keys to a high-performance FPGA is to minimize the access to and from the PS. Each access for data, which has previously been fetched, negatively impacts the performance of the system. An FPGA is capable of performing many concurrent calculations at once and reaching very high performance, but not while the flow of data is constantly interrupted by re-reading values.
With the code shown above, the data cannot be continuously streamed directly from the processor using a operation because the data is required to be re-read time and again.
Standard Vertical Convolution
The next step is to perform the vertical convolution shown in the following figure.
The process for the vertical convolution is similar to the horizontal convolution. A set of K data samples is required to convolve with the convolution coefficients, Vcoeff in this case. After the first output is created using the first K samples in the vertical direction, the next set of K values is used to create the second output. The process continues down through each column until the final output is created.
After the vertical convolution, the image is now smaller than the source image
src
, due to both the horizontal and vertical border
effect.
The following is the code for performing these operations:
Clear_Dst:for(int i = 0; i < height * width; i++){
dst[i]=0;
}
// Vertical convolution
VconvH:for(int col = border_width; col < height - border_width; col++){
VconvW:for(int row = 0; row < width; row++){
int pixel = col * width + row;
Vconv:for(int i = - border_width; i <= border_width; i++){
int offset = i * width;
dst[pixel] += local[pixel + offset] * vcoeff[i + border_width];
}
}
}
This code highlights similar issues to those already discussed with the horizontal convolution code:
- Many clock cycles are spent to set the values in the output image
dst
to zero. In this case, approximately another two million cycles for a 1920*1080 image size. - There are multiple accesses per pixel to re-read data stored in array
local
. - There are multiple writes per pixel to the output array/port
dst
.
The access patterns in the code above in fact creates the requirement to have such a large local array. The algorithm requires the data on row K to be available to perform the first calculation. Processing data down the rows before proceeding to the next column requires the entire image to be stored locally. This requires that all values be stored and results in large local storage on the FPGA.
In addition, when you reach the stage where you wish to use compiler
directives to optimize the performance of the hardware function, the flow of data between the
horizontal and vertical loop cannot be managed using a FIFO (a high-performance and
low-resource unit), because the data is not streamed out of array local
, a FIFO can only be used with sequential access patterns. Instead, this code
which requires arbitrary/random accesses requires a ping-pong block RAM to improve
performance. This doubles the memory requirements for the implementation of the local array to
approximately four million data samples, which is too large for an FPGA.
Standard Border Pixel Convolution
The final step in performing the convolution is to create the data around the border. These pixels can be created by simply reusing the nearest pixel in the convolved output. The following figures shows how this is achieved.
The border region is populated with the nearest valid value. The following code performs the operations shown in the previous figure.
int border_width_offset = border_width * width;
int border_height_offset = (height - border_width - 1) * width;
// Border pixels
Top_Border:for(int col = 0; col < border_width; col++){
int offset = col * width;
for(int row = 0; row < border_width; row++){
int pixel = offset + row;
dst[pixel] = dst[border_width_offset + border_width];
}
for(int row = border_width; row < width - border_width; row++){
int pixel = offset + row;
dst[pixel] = dst[border_width_offset + row];
}
for(int row = width - border_width; row < width; row++){
int pixel = offset + row;
dst[pixel] = dst[border_width_offset + width - border_width - 1];
}
}
Side_Border:for(int col = border_width; col < height - border_width; col++){
int offset = col * width;
for(int row = 0; row < border_width; row++){
int pixel = offset + row;
dst[pixel] = dst[offset + border_width];
}
for(int row = width - border_width; row < width; row++){
int pixel = offset + row;
dst[pixel] = dst[offset + width - border_width - 1];
}
}
Bottom_Border:for(int col = height - border_width; col < height; col++){
int offset = col * width;
for(int row = 0; row < border_width; row++){
int pixel = offset + row;
dst[pixel] = dst[border_height_offset + border_width];
}
for(int row = border_width; row < width - border_width; row++){
int pixel = offset + row;
dst[pixel] = dst[border_height_offset + row];
}
for(int row = width - border_width; row < width; row++){
int pixel = offset + row;
dst[pixel] = dst[border_height_offset + width - border_width - 1];
}
}
The code suffers from the same repeated access for data. The data stored outside the FPGA in the array dst
must now be available to be read as input data re-read multiple times. Even in the first loop, dst[border_width_offset + border_width]
is read multiple times but the values of border_width_offset
and border_width
do not change.
This code is very intuitive to both read and write. When implemented with the SDSoC environment it is approximately 120M clock cycles, which meets or slightly exceeds the performance of a CPU. However, as shown in the next section, optimal data access patterns ensure this same algorithm can be implemented on the FPGA at a rate of one pixel per clock cycle, or approximately 2M clock cycles.
The summary from this review is that the following poor data access patterns negatively impact the performance and size of the FPGA implementation:
- Multiple accesses to read and then re-read data. Use local storage where possible.
- Accessing data in an arbitrary or random access manner. This requires the data to be stored locally in arrays and costs resources.
- Setting default values in arrays costs clock cycles and performance.
Algorithm with Optimal Data Access Patterns
The key to implementing the convolution example reviewed in the previous section as a high-performance design with minimal resources is doing the following:
- Maximize the flow of data through the system. Refrain from using any coding techniques or algorithm behavior that inhibits the continuous flow of data.
- Maximize the reuse of data. Use local caches to ensure there are no requirements to re-read data and the incoming data can keep flowing.
- Use conditional branching. This is expensive on a CPU, GPU, or DSP but optimal in an FPGA.
The first step is to understand how data flows through the system into and out of the FPGA. The convolution algorithm is performed on an image. When data from an image is produced and consumed, it is transferred in a standard raster-scan manner as shown in the following figure.
If the data is transferred to the FPGA in a streaming manner, the FPGA should process it in a streaming manner and transfer it back from the FPGA in this manner.
The convolution algorithm shown below embraces this style of coding. At this level of
abstraction a concise view of the code is shown. However, there are now intermediate buffers,
hconv
and vconv
, between each loop. Because these are
accessed in a streaming manner, they are optimized into single registers in the final
implementation.
template<typename T, int K>
static void convolution_strm(
int width,
int height,
T src[TEST_IMG_ROWS][TEST_IMG_COLS],
T dst[TEST_IMG_ROWS][TEST_IMG_COLS],
const T *hcoeff,
const T *vcoeff)
{
T hconv_buffer[MAX_IMG_COLS*MAX_IMG_ROWS];
T vconv_buffer[MAX_IMG_COLS*MAX_IMG_ROWS];
T *phconv, *pvconv;
// These assertions let HLS know the upper bounds of loops
assert(height < MAX_IMG_ROWS);
assert(width < MAX_IMG_COLS);
assert(vconv_xlim < MAX_IMG_COLS - (K - 1));
// Horizontal convolution
HConvH:for(int col = 0; col < height; col++) {
HConvW:for(int row = 0; row < width; row++) {
HConv:for(int i = 0; i < K; i++) {
}
}
}
// Vertical convolution
VConvH:for(int col = 0; col < height; col++) {
VConvW:for(int row = 0; row < vconv_xlim; row++) {
VConv:for(int i = 0; i < K; i++) {
}
}
}
Border:for (int i = 0; i < height; i++) {
for (int j = 0; j < width; j++) {
}
}
All three processing loops now embrace conditional branching to ensure the continuous processing of data.
Optimal Horizontal Convolution
The algorithm must use the K previous samples to compute the convolution
result. It therefore copies the sample into a temporary cache hwin
. This use of local storage means there is no need to re-read values from the
PS and interrupt the flow of data. For the first calculation, there are not enough values in
hwin to compute a result, so conditionally, no output values are written. To perform the
calculation in a more efficient manner for FPGA implementation, the horizontal convolution is
computed.
The algorithm keeps reading input samples and caching them into hwin
. Each time it reads a new sample, it pushes an unneeded sample
out of hwin
. The first time an output value can be written is
after the Kth input has been read. An output value can now be written. The algorithm proceeds
in this manner along the rows until the final sample has been read. At that point, only the
last K samples are stored in hwin
; all that is required to
compute the convolution.
As shown below, the code to perform these operations uses both local storage to prevent re-reads from the PL (the reads from local storage can be performed in parallel in the final implementation), and the extensive use of conditional branching to ensure each new data sample can be processed in a different manner.
// Horizontal convolution
phconv=hconv_buffer; // set / reset pointer to start of buffer
// These assertions let HLS know the upper bounds of loops
assert(height < MAX_IMG_ROWS);
assert(width < MAX_IMG_COLS);
assert(vconv_xlim < MAX_IMG_COLS - (K - 1));
HConvH:for(int col = 0; col < height; col++) {
HConvW:for(int row = 0; row < width; row++) {
#pragma HLS PIPELINE
T in_val = *src++;
// Reset pixel value on-the-fly - eliminates an O(height*width) loop
T out_val = 0;
HConv:for(int i = 0; i < K; i++) {
hwin[i] = i < K - 1 ? hwin[i + 1] : in_val;
out_val += hwin[i] * hcoeff[i];
}
if (row >= K - 1) {
*phconv++=out_val;
}
}
}
An interesting point to note in the code above is the use of the temporary variable out_val
to perform the convolution calculation. This variable is set to zero before the calculation is performed, negating the need to spend two million clock cycles to reset the values, as in the previous example.
Throughout the entire process, the samples in the src input are processed in a raster-streaming manner. Every sample is read in turn. The outputs from the task are either discarded or used, but the task keeps constantly computing. This represents a difference from code written to perform on a CPU.
Optimal Vertical Convolution
The vertical convolution represents a challenge to the streaming data model preferred by an FPGA. The data must be accessed by column but you do not wish to store the entire image. The solution is to use line buffers, as shown in the following figure.
Again, the samples are read in a streaming manner, this time from the local
buffer hconv
. The algorithm requires at least K-1 lines of
data before it can process the first sample. All the calculations performed before this are
discarded through the use of conditionals.
A line buffer allows K-1 lines of data to be stored. Each time a new sample is read, another sample is pushed out the line buffer. An interesting point to note here is that the newest sample is used in the calculation, and then the sample is stored into the line buffer and the old sample ejected out. This ensures that only K-1 lines are required to be cached rather than an unknown number of lines, and minimizes the use of local storage. Although a line buffer does require multiple lines to be stored locally, the convolution kernel size K is always much less than the 1080 lines in a full video image.
The first calculation can be performed when the first sample on the Kth line is read. The algorithm then proceeds to output values until the final pixel is read.
// Vertical convolution
phconv=hconv_buffer; // set/reset pointer to start of buffer
pvconv=vconv_buffer; // set/reset pointer to start of buffer
VConvH:for(int col = 0; col < height; col++) {
VConvW:for(int row = 0; row < vconv_xlim; row++) {
#pragma HLS DEPENDENCE variable=linebuf inter false
#pragma HLS PIPELINE
T in_val = *phconv++;
// Reset pixel value on-the-fly - eliminates an O(height*width) loop
T out_val = 0;
VConv:for(int i = 0; i < K; i++) {
T vwin_val = i < K - 1 ? linebuf[i][row] : in_val;
out_val += vwin_val * vcoeff[i];
if (i > 0)
linebuf[i - 1][row] = vwin_val;
}
if (col >= K - 1) {
*pvconv++ = out_val;
}
}
The code processes all the samples in the design in a streaming manner. The task is constantly running. Following a coding style where you minimize the number of re-reads (or re-writes) forces you to cache the data locally. This is an ideal strategy when targeting an FPGA.
Optimal Border Pixel Convolution
The final step in the algorithm is to replicate the edge pixels into the border region. To ensure the constant flow of data and data reuse, the algorithm makes use of local caching. The following figure shows how the border samples are aligned into the image.
- Each sample is read from the
vconv
output from the vertical convolution. - The sample is then cached as one of four possible pixel types.
- The sample is then written to the output stream.
The code for determining the location of the border pixels is shown here.
// Border pixels
pvconv=vconv_buffer; // set/reset pointer to start of buffer
Border:for (int i = 0; i < height; i++) {
for (int j = 0; j < width; j++) {
T pix_in, l_edge_pix, r_edge_pix, pix_out;
#pragma HLS PIPELINE
if (i == 0 || (i > border_width && i < height - border_width)) {
// read a pixel out of the video stream and cache it for
// immediate use and later replication purposes
if (j < width - (K - 1)) {
pix_in = *pvconv++;
borderbuf[j] = pix_in;
}
if (j == 0) {
l_edge_pix = pix_in;
}
if (j == width - K) {
r_edge_pix = pix_in;
}
}
// Select output value from the appropriate cache resource
if (j <= border_width) {
pix_out = l_edge_pix;
} else if (j >= width - border_width - 1) {
pix_out = r_edge_pix;
} else {
pix_out = borderbuf[j - border_width];
}
*dst++=pix_out;
}
}
A notable difference with this new code is the extensive use of conditionals inside the tasks. This allows the task, after it is pipelined, to continuously process data. The result of the conditionals does not impact the execution of the pipeline. The result will impact the output values, but the pipeline with keep processing as long as input samples are available.
Optimal Data Access Patterns
The following list summarizes how to ensure your data access patterns result in the most optimal performance on an FPGA.
- Minimize data input reads. After data has been read into the block, it can easily feed many parallel paths but the inputs to the hardware function can be bottlenecks to performance. Read data once and use a local cache if the data must be reused.
- Minimize accesses to arrays, especially large arrays. Arrays are implemented in block RAM which like I/O ports only have a limited number of ports and can be bottlenecks to performance. Arrays can be partitioned into smaller arrays and even individual registers but partitioning large arrays will result in many registers being used. Use small localized caches to hold results such as accumulations and then write the final result to the array.
- Seek to perform conditional branching inside pipelined tasks rather than conditionally execute tasks, even pipelined tasks. Conditionals are implemented as separate paths in the pipeline. Allowing the data from one task to flow into the next task with the conditional performed inside the next task will result in a higher performing system.
- Minimize output writes for the same reason as input reads, namely, that ports are bottlenecks. Replicating additional accesses only pushes the issue further back into the system.
For C code which processes data in a streaming manner, consider employing a coding style that promotes read-once/write-once to function arguments because this ensures the function can be efficiently implemented in an FPGA. It is much more productive to design an algorithm in C that results in a high-performance FPGA implementation than debug why the FPGA is not operating at the performance required.