Real-World Examples
This chapter describes some real-world examples and shows the following:
- How these examples are optimized using both the top-down flow and bottom-up
- The top-down flow is demonstrated using a Lucas-Kanade (LK) Optical Flow algorithm.
- The bottom-up flow is demonstrated using a stereo vision block matching algorithm.
- What optimization directives were applied
- Why those directives were chosen
Top-Down: Optical Flow Algorithm
The Lucas-Kanade (LK) method is a widely used, differential method for optical flow estimation or the estimation of movement of pixels between two related images. In this example system, the related images are the current and previous images of a video stream. The LK method is a compute intensive algorithm and works over a window of neighboring pixels using the least square difference to find matching pixels.
The following code example shows how to implement this algorithm, where two
input files are read in, processed through function fpga_optflow
, and the results written to an output file.
int main()
FILE *f;
pix_t *inY1 = (pix_t *)sds_alloc(HEIGHT*WIDTH);
yuv_t *inCY1 = (yuv_t *)sds_alloc(HEIGHT*WIDTH*2);
pix_t *inY2 = (pix_t *)sds_alloc(HEIGHT*WIDTH);
yuv_t *inCY2 = (yuv_t *)sds_alloc(HEIGHT*WIDTH*2);
yuv_t *outCY = (yuv_t *)sds_alloc(HEIGHT*WIDTH*2);
printf("allocated buffers\n");
f = fopen(FILEINAME,"rb");
if (f == NULL) {
printf("failed to open file %s\n", FILEINAME);
return -1;
printf("opened file %s\n", FILEINAME);
read_yuv_frame(inY1, WIDTH, WIDTH, HEIGHT, f);
printf("read 1st %dx%d frame\n", WIDTH, HEIGHT);
read_yuv_frame(inY2, WIDTH, WIDTH, HEIGHT, f);
printf("read 2nd %dx%d frame\n", WIDTH, HEIGHT);
printf("closed file %s\n", FILEINAME);
convert_Y8toCY16(inY1, inCY1, HEIGHT*WIDTH);
printf("converted 1st frame to 16bit\n");
convert_Y8toCY16(inY2, inCY2, HEIGHT*WIDTH);
printf("converted 2nd frame to 16bit\n");
fpga_optflow(inCY1, inCY2, outCY, HEIGHT, WIDTH, WIDTH, 10.0);
printf("computed optical flow\n");
// write optical flow data image to disk
write_yuv_file(outCY, WIDTH, WIDTH, HEIGHT, ONAME);
printf("freed buffers\n");
return 0;
This method is typical for a top-down design flow using standard C/C++ data types.
Function fpa_optflow
is shown in the
following code example and contains the following sub-functions:
int fpga_optflow (yuv_t *frame0, yuv_t *frame1, yuv_t *framef, int height, int width, int stride, float clip_flowmag)
int img_pix_count = height*width;
int img_pix_count = 10;
if (f0Stream == NULL) f0Stream = (pix_t *) malloc(sizeof(pix_t) * img_pix_count);
if (f1Stream == NULL) f1Stream = (pix_t *) malloc(sizeof(pix_t) * img_pix_count);
if (ffStream == NULL) ffStream = (yuv_t *) malloc(sizeof(yuv_t) * img_pix_count);
if (ixix == NULL) ixix = (int *) malloc(sizeof(int) * img_pix_count);
if (ixiy == NULL) ixiy = (int *) malloc(sizeof(int) * img_pix_count);
if (iyiy == NULL) iyiy = (int *) malloc(sizeof(int) * img_pix_count);
if (dix == NULL) dix = (int *) malloc(sizeof(int) * img_pix_count);
if (diy == NULL) diy = (int *) malloc(sizeof(int) * img_pix_count);
if (fx == NULL) fx = (float *) malloc(sizeof(float) * img_pix_count);
if (fy == NULL) fy = (float *) malloc(sizeof(float) * img_pix_count);
readMatRows (frame0, f0Stream, height, width, stride);
readMatRows (frame1, f1Stream, height, width, stride);
computeSum (f0Stream, f1Stream, ixix, ixiy, iyiy, dix, diy, height, width);
computeFlow (ixix, ixiy, iyiy, dix, diy, fx, fy, height, width);
getOutPix (fx, fy, ffStream, height, width, clip_flowmag);
writeMatRows (ffStream, framef, height, width, stride);
return 0;
In this example, all of the functions in fpga_optflow
are processing live video data, and can benefit from hardware
acceleration with DMAs used to transfer the data to and from the PS. If all five functions are
annotated to be hardware functions, the topology of the system is shown in the following
Figure: System Topology
The system can be compiled into hardware and event tracing used to analyze the performance in detail.
The issue here is that it takes a long time to complete— approximately 15 seconds for a single frame. To process HD video, the system should process 60 frames per second or one frame every 16.7 ms. You can use optimization directives, as described below, to ensure the system meets the target performance.
Optical Flow Memory Access Optimization
The first task is to optimize the transfer of data. In this case, because the system will process steaming video, where each sample is processed in consecutive order, the memory transfer optimization is used to ensure the SDSoC™ environment interprets all accesses as sequential in nature.
This is performed by adding SDS pragmas before the function signatures for all functions involved.
#pragma SDS data access_pattern(matB:SEQUENTIAL, pixStream:SEQUENTIAL)
#pragma SDS data mem_attribute(matB:PHYSICAL_CONTIGUOUS)
#pragma SDS data copy(matB[0:stride*height])
void readMatRows (yuv_t *matB, pix_t* pixStream,
int height, int width, int stride);
#pragma SDS data access_pattern(pixStream:SEQUENTIAL, dst:SEQUENTIAL)
#pragma SDS data mem_attribute(dst:PHYSICAL_CONTIGUOUS)
#pragma SDS data copy(dst[0:stride*height])
void writeMatRows (yuv_t* pixStream, yuv_t *dst,
int height, int width, int stride);
#pragma SDS data access_pattern(f0Stream:SEQUENTIAL, f1Stream:SEQUENTIAL)
#pragma SDS data access_pattern(ixix_out:SEQUENTIAL, ixiy_out:SEQUENTIAL, iyiy_out:SEQUENTIAL)
#pragma SDS data access_pattern(dix_out:SEQUENTIAL, diy_out:SEQUENTIAL)
void computeSum(pix_t* f0Stream, pix_t* f1Stream,
int* ixix_out, int* ixiy_out, int* iyiy_out,
int* dix_out, int* diy_out,
int height, int width);
#pragma SDS data access_pattern(ixix:SEQUENTIAL, ixiy:SEQUENTIAL, iyiy:SEQUENTIAL)
#pragma SDS data access_pattern(dix:SEQUENTIAL, diy:SEQUENTIAL)
#pragma SDS data access_pattern(fx_out:SEQUENTIAL, fy_out:SEQUENTIAL)
void computeFlow(int* ixix, int* ixiy, int* iyiy,
int* dix, int* diy,
float* fx_out, float* fy_out,
int height, int width);
#pragma SDS data access_pattern(fx:SEQUENTIAL, fy:SEQUENTIAL, out_pix:SEQUENTIAL)
void getOutPix (float* fx, float* fy, yuv_t* out_pix,
int height, int width, float clip_flowmag);
For the readMatRows
and writeMatRows
function arguments, which interface to the processor,
the memory transfers are specified as sequential accesses from physically contiguous memory,
and the data should be copied to and from the hardware function, and not simply accessed from
the accelerator. This ensures the data is copied efficiently. The following options are
- Sequential
- The data is transferred in the same sequential manner as it is processed. This type of transfer requires the least amount of hardware overhead for high data processing rates and means an area efficient datamover is used.
- Contiguous
- The data is accessed from contiguous memory. This ensures there is no
scatter-gather overhead in the data transfer rate and an efficient fast hardware datamover
is used. This directive is supported by the associated
library call in themain()
function, which ensures data for these arguments is stored in contiguous memory. - Copy
- The data is copied to and from the accelerator, negating the need for data accesses back to the CPU or DDR memory. Because pointers are used, the size of the data to be copied is specified.
For the remaining hardware functions, the data transfers are specified as sequential, allowing the most efficient hardware to be used to connect the functions in the programmable logic (PL) fabric.
Optical Flow Hardware Function Optimization
The hardware functions also require optimization directives to execute at the
highest level of performance. These are already present in the design example. Reviewing these
highlights the lessons learned from
Understanding the Hardware Function Optimization Methodology. Most of the hardware functions in this
design example are optimized using primarily the PIPELINE directive, in a manner similar to
the getOutPix
Review of the getOutPix
function shows:
- The sub-functions have an INLINE optimization applied to ensure the logic from these functions is merged with the function above. This automatically occurs for small functions, but the use of this directive ensures the sub-functions are always inlined, and there is no need to pipeline the sub-functions.
- The inner loop of the
function is the loop that processes data at the level of each pixel and is optimized with the PIPELINE directive to ensure it processes one pixel per clock.
pix_t getLuma (float fx, float fy, float clip_flowmag)
#pragma HLS inline
float rad = sqrtf (fx*fx + fy*fy);
if (rad > clip_flowmag) rad = clip_flowmag; // clamp to MAX
rad /= clip_flowmag; // convert 0..MAX to 0.0..1.0
pix_t pix = (pix_t) (255.0f * rad);
return pix;
pix_t getChroma (float f, float clip_flowmag)
#pragma HLS inline
if (f > clip_flowmag ) f = clip_flowmag; // clamp big positive f to MAX
if (f < (-clip_flowmag)) f = -clip_flowmag; // clamp big negative f to -MAX
f /= clip_flowmag; // convert -MAX..MAX to -1.0..1.0
pix_t pix = (pix_t) (127.0f * f + 128.0f); // convert -1.0..1.0 to -127..127 to 1..255
return pix;
void getOutPix (float* fx,
float* fy,
yuv_t* out_pix,
int height, int width, float clip_flowmag)
int pix_index = 0;
for (int r = 0; r < height; r++) {
for (int c = 0; c < width; c++) {
float fx_ = fx[pix_index];
float fy_ = fy[pix_index];
pix_t outLuma = getLuma (fx_, fy_, clip_flowmag);
pix_t outChroma = (c&1)? getChroma (fy_, clip_flowmag) : getChroma (fx_, clip_flowmag);
yuv_t yuvpix;
yuvpix = ((yuv_t)outChroma << 8) | outLuma;
out_pix[pix_index++] = yuvpix;
If you examine the computeSum
function, you
will find examples of the ARRAY_PARTITION and DEPENDENCE directives. In this function, the
ARRAY_PARTITION directive is used on array img1Win
. Because
is an array, it is implemented by default in a
block RAM, which has a maximum of two ports, as shown in the following code summary:
- Used in a for-loop that is pipelined to process 1 sample per clock cycle.
- Read from 8 + (KMEDP1-1) + (KMEDP1-1) times within the for-loop.
- Written to (KMEDP1-1) + (KMEDP1-1) times within the for-loop.
void computeSum(pix_t* f0Stream,
pix_t* f1Stream,
int* ixix_out,
int* ixiy_out,
int* iyiy_out,
int* dix_out,
int* diy_out)
static pix_t img1Win [2 * KMEDP1], img2Win [1 * KMEDP1];
#pragma HLS ARRAY_PARTITION variable=img1Win complete dim=0
for (int r = 0; r < MAX_HEIGHT; r++) {
for (int c = 0; c < MAX_WIDTH; c++) {
int cIxTopR = (img1Col_ [wrt] - img1Win [wrt*2 + 2-2]) /2 ;
int cIyTopR = (img1Win [ (wrt+1)*2 + 2-1] - img1Win [ (wrt-1)*2 + 2-1]) /2;
int delTopR = img1Win [wrt*2 + 2-1] - img2Win [wrt*1 + 1-1];
int cIxBotR = (img1Col_ [wrb] - img1Win [wrb*2 + 2-2]) /2 ;
int cIyBotR = (img1Win [ (wrb+1)*2 + 2-1] - img1Win [ (wrb-1)*2 + 2-1]) /2;
int delBotR = img1Win [wrb*2 + 2-1] - img2Win [wrb*1 + 1-1];
// shift windows
for (int i = 0; i < KMEDP1; i++) {
img1Win [i * 2] = img1Win [i * 2 + 1];
for (int i=0; i < KMEDP1; ++i) {
img1Win [i*2 + 1] = img1Col_ [i];
} // for c
} // for r
Because a block RAM only supports a maximum of two accesses per clock cycle, all of these accesses cannot be made in one clock cycle. As noted previously in the methodology, the ARRAY_PARTITION directive is used to partition the array into smaller blocks, in this case into individual elements, by using the complete option. This enables parallel access to all elements of the array at the same time and ensures that the for-loop processes data every clock cycle.
The final optimization directive worth reviewing is the DEPENDENCE directive.
The csIxix
array has a DEPENDENCE directive applied to it.
The array is read from and then written to using different indices, as shown in the following
code example, and performs these reads and writes within a pipelined loop.
void computeSum(pix_t* f0Stream,
pix_t* f1Stream,
int* ixix_out,
int* ixiy_out,
int* iyiy_out,
int* dix_out,
int* diy_out)
static int csIxix [MAX_WIDTH], csIxiy [MAX_WIDTH], csIyiy [MAX_WIDTH], csDix [MAX_WIDTH], csDiy [MAX_WIDTH];
#pragma HLS DEPENDENCE variable=csIxix inter WAR false
int zIdx= - (KMED-2);
int nIdx = zIdx + KMED-2;
for (int r = 0; r < MAX_HEIGHT; r++) {
for (int c = 0; c < MAX_WIDTH; c++) {
if (zIdx >= 0) {
csIxixL = csIxix [zIdx];
csIxix [nIdx] = csIxixR;
if (zIdx == MAX_WIDTH) zIdx = 0;
if (nIdx == MAX_WIDTH) nIdx = 0;
} // for c
} // for r
When a loop is pipelined in hardware, the accesses to the array overlap in
time. The compiler analyzes all accesses to an array and issues a warning if any condition
exists where the write in iteration N
overwrites the data for
iteration N + K
, thus changing the value. The warning
prevents implementing a pipeline with II = 1
The following example shows read and write operations for a loop over multiple iterations for an array with indices 0 through 9. As in the code above, it is possible for the address counters to differ between the read and write operations and to return to zero, before all loop iterations are complete. The operations are shown overlapped in time, just like a pipelined implementation.
In sequential C code, where each iteration completes before the next starts, it is clear what order the reads and writes occur. However, in a concurrent hardware pipeline, the accesses can overlap and occur in different orders. As can be seen clearly above, it is possible for the read from index 8, as noted by R8, to occur in time before the write to index 8 (W8) which is meant to occur some iterations before R8.
The compiler warns of this condition, and the DEPENDENCE directive is used
with the setting false
to tell the compiler that there is no
dependence on read-after-write, allowing the compiler to create the pipelined hardware which
performs with II=1
The DEPENDENCE directive is typically used to inform the compiler of algorithm behaviors and conditions external to the function of which is it unaware from static analysis of the code. If a DEPENDENCE directive is set incorrectly, the issue will be discovered in hardware emulation, if the results from the hardware are different from those achieved with the software.
Optical Flow Results
With both the data transfers and hardware functions optimized, the hardware functions are recompiled, and the performance is analyzed using event traces. The figure below shows the start of the event traces, and clearly shows the pipelined hardware functions do not execute until the previous function has completed. Each hardware function begins to process data as soon as data becomes available.
Figure: Trace Result
The complete view of the event traces shows all hardware functions and data transfers executing in parallel for the highest performing system, as shown in the following figure.
Figure: Event Traces
To get the duration time, hover on top of one of the lanes to obtain a popup window that shows the duration of the accelerator runtime. The execution time is just under 15.5 ms; this meets the targeted 16.8 ms necessary to achieve 60 frames per second. The following figure shows the AXI State View for trace legend:
Figure: AXI State View Trace Legend
- Software
- Execution done on the Arm® processor core.
- Accelerator
- Execution done in the accelerator(s).
- Transfer
- Data being transferred from Arm core.
- Receive
- Data being received by the Arm processor core.
Bottom-Up: Stereo Vision Algorithm
The stereo vision algorithm uses images from two cameras horizontally displaced from each other. This provides two different views of the scene from different vantage points, similar to human vision. To obtain the relative depth information from the scene, compare the two images to build a disparity map. The disparity map encodes the relative positions of objects in the horizontal coordinates such that the values are inversely proportional to the scene depth at the corresponding pixel location.
The bottom-up methodology starts with a fully optimized hardware design that is already synthesized using the Vivado® High-Level Synthesis (HLS) tool and then integrate the pre-optimized hardware function with software in the SDSoC environment.
This flow allows hardware designers who are already knowledgeable with the HLS tool to build and optimize the entire hardware function first, using advanced HLS tool features and then for software programmers to leverage this existing work.
The following section uses the stereo vision design example to take you through
the steps of starting with an optimized hardware function in the HLS tool and build an
application that integrates the full system with hardware and software running on the board using
the SDSoC environment. The following figure shows the final
system to be realized, and highlights the existing stereo_remap_bm
hardware function to be incorporated into the SDSoC environment.
Figure: Block Diagram of System
In the bottom-up flow, the general optimization methodology for the SDSoC environment, as detailed in this guide, is reversed. By definition, you would start with an optimized hardware function, and then seek to incorporate it into the SDSoC environment and optimize the data transfers.
Stereo Vision Hardware Function Optimization
The following code example shows the existing stereo_remap_bm
hardware function with
optimization pragmas highlighted. Before reviewing the optimization
directives, note the following details about the function:
- The hardware function contains sub-functions
, andwriteDispOut
that have also been optimized. - The hardware function also uses pre-optimized functions, prefixed with the
, from the Vivado HLS tool video library, hls_video.h. These sub-functions use their own data type ofMAT
#include "hls_video.h"
#include "top.h"
#include "transform.h"
void readLRinput (yuv_t *inLR,
hls::Mat<IMG_HEIGHT, IMG_WIDTH, HLS_8UC1>& img_l,
hls::Mat<IMG_HEIGHT, IMG_WIDTH, HLS_8UC1>& img_r,
int height, int dual_width, int width, int stride)
for (int i=0; i < height; ++i) {
#pragma HLS loop_tripcount min=1080 max=1080 avg=1080
for (int j=0; j < stride; ++j) {
#pragma HLS loop_tripcount min=1920 max=1920 avg=1920
yuv_t tmpData = inLR [i*stride + j]; // from yuv_t array: consume height*stride
if (j < width)
img_l.write (tmpData & 0x00FF); // to HLS_8UC1 stream
else if (j < dual_width)
img_r.write (tmpData & 0x00FF); // to HLS_8UC1 stream
void writeDispOut(hls::Mat<IMG_HEIGHT, IMG_WIDTH, HLS_8UC1>& img_d,
yuv_t *dst,
int height, int width, int stride)
pix_t tmpOut;
yuv_t outData;
for (int i=0; i < height; ++i) {
#pragma HLS loop_tripcount min=1080 max=1080 avg=1080
for (int j=0; j < stride; ++j) {
#pragma HLS loop_tripcount min=960 max=960 avg=960
if (j < width) {
tmpOut =[0];
outData = ((yuv_t) 0x8000) | ((yuv_t)tmpOut);
dst [i*stride +j] = outData;
else {
outData = (yuv_t) 0x8000;
dst [i*stride +j] = outData;
namespace hls {
void SaveAsGray(
int height = src.rows;
int width = src.cols;
for (int i = 0; i < height; i++) {
#pragma HLS loop_tripcount min=1080 max=1080 avg=1080
for (int j = 0; j < width; j++) {
#pragma HLS loop_tripcount min=960 max=960 avg=960
#pragma HLS pipeline II=1
Scalar<1, short> s;
Scalar<1, unsigned char> d;
src >> s;
short uval = (short) (abs ((int)s.val[0]));
// Scale to avoid overflow. The right scaling here for a
// good picture depends on the NDISP parameter during
// block matching.
d.val[0] = (unsigned char)(uval >> 1);
//d.val[0] = (unsigned char)(s.val[0] >> 1);
dst << d;
} // namespace hls
int stereo_remap_bm_new(
yuv_t *img_data_lr,
yuv_t *img_data_disp,
hls::Window<3, 3, param_T > &lcameraMA_l,
hls::Window<3, 3, param_T > &lcameraMA_r,
hls::Window<3, 3, param_T > &lirA_l,
hls::Window<3, 3, param_T > &lirA_r,
param_T (&ldistC_l)[5],
param_T (&ldistC_r)[5],
int height, // 1080
int dual_width, // 1920 (two 960x1080 images side by side)
int stride_in, // 1920 (two 960x1080 images side by side)
int stride_out) // 960
int width = dual_width/2; // 960
hls::Mat<IMG_HEIGHT, IMG_WIDTH, HLS_8UC1> img_l(height, width);
hls::Mat<IMG_HEIGHT, IMG_WIDTH, HLS_8UC1> img_r(height, width);
hls::Mat<IMG_HEIGHT, IMG_WIDTH, HLS_8UC1> img_l_remap(height, width); // remapped left image
hls::Mat<IMG_HEIGHT, IMG_WIDTH, HLS_8UC1> img_r_remap(height, width); // remapped left image
hls::Mat<IMG_HEIGHT, IMG_WIDTH, HLS_8UC1> img_d(height, width);
hls::Mat<IMG_HEIGHT, IMG_WIDTH, HLS_16SC2> map1_l(height, width);
hls::Mat<IMG_HEIGHT, IMG_WIDTH, HLS_16SC2> map1_r(height, width);
hls::Mat<IMG_HEIGHT, IMG_WIDTH, HLS_16UC2> map2_l(height, width);
hls::Mat<IMG_HEIGHT, IMG_WIDTH, HLS_16UC2> map2_r(height, width);
hls::Mat<IMG_HEIGHT, IMG_WIDTH, HLS_16SC1> img_disp(height, width);
hls::StereoBMState<15, 32, 32> state;
// ddr -> kernel streams: extract luma from left and right yuv images
// store it in single channel HLS_8UC1 left and right Mat's
readLRinput (img_data_lr, img_l, img_r, height, dual_width, width, stride_in);
//////////////////////// remap left and right images, all types are HLS_8UC1 //////////
hls::InitUndistortRectifyMapInverse(lcameraMA_l, ldistC_l, lirA_l, map1_l, map2_l);
hls::Remap<8>(img_l, img_l_remap, map1_l, map2_l, HLS_INTER_LINEAR);
hls::InitUndistortRectifyMapInverse(lcameraMA_r, ldistC_r, lirA_r, map1_r, map2_r);
hls::Remap<8>(img_r, img_r_remap, map1_r, map2_r, HLS_INTER_LINEAR);
////////// find disparity of remapped images //////////
hls::FindStereoCorrespondenceBM(img_l_remap, img_r_remap, img_disp, state);
hls::SaveAsGray(img_disp, img_d);
// kernel stream -> ddr : output single wide
writeDispOut (img_d, img_data_disp, height, width, stride_out);
return 0;
int stereo_remap_bm(
yuv_t *img_data_lr,
yuv_t *img_data_disp,
int height, // 1080
int dual_width, // 1920 (two 960x1080 images side by side)
int stride_in, // 1920 (two 960x1080 images side by side)
int stride_out) // 960
//#pragma HLS interface m_axi port=img_data_lr depth=2073600
//#pragma HLS interface m_axi port=img_data_disp depth=2073600
hls::Window<3, 3, param_T > lcameraMA_l;
hls::Window<3, 3, param_T > lcameraMA_r;
hls::Window<3, 3, param_T > lirA_l;
hls::Window<3, 3, param_T > lirA_r;
param_T ldistC_l[5];
param_T ldistC_r[5];
for (int i=0; i<3; i++) {
for (int j=0; j<3; j++) {
for (int i=0; i<5; i++) {
ldistC_l[i] = distC_l[i];
ldistC_r[i] = distC_r[i];
int ret = stereo_remap_bm_new(img_data_lr,
return ret;
As noted in Understanding the Hardware Function Optimization Methodology, the primary optimization directives used are the PIPELINE and DATAFLOW directives. Additionally, the LOOP_TRIPCOUNT directive is used.
Based on the recommendations for optimizing hardware functions, which process frames of data, the PIPELINE directives are all applied to for-loops that process data at the sample level, or in this case, the pixel level. This ensures hardware pipelining is used to achieve the highest performing design.
The LOOP_TRIPCOUNT directives are used on for-loops, for which the upper bound of the loop index is defined by a variable, and the exact value, which is unknown at compile time. The estimated tripcount, or loop iteration count, allows the reports generated by the HLS tool to include expected values for latency and initiation interval (II), instead of unknowns. This directive has no impact on the hardware created—it only impacts reporting.
The top-level stereo_remap_bm
function is
composed of the optimized sub-functions and a number of functions from the HLS tool video
library (hls_video.h). For details about the library
functions provided by the HLS tool video library, refer to Vivado Design Suite User Guide:
High-Level Synthesis (UG902).
The functions provided in the HLS tool video library are already pre-optimized and contain all the optimization directives to ensure they are implemented with the highest possible performance. The top-level function is therefore composed of sub-functions that are all optimized, and it only requires the DATAFLOW directive to ensure each sub-function starts to execute in hardware as soon as data becomes available.
int stereo_remap_bm(..) {
readLRinput (img_data_lr, img_l, img_r, height, dual_width, width, stride
hls::InitUndistortRectifyMapInverse(lcameraMA_l, ldistC_l, lirA_l, map1_l, map2_l);
hls::Remap<8>(img_l, img_l_remap, map1_l, map2_l, HLS_INTER_LINEAR);
hls::InitUndistortRectifyMapInverse(lcameraMA_r, ldistC_r, lirA_r, map1_r, map2_r);
hls::Remap<8>(img_r, img_r_remap, map1_r, map2_r, HLS_INTER_LINEAR);
hls::Duplicate(img_l_remap, img_l_remap_bm, img_l_remap_pt);
hls::FindStereoCorrespondenceBM(img_l_remap_bm, img_r_remap, img_disp, state);
hls::SaveAsGray(img_disp, img_d);
writeDispOut (img_l_remap_pt, img_d, img_data_disp, height, dual_width, width, stride);
In general, the DATAFLOW optimization is not required because the SDSoC™ environment automatically ensures that data is passed
from one hardware function to the next, as soon as it becomes available; however, in this
example, the functions within stereo_remap_bm
are using the
HLS tool data type hls::stream
, which cannot be compiled on
the Arm® processor and cannot be used in the hardware
function interface in the SDSoC environment. For this
reason, the top-level hardware function must be stereo_remap_bm
and thus, the DATAFLOW directive is used to achieve
high-performance transfers between the sub-functions. If this were not the case, the DATAFLOW
directive could be removed and each sub-function within stereo_remap_bm
could be specified as a hardware function.
The hardware functions in this design example use the data type Mat
, that is based on the HLS tool data type hls::stream
. The hls::stream
type can only be accessed in a sequential manner. Data is pushed on and popped off.
- In software simulation, the
data type has infinite size. - In hardware, the
data type is implemented as a single register and can only store one data value at a time, because it is expected that the streaming data is consumed before the previous value is overwritten.
By specifying the top-level stereo_remap_bm
function as the hardware function, the effects of these hardware types can be ignored in the
software environment; however, when these functions are incorporated into the SDSoC environment, they cannot be compiled on the Arm processor, and the system can only be verified through
hardware emulation, executing on the target platform, or both.
data type is designed for use within the HLS tool, but
is unsuitable for running software on embedded CPUs. Therefore, this type should not be part
of the top-level hardware function interface.If any of the arguments of the hardware function use any HLS tool specific data types, the function must be enclosed by a top-level C/C++ wrapper function that exposes only native C/C++ types in the function argument list.
Optimizing the Data Motion Network
After importing the pre-optimized hardware function into a project in the SDSoC environment, the first task is to remove any interface optimizations. Based on the data types of the hardware function and data access, the interface between the PS and the hardware function is managed and automatically optimized. See Data Motion Optimization.
- Remove any INTERFACE directives present in the hardware function.
- Remove any DATA_PACK directives that reference variables present in the hardware function argument list.
- Remove any of the Vivado HLS tool hardware data types by enclosing the top-level function in wrappers that only use native C/C++ types for the function arguments.
In this example, the functions to be accelerated are captured inside a
single top-level hardware function, stereo_remap_bm
int main() {
unsigned char *inY = (unsigned char *)sds_alloc(HEIGHT*DUALWIDTH);
unsigned short *inCY = (unsigned short *)sds_alloc(HEIGHT*DUALWIDTH*2);
unsigned short *outCY = (unsigned short *)sds_alloc(HEIGHT*DUALWIDTH*2);
unsigned char *outY = (unsigned char *)sds_alloc(HEIGHT*DUALWIDTH);
// read double wide image from disk
if (read_yuv_file(inY, DUALWIDTH, DUALWIDTH, HEIGHT, FILEINAME) != 0)
return -1;
convert_Y8toCY16(inY, inCY, HEIGHT*DUALWIDTH);
stereo_remap_bm(inCY, outCY, HEIGHT, DUALWIDTH, DUALWIDTH);
// write single wide image to disk
convert_CY16toY8(outCY, outY, HEIGHT*DUALWIDTH);
// write single wide image to disk
return 0;
The key to optimizing the memory accesses to the hardware is to review
the data types passed into the hardware function. Reviewing the function signature shows
the key variables names to optimize: the input and output data streams img_data_lr
and img_data_disp
int stereo_remap_bm(
yuv_t *img_data_lr,
yuv_t *img_data_disp,
int height,
int dual_width,
int stride);
Because the data is transferred in a sequential manner, first ensure that
the access pattern is defined as SEQUENTIAL
for both
arguments. For the next optimization, ensure the data transfer is not interrupted by a
scatter gather DMA operation by specifying the memory_attribute
. This also requires that the memory is
allocated with sds_alloc
from sds_lib
#include "sds_lib.h"
int main() {
unsigned char *inY = (unsigned char *)sds_alloc(HEIGHT*DUALWIDTH);
unsigned short *inCY = (unsigned short *)sds_alloc(HEIGHT*DUALWIDTH*2);
unsigned short *outCY = (unsigned short *)sds_alloc(HEIGHT*DUALWIDTH*2);
unsigned char *outY = (unsigned char *)sds_alloc(HEIGHT*DUALWIDTH);
Finally, the copy directive is used to ensure the data is explicitly copied to the accelerator, and that the data is not accessed from shared memory.
#pragma SDS data access_pattern(img_data_lr:SEQUENTIAL)
#pragma SDS data mem_attribute(img_data_lr:PHYSICAL_CONTIGUOUS|NON_CACHEABLE)
#pragma SDS data copy(img_data_lr[0:stride*height])
#pragma SDS data access_pattern(img_data_disp:SEQUENTIAL)
#pragma SDS data mem_attribute(img_data_disp:PHYSICAL_CONTIGUOUS|NON_CACHEABLE)
#pragma SDS data copy(img_data_disp[0:stride*height])
int stereo_remap_bm(
yuv_t *img_data_lr,
yuv_t *img_data_disp,
int height,
int dual_width,
int stride);
With these optimization directives, the memory access between the PS and PL is optimized for the most efficient transfers.
Stereo Vision Results
After the hardware function optimized with the Vivado HLS tool is wrapped, as in this example, to ensure the HLS tool hardware data types are not exposed at the interface of the hardware function, any interface directives are removed and the data transfers optimized, the hardware functions are recompiled, and the performance is analyzed using event traces.
The following figure shows the complete view of the event traces, and all hardware functions and data transfers executing in parallel for the highest performing system.
Figure: Event Traces
To get the duration time, hover over one of the lanes to display a popup window that shows the duration of the accelerator runtime. The execution time is 15.86 ms; this meets the targeted 16.8 ms necessary to achieve 60 frames per second for live video.