Real-World Examples

This chapter describes some real-world examples and shows the following:

  • How these examples are optimized using both the top-down flow and bottom-up flow
    • The top-down flow is demonstrated using a Lucas-Kanade (LK) Optical Flow algorithm.
    • The bottom-up flow is demonstrated using a stereo vision block matching algorithm.
  • What optimization directives were applied
  • Why those directives were chosen

Top-Down: Optical Flow Algorithm

The Lucas-Kanade (LK) method is a widely used, differential method for optical flow estimation or the estimation of movement of pixels between two related images. In this example system, the related images are the current and previous images of a video stream. The LK method is a compute intensive algorithm and works over a window of neighboring pixels using the least square difference to find matching pixels.

The following code example shows how to implement this algorithm, where two input files are read in, processed through function fpga_optflow, and the results written to an output file.

int main()
{
	FILE *f;
	pix_t *inY1 = (pix_t *)sds_alloc(HEIGHT*WIDTH);
	yuv_t *inCY1 = (yuv_t *)sds_alloc(HEIGHT*WIDTH*2);
	pix_t *inY2 = (pix_t *)sds_alloc(HEIGHT*WIDTH);
	yuv_t *inCY2 = (yuv_t *)sds_alloc(HEIGHT*WIDTH*2);
	yuv_t *outCY = (yuv_t *)sds_alloc(HEIGHT*WIDTH*2);
	printf("allocated buffers\n");

	f = fopen(FILEINAME,"rb");
	if (f == NULL) {
		printf("failed to open file %s\n", FILEINAME);
		return -1;
	}
	printf("opened file %s\n", FILEINAME);

	read_yuv_frame(inY1, WIDTH, WIDTH, HEIGHT, f);
	printf("read 1st %dx%d frame\n", WIDTH, HEIGHT);
	read_yuv_frame(inY2, WIDTH, WIDTH, HEIGHT, f);
	printf("read 2nd %dx%d frame\n", WIDTH, HEIGHT);

	fclose(f);
	printf("closed file %s\n", FILEINAME);

	convert_Y8toCY16(inY1, inCY1, HEIGHT*WIDTH);
	printf("converted 1st frame to 16bit\n");
	convert_Y8toCY16(inY2, inCY2, HEIGHT*WIDTH);
	printf("converted 2nd frame to 16bit\n");

	fpga_optflow(inCY1, inCY2, outCY, HEIGHT, WIDTH, WIDTH, 10.0);
	printf("computed optical flow\n");

	// write optical flow data image to disk
	write_yuv_file(outCY, WIDTH, WIDTH, HEIGHT, ONAME);

	sds_free(inY1);
	sds_free(inCY1);
	sds_free(inY2);
	sds_free(inCY2);
	sds_free(outCY);
	printf("freed buffers\n");

return 0;
}

This method is typical for a top-down design flow using standard C/C++ data types.

Function fpa_optflow is shown in the following code example and contains the following sub-functions:

  • readMatRows
  • computeSum
  • computeFlow
  • getOutPix
  • writeMatRows
int fpga_optflow (yuv_t *frame0, yuv_t *frame1, yuv_t *framef, int height, int width, int stride, float clip_flowmag)
{
#ifdef COMPILEFORSW
	  int img_pix_count = height*width;
#else
	  int img_pix_count = 10;
#endif

  if (f0Stream == NULL) f0Stream = (pix_t *) malloc(sizeof(pix_t) * img_pix_count);
  if (f1Stream == NULL) f1Stream = (pix_t *) malloc(sizeof(pix_t) * img_pix_count);
  if (ffStream == NULL) ffStream = (yuv_t *) malloc(sizeof(yuv_t) * img_pix_count);

  if (ixix == NULL) ixix = (int *) malloc(sizeof(int) * img_pix_count);
  if (ixiy == NULL) ixiy = (int *) malloc(sizeof(int) * img_pix_count);
  if (iyiy == NULL) iyiy = (int *) malloc(sizeof(int) * img_pix_count);
  if (dix == NULL) dix = (int *) malloc(sizeof(int) * img_pix_count);
  if (diy == NULL) diy = (int *) malloc(sizeof(int) * img_pix_count);

  if (fx == NULL) fx = (float *) malloc(sizeof(float) * img_pix_count);
  if (fy == NULL) fy = (float *) malloc(sizeof(float) * img_pix_count);

  readMatRows (frame0, f0Stream, height, width, stride);
  readMatRows (frame1, f1Stream, height, width, stride);

  computeSum (f0Stream, f1Stream, ixix, ixiy, iyiy, dix, diy, height, width);
  computeFlow (ixix, ixiy, iyiy, dix, diy, fx, fy, height, width);
  getOutPix (fx, fy, ffStream, height, width, clip_flowmag);

  writeMatRows (ffStream, framef, height, width, stride);

  return 0;
}

In this example, all of the functions in fpga_optflow are processing live video data, and can benefit from hardware acceleration with DMAs used to transfer the data to and from the PS. If all five functions are annotated to be hardware functions, the topology of the system is shown in the following figure:

Figure: System Topology

The system can be compiled into hardware and event tracing used to analyze the performance in detail.

The issue here is that it takes a long time to complete— approximately 15 seconds for a single frame. To process HD video, the system should process 60 frames per second or one frame every 16.7 ms. You can use optimization directives, as described below, to ensure the system meets the target performance.

Optical Flow Memory Access Optimization

The first task is to optimize the transfer of data. In this case, because the system will process steaming video, where each sample is processed in consecutive order, the memory transfer optimization is used to ensure the SDSoC™ environment interprets all accesses as sequential in nature.

This is performed by adding SDS pragmas before the function signatures for all functions involved.

#pragma SDS data access_pattern(matB:SEQUENTIAL, pixStream:SEQUENTIAL)
#pragma SDS data mem_attribute(matB:PHYSICAL_CONTIGUOUS)
#pragma SDS data copy(matB[0:stride*height])
void readMatRows (yuv_t *matB, pix_t* pixStream,
		int height, int width, int stride);

#pragma SDS data access_pattern(pixStream:SEQUENTIAL, dst:SEQUENTIAL)
#pragma SDS data mem_attribute(dst:PHYSICAL_CONTIGUOUS)
#pragma SDS data copy(dst[0:stride*height])
void writeMatRows (yuv_t* pixStream, yuv_t *dst,
		int height, int width, int stride);

#pragma SDS data access_pattern(f0Stream:SEQUENTIAL, f1Stream:SEQUENTIAL)
#pragma SDS data access_pattern(ixix_out:SEQUENTIAL, ixiy_out:SEQUENTIAL, iyiy_out:SEQUENTIAL)
#pragma SDS data access_pattern(dix_out:SEQUENTIAL, diy_out:SEQUENTIAL)
void computeSum(pix_t* f0Stream, pix_t* f1Stream,
		int* ixix_out, int* ixiy_out, int* iyiy_out,
		int*  dix_out, int* diy_out,
		int height, int width);

#pragma SDS data access_pattern(ixix:SEQUENTIAL, ixiy:SEQUENTIAL, iyiy:SEQUENTIAL)
#pragma SDS data access_pattern(dix:SEQUENTIAL, diy:SEQUENTIAL)
#pragma SDS data access_pattern(fx_out:SEQUENTIAL, fy_out:SEQUENTIAL)
void computeFlow(int* ixix, int* ixiy, int* iyiy,
		int* dix, int* diy,
		float* fx_out, float* fy_out,
		int height, int width);

#pragma SDS data access_pattern(fx:SEQUENTIAL, fy:SEQUENTIAL, out_pix:SEQUENTIAL)
void getOutPix (float* fx, float* fy, yuv_t* out_pix,
		int height, int width, float clip_flowmag);

For the readMatRows and writeMatRows function arguments, which interface to the processor, the memory transfers are specified as sequential accesses from physically contiguous memory, and the data should be copied to and from the hardware function, and not simply accessed from the accelerator. This ensures the data is copied efficiently. The following options are available:

Sequential
The data is transferred in the same sequential manner as it is processed. This type of transfer requires the least amount of hardware overhead for high data processing rates and means an area efficient datamover is used.
Contiguous
The data is accessed from contiguous memory. This ensures there is no scatter-gather overhead in the data transfer rate and an efficient fast hardware datamover is used. This directive is supported by the associated scs_alloc library call in the main() function, which ensures data for these arguments is stored in contiguous memory.
Copy
The data is copied to and from the accelerator, negating the need for data accesses back to the CPU or DDR memory. Because pointers are used, the size of the data to be copied is specified.

For the remaining hardware functions, the data transfers are specified as sequential, allowing the most efficient hardware to be used to connect the functions in the programmable logic (PL) fabric.

Optical Flow Hardware Function Optimization

The hardware functions also require optimization directives to execute at the highest level of performance. These are already present in the design example. Reviewing these highlights the lessons learned from Understanding the Hardware Function Optimization Methodology. Most of the hardware functions in this design example are optimized using primarily the PIPELINE directive, in a manner similar to the getOutPix function.

Review of the getOutPix function shows:

  • The sub-functions have an INLINE optimization applied to ensure the logic from these functions is merged with the function above. This automatically occurs for small functions, but the use of this directive ensures the sub-functions are always inlined, and there is no need to pipeline the sub-functions.
  • The inner loop of the getOutPix function is the loop that processes data at the level of each pixel and is optimized with the PIPELINE directive to ensure it processes one pixel per clock.
pix_t getLuma (float fx, float fy, float clip_flowmag)
{
#pragma HLS inline
  float rad = sqrtf (fx*fx + fy*fy);

  if (rad > clip_flowmag) rad = clip_flowmag; // clamp to MAX
  rad /= clip_flowmag;			    // convert 0..MAX to 0.0..1.0
  pix_t pix = (pix_t) (255.0f * rad);

  return pix;
}

pix_t getChroma (float f, float clip_flowmag)
{
#pragma HLS inline
  if (f >   clip_flowmag ) f =  clip_flowmag; // clamp big positive f to  MAX
  if (f < (-clip_flowmag)) f = -clip_flowmag; // clamp big negative f to -MAX
  f /= clip_flowmag;				// convert -MAX..MAX to -1.0..1.0
  pix_t pix = (pix_t) (127.0f * f + 128.0f);  // convert -1.0..1.0 to -127..127 to 1..255

  return pix;
}

void getOutPix (float* fx, 
                float* fy, 
                yuv_t* out_pix,
				int height, int width, float clip_flowmag)
{
  int pix_index = 0;
  for (int r = 0; r < height; r++) {
    for (int c = 0; c < width; c++) {
      #pragma HLS PIPELINE
      float fx_ = fx[pix_index];
      float fy_ = fy[pix_index];

      pix_t outLuma = getLuma (fx_, fy_, clip_flowmag);
      pix_t outChroma = (c&1)? getChroma (fy_, clip_flowmag) : getChroma (fx_, clip_flowmag);
      yuv_t yuvpix;

      yuvpix = ((yuv_t)outChroma << 8) | outLuma;

      out_pix[pix_index++] = yuvpix;
    }
  }
}

If you examine the computeSum function, you will find examples of the ARRAY_PARTITION and DEPENDENCE directives. In this function, the ARRAY_PARTITION directive is used on array img1Win. Because img1Win is an array, it is implemented by default in a block RAM, which has a maximum of two ports, as shown in the following code summary:

img1Win
Used in a for-loop that is pipelined to process 1 sample per clock cycle.
img1Win
Read from 8 + (KMEDP1-1) + (KMEDP1-1) times within the for-loop.
img1Win
Written to (KMEDP1-1) + (KMEDP1-1) times within the for-loop.
void computeSum(pix_t* f0Stream, 
                 pix_t* f1Stream, 
                 int*   ixix_out, 
                 int*   ixiy_out, 
                 int*   iyiy_out, 
                 int*   dix_out, 
                 int*   diy_out)
{

   static pix_t img1Win [2 * KMEDP1], img2Win [1 * KMEDP1];
   #pragma HLS ARRAY_PARTITION variable=img1Win complete dim=0
    ...
   for (int r = 0; r < MAX_HEIGHT; r++) {
     for (int c = 0; c < MAX_WIDTH; c++) {
        #pragma HLS PIPELINE
        ...
        int cIxTopR = (img1Col_ [wrt] - img1Win [wrt*2 + 2-2]) /2 ;
        int cIyTopR = (img1Win [ (wrt+1)*2 + 2-1] - img1Win [ (wrt-1)*2 + 2-1])  /2;
        int delTopR = img1Win [wrt*2 + 2-1] - img2Win [wrt*1 + 1-1];
        ...
        int cIxBotR = (img1Col_ [wrb] - img1Win [wrb*2 + 2-2]) /2 ;
        int cIyBotR = (img1Win [ (wrb+1)*2 + 2-1] - img1Win [ (wrb-1)*2 + 2-1]) /2;
        int delBotR = img1Win [wrb*2 + 2-1] - img2Win [wrb*1 + 1-1];
        ...
        // shift windows
        for (int i = 0; i < KMEDP1; i++) {
          img1Win [i * 2] = img1Win [i * 2 + 1];
        }
        for (int i=0; i < KMEDP1; ++i) {
          img1Win  [i*2 + 1] = img1Col_ [i];
          ...
        }
        ...
      } // for c
   }  // for r
   ...
}

Because a block RAM only supports a maximum of two accesses per clock cycle, all of these accesses cannot be made in one clock cycle. As noted previously in the methodology, the ARRAY_PARTITION directive is used to partition the array into smaller blocks, in this case into individual elements, by using the complete option. This enables parallel access to all elements of the array at the same time and ensures that the for-loop processes data every clock cycle.

The final optimization directive worth reviewing is the DEPENDENCE directive. The csIxix array has a DEPENDENCE directive applied to it. The array is read from and then written to using different indices, as shown in the following code example, and performs these reads and writes within a pipelined loop.

 void computeSum(pix_t* f0Stream, 
                 pix_t* f1Stream, 
                 int*   ixix_out, 
                 int*   ixiy_out, 
                 int*   iyiy_out, 
                 int*   dix_out, 
                 int*   diy_out)
{
  ...
   static int csIxix [MAX_WIDTH], csIxiy [MAX_WIDTH], csIyiy [MAX_WIDTH], csDix [MAX_WIDTH], csDiy [MAX_WIDTH];
   ...
   #pragma HLS DEPENDENCE variable=csIxix inter WAR false
   ...
   int zIdx= - (KMED-2);
   int nIdx = zIdx + KMED-2;

   for (int r = 0; r < MAX_HEIGHT; r++) {
     for (int c = 0; c < MAX_WIDTH; c++) {
        #pragma HLS PIPELINE
        ...
        if (zIdx >= 0) {
          csIxixL = csIxix [zIdx];
          ...
       }
       ...
        csIxix [nIdx] = csIxixR;
        ...
        zIdx++;
        if (zIdx == MAX_WIDTH) zIdx = 0;
        nIdx++;
        if (nIdx == MAX_WIDTH) nIdx = 0;
        ...
      } // for c
   }  // for r
   ...
}

When a loop is pipelined in hardware, the accesses to the array overlap in time. The compiler analyzes all accesses to an array and issues a warning if any condition exists where the write in iteration N overwrites the data for iteration N + K, thus changing the value. The warning prevents implementing a pipeline with II = 1.

The following example shows read and write operations for a loop over multiple iterations for an array with indices 0 through 9. As in the code above, it is possible for the address counters to differ between the read and write operations and to return to zero, before all loop iterations are complete. The operations are shown overlapped in time, just like a pipelined implementation.


R4---------W8
  R5---------W9
    R6---------W0
      R7---------W1
        R8–––------W2
          R9--------W3
            R0--------W4
              R1--------W5
                R2--------W6

In sequential C code, where each iteration completes before the next starts, it is clear what order the reads and writes occur. However, in a concurrent hardware pipeline, the accesses can overlap and occur in different orders. As can be seen clearly above, it is possible for the read from index 8, as noted by R8, to occur in time before the write to index 8 (W8) which is meant to occur some iterations before R8.

The compiler warns of this condition, and the DEPENDENCE directive is used with the setting false to tell the compiler that there is no dependence on read-after-write, allowing the compiler to create the pipelined hardware which performs with II=1.

The DEPENDENCE directive is typically used to inform the compiler of algorithm behaviors and conditions external to the function of which is it unaware from static analysis of the code. If a DEPENDENCE directive is set incorrectly, the issue will be discovered in hardware emulation, if the results from the hardware are different from those achieved with the software.

Optical Flow Results

With both the data transfers and hardware functions optimized, the hardware functions are recompiled, and the performance is analyzed using event traces. The figure below shows the start of the event traces, and clearly shows the pipelined hardware functions do not execute until the previous function has completed. Each hardware function begins to process data as soon as data becomes available.

Figure: Trace Result

The complete view of the event traces shows all hardware functions and data transfers executing in parallel for the highest performing system, as shown in the following figure.

Figure: Event Traces

To get the duration time, hover on top of one of the lanes to obtain a popup window that shows the duration of the accelerator runtime. The execution time is just under 15.5 ms; this meets the targeted 16.8 ms necessary to achieve 60 frames per second. The following figure shows the AXI State View for trace legend:

Figure: AXI State View Trace Legend

Software
Execution done on the Arm® processor core.
Accelerator
Execution done in the accelerator(s).
Transfer
Data being transferred from Arm core.
Receive
Data being received by the Arm processor core.

Bottom-Up: Stereo Vision Algorithm

The stereo vision algorithm uses images from two cameras horizontally displaced from each other. This provides two different views of the scene from different vantage points, similar to human vision. To obtain the relative depth information from the scene, compare the two images to build a disparity map. The disparity map encodes the relative positions of objects in the horizontal coordinates such that the values are inversely proportional to the scene depth at the corresponding pixel location.

The bottom-up methodology starts with a fully optimized hardware design that is already synthesized using the Vivado® High-Level Synthesis (HLS) tool and then integrate the pre-optimized hardware function with software in the SDSoC environment.

This flow allows hardware designers who are already knowledgeable with the HLS tool to build and optimize the entire hardware function first, using advanced HLS tool features and then for software programmers to leverage this existing work.

The following section uses the stereo vision design example to take you through the steps of starting with an optimized hardware function in the HLS tool and build an application that integrates the full system with hardware and software running on the board using the SDSoC environment. The following figure shows the final system to be realized, and highlights the existing stereo_remap_bm hardware function to be incorporated into the SDSoC environment.

Figure: Block Diagram of System



In the bottom-up flow, the general optimization methodology for the SDSoC environment, as detailed in this guide, is reversed. By definition, you would start with an optimized hardware function, and then seek to incorporate it into the SDSoC environment and optimize the data transfers.

Stereo Vision Hardware Function Optimization

The following code example shows the existing stereo_remap_bm hardware function with the optimization pragmas highlighted. Before reviewing the optimization directives, note the following details about the function:

  • The hardware function contains sub-functions readLRinput, writeDispOut, and writeDispOut that have also been optimized.
  • The hardware function also uses pre-optimized functions, prefixed with the namespace hls, from the Vivado HLS tool video library, hls_video.h. These sub-functions use their own data type of MAT.
#include "hls_video.h"
#include "top.h"
#include "transform.h"

void readLRinput (yuv_t *inLR,
	    hls::Mat<IMG_HEIGHT, IMG_WIDTH, HLS_8UC1>& img_l,
	    hls::Mat<IMG_HEIGHT, IMG_WIDTH, HLS_8UC1>& img_r,
		int height, int dual_width, int width, int stride)
{

  for (int i=0; i < height; ++i) {
#pragma HLS loop_tripcount min=1080 max=1080 avg=1080
    for (int j=0; j < stride; ++j) {
#pragma HLS loop_tripcount min=1920 max=1920 avg=1920
    #pragma HLS PIPELINE
      yuv_t tmpData = inLR [i*stride + j];	  // from yuv_t array: consume height*stride
      if (j < width)
    	  img_l.write (tmpData & 0x00FF);	// to HLS_8UC1 stream
      else if (j < dual_width)
    	  img_r.write (tmpData & 0x00FF);	// to HLS_8UC1 stream
    }
  }
}

void writeDispOut(hls::Mat<IMG_HEIGHT, IMG_WIDTH, HLS_8UC1>& img_d,
					yuv_t *dst,
					int height, int width, int stride)
{
  pix_t tmpOut;
  yuv_t outData;

  for (int i=0; i < height; ++i) {
#pragma HLS loop_tripcount min=1080 max=1080 avg=1080
    for (int j=0; j < stride; ++j) {
#pragma HLS loop_tripcount min=960 max=960 avg=960
#pragma HLS PIPELINE
	  if (j < width) {
		tmpOut = img_d.read().val[0];
		outData = ((yuv_t) 0x8000) | ((yuv_t)tmpOut);
		dst [i*stride +j] = outData;
	  }
	  else {
		outData = (yuv_t) 0x8000;
		dst [i*stride +j] = outData;
	  }
    }
  }
}

namespace hls {
void SaveAsGray(
            Mat<IMG_HEIGHT, IMG_WIDTH, HLS_16SC1>& src,
            Mat<IMG_HEIGHT, IMG_WIDTH, HLS_8UC1>& dst)
{
    int height = src.rows;
    int width = src.cols;
    for (int i = 0; i < height; i++) {
#pragma HLS loop_tripcount min=1080 max=1080 avg=1080
        for (int j = 0; j < width; j++) {
#pragma HLS loop_tripcount min=960 max=960 avg=960
#pragma HLS pipeline II=1
            Scalar<1, short> s;
            Scalar<1, unsigned char> d;
            src >> s;

            short uval = (short) (abs ((int)s.val[0]));

            // Scale to avoid overflow.  The right scaling here for a
            // good picture depends on the NDISP parameter during
            // block matching.
            d.val[0] = (unsigned char)(uval >> 1);
            //d.val[0] = (unsigned char)(s.val[0] >> 1);
            dst << d;
        }
    }
}
} // namespace hls

int stereo_remap_bm_new(
        yuv_t *img_data_lr,
        yuv_t *img_data_disp,
        hls::Window<3, 3, param_T > &lcameraMA_l,
        hls::Window<3, 3, param_T > &lcameraMA_r,
        hls::Window<3, 3, param_T > &lirA_l,
        hls::Window<3, 3, param_T > &lirA_r,
        param_T (&ldistC_l)[5],
        param_T (&ldistC_r)[5],
        int height,		 // 1080
        int dual_width,	   // 1920 (two 960x1080 images side by side)
        int stride_in,	    // 1920 (two 960x1080 images side by side)
        int stride_out)	   // 960
{
	int width = dual_width/2; // 960

#pragma HLS DATAFLOW

    hls::Mat<IMG_HEIGHT, IMG_WIDTH, HLS_8UC1> img_l(height, width);
    hls::Mat<IMG_HEIGHT, IMG_WIDTH, HLS_8UC1> img_r(height, width);

    hls::Mat<IMG_HEIGHT, IMG_WIDTH, HLS_8UC1> img_l_remap(height, width);	// remapped left image
    hls::Mat<IMG_HEIGHT, IMG_WIDTH, HLS_8UC1> img_r_remap(height, width);	// remapped left image
    hls::Mat<IMG_HEIGHT, IMG_WIDTH, HLS_8UC1> img_d(height, width);

    hls::Mat<IMG_HEIGHT, IMG_WIDTH, HLS_16SC2> map1_l(height, width);
    hls::Mat<IMG_HEIGHT, IMG_WIDTH, HLS_16SC2> map1_r(height, width);
    hls::Mat<IMG_HEIGHT, IMG_WIDTH, HLS_16UC2> map2_l(height, width);
    hls::Mat<IMG_HEIGHT, IMG_WIDTH, HLS_16UC2> map2_r(height, width);

    hls::Mat<IMG_HEIGHT, IMG_WIDTH, HLS_16SC1> img_disp(height, width);
    hls::StereoBMState<15, 32, 32> state;

// ddr -> kernel streams: extract luma from left and right yuv images
// store it in single channel HLS_8UC1 left and right Mat's
    readLRinput (img_data_lr, img_l, img_r, height, dual_width, width, stride_in);

//////////////////////// remap left and right images, all types are HLS_8UC1 //////////
    hls::InitUndistortRectifyMapInverse(lcameraMA_l, ldistC_l, lirA_l, map1_l, map2_l);
    hls::Remap<8>(img_l, img_l_remap, map1_l, map2_l, HLS_INTER_LINEAR);
    hls::InitUndistortRectifyMapInverse(lcameraMA_r, ldistC_r, lirA_r, map1_r, map2_r);
    hls::Remap<8>(img_r, img_r_remap, map1_r, map2_r, HLS_INTER_LINEAR);

////////// find disparity of remapped images //////////
    hls::FindStereoCorrespondenceBM(img_l_remap, img_r_remap, img_disp, state);
    hls::SaveAsGray(img_disp, img_d);

  // kernel stream -> ddr : output single wide
  writeDispOut (img_d, img_data_disp, height, width, stride_out);

  return 0;
}

int stereo_remap_bm(
        yuv_t *img_data_lr,
        yuv_t *img_data_disp,
        int height,		// 1080
        int dual_width,	// 1920 (two 960x1080 images side by side)
        int stride_in,	// 1920 (two 960x1080 images side by side)
        int stride_out)	// 960
{
//1920*1080
//#pragma HLS interface m_axi port=img_data_lr depth=2073600
//#pragma HLS interface m_axi port=img_data_disp depth=2073600

    hls::Window<3, 3, param_T > lcameraMA_l;
    hls::Window<3, 3, param_T > lcameraMA_r;
    hls::Window<3, 3, param_T > lirA_l;
    hls::Window<3, 3, param_T > lirA_r;
    param_T ldistC_l[5];
    param_T ldistC_r[5];

    for (int i=0; i<3; i++) {
        for (int j=0; j<3; j++) {
            lcameraMA_l.val[i][j]=cameraMA_l[i*3+j];
            lcameraMA_r.val[i][j]=cameraMA_r[i*3+j];
            lirA_l.val[i][j]=irA_l[i*3+j];
            lirA_r.val[i][j]=irA_r[i*3+j];
        }
    }
    for (int i=0; i<5; i++) {
        ldistC_l[i] = distC_l[i];
        ldistC_r[i] = distC_r[i];
    }

    int ret = stereo_remap_bm_new(img_data_lr,
                                img_data_disp,
                                lcameraMA_l,
                                lcameraMA_r,
                                lirA_l,
                                lirA_r,
                                ldistC_l,
                                ldistC_r,
                                height,
                                dual_width,
                                stride_in,
                                stride_out);
    return ret;
}

As noted in Understanding the Hardware Function Optimization Methodology, the primary optimization directives used are the PIPELINE and DATAFLOW directives. Additionally, the LOOP_TRIPCOUNT directive is used.

Based on the recommendations for optimizing hardware functions, which process frames of data, the PIPELINE directives are all applied to for-loops that process data at the sample level, or in this case, the pixel level. This ensures hardware pipelining is used to achieve the highest performing design.

The LOOP_TRIPCOUNT directives are used on for-loops, for which the upper bound of the loop index is defined by a variable, and the exact value, which is unknown at compile time. The estimated tripcount, or loop iteration count, allows the reports generated by the HLS tool to include expected values for latency and initiation interval (II), instead of unknowns. This directive has no impact on the hardware created—it only impacts reporting.

The top-level stereo_remap_bm function is composed of the optimized sub-functions and a number of functions from the HLS tool video library (hls_video.h). For details about the library functions provided by the HLS tool video library, refer to Vivado Design Suite User Guide: High-Level Synthesis (UG902).

The functions provided in the HLS tool video library are already pre-optimized and contain all the optimization directives to ensure they are implemented with the highest possible performance. The top-level function is therefore composed of sub-functions that are all optimized, and it only requires the DATAFLOW directive to ensure each sub-function starts to execute in hardware as soon as data becomes available.

int stereo_remap_bm(..) {

#pragma HLS DATAFLOW
  readLRinput (img_data_lr, img_l, img_r, height, dual_width, width, stride
  hls::InitUndistortRectifyMapInverse(lcameraMA_l, ldistC_l, lirA_l, map1_l, map2_l);
  hls::Remap<8>(img_l, img_l_remap, map1_l, map2_l, HLS_INTER_LINEAR);
  hls::InitUndistortRectifyMapInverse(lcameraMA_r, ldistC_r, lirA_r, map1_r, map2_r);
  hls::Remap<8>(img_r, img_r_remap, map1_r, map2_r, HLS_INTER_LINEAR);
  hls::Duplicate(img_l_remap, img_l_remap_bm, img_l_remap_pt);
  hls::FindStereoCorrespondenceBM(img_l_remap_bm, img_r_remap, img_disp, state);
  hls::SaveAsGray(img_disp, img_d);
  writeDispOut (img_l_remap_pt, img_d, img_data_disp, height, dual_width, width, stride);

}

In general, the DATAFLOW optimization is not required because the SDSoC™ environment automatically ensures that data is passed from one hardware function to the next, as soon as it becomes available; however, in this example, the functions within stereo_remap_bm are using the HLS tool data type hls::stream, which cannot be compiled on the Arm® processor and cannot be used in the hardware function interface in the SDSoC environment. For this reason, the top-level hardware function must be stereo_remap_bm and thus, the DATAFLOW directive is used to achieve high-performance transfers between the sub-functions. If this were not the case, the DATAFLOW directive could be removed and each sub-function within stereo_remap_bm could be specified as a hardware function.

The hardware functions in this design example use the data type Mat, that is based on the HLS tool data type hls::stream. The hls::stream data type can only be accessed in a sequential manner. Data is pushed on and popped off.

  • In software simulation, the hls::stream data type has infinite size.
  • In hardware, the hls::stream data type is implemented as a single register and can only store one data value at a time, because it is expected that the streaming data is consumed before the previous value is overwritten.

By specifying the top-level stereo_remap_bm function as the hardware function, the effects of these hardware types can be ignored in the software environment; however, when these functions are incorporated into the SDSoC environment, they cannot be compiled on the Arm processor, and the system can only be verified through hardware emulation, executing on the target platform, or both.

IMPORTANT: When incorporating hardware functions that contain the HLS tool hardware data types into the SDSoC environment, ensure the functions have been fully verified through C compilation and hardware simulation within the HLS tool environment.
IMPORTANT: The hls::stream data type is designed for use within the HLS tool, but is unsuitable for running software on embedded CPUs. Therefore, this type should not be part of the top-level hardware function interface.

If any of the arguments of the hardware function use any HLS tool specific data types, the function must be enclosed by a top-level C/C++ wrapper function that exposes only native C/C++ types in the function argument list.

Optimizing the Data Motion Network

After importing the pre-optimized hardware function into a project in the SDSoC environment, the first task is to remove any interface optimizations. Based on the data types of the hardware function and data access, the interface between the PS and the hardware function is managed and automatically optimized. See Data Motion Optimization.

  • Remove any INTERFACE directives present in the hardware function.
  • Remove any DATA_PACK directives that reference variables present in the hardware function argument list.
  • Remove any of the Vivado HLS tool hardware data types by enclosing the top-level function in wrappers that only use native C/C++ types for the function arguments.

In this example, the functions to be accelerated are captured inside a single top-level hardware function, stereo_remap_bm.

int main() {

  unsigned char *inY = (unsigned char *)sds_alloc(HEIGHT*DUALWIDTH);
  unsigned short *inCY = (unsigned short *)sds_alloc(HEIGHT*DUALWIDTH*2);
  unsigned short *outCY = (unsigned short *)sds_alloc(HEIGHT*DUALWIDTH*2);
  unsigned char *outY = (unsigned char *)sds_alloc(HEIGHT*DUALWIDTH);

  // read double wide image from disk
  if (read_yuv_file(inY, DUALWIDTH, DUALWIDTH, HEIGHT, FILEINAME) != 0)
    return -1;

  convert_Y8toCY16(inY, inCY, HEIGHT*DUALWIDTH);

  stereo_remap_bm(inCY, outCY, HEIGHT, DUALWIDTH, DUALWIDTH);

  // write single wide image to disk
  convert_CY16toY8(outCY, outY, HEIGHT*DUALWIDTH);
  write_yuv_file(outY, DUALWIDTH, DUALWIDTH, HEIGHT, ONAME);

  // write single wide image to disk
  sds_free(inY);
  sds_free(inCY);
  sds_free(outCY);
  sds_free(outY);
  return 0;
}

The key to optimizing the memory accesses to the hardware is to review the data types passed into the hardware function. Reviewing the function signature shows the key variables names to optimize: the input and output data streams img_data_lr and img_data_disp.

int stereo_remap_bm( 
        yuv_t *img_data_lr,
        yuv_t *img_data_disp,
        int height,
        int dual_width,
        int stride);

Because the data is transferred in a sequential manner, first ensure that the access pattern is defined as SEQUENTIAL for both arguments. For the next optimization, ensure the data transfer is not interrupted by a scatter gather DMA operation by specifying the memory_attribute as PHYSICAL_CONTIGUOUS|NON_CACHEABLE. This also requires that the memory is allocated with sds_alloc from sds_lib.

#include "sds_lib.h"
int main() {

  unsigned char *inY = (unsigned char *)sds_alloc(HEIGHT*DUALWIDTH);
  unsigned short *inCY = (unsigned short *)sds_alloc(HEIGHT*DUALWIDTH*2);
  unsigned short *outCY = (unsigned short *)sds_alloc(HEIGHT*DUALWIDTH*2);
  unsigned char *outY = (unsigned char *)sds_alloc(HEIGHT*DUALWIDTH);

}

Finally, the copy directive is used to ensure the data is explicitly copied to the accelerator, and that the data is not accessed from shared memory.

#pragma SDS data access_pattern(img_data_lr:SEQUENTIAL)
#pragma SDS data mem_attribute(img_data_lr:PHYSICAL_CONTIGUOUS|NON_CACHEABLE)
#pragma SDS data copy(img_data_lr[0:stride*height])
#pragma SDS data access_pattern(img_data_disp:SEQUENTIAL)
#pragma SDS data mem_attribute(img_data_disp:PHYSICAL_CONTIGUOUS|NON_CACHEABLE)
#pragma SDS data copy(img_data_disp[0:stride*height])
int stereo_remap_bm( 
        yuv_t *img_data_lr,
        yuv_t *img_data_disp,
        int height,
        int dual_width,
        int stride);

With these optimization directives, the memory access between the PS and PL is optimized for the most efficient transfers.

Stereo Vision Results

After the hardware function optimized with the Vivado HLS tool is wrapped, as in this example, to ensure the HLS tool hardware data types are not exposed at the interface of the hardware function, any interface directives are removed and the data transfers optimized, the hardware functions are recompiled, and the performance is analyzed using event traces.

The following figure shows the complete view of the event traces, and all hardware functions and data transfers executing in parallel for the highest performing system.

Figure: Event Traces



To get the duration time, hover over one of the lanes to display a popup window that shows the duration of the accelerator runtime. The execution time is 15.86 ms; this meets the targeted 16.8 ms necessary to achieve 60 frames per second for live video.