Kernel Optimization - betway必威登陆

Interface Attributes (Detailed Kernel Trace)

The detailed kernel trace provides easy access to the AXI transactions and their properties. The AXI transactions are presented for the global memory, as well as the Kernel side (Kernel "pass" 1:1:1) of the AXI interconnect. The following figure illustrates a typical kernel trace of a newly accelerated algorithm.

The following fields are the most important in respect to performance:

Burst Length: Describes how many beats are sent within one transaction
Burst Size: Describes the number of bytes being transferred as part of one beat

Given a burst length of 1 and just 4 Bytes per package, it will require many individual AXI transactions to transfer any reasonable amount of data.

Note: The SDAccel™ environment never creates burst sizes less than 4 bytes, even if smaller data is transmitted. In this case, if consecutive items are accessed without AXI bursts enabled, you can observe multiple AXI reads to the same address.

Therefore, small burst lengths, as well as burst sizes, considerably less than 512-bits are good opportunities to optimize interface performance. The following sections show improved implementations:

Top Level Dataflow

Top level dataflow optimization provides the capability to the kernels to executed in a parallel processing architecture by overlapping multiple kernel execution. In cases of long kernel latency, this optimization is useful and will improve the performance of the overall application. Xilinx recommends enabling this optimization only to kernels which have dataflow at the top function by applying the interface pragma ap_ctrl_chain to the return port (along with s_axilite).

#pragma HLS INTERFACE ap_ctrl_chain port=return bundle=control

The following example is provided to illustrate that both the ap_ctrl_chain and s_axilite option is required for the return port to enable top level dataflow.


void N_stage_Adders(int *input, int *output, int incr, int size)
{
    …
    #pragma HLS INTERFACE s_axilite port=return bundle=control
    #pragma HLS INTERFACE ap_ctrl_chain port=return bundle=control
    …
    …
    #pragma HLS dataflow
    read_input(input,streamArray[0],size);
    compute_loop: for (int i = 0 ; i < STAGES ; i++)
    {
        #pragma HLS UNROLL
        adder(streamArray[i],streamArray[i+1],incr,size);
    }
    write_result(output,streamArray[STAGES],size);
}

Note: The profiling capability for this optimization is currently only available in hardware.

Using Burst Data Transfers

Transferring data in bursts hides the memory access latency and improves bandwidth usage and efficiency of the memory controller.

Note: Infer burst transfers from successive requests of data from consecutive address locations. Refer to "Inferring Burst Transfer from/to Global Memory" in SDAccel Environment Programmers Guide for more details.

If burst data transfers occur, the detailed kernel trace will reflect the higher burst rate as a larger burst length number:

In the previous figure, it is also possible to observe that the memory data transfers following the AXI interconnect are actually implemented rather differently (shorter transaction time). If you hover over these transactions, you would see that the AXI interconnect has packed the 16x4 Byte transaction into a single burst transaction of 1x64 Bytes. This effectively uses the AXI4 bandwidth which is even more favorable. The next section focuses on this optimization technique in more detail.

Burst inference is heavily dependent on coding style and access pattern. To avoid potential modeling pitfalls, refer to the SDAccel Environment Programmers Guide (UG1277). However, you can ease burst detection and improve performance by isolating data transfer and computation, as shown in the following code snippet:

void kernel(T in[1024], T out[1024]) {
    T tmpIn[1024];
    T tmpOu[1024];
    read(in, tmpIn);
    process(tmpIn, tmpOut);
    write(tmpOut, out);
}

In short, the function read is responsible for reading from the AXI input to an internal variable (tmpIn). The computation is implemented by the function process working on the internal variables tmpIn and tmpOut. The function write takes the produced output and writes to the AXI output.

The isolation of the read and write function from the computation results in:

Simple control structures (loops) in the read/write function which makes burst detection simpler.
Isolation of the computational function away from the AXI interfaces, simplifies potential kernel optimization. See the Kernel Optimization chapter for more information.
Internal variables are mapped to on-chip memory, which allow faster access compared to AXI transactions. Acceleration platforms supported in SDAccel environment can have as much as 10 MB on-chip memories that can be used as pipes, local memories, and private memories. Using these resources effectively can greatly improve the efficiency and performance of your applications.

Using AXI4 Data Width

The user data width between the kernel and the memory controller can be configured by the SDAccel environment compiler based on the data types of the kernel arguments. To maximize the data throughput, Xilinx recommends that you choose data types mapping to the full data width on the memory controller. The memory controller in all supported acceleration cards supports 512-bit user interface, which can be mapped to OpenCL™ vector data types, such as int16 or C/C++ arbitrary precision data type ap_int<512>.

As shown on the following figure, you can observe burst AXI transactions (Burst Length 16) and a 512-bit beat size (Burst Size 64 Bytes).

This example shows good interface configuration as it maximizes AXI data width and also shows actual burst transactions.

Complex structs or classes, used to declare interfaces, can lead to very complex hardware interfaces due to memory layout and data packing differences. This can introduce potential issues that are very difficult to debug in a complex system.

Note: Use simple structs for kernel arguments that can be packed to a power of two boundary. Refer to the Custom Data Type Example in kernel_to_gmem category at Xilinx On-boarding Example GitHub for the recommended way to use structs.

OpenCL API Attributes

The OpenCL API provides attributes to support a more automatic approach to incrementing AXI data width usage. The change of the interface data types, as stated above, is supported in the API as well but will require the same code changes as C/C++ to the algorithm to accommodate the larger input vector.

To eliminate manual code modifications, the following OpenCL attributes are interpreted to perform data path widening and vectorization of the algorithm. A detailed description can be found in the SDx Pragma Reference Guide (UG1253).

vec_type_hint
reqd_work_group_size
xcl_zero_global_work_offset

In the following example, examine the combined functionality on the following case:

__attribute__((reqd_work_group_size(64, 1, 1)))
__attribute__((vec_type_hint(int)))
__attribute__((xcl_zero_global_work_offset))
__kernel void vector_add(__global int* c, __global const int* a, __global const int* b) {
    size_t idx = get_global_id(0);
    c[idx] = a[idx] + b[idx];
}

In this case, the hard coded interface is a 32-bit wide data path (int *c, int* a, int *b), which drastically limits the memory throughput if implemented directly. However, the automatic widening and transformation is applied, based on the values of the three attributes.

__attribute__((vec_type_hint(int)))

Declares that int is the main type used for computation and memory transfer (32-bit). This knowledge is used to calculate the vectorization/widening factor based on the target bandwidth of the AXI interface (512-bits). In this example the factor would be 16 = 512-bits / 32-bit. This implies that in theory, 16 values could be processed if vectorization can be applied.

__attribute__((reqd_work_group_size(X, Y, Z)))

Defines the total number of work items (where X, Y, and Z are positive constants). X*Y*Z is the maximum number of work items therefore defining the maximum possible vectorization factor which would saturate the memory bandwidth. In this example, the total number of work items is 64*1*1=64.

The actual vectorization factor to be applied will be the greatest common divider of the vectorization factor defined by the actual coded type or the vec_type_hint, and the maximum possible vectorization factor defined through reqd_work_group_size.

The quotient of maximum possible vectorization factor divided by the actual vectorization factor provides the remaining loop count of the OpenCL description. As this loop is pipelined, it can be advantageous to have several remaining loop iterations to take advantage of a pipelined implementation. This is especially true if the vectorized OpenCL code has long latency.

There is one optional parameter that is highly recommended to be specified for performance optimization on OpenCL interfaces.

The __attribute__((xcl_zero_global_work_offset)) instructs the compiler that no global offset parameter is used at runtime, and all accesses are aligned. This gives the compiler valuable information with regard to alignment of the work groups, which in turn usually propagate to the alignment of the memory accesses (less hardware).

It should be noted, that the application of these transformations changes the actual design to be synthesized. Partially unrolled loops require reshaping of local arrays in which data is stored. This usually behaves nicely, but can interact poorly in rare situations.

For example:

For partitioned arrays, when the partition factor is not divisible by the unrolling/vectorization factor.
- The resulting access requires a lot of multiplexers and will create a difficult problem for the scheduler (might severely increase memory usage and compilation time), Xilinx recommends that you use partitioning factors that are powers of two (as the vectorization factor is always a power of two).
If the loop being vectorized has an unrelated resource constraint, the scheduler complains about II not being met.
- This is not necessarily correlated with a loss of performance (usually it is still performing better) because the II is computed on the unrolled loop (which has therefore a multiplied throughput for each iteration).
- The scheduler informs you of the possible resources constraints and resolving those will further improve the performance.
- Note that a common occurrence is that a local array does not get automatically reshaped (usually because it is accessed in a later section of the code in non-vectorizable way).

Reducing Kernel-to-Kernel Communication Latency with OpenCL Pipes

This section specifically applies to OpenCL kernels. For C++ kernels, kernel-to-kernel streaming is provided. This is discussed in Memory Data Transfer Types.

The OpenCL API 2.0 specification introduces a new memory object called a pipe. A pipe stores data organized as a FIFO. Pipe objects can only be accessed using built-in functions that read from and write to a pipe. Pipe objects are not accessible from the host. Pipes can be used to stream data from one kernel to another inside the FPGA without having to use the external memory, which greatly improves the overall system latency.

In the SDAccel development environment, pipes must be statically defined outside of all kernel functions; dynamic pipe allocation using the OpenCL 2.x clCreatePipe API is not currently supported. The depth of a pipe must be specified by using the xcl_reqd_pipe_depth attribute in the pipe declaration.

The valid depth values are as follows:

16
32
64
128
256
512
1024
2048
4096
8192
16384
32768

A given pipe can have one and only one producer and consumer in different kernels.

pipe int p0 __attribute__((xcl_reqd_pipe_depth(32)));

Pipes can be accessed using standard OpenCL read_pipe() and write_pipe() built-in functions in non-blocking mode or using the Xilinx extended read_pipe_block() and write_pipe_block() functions in blocking mode. The status of pipes can be queried using OpenCL get_pipe_num_packets() and get_pipe_max_packets() built-in functions. See the OpenCL C Specification, Version 2.0 from Khronos Group for more details on these built-in functions.

The following function signatures are the currently supported pipe functions, where gentype indicates the built-in OpenCL C scalar integer or floating-point data types.

int read_pipe_block (pipe gentype p, gentype *ptr) 
int write_pipe_block (pipe gentype p, const gentype *ptr)

The following “Blocking Pipes Example” from SDAccel Getting Started Examples on GitHub uses pipes to pass data from one processing stage to the next using blocking read_pipe_block() and write_pipe_block() functions:

pipe int p0 __attribute__((xcl_reqd_pipe_depth(32)));
pipe int p1 __attribute__((xcl_reqd_pipe_depth(32)));
// Input Stage Kernel : Read Data from Global Memory and write into Pipe P0
kernel __attribute__ ((reqd_work_group_size(1, 1, 1)))
void input_stage(__global int *input, int size)
{
    __attribute__((xcl_pipeline_loop)) 
    mem_rd: for (int i = 0 ; i < size ; i++)
    {
        //blocking Write command to pipe P0
        write_pipe_block(p0, &input[i]);
    }
}
// Adder Stage Kernel: Read Input data from Pipe P0 and write the result 
// into Pipe P1
kernel __attribute__ ((reqd_work_group_size(1, 1, 1)))
void adder_stage(int inc, int size)
{
    __attribute__((xcl_pipeline_loop))
    execute: for(int i = 0 ; i < size ;  i++)
    {
        int input_data, output_data;
        //blocking read command to Pipe P0
        read_pipe_block(p0, &input_data);
        output_data = input_data + inc;
        //blocking write command to Pipe P1
        write_pipe_block(p1, &output_data);
    }
}
// Output Stage Kernel: Read result from Pipe P1 and write the result to Global
// Memory
kernel __attribute__ ((reqd_work_group_size(1, 1, 1)))
void output_stage(__global int *output, int size)
{
    __attribute__((xcl_pipeline_loop))
    mem_wr: for (int i = 0 ; i < size ; i++)
    {
        //blocking read command to Pipe P1
        read_pipe_block(p1, &output[i]);
    }
}

The Device Traceline view shows the detailed activities and stalls on the OpenCL pipes after hardware emulation is run. This information can be used to choose the correct FIFO sizes to achieve the optimal application area and performance.

Optimizing Computational Parallelism

By default, C/C++ does not model computational parallelism, as it always executes any algorithm sequentially. Conversely, the OpenCL API does model computational parallelism with respect to work groups, but it does not use any additional parallelism within the algorithm description. However, fully configurable computational engines, like FPGAs, allow more freedom to exploit computational parallelism.

Coding Data Parallelism

To leverage computational parallelism during the implementation of an algorithm on the FPGA, it should be mentioned that the synthesis tool must be able to recognize computational parallelism from the source code first. Loops and functions are prime candidates for reflecting computational parallelism and compute units in the source description. However, even in this case, it is key to verify that the implementation takes advantage of the computational parallelism as in some cases the SDAccel tool might not be able to apply the desired transformation due to the structure of the source code.

It is quite common, that some computational parallelism might not be reflected in the source code to begin with. In this case, it will need to be added. A typical example is a kernel that might be described to operate on a single input value, while the FPGA implementation might execute computations more efficiently in parallel on multiple values. This kind of parallel modeling is described in the Using Full AXI Data Width section. A 512-bit interface can be created using OpenCL vector data types such as int16 or C/C++ arbitrary precision data type ap_int<512>.

Note: These vector types can also be used as a powerful way to model data parallelism within a kernel, with up to 16 data paths operating in parallel in case of int16. Refer to the Median Filter Example in the vision category at SDAccel Getting Started Examples on GitHub for the recommended way to use vectors.

Loop Parallelism

Loops are the basic C/C++/OpenCL™ API method of representing repetitive algorithmic code. The following example illustrates various implementation aspects of a loop structure:

  for(int i = 0; i<255; i++) {
    out[i] = in[i]+in[i+1];
  }
  out[255] = in[255];

This code iterates over an array of values and adds consecutive values, except the last value. If this loop is implemented as written, each loop iteration requires two cycles for implementation, which results in a total of 510 cycles for implementation. This can be analyzed in detail in the Schedule Viewer in the HLS Project:

This can also be analyzed in terms of total numbers and latency through the Vivado synthesis results:

The key numbers here are the latency numbers and total LUT usage. For example, depending on the configuration, you could get latency of 511 and total LUT usage of 47. As you will see, these values can widely vary based on the implementation choices. While this implementation will require very little area, it results in significant latency.

Unrolling Loops

Unrolling a loop enables the full parallelism of the model to be exploited. To do this, you can simply mark a loop to be unrolled and the tool will create the implementation with the most parallelism possible. To mark a loop to unroll, an OpenCL API loop can be marked with the UNROLL attribute:

__attribute__((opencl_unroll_hint))

or a C/C++ loop can use the UNROLL pragma:

#pragma HLS UNROLL

When applied to this specific example, the Schedule Viewer in the HLS Project will be:

With an estimated performance of:

As you can see, the total latency was considerably improved to 127 cycles, and as expected, the computational hardware was increased to 4845 LUTs to perform the same computation in parallel.

However, if you analyze the for-loop, you might ask why this algorithm cannot be implemented in a single cycle, as each addition is completely independent of the previous loop iteration. The reason is the memory interface to be used for the out variable. The SDAccel environment uses dual port memory by default for an array. However, this implies that at most two values can be written to the memory per cycle. Therefore, to see a fully parallel implementation, you must specify that the out variable is kept in registers, as shown in the following example:

#pragma HLS array_partition variable=out complete dim= 0

For more information see the pragma HLS array_partition section in SDx Pragma Reference Guide.

The results of this transformation can be observed in the following Schedule Viewer:

The associated estimates are:

As you can see, this code can be implemented as a combinatorial function requiring only a fraction of the cycle to complete.

Pipelining Loops

Pipelining loops allows you to overlap iterations of a loop in time. Allowing iterations to operate concurrently is often a good compromise, as resources can be shared between iterations (less resource usage), while requiring less execution time compared to loops that are not unrolled.

Pipelining is enabled in C/C++ through the following pragma:

#pragma HLS PIPELINE

While the OpenCL API uses the following attribute:

__attribute__((xcl_pipeline_loop))

Note: The OpenCL API has an additional way of specifying loop pipelining. This has to do with the fact that work item loops are not explicitly stated and pipelining these loops requires the attribute:

__attribute__((xcl_pipeline_workitems))

More details to any of these specifications are provided in the SDx Pragma Reference Guide and the SDAccel Environment Programmers Guide.

In this example, the Schedule Viewer in the HLS Project produces the following information:

With the overall estimates being:

Because each iteration of a loop only consumes two cycles of latency, there can only be a single iteration overlap. This enables the total latency to be cut into half compared to the original, resulting in 257 cycles of total latency. However, when compared to unrolling, this reduction in latency was achieved using fewer resources.

In most cases, loop pipelining by itself can improve overall performance. However, the effectiveness of the pipelining will depend on the structure of the loop. Some common limitations are:

Resources with limited availability such as memory ports or process channels can limit the overlap of the iterations (II).
Similarly, loop-carried dependencies, such as those created by variables conditions computed in one iteration affecting the next, might increase the initial interval of the pipeline.

These are reported by the tool during high-level synthesis and can be observed and examined in the Schedule Viewer. For the best possible performance, the code might have to be modified to eliminate these limiting factors, or the tool needs to be instructed to eliminate some dependency by restructuring the memory implementation of an array or breaking the dependencies all together.

Task Parallelism

Task parallelism allows you to take advantage of data flow parallelism. In contrast to loop parallelism, when task parallelism is deployed, full execution units (tasks) are allowed to operate in parallel taking advantage of extra buffering introduced between the tasks.

Look at the following example:

void run (ap_uint<16> in[1024],
	  ap_uint<16> out[1024]
	  ) {
  ap_uint<16> tmp[128];
  for(int i = 0; i<8; i++) {
    processA(&(in[i*128]), tmp);
    processB(tmp, &(out[i*128]));
  }
}

When this code is executed, the function processA and processB are executed sequentially 128 times in a row. Given the combined latency for processA and processB in the loop is 278, the total latency can be estimated as:

The extra cycle is due to loop setup and can be observed in the Schedule Viewer.

For C/C++ code, task parallelism is performed by adding the DATAFLOW pragma into the for-loop:

#pragma HLS DATAFLOW

For OpenCL API code, add the attribute before the for-loop:

__attribute__ ((xcl_dataflow))

Refer to SDx Pragma Reference Guide and SDAccel Environment Programmers Guide for more details regarding the specifics and limitations of these modifiers.

As illustrated by the estimates in the HLS Report, applying the transformation will considerably improve the overall performance effectively using a double (ping pong) buffer scheme between the tasks:

The overall latency of the design has almost halved in this case due to concurrent execution of the different tasks of the different iterations. Given the 139 cycles per processing function and the full overlap of the 128 iterations, this allows the total latency to be:

(1x only processA + 127x both processes + 1x only processB) * 139 cycles = 17931 cycles

Using task parallelism is a very powerful way improve performance when it comes to implementation. However, the effectiveness of applying the DATAFLOW pragma to a specific and arbitrary piece of code might vary vastly. The coding guidelines for applying DATAFLOW effectively are provided in SDx Pragma Reference Guide and SDAccel Environment Programmers Guide. However, to understand the final implementation of the DATAFLOW pragma, it is often necessary to actually look at the execution pattern of the individual tasks. Towards that end, the SDAccel environment provides the Detailed Kernel Trace, which nicely illustrates concurrent execution.

For this detailed kernel trace, the tool displays the start of the dataflowed loop, as shown in the previous figure. It illustrates how processA is starting up right away with the beginning of the loop, while processB waits until the completion of the processA before it can start up its first iteration. However, while processB completes the first iteration of the loop, processA begins operating on the second iteration and so forth.

A more abstract representation of this information is presented in the Application Timeline (Host & Device) and Device Hardware Transaction View (device-only during hardware emulation).

Optimizing Compute Units

Data Width

One aspect for performance is the data width required for the implementation. The tool propagates port widths throughout the algorithm. In some cases, especially when starting out with an algorithmic description, the C/C++/OpenCL™ API code might only use large data types such as integers even at the ports of the design. However, as the algorithm gets mapped to a fully configurable implementation, smaller data types such as 10- or 12-bit might often suffice. Towards that end it is beneficial to check the size of basic operations in the HLS Synthesis report during optimization. In general, when the SDAccel environment maps an algorithm onto the FPGA, much processing is required to comprehend the C/C++/OpenCL API structure and extract operational dependencies. Therefore, to perform this mapping the SDAccel environment generally partitions the source code into operational units which are then mapped onto the FPGA. Several aspects influence the number and size of these operational units (ops) as seen by the tool.

In the following figure, the basic operations and their bitwidth are reported.

Look for bit widths of 16, 32, and 64 bits commonly used in algorithmic descriptions, and verify that the associated operation from the C/C++/OpenCL API source actually requires the bit width to be this large. This can considerably improve the implementation of the algorithm, as smaller operations require less computation time.

Fixed Point Arithmetic

Some applications use floating point computation only because they are optimized for other hardware architectures. As explained in Deep Learning with INT8 Optimization on Xilinx Devices, using fixed point arithmetic for applications, like deep learning. can save the power efficiency and area significantly while keeping the same level of accuracy.

Note: Explore fixed point arithmetic for your application before committing to using floating point operations.

Macro Operations

It is sometimes advantageous to think about larger computational elements. The tool will operate on the source code independently of the remaining source code, effectively mapping the algorithm without consideration of surrounding operations onto the FPGA. When applied, SDAccel tool keeps operational boundaries, effectively creating macro operations for specific code. This uses the following principles:

Operational locality to the mapping process.
Reduction in complexity for the heuristics.

This might create vastly different results when applied. In C/C++ macro operations are created with the help of

#pragma HLS inline off

While in the OpenCL API, the same kind of macro operation can be generated by not specifying the following attribute, when defining a function.:

__attribute__((always_inline))

Using Optimized Libraries

The OpenCL specification provides many math built-in functions. All math built-in functions with the native_ prefix are mapped to one or more native device instructions and will typically have better performance compared to the corresponding functions (without the native_ prefix). The accuracy and in some cases the input ranges of these functions is implementation-defined. In SDAccel™ environment these native_ built-in functions use the equivalent functions in the Vivado® HLS tool Math library, which are already optimized for Xilinx® FPGAs in terms of area and performance.

Note: Xilinx recommends that you use native_ built-in functions or the HLS tool Math library if the accuracy meets the application requirement.

Optimizing Memory Architecture

A key aspect of implementation is memory architecture. Because of the limited access bandwidth, it can heavily impact the overall performance, as shown in the following example:

void run (ap_uint<16> in[256][4],
          ap_uint<16> out[256]
         ) {
  ...
  ap_uint<16> inMem[256][4];
  ap_uint<16> outMem[256];

  ... Preprocess input to local memory
  
  for( int j=0; j<256; j++) {
    #pragma HLS PIPELINE OFF
    ap_uint<16> sum = 0;
    for( int i = 0; i<4; i++) {

      sum += inMem[j][i];
    }
    outMem[j] = sum;
  } 

  ... Postprocess write local memory to output
}

This code adds the four values associated with the inner dimension of the two dimensional input array. If implemented without any additional modifications, it results in the following estimates:

The overall latency of 4608 (Loop 2) is due to 256 iterations of 18 cycles (16 cycles spent in the inner loop, plus the reset of sum, plus the output being written). This is can be observed in the Schedule Viewer in the HLS Project. The estimates become considerably better when unrolling the inner loop.

However, this improvement is largely due to the fact that this process uses both ports of a dual port memory. This can be seen from the Schedule Viewer in the HLS Project:

As you can see, two read operations are performed per cycle to access all the values from the memory to calculate the sum. This is often an undesired result as this completely blocks the access to the memory. To further improve the results, the memory can be split into four smaller memories along the second dimension:

#pragma HLS ARRAY_PARTITION variable=inMem complete dim=2

This results in four array reads, all executed on different memories using a single port:

Using a total of 256 * 4 cycles = 1024 cycles for loop 2.

Alternatively, the memory can be reshaped into to a single memory with four words in parallel. This is performed through the following pragma:

#pragma HLS array_reshape variable=inMem complete dim=2

This results in the same latency as the array partitioning, but with a single memory using a single port:

Although, either solution creates comparable results with respect to overall latency and utilization, reshaping the array results in cleaner interfaces and less routing congestion making this the preferred solution.

Note: This completes array optimization, in a real design the latency could be further improved by exploiting loop parallelism (see the Loop Parallelism section).

void run (ap_uint<16> in[256][4],
	  ap_uint<16> out[256]
	  ) {
  ...

  ap_uint<16> inMem[256][4];
  ap_uint<16> outMem[256];
  #pragma HLS array_reshape variable=inMem complete dim=2
  
  ... Preprocess input to local memory
  
  for( int j=0; j<256; j++) {
    #pragma HLS PIPELINE OFF
    ap_uint<16> sum = 0;
    for( int i = 0; i<4; i++) {
      #pragma HLS UNROLL
      sum += inMem[j][i];
    }
    outMem[j] = sum;
  } 

  ... Postprocess write local memory to output

}