# The Future of Machine Learning Acceleration

Jeff Fifield Xilinx Labs Nov 2018

Slides from Michaela Blott, Hot Chips 2018 Tutorial, "Overview of Deep Learning and Computer Architectures for Accelerating DNNs"



- Neural Networks
- > Computation & Memory Requirements
- > Algorithmic Optimization Techniques
- > Hardware Architectures



### **Neural Networks**



### A.I. – Machine Learning - Neural Networks





# Convolutional Neural Networks (CNNs) Why are they so popular?

- > Requires little or no domain expertise
- > NNs are a "universal approximation function"
- > If you make it big enough and train it enough
  - >> Can outperform humans on specific tasks



- > Will increasingly replace other algorithms
  - unless for example simple rules can describe the problem
- Solve problems previously unsolved by computers
- > And solve completely unsolved problems



### **Increasing Range of Applications**



**Image Classification** 



**Object Detection** 



**Semantic Segmentation** 

Computer Vision CNNs



Speaker Diarization



Speech Recognition

Speech Recognition RNNs, LSTMs



**Translation** 



**Sentiment Analysis** 

Natural Language Processing Sequence to sequence



Recommender



**GamePlay** 

Many more emerging...

**Others** 

### **Popular Neural Networks**



# Convolutional Neural Networks (CNNs) from a computational point of view

- > CNNs are usually feed forward\* computational graphs constructed from one or more layers
  - >> Up to 1000s of layers

Synapse with weight wji

Neuron ni

- Each layer consists of neurons ni which are interconnected with synapses, associated with weights wij
- n0 = Act(w00\*i0 + w10\*i1)

- > Each neuron computes:
  - >> Typically linear transform (dot-product of receptive field)
  - >> Followed by a non-linear "activation" function



>> 8

**E** XILINX.

### From Training to Inference



#### **Training**

Process for a machine to *learn* by optimizing models (weights) from labeled data.

Typically computed in the cloud



#### Inference

Using trained models to predict or estimate outcomes from new inputs.

**Deployment at the edge** 

### **Example: ResNet50**

### Forward Pass (Inference)

(initialized)
Input Image
Neural Network
Neural Network
Weights
Weights
Cat?

#### For ResNet50:

70 Layers

7.7 Billion operations

25.5 MBytes of weight storage\*

10.1 MBytes for activations\*

\*Assuming int8



### **NNs in More Detail**



**Activation & Batch Normalization** 



### **Activation Functions**

- > Implements the concept of "Firing"
  - >> Non-linear so we can approximate more complex functions
- Most popular for CNN: rectified linear unit (ReLU)\*\*
  - Popular as it propagates gradients better than bounded and easy to compute
  - >> However, recent work says as long as you have the proper initialization, it'll be fine even with bounded act. function\*





#### > Implementation:

>> Support for special functions as well as some level of flexibility

\*Xiao, L., Bahri, Y., Sohl-Dickstein, J., Schoenholz, S.S. and Pennington "Dynamical Isometry and a Mean Field Theory of CNNs: How to Train 10,000-Layer Vanilla Convolutional Neural Networks." arXiv preprint arXiv:1806.05393 (2018).

\*\*Nair, V. and Hinton, G.E., 2010. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML-10) (pp. 807-814).



### **Batch Normalization**

- Normalizes the statistics of activation values across layers
- Significantly reduces the training time of networks, can improve accuracy and makes it less sensitive to initialization

#### > Compute:

- >> Lightweight at inference
- >> Heavy duty during training
  - Subtract mean, divide by standard deviation to achieve zero-centered distribution with unit variance



https://en.wikipedia.org/wiki/Normal\_distribution



### **Fully Connected Layers**

(aka inner product or dense layers)

- > Each input activation is connected to every output activation
  - >> Receptive field encompasses the full input



- Can be written as a matrix-vector product with an elementwise non-linearity applied afterwards.
- > Implementation Challenges
  - >> Connectivity
  - >> High weight memory requirement: #IN \* #OUT \* BITS
  - >> Low arithmetic intensity assuming weights off-chip 2 \* #IN\* #OUT / #IN \* #OUT \* BITS/8

|                     |   | W00 W01 W02 W | 03 |   |                |
|---------------------|---|---------------|----|---|----------------|
| [ <i>i0 i1 i2</i> ] | X | W10 W11 W12 W | 13 | = | [n0'n1'n2'n3'] |
|                     |   | W20 W21 W22 W | 23 |   |                |
|                     |   |               |    |   |                |

(n0 n1 n2 n3) = Act(n0'n1'n2'n3')

| MODEL    | CONV WEIGHTS (M) | FC WEIGHTS (M) |
|----------|------------------|----------------|
| ResNet50 | 23.454912        | 2.048          |
| AlexNet  | 2.332704         | 58.621952      |
| VGG16    | 14.710464        | 123.633664     |

IN: number of input channelsOUT: number of output channelsBITS: bit precision in data types



### **Convolutional Layers Example 2D Convolution**

- > Convolutions capture some kind of locality, spatial or temporal, that we know exists in the domain
- > Receptive field of each neuron reduced
  - >> Applying convolution to all images in the previous layer
- > Weights represent the filters used for convolutions



**E** XILINX.

### **2D Convolutional Layers**

- > Slide the window till one feature map is complete
  - >> With a given stride size





### **2D Convolutional Layers**

#### > Compute next channel



### **Convolutions**

### Challenges

#### > Channel connectivity issue

>> Every input channel information broadcasts to every output channel



100s to 1000 channels

#### > Huge amounts of compute

>> Dense convolutions account for the majority of the compute

| MODEL    | CONV [GOPS] | FC [GOPS] |
|----------|-------------|-----------|
| ResNet50 | 7.712       | 0.004     |
| AlexNet  | 1.332       | 0.044     |
| VGG16    | 30.693      | 0.247     |

#### > Novel (Non-Dense) Convolutions

- >> Less spatial convolutions (1x1) (SqueezeNet's FireModules)
- >> Connectivity reduction between in and out channels (Shuffle, Shift layers)

=> Optimizations

>> 18



### **Convolutions**

### **Challenges**

- > Parallelization of compute across layers reduces memory bandwidth required for buffering activations in between layers
- > Pyramid-shaped data dependency between activations across layers





### **Pooling Layer**

- > Down-samplers of images
- > Reduces compute in subsequent layers
- > May use MAX or AVERAGE
- > Compute:
  - >> Low amount of compute
  - >> Potentially replaceable with larger strides in previous convolution

#### Max pool with 2x2 filters and stride of 2:



$$n00 = Max(i00, i01, i10, i11)$$



### **Recurrent Layer Types**

- > Contain state for processing sequences
  - For example needed in speech or optical character recognition
  - >> "Apocal???"
- > Uni-directional or bi-directional
  - >> "I ???? You"
- More sophisticated types to address the vanishing gradients problem for learning more than 5-10 timesteps
  - GRU (gated recurrent unit)
  - >> LSTM (long short term memory)









### **Recurrent Layers**

### Challenges in Additional Data Dependencies

#### > Input sequence

>> Unlike batch, additional data dependencies between inputs of the same sequence and state

#### > Bi-directional NNs

>> Full sequence needs to be completed before the next layer





### **Meta-Layers**

- > Residual layers (ResNets \*)
  - >> Introduced to make larger networks more trainable
  - Better gradient propagation through skip connections during training
  - Plus 1x1 convolutions to reduce dimensionality and save compute
- > Inception layers (GoogleNet\*\*)
  - Huge variation in spatial features => combining different size convolutions in one layer
  - Plus 1x1 convolutions to reduce dimensionality and save compute
  - >> Later on additional factorization to reduce compute
    - -3x3 = 1x3 and 3x1
- > Many more...

>> 23

> Implementation: support for non-linear topologies!



CNV 3x3, 64, Relu

CNV 3x3, 64, Relu

CNV 1x1, 64, Relu

CNV 3x3, 64, Relu

CNV 1x1, 256, Relu







# Computation & Memory Requirements



### **Compute and Memory Requirements**

### Architecture Neutral, Per Layer

>> 25



IN, IN\_CH: number of inputs and input channels OUT, OUT\_CH: number of outputs and output channels

*F\_DIM, FM\_DIM:* filter and feature map dimensions (assumed square)

BATCH: batch size

BITS: bit precision in data types GATES: number of gates in RNNs:

STATES: worst case

SEQ: sequence length

HID: hidden size (state + output from each state)
DIRS: 1 for unidirectional and 2 for bidirectional RNN



Memory Requirements:

 $A_{total} = \sum A_i$ 

# **Inference Compute and Memory Across a Spectrum of Neural Networks**

\*architecture independent
\*\*1 image forward
\*\*\* batch = 1
\*\*\*\* int8



### **Rooflines\***







\*\* with respect to weights assuming weights are off-chip

### **Arithmetic Intensity**

### Across a Spectrum of Neural Networks

- > Memory requirement for weights, activations are beyond typically available on-chip memory
- > This yields low arithmetic intensity
  - >> For example for inference, assuming weights off-chip and naïve implementation, majority of networks is below 6OPS:Byte





Jouppi, N.P., Young, C., Patil, N., Patterson, D., Agrawal, G., Bajwa, R., Bates, S., Bhatia, S., Boden, N., Borchers, A. and Boyle, R., 2017, June. Indatacenter performance analysis of a tensor processing unit. ISCA'2017



### In Summary: CNNs are associated with...

- Significant amounts of memory and computation
- > Huge variation between topologies and within them
- > Fast changing algorithms
- > Special functions, non-linear topologies
- > However, incredibly parallel!
  - >> For convolutions: filter dimensions, feature map dimensions, input & output channels, batches, layers, and even precisions (discussed later)



Adopted from Ce Zhang, ETH Zurich, Systems Group Retreat



### **Architectural Challenges/ Pain Points**



>> 30

Requires algorithmic & architectural innovation



### **Algorithmic Optimization Techniques**



#### **Optimization Techniques** DRA Weight & activation fetching: bandwidth throttles performance NN Inference/ Training Accelerator Power consumption for embedded Loop transformations to minimize memory access\* Weight Buffer Latency in real-time Huge amount of memory processing spilling into DRAM Input & Results Input samples Activation Buffering Huge amount of compute -**Pruning** Limited scalability with new technology nodes Partial Sums Activation Functions/Pooling. Compression Winograd, Strassen and FFT Novel layer types (squeeze, shuffle, shift) **Numerical Representations & Reducing Precision**



### **Example: Reducing Bit-Precision**

- > Linear reduction in memory footprint
  - >> Reduces weight fetching memory bandwidth
  - >> NN model may even stay on-chip

| Reducing precision sl | rinks inherent arithmetic cost | in both |
|-----------------------|--------------------------------|---------|
| ASICs and FPGAs       |                                |         |

>> Instantiate 100x more compute within the same fabric and thereby scale performance

|           | 1800        | 1.1*C                                   |          |          |    | ×       |
|-----------|-------------|-----------------------------------------|----------|----------|----|---------|
|           | 1400        | <ul><li>HLS Com</li><li>1.6*C</li></ul> | pression |          |    | <br>    |
| sts       | 1200        |                                         |          |          | XX | <br>+ - |
| .UT Costs | 1000        |                                         |          | X        |    | <br>    |
| 5         | 800 -       |                                         | 8        | <b>*</b> |    |         |
| _         | 600         | **                                      | X        | +        |    |         |
|           | 400         | N N                                     |          |          |    |         |
|           | 200         | ++                                      |          |          |    |         |
|           | O THE PARTY | 1,000                                   |          |          |    |         |

| Modelsize [MB]<br>(ResNet50) |  |  |
|------------------------------|--|--|
| 3.2                          |  |  |
| 25.5                         |  |  |
| 102.5                        |  |  |
|                              |  |  |

C= size of accumulator \* size of weight \* size of activation (to appear in ACM TRETS SE on DL, FINN-R)



## Reducing Precision provides Performance Scalability Example: ResNet50, ResNet152 and TinyYolo



Theoretical Peak Performance for a VU13P with different Precision Operations Assumptions: Application can fill device to 90% (fully parallelizable) 710MHz

RP reduces model size=> to stay on-chip



### **Reducing Precision Inherently Saves Power**

#### FPGA:



Target Device ZU7EV  $\bullet$  Ambient temperature: 25 °C  $\bullet$  12.5% of toggle rate  $\bullet$  0.5 of Static Probability  $\bullet$  Power reported for PL accelerated block only

#### **ASIC:**

|                     |             | Relative Energy Cost |
|---------------------|-------------|----------------------|
| Operation:          | Energy (pJ) |                      |
| 8b Add              | 0.03        |                      |
| 16b Add             | 0.05        |                      |
| 32b Add             | 0.1         |                      |
| 16b FP Add          | 0.4         |                      |
| 32b FP Add          | 0.9         |                      |
| 8b Mult             | 0.2         |                      |
| 32b Mult            | 3.1         |                      |
| 16b FP Mult         | 1.1         |                      |
| 32b FP Mult         | 3.7         |                      |
| 32b SRAM Read (8KB) | 5           |                      |
| 32b DRAM Read       | 640         |                      |
|                     |             | 1 10 100 1000 1000   |

Source: Bill Dally (Stanford), Cadence Embedded Neural Network Summit, February 1, 2017



### **RPNNs: Closing the Accuracy Gap**





### **Design Space Trade-Offs**



Hardware Architectures and their Specialization Towards CNN Workloads

Exciting Times in Computer Architecture Research!



### **Spectrum of New Architectures for Deep Learning**



Chen, Y., Luo, T., Liu, S., Zhang, S., He, L., Wang, J., Li, L., Chen, T., Xu, Z., Sun, N. and Temam, O., 2014, December. Dadiannao: A machine-learning supercomputer. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture (pp. 609-622). IEEE Computer Society.



<sup>\*</sup>Shafiee, A., Nag, A., Muralimanohar, N., Balasubramonian, R., Strachan, J.P., Hu, M., Williams, R.S. and Srikumar, V., 2016. ISAAC: A convolutional neural network accelerator with in-situ analog arithmetic in crossbars. ACM SIGARCH

Chi, P., Li, S., Xu, C., Zhang, T., Zhao, J., Liu, Y., Wang, Y. and Xie, Y., 2016, June. Prime: A novel processing-in-memory architecture for neural network computation in reram-based main memory. In ACM SIGARCH

### **Architectural Choices – Macro-Architecture**

**\*** 





<sup>\*</sup>Chung, E., Fowers, J., Ovtcharov, K., Papamichael, M., Caulfield, A., Massengill, T., Liu, M., Lo, D., Alkalay, S., Haselman, M. and Abeydeera, M.Serving DNNs in Real Time at Datacenter Scale with Project Brainwave. IEEE Micro, 38(2)

<a href="https://www.microsoft.com/en-us/research/uploads/prod/2018/06/ISCA18-Brainwave-CameraReady.pdf">https://www.microsoft.com/en-us/research/uploads/prod/2018/06/ISCA18-Brainwave-CameraReady.pdf</a>

\*\*Ilmurogly, Yaman, Ilmurogly, Y. Fraser, N.L. Gambardella, G. Blott, M. Leong, P. Jahre, M. and Vissers, K. "FINN: A

<sup>\*\*</sup>Umuroglu, Yaman, Umuroglu, Y., Fraser, N.J., Gambardella, G., Blott, M., Leong, P., Jahre, M. and Vissers, K. "FINN: A framework for fast, scalable binarized neural network inference." ISFPGA'2017

# Synchronous Dataflow (SDF) vs Matrix of Processing Elements (MPE)



**Spectrum of Options** 



End points are pure layer-by-layer compute and feed-forward dataflow architecture





>> 41 Lin, X., Yin, S., Tu, F., Liu, L., Li, X. and Wei, S. LCP: a layer clusters paralleling mapping method for accelerating inception and residual networks on FPGA. DAC'2016 XILINX. Alwani, M., Chen, H., Ferdman, M. and Milder, P. Fused-layer CNN accelerators. MICRO 2016.

## Synchronous Dataflow (SDF) vs Matrix of Processing Elements (MPE)



Degree of parallelization across layers



- · Requires less activation buffering
- Higher compute and memory efficiency due to custom-tailored hardware design
- · Less flexibility
- Less latency (reduced buffering)
- No control flow (static schedule)

- Requires less on-chip weight memory, but more activation buffers
- Efficiency of memory for weights and activations depends on how well balanced the topology is
- Flexible hardware, which can scale to arbitrary large networks
- Compute efficiency is a scheduling problem=> generating sophisticated scheduling algorithms

### **Architectural Choices – Micro-Architecture**



Judd, P., Albericio, J., Hetherington, T., Aamodt, T.M. and Moshovos, A., 2016, October. Stripes: Bit-serial deep neural network computing. MICRO'2016 Moons, B., Bankman, D., Yang, L., Murmann, B. and Verhelst, M. BinarEye: An always-on energy-accuracy-scalable binary CNN processor with all memory on chip in 28nm CMOS, ICC'2018

>> 43 Lin, X., Yin, S., Tu, F., Liu, L., Li, X. and Wei, S. LCP: a layer clusters paralleling mapping method for accelerating inception and residual networks on FPGA.

DAC'2016



### Micro-Architecture:

### Customized Arithmetic for Specific Numerical Representations

- Customizing arithmetic compute allows to maximize performance at minimal accuracy loss
  - >> Flexpoint, Microsoft Floating Point formats, Binary & Ternary, Bfloat16



- > Which do we support?
  - >> Perhaps too risky to support numerous, and too risky to fix on one?
- > What's more, non-uniform arithmetic can yield more efficient hardware implementations for a fixed accuracy\*
  - >> Run-time programmable precision: Bit-Serial

|           | DEC   | INC   | CONCAVE | CONVEX |
|-----------|-------|-------|---------|--------|
| Top-1 [%] | 53.79 | 50.35 | 54.45   | 54.33  |
| Top-5 [%] | 77.59 | 74.89 | 76.43   | 78.20  |

Table 2. Accuracy comparison of our approach under different styles of layer-wise quantization.



### Summary



### **Summary**

- CNNs are increasingly being adopted for new workloads and key to the current industrial revolution and perhaps the next
- > Associated with significant challenges
- > Requires algorithmic and architectural innovation (co-designed)
- > Emerging: Huge spectrum of algorithms and increasingly diverse & heterogenous hardware architectures
- > Clear metrics for comparison needed
  - Hardware performance always tying back to application performance (accuracy) to allow for algorithmic optimizations
  - Ideally in form of pareto curves: Accuracy performance (TOPS/sec) response time (1 input) power consumption





### **Exciting Times for our Community:**

### Many New Architectures Evolving - Programmable and Hardened

