## Why Xilinx for Machine Learning?

Craig Abramson Senior Technical Marketing Engineer Longmont, CO





**EXILINX**.

## **Machine Learning Challenges**



### The rate of AI innovation



Performance at low latency



Low power consumption



## Whole app acceleration





### **Inference is Moving to Lower Precision**



#### RELATIVE ENERGY COST

| 8b Add     | 0.03 |  |
|------------|------|--|
| 16b Add    | 0.05 |  |
| 32b Add    | 0.1  |  |
| 16b FP Add | 0.4  |  |
| 32b FP Add | 0.9  |  |

Source: Bill Dally (Stanford), Cadence Embedded Neural Network Summit, February 1, 2017



### **Machine Learning Inference is Xilinx Focus**



Input

INFERENCE

Fewer

**Training**: Process for machine to "learn" and optimize model from data

### Focus

``dog″√

Inference: Using trained models to predict/estimate outcomes from new observations in efficient deployments

## Why ML Inference? It's Where the Market is going to be...



Barclays Research, Company Reports May 2018

## **Delivering Adaptable ML Compute Acceleration**



### Why GPUs in the First Place?

GPU Graphics Pipeline: Converts 3D representations of images into 2D space



"Can we apply GPUs to other problems?"

## **CPU / GPU Architecture**

![](_page_9_Figure_1.jpeg)

### **Data Flow and Data Precision Matters**

### GPU (Basically a SW Engine)

![](_page_10_Figure_2.jpeg)

### > Software Defined Data Flow

- >> Major overhead (memory, comms, power)
- >> Non-deterministic Behavior (latency)

### > Fixed Data Precision Support

- >> Floating point / Integer units
- >> Native precisions defined at T/O

![](_page_10_Figure_9.jpeg)

### > Hardware Defined Data Flow

- >> Minimum overhead, custom compute / memory
- >> Deterministic Behavior (latency)
- >> Reconfigurable to current / future workloads

#### > Variable Data Precision Support

- >> Optimize for memory, power, cost
- >> Future proof, adapts to future precision trends

## Memory Hierarchy: Very Fundamental FPGA Advantage

![](_page_11_Figure_1.jpeg)

- > Rigid memory hierarchy & data duplication
- > High "data locality" required for workload efficiency

Fixed Memory Hierarchy & Shared Interconnect: Robs Bandwidth / Capacity & Stalls Compute

![](_page_11_Figure_5.jpeg)

- > Adaptable memory hierarchy & datapath
- > ~5X more on-chip memory / less off-chip required

Match Memory Hierarchy & Bandwidth to Compute Requirements

## What About Batching?

### **Fundamental to GPU Architecture**

(Software Defined Data Flow)

### Batching: Loading up lots of similar Data Sets

- Keep compute cores busy
- Hide some memory latency
- Create better SIMT efficiency

### Not Required for FPGA / ACAP

(Hardware Defined Data Flow)

### Independent of Data Set count

- Custom HW kernels
- Custom Memory Hierarchy
- HW pipeline data flow

![](_page_12_Figure_13.jpeg)

![](_page_12_Figure_14.jpeg)

High Throughput OR Low Latency High

### High Throughput AND Low Latency

**EXILINX** 

Batching doesn't even really apply in an FPGA

## A Batching Example: Image Classification

GPU

![](_page_13_Figure_2.jpeg)

### A Batching Example: Image Classification

![](_page_14_Figure_1.jpeg)

![](_page_14_Figure_2.jpeg)

"Dog" "Cat" "Truck" "Duck" "Cat" "Dog" "Duck" "Truck"

![](_page_14_Figure_4.jpeg)

![](_page_14_Figure_5.jpeg)

![](_page_14_Figure_6.jpeg)

## **The Benefits of Pruning & Compression**

#### **Before Pruning**

![](_page_15_Figure_2.jpeg)

![](_page_15_Figure_3.jpeg)

### Pruning Benefits:

- Smaller, "lighter" networks
- Less mem capacity & b/w req'd
- Reduced compute requirement
- Higher performance
- Lower power

Applies to Both ..BUT..

- Issues w/ sparse & small matrices
- Compute efficiency degrades
- Still more off-chip mem req'd
- Better w/ sparse matrices
- Single-chip solutions possible
- Device scales w/ compute/resources

Up to 30+% Better Compression

Ω

...

## **Only Adaptable Hardware Addresses Inference Challenges**

Custom data flow (Address new architectures)

Custom memory hierarchy (Address power/performance challenges

![](_page_16_Figure_4.jpeg)

Custom precision (Address power/cost)

![](_page_16_Picture_6.jpeg)

Domain Specific Architectures (DSAs) on Adaptable Platforms

![](_page_16_Picture_8.jpeg)

### **Do TOPs/FLOPs Matter?**

![](_page_17_Picture_1.jpeg)

![](_page_17_Picture_2.jpeg)

## **Putting Metrics & Benchmarks in Focus**

![](_page_18_Figure_1.jpeg)

### Focus on Application Level Performance Where Xilinx Solutions Shine

![](_page_18_Picture_3.jpeg)

### **Bi-directional LSTM Performance: Speech to Text**

![](_page_19_Figure_1.jpeg)

## **Xilinx Machine Learning Customer Successes**

![](_page_20_Figure_1.jpeg)

- USB Camera +CV +ML +Display
- 5x better Perf/Watt than GPU / SSD (no pruning!)
- > ML benchmarks (pruning)
- <u>93% Reduction</u> of resources within 1% of initial precision
- Camera +CV +12 SSD +Display
- 12 channel object detection in 1 ZU9 (with pruning)

### **Video Success Story**

**E** XILINX.

## Conclusion

![](_page_22_Picture_1.jpeg)

## FPGAs Address the Machine Learning Challenges . . .

![](_page_23_Picture_1.jpeg)

The rate of AI innovation

![](_page_23_Picture_3.jpeg)

Performance at low latency

![](_page_23_Picture_5.jpeg)

![](_page_23_Picture_6.jpeg)

... and today's activities will show you exactly how.

![](_page_23_Picture_9.jpeg)

## Thank you.

![](_page_24_Picture_1.jpeg)

# Adaptable Advantage

![](_page_25_Picture_1.jpeg)

![](_page_25_Picture_2.jpeg)