# Versal Portfolio Product Overview

Bill Allaire Mar 20, 2019





# Agenda

- Introducing Versal: The First ACAP
- > Heterogeneous Acceleration Engines
- > Key Architectural Blocks (Focus on AI Engines)
- > Product Portfolio





# The Technology Conundrum .. And the Need for a New Compute Paradigm

Processing Architectures are Not Scaling



A Single Architecture

Can't Do It Alone

<u>WP505</u>: "Versal: The First Adaptive Compute Acceleration Platform." <u>WP506</u>: "Xilinx AI Engines and Their Applications."



# New Device Category: Adaptive Compute Acceleration Platform



© Copyright 2018 Xilinx

**E** XILINX.

# **Breakthrough Performance for Cloud, Network, and Edge**





Networking

Multi-terabit Throughput

Cloud Compute Breakthrough AI Inference







5G Wireless Compute for Massive MIMO



Edge Compute Al Inference at Low Power



# **Versal Architecture Overview**

Scalar Engines

Platform Control
Edge Compute

Protocol EnginesIntegrated 600G cores

4X encrypted bandwidth

Programmable I/O
Any interface or sensor
Includes 3.2Gb/s MIPI





# Hardware Adaptable: Accelerating the Whole Application



>> 9

# **Adaptable Engines**

# Adaptable Hardware Engines

Programmable logic for fine-grained parallel processing, data aggregation, and sensor fusion

Programmable memory hierarchy to optimize compute efficiency

High bandwidth, low latency data movement between engines and I/O



# Adaptable Engines: Greater Compute Density for Any Workload

#### Re-Architected Hardware Fabric

- > 4X density per logic block for more compute
- > Less external routing $\rightarrow$  greater performance
- > Code and IP compatible with 16nm devices



- > Three operating voltages to choose from
- > Balance power/performance for target app
- > Equivalent to 3 speed grades in one device

#### Adaptable to any Workload

- > Bit-level precision (1  $\rightarrow$  1,000) for any algorithm
- > Improves ML efficiency (compression, pruning)

ML Inference and Optimizations

(e.g., pruning)

 Forward-compatible to lower precision neural networks, e.g., BNN







For Any Workload, e.g., ...

#### **E** XILINX.

# Adaptable Memory Hierarchy The Right Memory for the Right Job

Scalar Engines Adaptable Engines Intelligent Engines **AI ENGINES** WORKLOAD<sub>1</sub> Arm Cortex-A72 WORKLOAD Increasing Bandwidth, Decreasing Density 1,000 Tb/s Cache LUTRAM Distributed low-latency memory Arm Block RAM & UltraRAM BRAM BRAM BRAM BRAM Cortex-R5 100 Tb/s BRAM BRAM BRAM BRAM Embedded configurable SRAM UltraRAM UltraRAM UltraRAM UltraRAM Cache (New) Accelerator RAM TCM 4 MB sharable across engines Accelerator RAM OCM 10 Tb/s HBM In-package DRAM MIPI PCle & Network HBM DDR SerDes **DDR External Memory** CCIX Cores GPIO DDR4-3200; LPDDR4-4266 1 Tb/s



local data memory in Al engines

# **Intelligent Engines**

# Intelligent Engines for Diverse Compute

#### **DSP** Engines

High-precision floating point & low latency Granular control for customized data paths

#### **AI Engines**

High throughput, low latency, and power efficient Ideal for AI inference and advanced signal processing





**E** XILINX.

# NoC for Ease of Use, Guaranteed Bandwidth, and Power Efficiency

### High bandwidth terabit network-on-chip

- > Memory mapped access to all resources
- > Built-in arbitration between engines and memory

### High Bandwidth, Low Latency, Low power

- > Guaranteed QoS
- > 8X power efficiency vs. FPGA implementations

### **Eases Kernel Placement**

- > Easily swap kernels at NoC port boundaries
- > Simplifies connectivity between kernels



# Introducing the "Integrated Shell"

#### 'Shell': Pre-Built Core Infrastructure & System Connectivity

- > External host interface
- > Memory subsystem
- > Basic interfaces (e.g., JTAG, USB, GbE)

#### Key Architectural Elements of the Shell

- > Platform Management Controller (PMC)
- Integrated host interfaces: PCIe & CCIX, DMA
- > Scalable Memory Subsystem: DDR4 & LPRDDR4
- > Network-on-Chip for connectivity and arbitration

#### Greater Performance, Device Utilization, and Productivity

- > More of the platform available for application's workload(s)
- > Target application runs faster with less device congestion
- > Turn-key, pre-engineered timing closure no debug



# **AI Engines**



# Al Engines Massive Al Inference Throughput and Wireless Compute

#### 1.3GHz VLIW / SIMD vector processors

> Versatile core for ML and other advanced DSP workloads

#### Massive array of interconnected cores

> Instantiate multiple tiles (10s to 100s) for scalable compute

#### Terabytes/sec of interface bandwidth to other engines

- > Direct, massive throughput to adaptable HW engines
- > Implement core application with AI for "Whole App Acceleration"

#### SW programmable for any developer

- > C programmable, compile in minutes
- > Library-based design for ML framework developers



# Al Engine: Scalar Unit, Vector Unit, Load Units and Memory

>> 26



# Up to 128 MACs / Clock Cycle per Core (INT 8)







# Signal Processing Data Types



## >> 28 Data Movement Architecture





# Streaming Communication (non-neighbor)









# Al Engine Integration with Versal<sup>™</sup> ACAP



## > TB/s of Interface Bandwidth

- >> AI Engine to Programmable Logic
- >> AI Engine to NOC

## > Leveraging NOC connectivity

- >> PS manages Config / Debug / Trace
- >> AI Engine to DRAM (no PL req'd)



# **Al Engine Delivers High Compute Efficiency**

### > Adaptable, non-blocking interconnect

- >> Flexible data movement architecture
- >> Avoids interconnect "bottlenecks"

## > Adaptable memory hierarchy

>> 30

- >> Local, distributed, shareable = extreme bandwidth
- >> <u>No cache misses</u> or data replication
- >> Extend to PL memory (BRAM, URAM)

## > Transfer data while AI Engine Computes



Overlap Compute and Communication

### **Vector Processor Efficiency**



**EXILINX** 

# Al Inference on Versal<sup>™</sup> ACAP



\*Figure credit: https://en.wikipedia.org/wiki/Convolutional\_neural\_network

>> 31

# Al Inference Mapping on Versal<sup>™</sup> ACAP

#### A = Activations W = Weights





>> 32



## > Custom memory hierarchy

- > Buffer on-chip vs off-chip; Reduce latency and power
- > Stream Multi-cast on AI interconnect
  - > Weights and Activations
  - > Read once: reduce memory bandwidth
- > Al-optimized vector instructions (128 INT8 mults/cycle) © Copyright 2018 Xilinx XILINX.

# > 33 Frameworks for Any Developer



Domain Specific Architecture (e.g. Al Inference)

> Architecture Overlay

**Data Flow** w/ Xilinx libraries

Kernel Program Data Flow w/ user defined libraries

## Target Domain Specific Architectures – No HW Design Experience Required



# **Unified Tool Chain for Device Programming**





# **Versal Roadmap**



2H 2019

# **Getting Started**



#### Visit www.xilinx.com/versal

- > Watch ACAP Intro video
- > Subscribe to mailing list for the latest news



#### View documentation and resources

- > Data Sheet Overview
- Product Tables
- > Versal Architecture and AI Engine White Papers



# **Key Take-Aways**

# Versal: The First ACAP

- > Heterogeneous Acceleration
- For Any Application
- > For Any Developer

# Announcing Two Device Series

- > Versal Prime Series for Broad Application
- > Versal AI Core Series for Highest AI Throughput

# Availability

- > Early Access Program for SW and tools
- > Devices Available 2H 2019



# Building the Adaptable, Intelligent World

# Thank you!

Contact Info: jasonv@xilinx.com

