

# Configurable Cloud-Scale DNN Processor for Real-Time Al

Speaker: Bita Rouhani, Sr. Researcher

## AI/ML ubiquitously fuels our technology



**Bing Visual Search** 

## Model sizes growing exponentially





## Dominant state-of-the-art models also evolving rapidly



Figure sources

<sup>1.</sup> Han et al., Pre-Trained AlexNet Architecture with Pyramid Pooling and Supervision for High Spatial Resolution Remote Sensing Image Scene Classification

<sup>2.</sup> Vaswani et al., "Attention is all you need"

<sup>3.</sup> https://tkipf.github.io/graph-convolutional-networks/

#### Silicon alternatives for AI models









FLEXIBILITY

#### The power of AI on FPGA

#### **Flexibility**

FPGAs ideal for adapting to rapidly evolving AI/DL Adaptive numerical precision and custom operators CNNs, LSTMs, MLPs, transformers, reinforcement learning, feature extraction, etc. Exploit sparsity, etc.

#### **Performance**

Excellent inference performance at low batch sizes Ultra-low latency serving on modern DNNs Scale to many FPGAs in single DNN service

#### Scale

Microsoft has the world's largest cloud investment in FPGAs Multiple Exa-Ops of aggregate AI capacity
BrainWave runs on Microsoft's scale infrastructure

# Project Catapult + Brainwave history

# Field Programmable Gate Arrays



2011: Project Catapult Launched

2013: Bing pilot runs decision trees 40X faster

2015: Bing ranking throughput increased 2X

2016: Azure Accelerated Networking delivers industry-leading cloud performance

2017: Over 1M servers deployed with FPGAs at hyperscale

2017: Hardware Microservices harness FPGAs for distributed computing

2017: FPGAs enable real-time AI, ultra-low latency inferencing without batching; Bing launches first FPGA-accelerated Deep Neural Network

2018: Project Brainwave launched in Azure Machine Learning

## Brainwave runs on a configurable cloud at massive scale



#### Scalable hardware microservice



## Conventional acceleration approach

#### Local offload and streaming



#### Conventional acceleration approach

#### Local offload and streaming

Model Parameters Initialized in DRAM



For memory-intensive DNNs with low compute-to-data ratios (e.g., LSTM), HW utilization limited by off-chip DRAM bandwidth

## Improving HW utilization with batching







## Improving HW utilization with batching



Batching improves HW utilization but also increases latency

## Improving HW utilization with batching



Batching improves HW utilization but increases latency

Ideally want high HW utilization at low batch sizes





#### **Observations**

State-of-art FPGAs have O(10K) distributed Block RAMs O(10MB)

→ Tens of TB/sec of memory BW

Large-scale cloud services and DNN models run persistently

Solution: persist all model parameters in FPGA on-chip memory during service lifetime





When single request arrives, all chip resources (onchip memories and compute units) are used to process a single query (no batching required)

Compiler & Runtime

Architecture

BrainWave System

Microarchitecture

Persistency at Scale

HW Microservices on Intel FPGAs

General Infrastructure

Compiler & Runtime

Architecture

Microarchitecture

Persistency at Scale

HW Microservices on Intel FPGAs

Compiler & Runtime

Architecture

Microarchitecture

Persistency at Scale

Persist model parameters entirely in FPGA on-chip memories Support large models by scaling across many FPGAs

HW Microservices on Intel FPGAs

Compiler & Runtime

Architecture

Microarchitecture

BrainWave Soft NPU microarchitecture
Highly optimized for narrow precision and low batch

Persistency at Scale

Persist model parameters entirely in FPGA on-chip memories Support large models by scaling across many FPGAs

HW Microservices on Intel FPGAs

Compiler & Runtime

Architecture

Adaptive ISA for narrow precision DNN inference Flexible and extensible to support fast-changing AI algorithms

Microarchitecture

BrainWave Soft NPU microarchitecture
Highly optimized for narrow precision and low batch

Persistency at Scale

Persist model parameters entirely in FPGA on-chip memories Support large models by scaling across many FPGAs

HW Microservices on Intel FPGAs

Compiler & Runtime

A framework-neutral federated compiler and runtime for compiling pretrained DNN models to soft NPUs

Architecture

Adaptive ISA for narrow precision DNN inference Flexible and extensible to support fast-changing AI algorithms

Microarchitecture

BrainWave Soft NPU microarchitecture Highly optimized for narrow precision and low batch

Persistency at Scale

Persist model parameters entirely in FPGA on-chip memories Support large models by scaling across many FPGAs

HW Microservices on Intel FPGAs

#### Deployed in production datacenters



Deployment of LSTM-based NLP model (tens of millions of parameters)

Takes tens of milliseconds to serve on well-tuned CPU implementations

Tail latencies in BrainWave-powered DNN models appear negligible in E2E software pipelines

# Closing Thoughts ...

## Bending the Al ambition-cost curve



### We are hiring ...

Check out Azure AlArch for our open positions:

- Data & Applied Scientist
- Software Engineering
- Hardware Engineering

Send Resumes To:
<a href="mailto:hiring4azurehardware@microsoft.com">hiring4azurehardware@microsoft.com</a>
<a href="mailto:bita.rouhani@microsoft.com">bita.rouhani@microsoft.com</a>

