



# Enabling Continuous Learning through Synaptic Plasticity in Hardware

Tushar Krishna Georgia Tech

EMC<sup>2</sup> Workshop June 23 2019

### The Dream!



# **Deep Learning Applications**

#### "AI is the new electricity" – Andrew Ng

#### **Object Detection**



#### Image Segmentation



#### **Medical Imaging**



#### **Speech Recognition**



#### **Text to Speech**

Speech

Text

#### Recommendations

ΦΛ

Games



## Deep Learning Landscape



# **Deep Learning Landscape**



# Deep Learning Landscape



### **Computation Platforms**



# Efficiency of Deep Learning Systems



### What is Continuous Learning?





### Efficiency of Continuous Learning Systems



#### Outline of Talk



#### Outline of Talk

Ananda Samajdar, Parth Mannan, Kartikay Garg, and Tushar Krishna, **GeneSys: Enabling Continuous Learning through Neural Network Evolution in Hardware**, **MICRO 2018** 



#### • Continuous Learning Template

- Neuro-Evolutionary Algorithms
  - Algorithm Description
  - Characterizing NEAT
- Microarchitecture
- Evaluations

### **Continuous Learning in Brains**



Constant synapse formation and pruning

# **Template for Continuous Learning**



### **Conventional RL: Challenges**

Deep NNs used internally

Manual hyperparameter tuning

Each update results in **Backpropagation** 

- High compute requirement at every update
- High memory overhead
- Not scalable

#### Not viable for continuous learning on the edge

#### Outline of Talk

Ananda Samajdar, Parth Mannan, Kartikay Garg, and Tushar Krishna, **GeneSys: Enabling Continuous Learning through Neural Network Evolution in Hardware**, **MICRO 2018** 



#### • Continuous Learning` Template

- Neuro-Evolutionary Algorithms
  - Algorithm Description
  - Characterizing NEAT
- Microarchitecture
- Evaluations

# Neuro-Evolutionary (NE) Algorithm



# Neuro-Evolutionary (NE) Algorithm



# Properties of NE algorithms



#### **Systems**

Too much compute!

**Convergence time?** 

déjà vu! Looks like Deep Neural Networks in the 90s



#### Outline of Talk

Ananda Samajdar, Parth Mannan, Kartikay Garg, and Tushar Krishna, **GeneSys: Enabling Continuous Learning through Neural Network Evolution in Hardware**, **MICRO 2018** 



#### • Continuous Learning

- Neuro-Evolutionary Algorithms
  - Algorithm Description
  - Characterizing NEAT
- Microarchitecture
- Evaluations

# Characterization of NEAT



Ran each environment till convergence, multiple times

Only changed fitness function between workloads

EMC2 Workshop, ISCA 2019 **NEAT Python: https://github.com/CodeReclaimers/neat-python** 

# Characterization of NEAT

#### **Computations**



#### **Inference:**

**Population level parallelism (PLP)** 

#### **Evolution:**

#### **Gene level parallelism (GLP)**

#### **Distribution of Operations/Generation**

All operations are independent

**Large operation level Parallelism** 

### **Operations in NEAT**



# Characterization of NEAT

#### Memory





#### **Distribution of Memory footprint/Generation**

#### **Entire population can fit on-chip**

Only need to store the weights and node info

# Characterization of NEAT

#### Memory



#### EMC2 Workshop, ISCA 2019

Distribution of Memory footprint/Generation

# **Properties of NE algorithms**



#### Motivating Hardware Solution



#### Outline of Talk

Ananda Samajdar, Parth Mannan, Kartikay Garg, and Tushar Krishna, **GeneSys: Enabling Continuous Learning through Neural Network Evolution in Hardware**, **MICRO 2018** 



#### Continuous Learning

- Neuro-Evolutionary Algorithms
  - Algorithm Description
  - Characterizing NEAT
- Microarchitecture
- Evaluations





#### Evolution Engine: EvE Microarchitecture



### PE Microarchitecture



#### Inference Engine: ADAM Microarchitecture



EMC2 Workshop, ISCA 2019

#### Outline of Talk

Ananda Samajdar, Parth Mannan, Kartikay Garg, and Tushar Krishna, **GeneSys: Enabling Continuous Learning through Neural Network Evolution in Hardware**, **MICRO 2018** 



#### Continuous Learning

- Neuro-Evolutionary Algorithms
  - Algorithm Description
  - Characterizing NEAT
- Microarchitecture
- Evaluations

### Implementation

#### **GeneSys Parameters**

| Tech node     | 15nm     |  |
|---------------|----------|--|
| Num EvE PE    | 256      |  |
| Num ADAM PE   | 1024     |  |
| EvE Area      | 0.89 mm2 |  |
| ADAM Area     | 0.25 mm2 |  |
| GeneSys Area  | 2.45 mm2 |  |
| Power         | 947.5 mW |  |
| Frequency     | 200 MHz  |  |
| Voltage 1.0 V |          |  |
| SRAM banks    | 48       |  |
| SRAM depth    | 4096     |  |



🗖 EvE area 📕 SRAM area 🗖 ADAM area 🗖 MO area



### **Evaluations**

| Legend                                          | Inference | Evolution | Platform        |  |
|-------------------------------------------------|-----------|-----------|-----------------|--|
| CPU_a                                           | Serial    | Serial    | 6th gen i7      |  |
| CPU_b                                           | PLP       | Serial    | 6th gen i7      |  |
| GPU_a                                           | BSP       | PLP       | Nvidia GTX 1080 |  |
| GPU_b                                           | BSP + PLP | PLP       | Nvidia GTX 1080 |  |
| CPU_c                                           | Serial    | Serial    | ARM Cortex A57  |  |
| CPU_d                                           | PLP       | Serial    | ARM Cortex A57  |  |
| GPU_c                                           | BSP       | PLP       | Nvidia Tegra    |  |
| GPU_d                                           | BSP + PLP | PLP       | Nvidia Tegra    |  |
| GENESYS                                         | PLP       | PLP + GLP | GENESYS         |  |
| PLP (GLP) - Population (Gene) Level Parallelism |           |           |                 |  |
| BSP - Bulk Synchronous Parallelism (GPU)        |           |           |                 |  |
### **Evaluations: Energy**



### **Evaluations: Runtime**

• CPU\_a □ CPU\_c  $\triangle$  GPU\_a  $\triangle$  GPU\_c  $\times$  Genesys



38

## Summary for GeneSys

- Robust, Scalable and Energy efficient solutions needed for continuous learning
  - Look beyond DL and RL
- NEs offer promise
  - Parallelism
  - Low-memory Footprint
  - HW friendly
- GeneSys: 100x 100000x energy efficiency and performance
  - More deployable compute
  - Enables AI solutions for a large gamut of problems

### Outline of Talk



EMC2 Workshop, ISCA 2019

40

### Outline of Talk

- Motivation
  - Irregular Dataflows
  - DNN Computation

#### • MAERI

- Abstraction
- Implementation
- Operation Example
- Mapping Strategies

• Evaluations

Hyoukjun Kwon, Ananda Samajdar, and Tushar Krishna MAERI: Enabling Flexible Dataflow Mapping over DNN Accelerators via Reconfigurable Interconnects: ASPLOS 2018, IEEE Micro Top Picks 2019 Honorable Mention



### Myriad Dataflows in DNN Accelerators

#### • DNN Topologies

- Layer size / shape
- Layer types: Convolution / Pool / FC / LSTM
- New sub-structure: e.g., Inception in Googlenet

### Compiler/Mapper

- Loop Scheduling
  - Reordering and Tiling
- Mapping
  - Output/Weight/Input/Row-stationary
- Algorithmic Optimization (e.g., Sparsity)
  - Weight pruning
  - GeneSys



Can we have one architectural solution that can handle arbitrary dataflows and provides ~100% utilization?

## What is the computation in a DNN?



#### **Accumulation of partial products**

Our Key insight: Each DNN/dataflow translates into neurons of different sizes

### Irregular Dataflow: Pruning



#### Our Key insight: Each DNN/dataflow translates into neurons of different sizes

### Outline of Talk

- Motivation
  - Irregular Dataflows
  - DNN Computation

#### • MAERI

- Abstraction
- Implementation
- Operation Example
- Mapping Strategies

• Evaluations



### The MAERI Abstraction





How to enable flexible grouping?

Need flexible connectivity!

### Outline of Talk

- Motivation
  - Irregular Dataflows
  - DNN Computation

#### • MAERI

- Abstraction
- Implementation
- Operation Example
- Mapping Strategies

• Evaluations

Hyoukjun Kwon, Ananda Samajdar, and Tushar Krishna MAERI: Enabling Flexible Dataflow Mapping over DNN Accelerators via Reconfigurable Interconnects: ASPLOS 2018, IEEE Micro Top Picks 2019 Honorable Mention

















54

### Outline of Talk

- Motivation
  - Irregular Dataflows
  - DNN Computation

#### • MAERI

- Abstraction
- Implementation
- Operation Example
- Mapping Strategies

• Evaluations

Hyoukjun Kwon, Ananda Samajdar, and Tushar Krishna MAERI: Enabling Flexible Dataflow Mapping over DNN Accelerators via Reconfigurable Interconnects: ASPLOS 2018, IEEE Micro Top Picks 2019 Honorable Mention



# Example: Computing a CONV layer

- [Communication] Distribute weights and inputs (image pixels) to multiplier switches
  - Assume: weight stationary, conv reuse of inputs via local links
- [Computation] Compute partial sums
- [Computation] Reduce partial sums
- [Communication] Collect outputs to buffer







EMC2 Workshop, ISCA 2019



EMC2 Workshop, ISCA 2019







### Outline of Talk

- Motivation
  - Irregular Dataflows
  - DNN Computation

#### • MAERI

- Abstraction
- Implementation
- Operation Example
- Mapping Strategies

• Evaluations

Hyoukjun Kwon, Ananda Samajdar, and Tushar Krishna MAERI: Enabling Flexible Dataflow Mapping over DNN Accelerators via Reconfigurable Interconnects: ASPLOS 2018, IEEE Micro Top Picks 2019 Honorable Mention



## Example Mapping – Dense CNN

Our Key insight: Each DNN/dataflow translates into neurons of different sizes



## Example Mapping – Sparse DNN

Our Key insight: Each DNN/dataflow translates into neurons of different sizes



66

# Example Mapping – LSTM/FC

Our Key insight: Each DNN/dataflow translates into neurons of different sizes



### Searching optimal dataflows for MAERI



ASPLOS 2018, IEEE Micro Top Picks 2019 Honorable Mention

### Outline of Talk

- Motivation
  - Irregular Dataflows
  - DNN Computation

#### • MAERI

- Abstraction
- Implementation
- Operation Example
- Mapping Strategies

### • Evaluations

Hyoukjun Kwon, Ananda Samajdar, and Tushar Krishna MAERI: Enabling Flexible Dataflow Mapping over DNN Accelerators via Reconfigurable Interconnects: ASPLOS 2018, IEEE Micro Top Picks 2019 Honorable Mention



### End-to-End Performance



### **Energy with Convolution Layers**



\* Normalized to MAERI energy with Alexnet C1

MAERI reduces energy upto 57% and 28% in average compared to Row-Stationary (dense dataflow) and 7.1% in average compared to Systolic Array (sparse dataflow)

## Summary of MAERI

- DNN models evolving rapidly
  - Multiple layer types
  - Sparsity Optimizations
  - Myriad dataflows for scheduling and mapping
- MAERI enables dynamic grouping of arbitrary number of MACCs ("Virtual Neuron") via reconfigurable, nonblocking interconnects, providing
  - Future proof to DNN models and dataflows
  - Near 100% compute unit utilization


