# Mixed-Signal Techniques for Embedded Machine Learning Systems



Boris Murmann June 16, 2019





#### Applications





#### **Speed of response**

#### **Bandwidth utilized**

#### **Privacy**

#### **Power consumed**

### Task Complexity, Memory and Classification Energy



### Task Complexity, Memory and Classification Energy



### Edge Inference System



5

# **Opportunities for Analog/Mixed-Signal Design**



#### Outline

#### Data-Compressive Imager for Object Detection

> Omid-Zohoor & Young, TCSVT 2018 & ISSCC 2019

#### Mixed-Signal ConvNet

> Bankman, ISSCC 2018 & JSSC 2019

# RRAM-based ConvNet with In-Memory Compute

> Ongoing work



#### Wake-Up Detector with Hand-Crafted Features



#### **Analog Feature Extractor**



- Low-rate and/or low-resolution ADC
- Low data rate digital I/O
- Reduced memory requirements



Low-dimensional representation

#### Histogram of Oriented Gradients



### Analog Gradient Computation





**Bright patch** 

 $G_H = 400mV - 100mV = 300mV$ 

**Dark patch** 

$$G_H = \left(\frac{1}{4}\right) 400mV - \left(\frac{1}{4}\right) 100mV = 75mV$$

# High Dynamic Range Images



- Gradient magnitude varies significantly across image
- Would require highresolution ADCs (≥ 9b) to digitize computed gradients
  - But, we want to produce as little data as possible

### Ratio-Based ("Log") Gradients





#### **Ratio Quantization**



### HOG Feature Compression with 1.5b Gradients



#### Log vs. Linear Gradients





Less Illumination

### Log vs. Linear Gradients





Less Illumination

#### Log vs. Linear Gradients





Less Illumination

# Prototype Chip



### Row Buffers with Pixel Binning (Image Pyramid)



20

#### Ratio-to-Digital Converter (RDC)



#### **Data-Driven Spec Derivation**



# Chip Summary

- 0.13 µm CIS 1P4M
- 5µm 4T pixels
- QVGA 320(V) x 240(H)
- 229 μW @ 30 FPS

Supply Voltages Pixel: 2.5V Analog: 1.5V, 2.5V Digital: 0.9V









Results using Deformable Parts Model detection & custom database (PascalRAW)

### Comparison to State of the Art

|                   | This Work                                      | [Choi, ISSCC'13]  | [Katic, Sens.J.'15]               | [Bong, ISSCC'17]                     |
|-------------------|------------------------------------------------|-------------------|-----------------------------------|--------------------------------------|
| Technology        | 0.13 μm 1P4M                                   | 0.18 μm 1P4M      | 0.18 μm                           | 65 nm 1P8M                           |
| Resolution        | 320x240                                        | 256x256           | 32x32                             | 320x240                              |
| Pixel Size        | 5 μm x 5 μm                                    | 5.9 μm x 5.9 μm   | 31 μm x 26 μm                     | 7 μm x 7 μm                          |
| Fill Factor       | 60.4%                                          | 30%               | 24%                               | -                                    |
| Feature Type      | log-gradients                                  | linear HOGs       | relative ratios<br>between pixels | linear Haar-like<br>w/ face-detector |
| Frame Rate        | 30 fps nom.<br>207 fps max                     | 15 fps - reported | 9756 fps nom.<br>24000 fps max    | 1 fps - reported                     |
| Dynamic Range     | 59.3 dB <sup>1</sup>                           | 54.8 dB           | $43 \text{ dB}^2$                 | -                                    |
| Power Consumption | 229 µW @ 30fps                                 | 51 µW @ 15 fps    | 4 mW @ 9765 fps                   | 24-96 µW @ 1fps                      |
| Energy Efficiency | 1.5-bit: 99 pJ/pixel<br>2.75-bit: 114 pJ/pixel | 52 pJ/pixel       | 404 pJ/pixel                      | 312 - 1250 pJ/pixel                  |
| Multi-Scale       | Yes -<br>arbitrary square<br>bins              | No                | No                                | Yes - three scales                   |

1. At output of cyclic row buffer, without RDC 2. Pixel-to-pixel dynamic range

#### **Information Preservation**

#### **Raw Pixels**



#### 1.5-bit Log Gradients



\*truncated from 2.75-bit

#### Reconstruction



\*courtesy Julien Martel

### Use Log Gradients as ConvNet Input?



Ongoing work; comparable performance using ResNet-10 (PascalRaw dataset)

### Can We Play Mixed-Signal Tricks in a ConvNet?



 $\sim$ 

### **BinaryNet**

- Courbariaux et al., NIPS 2016
- Weights and activations constrained to +1 and -1, multiplication becomes XNOR
- Minimizes D/A and A/D overhead
- Nice option for small/medium-size problems and mixed-signal exploration



# Mixed-Signal Binary CNN Processor

- Binary CNN with "CMOS-inspired" topology, engineered for minimal circuitlevel path loading
- Hardware architecture amortizes memory access across many computations, with all memory on chip (328 KB)
- 3. Energy-efficient switched-capacitor neuron for wide vector summation, replacing digital adder tree



Bankman et al., ISSCC 2018

# **Original BinaryNet Topology**



- 1.67 MB weight memory (68% FC layers)
- 27.9 mJ/classification with FPGA



Zhao et al., FPGA 2017

## Mixed-Signal BinaryNet Topology



4096

- Sacrificed accuracy for regularity and energy efficiency
- 86.05% accuracy on CIFAR-10
- 328 KB weight memory
- 3.8 μJ per classification

Neuron



34

### Naïve Sequential Computation



#### Weight-Stationary



#### Weight-Stationary and Data-Parallel



#### **Complete Architecture**



#### **Neuron Function**



#### Switched-Capacitor Implementation



 Batch normalization folded into weight signs and bias

#### Weights x inputs

#### 1024b thermometer binary-weighted $W_{1023} X_{1023} W_2 X_2 W_1 X_1 W_0 X_0$ $m_7 \overline{s}$ $m_0 s$ $m_0 \overline{s}$ $m_7 s$ 2<sup>-1</sup>C<sub>u</sub> 2<sup>6</sup>C<sub>u</sub> 2<sup>6</sup>C<sub>u</sub> I C<sub>u</sub> $\boxed{2^6C_u}$ $\boxed{2^6C_u}$ $\boxed{2^{-1}C_u}$ $\boxed{2^{-1}C_u}$ I C<sub>u</sub> $C_{u}$ Cu $m_7 \overline{s}$ $m_0 \ \overline{s}$ $W_2 X_2 W_1 X_1 W_0 X_0$ $m_7 s$ W<sub>1023</sub> X<sub>1023</sub> $m_0 s$



Bias & offset cal.

#### **Behavioral Simulations**



Significant margin in noise, offset, and mismatch (V<sub>FS</sub> = 460 mV)

### "Memory-Cell-Like" Processing Element





Standard-cell-based 42 transistors 24107 F<sup>2</sup>



1 fF metal-oxide-metal fringe capacitor



#### **Die Photo**

- TSMC 28nm HPL 1P8M
- 6 mm<sup>2</sup> area
- 328 KB SRAM
- 10 MHz clock

# Supply Voltages

 $V_{DD}$  – Digital Logic, 0.6V – 1.0V  $V_{MEM}$  – SRAM, 0.53V – 1.0V  $V_{NEU}$  – Neuron Array, 0.6V  $V_{COMP}$  – Comparators, 0.8V



#### **Measured Classification Accuracy**

- 10 chips, 180 runs each through 10,000 CIFAR-10 test images
- $V_{DD} = 0.8V, V_{MEM} = 0.8V$
- 3.8 µJ/classification
- 237 FPS, 899 μW
- 0.43 µJ in 1.8V I/O
- Mean accuracy µ = 86.05% same as perfect digital model



# Comparison to Synthesized Digital

#### Synthesized Digital



#### BinarEye (Moons et al., CICC 2018)

#### **Mixed-Signal**



## Digital vs. Mixed-Signal Binary CNN Processor



# CIFAR-10 Energy vs. Accuracy



- Neuromorphic
  - > [1] TrueNorth, Esser PNAS 2016
- GPU
  - > [2] Zhao FPGA 2017
- FPGA
  - > [2] Zhao FPGA 2017
  - > [3] Umuroglu FPGA 2017
- MCU
  - > [4] CMSIS-NN, Lai arXiv 2018
- Memory-like, mixed-signal
  - > [5] Bankman ISSCC 2018
- BinarEye, digital
  - > [6] Moons CICC 2018
- In-memory, mixed-signal
  - [7] Jia arXiv 2018
    \*energy excludes off-chip DRAM

### Limitations of Mixed-Signal BinaryNet

- Poor programmability
- Relatively limited accuracy (even on CIFAR-10) due to 1b arithmetic
- Energy advantage over customized digital is not revolutionary
  - > Same SRAM, essentially same dataflow
- Need a more "analog" memory system to unleash larger gains
  - > In-memory computing

48

# BinaryNet Synapse versus Resistive RAM





- 0.93 fJ per 1b-MAC in 28 nm
- 24107 F<sup>2</sup>
- Single-bit

- TBD
- 25 F<sup>2</sup>
- Multi-bit (?)



#### Matrix-Vector Multiplication with Resistive Cells

NVM

ᅻ



Typically use two cells to achieve pos/neg weights (other schemes possible)

Tsai, 2018



### **Ongoing Research**

- What is the best architecture?
- How many levels can be stored in each cell?
- What is the most efficient readout?
- Can we cope with nonidealities using training techniques?











### VGG-7 Experiment (4.8 Million Parameters)



#### Energy Model for Column in Conv6 Layer





- Analog feature extraction is attractive for wake-up detectors
- Adding analog compute in ConvNets can be beneficial when it simultaneously lets us reduce data movement
  - > In-memory analog compute looks most promising
  - > Can consider SRAM or emerging memories (e.g. RRAM)
- Expect significant progress as more application drivers for "machine learning at the edge" emerge