# Advances and Prospects for In-memory Computing



Naveen Verma (<u>nverma@princeton.edu</u>), L.-Y. Chen, P. Deaville, H. Jia, J. Lee, M. Ozatay, R. Pathak, Y. Tang, H. Valavi, B. Zhang, J. Zhang

Dec. 13, 2019

## The memory wall

Separating memory from compute fundamentally raises a communication cost



More data  $\rightarrow$  bigger array  $\rightarrow$  larger comm. distance  $\rightarrow$  more comm. energy

2

## So, we should <u>amortize</u> data movement

- **Reuse accessed data for compute** • operations
- **Specialized (memory-compute integrated)** • architectures





 $b_N$ 

b45

b25

b15

c25

c35

## In-memory computing (IMC)





- In SRAM mode, matrix A stored in bit cells <u>row-by-row</u>
- In IMC mode, many WLs driven <u>simultaneously</u>
  - $\rightarrow$  amortize comm. cost inside array
- Can apply to diff. mem. Technologies → enhanced scalability
  - $\rightarrow$  embedded non-volatility

[J. Zhang, VLSI'16][J. Zhang, JSSC'17]

## The basic tradeoffs

<u>CONSIDER</u>: Accessing *D* bits of data associated with computation, from array with  $\sqrt{D}$  columns  $\times \sqrt{D}$  rows.



|    | Memory &                                |
|----|-----------------------------------------|
| C  | omputation                              |
| (D | <sup>1/2</sup> ×D <sup>1/2</sup> array) |
|    |                                         |

| Metric    | Traditional | In-memory           |
|-----------|-------------|---------------------|
| Bandwidth | $1/D^{1/2}$ | 1                   |
| Latency   | D           | 1                   |
| Energy    | $D^{3/2}$   | ~D                  |
| SNR       | 1           | ~1/D <sup>1/2</sup> |

- IMC benefits energy/delay at cost of SNR
- SNR-focused systems design is critical (circuits, architectures, algorithms)

## **IMC** as a spatial architecture

### Assume:

- 1k dimensionality
- 4-b multiplies
- 45nm CMOS



| Operation      | Digital-PE Energy (fJ) | Bit-cell Energy (fJ) |
|----------------|------------------------|----------------------|
| Storage        | 250                    |                      |
| Multiplication | 100                    | 50                   |
| Accumulation   | 200                    |                      |
| Communication  | 40                     | 5                    |
| Total          | 590                    | 55                   |

## Where does IMC stand today?

- Potential for 10× higher efficiency & throughput
- Limited scale, robustness, configurability



## **IMC challenge (1):** analog computation

Need analog to 'fit' compute in bit cells (SNR limited by analog non-idealities)
 → Must be feasible/competitive @ 16/12/7nm



## **IMC Challenge (2):** heterogeneous computing

Matrix-vector multiply is only 70-90% of operations
 → IMC must integrate in programmable, heterogenous architectures



## **IMC Challenge (3):** efficient application mappings

### IMC engines must be 'virtualized'

- $\rightarrow$  IMC amortizes MVM costs, not weight loading. But...
- $\rightarrow$  Need new mapping algorithms (physical tradeoffs very diff. than digital engines)

#### **Activation Accessing**

- $E_{DRAM \rightarrow IMC}/4$ -bit: 40pJ
- Reuse:  $N \times I \times J$  (10-20 lyrs)
- E<sub>MAC,4-b</sub>: 50fJ

#### Weight Accessing

- E<sub>DRAM→IMC</sub>/4-bit: 40pJ
- Reuse: *X*×*Y*
- E<sub>MAC,4-b</sub>:50fJ



## Low-SNR computation via algorithmic co-design



# **Emerging non-volatile memory (NVM)**

• 2-terminal resistive memory provides better scaling at advanced nodes



CMOS Technology Node (nm)

https://www.spinmemory.com/technologies/mram-overview/

 ... BUT, leads to significantly reduced SNR





Globalfoundries, 22nm [D. Shum, VLSI'17]



#### **Resistive RAM (RRAM)**

Magnetic RAM (MRAM)

 $3^{100}$  -- HRS 50 25 0  $10^{-7}$  -- LRS  $10^{-7}$  Read Current (A)  $10^{-6}$ 

12

TSMC 16nm [H. W. Pan, IEDM'15]

## **Chip-specific parameter training**







## **Chip-generalized parameter training**



## High-SNR, charge-domain IMC



• Capacitors provide much better controllability & technology scalability  $\rightarrow$  10's of thousands of IMC rows before SNR is capacitor limited

[H. Valavi, VLSI'18] [H. Valavi, JSSC'19]

### Bin Batch Norm Neuron IA SRAM Array .4 mi 8x8 Neuron Tiles 4.3 mm **Neuron Transfer Function** Switching Threshold (6b) 6'd63 Error bars show **Activation Function** sigma over 512 (3x3x512) on-chip neurons

(measured)

+4608

Ô

**Pre-Activation PA<sup>n</sup> Value** 

Batch-normalized

6'd0-

-4608

|                             | Moons,<br>ISSCC'17 | Bang,<br>ISSCC'17 | Ando,<br>VLSI'17 | Bankman,<br>ISSCC'18 | Valavi,<br>VLSI'18 |
|-----------------------------|--------------------|-------------------|------------------|----------------------|--------------------|
| Technology                  | 28nm               | 40nm              | 65nm             | 28nm                 | 65nm               |
| Area (mm <sup>2</sup> )     | 1.87               | 7.1               | 12               | 6                    | 17.6               |
| Operating VDD               | 1                  | 0.63-0.9          | 0.55-1           | 0.8/0.8<br>(0.6/0.5) | 0.94/0.68/1.2      |
| Bit precision               | 4-16b              | 6-32b             | 1b               | 1b                   | 1b                 |
| on-chip Mem.                | 128kB              | 270kB             | 100kB            | 328kB                | 295kB              |
| <u>Throughput</u><br>(GOPS) | 400                | 108               | 1264             | 400 (60)             | 18,876             |
| TOPS/W                      | 10                 | 0.384             | 6                | 532 (772)            | 866                |

- 10-layer CNN demos for MNIST/CIFAR-10/SVHN at energies of ٠ 0.8/3.55/3.55 µJ/image
- Equivalent performance to software implementation •

[H. Valavi, WLSI'18]

# 2.4Mb, 64-tile IMC

## **Programmable heterogeneous IMC processor**



[H. Jia, arXiv:1811.04047] [H. Jia, HotChips'19]

## **Bit-Parallel/Bit-Serial (BP/BS) Multi-bit IMC**



## **Development board**



19

## **Software libraries**

| <u>1. Deep-learning Training Libraries</u><br>(Keras)     | <u>2. Deep-learning Inference Libraries</u><br>(Python, MATLAB, C)                                                                                                             |
|-----------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Standard Keras libs:                                      | High-level network build (Python):                                                                                                                                             |
| <pre>Dense(units,) Conv2D(filters, kernel_size,)</pre>    | <pre>chip_mode = True outputs = QuantizedConv2D(inputs,</pre>                                                                                                                  |
| Custom libs:                                              | <pre>layer_params)</pre>                                                                                                                                                       |
| (INT/CHIP guant.)                                         | Function calls to chip (Python):                                                                                                                                               |
| <pre>QuantizedDense(units, nb_input=4, nb_weight=4,</pre> | <pre>chip.load_config(num_tiles, nb_input=4,</pre>                                                                                                                             |
| •••                                                       | Embedded C:                                                                                                                                                                    |
| <pre>QuantizedDense(units, nb_input=4, nb_weight=4,</pre> | <pre>chip_command = get_uart_word();<br/>chip_config();<br/>load_weights(); load_image();<br/>image_filter(chip_command);<br/>read_dotprod_result(image_filter_command);</pre> |

## **Demonstrations**





| Neural-Network Demonstrations        |                                                                                                                                                                                                                                                                      |                                                                                                                                                                                                                                                                       |  |
|--------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|
|                                      | Network A                                                                                                                                                                                                                                                            | Network B                                                                                                                                                                                                                                                             |  |
|                                      | (4/4-b activations/weights)                                                                                                                                                                                                                                          | (1/1-b activations/weights)                                                                                                                                                                                                                                           |  |
| Accuracy of chip                     | 92.4%                                                                                                                                                                                                                                                                | 89.3%                                                                                                                                                                                                                                                                 |  |
| (vs. ideal)                          | (vs. 92.7%)                                                                                                                                                                                                                                                          | (vs. 89.8%)                                                                                                                                                                                                                                                           |  |
| Energy/10-way<br>Class. <sup>1</sup> | 105.2 µJ                                                                                                                                                                                                                                                             | 5.31 µJ                                                                                                                                                                                                                                                               |  |
| Throughput <sup>1</sup>              | 23 images/sec.                                                                                                                                                                                                                                                       | 176 images/sec.                                                                                                                                                                                                                                                       |  |
| Neural Network<br>Topology           | L1: 128 CONV3 – Batch norm<br>L2: 128 CONV3 – POOL – Batch norm.<br>L3: 256 CONV3 – Batch. norm<br>L4: 256 CONV3 – POOL – Batch norm.<br>L5: 256 CONV3 – Batch norm.<br>L6: 256 CONV3 – POOL – Batch norm.<br>L7-8: 1024 FC – Batch norm.<br>L9: 10 FC – Batch norm. | L1: 128 CONV3 – Batch Norm.<br>L2: 128 CONV3 – POOL – Batch Norm.<br>L3: 256 CONV3 – Batch Norm.<br>L4: 256 CONV3 – POOL – Batch Norm.<br>L5: 256 CONV3 – Batch Norm.<br>L6: 256 CONV3 – POOL – Batch Norm.<br>L7-8: 1024 FC – Batch norm.<br>L9: 10 FC – Batch norm. |  |

## **Conclusions & summary**

Matrix-vector multiplies (MVMs) are a little different than other computations → high-dimensionality operands lead to data movement / memory accessing Bit cells make for dense, energy-efficient PE's in spatial array → but require analog operation to fit compute, and impose SNR tradeoff Must focus on SNR tradeoff to enable scaling (technology/platform levels) and architectural integration In-memory computing greatly affects the architectural tradeoffs, requiring new strategies for mapping applications

Acknowledgements: funding provided by ADI, DARPA, SRC/STARnet