Exploring Bit-Slice Sparsity in Deep Neural Networks for Efficient ReRAM-Based Deployment

Jingyang Zhang<sup>1</sup>, Huanrui Yang<sup>1</sup>, Fan Chen<sup>1</sup>, Yitu Wang<sup>2</sup>, Hai Li<sup>1</sup>

<sup>1</sup>Duke University, <sup>2</sup>Fudan University

EMC2 Workshop @ NeurIPS 2019

## Motivation: ReRAM-based DNN accelerator



# **Two-order magnitude advantage** in energy, performance and chip footprint

- High bit-resolution ADC accounts for >60% power and >30% area
  - ADC resolution dictated by accumulated currents on bitlines: need sparsity in G
  - Limited cell bit density: each XB only holds 2 bits (bit-slice) of the weight
  - Need higher sparsity among bit-slice

 $\begin{bmatrix} 0 & w_1 & 0 \\ 0 & 0 & w_2 \\ w_0 & 0 & 0 \end{bmatrix} \Rightarrow \begin{bmatrix} 11 & 00 & 10 & 00 \end{bmatrix}_2$ Weight sparsity Bit-slice sparsity

Canziani, Alfredo, Adam Paszke, and Eugenio Culurciello. "An analysis of deep neural network models for practical applications." *arXiv preprint arXiv:1605.07678* (2016). A. Shafiee, A. Nag, N. Muralimanohar, R. Balasubramonian, J. P. Strachan, M. Hu, R. S. Williams, and V. Srikumar. Isaac: A convolutional neural network accelerator with in-situ analog arithmetic in crossbars . In Proceedings of ISCA, 2016.

#### Bit-slice L1 for dynamic fixed-point quantization

• Dynamic range scaling (to [0,1])

 $S(W_l) = \lceil \log_2(\max_{w_l^i \in W_l}(|w_l^i|)) \rceil,$ 

N-bit uniform quantization

$$Q_{step} = 2^{S(W_l) - n}, \quad B(w_l^i) = \lfloor \frac{w_l^i}{Q_{step}} \rfloor.$$



• L1 regularization over all bit-slices

$$B(w_l^i) = \sum_{k=0}^3 \hat{B}_l^{i,k} \cdot 2^{2k} \quad B\ell_1(W_l) := \sum_{i,k} \hat{B}_l^{i,k}.$$

#### Training routine

- Dynamic range recovery  $Q(w_l^i) = B(w_l^i) \cdot Q_{step}$
- Training routine
  - FP and BP with quantized weight
  - Gradient update on full-precision weight
  - Add Bit-slice L1 to the objective

$$q^{(t)} = Q(w_l^{(t)}),$$

full-precision  $Q(w_l^{(t)})$  $w_i^{(t)}$ quantized bit-slice l1  $Bl_1(W_l)$ 

 $w_l^{(t+1)} = q^{(t)} - lr \times (\nabla_q \mathcal{L}_{CE}(q^{(t)}) + \alpha \nabla_q B\ell_1(q^{(t)}))$ 

## Improving the bit-slice sparsity

#### • Up to 2x less nonzero bit-slices than traditional L1

| Method                           | Accuracy         | Ratio of non-zero wights |                |                       |                 |                          |  |  |
|----------------------------------|------------------|--------------------------|----------------|-----------------------|-----------------|--------------------------|--|--|
|                                  |                  | $\hat{B^3}$              | $\hat{B^2}$    | $\hat{B^1}$           | $\hat{B^0}$     | Average                  |  |  |
| Pruned                           | 97.99%           | 1.08%                    | 5.87%          | 8.42%                 | 17.42%          | 8.20±5.94%               |  |  |
| $rac{\ell_1}{\mathrm{B}\ell_1}$ | 97.99%<br>97.67% | 1.19%<br>0.84%           | 5.21%<br>4.02% | 7.01%<br><b>4.27%</b> | 11.36%<br>9.58% | 6.19±3.65%<br>4.68±3.14% |  |  |

|           |                                                | Ta                      | ble 2: Res               | sults on CII          | FAR-10                 |                  |                              |  |  |
|-----------|------------------------------------------------|-------------------------|--------------------------|-----------------------|------------------------|------------------|------------------------------|--|--|
|           |                                                | Accuracy                | Ratio of non-zero wights |                       |                        |                  |                              |  |  |
| Model     | Method                                         |                         | $\hat{B^3}$              | $\hat{B^2}$           | $\hat{B^1}$            | $\hat{B^0}$      | Average                      |  |  |
| VGG-11    | Pruned                                         | 88.93%                  | 0.86%                    | 28.30%                | 34.14%                 | 33.39%           | 24.17±13.65%                 |  |  |
|           | $\overset{\ell_1}{\overset{B\ell_1}{B\ell_1}}$ | <b>89.39%</b><br>89.33% | 0.39%<br>0.21%           | 9.37%<br><b>3.57%</b> | 18.43%<br><b>7.09%</b> | 22.19%<br>10.71% | 12.59±8.45%<br>5.40±3.92%    |  |  |
| ResNet-20 | Pruned                                         | 89.22%                  | 1.10%                    | 8.07%                 | 21.92%                 | 43.96%           | 18.76±16.36%                 |  |  |
|           | $\overset{\ell_1}{B\ell_1}$                    | <b>90.62%</b><br>89.66% | 0.44%<br>0.31%           | 4.71%<br>3.34%        | 14.37%<br>11.99%       | 33.16%<br>31.39% | 13.17±12.60%<br>11.76±12.12% |  |  |



Figure 2: Bit-slice sparsity of VGG-11 on CIFAR-10 during training.

• Codes available at: <u>https://github.com/zjysteven/bitslice\_sparsity\_Duke</u>

#### Reducing ADC overhead

- High sparsity in bit-slices enables the use of low-resolution ADC
- Low resolution reduces ADC overhead
- Simulation results for mapping to 128x128 ReRAM XBs

| Table 3: ADC Overhead Saving with Bit-Slice Sparsity |                        |                       |               |               |             |  |  |
|------------------------------------------------------|------------------------|-----------------------|---------------|---------------|-------------|--|--|
|                                                      | w/o Bit-Slice Sparsity | w/ Bit-Slice Sparsity |               |               |             |  |  |
|                                                      | Resolution             | Resolution            | Energy Saving | Speedup       | Area Saving |  |  |
| $XB_3$                                               | 8 bit                  | 1 bit                 | $28.4 \times$ | $8 \times$    | $2 \times$  |  |  |
| $XB_{2,1,0}$                                         | 8 bit                  | 3 bit                 | $14.2 \times$ | $2.67 \times$ | $2 \times$  |  |  |

Table 2: ADC Overhead Coving with Dit Clica Coordina

