

## Algorithm-Hardware Co-design for Deformable Convolution

**Qijing Huang**\*, Dequan Wang\*, Yizhao Gao<sup>+</sup>, Yaohui Cai<sup>‡</sup>, Zhen Dong, Bichen Wu, Kurt Keutzer, John Wawrzynek

University of California, Berkeley <sup>†</sup>University of Chinese Academy of Science <sup>‡</sup>Peking University



EMC2 Workshop @ NeurIPS 2019

### Motivation

- **Deformable Convolution** is an input-adaptive dynamic operation that samples inputs from variable spatial locations
- Its sampling locations vary with:
  - Different in
  - Different ou
- It captures the
  - Scales
  - Aspect Rat
  - Rotation Ar
- Challenges:
  - Increasec
  - Irregular |
    - Not frie



Sampling Locations (in red) for Different Output Dixesceptore melds



# Algorithm-Hardware Codesign

#### Algorithm Modification:



0. Original Deformable

Accuracy <sup>1</sup>(mIoU  $\uparrow$ ): **79.9** 

#### Hardware Optimization:



- Preloads weights to on-chip buffer
- Loads input and offsets directly from DRAM



# Algorithm-Hardware Codesign

Algorithm Modification:



Accuracy <sup>1</sup>(mIoU ↑): **79.6** 

Hardware Optimization:



Reduces the computation for bilinear interpolation



## Algorithm-Hardware Codesign

Algorithm Modification:

Hardware Optimization:





• Buffers inputs in the on-chip line buffer to allow spatial reuse



# **Resolitshm-Hardware Codesign**

### Hardware Performance

Hardware Optimization:

| Operation             | Original     | Deformable   | Bound        | Square         | Without LLC  |       | With LLC     |       |
|-----------------------|--------------|--------------|--------------|----------------|--------------|-------|--------------|-------|
|                       |              |              | (buffered)   | (multi-ported) | Latency (ms) | GOPs  | Latency (ms) | GOPs  |
| Full<br>3×3 Conv      | $\checkmark$ |              |              |                | 43.1         | 112.0 | 41.6         | 116.2 |
|                       |              | $\checkmark$ |              |                | 59.0         | 81.8  | 42.7         | 113.1 |
|                       |              | $\checkmark$ | $\checkmark$ |                | 43.4         | 111.5 | 41.8         | 115.5 |
|                       |              | $\checkmark$ | $\checkmark$ | $\checkmark$   | 43.4         | 111.5 | 41.8         | 115.6 |
| Depthwise<br>3×3 Conv | $\checkmark$ |              |              |                | 1.9          | 9.7   | 2.0          | 9.6   |
|                       |              | $\checkmark$ |              |                | 20.5         | 0.9   | 17.8         | 1.1   |
|                       |              | $\checkmark$ | $\checkmark$ |                | 3.0          | 6.2   | 3.4          | 5.5   |
|                       |              | $\checkmark$ | $\checkmark$ | ✓              | 2.1          | 9.2   | 2.3          | 8.2   |

5-Oeptays Rectangulation warse co-design methodology for the deformable Convolution achieves and .36× and 9.76וspeaker is stand with the set of the deformable convolution on FPGA ShuffleNetV2 Deform Conv Depthwise 68.0 Email: Gijing huang Oberkeley.edu

