

# Algorithm-Hardware Co-design for **Deformable Convolution**



Qijing Huang\*, Dequan Wang\*, Yizhao Gao<sup>+</sup>, Yaohui Cai<sup>‡</sup>, Zhen Dong, Bichen Wu, Kurt Keutzer, John Wawrzynek University of California, Berkeley, <sup>†</sup>University of Chinese Academy of Science, <sup>‡</sup>Peking University

### Motivation

- Inefficient Model Designs many CV tasks use large inefficient models and operations solely optimized for accuracy
- Limited Hardware Resources embedded devices have limited compute resources and a strict power budgets
- Real-time Requirements accelerators must guarantee

## **Deformable Convolution**

**Deformable Convolution** is a <u>dynamic input-adaptive</u> operation that samples inputs from variable spatial locations

- Its sampling locations vary with:
  - Different input images
  - Different output pixel locations

Figure 1. Deformable Convolution Example





### response within certain time constraints



 Codesign algorithms and accelerators that satisfy embedded system constraints and fall on the pareto curve of the accuracy-latency tradeoff.





Figure 2. Distance Distribution on 5000 images from COCO

distance

- It captures the spatial variance of objects with different:
  - Scales



a. sampling locations

for lawn







Figure 3. Deformable Convolution

# **Algorithm Modifications**











Figure 4. Major Algorithm Changes

a. **Deformable Convolution** samples inputs from variable offsets generated based on the input feature

- b. Rounded Offsets rounds the fractional offsets to integer
- c. **Bounded Range** restricts the range of offsets
- d. Rectangle Shape limits the geometry to a rectangle shape
- e. Efficient Feature Extractor uses ShuffleNetv2 as backbone
- f. **Depthwise Convolution** replaces full deformable conv with 3x3 depthwise deformable conv and 1x1 conv

Figure 5. Hardware Engine

a. **Baseline** loads input features with dynamic offsets from **DRAM** directly

b. Caching adds LLC to leverage temporal and spatial locality c. Buffering uses on-chip BRAM to buffer all inputs from

limited range

d. **Parallel Ports** increases on-chip bandwidth with constrained shape

- Results shows a 1.36× and 9.76× speedup respectively for the full and depthwise deformable conv on FPGA (Ultra96, Xilinx Zynq-MPSoC)

Table 1. Accuracy<sup>1</sup> with DLA as Feature Extractor

| Deformable   | Round        | Bound        | Square       | mIoU↑ |
|--------------|--------------|--------------|--------------|-------|
| $\checkmark$ |              |              |              | 79.9  |
| $\checkmark$ | $\checkmark$ |              |              | 79.6  |
| $\checkmark$ | $\checkmark$ | $\checkmark$ |              | 79.4  |
| $\checkmark$ | $\checkmark$ | $\checkmark$ | $\checkmark$ | 78.7  |

Table 2. Accuracy<sup>1</sup> with Different Feature Extractors

| <b>Feature Extractor</b> | Operation              | mIoU↑ |
|--------------------------|------------------------|-------|
| DLA                      | DeformConv             | 79.9  |
| ShuffleNetV2             | DeformConv             | 70.1  |
| ShuffleNetV2             | DeformConv + Depthwise | 68.0  |

Accuracy for Semantic Segmentation on CityScapes

- 1.2 mIoU and 2.1 mIoU loss on the overall the semantic segmentation task on CityScapes respectively for the full and depthwise deformable conv

| Table 3. Codesigned Hardware | Performance Comparison |
|------------------------------|------------------------|
|------------------------------|------------------------|

| Operation         | Original     | Deformable   | Bound        | Square         | Without LLC  |       | With LLC     |       |
|-------------------|--------------|--------------|--------------|----------------|--------------|-------|--------------|-------|
|                   |              |              | (buffered)   | (multi-ported) | Latency (ms) | GOPs  | Latency (ms) | GOPs  |
|                   | $\checkmark$ |              |              |                | 43.1         | 112.0 | 41.6         | 116.2 |
| Full              |              | $\checkmark$ |              |                | 59.0         | 81.8  | 42.7         | 113.1 |
| $3 \times 3$ Conv |              | $\checkmark$ | $\checkmark$ |                | 43.4         | 111.5 | 41.8         | 115.5 |
|                   |              | $\checkmark$ | $\checkmark$ | $\checkmark$   | 43.4         | 111.5 | 41.8         | 115.6 |
|                   | $\checkmark$ |              |              |                | 1.9          | 9.7   | 2.0          | 9.6   |
| Depthwise         |              | $\checkmark$ |              |                | 20.5         | 0.9   | 17.8         | 1.1   |
| $3 \times 3$ Conv |              | $\checkmark$ | $\checkmark$ |                | 3.0          | 6.2   | 3.4          | 5.5   |
|                   |              | $\checkmark$ | $\checkmark$ | $\checkmark$   | 2.1          | 9.2   | 2.3          | 8.2   |



The 5th Workshop on Energy Efficient Machine Learning and Cognitive Computing @ NeurlPS 2019