## Efficient Winograd or Cook-Toom Convolution Kernel Implementation on Widely Used Mobile CPUs

Partha Maji, Andrew Mundy, Ganesh Dasika, Jesse Beu, Matthew Mattina, Robert Mullins

Arm Research, University of Cambridge

## ML and the Rise of the Edge

#### VR/AR/MR



#### **Robotics**



Home, surveillance & analytics

#### Drones



#### **Shipping & logistics**



ΙοΤ



#### Automotive



#### Mobile



arm

2 © 2019 Arm Limited

## Contributions of this work

- We discuss what Winograd convolution can offer in terms of performance
- Breakdown the instruction-level implications and memory layout tradeoffs for different flavors of a Winograd kernel in order to realize its full potential
- Demonstrate how general matrix multiply (GEMM) can further optimize Winograd
- Present performance results for Winograd vs conventional im2row + GEMM solution
  - More than a 2x performance boost on real hardware today!

Ultimately enable more efficient ML compute at the edge through Winograd in the Arm Compute Library (ArmCL).

# Convolution and Winograd arm

+ + + + + + + + + + + + + + +

## What is Winograd and why should I care?

- Convolutional Neural Networks (CNNs)
  - Common type of deep learning model employed in a variety of domains
  - Convolve filter bank (weights) over a field (input activations) to produce a response (output)
  - Push response through an activation function (typically ReLu) and feed to the next layer
- Winograd Convolution
  - Based in the Chinese Remainder Theorem and modulo arithmetic
  - Produces mathematically equivalent results to naïve convolution\*
  - Similar to using Fourier: transform into 'Winograd domain', do simpler math, transform result back

\*Assuming infinite precision

## What is Winograd and why should I care?

- Convolutional Neural Networks (CNNs)
  - Common type of deep learning model employed in a variety of domains
  - Convolve filter bank (weights) over a field (input activations) to produce a response (output)
  - Push response through an activation function (typically ReLu) and feed to the next layer
- Winograd Convolution
  - Based in the Chinese Remainder Theorem and modulo arithmetic
  - Produces mathematically equivalent results to naïve convolution\*
  - Similar to using Fourier: transform into 'Winograd domain', do simpler math, transform result back

## Objective: To (quickly) explain for a CPU context:

$$f = Z^{T} \left[ \left( W W W^{T} \right) \odot \left( X^{T} x X \right) \right] Z$$

\*Assuming infinite precision

6 © 2019 Arm Limited



Standard CNN Configuration

7 © 2019 Arm Limited





9 © 2019 Arm Limited

arm



10



11 © 2019 Arm Limited

#### **Input Region Transform**

 $(2 \times 2) = (2 \times 4) \left[ (4 \times 3)(3 \times 3)(3 \times 4) \odot (4 \times 4)(4 \times 4)(4 \times 4) \right] (2 \times 4)$ 



 $f = Z^{T} \left[ \left( W W W^{T} \right) \odot \left( X^{T} x X \right) \right] Z$ 

12 © 2019 Arm Limited

#### **Filter Transform**

 $(2 \times 2) = (2 \times 4) [(4 \times 3)(3 \times 3)(3 \times 4)] \odot (4 \times 4)(4 \times 4)(4 \times 4)](2 \times 4)$ 



 $f = Z^{T} \left[ \left( W W W^{T} \right) \odot \left( X^{T} x X \right) \right] Z$ 

13 © 2019 Arm Limited

#### **Output Channel Transform**

 $(2 \times 2) = (2 \times 4) \left[ (4 \times 3)(3 \times 3)(3 \times 4) \odot (4 \times 4)(4 \times 4)(4 \times 4) \right] (4 \times 2)$ 



 $\boldsymbol{f} = \boldsymbol{Z}^{T} \left[ \left( \boldsymbol{W} \boldsymbol{W} \boldsymbol{W}^{T} \right) \odot \left( \boldsymbol{X}^{T} \boldsymbol{x} \boldsymbol{X} \right) \right] \boldsymbol{Z}$ 

14 © 2019 Arm Limited

#### **Elementwise Multiplication**



#### **Transform Cost**



16 © 2019 Arm Limited

#### **Transform Cost**







18 © 2019 Arm Limited



© 2019 Arm Limited 19

# Multi-Channel Filters, Memory Layout, Vectorization, and GEMM

#### NCHW vs NHWC, data layout

**Tensor Ordering** 

- N = batch
- C = channel
- H = height
- W = width



## NCHW vs NHWC, data layout

- Layout ultimately dictates how contiguous vector-load operations will populate registers
  - Under NCHW, registers will be filled entirely from a single channel
  - Under NHWC, registers will hold multiple channels for a single coordinate
- In the Arm-V8 architecture (with 128-bit SIMD registers), this means either:



## Advantages to NHWC layout for CPUs

- Reasonably optimized transforms exist for both NCHW and NHWC at F(2x2, 3x3, 4x4)
- Convolution filters and Winograd are not restricted to F(2x2, 3x3, 4x4)
  - Larger regions yields can drive higher performance e.g., F(3x3, 3x3, 5x5)
  - 5x5 and 7x7 filters found in inception networks e.g., F(2x2, 5x5, 6x6)
  - Dimension-to-register capacity mismatch results in wasted register utilization and/or alignment complexity under NCHW
  - NHWC only experiences increased register pressure

## F(2x2, 5x5, 6x6) Example



arm

24 © 2019 Arm Limited

## F(2x2, 5x5, 6x6) Example



## F(2x2, 5x5, 6x6) Example



26 © 2019 Arm Limited

## Advantages to NHWC layout for CPUs

- Reasonably optimized transforms exist for both NCHW and NHWC at F(2x2, 3x3, 4x4)
- Convolution filters and Winograd are not restricted to F(2x2, 3x3, 4x4)
  - Larger regions yields can drive higher performance e.g., F(4x4, 3x3, 6x6)
  - 5x5 and 7x7 filters found in inception networks e.g., F(2x2, 5x5, 7x7)
  - Dimension-to-register capacity mismatch results in wasted register utilization and alignment complexity under NCHW
  - NHWC only experiences increased register pressure
- Wider registers or lower precision also adds challenges for NCHW
  - 256-bit or FP16 means 8 values per register, or 2 rows per register under NCHW
  - Loss of 1:1 register-row mapping complicates assembly sequence for efficient NCHW transpose
  - NHWC simply doubles the # of channels stored per register

#### Vectorization over channels is more portable and performant!

## Use of GEMM to further optimize

- General Matrix-Matrix Multiply is a common, highly optimized operation for most architectures, including Arm
- Inspection of the full Winograd convolution algorithm (Listing 1 in paper) shows:
  - The fundamental operation is a multiply-accumulate
  - There are 2 axis of data re-use:
    - weight tile reuse over all input regions and
    - input region reuse over all output channels
  - Opportunity to leverage GEMM to do the computation in a highly parallel manner



29 © 2019 Arm Limited



Post-Transform Filter Tiles (C x 4 x 4 x M)



31 © 2019 Arm Limited



32 © 2019 Arm Limited





34 © 2019 Arm Limited



35 © 2019 Arm Limited



36 © 2019 Arm Limited



37 © 2019 Arm Limited



Output Transforms





38 © 2019 Arm Limited

|  | r'n | $\mathbf{n}^{\dagger}$ |  |  |  |  | Results |  |  |  |
|--|-----|------------------------|--|--|--|--|---------|--|--|--|
|  |     |                        |  |  |  |  |         |  |  |  |
|  |     |                        |  |  |  |  |         |  |  |  |
|  |     |                        |  |  |  |  |         |  |  |  |
|  |     |                        |  |  |  |  |         |  |  |  |
|  |     |                        |  |  |  |  |         |  |  |  |
|  |     |                        |  |  |  |  |         |  |  |  |
|  |     |                        |  |  |  |  |         |  |  |  |

#### **Experimental Setup**

Platform: Huawei HiKey960 Development Platform – 4xA73 cluster Networks: VGG19, VGG16, GoogleNet, Inception-v3, SqueezeNet Other: FP32, batchsize 1, 4x multi-threaded through Arm Compute Library (ArmCL)

Measured individual per-layer performance as well as end-to-end run-time, compared with highly optimized conventional 'im2row GEMM' convolution strategy

#### **Benchmark Results**



#### Conclusion

- ML is coming to the edge, hard and fast
- ARM CPUs are already widely deployed at the edge, so optimizing for performance here has immediate impact
- Winograd domain is an alternative to conventional im2row/GEMM convolution that reduces math, but requires care to fully realize benefit
- When done properly, can provide as much as a 2.5x speedup on real hardware for endto-end model inference

#### Benefits now available in ArmCL!

| <sup>†</sup> Thank You |  |  |  |  | <b>rr</b> | C |  |
|------------------------|--|--|--|--|-----------|---|--|
| _ Danke                |  |  |  |  | +         |   |  |
| Merci                  |  |  |  |  |           |   |  |
| + + 谢谢                 |  |  |  |  |           |   |  |
| ありがとう                  |  |  |  |  |           |   |  |
| <sup>*</sup> Gracias   |  |  |  |  |           |   |  |
| . Kiitos               |  |  |  |  |           |   |  |
| 감사합니다                  |  |  |  |  |           |   |  |
| धन्यवाद                |  |  |  |  |           |   |  |
| شکرًا                  |  |  |  |  |           |   |  |
| תודה                   |  |  |  |  |           |   |  |
|                        |  |  |  |  |           |   |  |

# arm <sup>+</sup>The Arm trademarks featured in this presentation are registered trademarks or trademarks of Arm Limited (or its subsidiaries) in the US and/or elsewhere. All rights reserved. All other marks featured may be trademarks of their respective owners. www.arm.com/company/policies/trademarks

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .