The 2nd EMC2 - Energy Efficient Training and Inference of Transformer Based Models

Co-located with the 25th IEEE International Symposium on High-Performance Computer Architecture HPCA 2019

Sunday, February 17, 2019
Washington D.C.
Room: Dogwood Room
Full Day

description Workshop Objective

Transformers are the foundational principles of large deep learning language models. Recent successes of Transformer-based models in image classification and action prediction use cases indicate their wide applicability. In this workshop, we want to focus on the leading ideas using Transformer models such as PALM from Google. We will learn what have been their key observations on performance of the model, optimizations for inference and power consumption of both mixed-precision inference and training.

chat Call for Papers

The goal of this Workshop is to provide a forum for researchers and industry experts who are exploring novel ideas, tools, and techniques to improve the energy efficiency of machine learning and deep learning as it is practiced today and would evolve in the next decade. We envision that only through close collaboration between industry and the academia we will be able to address the difficult challenges and opportunities of reducing the carbon footprint of AI and its uses. We have tailored our program to best serve the participants in a fully digital setting. Our forum facilitates active exchange of ideas through:

  • Keynotes, invited talks and discussion panels by leading researchers from industry and academia
  • Peer-reviewed papers on latest solutions including works-in-progress to seek directed feedback from experts
  • Independent publication of proceedings through IEEE CPS

We invite full-length papers describing original, cutting-edge, and even work-in-progress research projects about efficient machine learning. Suggested topics for papers include, but are not limited to the ones listed on this page. The proceedings from previous instances have been published through the prestigious IEEE Conference Publishing Services (CPS) and are available to the community via IEEE Xplore. In each instance, IEEE conducted independent assessment of the papers for quality.

format_list_bulleted Topics for the Workshop

  • Neural network architectures for resource constrained applications
  • Efficient hardware designs to implement neural networks including sparsity, locality, and systolic designs
  • Power and performance efficient memory architectures suited for neural networks
  • Network reduction techniques – approximation, quantization, reduced precision, pruning, distillation, and reconfiguration
  • Exploring interplay of precision, performance, power, and energy through benchmarks, workloads, and characterization
  • Simulation and emulation techniques, frameworks, tools, and platforms for machine learning
  • Optimizations to improve performance of training techniques including on-device and large-scale learning
  • Load balancing and efficient task distribution, communication and computation overlapping for optimal performance
  • Verification, validation, determinism, robustness, bias, safety, and privacy challenges in AI systems
08:45 - 09:00

Introduction and Opening Remarks

09:00 - 10:00

Quantizing Deep Convolutional Networks for Efficient Inference

Raghuraman Krishnamoorthi, Facebook

We present an overview of techniques for quantizing convolutional neural networks for inference with integer weights and activations. We discuss different quantization schemes and show that simple techniques provide very good performance (4x reduction in model size, 2x speed up in CPUs) for classification use cases, with 1-2% accuracy drop.

Modeling quantization during training can provide further improvements, reducing the gap to floating point to 1% at 8-bit precision. Quantization-aware training also allows for reducing the precision of weights to four bits with accuracy losses ranging from 2% to 10%, with higher accuracy drop for smaller networks.

We recommend that per-channel quantization of weights and per-layer quantization of activations be the preferred quantization scheme for hardware acceleration and kernel optimization. We also propose that future processors and hardware accelerators for optimized inference support:

  • precisions of 4, 8 and 16 bits for computation
  • Per-channel quantization of weights
  • Per layer selection of bit widths for weights and activations
  • Support for on the fly weight compression techniques for memory bandwidth efficiency.

Raghuraman Krishnamoorthi is a software engineer in the Pytorch team at Facebook, where he leads the effort to develop and optimize quantized deep networks for inference. Prior to that he was part of the Tensorflow team at google working on quantization for mobile inference as part of TensorflowLite.

From 2001 to 2017, Raghu was at Qualcomm Research, working on several generations of wireless technologies. His work experience also includes computer vision for AR, ultra-low power always on vision and hardware/software co-design for inference on mobile platforms. He is an inventor in more than 90 issued and filed patents. Raghu has a masters in EE from University of Illinois,Urbana Champaign and a Bachelor degree from Indian Institute of Technology, Madras.

10:00 - 11:00

Efficient Machine Learning Architectures

T. N. Vijaykumar, Purdue University link

Advances in machine learning (ML) are resulting in highly-accurate recognition (e.g., image and speech recognition). ML models, however, place high computational demand during both training and inference, requiring efficient architectures. The models? computations are fine-grained, regular and highly parallel, have high data reuse, and use low-precision arithmetic for inference (e.g., int8). Modern ML architectures (e.g., GPGPUs, TPU, and FPGA-based) exploit these characteristics to achieve high performance and energy efficiency.

Recently, ML models have been shown to be sparse, prompting creative proposals for sparse architectures. Emerging technology trends of processing-in/near-memory match some of the ML workloads well providing an opportunity for architectural innovation based on these innovative technologies. In this talk, I will explore these exciting aspects of machine learning architectures.

T. N. Vijaykumar is Professor in the School of Electrical and Computer Engineering at Purdue University. His research interests are in computer architecture targeting machine learning architectures, secure high-performance microprocessors, and verifiable architectures. He is also interested in hardware for data center networks and software-programmable microfluidics. His work has been recognized with an NSF CAREER Award in 1999 and IEEE Micros Top Picks in 2003 and 2005. He is listed in the International Symposium on Computer Architecture (ISCA) Hall of Fame at With his colleagues, he received the first prize in the 2009 Burton D. Morgan Business Plan Competition for a business plan on commercializing software-programmable lab-on-a-chip technology. He received a Ph.D. in computer science from the University of Wisconsin-Madison in 1997.

11:00 - 12:00
Paper Session #1
Paper Presentation

Efficient Winograd or Cook-Toom Convolution Kernel Implementation on Widely Used Mobile CPUs

Partha Maji, Andrew Mundy, Ganesh Dasika, Jesse Beu, Matthew Mattina, Robert Mullins
University of Cambridge and ARM ML Research

Paper Presentation

On Merging MobileNets for Efficient Multitask Inference

Cheng-En Wu, Yi-Ming Chan and Chu-Song Chen
Institute of Information Science, Academia Sinica

13:30 - 14:15
Invited Talk

Tensilica DNA 100 Processor: A High-Performance, Power-Efficient DNN Processor for On-Device Inference

Megha Daga, Cadence

Deep learning is influencing not only the technology itself but also our everyday lives. With the increasing demand on mobile artificial intelligence (AI), conventional hardware solutions face their ordeal because of their low energy efficiency on such power-hungry applications. For the past few years, dedicated DNN accelerator inference has been under the spotlight. However, with the rising emphasis on privacy and personalization, the ability to learn on mobile platforms is becoming the second hurdle for “on-device AI.” The Cadence® Tensilica® DNA 100 Processor IP, is the first deep neural-network accelerator (DNA) AI processor IP to deliver both high performance and power efficiency across a full range of compute from 0.5 TeraMAC (TMAC) to 100 TMACs. As a result, the DNA 100 processor is well suited for on-device neural network inference applications spanning autonomous vehicles (AVs), ADAS, surveillance, robotics, drones, augmented reality (AR) /virtual reality (VR), smartphones, smart home, and IoT.

Megha Daga, works at Cadence Design Systems, Inc. as Sr Manager, Product Marketing and Management in the AI group. Megha’s focus and passion is to research latest trends and requirements in AI and to create industry leading solutions on Cadence AI IPs. Megha enjoys learning from customer’s experiences and fellow researchers in AI. Her R&D background coupled with her current marketing role gives her a unique perspective about the AI industry.

14:15 - 15:00
Invited Talk

Beyond IPS: Toward A Wholistic Measure of Machine Learning Performance

Saurabh Tangri, Intel

Saurabh Tangri is a senior SW architect at Intel and leads AI enabling efforts across Intel HW for Microsoft solutions. His focus area is to make AI accessible and performant for everyone in a seamless manner. He has been with Intel for nearly 15 years.

15:30 - 17:00
Paper Session #2
Paper Presentation

Integrating NVIDIA Deep Learning Accelerator (NVDLA) with RISC-V SoC on FireSim

Farzad Farshchi, Qijing Huang and Heechul Yun
University of Kansas, University of California, Berkeley

Paper Presentation

Bootstrapping Deep Neural Networks from Approximate Image Processing Pipelines

Sek Chai, Kilho Son and Jesse Hostetler
SRI International

Paper Presentation

NNBench-X: A Benchmarking Methodology for Neural Network Accelerator Designs

Xinfeng Xie, Xing Hu, Peng Gu, Shuangchen Li, Yu Ji and Yuan Xie
University of California, Santa Barbara

17:00 - 17:30
Invited Talk

Hardware Acceleration Opportunities in Bioinformatics and Computational Biology

Leonid Yavits, Technion

Advances in genomics have triggered a revolution in healthcare and our understanding of life. Recent years saw exponential increase in genomic data, far outpacing Moore’s Law. Coupled with prohibitively high computational costs of bioinformatics tasks, it presents a challenge but also a great opportunity for hardware acceleration.

I will describe a typical genomic assembly pipeline, and discuss the latest developments in the field of DNA sequencing, with an emphasis on hardware acceleration opportunities. Afterwards, I will make a brief excurse into the world of existing bioinformatics accelerators. I will end up with the insights from the Accelerator Architecture for Computational Biology and Bioinformatics (AACBB) 2019 workshop.

Leonid Yavits received his MSc (1996) and PhD in Electrical Engineering (2015) from the Technion, Israel Institute of Technology. After graduating the MSc program, he co-founded VisionTech where he co-designed the world’s first single chip MPEG2 codec. Following VisionTech’s acquisition by Broadcom, he managed Broadcom Israel R&D and co-developed a number of video compression products. Later Leonid co-founded Horizon Semiconductors where he co-designed a Set Top Box-on-chip for cable and satellite TV. Horizon’s Set Top Box-on-chip was among world’s earliest heterogeneous MPSoC.

Leonid is a postdoc research fellow in Electrical Engineering at Technion. He co-authored a number of patents and research papers. His research interests include non von Neumann computer architectures; processing in memory and resistive memory based computing; architectures for computational biology and bioinformatics. Leonid’s research work has earned several awards; among them: IEEE Computer Architecture Letter Journal Best Paper Awards for 2015 and 2017 and best poster awards at ISC High Performance in 2017 and ACM/IEEE Supercomputing Conference in 2018.