The 3rd EMC2 - Energy Efficient Training and Inference of Transformer Based Models

Co-located with the 30th IEEE/CVF Conference on Computer Vision and Pattern Recognition CVPR 2019

Sunday, June 16, 2019
Long Beach, CA
Room: Hyatt Shoreline A
Click Here for: Slides from the Workshop
Half Day (Morning Session)

description Workshop Objective

Transformers are the foundational principles of large deep learning language models. Recent successes of Transformer-based models in image classification and action prediction use cases indicate their wide applicability. In this workshop, we want to focus on the leading ideas using Transformer models such as PALM from Google. We will learn what have been their key observations on performance of the model, optimizations for inference and power consumption of both mixed-precision inference and training.

chat Call for Papers

The goal of this Workshop is to provide a forum for researchers and industry experts who are exploring novel ideas, tools, and techniques to improve the energy efficiency of machine learning and deep learning as it is practiced today and would evolve in the next decade. We envision that only through close collaboration between industry and the academia we will be able to address the difficult challenges and opportunities of reducing the carbon footprint of AI and its uses. We have tailored our program to best serve the participants in a fully digital setting. Our forum facilitates active exchange of ideas through:

  • Keynotes, invited talks and discussion panels by leading researchers from industry and academia
  • Peer-reviewed papers on latest solutions including works-in-progress to seek directed feedback from experts
  • Independent publication of proceedings through IEEE CPS

We invite full-length papers describing original, cutting-edge, and even work-in-progress research projects about efficient machine learning. Suggested topics for papers include, but are not limited to the ones listed on this page. The proceedings from previous instances have been published through the prestigious IEEE Conference Publishing Services (CPS) and are available to the community via IEEE Xplore. In each instance, IEEE conducted independent assessment of the papers for quality.

format_list_bulleted Topics for the Workshop

  • Neural network architectures for resource constrained applications
  • Efficient hardware designs to implement neural networks including sparsity, locality, and systolic designs
  • Power and performance efficient memory architectures suited for neural networks
  • Network reduction techniques – approximation, quantization, reduced precision, pruning, distillation, and reconfiguration
  • Exploring interplay of precision, performance, power, and energy through benchmarks, workloads, and characterization
  • Simulation and emulation techniques, frameworks, tools, and platforms for machine learning
  • Optimizations to improve performance of training techniques including on-device and large-scale learning
  • Load balancing and efficient task distribution, communication and computation overlapping for optimal performance
  • Verification, validation, determinism, robustness, bias, safety, and privacy challenges in AI systems
08:00 - 08:10

Introduction and Opening Remarks

08:10 - 09:00
Invited Talk

Mixed-signal Techniques for Embedded Machine Learning Systems

Boris Murmann, Stanford University link

Over the past decade, machine learning algorithms have been deployed in many cloud-centric applications. However, as the application space continues to grow, various algorithms are now being embedded “closer to the sensor,” eliminating the latency, privacy and energy penalties associated with cloud access. In this talk, I will review mixed-signal circuit techniques that can improve the efficiency of moderate-complexity, low-power inference algorithms. Specific examples include feature analog extraction for image and audio processing, mixed-signal compute circuits for convolutional neural networks, as well as compute-in memory using resistive RAM.

Boris Murmann is a Professor of Electrical Engineering at Stanford University. He joined Stanford in 2004 after completing his Ph.D. degree in electrical engineering at the University of California, Berkeley in 2003. Since 2004, he has worked as a consultant with numerous Silicon Valley companies. Dr. Murmann’s research interests are in mixed-signal integrated circuit design, with special emphasis on sensor interfaces, data converters and custom circuits for embedded machine learning. He is a Fellow of the IEEE.

09:00 - 09:50
Invited Talk

Balancing Efficiency and Flexibility for DNN Acceleration

Vivienne Sze, MIT link

There has been a significant amount of research on the topic of efficient processing of DNNs, from the design of efficient DNN algorithms to the design of efficient DNN accelerators. The wide range techniques used for efficient DNN algorithm design has resulted in a more diverse set of DNNs; this creates a new challenge for the DNN accelerators, as they now need to be sufficiently flexible to support a wide range of DNN workloads efficiently. However, many of the existing DNN accelerators rely on certain properties of the DNN which cannot be guaranteed (e.g., fixed weight sparsity, large number of channels, large batch size). In this talk, we will briefly discuss recent techniques that have been used to design efficient DNN algorithms and important properties to consider when applying these techniques. We will then present a systematic approach called Eyexam to identify the sources of inefficiencies in DNN accelerator designs for different DNN workloads. Finally, we will introduce a flexible accelerator called Eyeriss v2 that is computationally efficient across a wide range of diverse DNNs.

Vivienne Sze is an Associate Professor at MIT in the Electrical Engineering and Computer Science Department. Her research interests include energy-aware signal processing algorithms, and low-power circuit and system design for portable multimedia applications, including computer vision, deep learning, autonomous navigation, and video process/coding. Prior to joining MIT, she was a Member of Technical Staff in the R&D Center at TI, where she designed low-power algorithms and architectures for video coding. She also represented TI in the JCT-VC committee of ITU-T and ISO/IEC standards body during the development of High Efficiency Video Coding (HEVC), which received a Primetime Engineering Emmy Award. She is a co-editor of the book entitled “High Efficiency Video Coding (HEVC): Algorithms and Architectures” (Springer, 2014).

Prof. Sze received the B.A.Sc. degree from the University of Toronto in 2004, and the S.M. and Ph.D. degree from MIT in 2006 and 2010, respectively. In 2011, she received the Jin-Au Kong Outstanding Doctoral Thesis Prize in Electrical Engineering at MIT. She is a recipient of the 2018 Facebook Faculty Award, the 2018 & 2017 Qualcomm Faculty Award, the 2018 & 2016 Google Faculty Research Award, the 2016 AFOSR Young Investigator Research Program (YIP) Award, the 2016 3M Non-Tenured Faculty Award, the 2014 DARPA Young Faculty Award, the 2007 DAC/ISSCC Student Design Contest Award, and a co-recipient of the 2017 CICC Outstanding Invited Paper Award, the 2016 IEEE Micro Top Picks Award and the 2008 A-SSCC Outstanding Design Award.

For more information about research in the Energy-Efficient Multimedia Systems Group at MIT visit:

09:50 - 10:00

Short Break

10:00 - 10:50
Invited Talk

Hardware Efficiency Aware Neural Architecture Search

Song Han, MIT link

Efficient deep learning computing requires algorithm and hardware co-design to enable specialization: we usually need to change the algorithm to reduce memory footprint and improve energy efficiency. However, the extra degree of freedom from the algorithm creates a much larger design space: it’s not only about designing the hardware but also about how to change the algorithm to best fit the hardware. Human engineers can hardly exhaust the design space by heuristics. It’s labor consuming and sub-optimal. We propose design automation techniques for efficient neural networks. We investigate automatically designing small and fast models (ProxylessNAS), auto channel pruning (AMC), and auto mixed-precision quantization (HAQ). We demonstrate such learning-based, automated design achieves superior performance and efficiency than rule-based human design. Moreover, we shorten the design cycle by 200× than previous work, so that we can afford to design specialized neural network models for different hardware platforms.

Song Han is an assistant professor at MIT EECS. Dr. Han received the Ph.D. degree in Electrical Engineering from Stanford advised by Prof. Bill Dally. Dr. Han’s research focuses on efficient deep learning computing. He proposed “Deep Compression” and “EIE Accelerator” that impacted the industry. His work received the best paper award in ICLR’16 and FPGA’17. He is the co-founder and chief scientist of DeePhi Tech (a leading efficient deep learning solution provider), which was acquired by Xilinx in 2018.

10:50 - 11:40
Invited Talk

What Can In-memory Computing Deliver, and What Are the Barriers?

Naveen Verma, Princeton University link

Inference based on deep-learning models is being employed pervasively in applications today. In many such applications, state-of-the-art models can easily push the platforms to their limits of performance and energy efficiency. To address this, digital acceleration has been widely exploited. But, deep-learning computations exhibit critical attributes that limit the gains achievable by traditional digital acceleration. In particular, computations are dominated by high-dimensionality matrix-vector multiplications (MVMs), where the precision requirements of elements have been reducing (from FP32 a few years ago, to INT8/4/2 now and in the future). In this scenario, in-memory computing (IMC) offers distinct advantages, which have been demonstrated through recent prototypes leading to roughly 10x higher energy efficiency and area-normalized throughput, compared to optimized digital accelerators. This arises from the structural alignment of dense 2D memory arrays with the dataflow of MVMs. While digital spatial architectures (e.g., systolic arrays) also exploit such alignment, IMC can do so more aggressively, minimizing data movement and amortizing compute into highly-efficient, highly-parallel analog operations. But, IMC also raises critical challenges, at each level (need for analog compute at circuit level, need for high bandwidth hardware infrastructure at architectural level, constrained configurability/virtualization at the software-mapping level). Recent research advances have shown remarkable promise in addressing many of these challenges, making IMC more of a reality than ever. These advances, their potential implications, and key questions remaining will be reviewed.

Naveen Verma received the B.A.Sc. degree in Electrical and Computer Engineering from the UBC, Vancouver, Canada in 2003, and the M.S. and Ph.D. degrees in Electrical Engineering from MIT in 2005 and 2009 respectively. Since July 2009 he has been a faculty member at Princeton University. His research focuses on advanced sensing systems, exploring how systems for learning, inference, and action planning can be enhanced by algorithms that exploit new sensing and computing technologies. This includes research on large-area, flexible sensors, energy-efficient statistical-computing architectures and circuits, and machine-learning and statistical-signal-processing algorithms. Prof. Verma has served as a Distinguished Lecturer of the IEEE Solid-State Circuits Society, and currently serves on the technical program committees for ISSCC, VLSI Symp., DATE, and IEEE Signal-Processing Society (DISPS).

11:40 - 12:30
Invited Talk

Speeding up Deep Neural Networks with Adaptive Computation and Efficient Multi-Scale Architectures

Rogerio Feris, IBM Research link

Very deep convolutional neural networks have shown remarkable success in many computer vision tasks, yet their computational expense limits their impact in domains where fast inference is essential, particularly in delay-sensitive and real-time scenarios such as autonomous driving, robotic navigation, or user-interactive applications on mobile devices. In this talk, I will describe two complementary approaches for speeding up deep neural networks. The first approach, called BlockDrop, learns to dynamically choose which layers of a deep network to execute during inference so as to best reduce total computation without degrading prediction accuracy. The second approach, called Big-Little Net, relies on a multi-branch network architecture that has different computational complexities for different branches, with feature fusion at multiple scales. The model surpasses state-of-the-art CNN acceleration approaches by a large margin in accuracy and FLOPs reduction. Finally, I will conclude the talk describing ongoing efforts at IBM for energy efficient deep learning.

Rogerio Schmidt Feris is the head of computer vision and multimedia research at IBM T.J. Watson Research Center. He joined IBM in 2006 after receiving a Ph.D. from the University of California, Santa Barbara. He has also worked as an Affiliate Associate Professor at the University of Washington and as an Adjunct Associate Professor at Columbia University. He has authored over 100 technical papers and has over 40 issued patents in the areas of computer vision, multimedia, and machine learning. Rogerio is a principal investigator of several projects within the MIT-IBM Watson AI Lab, and leads the IBM-MIT-Purdue team as part of the IARPA DIVA program. His work has not only been published in top AI conferences (NeurIPS, CVPR, ICLR, ICCV, ECCV, SIGGRAPH), but has also been integrated into multiple IBM products, including Watson Visual Recognition, Watson Media, and Intelligent Video Analytics. He led the development of an attribute-based people search system used by many police departments around the world, as well as a system to produce auto-curated highlights for the US Open, Wimbledon, and Masters tournaments, which were seen by millions of people. Rogerio’s work has been covered by the New York Times, ABC News, CBS 60 minutes, and many other media outlets. He currently serves as an Associate Editor of TPAMI, has served as a Program Chair of WACV 2017, and as an Area Chair of top AI conferences, such as NeurIPS, CVPR, and ICCV. Rogerio is an IBM Master Inventor, has received an IBM Outstanding Innovation Award, and was part of the team that recently achieved top results in highly competitive benchmarks such as the KITTI and TrecVid evaluations. In addition to working on core research, he had a one-year assignment at IBM Global Technology Services as a senior software engineer to help the productization of the IBM Smart Surveillance System.

12:30 - 12:35

Closing Remarks