The 4th EMC2 - Energy Efficient Training and Inference of Transformer Based Models
Co-located with the 46th International Symposium on Computer Architecture ISCA 2019
description Workshop Objective
Transformers are the foundational principles of large deep learning language models. Recent successes of Transformer-based models in image classification and action prediction use cases indicate their wide applicability. In this workshop, we want to focus on the leading ideas using Transformer models such as PALM from Google. We will learn what have been their key observations on performance of the model, optimizations for inference and power consumption of both mixed-precision inference and training.
chat Call for Papers
The goal of this Workshop is to provide a forum for researchers and industry experts who are exploring novel ideas, tools, and techniques to improve the energy efficiency of machine learning and deep learning as it is practiced today and would evolve in the next decade. We envision that only through close collaboration between industry and the academia we will be able to address the difficult challenges and opportunities of reducing the carbon footprint of AI and its uses. We have tailored our program to best serve the participants in a fully digital setting. Our forum facilitates active exchange of ideas through:
- Keynotes, invited talks and discussion panels by leading researchers from industry and academia
- Peer-reviewed papers on latest solutions including works-in-progress to seek directed feedback from experts
- Independent publication of proceedings through IEEE CPS
We invite full-length papers describing original, cutting-edge, and even work-in-progress research projects about efficient machine learning. Suggested topics for papers include, but are not limited to the ones listed on this page. The proceedings from previous instances have been published through the prestigious IEEE Conference Publishing Services (CPS) and are available to the community via IEEE Xplore. In each instance, IEEE conducted independent assessment of the papers for quality.
format_list_bulleted Topics for the Workshop
- Neural network architectures for resource constrained applications
- Efficient hardware designs to implement neural networks including sparsity, locality, and systolic designs
- Power and performance efficient memory architectures suited for neural networks
- Network reduction techniques – approximation, quantization, reduced precision, pruning, distillation, and reconfiguration
- Exploring interplay of precision, performance, power, and energy through benchmarks, workloads, and characterization
- Simulation and emulation techniques, frameworks, tools, and platforms for machine learning
- Optimizations to improve performance of training techniques including on-device and large-scale learning
- Load balancing and efficient task distribution, communication and computation overlapping for optimal performance
- Verification, validation, determinism, robustness, bias, safety, and privacy challenges in AI systems
09:00 - 09:10
Introduction and Opening Remarks
09:10 - 10:00
Efficient Deep Learning: Quantizing Models Without Using Re-training
Harris Teague, Qualcomm AI Research
In this talk we’ll cover techniques to do post-training quantization that can improve model accuracy for 8-bit quantization significantly. These techniques are especially useful when training/fine-tuning is not possible, a case that arises very frequently in commercial applications. No training pipeline, optimized hyperparameters, nor full training datasets are needed. We show the effectiveness of these techniques for popular models used for inference on resource constrained devices.
Harris Teague is Principal Engineer/Manager at Qualcomm and leads the Platform Systems group at Qualcomm AI Research. Prior to focusing on machine learning, he worked on a range of wireless projects including Bluetooth, WCDMA, ultra-wideband, and an OFDMA-based precursor to LTE called UMB. Harris is an inventor on over 60 granted US patents. He holds a BS degree in Aerospace Engineering from Virginia Tech, and MS and PhD in Aerospace from Stanford University.
10:00 - 10:50
Machine Learning at Scale
Carole-Jean Wu, Arizona State University and Facebook link
Machine learning systems are being widely deployed in production datacenter infrastructure and over billions of edge devices. This talk seeks to address key system design challenges when scaling machine learning solutions to billions of people. What are key similarities and differences between cloud and edge infrastructure? The talk will conclude with open system research directions for deploying machine learning at scale.
Carole-Jean Wu is a Research Scientist at Facebook’s AI Infrastructure Research. She is also a tenured Associate Professor of CSE in Arizona State University. Carole-Jean’s research focuses in Computer and System Architectures. More recently, her research has pivoted into designing systems for machine learning. She is the leading author of “Machine Learning at Facebook: Understanding Inference at the Edge” that presents unique design challenges faced when deploying ML solutions at scale to the edge, from over billions of smartphones to Facebook’s virtual reality platforms. Carole-Jean received her M.A. and Ph.D. from Princeton and B.Sc. from Cornell.
10:50 - 11:10
11:10 - 12:00
Enabling Continuous Learning through Synaptic Plasticity in Hardware
Tushar Krishna, Georgia Institue of Technology link
Ever since modern computers were invented, the dream of creating artificial intelligence (AI) has captivated humanity. We are fortunate to live in an era when, thanks to deep learning (DL), computer programs have paralleled, and in many cases even surpassed human level accuracy in tasks like visual perception and speech synthesis. However, we are still far away from realizing general-purpose AI. The problem lies in the fact that the development of supervised learning based DL solutions today is mostly open loop. A typical DL model is created by hand-tuning the deep neural network (DNN) topology by a team of experts over multiple iterations, followed by training over petabytes of labeled data. Once trained, the DNN provides high accuracy for the task at hand; if the task changes, however, the DNN model needs to be re-designed and re-trained before it can be deployed. A general-purpose AI system, in contrast, needs to have the ability to constantly interact with the environment and learn by adding and removing connections within the DNN autonomously, just like our brain does. This is known as synaptic plasticity.
In this talk we present our research efforts towards enabling general-purpose AI. First, we present GeneSys (MICRO 2018), a HW-SW prototype of a closed loop learning system for continuously evolving the structure and weights of a DNN for the task at hand using genetic algorithms, providing 100-10000x higher performance and energy-efficiency over state-of-the-art embedded and desktop CPU and GPU systems. Next, we present a DNN accelerator substrate called MAERI (ASPLOS 2018), built using light-weight, non-blocking, reconfigurable interconnects, that supports efficient mapping of regular and irregular DNNs with arbitrary dataflows, providing ~100% utilization of all compute units, resulting in 3X speedup and energy-efficiency over state-of-the-art DNN accelerators.
Tushar Krishna is an Assistant Professor in the School of Electrical and Computer Engineering at Georgia Tech. He has a Ph.D. in Electrical Engineering and Computer Science from MIT (2014), a M.S.E in Electrical Engineering from Princeton University (2009), and a B.Tech in Electrical Engineering from the Indian Institute of Technology (IIT) Delhi (2007). Before joining Georgia Tech in 2015, Dr. Krishna spent a year as a post-doctoral researcher at Intel, Massachusetts.
Dr. Krishna’s research spans computer architecture, interconnection networks, networks-on-chip (NoC) and deep learning accelerators - with a focus on optimizing data movement in modern computing systems. He has 42 publications in leading conferences and journals, which have amassed over 5000 citations to date. Three of his papers have been selected for IEEE Micro’s Top Picks from Computer Architecture, one more received an honorable mention, and two have won best paper awards. He has received the National Science Foundation (NSF) CRII award, a Google Faculty Award, and a Facebook Faculty Award.
12:00 - 13:30
13:30 - 15:00
Paper Session #1
Run-Time Efficient RNN Compression for Inference on Edge Device
Urmish Thakker, Dibakar Gope, Jesse Beu, Ganesh Dasika and Matthew Mattina
ARM ML Research
PyRTLMatrix: an Object-Oriented Hardware Design Pattern for Prototyping ML Accelerators
Dawit Aboye, Dylan Kupsh, Maggie Lim, Jacqueline Mai, Deeksha Dangwal, Diba Mirza and Timothy Sherwood
UC Santa Barbara
Accelerated CNN Training Through Gradient Approximation
Ziheng Wang, Sree Harsha Nelaturu and Saman Amarasinghe
Massachusetts Institute of Technology and SRM Institute of Science and Technology
15:00 - 15:10
15:10 - 16:00
Structured and Systematic Approach for Energy Efficient DNN Acceleration
Xuehai Qian, University of Southern California link
Large-scale deep neural networks (DNNs) are both compute and memory intensive. As the size of DNNs continues to grow, it is critical to improve the energy efficiency and performance while maintaining accuracy. In this talk, I first present our principled approaches to performing model compression and acceleration using structured matrices. Compared to unstructured pruning, our methods (CirCNN and PermDNN) achieve significant model storage and computation reduction while maintaining accuracy. Thanks to the regular structure, the accelerators can achieve better performance and energy efficiency compared to the state-of-the-art designs. I will also present a unified solution framework for both unstructured and structured pruning and quantization based on Alternating Direction Method of Multipliers (ADMM). It ensures high solution quality while guaranteeing solution feasibility, consistently outperforming previous results. Finally, I will present HyPar, a systematic approach to search the best tensor partition for a given multi-layer DNN with an accelerator array. It optimizes performance and energy efficiency by reducing data movement between accelerators. We believe that the structured and systematic approach and algorithm/hardware co-design are crucial for designing energy efficient DNN accelerators.
Xuehai Qian is an assistant professor at University of Southern California. His research interests include domain-specific system and architecture with focuses on machine learning and graph processing, performance tuning and resource management of Cloud systems, and parallel computer architecture. He got his Ph.D from University of Illinois Urbana Champaign and was a postdoc at UC Berkeley. He is the recipient of W.J Poppelbaum Memorial Award at UIUC, NSF CRII and CAREER Award, and the inaugural ACSIC (American Chinese Scholar In Computing) Rising Star Award.
16:00 - 16:50
Balancing Efficiency and Flexibility for DNN Acceleration
Vivienne Sze, MIT link
There has been a significant amount of research on the topic of efficient processing of DNNs, from the design of efficient DNN algorithms to the design of efficient DNN accelerators. The wide range techniques used for efficient DNN algorithm design has resulted in a more diverse set of DNNs; this creates a new challenge for the DNN accelerators, as they now need to be sufficiently flexible to support a wide range of DNN workloads efficiently. However, many of the existing DNN accelerators rely on certain properties of the DNN which cannot be guaranteed (e.g., fixed weight sparsity, large number of channels, large batch size). In this talk, we will briefly discuss recent techniques that have been used to design efficient DNN algorithms and important properties to consider when applying these techniques. We will then present a systematic approach called Eyexam to identify the sources of inefficiencies in DNN accelerator designs for different DNN workloads. Finally, we will introduce a flexible accelerator called Eyeriss v2 that is computationally efficient across a wide range of diverse DNNs.
Vivienne Sze is an Associate Professor at MIT in the Electrical Engineering and Computer Science Department. Her research interests include energy-aware signal processing algorithms, and low-power circuit and system design for portable multimedia applications, including computer vision, deep learning, autonomous navigation, and video process/coding. Prior to joining MIT, she was a Member of Technical Staff in the R&D Center at TI, where she designed low-power algorithms and architectures for video coding. She also represented TI in the JCT-VC committee of ITU-T and ISO/IEC standards body during the development of High Efficiency Video Coding (HEVC), which received a Primetime Engineering Emmy Award. She is a co-editor of the book entitled “High Efficiency Video Coding (HEVC): Algorithms and Architectures” (Springer, 2014).
Prof. Sze received the B.A.Sc. degree from the University of Toronto in 2004, and the S.M. and Ph.D. degree from MIT in 2006 and 2010, respectively. In 2011, she received the Jin-Au Kong Outstanding Doctoral Thesis Prize in Electrical Engineering at MIT. She is a recipient of the 2018 Facebook Faculty Award, the 2018 & 2017 Qualcomm Faculty Award, the 2018 & 2016 Google Faculty Research Award, the 2016 AFOSR Young Investigator Research Program (YIP) Award, the 2016 3M Non-Tenured Faculty Award, the 2014 DARPA Young Faculty Award, the 2007 DAC/ISSCC Student Design Contest Award, and a co-recipient of the 2017 CICC Outstanding Invited Paper Award, the 2016 IEEE Micro Top Picks Award and the 2008 A-SSCC Outstanding Design Award.
For more information about research in the Energy-Efficient Multimedia Systems Group at MIT visit: http://www.rle.mit.edu/eems/