EMC^2: EMC2 - Energy Efficient Machine Learning and Cognitive Computing

Sunday, June 23, 2019 Room: 102B (West Building) Full Day

Program
CFP

In the Eleventh edition of EMC2 workshop, we plan to facilitate conversation about the sustainability of large-scale AI computing systems being developed to meet the ever-increasing demands of generative AI. This involves discussions spanning multiple interrelated areas. First, we continue to serve as the leading forums for discussing the energy-efficiency aspect of GenAI workloads which directly impact the overall viability and economic value of AI technology. Second, we reassess the scaling laws of AI with the prevalence of agentic, multi-modal, and reasoning-based models in conjunction with novel techniques such as a highly sparse expert architecture and disaggregated computation. Finally, we discuss sustainable and high-performance computing paradigms towards efficient datacenters and hybrid computing models that can cater to the exponential growth in model sizes, application areas, anduser base. This would allow us to explore ideas to build the hardware, software, systems, and scaling infrastructure, as well as model architectures that make AI technology even more prevalent and accessible.

The goal of this Workshop is to provide a forum for researchers and industry experts who are exploring novel ideas, tools and techniques to improve the energy efficiency of MLLMs as it is practised today and would evolve in the next decade. We envision that only through close collaboration between industry and the academia we will be able to address the difficult challenges and opportunities of reducing the carbon footprint of AI and its uses. We have tailored our program to best serve the participants in a fully digital setting. Our forum facilitates active exchange of ideas through:

Keynotes, invited talks and discussion panels by leading researchers from industry and academia
Peer-reviewed papers on latest solutions including works-in-progress to seek directed feedback from experts
Independent publication of proceedings through IEEE CPS

We invite full-length papers describing original, cutting-edge, and even work-in-progress research projects about efficient machine learning. Suggested topics for papers include, but are not limited to the ones listed below:

Neural network architectures for resource constrained applications.
Efficient hardware designs to implement neural networks including sparsity, locality, and systolic designs.
Power and performance efficient memory architectures suited for neural networks.
Network reduction techniques – approximation, quantization, reduced precision, pruning, distillation, and reconfiguration.
Exploring interplay of precision, performance, power, and energy through benchmarks, workloads, and characterization.
Performance potential, limit studies, bottleneck analysis, profiling, and synthesis of workloads.
Explorations and architctures aimed to promote sustainable computing.
Simulation and emulation techniques, frameworks, tools, and platforms for machine learning.
Optimizations to improve performance of training techniques including on-device and large-scale learning.
Load balancing and efficient task distribution, communication and computation overlapping for optimal performance.
Verification, validation, determinism, robustness, bias, safety, and privacy challenges in AI systems.
Efficient deployment strategies for edge and distributed environments.
Model compression and optimization techniques that preserve reasoning and problem-solving capabilities.
Architectures and frameworks for multi-agent systems and retrieval-augmented generation (RAG) pipelines.
Systems-level approaches for scaling future foundation models (e.g., Llama 4, GPT-5 and beyond).

09:00 - 09:10

Welcome

Introduction and Opening Remarks

09:10 - 10:00

Invited Talk

Presentation

Efficient Deep Learning: Quantizing Models Without Using Re-training

Harris Teague, Qualcomm AI Research

In this talk we’ll cover techniques to do post-training quantization that can improve model accuracy for 8-bit quantization significantly. These techniques are especially useful when training/fine-tuning is not possible, a case that arises very frequently in commercial applications. No training pipeline, optimized hyperparameters, nor full training datasets are needed. We show the effectiveness of these techniques for popular models used for inference on resource constrained devices.

Harris Teague is Principal Engineer/Manager at Qualcomm and leads the Platform Systems group at Qualcomm AI Research. Prior to focusing on machine learning, he worked on a range of wireless projects including Bluetooth, WCDMA, ultra-wideband, and an OFDMA-based precursor to LTE called UMB. Harris is an inventor on over 60 granted US patents. He holds a BS degree in Aerospace Engineering from Virginia Tech, and MS and PhD in Aerospace from Stanford University.

10:00 - 10:50

Invited Talk

Presentation

Machine Learning at Scale

Carole-Jean Wu, Arizona State University and Facebook

Machine learning systems are being widely deployed in production datacenter infrastructure and over billions of edge devices. This talk seeks to address key system design challenges when scaling machine learning solutions to billions of people. What are key similarities and differences between cloud and edge infrastructure? The talk will conclude with open system research directions for deploying machine learning at scale.

Carole-Jean Wu is a Research Scientist at Facebook’s AI Infrastructure Research. She is also a tenured Associate Professor of CSE in Arizona State University. Carole-Jean’s research focuses in Computer and System Architectures. More recently, her research has pivoted into designing systems for machine learning. She is the leading author of “Machine Learning at Facebook: Understanding Inference at the Edge” that presents unique design challenges faced when deploying ML solutions at scale to the edge, from over billions of smartphones to Facebook’s virtual reality platforms. Carole-Jean received her M.A. and Ph.D. from Princeton and B.Sc. from Cornell.

10:50 - 11:10

Break

Short Break

11:10 - 12:00

Invited Talk

Presentation

Enabling Continuous Learning through Synaptic Plasticity in Hardware

Tushar Krishna, Georgia Institue of Technology

Ever since modern computers were invented, the dream of creating artificial intelligence (AI) has captivated humanity. We are fortunate to live in an era when, thanks to deep learning (DL), computer programs have paralleled, and in many cases even surpassed human level accuracy in tasks like visual perception and speech synthesis. However, we are still far away from realizing general-purpose AI. The problem lies in the fact that the development of supervised learning based DL solutions today is mostly open loop. A typical DL model is created by hand-tuning the deep neural network (DNN) topology by a team of experts over multiple iterations, followed by training over petabytes of labeled data. Once trained, the DNN provides high accuracy for the task at hand; if the task changes, however, the DNN model needs to be re-designed and re-trained before it can be deployed. A general-purpose AI system, in contrast, needs to have the ability to constantly interact with the environment and learn by adding and removing connections within the DNN autonomously, just like our brain does. This is known as synaptic plasticity.

In this talk we present our research efforts towards enabling general-purpose AI. First, we present GeneSys (MICRO 2018), a HW-SW prototype of a closed loop learning system for continuously evolving the structure and weights of a DNN for the task at hand using genetic algorithms, providing 100-10000x higher performance and energy-efficiency over state-of-the-art embedded and desktop CPU and GPU systems. Next, we present a DNN accelerator substrate called MAERI (ASPLOS 2018), built using light-weight, non-blocking, reconfigurable interconnects, that supports efficient mapping of regular and irregular DNNs with arbitrary dataflows, providing ~100% utilization of all compute units, resulting in 3X speedup and energy-efficiency over state-of-the-art DNN accelerators.

Tushar Krishna is an Assistant Professor in the School of Electrical and Computer Engineering at Georgia Tech. He has a Ph.D. in Electrical Engineering and Computer Science from MIT (2014), a M.S.E in Electrical Engineering from Princeton University (2009), and a B.Tech in Electrical Engineering from the Indian Institute of Technology (IIT) Delhi (2007). Before joining Georgia Tech in 2015, Dr. Krishna spent a year as a post-doctoral researcher at Intel, Massachusetts.

Dr. Krishna’s research spans computer architecture, interconnection networks, networks-on-chip (NoC) and deep learning accelerators - with a focus on optimizing data movement in modern computing systems. He has 42 publications in leading conferences and journals, which have amassed over 5000 citations to date. Three of his papers have been selected for IEEE Micro’s Top Picks from Computer Architecture, one more received an honorable mention, and two have won best paper awards. He has received the National Science Foundation (NSF) CRII award, a Google Faculty Award, and a Facebook Faculty Award.

12:00 - 13:30

Lunch Break

Lunch

13:30 - 15:00

Paper Session #1

Paper

Run-Time Efficient RNN Compression for Inference on Edge Device

Urmish Thakker, Dibakar Gope, Jesse Beu, Ganesh Dasika and Matthew Mattina

ARM ML Research

Paper

PyRTLMatrix: an Object-Oriented Hardware Design Pattern for Prototyping ML Accelerators

Dawit Aboye, Dylan Kupsh, Maggie Lim, Jacqueline Mai, Deeksha Dangwal, Diba Mirza and Timothy Sherwood

UC Santa Barbara

Paper

Accelerated CNN Training Through Gradient Approximation

Ziheng Wang, Sree Harsha Nelaturu and Saman Amarasinghe

Massachusetts Institute of Technology and SRM Institute of Science and Technology

15:00 - 15:10

Break

Short Break

15:10 - 16:00

Invited Talk

Structured and Systematic Approach for Energy Efficient DNN Acceleration

Xuehai Qian, University of Southern California

Large-scale deep neural networks (DNNs) are both compute and memory intensive. As the size of DNNs continues to grow, it is critical to improve the energy efficiency and performance while maintaining accuracy. In this talk, I first present our principled approaches to performing model compression and acceleration using structured matrices. Compared to unstructured pruning, our methods (CirCNN and PermDNN) achieve significant model storage and computation reduction while maintaining accuracy. Thanks to the regular structure, the accelerators can achieve better performance and energy efficiency compared to the state-of-the-art designs. I will also present a unified solution framework for both unstructured and structured pruning and quantization based on Alternating Direction Method of Multipliers (ADMM). It ensures high solution quality while guaranteeing solution feasibility, consistently outperforming previous results. Finally, I will present HyPar, a systematic approach to search the best tensor partition for a given multi-layer DNN with an accelerator array. It optimizes performance and energy efficiency by reducing data movement between accelerators. We believe that the structured and systematic approach and algorithm/hardware co-design are crucial for designing energy efficient DNN accelerators.

Xuehai Qian is an assistant professor at University of Southern California. His research interests include domain-specific system and architecture with focuses on machine learning and graph processing, performance tuning and resource management of Cloud systems, and parallel computer architecture. He got his Ph.D from University of Illinois Urbana Champaign and was a postdoc at UC Berkeley. He is the recipient of W.J Poppelbaum Memorial Award at UIUC, NSF CRII and CAREER Award, and the inaugural ACSIC (American Chinese Scholar In Computing) Rising Star Award.

16:00 - 16:50

Invited Talk

Presentation

Balancing Efficiency and Flexibility for DNN Acceleration

Vivienne Sze, MIT

There has been a significant amount of research on the topic of efficient processing of DNNs, from the design of efficient DNN algorithms to the design of efficient DNN accelerators. The wide range techniques used for efficient DNN algorithm design has resulted in a more diverse set of DNNs; this creates a new challenge for the DNN accelerators, as they now need to be sufficiently flexible to support a wide range of DNN workloads efficiently. However, many of the existing DNN accelerators rely on certain properties of the DNN which cannot be guaranteed (e.g., fixed weight sparsity, large number of channels, large batch size). In this talk, we will briefly discuss recent techniques that have been used to design efficient DNN algorithms and important properties to consider when applying these techniques. We will then present a systematic approach called Eyexam to identify the sources of inefficiencies in DNN accelerator designs for different DNN workloads. Finally, we will introduce a flexible accelerator called Eyeriss v2 that is computationally efficient across a wide range of diverse DNNs.

Vivienne Sze is an Associate Professor at MIT in the Electrical Engineering and Computer Science Department. Her research interests include energy-aware signal processing algorithms, and low-power circuit and system design for portable multimedia applications, including computer vision, deep learning, autonomous navigation, and video process/coding. Prior to joining MIT, she was a Member of Technical Staff in the R&D Center at TI, where she designed low-power algorithms and architectures for video coding. She also represented TI in the JCT-VC committee of ITU-T and ISO/IEC standards body during the development of High Efficiency Video Coding (HEVC), which received a Primetime Engineering Emmy Award. She is a co-editor of the book entitled “High Efficiency Video Coding (HEVC): Algorithms and Architectures” (Springer, 2014).

Prof. Sze received the B.A.Sc. degree from the University of Toronto in 2004, and the S.M. and Ph.D. degree from MIT in 2006 and 2010, respectively. In 2011, she received the Jin-Au Kong Outstanding Doctoral Thesis Prize in Electrical Engineering at MIT. She is a recipient of the 2018 Facebook Faculty Award, the 2018 & 2017 Qualcomm Faculty Award, the 2018 & 2016 Google Faculty Research Award, the 2016 AFOSR Young Investigator Research Program (YIP) Award, the 2016 3M Non-Tenured Faculty Award, the 2014 DARPA Young Faculty Award, the 2007 DAC/ISSCC Student Design Contest Award, and a co-recipient of the 2017 CICC Outstanding Invited Paper Award, the 2016 IEEE Micro Top Picks Award and the 2008 A-SSCC Outstanding Design Award.

For more information about research in the Energy-Efficient Multimedia Systems Group at MIT visit: http://www.rle.mit.edu/eems/

Energy Efficient Machine Learning and Cognitive Computing

4th Edition

Co-located with ISCA 2019 in Phoenix, AZ

description Workshop Objective

chat Call for Papers

format_list_bulleted Topics for the Workshop

We will follow that same formatting guidelines and duplicate submission policies as ASPLOS.

09:00 - 09:10

Welcome

Introduction and Opening Remarks

09:10 - 10:00

Invited Talk

Efficient Deep Learning: Quantizing Models Without Using Re-training

Harris Teague, Qualcomm AI Research

10:00 - 10:50

Invited Talk

Machine Learning at Scale

Carole-Jean Wu, Arizona State University and Facebook link

10:50 - 11:10

Break

Short Break

11:10 - 12:00

Invited Talk

Enabling Continuous Learning through Synaptic Plasticity in Hardware

Tushar Krishna, Georgia Institue of Technology link

12:00 - 13:30

Lunch Break

Lunch

13:30 - 15:00

Paper Session #1

Run-Time Efficient RNN Compression for Inference on Edge Device

Urmish Thakker, Dibakar Gope, Jesse Beu, Ganesh Dasika and Matthew Mattina

ARM ML Research

PyRTLMatrix: an Object-Oriented Hardware Design Pattern for Prototyping ML Accelerators

Dawit Aboye, Dylan Kupsh, Maggie Lim, Jacqueline Mai, Deeksha Dangwal, Diba Mirza and Timothy Sherwood

UC Santa Barbara

Accelerated CNN Training Through Gradient Approximation

Ziheng Wang, Sree Harsha Nelaturu and Saman Amarasinghe

Massachusetts Institute of Technology and SRM Institute of Science and Technology

15:00 - 15:10

Break

Short Break

15:10 - 16:00

Invited Talk

Structured and Systematic Approach for Energy Efficient DNN Acceleration

Xuehai Qian, University of Southern California link

16:00 - 16:50

Invited Talk

Balancing Efficiency and Flexibility for DNN Acceleration

Vivienne Sze, MIT link

16:50 - 17:00

Close

Closing Remarks

Workshop Objective

Call for Papers

Topics for the Workshop

Carole-Jean Wu, Arizona State University and Facebook

Tushar Krishna, Georgia Institue of Technology

Xuehai Qian, University of Southern California

Vivienne Sze, MIT