EMC^2: EMC2 - Energy Efficient Machine Learning and Cognitive Computing

Saturday, August 28, 2021 Room: Virtual Full Day

Program
Panel
CFP

16:45 - 17:30 Saturday, August 28, 2021 Virtual

As AI unlocks new frontiers in automation, science and human understanding, the complexity of AI models continues to grow. So does the need for computational power to train and deploy these models. While much attention has been given to improve energy efficiency in low-power edge devices for inference, the energy cost of AI is equally relevant, if not more, in large scale AI datacenters. With firsthand insights into the world of scalable compute and efficient AI in the cloud, our distinguished panelists discuss the technical and economic aspects of cloud-scale AI and the opportunities and challenges to lower the energy footprint of AI.

Satyam Srivastava

d-Matrix.ai

Debu Pal

CTO, Ambient Scientific

Geeta Chauhan

AI/PyTorch Partner Engineering Head, Facebook

Lizy John

University of Texas, Austin

Sumit Gupta

Sr. Director ML Infrastructure, Google

In the Eleventh edition of EMC2 workshop, we plan to facilitate conversation about the sustainability of large-scale AI computing systems being developed to meet the ever-increasing demands of generative AI. This involves discussions spanning multiple interrelated areas. First, we continue to serve as the leading forums for discussing the energy-efficiency aspect of GenAI workloads which directly impact the overall viability and economic value of AI technology. Second, we reassess the scaling laws of AI with the prevalence of agentic, multi-modal, and reasoning-based models in conjunction with novel techniques such as a highly sparse expert architecture and disaggregated computation. Finally, we discuss sustainable and high-performance computing paradigms towards efficient datacenters and hybrid computing models that can cater to the exponential growth in model sizes, application areas, anduser base. This would allow us to explore ideas to build the hardware, software, systems, and scaling infrastructure, as well as model architectures that make AI technology even more prevalent and accessible.

The goal of this Workshop is to provide a forum for researchers and industry experts who are exploring novel ideas, tools and techniques to improve the energy efficiency of MLLMs as it is practised today and would evolve in the next decade. We envision that only through close collaboration between industry and the academia we will be able to address the difficult challenges and opportunities of reducing the carbon footprint of AI and its uses. We have tailored our program to best serve the participants in a fully digital setting. Our forum facilitates active exchange of ideas through:

Keynotes, invited talks and discussion panels by leading researchers from industry and academia
Peer-reviewed papers on latest solutions including works-in-progress to seek directed feedback from experts
Independent publication of proceedings through IEEE CPS

We invite full-length papers describing original, cutting-edge, and even work-in-progress research projects about efficient machine learning. Suggested topics for papers include, but are not limited to the ones listed below:

Neural network architectures for resource constrained applications.
Efficient hardware designs to implement neural networks including sparsity, locality, and systolic designs.
Power and performance efficient memory architectures suited for neural networks.
Network reduction techniques – approximation, quantization, reduced precision, pruning, distillation, and reconfiguration.
Exploring interplay of precision, performance, power, and energy through benchmarks, workloads, and characterization.
Performance potential, limit studies, bottleneck analysis, profiling, and synthesis of workloads.
Explorations and architctures aimed to promote sustainable computing.
Simulation and emulation techniques, frameworks, tools, and platforms for machine learning.
Optimizations to improve performance of training techniques including on-device and large-scale learning.
Load balancing and efficient task distribution, communication and computation overlapping for optimal performance.
Verification, validation, determinism, robustness, bias, safety, and privacy challenges in AI systems.
Efficient deployment strategies for edge and distributed environments.
Model compression and optimization techniques that preserve reasoning and problem-solving capabilities.
Architectures and frameworks for multi-agent systems and retrieval-augmented generation (RAG) pipelines.
Systems-level approaches for scaling future foundation models (e.g., Llama 4, GPT-5 and beyond).

08:00 - 08:15

Welcome

Welcome and Opening Remarks

Kushal Datta, Nvidia Corporation

08:15 - 09:00

Invited Talk

Chimera: Efficiently Training Large-Scale Neural Networks with Bidirectional Pipelines

Shigang Li, Department of Computer Sc, ETH Zurich

Training large deep learning models at scale is very challenging. We proposes Chimera, a novel pipeline parallelism scheme which combines bidirectional pipelines for efficiently training large-scale models. Chimera is a synchronous approach and therefore no loss of accuracy, which is more convergence-friendly than asynchronous approaches. Compared with the latest synchronous pipeline approach, Chimera reduces the number of bubbles by up to 50%; benefiting from the sophisticated scheduling of bidirectional pipelines, Chimera has a more balanced activation memory consumption. Evaluations are conducted on Transformer based language models. For a GPT-2 model with 1.3 billion parameters running on 2,048 GPU nodes of the Piz Daint supercomputer, Chimera improves the training throughput by 1.16x-2.34x over the state-of-the-art synchronous and asynchronous pipeline approaches.

Dr. Shigang Li is currently a Postdoctoral Researcher in Department of Computer Science, ETH Zurich. His research interests include parallel computing, high performance computing, and parallel and distributed deep learning systems. He received the Bachelor’s degree majored in Computer Science and the Ph.D degree majored in Computer Architecture from University of Science and Technology Beijing, in 2009 and 2014, respectively. He has been a joint Ph.D student in Department of Computer Science, University of Illinois at Urbana-Champaign from Sep. 2011 to Sep. 2013. He has been an Assistant Professor in State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences from June 2014 to Aug. 2018. He got the Best Paper Nominations in SC’21, PPoPP’20 and HPDC’13, and Outstanding Paper of MLSys’21.

09:00 - 10:00

Keynote

Cognitive AI - a blueprint for sustainability, efficiency and higher intelligence

Gadi Singer, Intel Labs

Will AI move beyond statistical correlation and mapping towards a true comprehension of the world? Is deep structured knowledge a path to higher machine intelligence? In his keynote talk, Gadi will discuss the principles of Cognitive AI and its potential in establishing a categorically new level of reasoning, explainability, and adaptability. He will share his vision for AI architecture of the future and introduce Thrill-K, a knowledge layering framework that promises to substantially improve energy efficiency of intelligent systems.

One technology after another, Gadi has been pushing the leading edge of computing for the past four decades. He has made key contributions to Intel’s computer architectures, hardware and software development, AI technologies, and more. Currently, Gadi is Vice President at Intel Labs leading Cognitive Computing Research and development of third wave of AI capabilities.

10:00 - 10:45

Invited Talk

Accelerating Transformers: How Hugging Face Delivers 100x Speedup to Industrialize State of the Art Machine Learning

Jeff Boudier, Hugging Face

From Research to Industrialization: learn how Hugging Face is putting Transformers to work! Transformers conquered NLP and are now eating away at all Machine Learning domains. These complex architectures and very large models are creating new engineering challenges for companies to apply them in efficient, scalable production settings. In this session, Jeff will detail how Hugging Face is developing new State of the Art efficiency techniques, and is collaborating with hardware and open source partners including ORT, Intel and NVIDIA, to deliver up to 100x speedups to its customers.

Product + Growth @HuggingFace, creator of Transformers, to democratize state of the art Machine Learning. Previously Corporate and Business Development at GoPro, co-founder of Stupeflix (acquired by GoPro).

10:45 - 11:00

Break

11:00 - 11:30

Challenges in accelerating transformers for efficient inference of giant NLP models at scale

Sudeep Bhoja, d-Matrix.ai

Sponsored Talk: Introduction to d-Matrix.ai

Sudeep Bhoja is the co-founder and CTO of d-Matrix focused on next generation efficient AI inference accelerators using in-memory computing. Previously he was Inphi’s Chief Technology Officer, Networking Business Unit at Inphi/Marvell. He brings with him more than 20 years of experience in defining and architecting groundbreaking products in the semiconductor industry. Prior to Inphi, he was Technical Director in the Infrastructure and Networking Group at Broadcom and played an instrumental role in the design, development and commercialization of the 10G DSP physical layer products for Ethernet applications. He was also Chief Architect of a startup, Big Bear Networks, a mixed-signal networking IC & optical transceiver company. Sudeep also held R&D positions at Lucent Technologies and Texas Instruments working on Digital Signal Processors. He is the named inventor of over 40 pending and approved patents.

11:30 - 12:15

Invited Talk

HPC and AI

CJ Newburn, Nvidia Corporation

Energy efficiency is at the heart of powering the future. It begins with accelerated computing, but it is applied in system design all the way up to the data center management level. In this talk, we explore what it takes to be power efficient, scalable, partitionable, and programmable. We highlight efficiency at the data center level, and advocate improvements over time in tools that offer guidance to users and automate the process of more efficient and secure performance.

Chris J. Newburn, who goes by CJ, is a Principal Architect in the Compute Software Group at NVIDIA, where he leads HPC strategy and the software product roadmap, with a special focus on systems and programming models for scale. CJ is the architect of Magnum IO and the co-architect of GPUDirect Storage, heads the Summit Dev2Dev Series with the Department of Energy, and leads the HPC Containers Advisory Council. CJ has contributed to both hardware and software technologies for the past 20 years and has over 100 patents. He’s a community builder with a passion for extending the core capabilities of hardware and software platforms from HPC into AI, data science, and visualization. Before getting his Ph.D. at Carnegie Mellon University, CJ did stints at a couple of startups, working on a voice recognizer and a VLIW supercomputer. He’s delighted to have worked on volume products that his mom used.

12:15 - 12:30

Lunch Break

12:30 - 13:00

Invited Talk

Machine learning for Near Term Computing

Damian Podareany, SURF.NL

The place and route stage from digital design is in fact a combinatorial optimization problem. This has been solved for many devices in several ways over the past decades. Applications of this theoretical problem have been also showcased on quantum simulators and devices. We will go over the specific obstacles encountered in state preparation, mapping high level code to a quantum device, a classical reinforcement learning solution for placing SWAP gates solution, as well as ongoing efforts of adopting a quantum exploration policy. We will also discuss about the importance of scheduling in a hybrid, near term, execution environment, in which HPC classical resources should be used with quantum devices.

Damain Podareanu is a Senior Machine Learning Consultant at SURF.NL and has contributed to multiple pulications on accelerating deep learning training on CPU and GPU platforms.

13:00 - 13:45

Invited Talk

Efficient Neural Architecture Search at Scale

Mi Zhang, Michigan State University

Neural architecture search (NAS) has attracted significant attention in the machine learning community in recent years. Although architectures found by NAS are able to achieve superior performance than manually designed neural architectures, the key bottleneck of neural architecture search is its extremely high search cost. In this talk, I want to shed light on the importance of architecture encoding on enhancing the efficiency of NAS. In particular, I will talk about two works on architecture encoding. The first work, dubbed arch2vec, proposes an unsupervised learning-based architecture encoding method that decouples architecture representation learning and architecture search into two separate processes. The second work, dubbed CATE, proposes to encode computations instead of structures of neural architectures via a transformer-based encoder. Our experimental results on widely used benchmarks demonstrate the effectiveness, scalability, and generalization ability of our proposed methods.

Dr. Mi Zhang is an Associate Professor and the Director of the Machine Learning Systems Lab at Michigan State University. He received his Ph.D. from University of Southern California and B.S. from Peking University. Before joining MSU, he was a postdoctoral scholar at Cornell University. His research lies at the intersection of systems and machine learning, spanning areas including Efficient Deep Learning, Automated Machine Learning (AutoML), Federated Learning, Systems for Machine Learning, and Machine Learning for Systems. He is the 4th Place Winner of the 2019 Google MicroNet Challenge, the Third Place Winner of the 2017 NSF Hearables Challenge, and the champion of the 2016 NIH Pill Image Recognition Challenge. He is the recipient of six best paper awards and nominations. He is also the recipient of the Facebook Faculty Research Award, Amazon Machine Learning Research Award, and MSU Innovation of the Year Award.

13:45 - 14:30

Invited Talk

Towards Best Possible Deep Learning Acceleration on the Edge – A Compression-Compilation Co-Design Framework

Yanzhi Wang, Northeastern University

Mobile and embedded computing devices have become key carriers of deep learning to facilitate the widespread of machine intelligence. However, there is a widely recognized challenge to achieve real-time DNN inference on edge devices, due to the limited computation/storage resources on such devices. Model compression of DNNs, including weight pruning and weight quantization, has been investigated to overcome this challenge. However, current work on DNN compression suffer from the limitation that accuracy and hardware performance are somewhat conflicting goals difficult to satisfy simultaneously.

We present our recent work CoCoPIE, representing Compression-Compilation Codesign, to overcome this limitation towards the best possible DNN acceleration on edge devices. We propose novel fine-grained structured pruning schemes, including pattern-based pruning, block-based pruning, etc. They can simultaneously achieve high hardware performance (similar to filter/channel pruning) while maintaining zero accuracy loss, with the help of compiler, which is beyond the capability of prior work. Similarly, we present novel quantization scheme that achieves ultra-high hardware performance close to 2-bit weight quantization, with almost no accuracy loss. Through the CoCoPIE framework, we are able to achieve real-time on-device execution of a number of DNN tasks, including object detection, pose estimation, activity detection, speech recognition, just using an off-the-shelf mobile device, with up to 180X speedup compared with prior work. Our comprehensive demonstrations are at : https://www.youtube.com/channel/UCCKVDtg2eheRTEuqIJ5cD8A

Dr. Yanzhi Wang is currently an assistant professor at Dept. of ECE at Northeastern University, Boston, MA. He received the B.S. degree from Tsinghua University in 2009, and Ph.D. degree from University of Southern California in 2014. His research interests focus on model compression and platform-specific acceleration of deep learning applications. His research maintains the highest model compression rates on representative DNNs since 09/2018. His work on AQFP superconducting based DNN acceleration is by far the highest energy efficiency among all hardware devices. His recent research achievement, CoCoPIE, can achieve real-time performance on almost all deep learning applications using off-the-shelf mobile devices, outperforming competing frameworks by up to 180X acceleration.

His work has been published broadly in top conference and journal venues (e.g., ASPLOS, ISCA, MICRO, HPCA, PLDI, DAC, ICCAD, ICS, PACT, Mobicom, ISSCC, AAAI, ICML, CVPR, ICLR, IJCAI, ECCV, ICDM, ACM MM, FPGA, LCTES, CCS, VLDB, PACT, ICDCS, Infocom, C-ACM, JSSC, TComputer, TCAS-I, TCAD, TCAS-I, JSAC, TNNLS, etc.), and has been cited above 9,400 times. He has received six Best Paper and Top Paper Awards, has another 11 Best Paper Nominations and four Popular Paper Awards. He has received the U.S. Army Young Investigator Program Award (YIP), Massachusetts Acorn Innovation Award, IEEE TCSDM Early Career Award, Martin Essigman Excellence in Teaching Award, Ming Hsieh Scholar Award, and other research awards from Google, MathWorks, etc. Four of his former Ph.D./postdoc students become tenure track faculty at Univ. of Connecticut, Clemson University, and Texas A&M University, Corpse Christi, and Cleveland State University.

14:30 - 15:15

Invited Talk

On Device, always on AI, Challenges and Opportunities

Debajyoti Pal, CTO, Ambient Scientific Inc.

AI has now touched most of our lives. Voice assistants and smart speakers e.g. Amazon Alexa, Google home or Apple Siri are being used by most of us many times daily. While these are consumer devices at home for personal use, these rely on the AI compute being done in the cloud or in the data center. However issues around privacy, 24/7 connectivity, latency etc. are driving the recent push towards “on device AI”. While in many cases, these Edge devices today are plugged into the wall or use rechargeable batteries that are charged daily, there are a large number applications, emerging in speech/audio, vision, sensor fusion etc. where AI must run without internet connectivity on portable battery operated devices for a longtime without recharging. These devices need to have a very small form fac or and must be extremely energy efficient as in many cases the AI engine must always be on. This on one hand poses technical challenges and need for much innovation but on the other hand this opens up new opportunities and business growth. In this talk we will focus on the challenges, possible ways to solve them and briefly discuss a few possible opportunities.

Dr. Debajyoti (Debu) Pal currently serves as the CTO of Ambient Scientific the Always-On AI processor company. He is a renowned expert of AI/ML algorithms and architectures and, he comes with deep knowledge of Training and Inference having developed algorithms, architectures, silicon and systems for Machine Learning/Deep Learning since 2010. A 30-year veteran of the communications, networking and, semiconductor industry, he is a well-respected technologist and has been an IEEE Fellow since 2002 for making seminal contributions to Digital Communications. Debu is a successful entrepreneur and a highly regarded executive in the broadband access industry for his leadership in technology, product development and, worldwide commercialization of chipsets for broadband access over copper. He cofounded (with Prof. Thomas Kailath of Stanford University) Excess Bandwidth Corp. (acquired by Virata Corp.), where Debu was the founding CEO, CTO and VP Engineering. Also, He was a founding member of a well-known successful startup, Amati Communications Corp. (pioneered ADSL and developed ITU standards). Amati was acquired by Texas Instruments.

15:15 - 16:00

Invited Talk

Memory Bandwidth Optimizations -- from Data Structure to Memory Array

Mahdi Nazm Bojnordi, Qualcomm Inc.

As the demand for video and machine learning workloads increases, the memory bandwidth wall, and data movement problems escalate in all forms of computing systems–from edge devices and mobile nodes to data centers. Despite the recent technological enhancements in memory systems, the emerging data-intensive applications (e.g., point clouds and ML workloads) are still bottlenecked by the limited bandwidth and energy-efficiency of today’s memory hierarchies. This talk examines a few examples of our architectural solutions to efficient data processing. First, we introduce a novel data structure based on geometric arrays (G-Arrays) that allow fast and energy-efficient point cloud processing in a compressed format. In addition to a 50% reduction in memory footprint, the proposed G-Arrays format achieves a significant speedup over the state-of-the-art PCL and MPEG libraries for point operations such as kNN search, point merge, and projection. Next, we examine a novel mechanism for index-independent data ranking in memory. The proposed algorithm is capable of eliminating all pairwise comparisons for data ranking, thereby reducing the memory bandwidth significantly. When applied to off-chip non-volatile memory arrays, the proposed mechanism achieves two orders of magnitude performance gains, and 90% energy reduction compared to the existing methods.

Dr. Mahdi Nazm Bojnordi is an Assistant Professor of the School of Computing at the University of Utah. He received his Ph.D. degree from the University of Rochester, Rochester, NY, USA, in 2016 in electrical and computer engineering. Currently, he leads the Unconventional Computer Architecture Laboratory (UCAL) at the School of Computing, University of Utah. Recently, his research interests have included energy-efficient architectures, low-power memory systems, and the application of emerging memory technologies to computer systems. Professor Bojnordi’s research has been recognized by two IEEE Micro Top Picks Awards, an HPCA 2016 Distinguished Paper Award, and a Samsung Best Paper Award.

16:00 - 16:45

Invited Talk

Project Fiddle: Fast and Efficient Infrastructure for Distributed Deep Learning

Amar Phanisayee, Microsoft Research

The goal of Project Fiddle is to build efficient systems infrastructure for fast distributed DNN training. Our goal is to support 100x more efficient training. To achieve this goal, we take a broad view of training: from a single GPU, to multiple GPUs on a machine, all the way to training on large clusters. Our innovations cut across the systems stack: memory management, structuring parallel computation across GPUs and machines, speeding up communication between accelerators and across machines, optimizing the data ingest and output pipelines, and schedulers for DNN training on large multi-tenant clusters. In this talk, I’ll give you an overview of Project Fiddle and focus on a couple of recent ideas to speed up data loading (input) and checkpointing (output).

Dr. Amar Phanisayee is Sr. Principal Researcher at Microsoft Research in Redmond. The goal of his research is to enable the creation of high-performance and efficient networked systems for large-scale data-intensive computing. His research efforts center around radically rethinking the design of datacenter-based systems: from infrastructure for compute, storage, and networking to distributed systems and protocols that are scalable, robust to failures, and use resources efficiently. His recent focus has been on leading Project Fiddle at MSR (Project Fiddle - Microsoft Research). Amar received his Ph.D. in Computer Science from Carnegie Mellon University in 2012. His research work has been recognized by awards such as the ACM SOSP Best Paper Award and Carnegie Mellon’s Allen Newell Award for Research Excellence.

16:45 - 17:30

Panel

Energy Efficiency at Cloud Scale

As AI unlocks new frontiers in automation, science and human understanding, the complexity of AI models continues to grow. So does the need for computational power to train and deploy these models. While much attention has been given to improve energy efficiency in low-power edge devices for inference, the energy cost of AI is equally relevant, if not more, in large scale AI datacenters. With firsthand insights into the world of scalable compute and efficient AI in the cloud, our distinguished panelists discuss the technical and economic aspects of cloud-scale AI and the opportunities and challenges to lower the energy footprint of AI.

Energy Efficient Machine Learning and Cognitive Computing

7th Edition

forum Energy Efficiency at Cloud Scale

Moderator

Panelists

description Workshop Objective

chat Call for Papers

format_list_bulleted Topics for the Workshop

We will follow that same formatting guidelines and duplicate submission policies as ASPLOS.

08:00 - 08:15

Welcome

Welcome and Opening Remarks

08:15 - 09:00

Invited Talk

Chimera: Efficiently Training Large-Scale Neural Networks with Bidirectional Pipelines

09:00 - 10:00

Keynote

Cognitive AI - a blueprint for sustainability, efficiency and higher intelligence

10:00 - 10:45

Invited Talk

Accelerating Transformers: How Hugging Face Delivers 100x Speedup to Industrialize State of the Art Machine Learning

10:45 - 11:00

Break

Break

11:00 - 11:30

Sponsored Talk

Challenges in accelerating transformers for efficient inference of giant NLP models at scale

11:30 - 12:15

Invited Talk

HPC and AI

12:15 - 12:30

Lunch Break

Lunch Break

12:30 - 13:00

Invited Talk

Machine learning for Near Term Computing

13:00 - 13:45

Invited Talk

Efficient Neural Architecture Search at Scale

13:45 - 14:30

Invited Talk

Towards Best Possible Deep Learning Acceleration on the Edge – A Compression-Compilation Co-Design Framework

14:30 - 15:15

Invited Talk

On Device, always on AI, Challenges and Opportunities

15:15 - 16:00

Invited Talk

Memory Bandwidth Optimizations -- from Data Structure to Memory Array

16:00 - 16:45

Invited Talk

Project Fiddle: Fast and Efficient Infrastructure for Distributed Deep Learning

16:45 - 17:30

Panel

Energy Efficiency at Cloud Scale

Moderator: Satyam Srivastava, d-Matrix.ai

Panelists:

17:30 - 17:40

Close

Closing Remarks

Energy Efficiency at Cloud Scale

Workshop Objective

Call for Papers

Topics for the Workshop