EMC^2: EMC2 - Energy Efficient Machine Learning and Cognitive Computing

Sunday, February 17, 2019 Room: Dogwood Room Full Day

Program
CFP

In the Eleventh edition of EMC2 workshop, we plan to facilitate conversation about the sustainability of large-scale AI computing systems being developed to meet the ever-increasing demands of generative AI. This involves discussions spanning multiple interrelated areas. First, we continue to serve as the leading forums for discussing the energy-efficiency aspect of GenAI workloads which directly impact the overall viability and economic value of AI technology. Second, we reassess the scaling laws of AI with the prevalence of agentic, multi-modal, and reasoning-based models in conjunction with novel techniques such as a highly sparse expert architecture and disaggregated computation. Finally, we discuss sustainable and high-performance computing paradigms towards efficient datacenters and hybrid computing models that can cater to the exponential growth in model sizes, application areas, anduser base. This would allow us to explore ideas to build the hardware, software, systems, and scaling infrastructure, as well as model architectures that make AI technology even more prevalent and accessible.

The goal of this Workshop is to provide a forum for researchers and industry experts who are exploring novel ideas, tools and techniques to improve the energy efficiency of MLLMs as it is practised today and would evolve in the next decade. We envision that only through close collaboration between industry and the academia we will be able to address the difficult challenges and opportunities of reducing the carbon footprint of AI and its uses. We have tailored our program to best serve the participants in a fully digital setting. Our forum facilitates active exchange of ideas through:

Keynotes, invited talks and discussion panels by leading researchers from industry and academia
Peer-reviewed papers on latest solutions including works-in-progress to seek directed feedback from experts
Independent publication of proceedings through IEEE CPS

We invite full-length papers describing original, cutting-edge, and even work-in-progress research projects about efficient machine learning. Suggested topics for papers include, but are not limited to the ones listed below:

Neural network architectures for resource constrained applications.
Efficient hardware designs to implement neural networks including sparsity, locality, and systolic designs.
Power and performance efficient memory architectures suited for neural networks.
Network reduction techniques – approximation, quantization, reduced precision, pruning, distillation, and reconfiguration.
Exploring interplay of precision, performance, power, and energy through benchmarks, workloads, and characterization.
Performance potential, limit studies, bottleneck analysis, profiling, and synthesis of workloads.
Explorations and architctures aimed to promote sustainable computing.
Simulation and emulation techniques, frameworks, tools, and platforms for machine learning.
Optimizations to improve performance of training techniques including on-device and large-scale learning.
Load balancing and efficient task distribution, communication and computation overlapping for optimal performance.
Verification, validation, determinism, robustness, bias, safety, and privacy challenges in AI systems.
Efficient deployment strategies for edge and distributed environments.
Model compression and optimization techniques that preserve reasoning and problem-solving capabilities.
Architectures and frameworks for multi-agent systems and retrieval-augmented generation (RAG) pipelines.
Systems-level approaches for scaling future foundation models (e.g., Llama 4, GPT-5 and beyond).

08:45 - 09:00

Welcome

Introduction and Opening Remarks

09:00 - 10:00

Keynote

Presentation

Quantizing Deep Convolutional Networks for Efficient Inference

Raghuraman Krishnamoorthi, Facebook

We present an overview of techniques for quantizing convolutional neural networks for inference with integer weights and activations. We discuss different quantization schemes and show that simple techniques provide very good performance (4x reduction in model size, 2x speed up in CPUs) for classification use cases, with 1-2% accuracy drop.

Modeling quantization during training can provide further improvements, reducing the gap to floating point to 1% at 8-bit precision. Quantization-aware training also allows for reducing the precision of weights to four bits with accuracy losses ranging from 2% to 10%, with higher accuracy drop for smaller networks.

We recommend that per-channel quantization of weights and per-layer quantization of activations be the preferred quantization scheme for hardware acceleration and kernel optimization. We also propose that future processors and hardware accelerators for optimized inference support:

precisions of 4, 8 and 16 bits for computation
Per-channel quantization of weights
Per layer selection of bit widths for weights and activations
Support for on the fly weight compression techniques for memory bandwidth efficiency.

Raghuraman Krishnamoorthi is a software engineer in the Pytorch team at Facebook, where he leads the effort to develop and optimize quantized deep networks for inference. Prior to that he was part of the Tensorflow team at google working on quantization for mobile inference as part of TensorflowLite.

From 2001 to 2017, Raghu was at Qualcomm Research, working on several generations of wireless technologies. His work experience also includes computer vision for AR, ultra-low power always on vision and hardware/software co-design for inference on mobile platforms. He is an inventor in more than 90 issued and filed patents. Raghu has a masters in EE from University of Illinois,Urbana Champaign and a Bachelor degree from Indian Institute of Technology, Madras.

10:00 - 11:00

Keynote

Presentation

Efficient Machine Learning Architectures

T. N. Vijaykumar, Purdue University

Advances in machine learning (ML) are resulting in highly-accurate recognition (e.g., image and speech recognition). ML models, however, place high computational demand during both training and inference, requiring efficient architectures. The models? computations are fine-grained, regular and highly parallel, have high data reuse, and use low-precision arithmetic for inference (e.g., int8). Modern ML architectures (e.g., GPGPUs, TPU, and FPGA-based) exploit these characteristics to achieve high performance and energy efficiency.

Recently, ML models have been shown to be sparse, prompting creative proposals for sparse architectures. Emerging technology trends of processing-in/near-memory match some of the ML workloads well providing an opportunity for architectural innovation based on these innovative technologies. In this talk, I will explore these exciting aspects of machine learning architectures.

T. N. Vijaykumar is Professor in the School of Electrical and Computer Engineering at Purdue University. His research interests are in computer architecture targeting machine learning architectures, secure high-performance microprocessors, and verifiable architectures. He is also interested in hardware for data center networks and software-programmable microfluidics. His work has been recognized with an NSF CAREER Award in 1999 and IEEE Micros Top Picks in 2003 and 2005. He is listed in the International Symposium on Computer Architecture (ISCA) Hall of Fame at http://research.cs.wisc.edu/arch/www/iscabibhall. With his colleagues, he received the first prize in the 2009 Burton D. Morgan Business Plan Competition for a business plan on commercializing software-programmable lab-on-a-chip technology. He received a Ph.D. in computer science from the University of Wisconsin-Madison in 1997.

11:00 - 12:00

Paper Session #1

Paper Presentation

Efficient Winograd or Cook-Toom Convolution Kernel Implementation on Widely Used Mobile CPUs

Partha Maji, Andrew Mundy, Ganesh Dasika, Jesse Beu, Matthew Mattina, Robert Mullins

University of Cambridge and ARM ML Research

Paper Presentation

On Merging MobileNets for Efficient Multitask Inference

Cheng-En Wu, Yi-Ming Chan and Chu-Song Chen

Institute of Information Science, Academia Sinica

13:30 - 14:15

Invited Talk

Presentation

Tensilica DNA 100 Processor: A High-Performance, Power-Efficient DNN Processor for On-Device Inference

Megha Daga, Cadence

Deep learning is inﬂuencing not only the technology itself but also our everyday lives. With the increasing demand on mobile artificial intelligence (AI), conventional hardware solutions face their ordeal because of their low energy efﬁciency on such power-hungry applications. For the past few years, dedicated DNN accelerator inference has been under the spotlight. However, with the rising emphasis on privacy and personalization, the ability to learn on mobile platforms is becoming the second hurdle for “on-device AI.” The Cadence® Tensilica® DNA 100 Processor IP, is the first deep neural-network accelerator (DNA) AI processor IP to deliver both high performance and power efficiency across a full range of compute from 0.5 TeraMAC (TMAC) to 100 TMACs. As a result, the DNA 100 processor is well suited for on-device neural network inference applications spanning autonomous vehicles (AVs), ADAS, surveillance, robotics, drones, augmented reality (AR) /virtual reality (VR), smartphones, smart home, and IoT.

Megha Daga, works at Cadence Design Systems, Inc. as Sr Manager, Product Marketing and Management in the AI group. Megha’s focus and passion is to research latest trends and requirements in AI and to create industry leading solutions on Cadence AI IPs. Megha enjoys learning from customer’s experiences and fellow researchers in AI. Her R&D background coupled with her current marketing role gives her a unique perspective about the AI industry.

14:15 - 15:00

Invited Talk

Presentation

Beyond IPS: Toward A Wholistic Measure of Machine Learning Performance

Saurabh Tangri, Intel

Saurabh Tangri is a senior SW architect at Intel and leads AI enabling efforts across Intel HW for Microsoft solutions. His focus area is to make AI accessible and performant for everyone in a seamless manner. He has been with Intel for nearly 15 years.

15:30 - 17:00

Paper Session #2

Paper Presentation

Integrating NVIDIA Deep Learning Accelerator (NVDLA) with RISC-V SoC on FireSim

Farzad Farshchi, Qijing Huang and Heechul Yun

University of Kansas, University of California, Berkeley

Paper Presentation

Bootstrapping Deep Neural Networks from Approximate Image Processing Pipelines

Sek Chai, Kilho Son and Jesse Hostetler

SRI International

Paper Presentation

NNBench-X: A Benchmarking Methodology for Neural Network Accelerator Designs

Xinfeng Xie, Xing Hu, Peng Gu, Shuangchen Li, Yu Ji and Yuan Xie

University of California, Santa Barbara

17:00 - 17:30

Invited Talk

Hardware Acceleration Opportunities in Bioinformatics and Computational Biology

Leonid Yavits, Technion

Advances in genomics have triggered a revolution in healthcare and our understanding of life. Recent years saw exponential increase in genomic data, far outpacing Moore’s Law. Coupled with prohibitively high computational costs of bioinformatics tasks, it presents a challenge but also a great opportunity for hardware acceleration.

I will describe a typical genomic assembly pipeline, and discuss the latest developments in the field of DNA sequencing, with an emphasis on hardware acceleration opportunities. Afterwards, I will make a brief excurse into the world of existing bioinformatics accelerators. I will end up with the insights from the Accelerator Architecture for Computational Biology and Bioinformatics (AACBB) 2019 workshop.

Leonid Yavits received his MSc (1996) and PhD in Electrical Engineering (2015) from the Technion, Israel Institute of Technology. After graduating the MSc program, he co-founded VisionTech where he co-designed the world’s first single chip MPEG2 codec. Following VisionTech’s acquisition by Broadcom, he managed Broadcom Israel R&D and co-developed a number of video compression products. Later Leonid co-founded Horizon Semiconductors where he co-designed a Set Top Box-on-chip for cable and satellite TV. Horizon’s Set Top Box-on-chip was among world’s earliest heterogeneous MPSoC.

Leonid is a postdoc research fellow in Electrical Engineering at Technion. He co-authored a number of patents and research papers. His research interests include non von Neumann computer architectures; processing in memory and resistive memory based computing; architectures for computational biology and bioinformatics. Leonid’s research work has earned several awards; among them: IEEE Computer Architecture Letter Journal Best Paper Awards for 2015 and 2017 and best poster awards at ISC High Performance in 2017 and ACM/IEEE Supercomputing Conference in 2018.

Energy Efficient Machine Learning and Cognitive Computing

2nd Edition

Co-located with HPCA 2019 in Washington D.C.

description Workshop Objective

chat Call for Papers

format_list_bulleted Topics for the Workshop

We will follow that same formatting guidelines and duplicate submission policies as ASPLOS.

08:45 - 09:00

Welcome

Introduction and Opening Remarks

09:00 - 10:00

Keynote

Quantizing Deep Convolutional Networks for Efficient Inference

Raghuraman Krishnamoorthi, Facebook

10:00 - 11:00

Keynote

Efficient Machine Learning Architectures

T. N. Vijaykumar, Purdue University link

11:00 - 12:00

Paper Session #1

Efficient Winograd or Cook-Toom Convolution Kernel Implementation on Widely Used Mobile CPUs

Partha Maji, Andrew Mundy, Ganesh Dasika, Jesse Beu, Matthew Mattina, Robert Mullins

University of Cambridge and ARM ML Research

On Merging MobileNets for Efficient Multitask Inference

Cheng-En Wu, Yi-Ming Chan and Chu-Song Chen

Institute of Information Science, Academia Sinica

13:30 - 14:15

Invited Talk

Tensilica DNA 100 Processor: A High-Performance, Power-Efficient DNN Processor for On-Device Inference

Megha Daga, Cadence

14:15 - 15:00

Invited Talk

Beyond IPS: Toward A Wholistic Measure of Machine Learning Performance

Saurabh Tangri, Intel

15:30 - 17:00

Paper Session #2

Integrating NVIDIA Deep Learning Accelerator (NVDLA) with RISC-V SoC on FireSim

Farzad Farshchi, Qijing Huang and Heechul Yun

University of Kansas, University of California, Berkeley

Bootstrapping Deep Neural Networks from Approximate Image Processing Pipelines

Sek Chai, Kilho Son and Jesse Hostetler

SRI International

NNBench-X: A Benchmarking Methodology for Neural Network Accelerator Designs

Xinfeng Xie, Xing Hu, Peng Gu, Shuangchen Li, Yu Ji and Yuan Xie

University of California, Santa Barbara

17:00 - 17:30

Invited Talk

Hardware Acceleration Opportunities in Bioinformatics and Computational Biology

Leonid Yavits, Technion

Workshop Objective

Call for Papers

Topics for the Workshop

T. N. Vijaykumar, Purdue University