EMC^2: EMC2 - Energy Efficient Machine Learning and Cognitive Computing

Friday, December 13, 2019 Room: West 306 Full Day

Program
CFP

In the Eleventh edition of EMC2 workshop, we plan to facilitate conversation about the sustainability of large-scale AI computing systems being developed to meet the ever-increasing demands of generative AI. This involves discussions spanning multiple interrelated areas. First, we continue to serve as the leading forums for discussing the energy-efficiency aspect of GenAI workloads which directly impact the overall viability and economic value of AI technology. Second, we reassess the scaling laws of AI with the prevalence of agentic, multi-modal, and reasoning-based models in conjunction with novel techniques such as a highly sparse expert architecture and disaggregated computation. Finally, we discuss sustainable and high-performance computing paradigms towards efficient datacenters and hybrid computing models that can cater to the exponential growth in model sizes, application areas, anduser base. This would allow us to explore ideas to build the hardware, software, systems, and scaling infrastructure, as well as model architectures that make AI technology even more prevalent and accessible.

The goal of this Workshop is to provide a forum for researchers and industry experts who are exploring novel ideas, tools and techniques to improve the energy efficiency of MLLMs as it is practised today and would evolve in the next decade. We envision that only through close collaboration between industry and the academia we will be able to address the difficult challenges and opportunities of reducing the carbon footprint of AI and its uses. We have tailored our program to best serve the participants in a fully digital setting. Our forum facilitates active exchange of ideas through:

Keynotes, invited talks and discussion panels by leading researchers from industry and academia
Peer-reviewed papers on latest solutions including works-in-progress to seek directed feedback from experts
Independent publication of proceedings through IEEE CPS

We invite full-length papers describing original, cutting-edge, and even work-in-progress research projects about efficient machine learning. Suggested topics for papers include, but are not limited to the ones listed below:

Neural network architectures for resource constrained applications.
Efficient hardware designs to implement neural networks including sparsity, locality, and systolic designs.
Power and performance efficient memory architectures suited for neural networks.
Network reduction techniques – approximation, quantization, reduced precision, pruning, distillation, and reconfiguration.
Exploring interplay of precision, performance, power, and energy through benchmarks, workloads, and characterization.
Performance potential, limit studies, bottleneck analysis, profiling, and synthesis of workloads.
Explorations and architctures aimed to promote sustainable computing.
Simulation and emulation techniques, frameworks, tools, and platforms for machine learning.
Optimizations to improve performance of training techniques including on-device and large-scale learning.
Load balancing and efficient task distribution, communication and computation overlapping for optimal performance.
Verification, validation, determinism, robustness, bias, safety, and privacy challenges in AI systems.
Efficient deployment strategies for edge and distributed environments.
Model compression and optimization techniques that preserve reasoning and problem-solving capabilities.
Architectures and frameworks for multi-agent systems and retrieval-augmented generation (RAG) pipelines.
Systems-level approaches for scaling future foundation models (e.g., Llama 4, GPT-5 and beyond).

08:00 - 08:45

Grand Keynote

Presentation

What DL Hardware Will We Need?

Yann LeCun, New York University and Facebook

Yann LeCun is VP & Chief AI Scientist at Facebook and Silver Professor at NYU affiliated with the Courant Institute of Mathematical Sciences & the Center for Data Science. He was the founding Director of Facebook AI Research and of the NYU Center for Data Science. He received an Engineering Diploma from ESIEE (Paris) and a PhD from Sorbonne Université. After a postdoc in Toronto he joined AT&T Bell Labs in 1988, and AT&T Labs in 1996 as Head of Image Processing Research. He joined NYU as a professor in 2003 and Facebook in 2013. His interests include AI machine learning, computer perception, robotics and computational neuroscience. He is a member of the National Academy of Engineering and the recipient of the 2018 ACM Turing Award (with Geoffrey Hinton and Yoshua Bengio) for “conceptual and engineering breakthroughs that have made deep neural networks a a critical component of computing”.

08:45 - 09:30

Keynote

Presentation

Efficient Computing for AI and Robotics

Vivienne Sze, MIT

Computing near the sensor is preferred over the cloud due to privacy and/or latency concerns for a wide range of applications including robotics/drones, self-driving cars, smart Internet of Things, and portable/wearable electronics. However, at the sensor there are often stringent constraints on energy consumption and cost in addition to the throughput and accuracy requirements of the application. In this talk, we will describe how joint algorithm and hardware design can be used to reduce energy consumption while delivering real-time and robust performance for applications including deep learning, computer vision, autonomous navigation/exploration and video/image processing. We will show how energy-efficient techniques that exploit correlation and sparsity to reduce compute, data movement and storage costs can be applied to various tasks including image classification, depth estimation, super-resolution, localization and mapping.

Vivienne Sze is an Associate Professor at MIT in the Electrical Engineering and Computer Science Department. Her research interests include energy-aware signal processing algorithms, and low-power circuit and system design for portable multimedia applications, including computer vision, deep learning, autonomous navigation, and video process/coding. Prior to joining MIT, she was a Member of Technical Staff in the R&D Center at TI, where she designed low-power algorithms and architectures for video coding. She also represented TI in the JCT-VC committee of ITU-T and ISO/IEC standards body during the development of High Efficiency Video Coding (HEVC), which received a Primetime Engineering Emmy Award. She is a co-editor of the book entitled “High Efficiency Video Coding (HEVC): Algorithms and Architectures” (Springer, 2014).

Prof. Sze received the B.A.Sc. degree from the University of Toronto in 2004, and the S.M. and Ph.D. degree from MIT in 2006 and 2010, respectively. In 2011, she received the Jin-Au Kong Outstanding Doctoral Thesis Prize in Electrical Engineering at MIT. She is a recipient of the 2018 Facebook Faculty Award, the 2018 & 2017 Qualcomm Faculty Award, the 2018 & 2016 Google Faculty Research Award, the 2016 AFOSR Young Investigator Research Program (YIP) Award, the 2016 3M Non-Tenured Faculty Award, the 2014 DARPA Young Faculty Award, the 2007 DAC/ISSCC Student Design Contest Award, and a co-recipient of the 2018 Symposium on VLSI Circuits Best Student Paper Award, the 2017 CICC Outstanding Invited Paper Award, the 2016 IEEE Micro Top Picks Award and the 2008 A-SSCC Outstanding Design Award.

For more information about research in the Energy-Efficient Multimedia Systems Group at MIT visit: http://www.rle.mit.edu/eems/

09:30 - 10:00

Invited Talk

Presentation

Putting the “Machine” Back in Machine Learning: The Case for Hardware-ML Model Co-design

Diana Marculescu, University of Texas at Austin

Machine learning (ML) applications have entered and impacted our lives unlike any other technology advance from the recent past. Indeed, almost every aspect of how we live or interact with others relies on or uses ML for applications ranging from image classification and object detection, to processing multi‐modal and heterogeneous datasets. While the holy grail for judging the quality of a ML model has largely been serving accuracy, and only recently its resource usage, neither of these metrics translate directly to energy efficiency, runtime, or mobile device battery lifetime. This talk will uncover the need for building accurate, platform‐specific power and latency models for convolutional neural networks (CNNs) and efficient hardware-aware CNN design methodologies, thus allowing machine learners and hardware designers to identify not just the best accuracy NN configuration, but also those that satisfy given hardware constraints. Our proposed modeling framework is applicable to both high‐end and mobile platforms and achieves 88.24% accuracy for latency, 88.34% for power, and 97.21% for energy prediction. Using similar predictive models, we demonstrate a novel differentiable neural architecture search (NAS) framework, dubbed Single-Path NAS, that uses one single-path over-parameterized CNN to encode all architectural decisions based on shared convolutional kernel parameters. Single-Path NAS achieves state-of-the-art top-1 ImageNet accuracy (75.62%), outperforming existing mobile NAS methods for similar latency constraints (∼80ms) and finds the final configuration up to 5,000× faster compared to prior work. Combined with our quantized CNNs (Flexible Lightweight CNNs or FLightNNs) that customize precision level in a layer-wise fashion and achieve almost iso-accuracy at 5-10x energy reduction, such a modeling, analysis, and optimization framework is poised to lead to true co-design of hardware and ML model, orders of magnitude faster than state of the art, while satisfying both accuracy and latency or energy constraints.

Diana Marculescu is Department Chair, Cockrell Family Chair for Engineering Leadership #5, and Professor, Motorola Regents Chair in Electrical and Computer Engineering #2, at the University of Texas at Austin. Before joining UT Austin in December 2019, she was the David Edward Schramm Professor of Electrical and Computer Engineering, the Founding Director of the College of Engineering Center for Faculty Success (2015-2019) and has served as Associate Department Head for Academic Affairs in Electrical and Computer Engineering (2014-2018), all at Carnegie Mellon University. She received the Dipl.Ing. degree in computer science from the Polytechnic University of Bucharest, Bucharest, Romania (1991), and the Ph.D. degree in computer engineering from the University of Southern California, Los Angeles, CA (1998). Her research interests include energy- and reliability-aware computing, hardware aware machine learning, and computing for sustainability and natural science applications. Diana was a recipient of the National Science Foundation Faculty Career Award (2000-2004), the ACM SIGDA Technical Leadership Award (2003), the Carnegie Institute of Technology George Tallman Ladd Research Award (2004), and several best paper awards. She was an IEEE Circuits and Systems Society Distinguished Lecturer (2004-2005) and the Chair of the Association for Computing Machinery (ACM) Special Interest Group on Design Automation (2005-2009). Diana chaired several conferences and symposia in her area and is currently an Associate Editor for IEEE Transactions on Computers. She was selected as an ELATE Fellow (2013-2014), and is the recipient of an Australian Research Council Future Fellowship (2013-2017), the Marie R. Pistilli Women in EDA Achievement Award (2014), and the Barbara Lazarus Award from Carnegie Mellon University (2018). Diana is an IEEE Fellow and an ACM Distinguished Scientist.

10:00 - 10:30

Poster Session #1

Paper Poster

Bit Efficient Quantization for Deep Neural Networks

Prateeth Nayak, David Zhang and Sek Chai

SRI International and Latent AI

Paper Poster

Supported-BinaryNet: Bitcell Array-based Weight Supports for Dynamic Accuracy-Latency Trade-offs in SRAM-based Binarized Neural Network

Shamma Nasrin, Srikanth Ramakrishna, Theja Tulabandhula and Amit Trivedi

University of Illinois at Chicago

Paper Poster

Dynamic Channel Execution: on-device Learning Method for Finding Compact Networks

Simeon Spasov and Pietro Lio

University of Cambridge

Paper Poster

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

Victor Sanh, Lysandre Debut, Julien Chaumond and Thomas Wolf

Hugging Face

Paper Poster

QPyTorch: A Low-Precision Arithmetic Simulation Framework

Tianyi Zhang, Zhiqiu Lin, Guandao Yang and Christopher De Sa.

Cornell University

Paper Poster

Separable Convolutions for Multiscale Dense Networks for Efficient Anytime Image Classification

Sven Peter, Nasim Rahaman, Ferran Diego and Fred Hamprecht

Heidelberg University and Telefonica Research

10:30 - 11:00

Invited Talk

Presentation

Abandoning the Dark Arts: New Directions in Efficient DNN Design

Kurt Keutzer, UC Berkeley

Deep Neural Net models have provided the most accurate solutions to a very wide variety of problems in vision, language, and speech; however, the design, training, and optimization of efficient DNNs typically requires resorting to the “dark arts” of ad hoc methods and extensive hyperparameter tuning. In this talk we present our progress on abandoning these dark arts by using Differential Neural Architecture Search to guide the design of efficient DNNs and by using Hessian-based methods to guide the processes of training and quantizing those DNNs.

Kurt Keutzer’s research at University of California, Berkeley, focuses on computational problems in Deep Learning. In particular, Kurt has worked to reduce the training time of ImageNet to minutes and, with the SqueezeNet family, to develop a family of Deep Neural Networks suitable for mobile and IoT applications.

Before joining Berkeley as a Full Professor in 1998, Kurt was CTO and SVP at Synopsys. Kurt’s contributions to Electronic Design Automation were recognized at the 50th Design Automation Conference where he was noted as a Top 10 most cited author, as an author of a Top 10 most cited paper, and as one of only three people to have won four Best Paper Awards at that conference. Kurt was named a Fellow of the IEEE in 1996.

11:00 - 11:30

Invited Talk

Presentation

Hardware-aware Neural Architecture Design for Small and Fast Models: from 2D to 3D

Song Han, MIT

Efficient deep learning computing requires algorithm and hardware co-design to enable specialization. However, the extra degree of freedom creates a much larger design space. We propose AutoML techniques to architect efficient neural networks. We investigate automatically designing small and fast models (ProxylessNAS), auto channel pruning (AMC), and auto mixed-precision quantization (HAQ). We demonstrate such learning-based, automated design achieves superior performance and efficiency than rule-based human design. Moreover, we shorten the design cycle by 200× than previous work to efficiently search efficient models, so that we can afford to design specialized neural network models for different hardware platforms. We accelerate computation-intensive AI applications including (TSM) for efficient video recognition and PVCNN for efficient 3D recognition on point clouds. Finally, we’ll describe scalable distributed training and the potential security issues of efficient deep learning [1] [2]

Song Han is an assistant professor at MIT EECS. Dr. Han received the Ph.D. degree in Electrical Engineering from Stanford advised by Prof. Bill Dally. Dr. Han’s research focuses on efficient deep learning computing. He proposed “Deep Compression” and “ EIE Accelerator” that impacted the industry. His work received the best paper award in ICLR’16 and FPGA’17. He was the co-founder and chief scientist of DeePhi Tech which was acquired by Xilinx.

11:30 - 12:30

Paper Session #1

Paper Presentation Poster

AutoSlim: Towards One-Shot Architecture Search for Channel Numbers

Jiahui Yu and Thomas Huang

University of Illinois at Urbana-Champaign

Paper Presentation

YOLO Nano: a Highly Compact You Only Look Once Convolutional Neural Network for Object Detection

Alexander Wong, Mahmoud Famuori, Mohammad Javad Shafiee, Francis Li, Brendan Chwyl and Jonathan Chung

University of Waterloo and DarwinAI Corp

Paper Presentation Poster

Progressive Stochastic Binarization of Deep Networks

David Hartmann and Michael Wand

Johannes Gutenberg-University of Mainz

Paper Presentation Poster

Exploring Bit-Slice Sparsity in Deep Neural Networks for Efficient ReRAM-Based Deployment

Jingyang Zhang, Huanrui Yang, Fan Chen, Yitu Wang and Hai Li

Duke University and Fudan University

Paper Presentation Poster

Trained Rank Pruning for Efficient Deep Neural Networks

Yuhui Xu, Yuxi Li, Shuai Zhang, Wei Wen, Botao Wang, Wenrui Dai, Yingyong Qi, Yiran Chen, Weiyao Lin and Hongkai Xiong

Shanghai Jiao Tong University, Qualcomm and Duke University

Paper Presentation Poster

Improving Efficiency in Neural Network Accelerator using Operands Hamming Distance Optimization

Meng Li, Yilei Li, Pierce Chuang, Liangzhen Lai and Vikas Chandra

Facebook

Paper Presentation Poster

Q8BERT: Quantized 8Bit BERT

Ofir Zafrir, Guy Boudoukh, Peter Izsak and Moshe Wasserblat

Intel AI Lab

14:00 - 14:45

Keynote

Presentation

Cheap, Fast, and Low Power Deep Learning: I need it now!

Edward Delp, Purdue University

In this talk I will describe the need for low power machine learning systems. I will motivate this by describing several current projects at Purdue University that have a need for energy efficient deep learning and in some cases the real deployment of these methods will not be possible without lower power solutions. The applications include precision farming, health care monitoring, and edge-based surveillance.

Edward J. Delp is currently The Charles William Harrison Distinguished Professor of Electrical and Computer Engineering and Professor of Biomedical Engineering at Purdue University. His research interests include image and video processing, image analysis, computer vision, machine learning, image and video compression, multimedia security, medical imaging, multimedia systems, communication and information theory. Dr. Delp is a Life Fellow of the IEEE. In 2004 Dr. Delp received the Technical Achievement Award from the IEEE Signal Processing Society for his work in image and video compression and multimedia security. In 2008 Dr. Delp received the Society Award from the IEEE Signal Processing Society.

14:45 - 15:15

Invited Talk

Presentation

Advances and Prospects for In-memory Computing

Naveen Verma, Princeton University

Edge AI applications retain the need for high-performing inference models, while driving platforms beyond their limits of energy efficiency and throughput. Digital hardware acceleration, enabling 10-100x gains over general-purpose architectures, is already widely deployed, but is ultimately restricted by data-movement and memory accessing that dominates deep-learning computations. In-memory computing, based on both SRAM and emerging memory, offers fundamentally new tradeoffs for overcoming these barriers, with the potential for 10x higher energy efficiency and area-normalized throughput demonstrated in recent designs. But, those tradeoffs instate new challenges, especially affecting scaling to the level of computations required, integration in practical heterogeneous architectures, and mapping of diverse software. This talk examines those tradeoffs to characterize the challenges. It then explores recent research that provides promising paths forward, making in-memory computing more of a practical reality than ever before.

Naveen Verma received the B.A.Sc. degree in Electrical and Computer Engineering from the UBC, Vancouver, Canada in 2003, and the M.S. and Ph.D. degrees in Electrical Engineering from MIT in 2005 and 2009 respectively. Since July 2009 he has been a faculty member at Princeton University. His research focuses on advanced sensing systems, exploring how systems for learning, inference, and action planning can be enhanced by algorithms that exploit new sensing and computing technologies. This includes research on large-area, flexible sensors, energy-efficient statistical-computing architectures and circuits, and machine-learning and statistical-signal-processing algorithms. Prof. Verma has served as a Distinguished Lecturer of the IEEE Solid-State Circuits Society, and currently serves on the technical program committees for ISSCC, VLSI Symp., DATE, and IEEE Signal-Processing Society (DISPS).

15:15 - 15:45

Invited Talk

Presentation

Algorithm-Accelerator Co-Design for Neural Network Specialization

Zhiru Zhang, Cornell University

In recent years, machine learning (ML) with deep neural networks (DNNs) has been widely deployed in diverse application domains. However, the growing complexity of DNN models, the slowdown of technology scaling, and the proliferation of edge devices are driving a demand for higher DNN performance and energy efficiency. ML applications have shifted from general-purpose processors to dedicated hardware accelerators in both academic and commercial settings. In line with this trend, there has been an active body of research on both algorithms and hardware architectures for neural network specialization.

This talk presents our recent investigation into DNN optimization and low-precision quantization, using a co-design approach featuring contributions to both algorithms and hardware accelerators. First, we review static network pruning techniques and show a fundamental link between group convolutions and circulant matrices – two previously disparate lines of research in DNN compression. Then we discuss channel gating, a dynamic, fine-grained, and trainable technique for DNN acceleration. Unlike static approaches, channel gating exploits input-dependent dynamic sparsity at run time. This results in a significant reduction in compute cost with a minimal impact on accuracy. Finally, we present outlier channel splitting, a technique to improve DNN weight quantization by removing outliers from the weight distribution without retraining.

Zhiru Zhang is an Associate Professor in the School of ECE at Cornell University. His current research investigates new algorithms, design methodologies, and automation tools for heterogeneous computing. His research has been recognized with a Google Faculty Research Award (2018), the DAC Under-40 Innovators Award (2018), the Rising Professional Achievement Award from the UCLA Henry Samueli School of Engineering and Applied Science (2018), a DARPA Young Faculty Award (2015), and the IEEE CEDA Ernest S. Kuh Early Career Award (2015), an NSF CAREER Award (2015), the Ross Freeman Award for Technical Innovation from Xilinx (2012), and multiple best paper awards and nominations. Prior to joining Cornell, he was a co-founder of AutoESL, a high-level synthesis start-up later acquired by Xilinx.

15:45 - 16:15

Poster Session #2

Paper Poster

Pushing the limits of RNN Compression

Urmish Thakker, Igor Fedorov, Jesse Beu, Dibakar Gope, Chu Zhou, Ganesh Dasika and Matthew Mattina

Arm ML Research and AMD Research

Paper Poster

On hardware-aware probabilistic frameworks for resource constrained embedded applications

Laura Isabel Galindez Olascoaga, Wannes Meert, Nimish Shah, Guy Van den Broeck and Marian Verhelst

KU Leuven, and UCLA

Paper Poster

Neural Networks Weights Quantization: Target None-retraining Ternary (TNT)

Tianyu Zhang, Lei Zhu, Qian Zhao and Kilho Shin

WeBank, Harbin Engineering University, University of Hyogo and Gakushuin University

Paper Poster

Regularized Binary Network Training

Sajad Darabi, Mouloud Belbahri, Matthieu Courbariaux and Vahid Partovi Nia

UCLA, Université de Montréal and Huawei

Paper Poster

Fully Quantized Transformer for Improved Translation

Gabriele Prato, Ella Charlaix and Mehdi Rezagholizadeh

Université de Montréal and Huawei

16:15 - 16:45

Invited Talk

Kernel and Graph Optimization for DL Model Execution

Jinwon Lee, Qualcomm

There is increasing demand to deploy diverse deep learning models on edge devices. However, fully optimizing the execution of such models on resource-constrained HWs (e.g., CPU, DSP, NPU) is intrinsically challenging and often requires significant manual efforts. In this talk, we introduce our Morpheus team’s efforts to address these challenges. First, we optimize the performance of DL model execution in kernel level (e.g., a convolution operator). From a large number of possible kernel configurations (e.g., tiling, unrolling, vectorizations), the fastest kernel is quickly identified through machine learning algorithms we developed while binary codes are automatically generated by TVM or Halide compilers. Second, we further optimize the performance of DL model execution in graph-level (e.g., end-to-end network). Since each kernel or operator is often connected as a graph in deep learning models, the compute scheduling of such graphs significantly affects the end-to-end performance, especially memory I/O. We solve two problems in this context. First, for potentially complex topologies on edge devices with limited total memory, we solve the minimum memory usage problem, thus characterizing and enabling deployment of all feasible networks on a given device. Second, for any hardware with combined Tightly Coupled Memory (TCM) and more expensive external memory (e.g. DRAM), we solve the minimum external memory access problem, which optimizes hardware usage efficiency in I/O-bound conditions. For both problems we show efficient algorithms that are complete solutions, and improved results over heuristic methods. Finally, we will discuss our future directions to optimize deep learning model execution.

Jinwon Lee is a Senior Staff Engineer at the Qualcomm AI Research lab where he designs state-of-the-art deep learning models for the edge devices. He received his Ph.D in Computer Science from Korea Advanced Institute of Science and Technology (KAIST) in 2009. Jinwon is currently focused on deep learning model optimizations for the edge devices including kernel/graph compiler optimization, model compression/quantization, and HW-aware neural architecture search. Previously, he developed deep learning-based on-device speech enhancement/recognition engine for Qualcomm SoC. Also, he developed low-power context-aware engines for mobile use cases based on GPS, WiFi, and motion sensors.

16:45 - 17:15

Invited Talk

Presentation

Adaptive Multi-Task Neural Networks for Efficient Inference

Rogerio Feris, IBM Research

Very deep convolutional neural networks have shown remarkable success in many computer vision tasks, yet their computational expense limits their impact in domains where fast inference is essential. While there has been significant progress on model compression and acceleration, most methods rely on a one-size-fits-all network, where the same set of features is extracted for all images or tasks, no matter their complexity. In this talk, I will first describe an approach called BlockDrop, which learns to dynamically choose which layers of a deep network to execute during inference, depending on the image complexity, so as to best reduce total computation without degrading prediction accuracy. Then, I will show how this approach can be extended to design compact multi-task networks, where a different set of layers is executed depending on the task complexity, and the level of feature sharing across tasks is automatically determined to maximize both the accuracy and efficiency of the model. Finally, I will conclude the talk presenting an efficient multi-scale neural network model, which achieves state-of-the art results in terms of accuracy and FLOPS reduction on standard benchmarks such as the ImageNet dataset.

Rogerio Schmidt Feris is the head of computer vision and multimedia research at IBM T.J. Watson Research Center. He joined IBM in 2006 after receiving a Ph.D. from the University of California, Santa Barbara. He has also worked as an Affiliate Associate Professor at the University of Washington and as an Adjunct Associate Professor at Columbia University. His work has not only been published in top AI conferences, but has also been integrated into multiple IBM products, including Watson Visual Recognition, Watson Media, and Intelligent Video Analytics. He currently serves as an Associate Editor of TPAMI, has served as a Program Chair of WACV 2017, and as an Area Chair of conferences such as NeurIPS, CVPR, and ICCV.

17:15 - 17:45

Invited Talk

Presentation

Configurable Cloud-Scale DNN Processor for Real-Time AI

Bita Rouhani, Microsoft

Growing computational demands from deep neural networks (DNNs), coupled with diminishing returns from general-purpose architectures, have led to a proliferation of Neural Processing Units (NPUs). In this talk, we will discuss Project Brainwave, a production-scale system for real-time (low latency and high throughput) DNN inference. Brainwave NPU is reconfigurable and deployed in scale production. This reconfigurability, in turn, eliminates costly silicon updates to accommodate evolving state-of-the-art models while enabling orders of magnitude performance improvement compared to highly optimized software solutions.

Bita Rouhani is a senior researcher at Microsoft Azure AI and Advanced Architecture group. Bita received her Ph.D. in Computer Engineering from the University of California San Diego in 2018. Bita’s research interests include algorithm and hardware co-design for succinct and assured deep learning, real-time data analysis, and safe machine learning. Her work has been published at top-tier computer architecture, electronic design, machine learning, and security conferences and journals including ISCA, ASPLOS, ISLPED, DAC, ICCAD, FPGA, FCCM, SIGMETRICS, S&P magazine, and ACM TRETS.

17:45 - 18:45

Paper Session #2

Paper Presentation Poster

Training Compact Models for Low Resource Entity Tagging using Pre-trained Language Models

Peter Izsak, Shira Guskin and Moshe Wasserblat

Intel AI Lab

Paper Presentation

Doubly Sparse: Sparse Mixture of Sparse Experts for Efficient Softmax Inference

Shun Liao, Ting Chen, Tian Lin, Denny Zhou and Chong Wang

University of Toronto, Google and ByteDance

Paper Presentation Poster

Algorithm-hardware Co-design for Deformable Convolution

Qijing Huang, Dequan Wang, Yizhao Gao, Yaohui Cai, Zhen Dong, Bichen Wu, Kurt Keutzer and John Wawrzynek

UC Berkeley, Peking University and University of Chinese Academy of Science

Paper Presentation

Discovering Low-Precision Networks Close to Full-Precision Networks for Efficient Inference

Jeffrey McKinstry, Steven Esser, Rathinakumar Appuswamy, Deepika Bablani, John Arthur, Izzet Yildiz and Dharmendra Modha

IBM Research

Paper Presentation Poster

Instant Quantization of Neural Networks using Monte Carlo Methods

Gonçalo Mordido, Matthijs Van Keirsbilck and Alexander Keller

Hasso Plattner Institute and NVIDIA

Paper Presentation Poster

Spoken Language Understanding on the Edge

Alaa Saade, Alice Coucke, Alexandre Caulier, Joseph Dureau, Adrien Ball, Théodore Bluche, David Leroy, Clément Doumouro, Thibault Gisselbrecht, Francesco Caltagirone, Thibaut Lavril and Maël Primet

Energy Efficient Machine Learning and Cognitive Computing

5th Edition

Co-located with NeurIPS 2019 in Vancouver BC, Canada

description Workshop Objective

chat Call for Papers

format_list_bulleted Topics for the Workshop

We will follow that same formatting guidelines and duplicate submission policies as ASPLOS.

08:00 - 08:45

Grand Keynote

What DL Hardware Will We Need?

08:45 - 09:30

Keynote

Efficient Computing for AI and Robotics

09:30 - 10:00

Invited Talk

Putting the “Machine” Back in Machine Learning: The Case for Hardware-ML Model Co-design

10:00 - 10:30

Poster Session #1

Bit Efficient Quantization for Deep Neural Networks

Prateeth Nayak, David Zhang and Sek Chai

SRI International and Latent AI

Supported-BinaryNet: Bitcell Array-based Weight Supports for Dynamic Accuracy-Latency Trade-offs in SRAM-based Binarized Neural Network

Shamma Nasrin, Srikanth Ramakrishna, Theja Tulabandhula and Amit Trivedi

University of Illinois at Chicago

Dynamic Channel Execution: on-device Learning Method for Finding Compact Networks

Simeon Spasov and Pietro Lio

University of Cambridge

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

Victor Sanh, Lysandre Debut, Julien Chaumond and Thomas Wolf

Hugging Face

QPyTorch: A Low-Precision Arithmetic Simulation Framework

Tianyi Zhang, Zhiqiu Lin, Guandao Yang and Christopher De Sa.

Cornell University

Separable Convolutions for Multiscale Dense Networks for Efficient Anytime Image Classification

Sven Peter, Nasim Rahaman, Ferran Diego and Fred Hamprecht

Heidelberg University and Telefonica Research

10:30 - 11:00

Invited Talk

Abandoning the Dark Arts: New Directions in Efficient DNN Design

11:00 - 11:30

Invited Talk

Hardware-aware Neural Architecture Design for Small and Fast Models: from 2D to 3D

11:30 - 12:30

Paper Session #1

AutoSlim: Towards One-Shot Architecture Search for Channel Numbers

Jiahui Yu and Thomas Huang

University of Illinois at Urbana-Champaign

YOLO Nano: a Highly Compact You Only Look Once Convolutional Neural Network for Object Detection

Alexander Wong, Mahmoud Famuori, Mohammad Javad Shafiee, Francis Li, Brendan Chwyl and Jonathan Chung

University of Waterloo and DarwinAI Corp

Progressive Stochastic Binarization of Deep Networks

David Hartmann and Michael Wand

Johannes Gutenberg-University of Mainz

Exploring Bit-Slice Sparsity in Deep Neural Networks for Efficient ReRAM-Based Deployment

Jingyang Zhang, Huanrui Yang, Fan Chen, Yitu Wang and Hai Li

Duke University and Fudan University

Trained Rank Pruning for Efficient Deep Neural Networks

Yuhui Xu, Yuxi Li, Shuai Zhang, Wei Wen, Botao Wang, Wenrui Dai, Yingyong Qi, Yiran Chen, Weiyao Lin and Hongkai Xiong

Shanghai Jiao Tong University, Qualcomm and Duke University

Improving Efficiency in Neural Network Accelerator using Operands Hamming Distance Optimization

Meng Li, Yilei Li, Pierce Chuang, Liangzhen Lai and Vikas Chandra

Facebook

Q8BERT: Quantized 8Bit BERT

Ofir Zafrir, Guy Boudoukh, Peter Izsak and Moshe Wasserblat

Intel AI Lab

14:00 - 14:45

Keynote

Cheap, Fast, and Low Power Deep Learning: I need it now!

14:45 - 15:15

Invited Talk

Advances and Prospects for In-memory Computing

15:15 - 15:45

Invited Talk

Algorithm-Accelerator Co-Design for Neural Network Specialization

15:45 - 16:15

Poster Session #2

Pushing the limits of RNN Compression

Urmish Thakker, Igor Fedorov, Jesse Beu, Dibakar Gope, Chu Zhou, Ganesh Dasika and Matthew Mattina

Arm ML Research and AMD Research

On hardware-aware probabilistic frameworks for resource constrained embedded applications

Workshop Objective

Call for Papers

Topics for the Workshop