The 5th EMC2 - Energy Efficient Training and Inference of Transformer Based Models
Co-located with the 33rd Conference on Neural Information Processing Systems NeurIPS 2019
description Workshop Objective
Transformers are the foundational principles of large deep learning language models. Recent successes of Transformer-based models in image classification and action prediction use cases indicate their wide applicability. In this workshop, we want to focus on the leading ideas using Transformer models such as PALM from Google. We will learn what have been their key observations on performance of the model, optimizations for inference and power consumption of both mixed-precision inference and training.
chat Call for Papers
The goal of this Workshop is to provide a forum for researchers and industry experts who are exploring novel ideas, tools, and techniques to improve the energy efficiency of machine learning and deep learning as it is practiced today and would evolve in the next decade. We envision that only through close collaboration between industry and the academia we will be able to address the difficult challenges and opportunities of reducing the carbon footprint of AI and its uses. We have tailored our program to best serve the participants in a fully digital setting. Our forum facilitates active exchange of ideas through:
- Keynotes, invited talks and discussion panels by leading researchers from industry and academia
- Peer-reviewed papers on latest solutions including works-in-progress to seek directed feedback from experts
- Independent publication of proceedings through IEEE CPS
We invite full-length papers describing original, cutting-edge, and even work-in-progress research projects about efficient machine learning. Suggested topics for papers include, but are not limited to the ones listed on this page. The proceedings from previous instances have been published through the prestigious IEEE Conference Publishing Services (CPS) and are available to the community via IEEE Xplore. In each instance, IEEE conducted independent assessment of the papers for quality.
format_list_bulleted Topics for the Workshop
- Neural network architectures for resource constrained applications
- Efficient hardware designs to implement neural networks including sparsity, locality, and systolic designs
- Power and performance efficient memory architectures suited for neural networks
- Network reduction techniques – approximation, quantization, reduced precision, pruning, distillation, and reconfiguration
- Exploring interplay of precision, performance, power, and energy through benchmarks, workloads, and characterization
- Simulation and emulation techniques, frameworks, tools, and platforms for machine learning
- Optimizations to improve performance of training techniques including on-device and large-scale learning
- Load balancing and efficient task distribution, communication and computation overlapping for optimal performance
- Verification, validation, determinism, robustness, bias, safety, and privacy challenges in AI systems
08:00 - 08:45
What DL Hardware Will We Need?
Yann LeCun, New York University and Facebook link
Yann LeCun is VP & Chief AI Scientist at Facebook and Silver Professor at NYU affiliated with the Courant Institute of Mathematical Sciences & the Center for Data Science. He was the founding Director of Facebook AI Research and of the NYU Center for Data Science. He received an Engineering Diploma from ESIEE (Paris) and a PhD from Sorbonne Université. After a postdoc in Toronto he joined AT&T Bell Labs in 1988, and AT&T Labs in 1996 as Head of Image Processing Research. He joined NYU as a professor in 2003 and Facebook in 2013. His interests include AI machine learning, computer perception, robotics and computational neuroscience. He is a member of the National Academy of Engineering and the recipient of the 2018 ACM Turing Award (with Geoffrey Hinton and Yoshua Bengio) for “conceptual and engineering breakthroughs that have made deep neural networks a a critical component of computing”.
08:45 - 09:30
Efficient Computing for AI and Robotics
Vivienne Sze, MIT link
Computing near the sensor is preferred over the cloud due to privacy and/or latency concerns for a wide range of applications including robotics/drones, self-driving cars, smart Internet of Things, and portable/wearable electronics. However, at the sensor there are often stringent constraints on energy consumption and cost in addition to the throughput and accuracy requirements of the application. In this talk, we will describe how joint algorithm and hardware design can be used to reduce energy consumption while delivering real-time and robust performance for applications including deep learning, computer vision, autonomous navigation/exploration and video/image processing. We will show how energy-efficient techniques that exploit correlation and sparsity to reduce compute, data movement and storage costs can be applied to various tasks including image classification, depth estimation, super-resolution, localization and mapping.
Vivienne Sze is an Associate Professor at MIT in the Electrical Engineering and Computer Science Department. Her research interests include energy-aware signal processing algorithms, and low-power circuit and system design for portable multimedia applications, including computer vision, deep learning, autonomous navigation, and video process/coding. Prior to joining MIT, she was a Member of Technical Staff in the R&D Center at TI, where she designed low-power algorithms and architectures for video coding. She also represented TI in the JCT-VC committee of ITU-T and ISO/IEC standards body during the development of High Efficiency Video Coding (HEVC), which received a Primetime Engineering Emmy Award. She is a co-editor of the book entitled “High Efficiency Video Coding (HEVC): Algorithms and Architectures” (Springer, 2014).
Prof. Sze received the B.A.Sc. degree from the University of Toronto in 2004, and the S.M. and Ph.D. degree from MIT in 2006 and 2010, respectively. In 2011, she received the Jin-Au Kong Outstanding Doctoral Thesis Prize in Electrical Engineering at MIT. She is a recipient of the 2018 Facebook Faculty Award, the 2018 & 2017 Qualcomm Faculty Award, the 2018 & 2016 Google Faculty Research Award, the 2016 AFOSR Young Investigator Research Program (YIP) Award, the 2016 3M Non-Tenured Faculty Award, the 2014 DARPA Young Faculty Award, the 2007 DAC/ISSCC Student Design Contest Award, and a co-recipient of the 2018 Symposium on VLSI Circuits Best Student Paper Award, the 2017 CICC Outstanding Invited Paper Award, the 2016 IEEE Micro Top Picks Award and the 2008 A-SSCC Outstanding Design Award.
For more information about research in the Energy-Efficient Multimedia Systems Group at MIT visit: http://www.rle.mit.edu/eems/
09:30 - 10:00
Putting the “Machine” Back in Machine Learning: The Case for Hardware-ML Model Co-design
Diana Marculescu, University of Texas at Austin link
Machine learning (ML) applications have entered and impacted our lives unlike any other technology advance from the recent past. Indeed, almost every aspect of how we live or interact with others relies on or uses ML for applications ranging from image classification and object detection, to processing multi‐modal and heterogeneous datasets. While the holy grail for judging the quality of a ML model has largely been serving accuracy, and only recently its resource usage, neither of these metrics translate directly to energy efficiency, runtime, or mobile device battery lifetime. This talk will uncover the need for building accurate, platform‐specific power and latency models for convolutional neural networks (CNNs) and efficient hardware-aware CNN design methodologies, thus allowing machine learners and hardware designers to identify not just the best accuracy NN configuration, but also those that satisfy given hardware constraints. Our proposed modeling framework is applicable to both high‐end and mobile platforms and achieves 88.24% accuracy for latency, 88.34% for power, and 97.21% for energy prediction. Using similar predictive models, we demonstrate a novel differentiable neural architecture search (NAS) framework, dubbed Single-Path NAS, that uses one single-path over-parameterized CNN to encode all architectural decisions based on shared convolutional kernel parameters. Single-Path NAS achieves state-of-the-art top-1 ImageNet accuracy (75.62%), outperforming existing mobile NAS methods for similar latency constraints (∼80ms) and finds the final configuration up to 5,000× faster compared to prior work. Combined with our quantized CNNs (Flexible Lightweight CNNs or FLightNNs) that customize precision level in a layer-wise fashion and achieve almost iso-accuracy at 5-10x energy reduction, such a modeling, analysis, and optimization framework is poised to lead to true co-design of hardware and ML model, orders of magnitude faster than state of the art, while satisfying both accuracy and latency or energy constraints.
Diana Marculescu is Department Chair, Cockrell Family Chair for Engineering Leadership #5, and Professor, Motorola Regents Chair in Electrical and Computer Engineering #2, at the University of Texas at Austin. Before joining UT Austin in December 2019, she was the David Edward Schramm Professor of Electrical and Computer Engineering, the Founding Director of the College of Engineering Center for Faculty Success (2015-2019) and has served as Associate Department Head for Academic Affairs in Electrical and Computer Engineering (2014-2018), all at Carnegie Mellon University. She received the Dipl.Ing. degree in computer science from the Polytechnic University of Bucharest, Bucharest, Romania (1991), and the Ph.D. degree in computer engineering from the University of Southern California, Los Angeles, CA (1998). Her research interests include energy- and reliability-aware computing, hardware aware machine learning, and computing for sustainability and natural science applications. Diana was a recipient of the National Science Foundation Faculty Career Award (2000-2004), the ACM SIGDA Technical Leadership Award (2003), the Carnegie Institute of Technology George Tallman Ladd Research Award (2004), and several best paper awards. She was an IEEE Circuits and Systems Society Distinguished Lecturer (2004-2005) and the Chair of the Association for Computing Machinery (ACM) Special Interest Group on Design Automation (2005-2009). Diana chaired several conferences and symposia in her area and is currently an Associate Editor for IEEE Transactions on Computers. She was selected as an ELATE Fellow (2013-2014), and is the recipient of an Australian Research Council Future Fellowship (2013-2017), the Marie R. Pistilli Women in EDA Achievement Award (2014), and the Barbara Lazarus Award from Carnegie Mellon University (2018). Diana is an IEEE Fellow and an ACM Distinguished Scientist.
10:00 - 10:30
Poster Session #1
Bit Efficient Quantization for Deep Neural Networks
Prateeth Nayak, David Zhang and Sek Chai
SRI International and Latent AI
Supported-BinaryNet: Bitcell Array-based Weight Supports for Dynamic Accuracy-Latency Trade-offs in SRAM-based Binarized Neural Network
Shamma Nasrin, Srikanth Ramakrishna, Theja Tulabandhula and Amit Trivedi
University of Illinois at Chicago
Dynamic Channel Execution: on-device Learning Method for Finding Compact Networks
Simeon Spasov and Pietro Lio
University of Cambridge
DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
Victor Sanh, Lysandre Debut, Julien Chaumond and Thomas Wolf
QPyTorch: A Low-Precision Arithmetic Simulation Framework
Tianyi Zhang, Zhiqiu Lin, Guandao Yang and Christopher De Sa.
Separable Convolutions for Multiscale Dense Networks for Efficient Anytime Image Classification
Sven Peter, Nasim Rahaman, Ferran Diego and Fred Hamprecht
Heidelberg University and Telefonica Research
10:30 - 11:00
Abandoning the Dark Arts: New Directions in Efficient DNN Design
Kurt Keutzer, UC Berkeley link
Deep Neural Net models have provided the most accurate solutions to a very wide variety of problems in vision, language, and speech; however, the design, training, and optimization of efficient DNNs typically requires resorting to the “dark arts” of ad hoc methods and extensive hyperparameter tuning. In this talk we present our progress on abandoning these dark arts by using Differential Neural Architecture Search to guide the design of efficient DNNs and by using Hessian-based methods to guide the processes of training and quantizing those DNNs.
Kurt Keutzer’s research at University of California, Berkeley, focuses on computational problems in Deep Learning. In particular, Kurt has worked to reduce the training time of ImageNet to minutes and, with the SqueezeNet family, to develop a family of Deep Neural Networks suitable for mobile and IoT applications.
Before joining Berkeley as a Full Professor in 1998, Kurt was CTO and SVP at Synopsys. Kurt’s contributions to Electronic Design Automation were recognized at the 50th Design Automation Conference where he was noted as a Top 10 most cited author, as an author of a Top 10 most cited paper, and as one of only three people to have won four Best Paper Awards at that conference. Kurt was named a Fellow of the IEEE in 1996.
11:00 - 11:30
Hardware-aware Neural Architecture Design for Small and Fast Models: from 2D to 3D
Song Han, MIT link
Efficient deep learning computing requires algorithm and hardware co-design to enable specialization. However, the extra degree of freedom creates a much larger design space. We propose AutoML techniques to architect efficient neural networks. We investigate automatically designing small and fast models (ProxylessNAS), auto channel pruning (AMC), and auto mixed-precision quantization (HAQ). We demonstrate such learning-based, automated design achieves superior performance and efficiency than rule-based human design. Moreover, we shorten the design cycle by 200× than previous work to efficiently search efficient models, so that we can afford to design specialized neural network models for different hardware platforms. We accelerate computation-intensive AI applications including (TSM) for efficient video recognition and PVCNN for efficient 3D recognition on point clouds. Finally, we’ll describe scalable distributed training and the potential security issues of efficient deep learning  
Song Han is an assistant professor at MIT EECS. Dr. Han received the Ph.D. degree in Electrical Engineering from Stanford advised by Prof. Bill Dally. Dr. Han’s research focuses on efficient deep learning computing. He proposed “Deep Compression” and “ EIE Accelerator” that impacted the industry. His work received the best paper award in ICLR’16 and FPGA’17. He was the co-founder and chief scientist of DeePhi Tech which was acquired by Xilinx.
11:30 - 12:30
Paper Session #1
AutoSlim: Towards One-Shot Architecture Search for Channel Numbers
Jiahui Yu and Thomas Huang
University of Illinois at Urbana-Champaign
YOLO Nano: a Highly Compact You Only Look Once Convolutional Neural Network for Object Detection
Alexander Wong, Mahmoud Famuori, Mohammad Javad Shafiee, Francis Li, Brendan Chwyl and Jonathan Chung
University of Waterloo and DarwinAI Corp
Paper Presentation Poster
Progressive Stochastic Binarization of Deep Networks
David Hartmann and Michael Wand
Johannes Gutenberg-University of Mainz
Paper Presentation Poster
Exploring Bit-Slice Sparsity in Deep Neural Networks for Efficient ReRAM-Based Deployment
Jingyang Zhang, Huanrui Yang, Fan Chen, Yitu Wang and Hai Li
Duke University and Fudan University
Paper Presentation Poster
Trained Rank Pruning for Efficient Deep Neural Networks
Yuhui Xu, Yuxi Li, Shuai Zhang, Wei Wen, Botao Wang, Wenrui Dai, Yingyong Qi, Yiran Chen, Weiyao Lin and Hongkai Xiong
Shanghai Jiao Tong University, Qualcomm and Duke University
Paper Presentation Poster
Improving Efficiency in Neural Network Accelerator using Operands Hamming Distance Optimization
Meng Li, Yilei Li, Pierce Chuang, Liangzhen Lai and Vikas Chandra
Paper Presentation Poster
Q8BERT: Quantized 8Bit BERT
Ofir Zafrir, Guy Boudoukh, Peter Izsak and Moshe Wasserblat
Intel AI Lab
14:00 - 14:45
Cheap, Fast, and Low Power Deep Learning: I need it now!
Edward Delp, Purdue University link
In this talk I will describe the need for low power machine learning systems. I will motivate this by describing several current projects at Purdue University that have a need for energy efficient deep learning and in some cases the real deployment of these methods will not be possible without lower power solutions. The applications include precision farming, health care monitoring, and edge-based surveillance.
Edward J. Delp is currently The Charles William Harrison Distinguished Professor of Electrical and Computer Engineering and Professor of Biomedical Engineering at Purdue University. His research interests include image and video processing, image analysis, computer vision, machine learning, image and video compression, multimedia security, medical imaging, multimedia systems, communication and information theory. Dr. Delp is a Life Fellow of the IEEE. In 2004 Dr. Delp received the Technical Achievement Award from the IEEE Signal Processing Society for his work in image and video compression and multimedia security. In 2008 Dr. Delp received the Society Award from the IEEE Signal Processing Society.
14:45 - 15:15
Advances and Prospects for In-memory Computing
Naveen Verma, Princeton University link
Edge AI applications retain the need for high-performing inference models, while driving platforms beyond their limits of energy efficiency and throughput. Digital hardware acceleration, enabling 10-100x gains over general-purpose architectures, is already widely deployed, but is ultimately restricted by data-movement and memory accessing that dominates deep-learning computations. In-memory computing, based on both SRAM and emerging memory, offers fundamentally new tradeoffs for overcoming these barriers, with the potential for 10x higher energy efficiency and area-normalized throughput demonstrated in recent designs. But, those tradeoffs instate new challenges, especially affecting scaling to the level of computations required, integration in practical heterogeneous architectures, and mapping of diverse software. This talk examines those tradeoffs to characterize the challenges. It then explores recent research that provides promising paths forward, making in-memory computing more of a practical reality than ever before.
Naveen Verma received the B.A.Sc. degree in Electrical and Computer Engineering from the UBC, Vancouver, Canada in 2003, and the M.S. and Ph.D. degrees in Electrical Engineering from MIT in 2005 and 2009 respectively. Since July 2009 he has been a faculty member at Princeton University. His research focuses on advanced sensing systems, exploring how systems for learning, inference, and action planning can be enhanced by algorithms that exploit new sensing and computing technologies. This includes research on large-area, flexible sensors, energy-efficient statistical-computing architectures and circuits, and machine-learning and statistical-signal-processing algorithms. Prof. Verma has served as a Distinguished Lecturer of the IEEE Solid-State Circuits Society, and currently serves on the technical program committees for ISSCC, VLSI Symp., DATE, and IEEE Signal-Processing Society (DISPS).
15:15 - 15:45
Algorithm-Accelerator Co-Design for Neural Network Specialization
Zhiru Zhang, Cornell University link
In recent years, machine learning (ML) with deep neural networks (DNNs) has been widely deployed in diverse application domains. However, the growing complexity of DNN models, the slowdown of technology scaling, and the proliferation of edge devices are driving a demand for higher DNN performance and energy efficiency. ML applications have shifted from general-purpose processors to dedicated hardware accelerators in both academic and commercial settings. In line with this trend, there has been an active body of research on both algorithms and hardware architectures for neural network specialization.
This talk presents our recent investigation into DNN optimization and low-precision quantization, using a co-design approach featuring contributions to both algorithms and hardware accelerators. First, we review static network pruning techniques and show a fundamental link between group convolutions and circulant matrices – two previously disparate lines of research in DNN compression. Then we discuss channel gating, a dynamic, fine-grained, and trainable technique for DNN acceleration. Unlike static approaches, channel gating exploits input-dependent dynamic sparsity at run time. This results in a significant reduction in compute cost with a minimal impact on accuracy. Finally, we present outlier channel splitting, a technique to improve DNN weight quantization by removing outliers from the weight distribution without retraining.
Zhiru Zhang is an Associate Professor in the School of ECE at Cornell University. His current research investigates new algorithms, design methodologies, and automation tools for heterogeneous computing. His research has been recognized with a Google Faculty Research Award (2018), the DAC Under-40 Innovators Award (2018), the Rising Professional Achievement Award from the UCLA Henry Samueli School of Engineering and Applied Science (2018), a DARPA Young Faculty Award (2015), and the IEEE CEDA Ernest S. Kuh Early Career Award (2015), an NSF CAREER Award (2015), the Ross Freeman Award for Technical Innovation from Xilinx (2012), and multiple best paper awards and nominations. Prior to joining Cornell, he was a co-founder of AutoESL, a high-level synthesis start-up later acquired by Xilinx.
15:45 - 16:15
Poster Session #2
Pushing the limits of RNN Compression
Urmish Thakker, Igor Fedorov, Jesse Beu, Dibakar Gope, Chu Zhou, Ganesh Dasika and Matthew Mattina
Arm ML Research and AMD Research
On hardware-aware probabilistic frameworks for resource constrained embedded applications
Laura Isabel Galindez Olascoaga, Wannes Meert, Nimish Shah, Guy Van den Broeck and Marian Verhelst
KU Leuven, and UCLA
Neural Networks Weights Quantization: Target None-retraining Ternary (TNT)
Tianyu Zhang, Lei Zhu, Qian Zhao and Kilho Shin
WeBank, Harbin Engineering University, University of Hyogo and Gakushuin University
Regularized Binary Network Training
Sajad Darabi, Mouloud Belbahri, Matthieu Courbariaux and Vahid Partovi Nia
UCLA, Université de Montréal and Huawei
Fully Quantized Transformer for Improved Translation
Gabriele Prato, Ella Charlaix and Mehdi Rezagholizadeh
Université de Montréal and Huawei
16:15 - 16:45
Kernel and Graph Optimization for DL Model Execution
Jinwon Lee, Qualcomm link
There is increasing demand to deploy diverse deep learning models on edge devices. However, fully optimizing the execution of such models on resource-constrained HWs (e.g., CPU, DSP, NPU) is intrinsically challenging and often requires significant manual efforts. In this talk, we introduce our Morpheus team’s efforts to address these challenges. First, we optimize the performance of DL model execution in kernel level (e.g., a convolution operator). From a large number of possible kernel configurations (e.g., tiling, unrolling, vectorizations), the fastest kernel is quickly identified through machine learning algorithms we developed while binary codes are automatically generated by TVM or Halide compilers. Second, we further optimize the performance of DL model execution in graph-level (e.g., end-to-end network). Since each kernel or operator is often connected as a graph in deep learning models, the compute scheduling of such graphs significantly affects the end-to-end performance, especially memory I/O. We solve two problems in this context. First, for potentially complex topologies on edge devices with limited total memory, we solve the minimum memory usage problem, thus characterizing and enabling deployment of all feasible networks on a given device. Second, for any hardware with combined Tightly Coupled Memory (TCM) and more expensive external memory (e.g. DRAM), we solve the minimum external memory access problem, which optimizes hardware usage efficiency in I/O-bound conditions. For both problems we show efficient algorithms that are complete solutions, and improved results over heuristic methods. Finally, we will discuss our future directions to optimize deep learning model execution.
Jinwon Lee is a Senior Staff Engineer at the Qualcomm AI Research lab where he designs state-of-the-art deep learning models for the edge devices. He received his Ph.D in Computer Science from Korea Advanced Institute of Science and Technology (KAIST) in 2009. Jinwon is currently focused on deep learning model optimizations for the edge devices including kernel/graph compiler optimization, model compression/quantization, and HW-aware neural architecture search. Previously, he developed deep learning-based on-device speech enhancement/recognition engine for Qualcomm SoC. Also, he developed low-power context-aware engines for mobile use cases based on GPS, WiFi, and motion sensors.
16:45 - 17:15
Adaptive Multi-Task Neural Networks for Efficient Inference
Rogerio Feris, IBM Research link
Very deep convolutional neural networks have shown remarkable success in many computer vision tasks, yet their computational expense limits their impact in domains where fast inference is essential. While there has been significant progress on model compression and acceleration, most methods rely on a one-size-fits-all network, where the same set of features is extracted for all images or tasks, no matter their complexity. In this talk, I will first describe an approach called BlockDrop, which learns to dynamically choose which layers of a deep network to execute during inference, depending on the image complexity, so as to best reduce total computation without degrading prediction accuracy. Then, I will show how this approach can be extended to design compact multi-task networks, where a different set of layers is executed depending on the task complexity, and the level of feature sharing across tasks is automatically determined to maximize both the accuracy and efficiency of the model. Finally, I will conclude the talk presenting an efficient multi-scale neural network model, which achieves state-of-the art results in terms of accuracy and FLOPS reduction on standard benchmarks such as the ImageNet dataset.
Rogerio Schmidt Feris is the head of computer vision and multimedia research at IBM T.J. Watson Research Center. He joined IBM in 2006 after receiving a Ph.D. from the University of California, Santa Barbara. He has also worked as an Affiliate Associate Professor at the University of Washington and as an Adjunct Associate Professor at Columbia University. His work has not only been published in top AI conferences, but has also been integrated into multiple IBM products, including Watson Visual Recognition, Watson Media, and Intelligent Video Analytics. He currently serves as an Associate Editor of TPAMI, has served as a Program Chair of WACV 2017, and as an Area Chair of conferences such as NeurIPS, CVPR, and ICCV.
17:15 - 17:45
Configurable Cloud-Scale DNN Processor for Real-Time AI
Bita Rouhani, Microsoft link
Growing computational demands from deep neural networks (DNNs), coupled with diminishing returns from general-purpose architectures, have led to a proliferation of Neural Processing Units (NPUs). In this talk, we will discuss Project Brainwave, a production-scale system for real-time (low latency and high throughput) DNN inference. Brainwave NPU is reconfigurable and deployed in scale production. This reconfigurability, in turn, eliminates costly silicon updates to accommodate evolving state-of-the-art models while enabling orders of magnitude performance improvement compared to highly optimized software solutions.
Bita Rouhani is a senior researcher at Microsoft Azure AI and Advanced Architecture group. Bita received her Ph.D. in Computer Engineering from the University of California San Diego in 2018. Bita’s research interests include algorithm and hardware co-design for succinct and assured deep learning, real-time data analysis, and safe machine learning. Her work has been published at top-tier computer architecture, electronic design, machine learning, and security conferences and journals including ISCA, ASPLOS, ISLPED, DAC, ICCAD, FPGA, FCCM, SIGMETRICS, S&P magazine, and ACM TRETS.
17:45 - 18:45
Paper Session #2
Training Compact Models for Low Resource Entity Tagging using Pre-trained Language Models
Peter Izsak, Shira Guskin and Moshe Wasserblat
Intel AI Lab
Doubly Sparse: Sparse Mixture of Sparse Experts for Efficient Softmax Inference
Shun Liao, Ting Chen, Tian Lin, Denny Zhou and Chong Wang
University of Toronto, Google and ByteDance
Paper Presentation Poster
Algorithm-hardware Co-design for Deformable Convolution
Qijing Huang, Dequan Wang, Yizhao Gao, Yaohui Cai, Zhen Dong, Bichen Wu, Kurt Keutzer and John Wawrzynek
UC Berkeley, Peking University and University of Chinese Academy of Science
Discovering Low-Precision Networks Close to Full-Precision Networks for Efficient Inference
Jeffrey McKinstry, Steven Esser, Rathinakumar Appuswamy, Deepika Bablani, John Arthur, Izzet Yildiz and Dharmendra Modha
Paper Presentation Poster
Instant Quantization of Neural Networks using Monte Carlo Methods
Gonçalo Mordido, Matthijs Van Keirsbilck and Alexander Keller
Hasso Plattner Institute and NVIDIA
Paper Presentation Poster
Spoken Language Understanding on the Edge
Alaa Saade, Alice Coucke, Alexandre Caulier, Joseph Dureau, Adrien Ball, Théodore Bluche, David Leroy, Clément Doumouro, Thibault Gisselbrecht, Francesco Caltagirone, Thibaut Lavril and Maël Primet
Paper Presentation Poster