The 8th EMC2 - Energy Efficient Training and Inference of Transformer Based Models

Co-located with the The 37th AAAI Conference on Artificial Intelligence AAAI 2023

Tuesday, February 14, 2023
Washington DC, USA
Room: 146A
Click Here for: Slides from the Workshop

description Workshop Objective

Transformers are the foundational principles of large deep learning language models. Recent successes of Transformer-based models in image classification and action prediction use cases indicate their wide applicability. In this workshop, we want to focus on the leading ideas using Transformer models such as PALM from Google. We will learn what have been their key observations on performance of the model, optimizations for inference and power consumption of both mixed-precision inference and training.

chat Call for Papers

The goal of this Workshop is to provide a forum for researchers and industry experts who are exploring novel ideas, tools, and techniques to improve the energy efficiency of machine learning and deep learning as it is practiced today and would evolve in the next decade. We envision that only through close collaboration between industry and the academia we will be able to address the difficult challenges and opportunities of reducing the carbon footprint of AI and its uses. We have tailored our program to best serve the participants in a fully digital setting. Our forum facilitates active exchange of ideas through:

  • Keynotes, invited talks and discussion panels by leading researchers from industry and academia
  • Peer-reviewed papers on latest solutions including works-in-progress to seek directed feedback from experts
  • Independent publication of proceedings through IEEE CPS

We invite full-length papers describing original, cutting-edge, and even work-in-progress research projects about efficient machine learning. Suggested topics for papers include, but are not limited to the ones listed on this page. The proceedings from previous instances have been published through the prestigious IEEE Conference Publishing Services (CPS) and are available to the community via IEEE Xplore. In each instance, IEEE conducted independent assessment of the papers for quality.

format_list_bulleted Topics for the Workshop

  • Neural network architectures for resource constrained applications
  • Efficient hardware designs to implement neural networks including sparsity, locality, and systolic designs
  • Power and performance efficient memory architectures suited for neural networks
  • Network reduction techniques – approximation, quantization, reduced precision, pruning, distillation, and reconfiguration
  • Exploring interplay of precision, performance, power, and energy through benchmarks, workloads, and characterization
  • Simulation and emulation techniques, frameworks, tools, and platforms for machine learning
  • Optimizations to improve performance of training techniques including on-device and large-scale learning
  • Load balancing and efficient task distribution, communication and computation overlapping for optimal performance
  • Verification, validation, determinism, robustness, bias, safety, and privacy challenges in AI systems
08:30 - 08:45

Welcome and Opening Remarks

Fanny Nina Paravecino, Microsoft link

08:45 - 09:45

Efficient Learning in Single- and Multi-Modal Vision Transformers

Diana Marculescu, UT Austin link

Transformers have revolutionized the way we approach reasoning and learning tasks in the field of computer vision, both in single and multi-modal settings. Self-supervised pre-training methods, such as the Masked Autoencoder (MAE), have emerged as a solution to maximize the potential of vision transformers, although MAE requires a large number of epochs to pre-train, making it expensive in practice. Our work introduces a supervised pre-training approach called SupMAE, which is more efficient, robust, and effective in transfer learning than MAE and other supervised pre-training methods. SupMAE achieves similar performance to MAE on ImageNet with the ViTB/16 model while using only 30% of the compute cost, and outperforms MAE on ImageNet variants. In addition, techniques have been developed to reduce the computational cost of vision transformers through post-training quantization, often using mixed precision schemes or partitioning the model. We propose CPT-V, a contrastive loss-based approach using block-based evolutionary search for quantization scales that results in 1.5% better accuracy for 3, 4, and 8-bit models in less time than existing methods. In the case of multi-modal tasks such as audio-video event localization, effective multi-modal feature correspondence is necessary to understand the various temporal interactions. Existing approaches struggle in this regard due to ineffective multi-modal training strategies. We introduce AVE-CLIP, a framework that combines AudioCLIP, a model pre-trained on large-scale audio-visual data, with a multi-window temporal transformer to effectively handle different temporal scales of video frames. AVE-CLIP improves performance on the AVE dataset by 5.9% compared to previous work, demonstrating its effectiveness in practice. Taken together, SupMAE, CPT-V, and AVE-CLIP demonstrate how to improve the efficiency and effectiveness of vision transformers in a variety of tasks.

09:45 - 10:15
Invited Talk

Efficient transformers - From supercomputers to smartphones

Torsten Hoefler, ETH Zurich link

Billion-parameter artificial intelligence models have proven to show exceptional performance in a large variety of tasks ranging from natural language processing, computer vision, and image generation to mathematical reasoning and algorithm generation. Those models usually require large parallel computing systems, often called ‘AI Supercomputers’, to be trained initially. We will outline several techniques ranging from data ingestion, parallelization, to accelerator optimization that improve the efficiency of such training systems. Yet, training large models is only a small fraction of practical artificial intelligence computations. Efficient inference is even more challenging - models with hundreds-of-billions of parameters are expensive to use. We continue by discussing model compression and optimization techniques such as fine-grained sparsity as well as quantization to reduce model size and significantly improve efficiency during inference. These techniques may eventually enable inference with powerful models on hand-held devices.

10:15 - 10:45
Invited Talk

Faster Neural Network Training, Algorithmically

Jonathan Frankle, Mosaic ML link

Training modern neural networks is time-consuming, expensive, and energy-intensive. As neural network architectures double in size every few months, it is difficult for researchers and businesses without immense budgets to keep up, especially as hardware improvements stagnate. In this talk, I will describe one approach for managing this challenge: changing the training algorithm itself. I will discuss how we have put this approach into practice at MosaicML, including the dozens of algorithmic changes we have studied (which are freely available open source), the science behind how these changes interact with each other (the composition problem), and how we evaluate whether these changes have been effective. I will also detail several surprises we have encountered and lessons we have learned along the way. In the months since we began this work, we have reduced the training times of standard computer vision models by 5-7x and standard language models by 2-3x on publicly available cloud instances, and we’re just scratching the surface.

10:45 - 10:55


10:55 - 12:10
Paper Session #1

Enabling Faster Vision Transformers via Soft Token Pruning

Zhenglun Kong

TangoBERT: A Cascaded Architecture for Reduced Inference Cost

Jonathan Mamou

MBR-Sim: A Speed-of-Light Model to Explore Machine Learning Accelerator Architectures

Ben Maydan

Input-length-shortening and text generation via attention values

Michael Witbrock

Block Format Error Bounds and Optimal Block Size Selection

Ilya Soloveychik

12:10 - 12:30
Lightning Session #1

Pruning Compact ConvNets For Efficient Inference

Karthik Prasad

A Reinforcement Learning Based Compiler Framework For In-Memory Compute Accelerator

Tristan Trouwen

12:05 - 12:30
Poster Session #1
12:30 - 14:00
Lunch Break

Lunch Break

14:00 - 14:30
Invited Talk

Neural Network Design and Training for Efficient On-Device Learning

Tien-Ju Yang, Google Research link

Edge devices generate a large amount of data that can be used to improve the performance of neural networks. However, the private nature of such data prevents them from being uploaded to servers and requires on-device learning. This presents a challenge pertaining to the tight resource budget associated with edge devices. To address this challenge, we will present approaches to neural network design and training in this talk. To design efficient network architectures that require low resources to train, we propose the hardware-aware neural architecture search algorithm, NetAdaptV2. NetAdaptV2 automatically and rapidly discovers an efficient network architecture in terms of the given metrics on the target hardware. It uses empirical measurements to guide the search so that no hardware knowledge is required. According to our experiments, NetAdaptV2 discovers network architectures with better accuracy-latency/accuracy-MAC trade-offs than related works and reduces the total search time by up to 5.8x on ImageNet. To improve the efficiency of training under the setting of federated learning, we propose the efficient training algorithm, Partial Variable Training (PVT). PVT reduces memory usage and communication cost by training only a small subset of variables on edge devices. With PVT, we show that network accuracy can be maintained by utilizing more local training steps and devices, which is favorable for federated learning involving a large population of devices. According to our experiments on state-of-the-art neural networks for speech recognition and two different datasets, PVT can reduce memory usage by up to 1.9x and communication cost by up to 593x while attaining comparable accuracy when compared with full network training.

14:30 - 15:00
Invited Talk

Hardware/Software Codesign to Accelerate Vision Transformers on FPGAs

Zhenman Fang, Simon Fraser University link

Recently, vision transformers (ViTs) are emerging with significantly improved accuracy in computer vision tasks, surpassing state-of-the-art convolutional neural networks. However, their sophisticated network architectures with high computation and memory costs have impeded their deployment on resource-constrained edge devices. In this talk, I will present hardware-efficient quantization and pruning techniques to accelerate ViTs on embedded FPGAs. First, I will present Auto-ViT-Acc [FPL 2022], an FPGA-aware ViT acceleration framework that automatically explores a mix of fixed-point and power-of-two quantization schemes to fully leverage heterogeneous FPGA on-chip resources while maximally retaining the model accuracy. Compared with the baseline floating-point FPGA accelerator, our accelerator achieves around 5.6× improvement on the frame rate on the AMD-Xilinx ZCU102 FPGA with 0.71% accuracy drop on ImageNet dataset for DeiT-base. Second, I will present HeatViT [HPCA 2023], a hardware-efficient image-adaptive token pruning framework (together with 8-bit quantization) for efficient yet accurate ViT acceleration on FPGAs. In HeatViT, we adopt an effective, hardware-efficient, and learnable head-evaluation token selector, which can be progressively inserted before transformer blocks to dynamically identify and consolidate the non-informative tokens from input images. Compared to existing ViT pruning studies, under a similar computation cost, HeatViT can achieve 0.7%∼8.9% higher accuracy for various widely used ViTs on the ImageNet dataset. Compared to the baseline hardware accelerator, our implementations of HeatViT on the AMD-Xilinx ZCU102 FPGA achieve 3.46×∼4.89× speedup with a trivial resource utilization overhead of 8%∼11% more DSPs and 5%∼8% more LUTs.

15:00 - 15:15


15:15 - 16:15

On-Device Training under 256KB Memory

Song Han, MIT link

On-device training enables the model to adapt to new data collected from the sensors. Users can benefit from customized AI models without having to transfer the data to the cloud, preserving privacy. However, the training memory footprint is prohibitive for IoT devices. I’ll present “Tiny Transfer Learning”(NeurIPS’20) and “On-Device Training Under 256KB Memory” (NeurIPS’22) to solve this issue. I’ll first analyze the memory bottleneck, showing that we should reduce the activations, not just trainable parameters for efficient on-device learning. I’ll then introduce Quantization-Aware Scaling (QAS) to calibrate the gradient scales and stabilize 8-bit quantized training, and “sparse update” to skip the gradient computation of less important layers and sub-tensors to save activation memory. The algorithm innovation is implemented by a lightweight training system, Tiny Training Engine, which prunes the backward computation graph to support sparse updates and offload the runtime auto-differentiation to compile time. Deployed on STM32H746 microcontroller, our framework uses less than 1/1000 of the training memory of Tensorflow and Pytorch while matching the accuracy. Our study enables IoT devices to not only perform inference but also continuously adapt to new data for on-device lifelong learning.

16:15 - 16:45
Invited Talk

Accelerating Recommender Model Training by Leveraging Popular Choices

Divya Mahajan, Microsoft link

Recommender models are an important class of applications, however, transformative effects of these large models are predicated on providing high-performance compute capabilities to enable these learning algorithms. As the new age data-centers become heterogeneous with these emerging domain specific hardware, we must rethink both the architecture and the corresponding system stack. In this talk, I will describe our work in creating novel architectures, systems, and frameworks that leverage statistical properties of data to best utilize the heterogeneous compute and memory resources for recommender model training.

16:45 - 17:45

ML ASIC, how specific is too specific?

Growth in popularity and complexity of AI has fueled massive investment in ASIC designs for AI. These are often targeted towards a sub-class of AI/ML such as transformers or conv nets. This panels will discuss the implications of such focused designs on the HW vendors, data centers, end users, standard groups, and data scientists. Are there niches that justify such specializations or are we creating hardware solutions for a field which will likely continue to evolve rapidly?

Moderator: Satyam Srivastava

Jonathan Frankle, Chief Scientist, MosaicML

Sudeep Bhoja, CTO, d-Matrix

Russel Hewitt, Senior SW Engineer, Microsoft

David Kanter, Executive Director, MLCommons

17:45 - 17:55