The 9th EMC2 - Energy Efficient Machine Learning and Cognitive Computing

Co-located with the The ACM International Conference on Architectural Support for Programming Languages and Operating Systems ASPLOS 2024

Saturday, April 27, 2024
San Diego, CA, USA

description Workshop Objective

Artificial intelligence (AI) continues to proliferate everyday life aided by the advances in automation, algorithms, and innovative hardware and software technologies. With the growing prominence of AI, Multimodal Large Language model (MLLM) has been rising as new foundational model architecture that uses powerful Large Language Model (LLM) with multimodal tasks efficiency. MLLM are able to achieve surprising capabilities based on text and images, suggesting path for Artificial General Intelligence (AGI). With the advent of new frontiers on execution of MLLM, we are facing new challenges on the ecosystem of software/hardware co-design. There is a realization about the energy cost of developing and deploying MLLM. Training and inferencing the most successful MLLM models has become exceedingly power-hungry often dwarfing the energy needs of entire households for years. At the edge, applications which use these LLMs models for inference are ubiquitous in cell phones, appliances, smart sensors, vehicles, and even wildlife monitors where efficiency is paramount for practical reasons.

chat Call for Papers

The goal of this Workshop is to provide a forum for researchers and industry experts who are exploring novel ideas, tools and techniques to improve the energy efficiency of MLLMs as it is practised today and would evolve in the next decade. We envision that only through close collaboration between industry and the academia we will be able to address the difficult challenges and opportunities of reducing the carbon footprint of AI and its uses. We have tailored our program to best serve the participants in a fully digital setting. Our forum facilitates active exchange of ideas through:

  • Keynotes, invited talks and discussion panels by leading researchers from industry and academia
  • Peer-reviewed papers on latest solutions including works-in-progress to seek directed feedback from experts
  • Independent publication of proceedings through IEEE CPS

We invite full-length papers describing original, cutting-edge, and even work-in-progress research projects about efficient machine learning. Suggested topics for papers include, but are not limited to the ones listed on this page. The proceedings from previous instances have been published through the prestigious IEEE Conference Publishing Services (CPS) and are available to the community via IEEE Xplore. In each instance, IEEE conducted independent assessment of the papers for quality.

format_list_bulleted Topics for the Workshop

  • Neural network architectures for resource constrained applications
  • Efficient hardware designs to implement neural networks including sparsity, locality, and systolic designs
  • Power and performance efficient memory architectures suited for neural networks
  • Network reduction techniques – approximation, quantization, reduced precision, pruning, distillation, and reconfiguration
  • Exploring interplay of precision, performance, power, and energy through benchmarks, workloads, and characterization
  • Simulation and emulation techniques, frameworks, tools, and platforms for machine learning
  • Optimizations to improve performance of training techniques including on-device and large-scale learning
  • Load balancing and efficient task distribution, communication and computation overlapping for optimal performance
  • Verification, validation, determinism, robustness, bias, safety, and privacy challenges in AI systems
08:15 - 08:30

Welcome and Opening Remarks

Satyam Srivastava, d-Matrix

08:30 - 09:00
Invited Talk

Dense-and-Sparse Quantization Methods for Efficient LLM Serving

Amir Gholami, UC Berkeley link

The availability of unprecedented unsupervised training data, along with neural scaling laws, has resulted in an unprecedented surge in model size and compute requirements for serving/training LLMs. However, the main performance bottleneck for serving these models is increasingly shifting to memory bandwidth rather than compute. While quantization has emerged as a promising solution by representing model weights with reduced precision, previous efforts have often resulted in notable performance degradation. To address this, I will discuss our on-going work on a new type of quantization scheme called dense-and-sparse quantization which enables lossless compression to ultra-low precisions of up to 2-bit, while achieving state-of-the-art accuracy. Dense-and-sparse quantization allows this by decomposing the parameters and KVCache values into two components: a sparse component that includes outliers and sensitive values in the network, along with a dense component which is amenable to low precision compression. This allows us to achieve lossless compression for model parameters down to 3 bits, as well as down to 2-bits for compressing KV Cache values enabling serving a LLaMA-7B model on a single A100 GPU even with a context length of 1M token length.

09:00 - 09:30
Invited Talk

Tri Dao, Princeton University link

Transformers are slow and memory-hungry on long sequences, since the time and memory complexity of self-attention are quadratic in sequence length. In the first half, we describe FlashAttention, a fast and memory-efficient exact attention algorithm. By making attention algorithms IO-aware (accounting for reads and writes between levels of GPU memory) FlashAttention is 4-8x faster than optimized baselines, enabling 4-16x longer context in Transformers and yielding higher quality models. We will also describe optimizations for long-context LLM inference, leading to 2-8x faster end-to-end inference time. In the second half, we focus on subquadratic-time architectures such structured state space models (SSMs). We identify that a key weakness of such models is their inability to perform content-based reasoning, and propose a selection mechanism to address this shortcoming. Though this change prevents the use of efficient convolutions, we design a hardware-aware parallel algorithm in recurrent mode. We integrate these selective SSMs into a simplified end-to-end neural network architecture without attention or even MLP blocks. The resulting Mamba architecture matches or exceeds the performance of strong modern Transformers on language modeling, validated at 1B and 3B scales on both pretraining and downstream evaluation, while enjoying 5x higher inference throughput and linear scaling in sequence length.

10:00 - 10:15


10:15 - 10:45
Invited Talk
10:45 - 11:15
Invited Talk

Put LLMs on device? Challenges and new opportunities

Zechun Liu, Meta Reality Labs link

Large language models (LLMs) are permeating various facets of human life, influencing not only communication and work but also shaping everyday entertainment experiences. Due to limitations in memory size and computational cost, there is an increasing demand to deploy LLM on smartphones and mobile devices. To address these challenges, a new research direction has emerged that focuses on downsizing LLMs for on-device inference. This includes techniques such as model compression, deployment acceleration, small model design, etc. In this talk, we will discuss the constraints and solutions for optimizing models in on-device use cases as well as practical methods for LLM quantization.

11:15 - 11:45
Invited Talk

Rapid LLM deployments: with great power comes great responsibility

Esha Choukse, Microsoft Research link

With the ubiquitous use-cases of modern LLMs, the deployment scale of these models is unforeseen. This has led to a large-scale datacenter expansion with GPUs, currently running into an energy wall worldwide. This talk will focus on the properties of generative LLMs that can be used to make the deployment of these models more power-efficient. The talk will also introduce POLCA and Splitwise, two techniques to reduce the power consumption for the LLM serving.

11:45 - 12:15
Paper Session #1
12:15 - 13:30
Lunch Break

Lunch Break

13:30 - 14:00
Paper Session #2
14:00 - 15:00

Efficient Multi-modal LLM

Song Han, MIT, NVIDIA link

This talk presents efficient multi-modal LLM innovations and system implementations. I’ll first present VILA, a visual language model pre-training recipe beyond visual instruction tuning, enabling multi-image reasoning and in-context learning. Followed by SmoothQuant and AWQ for LLM quantization, and the TinyChat inference library. AWQ and TinyChat enable VILA 2.7B deployable on Jetson Orin Nano, bringing new opportunities for mobile vision applications. Second, I’ll present efficient representation learning, including EfficientViT for high-resolution vision, accelerating SAM by 48x without performance loss; and condition-aware neural networks(CAN), a novel way to add control to diffusion models. Third, I’ll present StreamingLLM, a KV cache optimization technique for long conversation and LongLoRA, using sparse, shifted attention for long-context LLM. Finally, I’ll present PockEngine for efficient LLM fine-tuning. Many of these techniques have been incorporated into NVIDIA’s large language model optimization library, TensorRT-LLM.

15:00 - 16:00

The Path to AGI: Directions and Challenges

Recent advances in AI has profoundly affected our perception of the technology. Not only are there domains where AI has reached superhuman performance levels, it is also starting to influence our interaction with machines (courtesy of LLMs).At the same time, observations of emergent properties has excited many in the community as a potential existence proof of future AGI. How and when we get there is an open question though. In this panel we discuss the various elements across algorithm innovation, computing infrastructure, training and data generation, even ethics and governance that need to come together for us to create truly intelligent systems.

Moderator: Raj Parihar

16:00 - 16:15


16:15 - 16:45
Invited Talk
16:45 - 17:15
Invited Talk

Efficient AI Programming with Mojo and Max

Tatiana Shpeisman, Modular link

As AI grows in its capabilities and ubiquity, it becomes increasingly important to improve efficiency of AI applications. At the same time, developing and deploying new applications requires the ability to quickly iterate on the design and have flexibility to modify it as required by the deployment scenario. In this talk, we will describe how Mojo and Modular MAX platform help to achieve this goal. We will give a brief overview of Mojo and MAX and illustrate by example how AI programmers can benefit from Max and its extensibility via Mojo.

17:15 - 17:45
Invited Talk

CHAI: Clustered Head Attention for Efficient LLM Inference

Bilge Acun, Meta link

Large Language Models (LLMs) with hundreds of billions of parameters have transformed the field of machine learning. However, serving these models at inference time is both compute and memory intensive, where a single request can require multiple GPUs and tens of Gigabytes of memory. Multi-Head Attention is one of the key components of LLMs, which can account for over 50% of LLMs memory and compute requirement. We observe that there is a high amount of redundancy across heads on which tokens they pay attention to. Based on this insight, we propose Clustered Head Attention (CHAI). CHAI combines heads with a high amount of correlation for self-attention at runtime, thus reducing both memory and compute. In our experiments, we show that CHAI is able to reduce the memory requirements for storing K,V cache by up to 21.4% and inference time latency by up to 1.73x without any fine-tuning required. CHAI achieves this with a maximum 3.2% deviation in accuracy across 3 different models (i.e. OPT-66B, LLAMA-7B, LLAMA-33B) and 5 different evaluation datasets.

17:45 - 18:15
Invited Talk

Teaching LLMs to Use Tools at Scale

Shishir Patil, UC Berkeley link

In this talk, we will explore our innovative approach to integrating Large Language Models (LLMs) with various tools via APIs. Bridging LLMs with APIs presents a significant challenge, primarily because of the models’ struggles to generate precise input arguments and their propensity to hallucinate API calls. Gorilla LLM, trained with our novel Retriever-Aware-Training (RAT), surpasses the performance of all open-sourced LLMs on writing API calls. Gorilla presents a novel PL-inspired metric to measure hallucination, commonly encountered in LLMs. Gorilla is an open-source project having served hundreds of thousand user requests, with enterprise adoption, and an energetic community supporting it. We’ll also spotlight the Berkeley Function Calling Leaderboard to evaluate an LLM’s ability to call functions (tools) accurately. We’ll conclude with our learnings from our deployment experiences, and present open research questions to enable wider integration of LLMs in applications.

18:15 - 18:30

Closing Remarks