Energy Efficient Machine Learning and Cognitive Computing

11th Edition

Co-located with ASPLOS 2026 in Pittsburgh, PA

event Sunday, March 22, 2026 meeting_room Room: Ft. Pitt

emoji_events EMC2 Competition: AI Infrastructure Demos, sponsored by Runara.aiLearn more →

local_bar EMC2 Social: Join us after the workshop! 6:30 – 8:30 PM at Drum Bar, Rivers Casino. Learn more →

emoji_events EMC2 Competition: AI Infrastructure Demos

Sponsored by Runara.ai

Overview

The EMC2 Competition invites researchers, practitioners, and students to present production-ready AI infrastructure systems demonstrating measurable improvements in efficiency, observability, and operational robustness. The competition is designed to surface systems that operate under real-world constraints- including scale, reliability, cost, and sustainability.

The primary goal is to move beyond toy benchmarks, isolated micro-optimizations, and paper-only abstractions. We want to see working demos, toolkits, and integrated systems that can be realistically deployed in modern AI infrastructure environments. Submissions should emphasize live execution, actionable telemetry, and end-to-end impact rather than synthetic evaluations.

Competition Tracks

Track 1: Efficient Inference

Systems improving cost, performance, and energy efficiency of inference workloads:

  • Runtime inference optimization
  • Scheduling, batching, parallelism
  • Hardware-aware execution
  • Cost/energy-aware pipelines
Track 2: Infrastructure Observability

Visibility, monitoring, and diagnostics across the AI stack:

  • Live telemetry and visualization
  • Utilization and bottleneck analysis
  • Cross-layer observability
  • Actionable operator insights

Hardware and Model Flexibility

Participants are free to choose any hardware platform and any inference model, including GPUs or custom accelerators, cloud-based, on-premise, or edge environments, and open-source or proprietary models. There are no restrictions on vendors, architectures, or model families, provided the submission demonstrates real execution, live metrics, and measurable impact.

Submission Requirements

  • Team size: Up to 2 participants
  • Format: Single-page proposal (PDF) describing the problem, system architecture, hardware/model used, live metrics captured, and how efficiency or observability improvements are demonstrated
  • Originality: The idea does not need to be novel, but implementation must be original. Prior work may be extended with clear attribution.
  • Demo requirement (mandatory): Submissions must include a working demo that tracks live metrics, demonstrates real execution, and clearly shows improvements in cost, performance, or sustainability. Simulation-only or slide-only submissions will not be considered.

Submission deadline: March 15

Selection & Presentation

Accepted teams will display their demos during the workshop lunch break and engage directly with judges during in-person evaluations. Based on judging scores, the top 3 teams from each track will be selected for oral presentations.

Final Presentations: 4:00 - 5:00 PM, 10 minutes per team, Live demo encouraged

At the conclusion of the session, one winning team per track will be selected.

Awards

Each winning team receives:

$500 cash prize + Certificate of Recognition

Winners eligible for Summer Internship opportunities at Runara.ai

Evaluation Criteria

Since this is an open-scope, applied competition, proposals will be evaluated based on their real-world application potential instead of rigid benchmark metrics. Judging prioritizes: production readiness and robustness, clarity and usefulness of metrics, real measurable end-to-end impact, clean system design, and practical relevance to modern AI infrastructure.

For any questions please contact raj@runara.ai

schedule 16:00 - 17:00 event Sunday, March 22, 2026 room Ft. Pitt

forum Architecting an energy-first stack for the AI age

As the cumulative demand for AI computing continues to grow, new bottlenecks in technology deployment are emerging. These range from compute capacity to memory supply to cluster reliability to simply finding enough power to supply the infrastructure. Since these aren’t easily solved with brute force addition of more chips, potential solutions encompass new model architectures, new computing architectures, high performance cooling designs, robust multi-device software stacks, and even orbital datacenters. In this panel, we will aim to identify the most pressing requirements for the computing stack of coming years. We will discuss both the promising technologies and the hypes to define a blueprint of the computing architecture where energy is the currency of success.

Moderator

Panelists

Esha Choukse
Esha Choukse link
Microsoft

* More panelists to be announced.

description Accepted Papers

  • piPE-SA: Enabling Deeply Pipelined Processing Elements in Systolic Arrays

    Jiayi Wang, Chenyi Wang and Ang Li

  • Fast NF4 Dequantization Kernels for Large Language Model Inference

    Xiangbo Qi, Chaoyi Jiang and Murali Annavar

  • I-Fuse: Profile-Guided Speculative Load Micro-op Fusion for Data Center Applications

    Deepanjali Mishra, Tanvir Ahmed Khan, Gilles Pokam, Heiner Litz and Akshitha Sriraman

  • CacheFlex: Explicitly Control What You Need in Your Cache

    Jingqun Zhang, Weihang Li, Maohua Nie, Yung-Jen Cheng, Jiayi Wang, Shwet Chitnis and Ang Li

  • Store or Recompute? Characterizing the Carbon Tradeoff of KV Cache Retention in LLM Inference

    Abnash Bassi, Jaylen Wang, Fiodar Kazhamiaka, Daniel S. Berger and Akshitha Sriraman

  • IEI: A Composite Infrastructure Efficiency Index for GPU Inference

    Raj Parihar

local_bar EMC2 Social

Sponsored by d-Matrix

Join us after the workshop for an informal social! This is a great opportunity to connect with fellow researchers, speakers, and organizers in a relaxed setting.

description Workshop Objective

In the Eleventh edition of EMC2 workshop, we plan to facilitate conversation about the sustainability of large-scale AI computing systems being developed to meet the ever-increasing demands of generative AI. This involves discussions spanning multiple interrelated areas. First, we continue to serve as the leading forums for discussing the energy-efficiency aspect of GenAI workloads which directly impact the overall viability and economic value of AI technology. Second, we reassess the scaling laws of AI with the prevalence of agentic, multi-modal, and reasoning-based models in conjunction with novel techniques such as a highly sparse expert architecture and disaggregated computation. Finally, we discuss sustainable and high-performance computing paradigms towards efficient datacenters and hybrid computing models that can cater to the exponential growth in model sizes, application areas, anduser base. This would allow us to explore ideas to build the hardware, software, systems, and scaling infrastructure, as well as model architectures that make AI technology even more prevalent and accessible.

chat Call for Papers

The goal of this Workshop is to provide a forum for researchers and industry experts who are exploring novel ideas, tools and techniques to improve the energy efficiency of MLLMs as it is practised today and would evolve in the next decade. We envision that only through close collaboration between industry and the academia we will be able to address the difficult challenges and opportunities of reducing the carbon footprint of AI and its uses. We have tailored our program to best serve the participants in a fully digital setting. Our forum facilitates active exchange of ideas through:

  • Keynotes, invited talks and discussion panels by leading researchers from industry and academia
  • Peer-reviewed papers on latest solutions including works-in-progress to seek directed feedback from experts
  • Independent publication of proceedings through IEEE CPS

We invite full-length papers describing original, cutting-edge, and even work-in-progress research projects about efficient machine learning. Suggested topics for papers include, but are not limited to the ones listed below:

format_list_bulleted Topics for the Workshop

  • Neural network architectures for resource constrained applications.
  • Efficient hardware designs to implement neural networks including sparsity, locality, and systolic designs.
  • Power and performance efficient memory architectures suited for neural networks.
  • Network reduction techniques – approximation, quantization, reduced precision, pruning, distillation, and reconfiguration.
  • Exploring interplay of precision, performance, power, and energy through benchmarks, workloads, and characterization.
  • Performance potential, limit studies, bottleneck analysis, profiling, and synthesis of workloads.
  • Explorations and architctures aimed to promote sustainable computing.
  • Simulation and emulation techniques, frameworks, tools, and platforms for machine learning.
  • Optimizations to improve performance of training techniques including on-device and large-scale learning.
  • Load balancing and efficient task distribution, communication and computation overlapping for optimal performance.
  • Verification, validation, determinism, robustness, bias, safety, and privacy challenges in AI systems.
  • Efficient deployment strategies for edge and distributed environments.
  • Model compression and optimization techniques that preserve reasoning and problem-solving capabilities.
  • Architectures and frameworks for multi-agent systems and retrieval-augmented generation (RAG) pipelines.
  • Systems-level approaches for scaling future foundation models (e.g., Llama 4, GPT-5 and beyond).
  • We will follow that same formatting guidelines and duplicate submission policies as ASPLOS.

08:00 - 08:15
Welcome

Welcome and Opening Remarks

Sushant Kondguli, Meta

08:15 - 09:00
Keynote

Future of Energy-Efficient Cognitive Computing: A Six-Word Story on Sentience, Systems, and Sustainability

Parthasarathy Ranganathan, Google link

We are at a pivotal inflection point in the design of computing systems. On one hand, demand for computing is accelerating at phenomenal rates, fueled by the AI revolution and increasingly deep processing on massive data volumes. On the other hand, Moore’s Law is slowing down. This widening supply-demand gap is forcing us to revisit traditional assumptions around systems design. In this talk, we will discuss Google’s experience designing and deploying large-scale AI systems optimized for efficiency, reliability, and velocity. Building on these lessons, we will identify key challenges and opportunities for future innovation, highlighting how the next generation of energy-efficient cognitive computing will be AI-driven, vertically integrated, and uncomfortably exciting!

09:00 - 09:30
Invited Talk

Architecting Responsible AI: Efficient and Carbon-Aware Caching for AI and Its Security Implications

Sihang Liu, University of Waterloo link

While generative AI offers transformative potential, its rapid scaling has introduced significant challenges in sustainability and responsible deployment. This talk explores the role of caching in AI systems, including both performance and its broader implications for sustainability and security. First, I will present our recent work on improving the efficiency of video generation systems through caching techniques. I will then discuss our context caching system that explores the trade-off between operational carbon savings and the increased embodied carbon associated with storage. Finally, I will highlight our recent findings on novel security vulnerabilities arising from caching mechanisms used in image generation services.

09:30 - 10:00
Invited Talk

Energy Consumption in AI Datacenters: Can We Address this Challenge?

Josep Torrellas, UIUC link

The global electricity demand from data centers, AI, and cryptocurrencies in expected to be around 800TW/h by the end of 2026. Currently, companies like Oracle are building 500MW AI campuses. Training a single model takes 100+ MW over many months, with clusters of about 100K GPUs, and the projected AI cluster size is expected to increase several times in the next few years. Clearly, power will be the primary limiter to the growth of AI. Given this scenario, what can we, researchers, do? In this talk, I will discuss the problem and suggest some techniques that we can use to try to mitigate it. Some of the ideas are being developed in the context of the ACE Center for Evolvable Computing, a center funded by SRC and DARPA for efficient distributed computing.

10:00 - 10:30
Break

Break

10:30 - 11:15
Paper Session #1

TBD

TBD

11:15 - 11:45
Invited Talk

Efficient and Scalable Agentic AI With Heterogeneous Systems

Zain Asgar, Gimlet AI link

11:45 - 12:15
Invited Talk

Private Neural Recommendation with Homomorphic Encryption and Orion

Brandon Reagen, NYU link

AI is great. It powers many of our favorite services and drives industries. However, today’s solutions pose a tradeoff between utility and privacy where receiving the highest quality service often requires disclosing private information. In this talk I will show how things don’t have to be this way. Fully Homomorphic Encryption (FHE) is a cryptographic method that enables computation directly on encrypted data, never disclosing sensitive inputs while still enabling access to high-quality services. I will then cover the challenges of using this technology, which include both extreme performance overhead and programming difficulty, and how our Orion framework addresses them. Finally, I will highlight our most recent work that demonstrates how FHE can be applied to neural recommendation. Time permitting, a demonstration will be given.

12:15 - 13:30
Lunch Break

Lunch Break

13:30 - 13:45
Invited Talk

Sponsor Talk

Max Sbabo, d-Matrix

14:15 - 14:45
Invited Talk

Making the Accuracy-Efficiency Trade-Off in Agentic Systems

Esha Choukse, Microsoft link

For decades, systems researchers have balanced competing objectives: latency versus throughput, performance versus power, and speed versus cost. Traditionally, accuracy was a fixed requirement—non-negotiable and external to the systems equation. But as modern computing increasingly hosts AI-driven workloads, accuracy itself has become a tunable system variable. In this talk, I’ll argue that the next frontier in systems design lies in treating accuracy as a first-class performance knob, to be traded for efficiency in a principled way. Drawing from our recent work, I will show how software–hardware co-design can explicitly expose, quantify, and manage the accuracy-efficiency tradeoff.

14:45 - 15:30
Paper Session #2

TBD

TBD

15:30 - 16:00
Break

Break / Competition Demos

16:00 - 17:00
Panel

Architecting an energy-first stack for the AI age

As the cumulative demand for AI computing continues to grow, new bottlenecks in technology deployment are emerging. These range from compute capacity to memory supply to cluster reliability to simply finding enough power to supply the infrastructure. Since these aren’t easily solved with brute force addition of more chips, potential solutions encompass new model architectures, new computing architectures, high performance cooling designs, robust multi-device software stacks, and even orbital datacenters. In this panel, we will aim to identify the most pressing requirements for the computing stack of coming years. We will discuss both the promising technologies and the hypes to define a blueprint of the computing architecture where energy is the currency of success.

Moderator: Satyam Srivastava, d-Matrix

Panelists:
  • Nishil Talati, UIUC
  • Esha Choukse, Microsoft
  • Josep Torrellas, UIUC
17:00 - 17:30
Invited Talk

Agile and evolvable software construction in the era of rapidly evolving hardware accelerator designs

Charith Mendis, UIUC link

Modern AI workloads have become exceedingly abundant and important in the current computing landscape. As a result, there have been numerous software and hardware innovations aimed at accelerating these workloads. However, we observe a subtle disconnect between the software and hardware communities. Most software innovations target well-established hardware platforms such as CPUs (e.g., x86, ARM) and GPUs (e.g., NVidia GPUs), while hardware innovations produce plenty of other tensor accelerator designs (e.g., Gemmini, Feather, Trainium) each year.

We asked the question, why aren’t the software community using these accelerators or even evaluating on them? The simple yet undeniable reason is the lack of standardized software tooling compared to CPUs and GPUs. For an architecture to be used, properly designed compiler backends, correctness, and performance testing tools should be abundant (e.g., CUDA ecosystem).

In this talk, I will describe how we bridge this gap by automatically generating the necessary software tools for a large class of accelerators through the Accelerator Compiler Toolkit (ACT) ecosystem. Central to ACT is an ISA definition language, TAIDL, that for the first time standardizes the hardware-software interfaces for a large class of accelerators. Departing from the traditional approach of manually constructing test oracles, performance models, or retargetable compiler backends, we instead introduce agile and evolvable methodologies to automatically generate such necessary tooling using both formal methods and machine learning techniques for any TAIDL-defined accelerator interface. I will show how such automation enables rapid software prototyping, making rapidly evolving accelerator designs usable by the software community.

18:00 - 18:05
Close

Closing Remarks

Satyam Srivastava, d-Matrix

18:30 - 20:30
Break

Social