2025 Symposium

MIT AI Hardware Program

Monday, March 31, 2025 | 10:00 AM - 3:30 PM ET

MIT Schwarzman College of Computing
51 Vassar St, 8th Floor
Cambridge, MA 02139

Artificial Intelligence brain shape in a complex and modern GPU card in shades of purple, blue, and gold

About

The MIT AI Hardware Program is an academia-industry initiative between the MIT School of Engineering and MIT Schwarzman College of Computing. We work with industry to define and bootstrap the development of translational technologies in hardware and software for the AI and quantum age.

Our annual symposium included a keynote talk from professor Elsa Olivetti, reviews of the current project portfolio, presentations on new projects, and a networking reception featuring a poster session and interactive demos.

Register

Registration for this event is now closed.

Contact program manager Emily Goldman (ediamond@mit.edu) with questions.

Agenda

9:30 – 10:00

Registration and Breakfast

10:00 - 10:10

Year in Review & the Year Ahead

Program Co-Leads

Jesús del Alamo, Donner Professor; Professor, Electrical Engineering and Computer Science; MacVicar Faculty Fellow
Aude Oliva, Director of Strategic Industry Engagement, MIT Schwarzman College of Computing; CSAIL Senior Research Scientist

10:10 – 11:40

Project Reviews

Updates on the current research portfolio of the MIT AI Hardware Program.

Increasing Architectural Resilience to Small Delay Faults

Peter Deutsch and Vincent Ulitzsch, PhD Candidates, Electrical Engineering and Computer Science

This research aims to create models for new fault modes in processors, addressing reliability challenges for large-scale data centers. Our research develops methods for designing resilient hardware, guiding cost-effective protection strategies for scalability.

In collaboration with Mengjia Yan, Assistant Professor of Electrical Engineering and Computer Science and Joel S. Emer, Professor of the Practice, Electrical Engineering and Computer Science

Wafer-Scale 2D Transition Metal Dichalcogenides for Neuromorphic Applications

Jiadi Zhu, Research Affiliate, MIT Research Laboratory of Electronics

This project aims to explore the use of two-dimensional transition metal dichalcogenides (TMDs), such as MoS2 and WSe2, as neuromorphic devices. We will leverage the extremely low leakage current of wide-bandgap TMD materials to develop floating-gate transistors (FGFETs), where changes in charge stored at the floating gate alter the MoS2 channel conductance. Floating gate structures will be first simulated and then experimentally demonstrated on TMD’s grown by metal-organic chemical vapor deposition (MOCVD) at back-end-of-line (BEOL)-compatible temperatures and integrated on a standard silicon CMOS process. For this, we will build on the MoS2 low-temperature 200 mm wafer-scale MOCVD growth technology recently demonstrated by our group, and we will fabricate highly-scaled heterostructure-based devices to ensure reproducible neuromorphic devices with record-low power consumption. We will also explore the impact of different processing steps such as lithography and deposition conditions on the device performance and stability.

In collaboration with Tomás Palacios, Clarence J. Lebel Professor in Electrical Engineering, Electrical Engineering and Computer Science; Director, Microsystems Technology Laboratories, and Jing Kong, Jerry Mcafee (1940) Professor in Engineering, Electrical Engineering and Computer Science

CIRCUIT: A Benchmark for Circuit Interpretation and Reasoning Capabilities of LLMs

Yan Xu, Ph.D. Candidate, Electrical Engineering and Computer Science

The role of Large Language Models (LLMs) has not been extensively explored in analog circuit design, which could benefit from a reasoning-based approach that transcends traditional optimization techniques. In particular, despite their growing relevance, there are no benchmarks to assess LLMs’ reasoning capability about circuits. Therefore, we created the CIRCUIT dataset consisting of 510 question-answer pairs spanning various levels of analog-circuit-related subjects. The best-performing model on our dataset, GPT-4o, achieves 48.04% accuracy when evaluated on the final numerical answer. To evaluate the robustness of LLMs on our dataset, we introduced a unique feature that enables unit-test-like evaluation by grouping questions into unit tests. In this case, GPT-4o can only pass 27.45% of the unit tests, highlighting that the most advanced LLMs still struggle with understanding circuits, which requires multi-level reasoning, particularly when involving circuit topologies. This circuit-specific benchmark highlights LLMs’ limitations, offering valuable insights for advancing their application in analog integrated circuit design.

In collaboration with Ruonan Han, Associate Professor, Electrical Engineering and Computer Science

Efficient Large Language Models and Generative AI

Song Han, Associate Professor, Electrical Engineering and Computer Science

The rapid advancement of generative AI, particularly large language models (LLMs), presents unprecedented computational challenges. The autoregressive nature of LLMs makes inference memory bounded. Generating long sequences further compounds the memory demand. Our research addresses these challenges by quantization (SmoothQuant, AWQ, SVDQuant) and KV cache optimization (StreamingLLM, QUEST, DuoAttention). We then present two efficient model architectures—HART, a hybrid autoregressive transformer for efficient visual generation, and VILA-U, a unified foundation model that seamlessly integrates visual understanding and generation in a single model.

A 14-nm Energy-Efficient and Reconfigurable Analog Current-Domain In-Memory Compute SRAM Accelerator

Aya Amer, Postdoctoral Associate, Research Laboratory of Electronics

This work presents a low-power reconfigurable 12T- SRAM current-domain analog in-memory computing (IMC) SRAM macro design to address non-linearities, process variations, and limited throughput. The proposed design features a time- domain subthreshold multiply and accumulate (MAC) operation with a differential output current sensing technique. Reconfigurable current-controlled design supports different precisions and speeds. A 1kbit macro is prototyped in a 14-nm CMOS process and achieves a measured bitwise energy efficiency of 580 TOPS/W while obtaining highly linear MAC operations. This is the highest energy efficiency reported for IMC current- domain computing methods. In addition, simulation results and estimations based on blocks and 1kb macro measurements show that increasing the macro size to 16kbit can achieve 2128 TOPS/W, which is comparable to other charge domain computing methods. Finally, A fully analog MLP classifier for voice-activity detection (VAD) is prototyped with 3 cascaded analog IMC macros, achieving ~ 90% classification accuracy at 5dB-SNR while consuming 0.58 nJ/classification.

In collaboration with Anantha Chandrakasan, Dean of the School of Engineering and Vannevar Bush Professor of Electrical Engineering and Computer Science

Photonics for AI | AI for Photonics

Dirk Englund, Professor, Electrical Engineering and Computer Science

The hardware limitations of conventional electronics in deep neural network (DNN) applications have spurred explorations into alternative architectures, including architectures using optical- and/or quantum- domain signal processing signal processing subroutines. This work investigates the scalability and performance metrics—such as throughput, energy consumption, and latency—of various such architectures, with a focus on recently developed hardware error correction techniques, in-situ training methods, initial field trials, as well as extensions into DNN-based inference on quantum signals with reversible, quantum-coherent resources.

11:40 – 12:00

UROP (Undergraduate Research Opportunities Program) Pitches

Project pitches from undergraduate students funded by the MIT AI Hardware Program.

PERE-Chains: AI-Supported Discovery of Privilege Escalation and Remote Exploit Chains

Cristián Colón, Undergraduate, Engineering and Computer Science

We’re developing PERE-Chains, a new tool that helps discover vulnerabilities in computer networks. It focuses on finding “exploit chains,” which attackers often use to escalate privileges or gain control of multiple computers remotely. PERE-Chains leverages LLMs and AI planning to quickly identify these chains. By pinpointing vulnerabilities that attackers might exploit, our method allows network security teams to prioritize fixes effectively. This not only simplifies security management but also significantly improves network protection by clearly showing which vulnerabilities should be addressed first.

In collaboration with Una-May O’Reilly, Principal Research Scientist, Computer Science & Artificial Intelligence Lab

Computing with Heat

Caio Silva, Undergraduate, Physics

Heat is often regarded as a waste byproduct of physical processes, something to be minimized and dissipated. However, by carefully designing thermal devices—such as metal alloys in specific shape and structures—it is possible to control heat flow in ways that enable novel applications across various fields. One such application is Computing with Heat, where temperature and thermal currents serve as carriers of information for performing computational operations. Using topology optimization and differential programming, we have developed inverse-designed 2D metal metastructures capable of receiving temperature inputs and executing matrix multiplications through heat conduction. This work lays the foundation for leveraging thermal transport as a computational medium, opening possibilities for energy-efficient analog computing.

In collaboration with Giuseppe Romano, Research Scientist, Institute for Soldier Nanotechnologies

Simulation of Optical Phase Change Modulator for Analog Photonic Applications

Anthony Donegan, Undergraduate

Thermal and optical simulations of optical phase change material (PCM) modulator geometries for a neuromorphic photonic chip were performed using Lumerical and COMSOL. Optical simulations identified the ideal thickness and length of the PCM modulator and verified a reasonable attenuation through the device. Thermal simulations identified the parameters for the device’s operation. Device fabrication and experimental verification of results are currently being conducted.

In collaboration with Juejun Hu, Professor, Materials Science and Engineering

DEXO: Hand Exoskeleton System for Teaching Robot Dexterous Manipulation In-The-Wild

Juan Alvarez, Undergraduate, Aeronautics and Astronautics

We introduce DEXO, a novel hand exoskeleton system designed to teach robots dexterous manipulation in-the-wild. Unlike traditional teleoperation systems, which are limited by the lack of haptic feedback and scalability, DEXO enables natural and intuitive control through kinematic mirroring and force transparency. The system’s passive exoskeleton design allows human users to directly control a robot’s dexterous hand, transmitting precise motion and force data for learning complex tasks in real-world environments. Equipped with integrated tactile sensors, DEXO captures high-fidelity interaction data, facilitating manipulation learning without the need for costly hardware or careful engineering. We evaluate the system across multiple dexterous tasks, demonstrating its capability to replicate human-level manipulation and its potential to scale the collection of high-quality demonstration data for training advanced robot learning models. Our experiments show significant improvements in task success rates compared to existing teleoperation method, making DEXO a powerful tool for advancing robot dexterity.

In collaboration with Pulkit Agrawal, Associate Professor, Electrical Engineering & Computer Science

Ferroelectric Memory Devices for AI Hardware

Tyra Espedal, Undergraduate, Physics

Ferroelectric (FE) memory based on CMOS-compatible Hf0.5Zr0.5O2 (HZO) has emerged as a promising non-volatile memory (NVM) technology for AI hardware due to its potential for low-voltage and fast switching, long data retention, and high memory endurance. In this work, we systematically investigate the wake-up behavior of TiN- and W-based FE-HZO capacitors under repeated triangular sweeps at frequencies ranging from 1.4 Hz to 1 MHz. We find that wakeup is more effective with slow triangular sweep cycling. High-frequency cycling, on the other hand, limits the wake-up effect as a result of domain pinning through high-voltage induced defect generation.

In collaboration with Jesús del Alamo, Donner Professor; Professor, Electrical Engineering and Computer Science; MacVicar Faculty Fellow

12:00 – 1:00

Lunch & Networking

1:00 – 1:30

Keynote

The Climate and Sustainability Implications of Generative AI

Elsa Olivetti, Professor, Materials Science and Engineering

Generative AI’s meteoric rise and explosive data center growth offers a unique opportunity to pioneer for sustainable, strategic AI deployment coupled with leadership in energy infrastructure modernization and decarbonization. Given the scale of the challenge, meeting unprecedented demand must be done with a mission-driven, holistic and collaborative outlook. To address this challenge, MIT is linking research on energy supply and compute demand, integrating efforts from chip design to workflow management to data center architecture to building footprint to power generation along the entire computing lifecycle with performance tradeoffs at hand and replacement cycles in mind, including how sustainable AI can drive broader societal decarbonization.

1:30 – 2:30

Highlights: Prospective New Projects

Presentations covering new research projects on AI and hardware.

Hardware-efficient Neural Architectures for Language Modeling

Lucas Torroba Hennigen, PhD Candidate, Computer Science and Artificial Intelligence Laboratory

The Transformer architecture has proven to be effective for modeling many structured domains including language, images, proteins, and more. A Transformer processes an input sequence through a series of “Transformer blocks”, each of which consists of an attention layer followed by a fully connected (FFN) layer. Both types of layers employ highly parallelizable matrix operations, and thus Transformers can take advantage of specialized hardware such as GPUs. However, the complexities of attention/FFN layers are quadratic with respect to sequence length/hidden state size respectively; making Transformers more efficient thus requires alternatives to these fundamental primitives. This proposal will develop efficient variants of attention/FFN layers that (1) enable scaling to longer sequences and larger hidden states, and (2) can make existing models more efficient to deploy on resource-constrained environments. We will couple these layers with hardware-efficient implementations that take advantage of the device on which these models will be trained/deployed.

In collaboration with Yoon Kim, Assistant Professor, Electrical Engineering and Computer Science

Ferroelectric AI Hardware: Overcoming Conventional Paradigms and Scalability Limits

Suraj Cheema, Assistant Professor, Materials Science and Engineering & Electrical Engineering and Computer Science

In-memory computing (IMC) paradigms comprised of two-terminal memristor-based crossbar arrays of nonvolatile memory elements have emerged as a promising solution to address the growing demand for data-intensive computing and its exponentially rising energy consumption. However, these solutions suffer from poor array scalability due to a lack of self-rectifying behavior, resulting in sneak path issues and additional selector devices. Furthermore, the best performing memristors are often based on emerging materials (2D van der Waals, complex oxides, electrolyte-based) that are not yet compatible with complementary metal-oxide-semiconductor (CMOS) and very large-scale integration (VLSI) processes, impeding high-density array integration. Here, we demonstrate the experimental realization of a self-rectifying memristor combining the ideal switching and rectification behavior of tunnel junctions and diodes, respectively – i.e. a hybrid ferroelectric-ionic tunnel diode (HTD) — in a scalable fabrication flow using the CMOS-compatible materials and VLSI processes employed in modern microelectronics. From a materials perspective, we harness the collective (ferroelectric-antiferroelectric polymorphism) and defective (ionic) switching character of HfO2-ZrO2 (HZO) to synergistically enhance both its electroresistance and rectifying behavior. From a device perspective, we leverage the conformal growth capability of atomic layer deposition (ALD) to integrate three-dimensional (3D) HTD structures, to improve both array density and electrostatic control, yielding record-high on/off and rectification ratios across all two-terminal paradigms. From an array perspective, the enhanced self-rectifying behavior leads to the highest array scalability and storage capacity reported for any memristive system. Overall, the unprecedented memristive performance, which exploits the same materials and processes used in modern microelectronics, not only positions the HTD as an ideal hardware building block for future 3D IMC platforms, but also highlights the potential of engineering breakthrough properties in conventional CMOS materials towards accelerating the “lab-to-fab” technological translation of novel functional devices.

Analog Computing with Inverse-Designed Metastructures

Giuseppe Romano, Research Scientist, Institute for Soldier Nanotechnologies

The increasing demand for AI is spurring the development of innovative computing platforms, which often employ single analog complex operations rather than traditional Boolean logic. We propose inverse-designed metastructures that perform matrix-vector multiplications, leveraging heat as the signal carrier. Furthermore, we investigate optimized spatiotemporal-modulated structures for processing time-dependent signals.

Magnetic Tunnel Junction for Stochastic Neuromorphic Computing

Luqiao Liu, Associate Professor, Electrical Engineering and Computer Science

The rapidly growing demand for more efficient artificial intelligence (AI) hardware accelerators remains a pressing challenge. Crossbar arrays have been widely proposed as a promising in-memory computing architecture. But conventional nonvolatile-memory-based crossbar arrays inherently require a large number of analog-to-digital converters (ADCs), leading to significant area and energy inefficiencies. Here, we demonstrate three-terminal stochastic magnetic tunnel junctions (sMTJs) operated by spin-orbit torque (SOT) as novel interfacial components between the analog and digital domains for next-generation AI accelerators. By harnessing the intrinsic analog-current-to-digital-signal conversion from sMTJs, we replace conventional bulk, energy hungry, and slow ADCs with compact, low power, and rapid stochastic current digitizers. Furthermore, a partial sum approach is introduced to break down large matrix operations, optimizing computational efficiency and achieving high accuracy on the MNIST handwritten digit dataset. This work paves the way for future AI hardware designs that leverage device-level innovations to overcome the limitations of current in-memory computing systems.

Compressing (in) the Wild: Continual Fine-tuning of Autoencoders for Camera Trap Image Compression

Timm Haucke, PhD Candidate, Electrical Engineering and Computer Science

This talk introduces an on-device fine-tuning strategy for autoencoder-based image compression, designed to achieve high compression ratios for camera trap images. Camera traps are an essential tool in ecology but are often deployed in remote areas and are therefore frequently bandwidth constrained. We exploit the fact that camera traps are static cameras, which results in high temporal redundancy in the image background, and fine-tune an autoencoder-based compression model to specific sites, thereby achieving higher compression ratios than general compression models.

In collaboration with Sara Beery, Assistant Professor, Electrical Engineering and Computer Science

Declarative Optimization for AI Workloads

Michael Cafarella, Research Scientist, Computer Science and Artificial Intelligence Laboratory (CSAIL)

Today’s AI engineer must make a huge number of narrow technical decisions: which models best suit which problems, which prompting methods to use, which test-time compute patterns to employ, whether to substitute conventional code for “easy” problems, and so on. These decisions are crucial for good performance, quality, and cost but are tedious and time-consuming to make. Moreover, they must be revisited when new models are released or existing prices are changed.

We propose a declarative AI programming framework that automatically optimizes the program on the user’s behalf. Like a relational database, it can marshal a wide range of optimization strategies, with the goal of making AI programs that are as fast, inexpensive, and high-quality as possible. Our prototype Palimpzest system can currently obtain AI programs that are up to 3.3x faster and 2.9x cheaper than a baseline method, while achieving equal or greater quality. Palimpzest is open source and is designed to integrate new optimization methods that should permit even greater future gains.

2:30 – 3:30

Research Showcase

This in-person session will feature posters, videos, and demos. Refreshments will be provided.

A Unified Framework for Sparse Plus Low-Rank Matrix Decomposition for LLMs

Mehdi Makni, PhD Candidate, MIT Operations Research Center

The impressive capabilities of large foundation models come at a cost of substantial computing resources to serve them. Compressing these pre-trained models is of practical interest as it can democratize deploying them to the machine learning community at large by lowering the costs associated with inference. A promising compression scheme is to decompose foundation models’ dense weights into a sum of sparse plus low-rank matrices. In this paper, we design a unified framework coined HASSLE-free for (semi-structured) sparse plus low-rank matrix decomposition of foundation models.

We introduce the local layer-wise reconstruction error objective for this decomposition and demonstrate that prior work solves an approximation of this optimization problem. We provide efficient and scalable methods to obtain good solutions to the exact optimization program.

HASSLE-free substantially outperforms state-of-the-art methods in terms of the layerwise reconstruction error and a wide range of LLM evaluation benchmarks. For the Llama3-8B model with a 2:4 sparsity component plus a 64-rank component decomposition, a compression scheme for which recent work shows impressive inference acceleration on GPUs, our method reduces the test perplexity by 18\% for the WikiText-2 dataset and reduces the gap (compared to the dense model) of the average of eight popular zero-shot tasks by 28\% compared to existing methods.

In collaboration with Rahul Mazumder, Associate Professor, Sloan School of Management

Additional authors: Kayhan Behdin, Zheng Xu, Natalia Ponomareva

Hardware-Aware Algorithms for Large Neural Network Compression

Xiang Meng and Ryan Lucas, PhD Candidates, MIT Operations Research Center

Despite excellent accuracy, most modern neural networks contain enormous parameters (e.g., LLaMA3 with up to 70B parameters), creating significant storage and deployment challenges. We present novel one-shot pruning techniques that compress networks by identifying and removing redundant weights using a few calibration data without retraining while preserving model accuracy. We formalized network compression within an optimization framework and proposed advanced optimization algorithms to solve it. Our approach scales to LLMs and large-scale vision networks and achieves state-of-the-art compression-accuracy trade-offs. Additionally, we support hardware-friendly sparsity patterns, including 2:4 sparsity and FLOPs-aware pruning, which enables acceleration when deployed on specialized hardware platforms. Our experiments on the LLaMA3-8B model with 70% sparsity demonstrate a 29% reduction in test perplexity compared to previous approaches, and we can prune VIT-B/16 (one of the largest Vision Transformer networks) with minimal loss in test accuracy on benchmark computer vision datasets.

In collaboration with Rahul Mazumder, Associate Professor, Sloan School of Management

Extracting Cellular Automaton Rules from Biological Systems with Generative Pretrained Transformers

Jaime Berkovich, PhD Candidate, Materials Science and Engineering

AI-accelerated modeling of nonlinear natural systems, ranging from biological dynamics to weather patterns, has historically relied on large empirical datasets for training, which incur high monetary and temporal costs. To remedy this problem, we look toward using cellular automata (CA) – discrete, grid-based computational models where simple, local rules give rise to intricate global behaviors – as key mathematical frameworks for exploring the emergence of complex dynamics from simple, local rules. For decades, CA have had applications ranging from traffic modeling and ecological systems to tissue development, fluid dynamics, and crystal growth. Despite their potential, the formalization of CA for quantitative, physics-aware, predictive modeling has been hindered by the challenge of deriving rules that precisely capture the underlying dynamics of a given system. Inspired by recent advancements in models using attention mechanisms, we present AutomataGPT, a decoder-only generative pretrained transformer model that was trained on extensive datasets of CA systems. We show that by learning from larger sections of ‘rule space,’ AutomataGPT achieves higher accuracies in orbit forecasting and rules inference for two-dimensional binary deterministic CA on toroidal grids for systems governed by not-previously-seen rules. AutomataGPT demonstrates significant generalization capabilities, addressing challenges in both forward prediction and inverse rule extraction. These findings lend credence to the possibility of abstracting real-world dynamical systems as CA models, enabling advancements across disciplines such as biology, tissue engineering, physics, and more.

In collaboration with Markus Buehler, McAfee Professor of Engineering, Civil and Environmental Engineering

How Much Potential is There for Further GPU Progress?

Emanuele Del Sozzo, Research Scientist, Computer Science & Artificial Intelligence Lab

Graphics Processing Units (GPUs) symbolize the state-of-the-art architecture for a variety of tasks, ranging from rendering 2D/3D graphics to accelerating multiple workloads in supercomputing centers and, of course, training and running Artificial Intelligence (AI) models. As GPUs continue to evolve and improve to keep pace with the ever-increasing demand for performance, identifying the past and current progress for these architectures becomes paramount to determining what we should expect shortly for scientific fields such as AI, which heavily relies on this technology and is the center of global competition. Given this context, our project examines NVIDIA GPU advancements, focusing on computing, memory, price, and power trends, and assesses U.S. export controls on AI chips. Our ultimate goal is to explore GPU characteristics to unveil their remaining potential and guide forthcoming investments in high-performance technologies.

In collaboration with Neil Thompson, Research Scientist, Computer Science & Artificial Intelligence Lab

In-Memory Sparsity: Enable Efficient Unstructured Element-Wise Sparse Training from the Bottom

Hongkai Ning, Postdoctoral Associate, Research Laboratory of Electronics

The fine-grained dynamic sparsity in biological synapses is an important element in the energy efficiency of the human brain. Emulating such sparsity in an artificial system requires off-chip memory indexing, which has a considerable energy and latency overhead. Here, we report an in-memory sparsity architecture in which index memory is moved next to individual synapses, creating a sparse neural network without external memory indexing. We use a compact building block consisting of two non-volatile ferroelectric field-effect transistors acting as a digital sparsity and an analogue weight. The network is formulated as the Hadamard product of the sparsity and weight matrices, and the hardware, which is comprised of 900 ferroelectric field-effect transistors, is based on wafer-scale chemical-vapour-deposited molybdenum disulfide integrated through back-end-of-line processes. With the system, we demonstrate key synaptic processes—including pruning, weight update and regrowth—in an unstructured and fine-grained manner. We also develop a vectorial approximate update algorithm and optimize training scheduling. Through this software–hardware co-optimization, we achieve 98.4% accuracy in an EMNIST letter recognition task under 75% sparsity. Simulations on large neural networks show a 10-fold reduction in latency and a 9-fold reduction in energy consumption when compared with a dense network of the same performance.

In collaboration with Suraj Cheema, Assistant Professor, Materials Science and Engineering & Electrical Engineering and Computer Science

Additional Authors: Hengdi Wen, Yuan Meng, Xinran Wang

Learning to Keep a Promise: Scaling Language Model Decoding Parallelism with Learned Asynchronous Decoding

Tian Jin, PhD Candidate, Electrical Engineering and Computer Science

Decoding with autoregressive large language models (LLMs) traditionally occurs sequentially, generating one token after another. An emerging line of work explored parallel decoding by identifying and simultaneously generating semantically independent chunks of LLM responses. However, these methods rely on hand-crafted heuristics tied to syntactic structures like lists and paragraphs, making them rigid and imprecise.

We present PASTA, a learning-based system that teaches LLMs to identify semantic independence and express parallel decoding opportunities in their own responses. At its core are PASTA-LANG and its interpreter: PASTA-LANG is an annotation language that enables LLMs to express semantic independence in their own responses, while the language interpreter acts on these annotations to orchestrate parallel decoding on the fly at inference time.

Through a two-stage finetuning process, we train LLMs to generate PASTA-LANG annotations that optimize both response quality and decoding speed. Evaluation on AlpacaEval, an instruction-following benchmark, shows that our approach Pareto-dominates existing methods in terms of decoding speed and response quality. Our results demonstrate geometric mean speedups ranging from 1.21× to 1.93×, with corresponding quality changes of +2.2% to -7.1%, measured by length-controlled win rates against the sequential decoding baseline.

In collaboration with Michael Carbin, Associate Professor, Electrical Engineering and Computer Science and Jonathan Ragan-Kelley, Associate Professor, Electrical Engineering and Computer Science

SVDQuant: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models

Muyang Li and Yujun Lin, PhD Candidates, Electrical Engineering and Computer Science

Diffusion models can effectively generate high-quality images. However, as they scale, rising memory demands and higher latency pose substantial deployment challenges. In this work, we aim to accelerate diffusion models by quantizing their weights and activations to 4 bits. At such an aggressive level, both weights and activations are highly sensitive, where existing post-training quantization methods like smoothing become insufficient. To overcome this limitation, we propose SVDQuant, a new 4-bit quantization paradigm. Different from smoothing, which redistributes outliers between weights and activations, our approach absorbs these outliers using a low-rank branch. We first consolidate the outliers by shifting them from activations to weights. Then, we use a high-precision, low-rank branch to take in the weight outliers with Singular Value Decomposition (SVD), while a low-bit quantized branch handles the residuals. This process eases the quantization on both sides. However, naively running the low-rank branch independently incurs significant overhead due to extra data movement of activations, negating the quantization speedup. To address this, we co-design an inference engine Nunchaku that fuses the kernels of the low-rank branch into those of the low-bit branch to cut off redundant memory access. It can also seamlessly support off-the-shelf low-rank adapters (LoRAs) without re-quantization. Extensive experiments on SDXL, PixArt-Σ, and FLUX.1 validate the effectiveness of SVDQuant in preserving image quality. We reduce the memory usage for the 12B FLUX.1 models by 3.5×, achieving 3.0× speedup over the 4-bit weight-only quantization (W4A16) baseline on the 16GB laptop 4090 GPU with INT4 precision. On the latest RTX 5090 desktop with Blackwell architecture, we achieve a 3.1× speedup compared to the W4A16 model using NVFP4 precision.

In collaboration with Song Han, Associate Professor, Electrical Engineering and Computer Science

Additional Authors: Yujun Lin, Zhekai Zhang, Tianle Cai, Xiuyu Li, Junxian Guo, Enze Xie, Chenlin Meng, Jun-Yan Zhu

LLMs for PICs: From Sketches and Text to Tapeout-Ready Designs

Kenaish Al Qubaisi, Postdoctoral Associate, Research Laboratory of Electronics

This short demo presents two emerging tools that harness the power of large language models (LLMs) to streamline photonic integrated circuit (PIC) design. The first tool, developed by our group, transforms hand-drawn sketches into DRC-clean GDS layouts—enabling a more intuitive and rapid transition from concept to fabrication-ready designs. The second tool, developed by our collaborators in Prof. Joyce Poon’s group at the Max Planck Institute of Microstructure Physics in Germany, takes natural language descriptions of photonic circuits and generates both the corresponding GDS layouts and system-level simulations. We are currently testing and integrating this tool in our design workflow. Together, these tools demonstrate the transformative potential of LLMs in photonic design automation.

In collaboration with Dirk Englund, Professor, Electrical Engineering and Computer Science

Computing with Heat

Caio Silva, Undergraduate, Physics

In collaboration with Giuseppe Romano, Research Scientist, Institute for Soldier Nanotechnologies