Principal Investigators: Song Han, Anantha Chandrakasan

 

Natural language processing (NLP) models based on transformers and the attention mechanism have become widely used for recommendation systems, language modeling, question-answering tasks, sentiment analysis, and machine translation. In recent years, much focus has been placed on accelerating convolutional neural networks (CNNs) for image processing applications. There have been several works that accelerate recurrent neural networks (RNNs), but transformer models and the attention mechanism have remained largely neglected until recently. As more devices rely on voice commands, it becomes more critical to develop efficient processors for language processing directly on the edge device to ensure privacy, low latency, and extended battery life.

Our goal is to accelerate the entire transformer model (as opposed to just the attention mechanism) to further reduce data movement across layers. We will exploit domain-specific properties of NLP models using techniques such as token pruning, piecewise linear quantization, and low-precision softmax to reduce memory and computation requirements and enable high-performance NLP models to be deployed on edge devices.

In collaboration with: Alex Ji, Hanrui Wang