Team effort leading to significantly higher LLM execution speed and high-profile industrial adoption

11/21/2024 Jenny Applequist

Written by Jenny Applequist

Many leading AI solutions are based on large language models (LLMs) that can generate text based on statistical analysis of huge amounts of existing textual data. While many people are excited about the innumerable potential uses of these LLM-based solutions, they have been enormously costly to run—in part because of inefficient use of computational hardware.

Photo of Deming Chen
Deming Chen

In January of this year, a team that includes Deming Chen, his then-student Yuhong Li, and Princeton University collaborators Tianle Cai and Tri Dao, among others, introduced an open-source solution called Medusa, which was published at the 2024 International Conference on Machine Learning (ICML). It can increase, by up to 3.6 times, the execution speed of LLM model inference, whereby a trained model is used to make predictions. Medusa rapidly attracted strong industry interest, and has already been adopted by NVIDIA (as discussed in an NVIDIA blog post), among other companies.

“LLMs employ auto-regressive decoding that requires sequential computation, with each step relying on the previous one’s output. This unfortunately creates a bottleneck because each step needs to move the full model parameters from memory to the accelerator’s cache,” explained Chen, who is Abel Bliss Professor of Engineering in electrical & computer engineering and the Coordinated Science Lab as well as the new Illinois co-director of the IBM-Illinois Discovery Accelerator Institute (IIDAI).

“Because data do not come fast enough, then the computation units are waiting for the data,” he added.

Thus, it can happen that only a fraction of the available computing units are active, while the majority sit idle. Among other drawbacks, that can translate into huge financial costs for users. For example, an organization might pay large sums for a cloud computing resource and be able to use only a fraction of the paid-for computational power.

Medusa is based on the insight that the waste of computational power can be avoided if multiple tokens (units of text that a model processes) are generated simultaneously, not one by one, because the computational hardware is thus fully occupied in performing multiple tasks in parallel on the same arrived data. Medusa thereby also slashes the amount of time it takes to complete inference tasks.

So how did they do it? Yuhong Li, one of the two lead authors of the ICML Medusa paper, said that the development of Medusa began with months of trial-and-error failures to achieve the goal of rapid speedup without sacrificing quality or accuracy. He credited team member Tianle Cai, the other lead author and a student at Princeton, with a “simple yet brilliant” idea that bubbled up in a casual conversation—a “light bulb moment”—whose exciting potential was instantly recognized. Li recalled that the team immediately flew into action, working day and night to produce the resulting successful Medusa solution.

Tri Dao, who is on the Princeton faculty and also the chief scientist of the Together AI company, said that the innovation was inspired by classical parallel decoding techniques. “By adding multiple decoding heads [that create tokens] to the model, Medusa can predict several tokens in parallel, thus reducing the dependency on sequential generation,” he explained. “This allows the model to process multiple tokens simultaneously, effectively speeding up inference and unlocking the computational potential of modern accelerators.”

Chen added that Medusa can be applied in two ways that are suitable for different scenarios. On the one hand, it can be directly fine-tuned on top of a frozen backbone LLM, gaining an up to 2.2× speedup. The other option is to fine-tune it together with the backbone LLM, enabling better prediction accuracy of Medusa heads and higher speedup (up to 3.6×). The latter option, however, requires a special training recipe that preserves the backbone model’s capabilities and also takes more time and effort.

Dao noted that Medusa “has been embraced by major inference libraries across the AI industry,” including vLLM and TGI as well as NVIDIA’s TensorRT-LLM. “This wide-scale adoption demonstrates the method’s versatility and effectiveness in optimizing LLM inference, paving the way for more responsive and efficient AI applications,” he said.

In July, Li completed his Ph.D. at the U. of I. and joined Apple. In addition to Chen, Li, Cai, and Dao, the Medusa co-authors and developers include Zhengyang Geng, a student at Carnegie Mellon University; Hongwu Peng, a student at the University of Connecticut; and Princeton faculty member Jason D. Lee.

Chen says that the team is continuing to work on Medusa, exploring other ways to enhance it and apply it to different hardware. His student Selin Yildirim recently ported Medusa to AMD GPUs and observed a similar amount of speedup. Chen says others have been reaching out to the team to request porting of Medusa to other types of computing platforms, including CPUs, Google TPU, and AMD AIE.

“This is a very exciting development. We aim for Medusa to become a widely adopted technology, enabling more efficient and faster execution of LLMs across a diverse range of accelerators,” he said.


Share this story

This story was published November 21, 2024.