Karu Sankaralingam
Abstract Title
Serving Intelligence: The Unseen Limits of LLM Inference and the Hardware Race to Overcome Them
Large Language Models (LLMs) are transforming every corner of the technology landscape—from code generation and content creation to customer support and scientific discovery. But as models grow ever larger, the infrastructure required to serve them is straining under the weight of complexity, cost, and energy consumption. In this talk, we examine the true system-level bottlenecks that constrain the deployment and scalability of LLM inference today and into the near future. We identify five non-negotiable challenges for LLM inference hardware, establishing compute, memory capacity, bandwidth and collective communication as primary barriers to performance. These findings suggest that achieving significant performance gains beyond 10,000 tokens-per-second will require not just hardware evolution but also fundamental algorithmic advances.
Biography
Karu Sankaralingam is a Principal Research Scientist at NVIDIA Research and Professor at UW-Madison. He founded SimpleMachines in 2017, building the Mozart chip to advance AI hardware using dataflow computing. He has led three chip projects—Mozart, MIAOW (an open-source GPU), and the TRIPS chip. His work, featured in the New York Times, Wired, and IEEE Spectrum, focuses on architecture, microarchitecture, and compilers. He has published over 100 papers, holds 21 patents, and has mentored 9 PhD students. He is an IEEE Fellow and his work has been recognized with 9 best paper awards and invitations to industry forums.