Team behind Blue Waters study wins Test of Time award

7/2/2024 Cassandra Smith

Written by Cassandra Smith

Blue Waters petascale supercomputer
Blue Waters

A Coordinated Science Laboratory researcher and his teammates received a prestigious award recognizing their research on the resilience of the Blue Waters petascale supercomputer. 

Ravishankar Iyer (CSL), along with five other researchers, conducted a study into the U.S. National Science Foundation’s Blue Waters, a unique supercomputer famous for being the most advanced of its time with its hybrid architecture of Central Processing Units and Graphics Processing Units. The team looked at analyzing system failures in the supercomputer, which is housed in the University of Illinois’ National Center for Supercomputing Applications. Team members included Catello Di Martino (CE), Zbigniew Kalbarczyk (CSL, ECE, ITI), Ravishankar K. Iyer (CS, ECE, CSL, ITI, NCSA), Fabio Baccanico, Joseph Fullop (CS, NCSA), and William Kramer (CS, NCSA).

The study resulted in a paper called “Lessons Learned From the Analysis of System Failures at Petascale: The Case of Blue Waters.” That paper was written 10 years ago and recently won a Test of Time Award from the IEEE/IFIP International Conference on Dependable Systems and Networks. Papers that win this prestigious award are chosen for their impact over time.

Lelio Di Martino (CS Alumni) is a department head at Nokia Bell Labs. He is a co-author of the paper. He said he was grateful for their opportunity to conduct this research.  

“Blue Waters offered us a unique opportunity to delve into the complexity of large-scale computing systems,” he said. “Our goal was to uncover the hidden challenges and to provide insights that could shape the future of resilient computing.” 

Blue Waters was in continuous operation from 2013 to 2021. It could process over 13 quadrillion calculations per second. To give more perspective on the power of this machine, it was three million times faster than a laptop and featured massive storage capacity. At more than 1.5 petabytes, it could hold 300 million digital camera photos. It contained more than 25 petabytes of disk storage, enough to store all the printed documents in all the libraries in the world. 

“Dealing with over 4TB of system data from one of the most complex computing machines ever built, we faced immense challenges,” said Di Martino. “Transferring data alone took more than two days and we even had to write a patch to use Blue Waters’ own nodes for analysis—essentially needing Blue Waters to analyze Blue Waters.” 

Their findings had impacts on many areas of industry. Their work contributed to developing studies into AI systems, autonomous vehicles and smart grids.

One of the co-authors, Joseph Fullop (Computer Science Alumni) worked with NCSA as a student in the late 90's. He joined full-time and worked on several supercomputers. He left the organization in 2016 to continue working with high-performance computing systems within the group at Los Alamos National Laboratory. He said that as a scientist at the lab, he is helping bring in the Nvidia Grace-Hopper based AI supercomputer, which he said is similar yet much larger compared to NCSA's Delta AI.

“Today, as we stand at the cusp of exascale computing, the lessons learned from Blue Waters are more relevant than ever,” said Di Martino, who noted that the award not only recognizes past efforts but also highlights the continuous importance of understanding system failures to innovate and improve. “Our findings on GPU resilience and the need for rapid recovery techniques have influenced the design and implementation of the next generation of supercomputers, including the Frontier machine.”  

Di Martino said the true value of their work was in identifying several corner cases that invalidated practical and theoretical assumptions of the time, many of which persist today. He says their work is a testament to the importance of challenging current ideas, learning from the past and continuing to push the limits on research boundaries. 


Share this story

This story was published July 2, 2024.