Protecting super computing environments with artificial intelligence

11/19/2018 Allie Arp, CSL

Written by Allie Arp, CSL

When a personal computer gets a bug or a virus, the owner can take it to Geek Squad or their techie friend to get it fixed. When a supercomputer has an issue, a fairly common occurrence for these massive machines, it could take a whole team of engineers to fix everyday issues. University of Illinois researchers are teaming up with Sandia Laboratories to use artificial intelligence (AI) to help resolve and prevent these problems.

“Increasingly we have big systems like the Google Cloud and the Microsoft suite merging with high performance computing systems that run weather simulations and scientific computing,” said CSL Professor Ravi Iyer, a George and Ann Fisher Distinguished Professor of Engineering. “When you have such large systems, things are constantly failing and conventional solutions don’t always scale up. There is strong interest in building systems that can continue to perform in the presence of failures and other kinds of problems, and as far as possible without human interference.”
Ravi Iyer
Ravi Iyer

Iyer and doctoral student Saurabh Jha are analyzing data from the National Center for Supercomputing Applications (NCSA) to see what humans have done to fix problems on supercomputers, like NCSA’s petascale system, Blue Waters. The goal of the project, “AI-driven Continuous Assessment of High-Performance Data Centers,” is to understand the monitoring and maintenance needs of a super computer and then use machine learning based artificial intelligence, to not only recover, but restore the system without the need of human interference.

Jha and Iyer are currently deploying algorithms to fix identified problems within the Blue Waters super computer system. This involves inserting problems in order to ensure the algorithms can resolve both common and uncommon issues.

“Not only were we able to clear the problems we planned for, but we were able to detect other issues automatically and take corrective actions,” Jha said. “Currently every solution is a custom AI algorithm, but can we come up with an abstraction that can cover 80 to 90 percent of the issues, not only for this system but across systems? Potentially across technological generations?”

As Iyer points out, this is not a new problem, but the magnitude does bring about new challenges, and with them, opportunities to collaborate.

“Building systems that are fault tolerant is as old as computers themselves, but when it comes to supercomputers, the scale is what presents a whole bunch of new problems. It requires theoretical work as well as system design to come together, and that’s unique,” said Iyer. “There is a lot of interest in solving this issue and it’s been an important focus for many national laboratories.”

Partners in this research are computer scientists Jim Brandt and Ann Gentile of Sandia National Laboratories. They worked successfully with Iyer and Jha on a previous project and saw the opportunity to continue to work together on taking an AI approach to solving these problems.

“One of their [Iyer and Jha] strengths is that they understand the issues that come with large scale systems,” said Jim Brandt.
Saurabh Jah
Saurabh Jah

Both sides of the collaborative partnership look forward to continuing to build the relationship. For one of Sandia’ scientists, it’s a way to give current University of Illinois at Urbana-Champaign students the opportunity to make a practical difference in the field.  The same way she did when she was a student at the University.

“When I was a student one of the things that was important for our group was working with companies who had technical need for the work we were developing,” said Ann Gentile. “For me, it was wonderful to make something that would be used in the real world. I am very happy that this project is a way for the students to see this isn’t just theoretical, it is going to make a difference in how computing is done.”

This is a one-year project funded by Sandia at $162,500.


Share this story

This story was published November 19, 2018.