Iyer works to update monitoring of computer systems

7/2/2021 Lizzie Roehrs

CSL professor Ravi Iyer is working with Sandia labs to automate the control and management of supercomputers.

Written by Lizzie Roehrs

Emerging supercomputing applications and simulations that use machine learning and artificial intelligence demand increased computational capacity, performance, and resilience. As next-generation supercomputers are tailored to meet those demands, it has become increasingly difficult to control and manage such systems using human-driven heuristics. Ravi Iyer

Addressing these challenges means fundamentally reinventing the monitoring and control plane of computing systems using fast, low-cost machine learning techniques. In their research, working alongside Sandia National Laboratory (SNL), CSL professor Ravi Iyer and his team are especially interested in solving this problem by automating the control and management of supercomputers so as to improve performance and resilience of supercomputing applications. Examples of such control and management tasks include scheduling, congestion control, performance anomalies, and failure mitigation.

“Large-scale high-performance computing systems and cloud computing systems are increasingly being used to solve critical societal problems such as drug synthesis, genome studies, and weather forecasting,” says Iyer. “Working with our partners at Sandia National Laboratory, we want to automate system management tasks to provide optimal performance and resilience for applications and increase the overall science throughput.”

Ann Gentile, manager of the HPC Development Department at Sandia National Labs and a UIUC alum, says that the advantage computing centers like SNL and UIUC have is that they can apply their approaches to some of the largest and most advanced computing technologies in existence.

“Significant expertise goes into designing computing systems and applications, but we will not realize the benefits of that design if the system is managed without consideration of the architecture and the applications running on it,” says Gentile. “A challenge in our work with Prof. Iyer and his team is how to take Machine Learning techniques developed in other areas and apply them to computing systems management."

Saurabh Jha, a senior graduate student and now an IBM PhD fellow, says “this research will benefit small-to-large scale system owners such as universities, cloud computing vendors, and national laboratories. The goal is to alleviate the need to control and manage a system manually.” This would improve the overall system utilization and throughput. This will also reduce the dependence on humans for tasks that are prone to errors. “We have developed and demonstrated ML-driven automation framework on large-scale systems such as Blue Waters and IBM Cloud,” says Jha. The results from our current work have been published at top tier systems conferences such as Supercomputing, NSDI, OSDI, and ICS, and machine learning conferences such as ICML.

“Today integration of innovation in hardware architecture, operating systems, network interconnects, and storage is based on handcrafted heuristics, which have become challenging to generate given the variations across deployment environments and applications,” says Iyer. “The goal of our research is to reinvent the design of the system from hardware to application using inflight analytics to control and optimize large-scale heterogeneous computer systems to meet the performance and resiliency requirements of the emergent applications.”

“I am really excited to work in this area which uniquely combines measurements, machine learning and systems to design computing systems of the future,” says Archit Patke, a PhD student, who recently started working on this project.

The automation of computer systems is a highly challenging task that requires solving many fundamental problems in modeling computer systems and their tasks, dealing with noisy and incomplete monitoring datasets, and adapting the latest innovations across the software and hardware stack. This research is broad enough to impact stakeholders across industry and academia.

“The general applicability of this research will lead to groundbreaking insights and innovations,” says Iyer.

This research is funded by Sandia National Laboratories for $100,000 for 9 months.


Share this story

This story was published July 2, 2021.