CSL professors lead multi-disciplinary efforts to improve computing platforms

12/17/2020 Allie Arp, CSL

Written by Allie Arp, CSL

Large-scale computing platforms like clouds or supercomputers are becoming more common and more complex. As a result, these platforms are becoming extremely expensive to build and operate. Three CSL professors and two computer science professors have teamed up in an NSF funded grant to proactively plan for the need to automate resource- and resiliency-management in such systems by leveraging machine learning (ML) to tackle problems such as scheduling, server/VM health monitoring, distributed failure detection, real-time intrusion detection, and power management among other management functions.

“The broad idea is that there will continue to be advances in the technology and when you combine that with the demand for major cloud large systems w

e must have a better process to manage these systems than exists today,” said Ravi Iyer, George and Ann Fisher Distinguished Professor of Engineering. “We will let the innards of such a system evolve as they do, however, we’ll put around this is a blanket of ML algorithms which will work together to automatically manage these future systems.” In the project, “Inflight Analytics to Control Large-Scale Heterogeneous Systems,” Iyer, along with CSL Director Klara Nahrstedt, Professor Emeritus Wen-mei Hwu and CS Professors Tiyanyin Xu and William Kramer, are building an ensemble of ML models that will use domain-knowledge-driven artificial intelligence techniques to make better decisions that can directly incorporate dynamic and contextual measurements made on these large systems. This approach will alleviate the need of current and future datacenters from painstakingly building human-

engineer-derived static policies or heuristics. The ML algorithms the team hopes to develop would automate such decision making, and significantly ease the integration of heterogeneous computing elements like accelerators, non-volatile memories, and high-speed interconnects into these computing platforms.

“Today, tight vertical integration across a system stack is handled with painstakingly built hand-crafted average-case heuristics,” said Iyer, professor in electrical and computer engineering. “To meet these application demands, the next generation systems are rapidly evolving by incorporating innovations in architecture, interconnects, operating systems, and large-scale distributed systems. There is a significant potential to use machine-learning-generated heuristics to replace hand-crafted heuristics.”

Using ML, the group plans to build the next generation of large-scale computing systems that would allow for monitoring data collected throughout the system stack (e.g., performance counters from processors, telemetry data

from operating systems and interconnects, detailed error logs, and application level trace data ) to automatically generate real-time decisions for a variety of system resource management tasks. These ML heuristics would take into account much more contextual information than is possible in hand-crafted heuristics, allowing them to adapt more readily to a system’s architecture and actual usage patterns, rather than being constructed for the average case.

“One of the major questions is at the network level how do you connect all these missions?” said Nahrstedt, Ralph and Catherine Fisher Professor of Computer Science. “In our framework we will exploit the treasure trove of data available from monitors across the system stack to manage next generation computing systems as dictated by emerging applications. It’s amazing work.”

Share this story

This story was published December 17, 2020.