12/17/2020 Allie Arp, CSL
Written by Allie Arp, CSL
Large-scale computing platforms like clouds or supercomputers are becoming more common and more complex. As a result, these platforms are becoming extremely expensive to build and operate. Three CSL professors and two computer science professors have teamed up in an NSF funded grant to proactively plan for the need to automate resource- and resiliency-management in such systems by leveraging machine learning (ML) to tackle problems such as scheduling, server/VM health monitoring, distributed failure detection, real-time intrusion detection, and power management among other management functions.
“The broad idea is that there will continue to be advances in the technology and when you combine that with the demand for major cloud large systems w“Today, tight vertical integration across a system stack is handled with painstakingly built hand-crafted average-case heuristics,” said Iyer, professor in electrical and computer engineering. “To meet these application demands, the next generation systems are rapidly evolving by incorporating innovations in architecture, interconnects, operating systems, and large-scale distributed systems. There is a significant potential to use machine-learning-generated heuristics to replace hand-crafted heuristics.”
Using ML, the group plans to build the next generation of large-scale computing systems that would allow for monitoring data collected throughout the system stack (e.g., performance counters from processors, telemetry data“One of the major questions is at the network level how do you connect all these missions?” said Nahrstedt, Ralph and Catherine Fisher Professor of Computer Science. “In our framework we will exploit the treasure trove of data available from monitors across the system stack to manage next generation computing systems as dictated by emerging applications. It’s amazing work.”