12/4/2020 Allie Arp, CSL
Written by Allie Arp, CSL
Good things take time. When the IBM-ILLINOIS Center for Cognitive Computing Systems Research (C3SR) was formed as a collaboration between CSL and IBM in 2016, one of the first areas the center tackled was productive tooling and methodologies for AI research. Four years later, one of the projects, titled MLModelScope, has produced multiple award-winning papers and PhD theses.
One of the issues in the field of AI is the complex interplay among hardware stacks, software stacks, machine learning models, and data sets. It is often hard for AI researchers to reproduce other researchers’ work, even when the machine learning models are published, because the documentation is rarely detailed enough to specify all the required interdependencies among hardware and software stacks. Even if the authors of the model have the best of intentions in writing a detailed document, a future researcher could have a different hardware configuration and different versions of existing software libraries, which may not be compatible with the authors’ documentation. All of these make reproducing a published model problematic.It is also impractical for non-computer science users to evaluate published AI work to see if those AI models would be as effective when applied to users’ proprietary data sets. Even for AI researchers themselves, they are sometimes limited by the available hardware configurations and deep learning frameworks when developing their deep neural network models, as there are limited tools to help them pinpoint the location of the performance issue across the entire software/hardware stacks. The C3SR team set out to address these issues through the MLModelScope project.
“This project has a long history and is a really great success story for the center and our students,” said Jinjun Xiong, IBM researcher and co-director of C3SR. “Our goal was to build a common platform that would help AI researchers easily publish their models, and allow other AI researchers and non-computer scientists to easily evaluate models’ performance on both public and proprietary datasets on any combination of deep learning frameworks and machine configurations.”
Hundreds of models have been published to the MLModelScope platform with the support of all major deep learning frameworks (including Tensorflow, PyTorch, ONNX, MxNet etc.), all major computing platforms (x86, POWER, and IBM Mainframes), and all modern GPU series. The platform also provides the productive tooling to run the evaluation under various model/software/hardware combinations in a scalable fashion, producing rich set of results for interesting data analytics. One of tools from the platform was recently published at the 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS), titled “XSP: Across-Stack Profiling and Analysis of Machine Learning Models on GPUs,” and it received the Best Paper Award.The overall length of this project and the amount of time before there would be recognition made it a hard-sell for students who are often looking for research projects that will allow them to publish papers quickly.
“One of the focuses of our center is on building functional AI systems so people can use them to solve real pains, but this is a bit difficult in the existing academic setting, where the success of a PhD student is often measured by his or her number of publications, rather than functional systems that he or she develops,” said Xiong. “I often had a hard time convincing some students to work on this project because there was a lot of system development effort.”
It may have taken a bit of convincing, but once students like Cheng Li and Abdul Dakkak began working on the project, they realized the advantages of developing functional AI systems. Now that they have started receiving long-awaited recognition, they know it was worth the wait.
"This piece of work took quite a long time and became an essential part of my PhD thesis,” said Li, who earned her doctorate in computer science in the spring of 2020. “Receiving this recognition is a very big encouragement for my future research and I am continuing this line of work in my current job.”In addition, the group wrote another paper on MLMModelScope design, titled “The Design and Implementation of a Scalable DL Benchmarking Platform,” which received the Best Student Paper Award at the recent 2020 IEEE CLOUD Conference. In addition to these awards, the researchers have been invited to conduct joint tutorials about the work at various premier international conferences by the MLPerf (a well-known machine learning benchmarking community) leadership team, where they attracted a lot of industry attention. As an outcome of these tutorials, Li was offered a summer internship with Alibaba USA Inc. in 2019, to use the developed technology to run experiments on its internal systems and models.
“Abdul and Cheng have done a tremendous job in deeply understanding the source of performance variation and measurement errors, proposing an elegant solution, and building a practical system to offer an undisputable proof of their concept,” said Wen-mei Hwu, the PhD advisor of both Dakkak and Li. “Perhaps even more importantly, they created an artifact that has advanced the state of the practice, which is the hallmark of outstanding PhD thesis research in computing systems.”All of these recognitions and experiences led to job offers for both Li and Dakkak right after graduation. Li is currently a senior researcher at Microsoft and Dakkak is a principal research software engineering within the machine learning area at Microsoft Research.
”It’s satisfying to see our research vision, laid out more than five years ago, starting to bear fruits,” said Xiong. “I hope the MLModelScope project will become a success story for future students to not to be lured by short-term success, but really target long-term success by building practical, impactful and functional AI systems to address the real pains of the community.”