AI workload differs significantly from conventional cloud workload (e.g., big data OLAP or OLTP workload). In early 2017, I began to look into this problem and tried to understand the implication. With my colleagues, we investigated a massive amount of AI workloads in Philly, Microsoft’s early GPU cluster management system designed for deep learning training. We shared our findings in (Jeon et al., 2019). We explained our thoughts on the scheduling primitives for training jobs (Xiao et al., 2018), and emphasized on the importance of topology aware scheduling (Zhao et al., 2020) in the AI era. Meanwhile, we also discovered several interesting opportunities in the coexistence of gaming and training workloads (Zhang et al., 2022), codesign of caching and scheduling (Zhao et al., 2023), and elastic training (Gu et al., 2023).
Given the strategic importance of GPU cluster management, I led an engineering group to develop OpenPAI, a Kubernetes based open-source cluster management platform for deep learning training and inferencing. OpenPAI is one of the earliest k8s systems capable of managing GPU clusters. It integrated several techniques mentioned above. Its key components like framework controller have been adopted by Azure AI products. As far as I know, several external organizations also developed their training infrastructure based on OpenPAI.
ElasticFlow: An Elastic Serverless Training Platform for Distributed Deep Learning
Diandian Gu, and 9 more authors
In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, ASPLOS, 2023