Scheduler for AI workload

AI workload differs significantly from conventional cloud workload (e.g., big data OLAP or OLTP workload). In early 2017, I began to look into this problem and tried to understand the implication. With my colleagues, we investigated a massive amount of AI workloads in Philly, Microsoft’s early GPU cluster management system designed for deep learning training. We shared our findings in (Jeon et al., 2019). We explained our thoughts on the scheduling primitives for training jobs (Xiao et al., 2018), and emphasized on the importance of topology aware scheduling (Zhao et al., 2020) in the AI era. Meanwhile, we also discovered several interesting opportunities in the coexistence of gaming and training workloads (Zhang et al., 2022), codesign of caching and scheduling (Zhao et al., 2023), and elastic training (Gu et al., 2023).

Given the strategic importance of GPU cluster management, I led an engineering group to develop OpenPAI, a Kubernetes based open-source cluster management platform for deep learning training and inferencing. OpenPAI is one of the earliest k8s systems capable of managing GPU clusters. It integrated several techniques mentioned above. Its key components like framework controller have been adopted by Azure AI products. As far as I know, several external organizations also developed their training infrastructure based on OpenPAI.

References

2023

  1. SiloD: A Co-design of Caching and Scheduling for Deep Learning Clusters
    Hanyu Zhao, and 11 more authors
    In Proceedings of the Eighteenth European Conference on Computer Systems, EuroSys, 2023
  2. ElasticFlow: An Elastic Serverless Training Platform for Distributed Deep Learning
    Diandian Gu, and 9 more authors
    In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, ASPLOS, 2023

2022

  1. PilotFish: Harvesting Free Cycles of Cloud Gaming with Deep Learning Training
    Wei Zhang, and 8 more authors
    In 2022 USENIX Annual Technical Conference, USENIX ATC, 2022

2020

  1. HiveD: Sharing a GPU Cluster for Deep Learning with Guarantees
    Hanyu Zhao, and 10 more authors
    In 14th USENIX Symposium on Operating Systems Design and Implementation, OSDI, 2020

2019

  1. Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads
    Myeongjae Jeon, and 5 more authors
    In 2019 USENIX Annual Technical Conference, USENIX ATC, 2019

2018

  1. Gandiva: Introspective Cluster Scheduling for Deep Learning
    Wencong Xiao, and 11 more authors
    In 13th USENIX Symposium on Operating Systems Design and Implementation, OSDI, 2018