AI compiler

AI compiler translates neural network into low level device code, e.g., CUDA. It plays a critical role to ensure the efficient scaling of neural network.

In the past, we developed a series of compiler techniques to advocate the tile-based abstraction on canonical deep learning compilation on SIMT based AI hardware (e.g., GPU). This includes Rammer (Ma et al., 2020), Roller (Zhu et al., 2022), Welder (Shi et al., 2023), and Cocktailer (Zhang et al., 2023). These techniques were covered in an MSR Research blog. And the tile abstraction is well recognized in the systems community.

In addition, we correctly envisioned the importance of taking model sparsity into account in compiler techniques, and developed the first sparsity-aware compilers, SparTA (Zheng et al., 2022), PIT (Zheng et al., 2023), and nmSPARSE (Lin et al., 2023), and compilers for low-bit neural models, e.g., Ladder (Wang et al., 2024). They are all successfully unified under the tile abstraction.

Our next focus will be compiler techniques designed for AI hardware with distributed memory architecture (i.e., non SIMT), which we believe will be the future. Meanwhile, the programming interface of neural network is an important topic related to compiler techniques. We will continue to investigate new programming paradigms like FractalTensor (Liu et al., 2024), which is designed for next generation neural networks.

Interestingly, we observe that compiler techniques are also useful in distributed deep learning training and automated machine learning. Based on the observation, we developed nnScaler (Lin et al., 2024), a flexible and efficient distributed training framework, and NNI, a popular AutoML toolkit (Zhang et al., 2020).

References

2024

  1. Ladder: Enabling Efficient Low-Precision Deep Learning Computing through Hardware-aware Tensor Transformation
    Lei Wang, and 11 more authors
    In 18th USENIX Symposium on Operating Systems Design and Implementation, OSDI, 2024
  2. Uncovering Nested Data Parallelism and Data Reuse in DNN Computation with FractalTensor
    Siran Liu, and 7 more authors
    In SOSP, 2024
  3. nnScaler: Constraint-Guided Parallelization Plan Generation for Deep Learning Training
    Zhiqi Lin, and 13 more authors
    In 18th USENIX Symposium on Operating Systems Design and Implementation, OSDI, 2024

2023

  1. Welder: Scheduling Deep Learning Memory Access via Tile-graph
    Yining Shi, and 8 more authors
    In 17th USENIX Symposium on Operating Systems Design and Implementation, OSDI, 2023
  2. Cocktailer: Analyzing and Optimizing Dynamic Control Flow in Deep Learning
    Chen Zhang, and 8 more authors
    In 17th USENIX Symposium on Operating Systems Design and Implementation, OSDI, 2023
  3. PIT: Optimization of Dynamic Sparse Deep Learning Models via Permutation Invariant Transformation
    Ningxin Zheng, and 10 more authors
    In Proceedings of the 29th Symposium on Operating Systems Principles, SOSP, 2023
  4. MLSys
    Efficient GPU Kernels for N:M-Sparse Weights in Deep Learning
    Bin Lin, and 10 more authors
    In Proceedings of Machine Learning and Systems, 2023

2022

  1. ROLLER: Fast and Efficient Tensor Compilation for Deep Learning
    Hongyu Zhu, and 14 more authors
    In 16th USENIX Symposium on Operating Systems Design and Implementation, OSDI, 2022
  2. SparTA: Deep-Learning Model Sparsity via Tensor-with-Sparsity-Attribute
    Ningxin Zheng, and 8 more authors
    In 16th USENIX Symposium on Operating Systems Design and Implementation, OSDI, 2022

2020

  1. Rammer: Enabling Holistic Deep Learning Compiler Optimizations with rTasks
    Lingxiao Ma, and 9 more authors
    In 14th USENIX Symposium on Operating Systems Design and Implementation, OSDI, 2020
  2. Retiarii: A Deep Learning Exploratory-Training Framework
    Quanlu Zhang, and 6 more authors
    In 14th USENIX Symposium on Operating Systems Design and Implementation, OSDI, 2020