AI compiler

AI compiler translates neural network into low level device code, e.g., CUDA. It plays a critical role to ensure the efficient scaling of neural network.

In the past, we developed a series of compiler techniques to advocate the tile-based abstraction on canonical deep learning compilation on SIMT based AI hardware (e.g., GPU). This includes Rammer (Ma et al., 2020), Roller (Zhu et al., 2022), Welder (Shi et al., 2023), and Cocktailer (Zhang et al., 2023). These techniques were covered in an MSR Research blog. And the tile abstraction is now well recognized in the systems community.

In addition, we correctly envisioned the importance of taking model sparsity into account in compiler techniques, and developed the first sparsity-aware compilers, SparTA (Zheng et al., 2022), PIT (Zheng et al., 2023), and nmSPARSE (Lin et al., 2023), and compilers for low-bit neural models, e.g., Ladder (Wang et al., 2024). They are all successfully unified under the tile abstraction.

Our next focus is compiler techniques for AI hardware with new architecture. For example, the more recent GPUs comes with heterogeneous hardware units, including tensor core, CUDA core, and Tensor Memory Accelerator (TMA). This introduces system opportunities that design new mechanisms that enables sophisticated compute schedule, e.g., pipelining, for extreme performance (Cheng et al., 2025). Another new hardware trend is the distributed memory architecture (i.e., non SIMT) (He et al., 2025). Meanwhile, the programming interface of neural network is an important topic related to compiler techniques. We will continue to investigate in new programming models like tile-lang and FractalTensor (Liu et al., 2024).

Interestingly, we observe that compiler techniques are also useful in distributed deep learning training and automated machine learning. Based on the observation, we developed nnScaler (Lin et al., 2024), a flexible and efficient distributed training framework, and NNI, a popular AutoML toolkit (Zhang et al., 2020).

References

2025

OSDI

PipeThreader: Software-Defined Pipelining for Efficient DNN Execution

Yu Cheng, and 11 more authors

In 19th USENIX Symposium on Operating Systems Design and Implementation, OSDI, 2025

Bib HTML Code

@inproceedings{pipethreader25,
  title = {PipeThreader: Software-Defined Pipelining for Efficient DNN Execution},
  author = {Cheng, Yu and Wang, Lei and Shi, Yining and Xia, Yuqing and Ma, Lingxiao and Xue, Jilong and Wang, Yang and Mo, Zhiwen and Chen, Feiyang and Yang, Fan and Yang, Mao and Yang, Zhi},
  year = {2025},
  booktitle = {19th {USENIX} Symposium on Operating Systems Design and Implementation, {OSDI}},
}

OSDI

WaferLLM: A Wafer-Scale LLM Inference System

Congjie He, and 7 more authors

In 19th USENIX Symposium on Operating Systems Design and Implementation, OSDI, 2025

Bib HTML

@inproceedings{waferllm25,
  title = {WaferLLM: A Wafer-Scale LLM Inference System},
  author = {He, Congjie and Huang, Yeqi and Mu, Pei and Miao, Ziming and Xue, Jilong and Ma, Lingxiao and Yang, Fan and Mai, Luo},
  year = {2025},
  booktitle = {19th {USENIX} Symposium on Operating Systems Design and Implementation, {OSDI}},
}

2024

OSDI

Ladder: Enabling Efficient Low-Precision Deep Learning Computing through Hardware-aware Tensor Transformation

Lei Wang, and 11 more authors

In 18th USENIX Symposium on Operating Systems Design and Implementation, OSDI, 2024

Bib HTML Code

@inproceedings{ladder24,
  title = {Ladder: Enabling Efficient Low-Precision Deep Learning Computing through Hardware-aware Tensor Transformation},
  author = {Wang, Lei and Ma, Lingxiao and Cao, Shijie and Zhang, Quanlu and Xue, Jilong and Shi, Yining and Zheng, Ningxin and Miao, Ziming and Yang, Fan and Cao, Ting and Yang, Yuqing and Yang, Mao},
  year = {2024},
  booktitle = {18th {USENIX} Symposium on Operating Systems Design and Implementation, {OSDI}},
}

SOSP
Uncovering Nested Data Parallelism and Data Reuse in DNN Computation with FractalTensor

Siran Liu, and 7 more authors

In SOSP, 2024

Abs Bib HTML

To speed up computation, deep neural networks (DNNs) usually rely on highly optimized tensor operators. Despite the effectiveness, tensor operators are often defined empirically with ad hoc semantics. This hinders the analysis and optimization across operator boundaries. FractalTensor is a programming framework that addresses this challenge. At the core, FractalTensor is a nested list-based abstract data type (ADT), where each element is a tensor with static shape or another FractalTensor (i.e., nested). DNNs are then defined by high-order compute operators like map/reduce/scan and data access operators like window/stride on FractalTensor. This new way of DNN definition explicitly exposes nested data parallelism and fine-grained data access patterns, opening new opportunities for whole program analysis and optimization. To exploit these opportunities, from the FractalTensor-based code the compiler extracts a nested multi-dimensional dataflow graph called Extended Task Dependence Graph (ETDG), which provides a holistic view of data dependency across different granularity. The ETDG is then transformed into an efficient implementation through graph coarsening, data reordering, and access materialization. Evaluation on six representative DNNs like RNN and FlashAttention on NVIDIA A100 shows that FractalTensor achieves speedup by up to 5.44x and 1.97x on average through a unified solution for diverse optimizations.
@inproceedings{FractalTensorSosp24, title = {Uncovering Nested Data Parallelism and Data Reuse in DNN Computation with FractalTensor}, author = {Liu, Siran and Qi, Chengxiang and Cao, Ying and Yang, Chao and Hu, Weifang and Shi, Xuanhua and Yang, Fan and Yang, Mao}, year = {2024}, booktitle = {{SOSP}}, }

OSDI

nnScaler: Constraint-Guided Parallelization Plan Generation for Deep Learning Training

Zhiqi Lin, and 13 more authors

In 18th USENIX Symposium on Operating Systems Design and Implementation, OSDI, 2024

Bib HTML Code Slides

@inproceedings{nnscaler24,
  title = {nnScaler: Constraint-Guided Parallelization Plan Generation for Deep Learning Training},
  author = {Lin, Zhiqi and Miao, Youshan and Zhang, Quanlu and Yang, Fan and Zhu, Yi and Li, Cheng and Maleki, Saeed and Cao, Xu and Shang, Ning and Yang, Yilei and Xu, Weijiang and Yang, Mao and Zhang, Lintao and Zhou, Lidong},
  year = {2024},
  booktitle = {18th {USENIX} Symposium on Operating Systems Design and Implementation, {OSDI}},
}

2023

OSDI

Welder: Scheduling Deep Learning Memory Access via Tile-graph

Yining Shi, and 8 more authors

In 17th USENIX Symposium on Operating Systems Design and Implementation, OSDI, 2023

Bib HTML

@inproceedings{DBLP:conf/osdi/00010XMXMG0Z23,
  title = {Welder: Scheduling Deep Learning Memory Access via Tile-graph},
  author = {Shi, Yining and Yang, Zhi and Xue, Jilong and Ma, Lingxiao and Xia, Yuqing and Miao, Ziming and Guo, Yuxiao and Yang, Fan and Zhou, Lidong},
  year = {2023},
  booktitle = {17th {USENIX} Symposium on Operating Systems Design and Implementation, {OSDI}},
}

OSDI

Cocktailer: Analyzing and Optimizing Dynamic Control Flow in Deep Learning

Chen Zhang, and 8 more authors

In 17th USENIX Symposium on Operating Systems Design and Implementation, OSDI, 2023

Bib HTML

@inproceedings{DBLP:conf/osdi/ZhangMXSM0Z0Y23,
  title = {Cocktailer: Analyzing and Optimizing Dynamic Control Flow in Deep Learning},
  author = {Zhang, Chen and Ma, Lingxiao and Xue, Jilong and Shi, Yining and Miao, Ziming and Yang, Fan and Zhai, Jidong and Yang, Zhi and Yang, Mao},
  year = {2023},
  booktitle = {17th {USENIX} Symposium on Operating Systems Design and Implementation, {OSDI}},
}

SOSP

PIT: Optimization of Dynamic Sparse Deep Learning Models via Permutation Invariant Transformation

Ningxin Zheng, and 10 more authors

In Proceedings of the 29th Symposium on Operating Systems Principles, SOSP, 2023

Bib HTML

@inproceedings{DBLP:conf/sosp/ZhengJZHM0YZQYZ23,
  title = {{PIT:} Optimization of Dynamic Sparse Deep Learning Models via Permutation Invariant Transformation},
  author = {Zheng, Ningxin and Jiang, Huiqiang and Zhang, Quanlu and Han, Zhenhua and Ma, Lingxiao and Yang, Yuqing and Yang, Fan and Zhang, Chengruidong and Qiu, Lili and Yang, Mao and Zhou, Lidong},
  year = {2023},
  booktitle = {Proceedings of the 29th Symposium on Operating Systems Principles, {SOSP}},
}

MLSys

Efficient GPU Kernels for N:M-Sparse Weights in Deep Learning

Bin Lin, and 10 more authors

In Proceedings of Machine Learning and Systems, 2023

Bib HTML

@inproceedings{MLSYS2023_a10deb4d,
  title = {Efficient GPU Kernels for N:M-Sparse Weights in Deep Learning},
  author = {Lin, Bin and Zheng, Ningxin and Wang, Lei and Cao, Shijie and Ma, Lingxiao and Zhang, Quanlu and Zhu, Yi and Cao, Ting and Xue, Jilong and Yang, Yuqing and Yang, Fan},
  year = {2023},
  booktitle = {Proceedings of Machine Learning and Systems},
  publisher = {Curan},
  volume = {5},
  pages = {513--525},
  editor = {Song, D. and Carbin, M. and Chen, T.},
}

2022

OSDI

ROLLER: Fast and Efficient Tensor Compilation for Deep Learning

Hongyu Zhu, and 14 more authors

In 16th USENIX Symposium on Operating Systems Design and Implementation, OSDI, 2022

Bib HTML

@inproceedings{DBLP:conf/osdi/ZhuWDKLZXMXC0YZ22,
  title = {{ROLLER:} Fast and Efficient Tensor Compilation for Deep Learning},
  author = {Zhu, Hongyu and Wu, Ruofan and Diao, Yijia and Ke, Shanbin and Li, Haoyu and Zhang, Chen and Xue, Jilong and Ma, Lingxiao and Xia, Yuqing and Cui, Wei and Yang, Fan and Yang, Mao and Zhou, Lidong and Cidon, Asaf and Pekhimenko, Gennady},
  year = {2022},
  booktitle = {16th {USENIX} Symposium on Operating Systems Design and Implementation, {OSDI}},
}

OSDI

SparTA: Deep-Learning Model Sparsity via Tensor-with-Sparsity-Attribute

Ningxin Zheng, and 8 more authors

In 16th USENIX Symposium on Operating Systems Design and Implementation, OSDI, 2022

Bib HTML

@inproceedings{DBLP:conf/osdi/ZhengLZMY0WYZ22,
  title = {Spar{TA}: Deep-Learning Model Sparsity via Tensor-with-Sparsity-Attribute},
  author = {Zheng, Ningxin and Lin, Bin and Zhang, Quanlu and Ma, Lingxiao and Yang, Yuqing and Yang, Fan and Wang, Yang and Yang, Mao and Zhou, Lidong},
  year = {2022},
  booktitle = {16th {USENIX} Symposium on Operating Systems Design and Implementation, {OSDI}},
}

2020

OSDI

Rammer: Enabling Holistic Deep Learning Compiler Optimizations with rTasks

Lingxiao Ma, and 9 more authors

In 14th USENIX Symposium on Operating Systems Design and Implementation, OSDI, 2020

Bib HTML

@inproceedings{DBLP:conf/osdi/MaXYXMCHYZZ20,
  title = {Rammer: Enabling Holistic Deep Learning Compiler Optimizations with rTasks},
  author = {Ma, Lingxiao and Xie, Zhiqiang and Yang, Zhi and Xue, Jilong and Miao, Youshan and Cui, Wei and Hu, Wenxiang and Yang, Fan and Zhang, Lintao and Zhou, Lidong},
  year = {2020},
  booktitle = {14th {USENIX} Symposium on Operating Systems Design and Implementation, {OSDI}},
}

OSDI

Retiarii: A Deep Learning Exploratory-Training Framework

Quanlu Zhang, and 6 more authors

In 14th USENIX Symposium on Operating Systems Design and Implementation, OSDI, 2020

Bib HTML

@inproceedings{DBLP:conf/osdi/ZhangHYZLYZ20,
  title = {Retiarii: {A} Deep Learning Exploratory-Training Framework},
  author = {Zhang, Quanlu and Han, Zhenhua and Yang, Fan and Zhang, Yuge and Liu, Zhe and Yang, Mao and Zhou, Lidong},
  year = {2020},
  booktitle = {14th {USENIX} Symposium on Operating Systems Design and Implementation, {OSDI}},
}