AI compiler translates neural network into low level device code, e.g., CUDA. It plays a critical role to ensure the efficient scaling of neural network.
In the past, we developed a series of compiler techniques to advocate the tile-based abstraction on canonical deep learning compilation on SIMT based AI hardware (e.g., GPU). This includes Rammer (Ma et al., 2020), Roller (Zhu et al., 2022), Welder (Shi et al., 2023), and Cocktailer (Zhang et al., 2023). These techniques were covered in an MSR Research blog. And the tile abstraction is well recognized in the systems community.
In addition, we correctly envisioned the importance of taking model sparsity into account in compiler techniques, and developed the first sparsity-aware compilers, SparTA (Zheng et al., 2022), PIT (Zheng et al., 2023), and nmSPARSE (Lin et al., 2023), and compilers for low-bit neural models, e.g., Ladder (Wang et al., 2024). They are all successfully unified under the tile abstraction.
Our next focus will be compiler techniques designed for AI hardware with distributed memory architecture (i.e., non SIMT) (He et al., 2025), which we believe will be the future. Meanwhile, the programming interface of neural network is an important topic related to compiler techniques. We will continue to investigate new programming paradigms like FractalTensor (Liu et al., 2024), which is designed for next generation neural networks.
Interestingly, we observe that compiler techniques are also useful in distributed deep learning training and automated machine learning. Based on the observation, we developed nnScaler (Lin et al., 2024), a flexible and efficient distributed training framework, and NNI, a popular AutoML toolkit (Zhang et al., 2020).
@article{waferllm25,title={WaferLLM: A Wafer-Scale LLM Inference System},author={He, Congjie and Huang, Yeqi and Mu, Pei and Miao, Ziming and Xue, Jilong and Ma, Lingxiao and Yang, Fan and Mai, Luo},year={2025},journal={ArXiv},}
@inproceedings{ladder24,title={Ladder: Enabling Efficient Low-Precision Deep Learning Computing through Hardware-aware Tensor Transformation},author={Wang, Lei and Ma, Lingxiao and Cao, Shijie and Zhang, Quanlu and Xue, Jilong and Shi, Yining and Zheng, Ningxin and Miao, Ziming and Yang, Fan and Cao, Ting and Yang, Yuqing and Yang, Mao},year={2024},booktitle={18th {USENIX} Symposium on Operating Systems Design and Implementation, {OSDI}},}
To speed up computation, deep neural networks (DNNs) usually rely on highly optimized tensor operators. Despite the effectiveness, tensor operators are often defined empirically with ad hoc semantics. This hinders the analysis and optimization across operator boundaries. FractalTensor is a programming framework that addresses this challenge. At the core, FractalTensor is a nested list-based abstract data type (ADT), where each element is a tensor with static shape or another FractalTensor (i.e., nested). DNNs are then defined by high-order compute operators like map/reduce/scan and data access operators like window/stride on FractalTensor. This new way of DNN definition explicitly exposes nested data parallelism and fine-grained data access patterns, opening new opportunities for whole program analysis and optimization. To exploit these opportunities, from the FractalTensor-based code the compiler extracts a nested multi-dimensional dataflow graph called Extended Task Dependence Graph (ETDG), which provides a holistic view of data dependency across different granularity. The ETDG is then transformed into an efficient implementation through graph coarsening, data reordering, and access materialization. Evaluation on six representative DNNs like RNN and FlashAttention on NVIDIA A100 shows that FractalTensor achieves speedup by up to 5.44x and 1.97x on average through a unified solution for diverse optimizations.
@inproceedings{FractalTensorSosp24,title={Uncovering Nested Data Parallelism and Data Reuse in DNN Computation with FractalTensor},author={Liu, Siran and Qi, Chengxiang and Cao, Ying and Yang, Chao and Hu, Weifang and Shi, Xuanhua and Yang, Fan and Yang, Mao},year={2024},booktitle={{SOSP}},}
@inproceedings{nnscaler24,title={nnScaler: Constraint-Guided Parallelization Plan Generation for Deep Learning Training},author={Lin, Zhiqi and Miao, Youshan and Zhang, Quanlu and Yang, Fan and Zhu, Yi and Li, Cheng and Maleki, Saeed and Cao, Xu and Shang, Ning and Yang, Yilei and Xu, Weijiang and Yang, Mao and Zhang, Lintao and Zhou, Lidong},year={2024},booktitle={18th {USENIX} Symposium on Operating Systems Design and Implementation, {OSDI}},}
@inproceedings{DBLP:conf/osdi/00010XMXMG0Z23,title={Welder: Scheduling Deep Learning Memory Access via Tile-graph},author={Shi, Yining and Yang, Zhi and Xue, Jilong and Ma, Lingxiao and Xia, Yuqing and Miao, Ziming and Guo, Yuxiao and Yang, Fan and Zhou, Lidong},year={2023},booktitle={17th {USENIX} Symposium on Operating Systems Design and Implementation, {OSDI}},}
@inproceedings{DBLP:conf/osdi/ZhangMXSM0Z0Y23,title={Cocktailer: Analyzing and Optimizing Dynamic Control Flow in Deep Learning},author={Zhang, Chen and Ma, Lingxiao and Xue, Jilong and Shi, Yining and Miao, Ziming and Yang, Fan and Zhai, Jidong and Yang, Zhi and Yang, Mao},year={2023},booktitle={17th {USENIX} Symposium on Operating Systems Design and Implementation, {OSDI}},}
@inproceedings{DBLP:conf/sosp/ZhengJZHM0YZQYZ23,title={{PIT:} Optimization of Dynamic Sparse Deep Learning Models via Permutation Invariant Transformation},author={Zheng, Ningxin and Jiang, Huiqiang and Zhang, Quanlu and Han, Zhenhua and Ma, Lingxiao and Yang, Yuqing and Yang, Fan and Zhang, Chengruidong and Qiu, Lili and Yang, Mao and Zhou, Lidong},year={2023},booktitle={Proceedings of the 29th Symposium on Operating Systems Principles, {SOSP}},}
MLSys
Efficient GPU Kernels for N:M-Sparse Weights in Deep Learning
Bin Lin, and 10 more authors
In Proceedings of Machine Learning and Systems, 2023
@inproceedings{MLSYS2023_a10deb4d,title={Efficient GPU Kernels for N:M-Sparse Weights in Deep Learning},author={Lin, Bin and Zheng, Ningxin and Wang, Lei and Cao, Shijie and Ma, Lingxiao and Zhang, Quanlu and Zhu, Yi and Cao, Ting and Xue, Jilong and Yang, Yuqing and Yang, Fan},year={2023},booktitle={Proceedings of Machine Learning and Systems},publisher={Curan},volume={5},pages={513--525},editor={Song, D. and Carbin, M. and Chen, T.},}
@inproceedings{DBLP:conf/osdi/ZhuWDKLZXMXC0YZ22,title={{ROLLER:} Fast and Efficient Tensor Compilation for Deep Learning},author={Zhu, Hongyu and Wu, Ruofan and Diao, Yijia and Ke, Shanbin and Li, Haoyu and Zhang, Chen and Xue, Jilong and Ma, Lingxiao and Xia, Yuqing and Cui, Wei and Yang, Fan and Yang, Mao and Zhou, Lidong and Cidon, Asaf and Pekhimenko, Gennady},year={2022},booktitle={16th {USENIX} Symposium on Operating Systems Design and Implementation, {OSDI}},}
@inproceedings{DBLP:conf/osdi/ZhengLZMY0WYZ22,title={Spar{TA}: Deep-Learning Model Sparsity via Tensor-with-Sparsity-Attribute},author={Zheng, Ningxin and Lin, Bin and Zhang, Quanlu and Ma, Lingxiao and Yang, Yuqing and Yang, Fan and Wang, Yang and Yang, Mao and Zhou, Lidong},year={2022},booktitle={16th {USENIX} Symposium on Operating Systems Design and Implementation, {OSDI}},}
@inproceedings{DBLP:conf/osdi/MaXYXMCHYZZ20,title={Rammer: Enabling Holistic Deep Learning Compiler Optimizations with rTasks},author={Ma, Lingxiao and Xie, Zhiqiang and Yang, Zhi and Xue, Jilong and Miao, Youshan and Cui, Wei and Hu, Wenxiang and Yang, Fan and Zhang, Lintao and Zhou, Lidong},year={2020},booktitle={14th {USENIX} Symposium on Operating Systems Design and Implementation, {OSDI}},}
@inproceedings{DBLP:conf/osdi/ZhangHYZLYZ20,title={Retiarii: {A} Deep Learning Exploratory-Training Framework},author={Zhang, Quanlu and Han, Zhenhua and Yang, Fan and Zhang, Yuge and Liu, Zhe and Yang, Mao and Zhou, Lidong},year={2020},booktitle={14th {USENIX} Symposium on Operating Systems Design and Implementation, {OSDI}},}