AI compiler translates neural network into low level device code, e.g., CUDA. It plays a critical role to ensure the efficient scaling of neural network.
In the past, we developed a series of compiler techniques to advocate the tile-based abstraction on canonical deep learning compilation on SIMT based AI hardware (e.g., GPU). This includes Rammer (Ma et al., 2020), Roller (Zhu et al., 2022), Welder (Shi et al., 2023), and Cocktailer (Zhang et al., 2023). These techniques were covered in an MSR Research blog. And the tile abstraction is well recognized in the systems community.
In addition, we correctly envisioned the importance of taking model sparsity into account in compiler techniques, and developed the first sparsity-aware compilers, SparTA (Zheng et al., 2022), PIT (Zheng et al., 2023), and nmSPARSE (Lin et al., 2023), and compilers for low-bit neural models, e.g., Ladder (Wang et al., 2024). They are all successfully unified under the tile abstraction.
Our next focus will be compiler techniques designed for AI hardware with distributed memory architecture (i.e., non SIMT), which we believe will be the future. Meanwhile, the programming interface of neural network is an important topic related to compiler techniques. We will continue to investigate new programming paradigms like FractalTensor (Liu et al., 2024), which is designed for next generation neural networks.
Interestingly, we observe that compiler techniques are also useful in distributed deep learning training and automated machine learning. Based on the observation, we developed nnScaler (Lin et al., 2024), a flexible and efficient distributed training framework, and NNI, a popular AutoML toolkit (Zhang et al., 2020).
To speed up computation, deep neural networks (DNNs) usually rely on highly optimized tensor operators. Despite the effectiveness, tensor operators are often defined empirically with ad hoc semantics. This hinders the analysis and optimization across operator boundaries. FractalTensor is a programming framework that addresses this challenge. At the core, FractalTensor is a nested list-based abstract data type (ADT), where each element is a tensor with static shape or another FractalTensor (i.e., nested). DNNs are then defined by high-order compute operators like map/reduce/scan and data access operators like window/stride on FractalTensor. This new way of DNN definition explicitly exposes nested data parallelism and fine-grained data access patterns, opening new opportunities for whole program analysis and optimization. To exploit these opportunities, from the FractalTensor-based code the compiler extracts a nested multi-dimensional dataflow graph called Extended Task Dependence Graph (ETDG), which provides a holistic view of data dependency across different granularity. The ETDG is then transformed into an efficient implementation through graph coarsening, data reordering, and access materialization. Evaluation on six representative DNNs like RNN and FlashAttention on NVIDIA A100 shows that FractalTensor achieves speedup by up to 5.44x and 1.97x on average through a unified solution for diverse optimizations.