Vector store

Vector store is an important storage system in the AI era. I believe it not only serves as an cache used to retrieve computation results from neural networks, but also becomes (or will become) a fundamental component at the core of neural models. For example, we realized that attention, a fundamental mechanism in the Transform-like neural architecture, can be viewed as vector index traversal (Liu et al., 2024). This makes the computation of sparse attention much more efficient. In the case of LLMs with a long context window, the benefit can be of one or multiple orders of magnitude.

Here are some of our thoughts on vector store.

Integrating vector indices with relational databases using relaxed monotonicity. (Zhang et al., 2023).
Updating a vector index incrementally. (Xu et al., 2023).
Vector indices can be dense or sparse. Instead of represented with separated solutions, they can be unified with one generic design. (Chen et al., 2024).
Attention can be transformed as vector retrieval, thus making sparse attention significantly more efficient. (Liu et al., 2024), (Chen et al., 2025)

References

2025

ArXiv

RetroInfer: A Vector-Storage Approach for Scalable Long-Context LLM Inference

Yaoqi Chen, and 17 more authors

ArXiv, 2025

Bib HTML

@article{chen2025retroinfervectorstorageapproachscalable,
  title = {RetroInfer: A Vector-Storage Approach for Scalable Long-Context LLM Inference},
  author = {Chen, Yaoqi and Zhang, Jinkai and Lu, Baotong and Zhang, Qianxi and Zhang, Chengruidong and Luo, Jingjia and Liu, Di and Jiang, Huiqiang and Chen, Qi and Liu, Jing and Ding, Bailu and Yan, Xiao and Jiang, Jiawei and Chen, Chen and Zhang, Mingxing and Yang, Yuqing and Yang, Fan and Yang, Mao},
  year = {2025},
  journal = {ArXiv},
}

2024

ENLSP

RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval

Di Liu, and 13 more authors

In NeurIPS Efficient Natural Language and Speech Processing workshop, NeurIPS ENLSP-IV, 2024

ENLSP’24 Best Paper Bib HTML

Best paper award of the NeurIPS Efficient Natural Language and Speech Processing (ENLSP-IV) workshop 2024.

@inproceedings{retrievalattention2024,
  title = {RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval},
  author = {Liu, Di and Chen, Meng and Lu, Baotong and Jiang, Huiqiang and Han, Zhenhua and Zhang, Qianxi and Chen, Qi and Zhang, Chengruidong and Ding, Bailu and Zhang, Kai and Chen, Chen and Yang, Fan and Yang, Yuqing and Qiu, Lili},
  year = {2024},
  booktitle = {NeurIPS Efficient Natural Language and Speech Processing workshop, {NeurIPS ENLSP-IV}},
}

WWW
OneSparse: A Unified System for Multi-index Vector Search

Yaoqi Chen, and 16 more authors

In Companion Proceedings of the ACM Web Conference 2024, Singapore, Singapore, 2024

Abs Bib HTML

Multi-index vector search has become the cornerstone for many applications, such as recommendation systems. Efficient search in such a multi-modal hybrid vector space is challenging since no single index design performs well for all kinds of vector data. Existing approaches to processing multi-index hybrid queries either suffer from algorithmic limitations or processing inefficiency. In this paper, we propose OneSparse, a unified multi-vector index query system that incorporates multiple posting-based vector indices, which enables highly efficient retrieval of multi-modal data-sets. OneSparse introduces a novel multi-index query engine design of inter-index intersection push-down. It also optimizes the vector posting format to expedite multi-index queries. Our experiments show OneSparse achieves more than 6x search performance improvement while maintaining comparable accuracy. OneSparse has already been integrated into Microsoft online web search and advertising systems with 5x+ latency gain for Bing web search and 2.0% Revenue Per Mille (RPM) gain for Bing sponsored search.
@inproceedings{10.1145/3589335.3648338, title = {OneSparse: A Unified System for Multi-index Vector Search}, author = {Chen, Yaoqi and Zheng, Ruicheng and Chen, Qi and Xu, Shuotao and Zhang, Qianxi and Wu, Xue and Han, Weihao and Yuan, Hua and Li, Mingqin and Wang, Yujing and Li, Jason and Yang, Fan and Sun, Hao and Deng, Weiwei and Sun, Feng and Zhang, Qi and Yang, Mao}, year = {2024}, booktitle = {Companion Proceedings of the ACM Web Conference 2024}, location = {Singapore, Singapore}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, series = {WWW '24}, pages = {393–402}, doi = {10.1145/3589335.3648338}, isbn = {9798400701726}, numpages = {10}, keywords = {approximate nearest neighbor search, multi-index search, retrieval system, sparse and dense search} }

2023

OSDI

VBASE: Unifying Online Vector Similarity Search and Relational Queries via Relaxed Monotonicity

Qianxi Zhang, and 11 more authors

In 17th USENIX Symposium on Operating Systems Design and Implementation, OSDI, 2023

Bib HTML Code

@inproceedings{DBLP:conf/osdi/ZhangXCSXCCH00Y23,
  title = {{VBASE:} Unifying Online Vector Similarity Search and Relational Queries via Relaxed Monotonicity},
  author = {Zhang, Qianxi and Xu, Shuotao and Chen, Qi and Sui, Guoxin and Xie, Jiadong and Cai, Zhizhen and Chen, Yaoqi and He, Yinxuan and Yang, Yuqing and Yang, Fan and Yang, Mao and Zhou, Lidong},
  year = {2023},
  booktitle = {17th {USENIX} Symposium on Operating Systems Design and Implementation, {OSDI}},
}

SOSP

SPFresh: Incremental In-Place Update for Billion-Scale Vector Search

Yuming Xu, and 11 more authors

In Proceedings of the 29th Symposium on Operating Systems Principles, SOSP, 2023

Bib HTML Slides

@inproceedings{DBLP:conf/sosp/XuLLXCZLYYYCY23,
  title = {SPFresh: Incremental In-Place Update for Billion-Scale Vector Search},
  author = {Xu, Yuming and Liang, Hengyu and Li, Jin and Xu, Shuotao and Chen, Qi and Zhang, Qianxi and Li, Cheng and Yang, Ziyue and Yang, Fan and Yang, Yuqing and Cheng, Peng and Yang, Mao},
  year = {2023},
  booktitle = {Proceedings of the 29th Symposium on Operating Systems Principles, {SOSP}},
}