Efficient Construction of RAG Systems with Massive Data

2月 11, 2025

Using the CPU of a mini PC, it took 16 hours to add 8 million titles to MyVectorDB, resulting in a file size of 31GB. If traditional methods were used, it would likely exceed 100GB. This is just a preliminary verification that it can handle large amounts of data relatively quickly, but it must be admitted that there is still significant room for improvement in actual performance：

What are the key challenges when scaling RAG systems to millions of documents?

How does the use of Faiss improve search performance in large datasets?

What are alternative approaches to enhance the quality of content retrieval in RAG systems?

Building a Retrieval-Augmented Generation (RAG) system efficiently, especially when dealing with millions of entries like a database with 8 million titles and full texts, poses significant challenges. Traditional methods, which involve breaking down content into sentences and adding them to a vector database one by one, are not only inefficient but also demand vast storage and memory resources.

An innovative approach involves using preliminary similarity searches combined with real-time analysis synthesis. This strategy balances between performance, storage, and retrieval quality. In practice, using only CPU, one can index 8 million titles within 16 hours into a file size of about 31GB, which would be infeasible with conventional techniques.

The implementation leverages MyVectorDB, a custom vector database built on Faiss, which significantly accelerates search responses. Although the content quality still requires enhancement, this method shows promise in balancing practical constraints with optimal efficiency.

The code provided, particularly MyVectorDB , employs SQLite for storage and Faiss for indexing. It uses batch processing to add items, which reduces the memory footprint and improves throughput. The use of concurrent processing further optimizes the data ingestion phase.

Testing reveals that while the initial setup is resource-intensive, the search performance is notably fast, and the conceptual tree generation is quick. Future improvements might focus on refining the embedding quality and exploring more sophisticated indexing strategies to further enhance retrieval accuracy and speed.