Let AI lead your code reviews

Published June 7, 2024

How Can AI Handle My Large Codebase?

“My codebase is huge, I have thousands of repos. How can you really handle it for AI use cases, like with RAG? Won’t that be really slow?”

We often get questions about how to work with large codebases and feed the appropriate context to LLMs. Do you use RAG, should you train the model, are there other approaches on these vast data sets?

Managing large codebases efficiently for AI use cases is crucial. This blog post delves into the intricacies of leveraging Retrieval-Augmented Generation (RAG) and other techniques, versus training models on massive code repositories.

The Challenge: Context and Scale

When dealing with hundreds or even thousands of repositories, understanding the context of the code is paramount. Understanding the relationships and dependencies within the code is essential for accelerating development and ensuring quality. In AI, the challenge lies in how to retrieve relevant context quickly and accurately from such a vast amount of data.

Speed Matters: RAG and Symbol Search

RAG (Retrieval-Augmented Generation)

Some people have question if RAG can be fast across large codebases. RAG can be surprisingly fast. Qdrant, an open-source vector database used for RAG, publishes benchmarks (link below) demonstrating impressive speeds. For instance, it can support up to 20-50GB of code, returning results within 10 milliseconds. While practical experiences may vary, often being around 50-100 milliseconds, this is still remarkably fast for handling large datasets.

Symbol Search

In many cases, symbol search can be even faster than RAG. A trigram-based symbol search is a notable example. It can search through 30GB of code in just 9 milliseconds (link below). Symbol search focuses on identifying relevant symbols, such as function names and class names, which can provide a more targeted and efficient way to navigate large codebases.

Context Accuracy: Beyond Speed

While RAG and embeddings provide some context, they often rely on similarity searches and how the code is chunked, which can lead to incomplete or irrelevant results. The dynamic symbol search approach, combined with Abstract Syntax Tree (AST) indexes, enhances accuracy. This method involves:

Dynamic Symbol Search: Quickly identifying relevant symbols and line numbers within milliseconds.
AST Indexes: Using these indexes to find the relevant context, such as locating the class a function belongs to.

This dual approach ensures a comprehensive understanding of the code structure within a few hundred milliseconds, which is adequate for code review tasks. Although there is room for further optimization, the current performance is quite good.

Looking Ahead: Expanding the Scope

One significant limitation is the context capacity of Large Language Models (LLMs), which can only handle so much information before they lose attention. Their attention span is limited and so choosing just the write context makes a difference.

Training and Fine-Tuning: Viability Concerns

Training models on massive codebases or fine-tuning them for specific contexts is not always cost-effective or practical, especially given the continuous updates in the code. Instead, optimizing existing search methods, like symbol search and AST indexes, offers a more viable solution for handling large codebases.

Conclusion: Balancing Promise and Reality

The integration of AI in managing large codebases holds immense promise, but it also comes with practical challenges. Balancing the long-term potential of AI with the immediate needs for enterprise-quality solutions requires a nuanced approach. By focusing on efficient search methods and understanding the limitations and capabilities of current AI models, we can make significant strides in optimizing code review and development processes.

Exciting times lie ahead as we continue to explore and refine these technologies, bridging the gap between AI’s potential and its real-world applications.

References

Amar Goel

Bito’s Co-founder and CEO. Dedicated to helping developers innovate to lead the future. A serial entrepreneur, Amar previously founded PubMatic, a leading infrastructure provider for the digital advertising industry, in 2006, serving as the company’s first CEO. PubMatic went public in 2020 (NASDAQ: PUBM). He holds a master’s degree in Computer Science and a bachelor’s degree in Economics from Harvard University.

Amar Goel

Amar is the Co-founder and CEO of Bito. With a background in software engineering and economics, Amar is a serial entrepreneur and has founded multiple companies including the publicly traded PubMatic and Komli Media.

Written by developers for developers

This article was handcrafted with by the Bito team.

Latest posts

Multi-Group Support for GitLab Self-Hosted | What Shipped 08.05.25

PEER REVIEW: Andrew Lau, CoFounder & CEO at Jellyfish

Bito Code Reviews: Automatic Setting and Manual Commands

C++ Code Review: Best Practices, Tools, and Checklist

Bitbucket Self-Managed Support | What Shipped 07.25.25

Multi-Group Support for GitLab Self-Hosted | What Shipped 08.05.25

PEER REVIEW: Andrew Lau, CoFounder & CEO at Jellyfish

Bito Code Reviews: Automatic Setting and Manual Commands

C++ Code Review: Best Practices, Tools, and Checklist

Bitbucket Self-Managed Support | What Shipped 07.25.25

From the blog

The latest industry news, interviews, technologies, and resources.

Published August 5, 2025

Multi-Group Support for GitLab Self-Hosted | What Shipped 08.05.25

Development with Bito

Published July 31, 2025

PEER REVIEW: Andrew Lau, CoFounder & CEO at Jellyfish

Artificial Intelligence

Published July 28, 2025

Bito Code Reviews: Automatic Setting and Manual Commands

Development with Bito

Community

Company

Products

Resources

Community

Company

Products

Resources

Let AI lead your code reviews

How Can AI Handle My Large Codebase?

Table of Contents

The Challenge: Context and Scale

Speed Matters: RAG and Symbol Search

RAG (Retrieval-Augmented Generation)

Symbol Search

Context Accuracy: Beyond Speed

Looking Ahead: Expanding the Scope

Training and Fine-Tuning: Viability Concerns

Conclusion: Balancing Promise and Reality

References

Amar Goel

Amar Goel

Written by developers for developers

Latest posts

Multi-Group Support for GitLab Self-Hosted | What Shipped 08.05.25

PEER REVIEW: Andrew Lau, CoFounder & CEO at Jellyfish

Bito Code Reviews: Automatic Setting and Manual Commands

C++ Code Review: Best Practices, Tools, and Checklist

Bitbucket Self-Managed Support | What Shipped 07.25.25

Top posts

Multi-Group Support for GitLab Self-Hosted | What Shipped 08.05.25

PEER REVIEW: Andrew Lau, CoFounder & CEO at Jellyfish

Bito Code Reviews: Automatic Setting and Manual Commands

C++ Code Review: Best Practices, Tools, and Checklist

Bitbucket Self-Managed Support | What Shipped 07.25.25

From the blog

Multi-Group Support for GitLab Self-Hosted | What Shipped 08.05.25

PEER REVIEW: Andrew Lau, CoFounder & CEO at Jellyfish

Bito Code Reviews: Automatic Setting and Manual Commands

Increase velocity, save time, reduce bugs