Faster, better AI-powered code reviews. Start your free trial!  
Faster, better AI-powered code reviews.
Start your free trial!

Get high quality AI code reviews

How does Bito’s “AI that understands your code” work?

How does Bito's "AI that understands your code" work?

Table of Contents

In most software development organizations managing and understanding a complex codebase can be challenging. And that’s just for humans. But AI has particular challenges that need to be thought through for it to understand code. At Bito we believe understanding code well and completely is one of the most important things we can enable for the AI-driven products we build. It’s critical for our developers to be able to access these capabilities as they write code, create tests, or figure out bugs.

To help, Bito introduced an innovative approach with its Code Context Retrieval module. This tool is engineered to provide developers with a deeper understanding of their code by supplying relevant context about the code being analyzed. Its functionality is particularly useful for a wide range of AI operations, enhancing code comprehension and facilitating more informed coding decisions.

Bito’s Code Context Retrieval Module

The objective of Bito’s Code Context Retrieval module is to improve the language model’s understanding of code by providing detailed context. This is achieved through a combination of technologies including Abstract Syntax Tree (AST) parsing, Symbol indexing, and embedding vectors. Each of these components plays a vital role in breaking down the codebase into manageable, understandable pieces.

Symbol Indexing, Abstract Syntax Trees, and Embeddings

Symbol Indexing

When integrating a new feature or modifying existing code, such as updating the authenticateUser() function, it’s crucial to understand its usage and impact across the entire codebase. Symbol indexing involves creating a searchable database of the symbols (variables, functions, classes, etc.) used in a codebase. Symbol indexing allows for precise location of every instance where authenticateUser() is used, even in less obvious files and modules. This index allows developers and tools to quickly find where symbols are defined and where they are used. It’s crucial for features like code navigation, refactoring, and understanding code dependencies.

Abstract Synax Trees (AST)

AST parsing provides insight into the logical constructs around authenticateUser(), such as conditional blocks and function calls, offering a detailed view necessary for assessing the impact of changes. AST parsing involves converting source code into an abstract syntactic structure (see image), representing the syntactic hierarchy of the programming language in a tree format. Each node in the tree represents a construct occurring in the source code. This representation is used by compilers and various tools to analyze, transform, and generate code.

Source: Wikipedia, https://en.wikipedia.org/wiki/Abstract_syntax_tree

Embeddings

Embeddings, or vector representations of data, are a fundamental concept in machine learning and natural language processing that allow computers to process and understand complex data structures, like text or code, in a more human-like manner. By mapping high-dimensional data (such as words, sentences, or even entire code snippets) to vectors in a lower-dimensional space, embeddings capture semantic relationships and patterns within the data that aren’t immediately obvious.

The distance and direction between vectors reflect the relationships and similarities between their corresponding entities. For instance, in the context of code, embeddings can capture syntactic and semantic similarities between snippets, enabling a machine to understand code in a way that mirrors developer intuition.  Using algorithms such as cosine similarity, you can find vectors close to each other, representing similarity.

A simplified plot of vector embeddings projected into 2 dimensions – Image Source: https://partee.io/2022/08/11/vector-embeddings/

In code search, embeddings can help developers find relevant code snippets across vast codebases by understanding the semantics of the search query and the code, beyond simple keyword matching. Bito uses embeddings to help understand a natural language query and find relevant, similar snippets or similar code suggestions based on learned best practices. For example, using our example from earlier about authenticating users, if a user asks “how does my authentication system work?”, Bito would turn this into a vector, and then compare it to other vectors representing functions, chunks of code, and other objects. The authenticateUser() function, represented via a vector, would likely show up as similar to the vector representing the user’s question, and would then be passed as context to the LLM. We say “likely show up” as embeddings work well, but don’t provide a completely guarantee of a match, compared to a technique like symbol search.

From Code Snippets to Semantic Insights

Bito’s workflow integrates AST parsing, Symbol indexing, and embedding vector databases for a multifaceted code analysis. The process starts with Symbol search identifying references to specific code snippets.

AST parsing then provides the structural and syntactical context of these snippets, such as their usage and dependency structure. Embedding vector databases add a semantic layer, identifying similar functionalities and suggesting improvements by analyzing the codebase semantically.

This integrated approach ensures a thorough understanding of code functionalities and their dependencies, which is essential for developers making informed modifications and improvements to the codebase.

Query Processing and APIs

The Code Context Retrieval module processes queries through a streamlined workflow. A query from Bito’s AI Code Review Agent is sent to the Code Context Retrieval module, which then distributes it to the Symbol, AST, and Embedding sub-modules. These sub-modules fetch relevant data, which is compiled into a unified response and returned via the Query API. This process ensures developers receive comprehensive insights based on their specific queries.

Bito offers two APIs to facilitate interaction with the CCR module:

  1. Index API: allows for the indexing of codebases, providing a foundation for the retrieval of context. 
  1. Query API: enables querying the indexed codebase to retrieve specific code contexts, including direct code snippets, dependency information, and semantic matches. 

Practical Applications

Bito’s Code Context Retrieval module has practical applications across various aspects of software development, like: 

  • Direct Code Context Retrieval: for modifications within functions or specific symbols, the module can identify and provide detailed context about the changes, aiding in understanding their scope and impact. 
  • Dependency Analysis: it identifies both internal and external dependencies, showing how pieces of code are interconnected within a project. 
  • Test Code Analysis: the module locates and analyzes test files related to specific modules, providing insights into test coverage and specifics. 

Conclusion

Bito’s Code Context Retrieval module is a powerful tool for developers navigating complex codebases. By integrating precise indexing, structural parsing, and semantic analysis, it offers a comprehensive understanding of code and its dependencies.

It enables developers to make informed decisions, ensuring code modifications and integrations are done with a full understanding of their implications.

Our next efforts are to make these capabilities available across your entire codebase, and to make it so any developer can access these capabilities to build AI capabilities.

Anand Das

Anand Das

Amar Goel

Amar Goel

Amar is the Co-founder and CEO of Bito. With a background in software engineering and economics, Amar is a serial entrepreneur and has founded multiple companies including the publicly traded PubMatic and Komli Media.

Written by developers for developers

This article was handcrafted with by the Bito team.

Latest posts

6 Best CodiumAI’s PR-Agent Alternatives for AI Code Reviews

Evaluating AI Recall Accuracy: A Test of Various LLMs from OpenAI to Claude to Google’s Gemini 1.5M Context window

Crafting AI Agents: Real-World Lessons from the Front Lines

Manual vs Automated Code Review: Who Wins in the Age of AI?

How to Properly Review a Merge Request in GitLab

Top posts

6 Best CodiumAI’s PR-Agent Alternatives for AI Code Reviews

Evaluating AI Recall Accuracy: A Test of Various LLMs from OpenAI to Claude to Google’s Gemini 1.5M Context window

Crafting AI Agents: Real-World Lessons from the Front Lines

Manual vs Automated Code Review: Who Wins in the Age of AI?

How to Properly Review a Merge Request in GitLab

From the blog

The latest industry news, interviews, technologies, and resources.

Get Bito for IDE of your choice