Let AI lead your code reviews

Published April 25, 2024

Evaluating AI Recall Accuracy: A Test of Various LLMs from OpenAI to Claude to Google’s Gemini 1.5M Context window

AI is increasingly tasked with dissecting vast amounts of data to find specific information related to a user’s request — effectively finding a digital needle in a haystack. In addition, LLM providers have been increasingly increasing their context window sizes and marketing that heavily as a big reason to use their models. Google’s latest Gemini 1.5 Pro takes the prize currently, boasting a 1,000,000 (1M) token context window, or about 1,500 pages of text. That is a lot of info!

The current popular thinking is that a huge context window means you can give the AI model a whole codebase or book, and ask it a bunch of questions. At Bito, we are very focused on developers, and their exacting needs related to code. For example, they might want to know every place in their codebase a particular function is mentioned. So are these longer context windows helping achieve what everyone thinks? We were hopeful, but wanted to get some data.

To explore how well different AI models and some traditional tools manage this challenge, I devised a simple test to have various tools count the occurrence of a word in a blog post. The newly launched Claude 3 family of models from Anthropic has a nice introductory blog post at https://www.anthropic.com/news/claude-3-family . Specifically, I gave this blog post (I left out the title and the footnotes) to the tools and asked them to count the occurrences of the word “Claude” in the blog post.

The Test Setup

The blog post at https://www.anthropic.com/news/claude-3-family is a moderately sized blog post, roughly 5-6 pages of text with generous spacing. The entire post along with my instructions to the LLM to count the use of the word “Claude” was 2,023 tokens. It’s actually a very small number of tokens actually, when OpenAI’s GPT-4 Turbo offers a maximum 128k token context window, Claude 3 offers 200k, and Gemini 1.5 Pro offers a stunning 1M token context window.

Tools and Models in the Test

OpenAI GPT-4 Turbo: Reported 25 occurrences of the word “Claude” initially. After I asked again, it said 21 times.

Claude 3 Opus: Detected the word 18 times on its first attempt, showing a significant difference in recall. A second request gave the same answer of 18.

Google Gemini-1.5-Pro-Preview 0409: Counted the word 27 times, demonstrating its handling of the data. The second time I asked it said 28.

Google Docs: Utilized a symbol search which leverages a keyword index to provide an exact and correct answer that the word “Claude” appeared 37 times.

Microsoft Word: Using a similar symbol search method as Google Docs, also found 37 occurrences.

To scale the challenge and test recall reliability, I replicated the text six times, creating a document with 12,083 tokens to see how well these tools could manage a larger context window. Unfortunately,… well, read on to see what happened.

Scaled Test Results

Google Docs and Microsoft Word: Consistently accurate, each reported 222 occurrences, or 6 copies of the blog post times 37 occurrences of the word Claude per post.

Google Gemini-1.5-Pro-Preview 0409: Showed inconsistency with counts of 44, 79, and 61 in successive tests.

OpenAI GPT-4 Turbo: Demonstrated some fluctuation with counts of 59 and 55.

Claude 3 Opus: Remained consistent in its own results, each time reporting 34 occurrences.

The LLMs performed very poorly on this test. Their results were between 65% and 85% off. That is a lot. Would you feel confident using a RAG based approach for a lot of information and feeding it to an LLM when you are looking for a precise answer?

Insights and Takeaways

This exercise revealed several insights:

Precision of Traditional Tools: Tools like Google Docs and Microsoft Word, though not AI-based, proved highly reliable for this type of data recall, showing no variation in their results. Their use of symbol search was critical.
Variability Among AI Models: The AI models tested exhibited varying levels of recall accuracy. This variation might be attributed to differences in how models interpret and analyze large blocks of text.
Implications for Users: For tasks requiring high precision and consistency, traditional data processing tools currently outperform more complex AI models. However, the capabilities of AI extend beyond simple recall, potentially providing richer analysis and insights in less straightforward scenarios.

Through this experiment, we see a snapshot of where AI stands in its ability to sift through data and recall specific information. As AI technologies evolve, their capacity to handle such recall tasks with precision is likely to improve, enhancing their utility in more complex analytical applications. But the models of today, while technically capable of longer context windows, do not process all that text well in certain situations requiring precision recall.

At Bito, we have been working to take this into account into the agents we are building, and have been devising different tools to augment AI. For example, our AI that understands your code builds multiple different indexes of your code to provide exacting recall.

Amar Goel

Bito’s Co-founder and CEO. Dedicated to helping developers innovate to lead the future. A serial entrepreneur, Amar previously founded PubMatic, a leading infrastructure provider for the digital advertising industry, in 2006, serving as the company’s first CEO. PubMatic went public in 2020 (NASDAQ: PUBM). He holds a master’s degree in Computer Science and a bachelor’s degree in Economics from Harvard University.

Amar Goel

Amar is the Co-founder and CEO of Bito. With a background in software engineering and economics, Amar is a serial entrepreneur and has founded multiple companies including the publicly traded PubMatic and Komli Media.

Written by developers for developers

This article was handcrafted with by the Bito team.

Latest posts

PHP Code Review: Best Practices, Tools, and Checklist

Comparing Agentic AI Code Reviews with Linear Reviews

Kotlin Code Review: Best Practices, Tools, and Checklist

PEER REVIEW: Gaurav Nigam, VP of Engineering at WorkBoard

Custom Code Review Guidelines | What Shipped 07.03.25

PHP Code Review: Best Practices, Tools, and Checklist

Comparing Agentic AI Code Reviews with Linear Reviews

Kotlin Code Review: Best Practices, Tools, and Checklist

PEER REVIEW: Gaurav Nigam, VP of Engineering at WorkBoard

Custom Code Review Guidelines | What Shipped 07.03.25

From the blog

The latest industry news, interviews, technologies, and resources.

Published July 11, 2025

PHP Code Review: Best Practices, Tools, and Checklist

Software Engineering

Published July 11, 2025

Comparing Agentic AI Code Reviews with Linear Reviews

Artificial Intelligence

Published July 4, 2025

Kotlin Code Review: Best Practices, Tools, and Checklist

Software Engineering

Community

Company

Products

Resources

Community

Company

Products

Resources

Let AI lead your code reviews

Evaluating AI Recall Accuracy: A Test of Various LLMs from OpenAI to Claude to Google’s Gemini 1.5M Context window

Table of Contents

The Test Setup

Tools and Models in the Test

Scaled Test Results

Insights and Takeaways

Amar Goel

Amar Goel

Written by developers for developers

Latest posts

PHP Code Review: Best Practices, Tools, and Checklist

Comparing Agentic AI Code Reviews with Linear Reviews

Kotlin Code Review: Best Practices, Tools, and Checklist

PEER REVIEW: Gaurav Nigam, VP of Engineering at WorkBoard

Custom Code Review Guidelines | What Shipped 07.03.25

Top posts

PHP Code Review: Best Practices, Tools, and Checklist

Comparing Agentic AI Code Reviews with Linear Reviews

Kotlin Code Review: Best Practices, Tools, and Checklist

PEER REVIEW: Gaurav Nigam, VP of Engineering at WorkBoard

Custom Code Review Guidelines | What Shipped 07.03.25

From the blog

PHP Code Review: Best Practices, Tools, and Checklist

Comparing Agentic AI Code Reviews with Linear Reviews

Kotlin Code Review: Best Practices, Tools, and Checklist

Increase velocity, save time, reduce bugs