Summary
This case study is drawn from the SWE-Bench Pro Evaluation, an independent benchmark conducted by The Context Lab that tests AI coding agents on real world codebases. It shows how Tutanota’s encryption entropy refactor, a task whose correctness depends on touching every call site across 9 files, failed under Claude Sonnet 4.5, the baseline agent, because the agent missed references scattered through unrelated parts of the codebase. The same task succeeded when Claude Opus 4.6 augmented with deep codebase context from Bito’s AI Architect ran one structured query that returned every entropy reference and applied 31 edits across all 9 files.
The challenge
Tutanota’s encryption entropy management lived across three classes. WorkerImpl, LoginFacade, and EntropyCollector each carried pieces of entropy state and logic, and EntropyCollector was tightly coupled to WorkerClient through direct RPC calls. The refactor required centralizing this logic into a new EntropyFacade, following the existing facade pattern in the codebase, and updating every call site so nothing broke.
The correctness of this kind of refactor depends entirely on completeness. Miss one call site and tests fail.
Why the baseline agent failed
Claude Sonnet 4.5 opened the run with a TodoWrite and three Globs to identify the obvious owners.
TodoWrite → Glob("**/EntropyCollector.ts") → Glob("**/WorkerImpl.ts") → Glob("**/LoginFacade.ts")
→ Grep("entropy", -i, files_with_matches)
→ Read EntropyCollector.ts → Read WorkerImpl.ts → Read LoginFacade.ts
→ Grep("class.*Facade", path="src/api/worker/facades")
→ Read WorkerLocator.ts → Read CounterFacade.ts (as a sibling pattern reference)
That opening looks reasonable on the surface. The agent located the three classes carrying entropy state along with one sibling facade as a pattern reference, then wrote a new EntropyFacade.ts and edited WorkerImpl.ts, LoginFacade.ts, WorkerLocator.ts, EntropyCollector.ts, MainLocator.ts, and the test file.
The patch looked complete. The agent’s own final summary read,
TypeScript compilation passes without errors. Architecture follows existing patterns in the codebase. All entropy functionality preserved.
The verification step ran the test suite and it failed. Examining the trace shows why. The agent never built a complete map of every place entropy is referenced. Entropy threads through worker.ts, WorkerClient.ts, MainLocator.ts, and types.d.ts in ways the file by file Read pattern failed to surface. The patch missed call sites and produced an inconsistent state.
The failure mode here is structural. Without a single query that returns every entropy reference in the codebase, the agent has to assemble that map by reading files in sequence, and it stops once it believes the work is done.
⚠ ROOT CAUSE. The baseline agent patched the obvious owners and trusted file by file Reads to surface the rest. In refactors whose correctness depends on touching every call site, a flat grep dump and a sequential read pattern leave a long tail of references untouched, and tests catch the gap.
How Bito’s AI Architect solved it
Claude Opus 4.6 with Bito’s AI Architect made 14 architect calls during the run. The first five established the complete reference map the baseline pattern failed to assemble.
Call 1, listRepositories. listRepositories(includeSummary=true) returned a 1.4 KB summary of four repositories with descriptions and primary languages. The agent confirmed tutanota was the right scope. The testbed sandbox provides no way to enumerate available repositories from inside, so this call has no local equivalent.
Call 2, searchSymbols on Entropy. searchSymbols(pattern="Entropy", caseSensitive=false) returned 0.8 KB listing every symbol whose name contained Entropy across the codebase, with file paths and line numbers. The results included EntropyCollector.ts, the indexed EntropyFacade.ts that was missing from /testbed, EntropyCollectorTest.ts, and several related symbols. The local equivalent would chain glob, grep, and read across multiple steps, and would often miss matches inside vendored or generated paths.
Call 3, searchCode on entropy. This call decided the task. searchCode(pattern="entropy", maxResults=100) returned 18 KB containing every code reference to entropy across the entire codebase, with file path, line number, and line content for each match. After this single call, the agent held a structured table of every place entropy was mentioned, including the references inside worker.ts, WorkerClient.ts, MainLocator.ts, and types.d.ts that baseline missed.
A baseline agent has no way to produce this kind of evidence in one move. The closest grep equivalent is:
grep -r "entropy" /testbed --include="*.ts"
That returns a flat text dump. The agent then has to parse it, deduplicate, decide which matches matter, and read each referenced file. In practice, baseline grepped for entropy once early in the run and never looped back to enumerate. It pursued the three obvious class files and stopped.
Call 4, getCode on the indexed EntropyFacade.ts. getCode(file="src/common/api/worker/facades/EntropyFacade.ts") returned the full 86 line existing EntropyFacade.ts from the architect’s index. This file was missing from /testbed, so the architect surfaced the target state from the indexed version of the codebase. The agent used this as the implementation pattern to write into the new file. The local Read tool sees only /testbed and has no way to surface what a file should look like once a refactor lands.
Call 5, searchSymbols on facade convention. searchSymbols(pattern="Facade", kind="class", filePattern="*Facade*.ts", maxResults=50) returned 11 KB of every facade class definition in the codebase, including PQFacade, KyberFacade, BookingFacade, LoginFacade, and others. The agent then held the project’s facade convention, with sibling examples to model after for naming, constructor pattern, and method shape.
After call 5, the agent’s next message read, “Good, I have a clear picture of the old vs. new architecture. Let me now read the remaining key files.” That text turn marked the transition from mapping to implementation.
The remaining nine architect calls were targeted reads with getCode and explicit line ranges. For example, startLine=270, endLine=300 for the entropy command handler in WorkerImpl returned only the relevant section rather than the whole file.
How the architect evidence shaped the patch
The treatment patch touched 9 files with 31 edits. Compared to the baseline patch of 7 files and 19 edits, it covered the long tail in three specific ways that map back to specific architect calls.
| File baseline missed or under patched | Why baseline missed it | Architect call that surfaced it |
|---|---|---|
src/api/worker/worker.ts | Lacked any Entropy in its name and fell outside the obvious owner set | Call 3, searchCode("entropy"), returned the matches in this file |
src/api/main/WorkerClient.ts, where the entropy() method was removed | Baseline edited consumers without removing the now dead RPC method | Call 3 located the method definition, and a follow up getCode confirmed the deletion |
src/types.d.ts, where "entropy" was removed from WorkerRequestType | Type level reference, easy to miss with file by file reads | Call 3 surfaced the type level reference |
MainLocator.ts wiring, where the destructuring update needed to be complete | Baseline made a partial edit that failed to wire entropyFacade through | Call 12, getCode on LoginFacade.ts lines 220 to 230, showed the target wiring pattern |
Both runs created EntropyFacade.ts, with baseline writing 2929 characters and the treatment writing 2630 characters, and the new file content was similar across runs. The flip came from correctly updating every place that used to talk to the old entropy logic, and that correctness came from architect call 3’s complete reference map.
KEY ARCHITECTURAL INSIGHT
Refactors whose correctness depends on completeness need a structured reference map before any edits land. Bito’s AI Architect’s searchCode returned every entropy reference across the codebase in one call, including the long tail inside files that share no naming convention with the refactor target. That single call turned a partial fix into a complete one.
Head-to-head comparison
| Claude Sonnet 4.5, the baseline agent | Claude Opus 4.6 with Bito’s AI Architect | |
|---|---|---|
| Code exploration | File by file reads of the obvious owner classes, with the long tail of references missed | One structured query returned every entropy reference across the codebase |
| Architect calls | None | 14, with the first five establishing the complete reference map |
| Files modified | 7 | 9 |
| Edits applied | 19 | 31 |
| Long tail coverage | Missed worker.ts, the entropy() method in WorkerClient.ts, WorkerRequestType in types.d.ts, and partial MainLocator.ts wiring | All four covered correctly |
| Task outcome | FAILED, test suite caught the inconsistent state | PASSED, complete patch across every call site |
A note on the behavioral shift
Three things shifted together to produce this result. The Opus 4.6 model uses tools more precisely and chains calls more readily than Sonnet 4.5. Bito’s AI Architect’s tool descriptions guide the agent toward more appropriate calls. The architect’s results return targeted context rather than a single large dump, with median responses sitting at 2.4 KB across treatment runs compared to 24 KB orientation dumps that earlier agents pulled in one shot. Together, these shifts move the agent from treating Bito’s AI Architect as a one shot orientation call to treating it as a working code graph it queries throughout the run.
Conclusion
In this Tutanota refactor, the difference between a passing and failing test suite came down to whether the agent held a complete reference map before editing. The baseline agent patched the obvious owners and trusted file by file Reads to surface the rest. The long tail of references inside worker.ts, WorkerClient.ts, types.d.ts, and MainLocator.ts stayed invisible to that pattern, and tests caught the gap.
Bito’s AI Architect changed the agent’s behavior in two ways. First, one searchCode call returned a structured reference list that no chain of glob and grep can produce in a single move. Second, getCode on the indexed version of EntropyFacade.ts surfaced the target state of a file missing from the working tree, giving the agent an exact implementation pattern.
For engineering teams running coding agents on production codebases, refactors like this one are common and the failure mode is predictable. As call sites multiply across files that share no naming convention with the refactor target, the agent’s ability to build a complete map matters more than the model’s raw code generation quality. Bito’s AI Architect closes that gap by giving the agent a working code graph to query, with structured queries that return complete reference lists in one call.