Summary
This case study is drawn from the SWE-Bench Pro Evaluation, an independent benchmark conducted by The Context Lab that tests AI coding agents on real world codebases. It shows how a ProtonMail webclients task that depended on a helper module missing from the local checkout failed under Claude Sonnet 4.5, the baseline agent, after twelve minutes of fruitless attempts to bootstrap Jest. The same task succeeded when Claude Opus 4.6 augmented with deep codebase context from Bito’s AI Architect made 23 architect calls, retrieved the missing module’s reference implementation directly from the indexed codebase, and shipped a clean two file patch.
The challenge
The task required standardizing the mail metrics helper functions in the webclients monorepo. The work spanned applications/mail/src/app/metrics/mailMetricsHelper.ts and the shared utilities in packages/mail/helpers/location.ts, with consistent normalization of mailbox identifiers, page sizes, and metric labels across both. The shared utilities file was missing entirely from the local checkout.
The dependency chain made this hard. The helpers in mailMetricsHelper.ts depend on functions that the test file imports from @proton/mail/helpers/location, including isCategoryLabel, isCustomLabelOrFolder, getLabelNameAnonymised, and isStringHumanLabelID. The location.ts file containing them was absent from /testbed. The agent had to figure out what those functions were supposed to do with no reference to work from.
Why the baseline agent failed
Claude Sonnet 4.5 located mailMetricsHelper.ts, read it, and identified the gap. It made five Edits to that single file, adjusting getPageSizeString, getLabelID, pathnameToLabelName, and the test file. The patch shape was small and looked sensible on first read.
The work derailed at verification. The agent tried to run jest mailMetricsHelper.test.ts and got sh: 1: jest: not found. Twelve minutes and 21 Bash commands followed, all spent trying to bootstrap Jest.
yarn install → "Unrecognized configuration: enableLegacyPeerDeps"
cat /testbed/.yarnrc.yml → search for the legacy flag
yarn workspace @proton/components test → same error
npm install → tail of failures
find /testbed -name "jest.config*" → locate config
which yarn → bootstrap loop
None of those commands produced a working test runner. The agent ended by writing a custom Node script at /testbed/test-mail-metrics.js to verify the function logic manually, ran it, deleted it, and stopped.
The patch the baseline agent produced was incomplete. It edited mailMetricsHelper.ts, and never created the missing packages/mail/helpers/location.ts that the test file imports from. Tests failed with Cannot find module '@proton/mail/helpers/location'.
The structural failure is clear. Baseline had no way to know the missing file’s expected shape because the file was absent locally. Without a reference, the agent had no anchor for writing it from scratch, so it patched what it could see and left the import broken.
⚠ ROOT CAUSE. The baseline agent patched the visible code and trusted Jest verification to catch errors. The missing helper module fell outside what local file reads and grep could surface, and the agent had no reference implementation to follow when generating it. The import stayed broken, the test suite failed, and twelve minutes of tooling bootstrap produced nothing.
How Bito’s AI Architect solved it
Claude Opus 4.6 with Bito’s AI Architect made 23 architect calls during the run, the heaviest single use across the cohort of flipped tasks. The calls split into two phases.
Phase 1, discovery and target state retrieval, calls 1 to 16
The agent established what files needed to exist and what they should contain.
Calls 1 to 3, one searchSymbols and two searchCode, mapped the metrics module structure and located every consumer of the helper functions. Calls 4 to 8, all getCode, pulled the existing test file and the three consumer hooks named useMailPTTMetric, useMailELDTMetric, and useMailECRTMetric. These reads taught the agent the API surface the helpers needed to provide. Calls 9 and 10, both searchCode, searched for getLabelNameAnonymised and isCategoryLabel definitions in the indexed codebase.
Call 11 was the load bearing call. getCode(file="packages/mail/helpers/location.ts") returned the full content of the missing file from the indexed reference state. Every function the test imports lives there with its complete implementation. Baseline had no way to retrieve this content because the file is absent from the testbed checkout.
Calls 12 to 16, one searchCode, one searchSymbols, and three more getCode, pulled MAILBOX_LABEL_IDS from packages/shared/lib/constants.ts at specific line ranges. The reads confirmed which IDs existed in the system, including the CATEGORY_* series that affects normalization.
After call 16, the agent held every function signature it needed to provide, every consumer’s expected import path, and the constants those functions depend on.
Phase 2, convention lookup and spec generation, calls 17 to 23
The agent then queried for project conventions before writing any code.
Call 17, searchRepositories(query="webclients proton mail"), disambiguated the repo identifier for subsequent calls. Call 18 supplied the second irreplaceable piece of evidence. getFieldPath(repository="webclients", fieldPath="coding_standards") returned the project’s coding standards as a structured object, with accessibility rules, CI and CD setup, ESLint and Prettier config, testing patterns, and naming conventions. There is no local tool equivalent for this. Coding standards are usually implicit, scattered across files and config. Bito’s AI Architect surfaces them as queryable structured data.
Calls 19 to 23, one searchSymbols and four getCode, re-pulled the original files at specific line ranges to embed exact reference snippets into the implementation specs.
The agent then wrote two files.
packages/mail/helpers/location.ts came in at 1549 characters and contained isCustomLabelOrFolder, isCategoryLabel, getHumanLabelID, getLabelNameAnonymised, and isStringHumanLabelID. The module is self contained, defining its own constants locally to avoid importing from other yet missing shared modules. applications/mail/src/app/metrics/mailMetricsHelper.ts came in at 1484 characters with getPageSizeString, getLabelID, pathnameToLabelName, and the LabelType type export.
Both files are clean reimplementations whose interfaces match what the test expects, because the architect calls had retrieved the test’s import expectations and the consumers’ usage patterns before any code was written.
How the architect evidence shaped the patch
The contrast between the two runs is in what each side knew before writing code.
| Knowledge needed | Baseline source | Architect source |
|---|---|---|
Which functions does mailMetricsHelper.test.ts import? | Read the test file, which was present locally | Read the test file via call 5 getCode |
What is the expected signature of isCategoryLabel? | Unknown, the file was missing locally | Call 11 getCode on packages/mail/helpers/location.ts returned the indexed reference implementation |
Which MAILBOX_LABEL_IDS exist, including categories? | Search via grep, slow and scattered | Calls 13 and 16 getCode on the constants file at exact line ranges |
| What naming and style convention should the new code follow? | Inferred from sibling files | Call 18 getFieldPath on coding_standards returned the project’s standards as structured data |
Baseline had no substitute for call 11. The reference implementation of the missing helper module was unreachable through any local tool. That single architect call, combined with the consumer hook pulls in calls 6 to 8, gave the agent enough information to generate the missing module from scratch with a correct interface.
The patch is small, two files and roughly 3 KB of new code, and every byte of it depends on evidence the architect produced.
KEY ARCHITECTURAL INSIGHT
Tasks involving a missing or significantly diverged module in the local working copy depend on access to the reference state of the codebase. Bito’s AI Architect’s index reflects that reference state, and getCode against the index serves content that local Read has no way to retrieve. Call 11 returned the missing helper module’s complete implementation. Without it, the agent had no anchor for writing the file. With it, the rewrite became a straightforward translation of the reference into the working tree.
Head-to-head comparison
| Claude Sonnet 4.5, the baseline agent | Claude Opus 4.6 with Bito’s AI Architect | |
|---|---|---|
| Code exploration | File reads of the visible mailMetricsHelper.ts and test file | 23 architect calls, including reference reads of the missing module and the consumer hooks |
| Architect calls | None | 23, the heaviest single use across the cohort |
| Files written | 1 file edited, plus a custom Node script written and deleted | 2 files written, both with clean interfaces |
| Verification approach | 21 Bash commands across 12 minutes attempting to bootstrap Jest, ending in a custom verification script | Architect calls supplied the target state, so test runner bootstrap never reached the critical path |
| Convention awareness | Inferred from sibling files | getFieldPath returned project coding standards as structured data |
| Task outcome | FAILED, missing module left tests with Cannot find module errors | PASSED, both files written with interfaces matching test imports |
A note on the behavioral shift
The augmented run used Claude Opus 4.6 and the baseline used Claude Sonnet 4.5. Three things shifted together to produce this result. The Opus 4.6 model uses tools more precisely and chains calls more readily than Sonnet 4.5. Bito’s AI Architect’s tool descriptions guide the agent toward more appropriate calls. The architect’s results return targeted context, with median responses sitting at 2.4 KB across augmented runs compared to 24 KB orientation dumps that earlier agents pulled in one shot.
Eleven of the augmented cohort’s 138 architect calls were getFieldPath. Their value shows up exactly in cases like this one, where new code has to fit project conventions that are implicit in the codebase rather than written down in any single file.
Conclusion
In this webclients task, the difference between a passing and failing test suite came down to one fact. The agent needed access to a file that was absent from the local working tree. The baseline agent edited the visible file, spent twelve minutes trying to bootstrap a test runner that lacked the configuration it needed, and stopped without ever creating the missing module that consumers imported from.
Bito’s AI Architect changed the behavior in two ways. First, getCode on the missing file path returned its indexed reference implementation, giving the agent an anchor for writing the helper module from scratch. Second, getFieldPath on coding standards returned the project’s conventions as structured data, removing the need to infer style from scattered sibling files.
For engineering teams running coding agents on production codebases, missing or diverged modules show up often in real refactor work. As branches age and packages move between repositories, agents working from a partial checkout will reach situations where the file they need has no local copy. Bito’s AI Architect closes that gap by giving the agent a working code graph to query, with structured queries that return reference state content and project conventions in a single call.