Summary
This case study is drawn from the SWE-Bench Pro Evaluation, an independent benchmark conducted by The Context Lab that tests AI coding agents against real-world engineering tasks on production codebases. It shows how, when enforcing a security constraint required coordinated changes across 9 files, Claude Sonnet 4.5 (baseline agent) fell into a 96-turn test loop, patched a handful of files inconsistently, and left the codebase exposed, while the same agent augmented with deep codebase context from Bito’s AI Architect mapped every vulnerable location upfront, applied uniform protection across all 9 files, and passed on the first attempt.
The challenge
Unbounded reading of HTTP request and response bodies in Teleport created resource exhaustion risks: memory exhaustion from large payloads, DoS attack vectors, and CPU consumption from oversized data. The fix required a ReadAtMost utility that enforces a 10MB limit across every HTTP body read location in the codebase.
The scope was the real challenge: 9 files across authentication services, OAuth/OIDC providers, SAML handling, database AWS operations, and core HTTP helpers, all needing consistent error handling and import management.
Why the baseline agent failed
Claude Sonnet 4.5 got stuck in a repetitive testing loop: 96 conversation turns running the same failing tests without diagnosing the root cause. It modified 3–4 files partially, left unused imports that broke compilation, and never achieved consistent error handling across all vulnerable locations.
The agent’s trial-and-error pattern (Attempt → Test → Fail → Retry Same Approach) consumed all its capacity without systematic progress.
⚠ ROOT CAUSE: The coding agent lacked scope awareness to identify all 9 vulnerable HTTP body read locations and fell into a test loop trap, repeating failed approaches without diagnosing root causes or systematically covering all files.
How Bito’s AI Architect solved it
Bito’s AI Architect approaches a task by first building a knowledge graph of the codebase before writing a single line of code. This knowledge graph is a structured, queryable map of the system: which modules exist, what functions they expose, where those functions are called, how data flows between them, and what patterns the codebase uses for error handling, imports, and utility abstractions. Rather than reading files in sequence or relying on keyword search, the knowledge graph lets the agent reason about the system as a whole, tracing dependencies and identifying patterns across module boundaries.
Bito’s AI Architect enabled a structured, comprehensive approach: first implement the core ReadAtMost utility with dual-layer error handling (I/O errors vs. limit violations), then systematically apply it across all 9 files with consistent patterns.
The agent used a mental coverage matrix, mapping every HTTP body read location (httplib, apiserver, clt, github, oidc, saml, aws, conn) and checking off each as it was protected. Clean import management removed deprecated io/ioutil after refactoring.
KEY ARCHITECTURAL INSIGHT
Security fixes demand complete coverage, one missed location is a vulnerability. Bito’s AI Architect’s systematic approach (Analyze → Plan All Changes → Implement All → Test → Pass) replaced the coding agent’s trial-and-error loop with first-pass completeness.
Head-to-head comparison
| Claude Sonnet 4.5 (baseline agent) | Bito’s AI Architect | |
| Code Exploration | Partial — found some HTTP read locations, missed others | Complete — Bito mapped all 9 vulnerable locations upfront |
| Methodology | Trial-and-error loop — 96 turns running same failing tests | Systematic coverage matrix — one pass, all files protected |
| Files Modified | 3–4 files (partial, with import errors) | 9 files (complete, clean imports) |
| Error Handling | Single error type — treats all errors the same | Dual-layer — distinguishes I/O errors from limit violations |
| Task Outcome | FAILED — security gaps remained after 96 turns | PASSED — complete protection across all locations |
Conclusion
In this Teleport task, the difference between failure after 96 turns and first-pass success wasn’t model capability. It was system understanding. Security hardening is only as strong as its weakest point, and when a fix requires changes across 9 files, partial coverage is no coverage.
The baseline agent (Claude Sonnet 4.5) worked file by file without a map of the full system, so it could never know what it had missed. Bito’s AI Architect builds a knowledge graph of the codebase before making any changes, giving the agent complete scope awareness upfront. That’s what turned a trial-and-error loop into a single, systematic pass.
For engineering teams, the implication extends beyond security: as codebases grow, the bottleneck in cross-cutting changes isn’t writing the code. It’s knowing where every relevant location is and coordinating changes consistently across the system. That requires a model of the system, not just access to individual files.