AI Architect tops SWE-Bench Pro with 39% higher task success. See results

AI Architect tops SWE-Bench Pro

The 9-File Security Hardening That Coding Agents Missed and AI Architect Nailed

The 9-File Security Hardening That Coding Agents Missed and AI Architect Nailed

Table of Contents

Summary 

This case study is drawn from the SWE-Bench Pro Evaluation, an independent benchmark conducted by The Context Lab that tests AI coding agents against real-world engineering tasks on production codebases. It shows how, when enforcing a security constraint required coordinated changes across 9 files, Claude Sonnet 4.5 (baseline agent) fell into a 96-turn test loop, patched a handful of files inconsistently, and left the codebase exposed, while the same agent augmented with deep codebase context from Bito’s AI Architect mapped every vulnerable location upfront, applied uniform protection across all 9 files, and passed on the first attempt. 

The challenge 

Unbounded reading of HTTP request and response bodies in Teleport created resource exhaustion risks: memory exhaustion from large payloads, DoS attack vectors, and CPU consumption from oversized data. The fix required a ReadAtMost utility that enforces a 10MB limit across every HTTP body read location in the codebase. 

The scope was the real challenge: 9 files across authentication services, OAuth/OIDC providers, SAML handling, database AWS operations, and core HTTP helpers, all needing consistent error handling and import management. 

Why the baseline agent failed 

Claude Sonnet 4.5 got stuck in a repetitive testing loop: 96 conversation turns running the same failing tests without diagnosing the root cause. It modified 3–4 files partially, left unused imports that broke compilation, and never achieved consistent error handling across all vulnerable locations. 

The agent’s trial-and-error pattern (Attempt → Test → Fail → Retry Same Approach) consumed all its capacity without systematic progress.

⚠ ROOT CAUSE: The coding agent lacked scope awareness to identify all 9 vulnerable HTTP body read locations and fell into a test loop trap, repeating failed approaches without diagnosing root causes or systematically covering all files.

How Bito’s AI Architect solved it 

Bito’s AI Architect approaches a task by first building a knowledge graph of the codebase before writing a single line of code. This knowledge graph is a structured, queryable map of the system: which modules exist, what functions they expose, where those functions are called, how data flows between them, and what patterns the codebase uses for error handling, imports, and utility abstractions. Rather than reading files in sequence or relying on keyword search, the knowledge graph lets the agent reason about the system as a whole, tracing dependencies and identifying patterns across module boundaries. 

Bito’s AI Architect enabled a structured, comprehensive approach: first implement the core ReadAtMost utility with dual-layer error handling (I/O errors vs. limit violations), then systematically apply it across all 9 files with consistent patterns. 

The agent used a mental coverage matrix, mapping every HTTP body read location (httplib, apiserver, clt, github, oidc, saml, aws, conn) and checking off each as it was protected. Clean import management removed deprecated io/ioutil after refactoring. 

KEY ARCHITECTURAL INSIGHT 

Security fixes demand complete coverage, one missed location is a vulnerability. Bito’s AI Architect’s systematic approach (Analyze → Plan All Changes → Implement All → Test → Pass) replaced the coding agent’s trial-and-error loop with first-pass completeness.

Head-to-head comparison 

 Claude Sonnet 4.5 (baseline agent) Bito’s AI Architect 
Code Exploration Partial — found some HTTP read locations, missed others Complete — Bito mapped all 9 vulnerable locations upfront 
Methodology Trial-and-error loop — 96 turns running same failing tests Systematic coverage matrix — one pass, all files protected 
Files Modified 3–4 files (partial, with import errors) 9 files (complete, clean imports) 
Error Handling Single error type — treats all errors the same Dual-layer — distinguishes I/O errors from limit violations 
Task Outcome FAILED — security gaps remained after 96 turns PASSED — complete protection across all locations 

Conclusion 

In this Teleport task, the difference between failure after 96 turns and first-pass success wasn’t model capability. It was system understanding. Security hardening is only as strong as its weakest point, and when a fix requires changes across 9 files, partial coverage is no coverage. 

The baseline agent (Claude Sonnet 4.5) worked file by file without a map of the full system, so it could never know what it had missed. Bito’s AI Architect builds a knowledge graph of the codebase before making any changes, giving the agent complete scope awareness upfront. That’s what turned a trial-and-error loop into a single, systematic pass. 

For engineering teams, the implication extends beyond security: as codebases grow, the bottleneck in cross-cutting changes isn’t writing the code. It’s knowing where every relevant location is and coordinating changes consistently across the system. That requires a model of the system, not just access to individual files. 

Picture of Anand Das

Anand Das

Anand is Co-founder and CTO of Bito. He leads technical strategy and engineering, and is our biggest user! Formerly, Anand was CTO of Eyeota, a data company acquired by Dun & Bradstreet. He is co-founder of PubMatic, where he led the building of an ad exchange system that handles over 1 Trillion bids per day.

Picture of Amar Goel

Amar Goel

Amar is the Co-founder and CEO of Bito. With a background in software engineering and economics, Amar is a serial entrepreneur and has founded multiple companies including the publicly traded PubMatic and Komli Media.

Written by developers for developers red heart icon

This article is brought to you by the Bito team.

Latest posts

Why Coding Agents Get Lost in Your Codebase (Even After Indexing Everything) 

The TPUT Implementation Claude Code Got Wrong and AI Architect Got Right

How to Integrate Bito’s AI Architect with Claude Code

How to Integrate Bito’s AI Architect with Cursor

The 9-File Security Hardening That Coding Agents Missed and AI Architect Nailed

Top posts

Why Coding Agents Get Lost in Your Codebase (Even After Indexing Everything) 

The TPUT Implementation Claude Code Got Wrong and AI Architect Got Right

How to Integrate Bito’s AI Architect with Claude Code

How to Integrate Bito’s AI Architect with Cursor

The 9-File Security Hardening That Coding Agents Missed and AI Architect Nailed

From the blog

The latest industry news, interviews, technologies, and resources.

Code Indexing

Why Coding Agents Get Lost in Your Codebase (Even After Indexing Everything) 

arrow bito ai
The TPUT Implementation Claude Code Got Wrong and AI Architect Got Right

The TPUT Implementation Claude Code Got Wrong and AI Architect Got Right

arrow bito ai
How to Integrate Bito's AI Architect with Claude Code

How to Integrate Bito’s AI Architect with Claude Code

arrow bito ai