Get production-ready code in Cursor and Claude with Bito’s AI Architect

The context layer your coding agent is missing 

Technical design in hours, not days 

Bito’s AI Architect tops SWE-Bench Pro Evaluation for Long-Horizon Software Tasks 

Bito's AI Architect Tops SWE-Bench Pro

Table of Contents

Bito’s AI Architect achieves the highest resolve rate on SWE-Bench Pro, improving task success by 39.4%, from 43.6% to 60.8%, while completing tasks faster and without any additional cost. The gains concentrate on large repositories and multi-file tasks, where coding agents typically fail. 

This evaluation was conducted in collaboration with The Context Lab, an independent third party that runs agent evaluations in a tightly controlled, measurement-driven environment.  

The same coding agent was evaluated on SWE-Bench Pro with and without AI Architect’s system-level codebase context, while holding the model class, agent setup, tools, and execution constraints constant. Access to structured codebase context was the only variable. 

SWE-Benc Pro: Model Performance Comparison

For teams using coding agents on real-world codebases, this result shows that system-level codebase intelligence directly improves reliability on complex engineering work. This post presents a high-level summary of the evaluation. 

The problem: Why coding agents fail on large codebases 

Coding agents handle local edits well. They fail when tasks require coordinated changes across files, components, and layers. 

The failure pattern is consistent. Agents rely on unguided exploration and shallow dependency signals to infer system behavior. On long-horizon, multi-file work, this leads to partial fixes, missed propagation paths, and stalled execution. 

This is a system-level limitation. Improving models alone does not address it. 

Why SWE-Bench Pro and what we evaluated 

SWE-Bench Pro reflects professional-grade software engineering work, where success depends on long-horizon reasoning, multi-file coordination, and preserving system-wide behavior under strict test criteria. 

We evaluated performance on: 

  • 293 SWE-Bench Pro tasks 
  • From the five largest public repositories by size and file count 
  • Using the same Claude Sonnet 4.5–class coding agent 

We ran the agent twice. One run used standard repository access. The other used AI Architect’s system-level codebase intelligence via MCP. 

Evaluation methodology and success criteria 

We held the model, agent scaffold, tools, instructions, execution limits, and environment constant across both runs. The structured codebase context via AI Architect was the only variable. 

All tasks follow SWE-Bench Pro criteria. A task resolves only when all failing tests pass, and all previously passing tests remain passing. Resolve rate serves as the primary metric, with execution time, tool calls, and AI cost used to assess efficiency. 

ResultsAI Architect improves task success by 39.4% 

Under identical execution conditions, AI Architect materially improves coding agent performance on SWE-Bench Pro by increasing task success, expanding the complexity of solvable problems, and improving execution efficiency without added cost. 

1. Improvement in task success rate 

AI Architect increases task success from 43.6% to 60.8%, a 39.4% relative improvement on SWE-Bench Pro. This lift reflects a higher ability to complete long-horizon tasks that require coordinated changes across multiple files while preserving system-wide correctness under strict test criteria. 

2. Complexity ceiling breakthrough 

AI Architect enables agents to successfully resolve classes of tasks that baseline executions fail to complete. 

  • 4.5× increase in resolved tasks involving 10+ file changes 
  • Zero baseline successes on 15+ file changes, versus four completed with AI Architect 
  • 3.8× higher resolve-rate lift in repositories larger than 1.5M LOC 

These results show a capability ceiling expansion rather than incremental tuning, with the largest gains appearing in large, highly interconnected codebases. 

3. Execution efficiency and scaling with complexity 

Higher correctness is accompanied by more directed execution. 

  • 19.6% faster task completion on average 
  • 25.4% fewer tool calls, indicating reduced unguided exploration 

Efficiency gains increase with task complexity, suggesting that structured codebase intelligence helps agents converge earlier on correct solution paths instead of exhaustively searching the repository. 

4. Cost impact 

Performance gains do not come with additional compute overhead. 

  • Average AI cost per task remains statistically flat (−1.4%) 
  • On complex, multi-file tasks, cost often decreases due to fewer retries and faster convergence 

This indicates that AI Architect improves both effectiveness and efficiency, particularly on the tasks that dominate agent failure modes in large systems. 

Standout Example: AI Architect enables a 58,000-line refactor that baseline agents fail 

One SWE-Bench Pro task required refactoring calendar functionality in one of the repositories, a nearly 1 GB codebase. The change touched 412 files and 58,000+ lines, spanning utilities, recurrence logic, alarms, encryption, and mail integrations. 

This task failed under baseline execution. With AI Architect enabled, the agent completed the refactor successfully, preserving system behavior across all tests. 

Why this task is hard for coding agents 

Architectural refactors differ from bug fixes or localized changes. They require: 

  • Identifying all files involved, not just the obvious ones 
  • Understanding module boundaries and cross-cutting dependencies 
  • Planning the refactor holistically before writing code 
  • Avoiding partial fixes that break behavior elsewhere 

Baseline agents attempt to infer this structure through file-by-file exploration, which rarely converges on large systems. 

Results 

  • 27% faster execution 
  • 50% fewer tool calls 
  • 44% lower cost 
  • 412 files modified, 58,666 lines changed 

This example shows the difference between local exploration and system-level codebase intelligence. AI Architect enables large-scale architectural changes that baseline agents cannot complete. 

What this evaluation establishes 

This evaluation shows that system-level codebase intelligence is a primary driver of coding agent performance on large software systems. On SWE-Bench Pro, AI Architect improves correctness, speed, and reliability under identical model and execution settings. 

The impact increases with scale. As tasks span more files and larger repositories, agents without architectural context fail more often. With AI Architect, those same tasks succeed. 

For teams using coding agents in production, the takeaway is simple. Model quality alone is not enough. Reliable execution on real codebases requires system-level codebase intelligence. 

Picture of Amar Goel

Amar Goel

Bito’s Co-founder and CEO. Dedicated to helping developers innovate to lead the future. A serial entrepreneur, Amar previously founded PubMatic, a leading infrastructure provider for the digital advertising industry, in 2006, serving as the company’s first CEO. PubMatic went public in 2020 (NASDAQ: PUBM). He holds a master’s degree in Computer Science and a bachelor’s degree in Economics from Harvard University.

Picture of Amar Goel

Amar Goel

Amar is the Co-founder and CEO of Bito. With a background in software engineering and economics, Amar is a serial entrepreneur and has founded multiple companies including the publicly traded PubMatic and Komli Media.

Written by developers for developers red heart icon

This article is brought to you by the Bito team.

Latest posts

Bito’s AI Architect now works in Linear 

The PassAliases Drawer Bug Coding Agents Failed to Fix and AI Architect Solved

Token tax is real, but you are solving the wrong problem

The Missing Module Coding Agents Failed to Rebuild and AI Architect Restored

The Encryption Refactor That Coding Agents Missed and AI Architect Nailed

Top posts

Bito’s AI Architect now works in Linear 

The PassAliases Drawer Bug Coding Agents Failed to Fix and AI Architect Solved

Token tax is real, but you are solving the wrong problem

The Missing Module Coding Agents Failed to Rebuild and AI Architect Restored

The Encryption Refactor That Coding Agents Missed and AI Architect Nailed

From the blog

The latest industry news, interviews, technologies, and resources.

Bito's AI Architect now works in Linear

Bito’s AI Architect now works in Linear 

arrow bito ai
The PassAliases Drawer Bug Coding Agents Failed to Fix and AI Architect Solved

The PassAliases Drawer Bug Coding Agents Failed to Fix and AI Architect Solved

arrow bito ai
Token tax is real, but you are solving the wrong problem

Token tax is real, but you are solving the wrong problem

arrow bito ai