Bito’s AI Architect achieves the highest resolve rate on SWE-Bench Pro, improving task success by 39.4%, from 43.6% to 60.8%, while completing tasks faster and without any additional cost. The gains concentrate on large repositories and multi-file tasks, where coding agents typically fail.
This evaluation was conducted in collaboration with The Context Lab, an independent third party that runs agent evaluations in a tightly controlled, measurement-driven environment.
The same coding agent was evaluated on SWE-Bench Pro with and without AI Architect’s system-level codebase context, while holding the model class, agent setup, tools, and execution constraints constant. Access to structured codebase context was the only variable.

For teams using coding agents on real-world codebases, this result shows that system-level codebase intelligence directly improves reliability on complex engineering work. This post presents a high-level summary of the evaluation.
The problem: Why coding agents fail on large codebases
Coding agents handle local edits well. They fail when tasks require coordinated changes across files, components, and layers.
The failure pattern is consistent. Agents rely on unguided exploration and shallow dependency signals to infer system behavior. On long-horizon, multi-file work, this leads to partial fixes, missed propagation paths, and stalled execution.
This is a system-level limitation. Improving models alone does not address it.
Why SWE-Bench Pro and what we evaluated
SWE-Bench Pro reflects professional-grade software engineering work, where success depends on long-horizon reasoning, multi-file coordination, and preserving system-wide behavior under strict test criteria.
We evaluated performance on:
- 293 SWE-Bench Pro tasks
- From the five largest public repositories by size and file count
- Using the same Claude Sonnet 4.5–class coding agent
We ran the agent twice. One run used standard repository access. The other used AI Architect’s system-level codebase intelligence via MCP.
Evaluation methodology and success criteria
We held the model, agent scaffold, tools, instructions, execution limits, and environment constant across both runs. The structured codebase context via AI Architect was the only variable.
All tasks follow SWE-Bench Pro criteria. A task resolves only when all failing tests pass, and all previously passing tests remain passing. Resolve rate serves as the primary metric, with execution time, tool calls, and AI cost used to assess efficiency.
Results: AI Architect improves task success by 39.4%
Under identical execution conditions, AI Architect materially improves coding agent performance on SWE-Bench Pro by increasing task success, expanding the complexity of solvable problems, and improving execution efficiency without added cost.
1. Improvement in task success rate
AI Architect increases task success from 43.6% to 60.8%, a 39.4% relative improvement on SWE-Bench Pro. This lift reflects a higher ability to complete long-horizon tasks that require coordinated changes across multiple files while preserving system-wide correctness under strict test criteria.
2. Complexity ceiling breakthrough
AI Architect enables agents to successfully resolve classes of tasks that baseline executions fail to complete.
- 4.5× increase in resolved tasks involving 10+ file changes
- Zero baseline successes on 15+ file changes, versus four completed with AI Architect
- 3.8× higher resolve-rate lift in repositories larger than 1.5M LOC
These results show a capability ceiling expansion rather than incremental tuning, with the largest gains appearing in large, highly interconnected codebases.
3. Execution efficiency and scaling with complexity
Higher correctness is accompanied by more directed execution.
- 19.6% faster task completion on average
- 25.4% fewer tool calls, indicating reduced unguided exploration
Efficiency gains increase with task complexity, suggesting that structured codebase intelligence helps agents converge earlier on correct solution paths instead of exhaustively searching the repository.
4. Cost impact
Performance gains do not come with additional compute overhead.
- Average AI cost per task remains statistically flat (−1.4%)
- On complex, multi-file tasks, cost often decreases due to fewer retries and faster convergence
This indicates that AI Architect improves both effectiveness and efficiency, particularly on the tasks that dominate agent failure modes in large systems.
Standout Example: AI Architect enables a 58,000-line refactor that baseline agents fail
One SWE-Bench Pro task required refactoring calendar functionality in one of the repositories, a nearly 1 GB codebase. The change touched 412 files and 58,000+ lines, spanning utilities, recurrence logic, alarms, encryption, and mail integrations.
This task failed under baseline execution. With AI Architect enabled, the agent completed the refactor successfully, preserving system behavior across all tests.
Why this task is hard for coding agents
Architectural refactors differ from bug fixes or localized changes. They require:
- Identifying all files involved, not just the obvious ones
- Understanding module boundaries and cross-cutting dependencies
- Planning the refactor holistically before writing code
- Avoiding partial fixes that break behavior elsewhere
Baseline agents attempt to infer this structure through file-by-file exploration, which rarely converges on large systems.
Results
- 27% faster execution
- 50% fewer tool calls
- 44% lower cost
- 412 files modified, 58,666 lines changed
This example shows the difference between local exploration and system-level codebase intelligence. AI Architect enables large-scale architectural changes that baseline agents cannot complete.
What this evaluation establishes
This evaluation shows that system-level codebase intelligence is a primary driver of coding agent performance on large software systems. On SWE-Bench Pro, AI Architect improves correctness, speed, and reliability under identical model and execution settings.
The impact increases with scale. As tasks span more files and larger repositories, agents without architectural context fail more often. With AI Architect, those same tasks succeed.
For teams using coding agents in production, the takeaway is simple. Model quality alone is not enough. Reliable execution on real codebases requires system-level codebase intelligence.