Bito’s AI Architect drives Claude Opus 4.6 to a 70% task success rate on SWE-Bench Pro, a 35% lift over Opus 4.6 running standalone at 51.9%. This sets a new top mark on the benchmark, well ahead of the next best result of 45.2% from Claude Sonnet 4.5 with no deep context layer in place.
AI Architect has become even more capable than before, with sharper tool descriptions and schemas, more precise context retrieval, and a new capability that returns indexed reference implementations directly from the knowledge graph. The lift comes from these capability gains in AI Architect, combined with Claude Opus 4.6 chaining MCP tool calls more reliably than earlier model generations.
This result follows our first SWE-Bench Pro evaluation, which established AI Architect as a code intelligence layer for coding agents.

How the coding agent used AI Architect in this evaluation
AI Architect builds a knowledge graph of your entire codebase, with each repository analyzed across 160+ data points spanning 15+ dimensions, from service topology and dependency graphs to architectural patterns, implementation standards, and deployment context. Every function, class, interface, and type is indexed and cross referenced, with millisecond symbol resolution.
Coding agents call AI Architect through MCP when a task needs this kind of system level understanding, the context that does not surface from local file reads.
In this evaluation, the coding agent uses AI Architect as a working knowledge graph it returns to throughout the run. Across all the SWE-Bench Pro tasks drawn from the five largest public repositories, the agent makes a median of 8 MCP calls per task.
Each response averages 2.4 KB, scoped to the specific question the agent asked. The agent chains those calls into multi step lookup sequences that build a complete map of the task before any code gets written.
The chaining shape stays consistent across tasks:
- Confirm which repository the work is happening in
- Locate where a class or function is defined
- Find every reference to that symbol across the codebase
- Pull the exact code at each location through the new code retrieval capability
- Read the project conventions the new code needs to follow
By the time the agent begins editing, it holds the full call graph, the target patterns to extend, and the conventions the new code needs to fit. This is what produces the 70% resolve rate. The agent walks into each edit with system level context, not a partial picture inferred from local exploration.
Three forces behind the 35% lift
The lift comes from three reinforcing factors. Together they explain the result.
Sharper tool callability
- AI Architect’s tool descriptions and schemas are now sharper, so the coding agent picks the right tool more reliably. Tool descriptions and schemas drive whether the coding agent invokes a tool at all.
- When the description fits the agent’s reasoning pattern, the agent reaches for AI Architect through MCP rather than falling back to native code exploration with grep and glob.
More precise context retrieval
- When the coding agent calls AI Architect, the MCP responses are now scoped to the specific question the agent asked. A symbol search returns a definitions index. A code search returns a complete reference list. A code retrieval call returns exactly the lines requested.
- Sharper retrieval produces a more useful answer for the same agent question, the way a stronger search engine returns better results than alternatives for the same query.
Stronger model tool chaining
- Claude Opus 4.6 chains tool calls more reliably than earlier model generations. The coding agent uses six AI Architect tools in regular rotation, builds a structured map across MCP calls, and follows up on partial information instead of stopping at the first response.
- Some of the lift attributes to the model itself getting more skilled at exercising the infrastructure available to it.
Standout example: The tutanota entropy refactor
One SWE-Bench Pro task asked the coding agent to centralize encryption entropy management into a new facade in the tutanota repository, updating every call site so nothing breaks. Correctness depends on completeness. A single missed call site fails tests.
The coding agent made 14 AI Architect MCP calls before the first edit. 5 of those calls did the load bearing work.
- listRepositories(includeSummary=true) returned 1.4 KB of repo summaries, confirming tutanota as the right scope
- searchSymbols(pattern=”Entropy”, caseSensitive=false) returned every entropy named definition in the codebase with file paths and line numbers
- searchCode(pattern=”entropy”, maxResults=100) returned 18 KB containing every code reference to entropy across the entire codebase, the structured map that decided the task
- getCode(file=”src/common/api/worker/facades/EntropyFacade.ts”) returned the indexed facade as the target pattern to extend
- searchSymbols(pattern=”Facade”, kind=”class”, filePattern=”*Facade*.ts”) returned 11 KB of every facade class as sibling examples for project convention
The remaining nine MCP calls were targeted getCode reads with specific line ranges, pulling exactly the sections the agent needed to edit, for example startLine=270, endLine=300 for the entropy command handler. The agent shipped the patch with all tests passing.
What this evaluation establishes
AI Architect becomes more valuable as coding agents get better. The deeper AI Architect tools were available to earlier model generations. Claude Opus 4.6 used them, chained them through MCP, and resolved 35% more tasks because of it.
For engineering teams evaluating AI tooling, the practical implication runs against the common assumption. Better models do not reduce the need for context infrastructure. Better models exercise the infrastructure they have access to, and the gains compound when that infrastructure returns precise answers to specific questions.
The full evaluation details, repository selection, task distribution, and complete results live on our SWE-Bench Pro benchmarks page.