Get production-ready code in Cursor and Claude with Bito’s AI Architect

The context layer your coding agent is missing 

Technical design in hours, not days 

Bito’s AI Architect Boosts Claude Opus 4.6 by 35% on SWE-Bench Pro 

Bito’s AI Architect Boosts Claude Opus 4.6 by 35% on SWE-Bench Pro

Table of Contents

Bito’s AI Architect drives Claude Opus 4.6 to a 70% task success rate on SWE-Bench Pro, a 35% lift over Opus 4.6 running standalone at 51.9%. This sets a new top mark on the benchmark, well ahead of the next best result of 45.2% from Claude Sonnet 4.5 with no deep context layer in place. 

AI Architect has become even more capable than before, with sharper tool descriptions and schemas, more precise context retrieval, and a new capability that returns indexed reference implementations directly from the knowledge graph. The lift comes from these capability gains in AI Architect, combined with Claude Opus 4.6 chaining MCP tool calls more reliably than earlier model generations. 

This result follows our first SWE-Bench Pro evaluation, which established AI Architect as a code intelligence layer for coding agents. 

Coding agent performance on SWE-Bench Pro

How the coding agent used AI Architect in this evaluation 

AI Architect builds a knowledge graph of your entire codebase, with each repository analyzed across 160+ data points spanning 15+ dimensions, from service topology and dependency graphs to architectural patterns, implementation standards, and deployment context. Every function, class, interface, and type is indexed and cross referenced, with millisecond symbol resolution. 

Coding agents call AI Architect through MCP when a task needs this kind of system level understanding, the context that does not surface from local file reads. 

In this evaluation, the coding agent uses AI Architect as a working knowledge graph it returns to throughout the run. Across all the SWE-Bench Pro tasks drawn from the five largest public repositories, the agent makes a median of 8 MCP calls per task. 

Each response averages 2.4 KB, scoped to the specific question the agent asked. The agent chains those calls into multi step lookup sequences that build a complete map of the task before any code gets written. 

The chaining shape stays consistent across tasks: 

  • Confirm which repository the work is happening in 
  • Locate where a class or function is defined 
  • Find every reference to that symbol across the codebase 
  • Pull the exact code at each location through the new code retrieval capability 
  • Read the project conventions the new code needs to follow 

By the time the agent begins editing, it holds the full call graph, the target patterns to extend, and the conventions the new code needs to fit. This is what produces the 70% resolve rate. The agent walks into each edit with system level context, not a partial picture inferred from local exploration. 

Three forces behind the 35% lift 

The lift comes from three reinforcing factors. Together they explain the result. 

Sharper tool callability 

  • AI Architect’s tool descriptions and schemas are now sharper, so the coding agent picks the right tool more reliably. Tool descriptions and schemas drive whether the coding agent invokes a tool at all.  
  • When the description fits the agent’s reasoning pattern, the agent reaches for AI Architect through MCP rather than falling back to native code exploration with grep and glob. 

More precise context retrieval 

  • When the coding agent calls AI Architect, the MCP responses are now scoped to the specific question the agent asked. A symbol search returns a definitions index. A code search returns a complete reference list. A code retrieval call returns exactly the lines requested.  
  • Sharper retrieval produces a more useful answer for the same agent question, the way a stronger search engine returns better results than alternatives for the same query. 

Stronger model tool chaining 

  • Claude Opus 4.6 chains tool calls more reliably than earlier model generations. The coding agent uses six AI Architect tools in regular rotation, builds a structured map across MCP calls, and follows up on partial information instead of stopping at the first response.  
  • Some of the lift attributes to the model itself getting more skilled at exercising the infrastructure available to it. 

Standout example: The tutanota entropy refactor 

One SWE-Bench Pro task asked the coding agent to centralize encryption entropy management into a new facade in the tutanota repository, updating every call site so nothing breaks. Correctness depends on completeness. A single missed call site fails tests. 

The coding agent made 14 AI Architect MCP calls before the first edit. 5 of those calls did the load bearing work. 

  • listRepositories(includeSummary=true) returned 1.4 KB of repo summaries, confirming tutanota as the right scope 
  • searchSymbols(pattern=”Entropy”, caseSensitive=false) returned every entropy named definition in the codebase with file paths and line numbers 
  • searchCode(pattern=”entropy”, maxResults=100) returned 18 KB containing every code reference to entropy across the entire codebase, the structured map that decided the task 
  • getCode(file=”src/common/api/worker/facades/EntropyFacade.ts”) returned the indexed facade as the target pattern to extend 
  • searchSymbols(pattern=”Facade”, kind=”class”, filePattern=”*Facade*.ts”) returned 11 KB of every facade class as sibling examples for project convention 

The remaining nine MCP calls were targeted getCode reads with specific line ranges, pulling exactly the sections the agent needed to edit, for example startLine=270, endLine=300 for the entropy command handler. The agent shipped the patch with all tests passing. 

What this evaluation establishes 

AI Architect becomes more valuable as coding agents get better. The deeper AI Architect tools were available to earlier model generations. Claude Opus 4.6 used them, chained them through MCP, and resolved 35% more tasks because of it. 

For engineering teams evaluating AI tooling, the practical implication runs against the common assumption. Better models do not reduce the need for context infrastructure. Better models exercise the infrastructure they have access to, and the gains compound when that infrastructure returns precise answers to specific questions. 

The full evaluation details, repository selection, task distribution, and complete results live on our SWE-Bench Pro benchmarks page

Picture of Amar Goel

Amar Goel

Bito’s Co-founder and CEO. Dedicated to helping developers innovate to lead the future. A serial entrepreneur, Amar previously founded PubMatic, a leading infrastructure provider for the digital advertising industry, in 2006, serving as the company’s first CEO. PubMatic went public in 2020 (NASDAQ: PUBM). He holds a master’s degree in Computer Science and a bachelor’s degree in Economics from Harvard University.

Picture of Amar Goel

Amar Goel

Amar is the Co-founder and CEO of Bito. With a background in software engineering and economics, Amar is a serial entrepreneur and has founded multiple companies including the publicly traded PubMatic and Komli Media.

Written by developers for developers red heart icon

This article is brought to you by the Bito team.

Latest posts

Bito’s AI Architect now works in Linear 

The PassAliases Drawer Bug Coding Agents Failed to Fix and AI Architect Solved

Token tax is real, but you are solving the wrong problem

The Missing Module Coding Agents Failed to Rebuild and AI Architect Restored

The Encryption Refactor That Coding Agents Missed and AI Architect Nailed

Top posts

Bito’s AI Architect now works in Linear 

The PassAliases Drawer Bug Coding Agents Failed to Fix and AI Architect Solved

Token tax is real, but you are solving the wrong problem

The Missing Module Coding Agents Failed to Rebuild and AI Architect Restored

The Encryption Refactor That Coding Agents Missed and AI Architect Nailed