We ran one real engineering epic through Claude Code plan mode, then through the same agent with Bito’s AI Architect feeding it system context and let Claude itself grade both plans against the same rubric. Plan mode scored 41%. With full context, the same model scored 87%.
That gap is the most useful number I have seen in a while. It says the ceiling on what your coding agent produces stopped being the model and became how much of your system the agent can see before it plans.
Plan mode is where most teams hit that ceiling. It reads your code and writes a plan before it touches a file, and on a real codebase that plan looks finished while skipping the services it never found. Here is why that happens, and what closes the gap.
What Claude Code plan mode is and how it works
Let me level-set first, because the term gets used loosely. Claude Code plan mode is a read-only research mode inside Claude Code, Anthropic’s terminal-based coding agent. You switch to plan mode with a keystroke, and it holds off on writing code until you sign off.
In a single session, it:
- Runs read-only, opening files and grepping the working directory for the symbols, imports, and call sites it expects to matter
- Infers your architecture from what it reads in that one session, with no index of anything outside it
- Emits a plan you can edit or reject before it generates a line of code
People keep describing Claude plan mode as if it understands your system. What it does is read your system, and reading is a much shallower operation than understanding. On small work you never notice. On real work it is the whole game.
Where plan mode earns its keep
I want to be fair, because plan mode is good and I use it every day. Anthropic’s own best practices draw the line well, reach for planning when a change spans multiple files or the code is unfamiliar, and skip it when you could describe the diff in one sentence.
The work it nails looks like this:
- A bug fix scoped to a single service
- A new endpoint in a module it already has open
- A refactor contained to a handful of files in the current repository
In all of these the plan depends only on code the agent reads directly, so it holds up. The moment the work crosses that boundary, the story changes.
Why plan mode breaks down on large, multi-repo codebases
Here is where it comes apart, and it is the part that should worry you. Your system does not live in one repository. A single feature reaches into a search service, a sync pipeline, a shared data model, and three downstream consumers that quietly depend on the API you are about to change.
Plan mode breaks on a real polyrepo system in two specific, repeatable ways:
- Cold-start retrieval: It rediscovers your architecture from scratch every session using grep over the current repository, so it cannot reach a service it does not already know to open.
- No operational memory: It has no record of what already broke, what is fragile, or what is being deprecated, because that history lives outside the files it reads.
Both come down to retrieval rather than reasoning. The model is more than capable. What you feed it is where this falls apart.
It rediscovers your architecture from scratch every session
Every session starts cold. Plan mode holds no precomputed map of your repositories, no dependency graph, no call graph. Its retrieval is grep and file reads over the directory you happened to open.
That is fine in one repo. Across a polyrepo estate it fails on discovery, because the agent cannot trace a dependency into a service it never knew to clone.
Picture it concretely:
A team wires more than four hundred repositories into the agent over an MCP connection to GitHub, then asks it to plan a change to the payments API. The agent edits PaymentsService correctly and hands back a plan that reads as finished:
Illustrative plan output
Plan: extend PaymentsService.charge() to support partial refunds.
Repos touched: payments-service.
Risk: low. No downstream impact identified.
Three services consume that same API, and none of them appear in the plan:
- notifications
- billing reconciliation
- the merchant dashboard
The change ships and breaks all three on the first deploy. Access to four hundred repositories changed nothing. Grep finds strings. It cannot tell you which repositories hold the downstream consumers of the contract you are editing.
It has no memory of what already broke
A plan is only as good as the operational history behind it, and plan mode carries none of yours. It has no way to see:
- Four hotfixes that landed on a module last month, a signal to touch it carefully
- A high reopen rate on a service’s issues, a marker that the area is fragile
- A component scheduled for deprecation next quarter
- A downstream dependency with a P95 you cannot afford to add latency to
That signal lives in your issue tracker, your commit history, and your incident record, all outside the files the agent reads. So the plan optimizes the happy path and walks into the failure modes your senior engineers already route around.
The same blindness compounds inside one long session. As the conversation grows, Claude Code compacts earlier turns to stay inside its token budget. The constraints you set at the start get summarized away, and the agent drifts back toward a generic approach.
Teams call this context rot, and it is why a plan degrades the longer you run the session.
What a side-by-side plan comparison actually showed
So I stopped speculating and we ran it properly. One real epic, roughly two pages of requirements, touching more than a hundred repositories. We ran it through Claude Code plan mode alone, then through Claude Code with AI Architect feeding it system context. To keep grading constant, we asked Claude Code itself to score both plans against the same weighted rubric.
Plan mode scored 41%, 120 of 290 points. The same epic with AI Architect scored 87%, 252 of 290. The rubric weighted the dimensions that decide whether a plan survives production.

| Dimension | Plan mode | With AI Architect |
| Product completeness | 4 | 9 |
| Risk and stability analysis | 2 | 10 |
| Strategic alignment | 3 | 9 |
| Defensive design | 2 | 10 |

I have shown this to enough engineering leaders to know the reaction. The scores are rarely the surprise. The surprise is that the same model produced both plans, and the only thing that changed was what it knew when it walked in.
To be fair, plan mode scored higher on readability and team allocation. Those dimensions matter on a real team. They do little to prevent rework, and they do not write code.
This is not a one-epic fluke. On our independent SWE-Bench Pro evaluation, the basline model (Claude Opus 4.6) on the same tasks moved sharply once it had system context:
- Resolve rate climbed from 51.9% to 70.1%, a 35% lift on Claude Opus 4.6
- The largest gains landed on big codebases and changes spanning ten or more files
The full SWE-Bench Pro evaluation has the methodology and the breakdown by complexity. Do read!
What it takes to give a coding agent real system context
When the plans keep coming back thin, teams start hunting for Claude Code alternatives or Cursor alternatives, or whatever alternatives to whatever coding agent they are using.
I understand the instinct, and it usually aims at the wrong layer. The agent is capable. What it knows about your system before it starts is the problem.
The real fix has little to do with a sharper prompt. I have watched teams burn weeks tuning prompts and hit the same wall, because phrasing was never the issue. The agent needs a map of your system that already exists when the session opens, so it resolves dependencies by lookup instead of by grep.
That map is what we built Bito’s AI Architect to be. It indexes your code, business context, and tribal knowledge into a knowledge graph that holds:
- A dependency and call graph across every repository
- The API contracts and schemas between services
- The operational history that marks where the system is fragile
The same graph that powers technical design and scoping feeds your coding agent through MCP. Claude Code then plans grounded in your code, commits, issues, docs, and past decisions, with the cross-repo blast radius resolved before it writes the first line.
The difference shows up the moment the agent starts a task.
| Plan mode on its own | Claude Code with AI Architect | |
| How it retrieves context | Grep and file reads over the open repo | Lookup against a pre-built graph of every repo |
| What it knows about risk | Whatever sits in the current files | Hotfix history, reopen rates, deprecations, latency |
| Cross-repo blast radius | Found by luck, if at all | Resolved before the plan is written |
Your CLAUDE.md files are a sign you already feel this
If your team writes CLAUDE.md files or runs init to brief the agent on your architecture, you have already diagnosed this problem yourselves. Those files, along with the expanding set of Claude Code plugins teams bolt on, are hand-built attempts to give the agent the context it lacks.
The limit is structural. A flat text file cannot hold a full dependency graph, shared contracts, and cross-repo relationships, and someone has to keep it current by hand every time the system moves.
A knowledge graph computes that picture for you and refreshes as your code and issues change. It is the same context that flows into grounded coding once a plan is approved.
I will give plan mode its due one more time. For a contained change in a single repository, it is fast and capable, and I work that way myself. The moment a change reaches across your system, the agent needs a map it has no way to draw on its own.
Frequently asked questions
Does Claude Code plan mode work on large codebases?
It works with a sharp ceiling. On a large polyrepo codebase, plan mode reasons only about files it can reach in the current repository, so it routinely misses cross-repo dependencies and returns plans that look complete while skipping affected services. It performs best on changes contained to one repository.
How do you switch to plan mode in Claude Code?
You switch to plan mode in Claude Code with a keystroke during a session. The agent then runs read-only, exploring and proposing a plan without editing files, and waits for your approval before it writes code.
Why does Claude Code plan mode miss files and dependencies?
It explores your code from a cold start every session with no pre-built map of your system, so it cannot read a service it does not know exists. It also has no memory of operational signals like hotfix history and reopen rates, which is why its plans miss known-fragile areas.
Can MCP fix plan mode on its own?
An MCP connection, whether to GitHub or any of the other Claude Code MCP servers, gives the agent access to more repositories, and access alone does not solve discovery. The agent still has to know which repositories matter for the task. Supplying a pre-built knowledge graph through MCP, rather than raw access, is what closes the gap.
If you are betting your roadmap on autonomous engineering, the lesson is that model quality stopped being your bottleneck. The plan your agent writes is a mirror of what it knows about your system before it begins, and on a real codebase you have to put that knowledge in front of it.
The teams that pull ahead will be the ones that hand their agents a real map of the system, instead of asking them to redraw it from memory on every task. See the full SWE-Bench Pro evaluation to dig into the numbers and the method behind it.