Get production-ready code in Cursor and Claude with Bito’s AI Architect

The context layer your coding agent is missing 

Technical design in hours, not days 

Token tax is real, but you are solving the wrong problem

Token tax is real, but you are solving the wrong problem

Table of Contents

Coding agents consume 1 to 3.5 million tokens per task. Independent tracking of real agent sessions shows that 70% of those tokens are waste, spent on reading irrelevant files, backtracking through dead ends, and retrying failed attempts. This is the token tax, and it is now the fastest-growing line item in engineering budgets at companies scaling AI coding tools. 

The industry is racing to make each token cheaper through caching, compression, and model routing. At Bito, we think these LLM cost optimizations address the wrong variable. The reason agents consume so many tokens is that they spend most of their compute exploring codebases they should already understand. Making each token cheaper helps. Making fewer tokens wasteful is where the real leverage sits. 

The token tax everyone is talking about 

Uber burned through its entire 2026 AI coding budget in four months. AI spending across Ramp’s customer base has grown 13x in a year, and nobody knows how to budget for it. Developers on Hacker News report $1,400 monthly Cursor bills. Individual engineers using Claude Code as a daily agent report $500 to $2,000 per month in API costs. 

Experienced developers now use an average of 2.3 AI coding tools simultaneously, spending $150 to $400 per month on AI assistance during active development. And these numbers keep climbing. 

The pricing models are straining under the weight. Anthropic moved away from flat-rate enterprise pricing toward per-token billing because subscriptions designed for conversation could not sustain agentic workloads. OpenAI’s head of ChatGPT acknowledged that “having an unlimited plan is like having an unlimited electricity plan.” Cursor replaced fixed request allotments with usage-based credit pools after developers reported $350 in overages in a single week. 

Most of the responses so far, usage caps, model routing, prompt compression, batch processing, focus on making each unit of compute cheaper. The harder question is why agents consume so much compute in the first place. 

Where the tokens actually go 

A coding agent asked to refactor a microservice or trace a bug across services does something very specific before it writes a single line of code. It explores. It greps across files, reads functions sequentially, backtracks when it hits dead ends, and builds a mental model of the system through trial and error. 

One independent analysis found that 87% of tokens in coding agent sessions went to finding code, not writing it. In one documented case, a one-line typo fix consumed over 21,000 input tokens because the agent followed its full workflow of opening issues, posting checklists, and creating branches for a single character change. 

The economics get worse because every API call resends the full conversation history, and that context grows with every turn. This waste compounds in three ways

  • Context accumulation: The cost per turn increases as the session progresses because each call carries the full history of every previous call. 
  • Retry loops: A failed attempt at turn 40 carries the full inflated context. On Opus-tier pricing, that retry costs 10x what it would on a cheaper model. 
  • Context rot: As context grows, agent quality degrades, which leads to more failures, which adds more tokens, which makes the next call more expensive. 

At Opus-tier pricing with five parallel agents, this translates to $35 to $130 per hour in wasted tokens. Over an eight-hour workday, that is $280 to $1,040 in avoidable LLM spend per team. 

The optimization everyone is pursuing 

The industry has converged on a standard set of agentic workflow cost optimizations, and most of them are worth implementing: 

  • Model routing sends simple tasks to cheaper models like Haiku. Saves 30 to 40%. 
  • Prompt caching stores static parts of the prompt so you do not reprocess them every call. Saves up to 90% on repeated prefixes. 
  • Context compaction summarizes conversation history to keep the window manageable. 
  • Batch processing offers 50% discounts for non-urgent work. 
  • Instruction files reduce output token usage by 17% and runtime by 29% on PR-sized tasks because the model wastes fewer tokens figuring out what you could have told it upfront. 

These techniques can reduce costs by 40 to 70% according to multiple reports. They are real engineering optimizations and every team scaling AI coding tools should have them in place. 

They also all share the same limitation. They make each individual turn of an agent session cheaper or shorter. They do not reduce the number of turns an agent takes to understand a system, reduce the failure rate on complex tasks, or eliminate the exploration phase that precedes productive code generation. 

The variable that changes the equation 

The per-token optimizations above reduce the cost of each agent action. The question they leave unanswered is whether those actions produce useful output. Salesforce recently introduced a metric it calls “agentic work units” to track exactly this, the work AI completes rather than the tokens it burns. 

On SWE-Bench Pro, a benchmark of real-world software engineering tasks, we measured this directly. Agents with AI Architect’s knowledge graph of the codebase made 25% fewer tool calls per task, completed tasks 20% faster, and succeeded 35% more often, all at the same average cost per task. The improvement came from eliminating the exploration phase entirely, where the agent started with the architectural map and skipped straight to targeted work. 

The gains concentrated in exactly the task categories where exploration waste is highest: 

  • 4.5x higher success on tasks spanning 10 or more files 
  • 3.8x higher success on codebases with 1.5 million or more lines 

These are the tasks that consume the most tokens and fail the most often, and they are exactly where enterprise engineering work concentrates. Agents that understand the architecture of a codebase before they start working produce more output, fail less often, and consume fewer tokens doing it.

What this means for engineering organizations 

Every engineering team scaling agentic coding workflows is running this cost calculation right now. The token tax conversation will only intensify as agent adoption grows, and the cost controls the industry is building are all necessary.  

The real token tax, the one that scales with team size and codebase complexity, is that agents without architectural context waste the majority of their compute on exploration. Solving the token tax at the root means reducing compute waste by giving agents the context they need so that every token spent produces working code. 

The organizations that figure this out will spend less per engineer and ship more per dollar. The ones that only optimize per-token pricing will keep paying for exploration that produces nothing. 

Picture of Amar Goel

Amar Goel

Bito’s Co-founder and CEO. Dedicated to helping developers innovate to lead the future. A serial entrepreneur, Amar previously founded PubMatic, a leading infrastructure provider for the digital advertising industry, in 2006, serving as the company’s first CEO. PubMatic went public in 2020 (NASDAQ: PUBM). He holds a master’s degree in Computer Science and a bachelor’s degree in Economics from Harvard University.

Picture of Amar Goel

Amar Goel

Amar is the Co-founder and CEO of Bito. With a background in software engineering and economics, Amar is a serial entrepreneur and has founded multiple companies including the publicly traded PubMatic and Komli Media.

Written by developers for developers red heart icon

This article is brought to you by the Bito team.

Latest posts

Bito’s AI Architect now works in Linear 

The PassAliases Drawer Bug Coding Agents Failed to Fix and AI Architect Solved

Token tax is real, but you are solving the wrong problem

The Missing Module Coding Agents Failed to Rebuild and AI Architect Restored

The Encryption Refactor That Coding Agents Missed and AI Architect Nailed

Top posts

Bito’s AI Architect now works in Linear 

The PassAliases Drawer Bug Coding Agents Failed to Fix and AI Architect Solved

Token tax is real, but you are solving the wrong problem

The Missing Module Coding Agents Failed to Rebuild and AI Architect Restored

The Encryption Refactor That Coding Agents Missed and AI Architect Nailed