Summary
This case study is part of the SWE-Bench Pro Evaluation, an independent benchmark conducted by The Context Lab that tests AI coding agents on real-world codebases. It examines a security-critical task in Teleport, an open-source infrastructure access platform: masking provisioning tokens in log output to prevent plaintext secret exposure.
Where a baseline coding agent (Claude Sonnet 4.5) added 30 lines of error-wrapping code that left tokens still visible in plaintext, the same baseline agent augmented with deep codebase context from Bito’s AI Architect traced the actual data flow, identified the real exposure point, and resolved the vulnerability with a 6-line function that works completely and correctly.
The challenge
Provisioning tokens were being logged in plaintext in Teleport, exposing sensitive secrets to anyone with log file access. The fix required masking or obfuscating token values in all log output while preserving enough visibility for debugging (e.g., abc1**** instead of abc123456789).
The key question wasn’t how to mask a token, it was where tokens actually enter the log system. Fixing the wrong interception point means the vulnerability persists.
Why the baseline agent failed
Claude Sonnet 4.5 took an indirect approach: it built a complex error-wrapping function (~30 lines) that attempted to sanitize tokens after they appeared in error messages. This included type preservation logic for trace.NotFound, trace.AccessDenied, and other error types.
But the actual exposure came from log.Debugf() calls that directly output validateRequest.Token. Error wrapping never intercepts debug log statements. The coding agent’s entire approach targeted the wrong layer, like installing a water filter on the wrong pipe.
⚠ ROOT CAUSE: The coding agent targeted error messages instead of log statements. The actual token exposure came from log.Debugf() calls that directly print token values, an interception point that error wrapping cannot reach.
How Bito’s AI Architect solved it
Bito’s AI Architect does not explore a codebase file by file. Before generating any code, it constructs a knowledge graph of the repository: a structured representation of how modules, functions, and data flows relate to one another across the entire system. This graph captures not just what each file contains, but how values move between them, where they originate, and where they are consumed.
Bito’s AI Architect’s architectural understanding revealed that log.Debugf() calls were the actual exposure points. The treatment agent implemented a simple, public MaskToken() function (~6 lines) and applied it directly at the two log call sites in trustedcluster.go.
Prevention at the source (masking before logging) replaced remediation after the fact (cleaning up error messages). The result was simpler code, complete coverage, and a fix that actually addresses the vulnerability.
KEY ARCHITECTURAL INSIGHT
Prevention beats remediation. Masking tokens at the log.Debugf() call site is both simpler (~6 lines vs. ~30) and more complete than trying to sanitize error messages downstream. Bito’s AI Architect traced the actual data flow to find the real exposure point.
Head-to-head comparison
| Claude Sonnet 4.5 (baseline agent) | Bito’s AI Architect | |
| Code Exploration | Focused on error handling — missed log.Debugf() as exposure point | Traced token data flow — identified exact log statements |
| Approach | Indirect — error wrapping after tokens reach error messages (~30 lines) | Direct — MaskToken() at call site before tokens reach logs (~6 lines) |
| Coverage | Incomplete — error wrapping doesn’t intercept debug log statements | Complete — both log.Debugf() calls masked at source |
| Complexity | High — error type preservation, incomplete switch cases | Low — single public function, applied at 2 call sites |
| Task Outcome | FAILED — tokens still logged in plaintext | PASSED — tokens masked as abc1******** |
Conclusion
The right fix at the wrong layer is no fix at all. This task comes down to a single question: not how to mask a token, but where in the system that masking needs to happen. Claude Sonnet 4.5 answered a plausible version of that question and got it wrong, producing a complex 30-line error-wrapping approach that never touched the actual exposure point. Bito’s AI Architect traced the token’s data flow through the full system context, identified the log.Debugf() call sites as the real vulnerability, and resolved it with a 6-line function that actually closes the gap.
In a security context, fixing the wrong layer is indistinguishable from doing nothing. For engineering teams working on complex codebases, the bottleneck in catching subtle vulnerabilities is rarely the ability to write a fix. It is the system understanding to know exactly where that fix belongs. That is the gap Bito’s AI Architect closes.