Introduction: The Wrong Question
GitHub's shift from premium requests to usage-based billing has triggered a wave of anxiety across engineering teams. The question echoing through Slack channels and leadership meetings is some variation of: "How do we reduce our token spend?"
It's the wrong question.
Focusing purely on cost diminishes the value you get from agents. A better framing is: "How do we get the most out of the tokens we spend?" That subtle reframing changes everything — from how you write prompts, to which model you reach for, to how you architect your codebase, to how you organize your team's workflows.
This article walks through the full case for quality-first token optimization, the foundational mental models you need to reason about it, and the concrete controls and techniques that move the needle.
Part 1: Why Agent Quality Is the Better Lens
Agent Gambling Is No Longer Sustainable
When tokens were effectively free, agent accuracy didn't really matter. The dominant pattern became what's best described as "agent gambling": throw together a lazy prompt with minimal context, fire off an agent, and if it fails, fire off another one. Think of it as the NASA Artemis problem in reverse — if rockets were cheap, you'd send 20 in the general direction of the moon and hope one lands.
That worked when each developer ran a handful of agents per day. It stops working the moment developers — and especially AI engineers orchestrating fleets — are running dozens or hundreds of agents per day. The economics invert. The cost of misfires dwarfs the cost of doing the work properly.
The fix isn't to send fewer rockets blindly. It's to make sure each rocket actually lands. Higher per-agent quality means fewer retries, fewer wasted tokens, and better ROI on every dollar of usage.
The ROI Mental Model
The guiding equation for thinking about agent economics:
Agent ROI = (Value of Agent Output − Token Cost) / Token Cost × 100%
You can't calculate this precisely, but it's a directionally useful lens. Two things follow immediately:
- Optimizing cost when value is zero is meaningless. Cutting your spend by 50% on outputs that don't ship anything useful is just losing money more slowly.
- Increasing value often means decreasing tokens. Developers routinely stuff irrelevant text into prompts, let conversations compound with stale context, and pile on documentation the model doesn't need. Trimming that context usually improves both quality and cost simultaneously. They're the same lever.
The Compound Error Problem
Here's the math that should haunt anyone running multi-step agent workflows: errors compound multiplicatively.
- At 99% accuracy per step, a 50-step workflow lands at ~60% overall success.
- At 95% accuracy per step, the same workflow drops to ~8%.
LLMs are non-deterministic. Every step in an inner agent loop, every hop in an orchestrated workflow, every tool call — they all multiply against each other. This means every percentage point of per-step quality buys you a disproportionate improvement in overall reliability. And every miss isn't just a wasted token call — it triggers fix cycles, review overhead, reruns, debugging sessions, and burned human attention.
The takeaway: apply the same "shift-left" mindset to agents that you apply to quality, testing, and security in traditional engineering.
The Mantra
The whole philosophy collapses into one line worth pinning to your monitor:
Instead of counting tokens, make every token count.
Reduce token usage as a consequence of pursuing quality — not as a goal in itself. Send fewer, better-targeted rockets. The fuel savings follow automatically.
Part 2: Foundations — LLMs, Agents, and Context Windows
Before you can optimize anything, you need to internalize a few mechanical truths about how this technology actually works.
LLMs Are Pure Word Probability Machines
Strip away the marketing and what you have is a text-in, text-out system that predicts the next word given an input plus the patterns from its training data. When you type "GitHub Copilot is the world's most widely…" the model assigns probabilities to candidate next words — used, adopted, deployed, and so on — and picks one. In a coding context, it's predicting the next instruction.
Models have gotten dramatically better, but the underlying mechanism hasn't changed. This matters because the math doesn't distinguish hallucination from fact. A made-up function name and a real one occupy the same probability space. The model isn't "lying" when it hallucinates — it's just doing what it always does with insufficient signal.
The Core Principle of Context
Which leads to the single most important principle in this entire discipline:
Provide as little context as possible, but as much as required.
Two failure modes flank this principle:
- Too much context biases the model toward irrelevant patterns and dilutes the signal. Stuffing in five files when one is relevant makes the model worse, not better.
- Too little context forces the model to hallucinate to fill gaps. It has to predict something, and without grounding, it guesses.
Context engineering — the discipline of finding that sweet spot — is the fundamental skill of working with agents.
What an Agent Actually Is
An agent is not magic. It's an app — code that sits between you and the LLM. The architecture is simple:
You and your project ↔ The agent (harness) ↔ The LLM
Harnesses are things like VS Code Chat, Copilot CLI, Copilot Cloud Agent, Claude Code, OpenAI Codex. Models are things like GPT 5.5, Claude Opus 4.7, Gemini Pro. The harness is the orchestrator; the model is the inference engine.
Two things are crucial to understand here:
- The LLM is stateless. What feels like a "conversation" is actually the harness re-sending every prior input and output on every turn. There's no memory inside the model — only an ever-growing transcript being shipped back and forth.
- Tokens compound. Every loop drags the previous loops along with it. Your levers are the things that go into the context: your prompt, the files you reference, and the agent configs (instructions, skills, MCPs) the harness injects.
Context Window Mechanics
A token is roughly ¾ of an English word. Smaller models offer 50K–200K token windows; larger ones like Opus and GPT-5.5 push toward 1M tokens. For scale: 1M tokens is roughly the entire Lord of the Rings trilogy plus The Hobbit.
Don't obsess over token counting at the character level. Think at the level of prompts, files, and responses — those are the units that compound on each loop.
Context Rot: The Hidden Failure Mode
Even with a huge window, models don't treat all positions equally. Two well-documented effects govern how attention is distributed:
- Lost in the Middle (below ~50% window fill): Models bias toward content at the beginning and the end of the context. Middle content gets less weight.
- Recency Bias (above ~50% window fill): As the window fills up, attention skews heavily toward the end. System prompts and custom instructions sitting at the beginning start getting effectively ignored.
The practical implications are significant:
- The beginning of context is prime real estate for instructions and goals.
- The end is where current work lives.
- The middle is where past work decays in influence.
- Just because you can fill the context window doesn't mean you should. Try to keep it under 60–70%.
- If you switch tasks mid-session, the model may revert to the original task, because that's where the strongest signal still lives.
- Above 50% fill, you start losing your own guardrails to recency bias.
The fix isn't compaction (which trades tokens for potential information loss). It's a new context window per task — /clear liberally, divide work into discrete sessions, and don't let conversations sprawl.
Part 3: Quality and Token Controls — The Practical Playbook
Now to the controls themselves, ordered roughly by leverage.
Where You Are on the Maturity Curve Matters
Two archetypes exist on the agent maturity spectrum:
- AI-assisted engineers work mostly synchronously with one agent at a time. If you're sending ten agents per day and spending $20/month, saving 50% on tokens just gets you to $10. The juice isn't worth the squeeze.
- AI engineers orchestrate fleets of asynchronous agents. Every percentage point compounds across hundreds of runs. The compound error problem hits hardest here, and optimization pays back enormously.
Calibrate effort accordingly.
The Two Biggest Levers
Two controls vastly outweigh everything else: model choice and relevant context.
Model choice is the single highest-leverage decision. The cost gap between top-tier reasoning models (Claude Opus 4.7) and small models (GPT-5.4 mini) is roughly 24x. Match the model to the task:
- Reasoning models (Opus, GPT-5.5) for synchronous planning, architecture, debugging, and any work involving large context.
- Mid-tier models (Sonnet, GPT-5.4) for asynchronous implementation work.
- Low-tier models (Haiku, GPT-mini) for small refactors, repetitive tasks, and documentation updates.
A reasoning model on a trivial task isn't just expensive — it can actively make things worse, second-guessing tight specifications and "going rogue." Conversely, a small model on a planning task will produce shallow, brittle output.
Auto Mode (rolling out from June) detects task intent and selects the model for you. It's the lazy default for anyone who doesn't want to think about it — and it's usually right.
Relevant context is the other half of the equation. Don't stuff prompts with "might need" information. Let the agent discover what it needs. Compacting sessions trades tokens for potential info loss — use it cautiously. And use /clear often — tokens don't carry across sessions, so a clean slate is free.
Your Prompt
The prompt is always-on. It sits at the beginning of the context window and has outsized influence due to lost-in-middle effects.
A few rules:
- Don't optimize prompts for fewer tokens. Optimize them to steer correctly.
- Be precise and descriptive. "Fix the bug" is useless. "Issue #45 describes a bug where X happens — fix it" actually goes somewhere.
- Add stop signals. Phrases like "Stop after you've written the fix. Do not commit or push." prevent agents from running past your intent.
- Add known context upfront. Relevant file paths, doc URLs, skills to invoke. Don't make the agent rediscover what you already know.
Divide and Conquer: Research → Plan → Implement
A single context window doing research, planning, and implementation drags irrelevant files and stale reasoning through every phase. Quality degrades.
The pattern that works:
- Research (e.g., Gemini 2.5 Pro): "I want to change X. What files are relevant?"
- Plan (e.g., Opus 4.7): Take the research output and produce a precise specification.
- Implement (e.g., GPT-5.4, often in parallel): Multiple agents split by architecture layer (frontend, backend, database) with clearly defined contracts between them.
Each phase gets a fresh context window. The spec is the artifact that carries information across the boundary — clean, distilled, free of noise. This saves both time and tokens, and produces far higher-quality output than one monolithic session.
Deterministic Controls: The Compound Error Antidote
Tests, linters, security scanners, type checkers — anything code-enforced and deterministic — are essential context engineering tools. A test either fails or passes. There's no probability. Every passing test resets the compounding error rate to zero for the property it covers.
The contrast is stark:
- With tests: buggy change → failing test → correction → passing test. Done.
- Without tests: buggy change → buggy change on top of it → another one → incident → debugging session → burned CI/CD minutes, review cycles, human time.
The Copilot CLI team ships roughly 500 PRs per week. Roughly 53% of their codebase is tests. That's not overhead — that's the moat that lets them move that fast without burning down the production system.
Cheap in the short term means expensive in the medium term. Guardrails pay back many times over.
Agent Configs: The Context Engineering Surface
Modern agent harnesses pick up a stack of markdown files automatically. These are the surface you work with as a context engineer:
-
Persistent instructions —
copilot-instructions.md. Always loaded. -
Custom agents —
./github/agents/*.agent.md. Role-based, manually invoked. -
Skills —
./github/skills/*/skill.md. Conditionally loaded. - MCPs — external tool integrations.
- Subagents — separate context windows spawned by the main session.
-
Scoped instructions —
./github/instructions/*.instructions.md. Path-pattern based. -
Prompt files —
./.github/prompts/*.prompt.md. Manual starting points. - Copilot Memory — small always-on instructions learned from behavior.
Each has a place. Let's go through the high-leverage ones.
Persistent Instructions
These are your always-on guidance, the proactive human-in-the-loop signal. Three things belong in them:
- Project non-negotiables (architecture rules, conventions that can't be inferred).
- A log of recurring agent misses (wrong test framework? wrong build command? Write it down.).
- Output-trimming statements ("be concise"). Output tokens are the most expensive — trim them aggressively.
Critical rules: keep them small, don't use AI to generate them, and recreate them often. Research shows that "be concise" performs nearly as well as a 50-line "caveman" skill. AI-generated instructions bloat. Write them yourself, iterate, throw them away. The Copilot CLI team rewrites their entire instructions file every three months as a living document.
Custom Agents
A custom agent forces the model into a specific role or workflow — for example, a /tdd-red agent that only writes failing tests. The harness retrieves the agent file, injects the definition, restricts the available tools, and appends your prompt.
The token savings are modest (input is cached). The real win is preventing wrong paths. Restricting an agent to read-only access on GitHub issues, for instance, eliminates an entire class of mistakes.
Skills
Skills are conditionally loaded markdown. The harness puts the description of every skill into context; the LLM tells the harness when it needs the full skill loaded.
Two pitfalls:
- Don't overdo it. Hundreds of skill descriptions bloat context for marginal benefit.
- Avoid redundant skills. A "React skill" is wasted if the model already knows React fluently. Skills should add capabilities the agent wouldn't otherwise have. And maintain them as models evolve — what was needed last year may be built-in now.
MCPs
MCPs add external tools and API calls. The harness offers tool descriptions to the LLM, which invokes them when needed.
Be rigorous. MCPs bloat tool descriptions and can lead to undesired tool calls. Deactivate MCPs you don't always need, or wrap them inside custom agents that scope when they're active.
The Playwright MCP is the canonical example: powerful for frontend work, but expensive (screenshots, page reads, full DOM parsing). If always-on, it triggers unnecessary work for trivial CSS changes. Pair it with a custom agent that only activates it when you're doing real UI work.
Subagents
A subagent opens a second context window for a specific task — research, document summarization, etc. — and returns a compact summary to the main session. This keeps the main context clean.
The trade-off: more tokens are spent inside the subagent. It's a conditional optimization. Use it when the alternative is polluting your main session with hundreds of irrelevant files.
The Rest
- Scoped instructions are useful in monoliths with distinct sections (e.g., one set of rules for the auth module, another for billing). Start with static persistent instructions first; reach for scoped only when needed.
- Prompt files are manually invoked, can trim the toolset, and serve as good standardized starting points. (Not supported in Copilot CLI at the moment.)
- Copilot Memory learns from your behavior automatically. Check it periodically to make sure it's learned the right things.
Power User Techniques
For orchestrators running hundreds or thousands of agents, additional levers exist — though they trade quality for token savings and require careful testing:
- Think in code. Prefer scripts to analyze files over feeding them to the LLM. A 200-line file analyzed by a Python script consumes near-zero tokens versus thousands in context.
-
CLI over MCP. Models already know how to use tools like
gh. A CLI invocation can be leaner than the equivalent MCP, because the model doesn't need static tool descriptions injected. - Trim shell outputs. Tools like rtk strip CLI output down to agent-relevant information.
-
Run
/chronicle tipregularly in Copilot CLI to surface optimization opportunities from your actual session logs. - Collapse tool calls. Plugins like copilot-codeact-plugin batch multiple calls into single operations.
- Model-specific context tweaks. Only worth it for fleet orchestrators with thousands of runs. Risky given how fast models change.
Part 4: Long-Term Guidance — The Skills That Will Matter
Zooming out from the tactical playbook, three durable traits separate developers who'll thrive in the agent era from those who won't.
Build Analytical Skills
Coding itself was never the true source of developer value. Analytical thinking and deep domain proficiency were. Agents can write code; they can't decide what should be built, in what domain language, with what trade-offs. The ability to tell an agent precisely what to do, in the language of the domain, is the most valuable skill. Invest there.
Apply Good Architecture
Domain-Driven Design, Hexagonal Architecture, CQRS, Event-Driven Design — these matter more now, not less. Good architecture:
- Makes agent discovery faster (clear file organization, predictable patterns).
- Provides guardrails that prevent agents from putting code in the wrong place.
- Reduces the fix/debug cycles that come from architectural drift.
The old debates — five-line functions versus ten, semicolons, comment style — are noise. Architecture is signal.
Iterate on Prompts and Agent Configs
Treat this with an engineering mindset. Keep configs fresh. Treat every agent miss like an incident — log it, fix the underlying instruction or skill, prevent recurrence. Use /chronicle regularly in the CLI to surface patterns. This is continuous engineering work, not a one-time setup.
You are now a context engineer. That's the job.
Part 5: Five Things to Start Doing Today
If you take nothing else from this, take these five:
- Choose the right model for the right task. Reasoning models for planning and debugging; mid-tier for implementation; small models for trivial work. Let Auto Mode pick when in doubt.
- Provide clear guidance in your prompts. Be precise. Add stop signals. Provide known context upfront. Don't be terse for the sake of saving tokens.
- Research → Plan → Implement. Separate context windows per phase. Distill a precise spec between them. Parallelize implementation across architecture layers.
- Provide deterministic guardrails. Tests, linters, security scans — anything code-enforced. These reset the compound error rate.
-
Maintain a concise, human-written
copilot-instructions.md. Use it as an agent-miss log and to trim outputs. Keep it small. Rewrite it often. Don't let AI generate it.
Summary
The whole discipline reduces to one principle:
Write as little context as required, and as much as necessary.
Token cost optimization isn't really about tokens. It's about quality, precision, and engineering rigor applied to a new substrate. The teams that internalize this — that stop counting tokens and start making every token count — will out-ship, out-quality, and out-economize everyone still gambling with cheap agents.
I'm happy to answer your questions, and to help your team or organization with agent quality and token optimizations techniques - send me a message on LinkedIn.





















