惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

H
Help Net Security
T
ThreatConnect
SecWiki News
SecWiki News
F
Future of Privacy Forum
AWS News Blog
AWS News Blog
C
Cisco Blogs
A
Arctic Wolf
Vercel News
Vercel News
The GitHub Blog
The GitHub Blog
Scott Helme
Scott Helme
V
V2EX
博客园 - 叶小钗
阮一峰的网络日志
阮一峰的网络日志
K
Kaspersky official blog
G
Google Developers Blog
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
P
Privacy International News Feed
C
Cyber Attacks, Cyber Crime and Cyber Security
N
News | PayPal Newsroom
Schneier on Security
Schneier on Security
NISL@THU
NISL@THU
Microsoft Azure Blog
Microsoft Azure Blog
量子位
The Hacker News
The Hacker News
Stack Overflow Blog
Stack Overflow Blog
Security Latest
Security Latest
M
Microsoft Research Blog - Microsoft Research
Google Online Security Blog
Google Online Security Blog
博客园_首页
C
CXSECURITY Database RSS Feed - CXSecurity.com
I
InfoQ
Google DeepMind News
Google DeepMind News
Y
Y Combinator Blog
The Cloudflare Blog
Microsoft Security Blog
Microsoft Security Blog
Martin Fowler
Martin Fowler
Cisco Talos Blog
Cisco Talos Blog
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
T
Troy Hunt's Blog
F
Fox-IT International blog
S
Security @ Cisco Blogs
博客园 - 司徒正美
cs.CV updates on arXiv.org
cs.CV updates on arXiv.org
C
Comments on: Blog
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
L
LINUX DO - 最新话题
GbyAI
GbyAI
Project Zero
Project Zero
腾讯CDC
T
Tailwind CSS Blog

DEV Community

Creating a Custom Grid Editor tool in Unreal Engine I Built a Multilingual Spam Detection Dataset with 149K+ Messages Across 23 Languages How to Handle Telegram Albums in grammY RAG 시스템 실전 구축 (v38) Beyond Pip Install: Why Your AI Agent Needs a "Hermetic" Life-Support System to Survive Resume Building using HTML & CSS SpecFlow: Multi-Agent SDD in Cursor (4 phases, /approve, single code writer) Running ASR for smart homes in the NPU of Intel processors "Building a CI/CD Pipeline From Scratch: A Practical Guide for Developers (with GitHub Actions)" SpecFlow: SDD multi-agente en Cursor (4 fases, /approve, un solo escritor de código) How to Extract Your Full Team Hierarchy from HubSpot (the API doesn't expose it) Adobe Commerce Cloud now costs $40k/year. We migrated from Adobe Commerce to Magento Open Source — here's the honest breakdown .klickd v4.0.0 — Portable AI memory with constraints, strict schemas, and test vectors We Trust Third Party Code, It’s Time to Trust AI Generated Code LangGraph 워크플로우 템플릿 (v38) Sustainable AI Starts with Efficient AI Find Remove duplicated files in Google Drive How to Detect GPU Waste in a Kubernetes Cluster The Privacy Bug in My First Chrome Extension (And How to Avoid It) Serverless Mental Models: What They Don't Tell You Before You Build Preventing GPT hallucination in automated content pipelines: how I structure Make.com flows with data injection Hmm, where were we? AI Visibility Tools, Math Proofs, and Stripped Guardrails Shape Developer Landscape How AI and Electronics Are Changing Healthcare Devices: The Future of Smart Healthcare Author: Shivam Wakade | Founder, PrivSR Making Claude Sound Like Optimus Prime Understanding Reinforcement Learning with Human Feedback Part 5: Training the Reward Model with Loss Functions Learning Progress Pt.20 How Secure LoRa Communication Devices Work: Building the Future of Private and Long-Range Connectivity Author: Shivam Wakade | Founder, PrivSR How I Rebuilt an RPG Map Editor with Rust, React, and WASM Building a System That Automates YouTube Post-Production Building a 100% Serverless Digital Asset Packager in the Browser Game Recommended AI What is Human-In-The-Loop (HITL)? Deep Dive: React Server Components in TanStack Start Migrating off Google Analytics: Umami vs Plausible vs Fathom Building a Portfolio That Actually Demonstrates Software Engineering Async/Await in JavaScript: From Callbacks to Clean Code (2026) Benchmarking LLM Structured Outputs Angular 21 Multiselect Dropdown: A Migration-Friendly Component with Live Functional Tests ShareBox v5 — GPU transcoding, Netflix-style grid, and why I don't need Plex anymore TOML Schema is live Handling Duplicate Shopify Webhook Events (And Why You Must) Original Kubernetes Dashboard — retired upstream, upgraded to Angular 21. لماذا أسست ترينافو للتجار العرب الذين تتجاهلهم المنصات الغربية Construyendo un recomendador de películas en Python: de los datos al modelo When APIs Lie: A Lesson in Defensive Debugging Pope Leo XIV's AI Encyclical: What Builders Must Know (2026) Donna v0.3.0 HTB — MonitorsFour | Writeup The Free Tool You Trust Is the One You Should Fear the Most HTB — MonitorsFour | Writeup Fr 97. Embeddings and Vector Search: Semantic Search That Works Deep Dive: Building "Gravity Paint" - A Tactile Physics Instrument with React, Matter.js, and p5.js ABAP Unit Testing with Test Doubles and Mocking Frameworks: A Senior Architects Guide to Isolating Dependencies in SAP S/4HANA LeetCode Solution: 5. Longest Palindromic Substring kovax-react 0.8: Tailwind v4 preset, FormField adapters, ColorModeScript, and Storybook I built an AI résumé tool that refuses to lie about your experience The hat Azure Entra ID User & Role Management — Step-by-Step Practical Guide With A Simple Excercise The AI-Native Company: How a Single Founder Can Build Global Organizations Powered by AWS and an Ecosystem of Artificial Intelligences Building a Lightweight Remote MCP Knowledge Base on Cloudflare Workers Why I built Trinavo for the MENA merchants Western platforms ignore The N+1 Query That Killed Our Database, And How I Fixed It Docstrings vs Markdown Docs: What Should Developers Actually Write? Training Data Provenance: The Manifest Diff That Explains the Hash Add SVGIcons MCP to Claude Code and Find SVG Icons from Your Terminal 3 CLI Tools You Can Buy with Crypto — No KYC, No Subscriptions COSS Weekly: OpenClaw competitor NanoClaw Raises $12M, Dust Raises $40M, Sonar Acquires Gitar, and more How to know if you actually need mobile proxies (without buying any) Building Cursor for Community: A Buildathon Built on Time Pressure How we built a PII masking layer for LLM APIs — local detection, reversible tokens, one line to integrate Why MLFQ Was Way Ahead of Its Time Add Runtime Limits to Claude Agent Workflows I Built a Prompt Injection Detector with 98% Recall on Unseen Attacks. Here's Why Data Beat Architecture. 8 Vite Config Options Every Developer Should Know (Vite 8) Feature Flags That Forgot to Leave Why Trust Infrastructure Is Becoming the Hidden Layer of Donation Platforms XyPriss: Rethinking Core Performance and Zero-Trust Architecture in Modern Backends Designing Configuration for Scalable Treasure Hunts SSH Login Delays: The 10-Second Wait That Drives Us Crazy Building Production Multi-Agent Workflows in n8n: What 50 Deployments Taught Us A 3-layer memory system that gives Claude Code persistent context across sessions. Trishul SNMP Suite 2.0.1: Better MIBs, Traps, and SNMP Labs How I built a production AI SaaS as a solo developer Auto-labelling 1.2M robotics frames with VLMs: a failover story India’s Laws Were Not Built for AI — And Courts Are Filling the Gap skill-insp: A Skill That Scores Other Skills Clprolf Minimalist Messaging in the Age of AI What's actually in a good .cursorrules file? I built 10 of them — here's what I learned Building Strong Python Basics – Loops, Functions and Logic How to Choose the Right Tech Stack for Your Project I built a free multi-tab JSON editor — here's what I learned HTTP Headers Every Developer Should Know (2026) Building Cross-Platform Digital Products: Challenges and Best Practices Data Privacy in the Age of AI: How Product Teams Can Build Trust with Users What Would WordPress Look Like If It Were Designed Today? Why Backup Success Does Not Mean Database Recoverability Local AI Office Assistant That Never Sends Your Documents to the Cloud Building TaskForge: Translating Enterprise Chaos into an Open-Source Scheduler
A practitioner's guide to getting more value out of AI coding: agent quality & token optimization
Maxim Salnik · 2026-05-26 · via DEV Community

Introduction: The Wrong Question

GitHub's shift from premium requests to usage-based billing has triggered a wave of anxiety across engineering teams. The question echoing through Slack channels and leadership meetings is some variation of: "How do we reduce our token spend?"

It's the wrong question.

Focusing purely on cost diminishes the value you get from agents. A better framing is: "How do we get the most out of the tokens we spend?" That subtle reframing changes everything — from how you write prompts, to which model you reach for, to how you architect your codebase, to how you organize your team's workflows.

This article walks through the full case for quality-first token optimization, the foundational mental models you need to reason about it, and the concrete controls and techniques that move the needle.


Part 1: Why Agent Quality Is the Better Lens

Agent Gambling Is No Longer Sustainable

When tokens were effectively free, agent accuracy didn't really matter. The dominant pattern became what's best described as "agent gambling": throw together a lazy prompt with minimal context, fire off an agent, and if it fails, fire off another one. Think of it as the NASA Artemis problem in reverse — if rockets were cheap, you'd send 20 in the general direction of the moon and hope one lands.

That worked when each developer ran a handful of agents per day. It stops working the moment developers — and especially AI engineers orchestrating fleets — are running dozens or hundreds of agents per day. The economics invert. The cost of misfires dwarfs the cost of doing the work properly.

The fix isn't to send fewer rockets blindly. It's to make sure each rocket actually lands. Higher per-agent quality means fewer retries, fewer wasted tokens, and better ROI on every dollar of usage.

The ROI Mental Model

The guiding equation for thinking about agent economics:

Agent ROI = (Value of Agent Output − Token Cost) / Token Cost × 100%

You can't calculate this precisely, but it's a directionally useful lens. Two things follow immediately:

  1. Optimizing cost when value is zero is meaningless. Cutting your spend by 50% on outputs that don't ship anything useful is just losing money more slowly.
  2. Increasing value often means decreasing tokens. Developers routinely stuff irrelevant text into prompts, let conversations compound with stale context, and pile on documentation the model doesn't need. Trimming that context usually improves both quality and cost simultaneously. They're the same lever.

The Compound Error Problem

Here's the math that should haunt anyone running multi-step agent workflows: errors compound multiplicatively.

  • At 99% accuracy per step, a 50-step workflow lands at ~60% overall success.
  • At 95% accuracy per step, the same workflow drops to ~8%.

LLMs are non-deterministic. Every step in an inner agent loop, every hop in an orchestrated workflow, every tool call — they all multiply against each other. This means every percentage point of per-step quality buys you a disproportionate improvement in overall reliability. And every miss isn't just a wasted token call — it triggers fix cycles, review overhead, reruns, debugging sessions, and burned human attention.

The takeaway: apply the same "shift-left" mindset to agents that you apply to quality, testing, and security in traditional engineering.

The Mantra

The whole philosophy collapses into one line worth pinning to your monitor:

Instead of counting tokens, make every token count.

Reduce token usage as a consequence of pursuing quality — not as a goal in itself. Send fewer, better-targeted rockets. The fuel savings follow automatically.


Part 2: Foundations — LLMs, Agents, and Context Windows

Before you can optimize anything, you need to internalize a few mechanical truths about how this technology actually works.

LLMs Are Pure Word Probability Machines

Strip away the marketing and what you have is a text-in, text-out system that predicts the next word given an input plus the patterns from its training data. When you type "GitHub Copilot is the world's most widely…" the model assigns probabilities to candidate next words — used, adopted, deployed, and so on — and picks one. In a coding context, it's predicting the next instruction.

Models have gotten dramatically better, but the underlying mechanism hasn't changed. This matters because the math doesn't distinguish hallucination from fact. A made-up function name and a real one occupy the same probability space. The model isn't "lying" when it hallucinates — it's just doing what it always does with insufficient signal.

The Core Principle of Context

Which leads to the single most important principle in this entire discipline:

Provide as little context as possible, but as much as required.

Two failure modes flank this principle:

  • Too much context biases the model toward irrelevant patterns and dilutes the signal. Stuffing in five files when one is relevant makes the model worse, not better.
  • Too little context forces the model to hallucinate to fill gaps. It has to predict something, and without grounding, it guesses.

Context engineering — the discipline of finding that sweet spot — is the fundamental skill of working with agents.

What an Agent Actually Is

An agent is not magic. It's an app — code that sits between you and the LLM. The architecture is simple:

You and your project ↔ The agent (harness) ↔ The LLM

Harnesses are things like VS Code Chat, Copilot CLI, Copilot Cloud Agent, Claude Code, OpenAI Codex. Models are things like GPT 5.5, Claude Opus 4.7, Gemini Pro. The harness is the orchestrator; the model is the inference engine.

Two things are crucial to understand here:

  1. The LLM is stateless. What feels like a "conversation" is actually the harness re-sending every prior input and output on every turn. There's no memory inside the model — only an ever-growing transcript being shipped back and forth.
  2. Tokens compound. Every loop drags the previous loops along with it. Your levers are the things that go into the context: your prompt, the files you reference, and the agent configs (instructions, skills, MCPs) the harness injects.

Context Window Mechanics

A token is roughly ¾ of an English word. Smaller models offer 50K–200K token windows; larger ones like Opus and GPT-5.5 push toward 1M tokens. For scale: 1M tokens is roughly the entire Lord of the Rings trilogy plus The Hobbit.

Don't obsess over token counting at the character level. Think at the level of prompts, files, and responses — those are the units that compound on each loop.

Context Rot: The Hidden Failure Mode

Even with a huge window, models don't treat all positions equally. Two well-documented effects govern how attention is distributed:

  • Lost in the Middle (below ~50% window fill): Models bias toward content at the beginning and the end of the context. Middle content gets less weight.
  • Recency Bias (above ~50% window fill): As the window fills up, attention skews heavily toward the end. System prompts and custom instructions sitting at the beginning start getting effectively ignored.

The practical implications are significant:

  • The beginning of context is prime real estate for instructions and goals.
  • The end is where current work lives.
  • The middle is where past work decays in influence.
  • Just because you can fill the context window doesn't mean you should. Try to keep it under 60–70%.
  • If you switch tasks mid-session, the model may revert to the original task, because that's where the strongest signal still lives.
  • Above 50% fill, you start losing your own guardrails to recency bias.

The fix isn't compaction (which trades tokens for potential information loss). It's a new context window per task/clear liberally, divide work into discrete sessions, and don't let conversations sprawl.


Part 3: Quality and Token Controls — The Practical Playbook

Now to the controls themselves, ordered roughly by leverage.

Where You Are on the Maturity Curve Matters

Two archetypes exist on the agent maturity spectrum:

  • AI-assisted engineers work mostly synchronously with one agent at a time. If you're sending ten agents per day and spending $20/month, saving 50% on tokens just gets you to $10. The juice isn't worth the squeeze.
  • AI engineers orchestrate fleets of asynchronous agents. Every percentage point compounds across hundreds of runs. The compound error problem hits hardest here, and optimization pays back enormously.

Calibrate effort accordingly.

The Two Biggest Levers

Two controls vastly outweigh everything else: model choice and relevant context.

Model choice is the single highest-leverage decision. The cost gap between top-tier reasoning models (Claude Opus 4.7) and small models (GPT-5.4 mini) is roughly 24x. Match the model to the task:

  • Reasoning models (Opus, GPT-5.5) for synchronous planning, architecture, debugging, and any work involving large context.
  • Mid-tier models (Sonnet, GPT-5.4) for asynchronous implementation work.
  • Low-tier models (Haiku, GPT-mini) for small refactors, repetitive tasks, and documentation updates.

A reasoning model on a trivial task isn't just expensive — it can actively make things worse, second-guessing tight specifications and "going rogue." Conversely, a small model on a planning task will produce shallow, brittle output.

Auto Mode (rolling out from June) detects task intent and selects the model for you. It's the lazy default for anyone who doesn't want to think about it — and it's usually right.

Relevant context is the other half of the equation. Don't stuff prompts with "might need" information. Let the agent discover what it needs. Compacting sessions trades tokens for potential info loss — use it cautiously. And use /clear often — tokens don't carry across sessions, so a clean slate is free.

Your Prompt

The prompt is always-on. It sits at the beginning of the context window and has outsized influence due to lost-in-middle effects.

A few rules:

  • Don't optimize prompts for fewer tokens. Optimize them to steer correctly.
  • Be precise and descriptive. "Fix the bug" is useless. "Issue #45 describes a bug where X happens — fix it" actually goes somewhere.
  • Add stop signals. Phrases like "Stop after you've written the fix. Do not commit or push." prevent agents from running past your intent.
  • Add known context upfront. Relevant file paths, doc URLs, skills to invoke. Don't make the agent rediscover what you already know.

Divide and Conquer: Research → Plan → Implement

A single context window doing research, planning, and implementation drags irrelevant files and stale reasoning through every phase. Quality degrades.

The pattern that works:

  1. Research (e.g., Gemini 2.5 Pro): "I want to change X. What files are relevant?"
  2. Plan (e.g., Opus 4.7): Take the research output and produce a precise specification.
  3. Implement (e.g., GPT-5.4, often in parallel): Multiple agents split by architecture layer (frontend, backend, database) with clearly defined contracts between them.

Each phase gets a fresh context window. The spec is the artifact that carries information across the boundary — clean, distilled, free of noise. This saves both time and tokens, and produces far higher-quality output than one monolithic session.

Deterministic Controls: The Compound Error Antidote

Tests, linters, security scanners, type checkers — anything code-enforced and deterministic — are essential context engineering tools. A test either fails or passes. There's no probability. Every passing test resets the compounding error rate to zero for the property it covers.

The contrast is stark:

  • With tests: buggy change → failing test → correction → passing test. Done.
  • Without tests: buggy change → buggy change on top of it → another one → incident → debugging session → burned CI/CD minutes, review cycles, human time.

The Copilot CLI team ships roughly 500 PRs per week. Roughly 53% of their codebase is tests. That's not overhead — that's the moat that lets them move that fast without burning down the production system.

Cheap in the short term means expensive in the medium term. Guardrails pay back many times over.

Agent Configs: The Context Engineering Surface

Modern agent harnesses pick up a stack of markdown files automatically. These are the surface you work with as a context engineer:

  • Persistent instructionscopilot-instructions.md. Always loaded.
  • Custom agents./github/agents/*.agent.md. Role-based, manually invoked.
  • Skills./github/skills/*/skill.md. Conditionally loaded.
  • MCPs — external tool integrations.
  • Subagents — separate context windows spawned by the main session.
  • Scoped instructions./github/instructions/*.instructions.md. Path-pattern based.
  • Prompt files./.github/prompts/*.prompt.md. Manual starting points.
  • Copilot Memory — small always-on instructions learned from behavior.

Each has a place. Let's go through the high-leverage ones.

Persistent Instructions

These are your always-on guidance, the proactive human-in-the-loop signal. Three things belong in them:

  1. Project non-negotiables (architecture rules, conventions that can't be inferred).
  2. A log of recurring agent misses (wrong test framework? wrong build command? Write it down.).
  3. Output-trimming statements ("be concise"). Output tokens are the most expensive — trim them aggressively.

Critical rules: keep them small, don't use AI to generate them, and recreate them often. Research shows that "be concise" performs nearly as well as a 50-line "caveman" skill. AI-generated instructions bloat. Write them yourself, iterate, throw them away. The Copilot CLI team rewrites their entire instructions file every three months as a living document.

Custom Agents

A custom agent forces the model into a specific role or workflow — for example, a /tdd-red agent that only writes failing tests. The harness retrieves the agent file, injects the definition, restricts the available tools, and appends your prompt.

The token savings are modest (input is cached). The real win is preventing wrong paths. Restricting an agent to read-only access on GitHub issues, for instance, eliminates an entire class of mistakes.

Skills

Skills are conditionally loaded markdown. The harness puts the description of every skill into context; the LLM tells the harness when it needs the full skill loaded.

Two pitfalls:

  • Don't overdo it. Hundreds of skill descriptions bloat context for marginal benefit.
  • Avoid redundant skills. A "React skill" is wasted if the model already knows React fluently. Skills should add capabilities the agent wouldn't otherwise have. And maintain them as models evolve — what was needed last year may be built-in now.

MCPs

MCPs add external tools and API calls. The harness offers tool descriptions to the LLM, which invokes them when needed.

Be rigorous. MCPs bloat tool descriptions and can lead to undesired tool calls. Deactivate MCPs you don't always need, or wrap them inside custom agents that scope when they're active.

The Playwright MCP is the canonical example: powerful for frontend work, but expensive (screenshots, page reads, full DOM parsing). If always-on, it triggers unnecessary work for trivial CSS changes. Pair it with a custom agent that only activates it when you're doing real UI work.

Subagents

A subagent opens a second context window for a specific task — research, document summarization, etc. — and returns a compact summary to the main session. This keeps the main context clean.

The trade-off: more tokens are spent inside the subagent. It's a conditional optimization. Use it when the alternative is polluting your main session with hundreds of irrelevant files.

The Rest

  • Scoped instructions are useful in monoliths with distinct sections (e.g., one set of rules for the auth module, another for billing). Start with static persistent instructions first; reach for scoped only when needed.
  • Prompt files are manually invoked, can trim the toolset, and serve as good standardized starting points. (Not supported in Copilot CLI at the moment.)
  • Copilot Memory learns from your behavior automatically. Check it periodically to make sure it's learned the right things.

Power User Techniques

For orchestrators running hundreds or thousands of agents, additional levers exist — though they trade quality for token savings and require careful testing:

  • Think in code. Prefer scripts to analyze files over feeding them to the LLM. A 200-line file analyzed by a Python script consumes near-zero tokens versus thousands in context.
  • CLI over MCP. Models already know how to use tools like gh. A CLI invocation can be leaner than the equivalent MCP, because the model doesn't need static tool descriptions injected.
  • Trim shell outputs. Tools like rtk strip CLI output down to agent-relevant information.
  • Run /chronicle tip regularly in Copilot CLI to surface optimization opportunities from your actual session logs.
  • Collapse tool calls. Plugins like copilot-codeact-plugin batch multiple calls into single operations.
  • Model-specific context tweaks. Only worth it for fleet orchestrators with thousands of runs. Risky given how fast models change.

Part 4: Long-Term Guidance — The Skills That Will Matter

Zooming out from the tactical playbook, three durable traits separate developers who'll thrive in the agent era from those who won't.

Build Analytical Skills

Coding itself was never the true source of developer value. Analytical thinking and deep domain proficiency were. Agents can write code; they can't decide what should be built, in what domain language, with what trade-offs. The ability to tell an agent precisely what to do, in the language of the domain, is the most valuable skill. Invest there.

Apply Good Architecture

Domain-Driven Design, Hexagonal Architecture, CQRS, Event-Driven Design — these matter more now, not less. Good architecture:

  • Makes agent discovery faster (clear file organization, predictable patterns).
  • Provides guardrails that prevent agents from putting code in the wrong place.
  • Reduces the fix/debug cycles that come from architectural drift.

The old debates — five-line functions versus ten, semicolons, comment style — are noise. Architecture is signal.

Iterate on Prompts and Agent Configs

Treat this with an engineering mindset. Keep configs fresh. Treat every agent miss like an incident — log it, fix the underlying instruction or skill, prevent recurrence. Use /chronicle regularly in the CLI to surface patterns. This is continuous engineering work, not a one-time setup.

You are now a context engineer. That's the job.


Part 5: Five Things to Start Doing Today

If you take nothing else from this, take these five:

  1. Choose the right model for the right task. Reasoning models for planning and debugging; mid-tier for implementation; small models for trivial work. Let Auto Mode pick when in doubt.
  2. Provide clear guidance in your prompts. Be precise. Add stop signals. Provide known context upfront. Don't be terse for the sake of saving tokens.
  3. Research → Plan → Implement. Separate context windows per phase. Distill a precise spec between them. Parallelize implementation across architecture layers.
  4. Provide deterministic guardrails. Tests, linters, security scans — anything code-enforced. These reset the compound error rate.
  5. Maintain a concise, human-written copilot-instructions.md. Use it as an agent-miss log and to trim outputs. Keep it small. Rewrite it often. Don't let AI generate it.

Summary

The whole discipline reduces to one principle:

Write as little context as required, and as much as necessary.

Token cost optimization isn't really about tokens. It's about quality, precision, and engineering rigor applied to a new substrate. The teams that internalize this — that stop counting tokens and start making every token count — will out-ship, out-quality, and out-economize everyone still gambling with cheap agents.

I'm happy to answer your questions, and to help your team or organization with agent quality and token optimizations techniques - send me a message on LinkedIn.