GitHub - vukkt/token-warden: Claude Code plugin that makes coding agents measurably cheaper over time: collect token costs, distill candidate rules, benchmark them on a frozen golden suite, and keep only rules that earn their context rent.

A Claude Code plugin that makes coding agents measurably cheaper over time.

Most "agent memory" accumulates advice nobody ever verifies. token-warden treats agent memory as an engineering problem: every rule that wants space in an agent's context must prove, on a fixed benchmark, that it saves more tokens than it costs — or it gets evicted. The result is a per-agent memory file containing only rules with measured, positive return.

Measured, not vibes — every rule carries a token delta from real benchmark runs
Self-funding — rules must save ≥ 2× their own context rent to stay
Self-auditing — active rules are re-benchmarked round-robin and evicted when they stop earning
Zero session overhead — collection runs in a Stop hook that never blocks or fails your work

How it works
Getting started
Commands
The benchmark system
Architecture
The agents
Inter-agent approval gate
Design invariants
A recorded demonstration
Testing
Data layout
Security notes
Roadmap

How it works

The optimizer is a four-stage, feed-forward loop. Lessons are extracted from finished sessions and applied to future ones — past work is never re-done.

                  agent session (any project, any repo)
                                  │
                                  │  Stop hook · parses the transcript:
                                  │  tokens, tool calls, file re-reads, completion
                                  ▼
                ┌─────────────────────────────────────┐
                │  1 · COLLECT                        │
                │  one row per session in SQLite      │
                └─────────────────────────────────────┘
                                  │
                                  │  fires only when a run exceeds the
                                  │  agent's rolling p75 token cost
                                  ▼
                ┌─────────────────────────────────────┐
                │  2 · DISTILL                        │
                │  one haiku call over the waste      │
                │  stats → 0–2 candidate rules        │
                └─────────────────────────────────────┘
                                  │
                                  │  candidates wait in SQLite —
                                  │  never injected until measured
                                  ▼
                ┌─────────────────────────────────────┐
                │  3 · BENCH                          │
                │  golden suite on a frozen fixture,  │
                │  run with vs. without the candidate │
                └─────────────────────────────────────┘
                                  │
                                  │  measured delta vs. context rent
                                  ▼
                ┌─────────────────────────────────────┐
                │  4 · SELECT                         │
                │  keep if savings ≥ 2× rent, else    │
                │  evict · re-audit the oldest rule   │
                └─────────────────────────────────────┘
                                  │
                                  ▼
              ~/.claude/agent-memory/<agent>/MEMORY.md
        compiled wholesale from surviving rules and injected
            into the agent's system prompt next session

1 · Collect. Stop and SubagentStop hooks fire after every turn (main session and subagent work respectively) and parse the session transcript into one ledger row: input/output/cache tokens (deduplicated by API message id — the transcript repeats usage per streamed block), tool-call count, files read more than once, and whether the session completed. The hook is hard-capped under the 2-second budget, wraps every failure, and exits 0 regardless — it can never break your session.

2 · Distill. Collection is cheap, analysis is not, so analysis is rationed: only runs above the agent's rolling 75th-percentile cost (minimum 5 prior runs) are distilled. A single detached haiku-tier call receives the waste statistics plus an 8 KB action trace and must return strict JSON: at most two one-sentence, generalizable rules. Invalid output is dropped, never retried. Near-duplicates of any existing rule — including evicted ones — are rejected by trigram similarity, so a falsified rule cannot be re-proposed.

3 · Bench. Candidates are measured on a golden task suite against a frozen fixture repository (see The benchmark system). Each configuration runs the suite headlessly in a throwaway copy with the candidate compiled into a temporary, fully isolated agent memory.

4 · Select. A rule's verdict is the spec inequality: with delta = mean tokens saved per completed golden run and rent = the rule's own size in tokens, the rule goes active iff delta × sessions/week ≥ 2 × rent × sessions/week. Failing a previously-passing task is instant eviction regardless of tokens. Every selector run also re-benchmarks the least-recently-audited active rule — memory must keep earning its place. Survivors are compiled into MEMORY.md, which Claude Code injects into the agent's system prompt.

Getting started

Prerequisites

Node.js 22+
Claude Code v2.1+ (claude --version)
macOS or Linux (Windows via WSL — benchmarks need a POSIX shell)

1 · Clone and install

git clone https://github.com/vukkt/token-warden.git
cd token-warden
npm install        # the hooks run via the plugin's own tsx + better-sqlite3

2 · Load the plugin

For the current session:

claude --plugin-dir /path/to/token-warden

Or install persistently — this repository is also its own marketplace:

/plugin marketplace add vukkt/token-warden
/plugin install token-warden@vukkt-plugins

Marketplace installs are copied to ~/.claude/plugins/cache without node_modules. The Stop hook bootstraps its own dependencies on first run (one-time npm install, silent); collection begins from the second session at the latest.

3 · Verify collection

Work normally for a turn or two, then:

You should see a runs count for main. Every session in every project is now being measured into ~/.token-warden/warden.db.

4 · Freeze the baselines (one-time, ~20 min per agent)

npm run bench -- --agent all      # or one agent at a time

This runs each agent's three golden tasks twice and freezes run1_tokens — the permanent denominator of every future improvement claim. Do this once, before any rules exist.

5 · Let the loop run

Use the four subagents (frontend, backend, sql, testing) for real work. Expensive sessions distill into candidates automatically. When /warden-status shows candidates pending, measure them:

npx tsx src/select.ts --agent sql

Active rules land in the agent's memory; the next session starts cheaper.

Commands

Command	What it does
`/warden-status`	Read-only report: per-agent run/rule counts, suite total vs. frozen baseline (absolute + %), learning curve over time, active rules with measured deltas and provenance, recent evictions with reasons, real-work tokens by project, cross-agent question volume
`/warden-bench <agent\|all> [--runs N] [--task id]`	Runs the golden suite, compares against `run1` and `best`, and reports benchmarking meta-cost (warns above 10% of the week's real-work tokens)
`/warden-select <agent> [--runs N] [--top-up N]`	Measures pending candidates, evicts or activates them, re-audits the oldest active rule, and recompiles the agent's memory
`/warden-modelbench <agent> --model <id> [--baseline <id>] [--runs N]`	Runs the agent's golden suite under two models (candidate vs. the agent's current model, rules held constant) and reports which uses fewer tokens for that workload
`/warden-promptbench <agent> --variant <file.md> [--runs N]`	Runs the agent's golden suite under two prompts (a variant agent definition vs. the shipped one, rules and model held constant) and reports which uses fewer tokens
`/warden-evolve <agent> [--runs N]`	Proposes a token-cheaper rewrite of the agent's prompt (model call), benchmarks it, and recommends it only if it provably wins — never auto-applied

When candidate rules are waiting, a lightweight SessionStart hook injects a one-line nudge into new sessions — selection itself always stays a user decision, because it spends real benchmark tokens.

Headless or when names collide, use the namespaced forms (/token-warden:warden-status). CLI equivalents:

npx tsx src/status.ts                              # status report
npm run bench -- --agent sql [--rule N]            # benchmark runner
npx tsx src/select.ts --agent sql                  # selector (measure + evict + compile)
npx tsx src/modelbench.ts --agent sql --model haiku  # compare a model against the agent's default
npx tsx src/promptbench.ts --agent sql --variant v.md  # compare a prompt variant against the shipped one
npx tsx src/evolve.ts --agent sql                      # propose + measure a cheaper prompt variant

The benchmark system

Measurement is only as good as its control variables. token-warden controls them aggressively:

The fixture (benchmarks/fixture/) is a small but realistic full-stack TypeScript project — Express routes → services → repositories over SQLite, a React admin UI, a partial vitest suite — frozen at Phase 2 and never modified, so baselines stay comparable across months. It ships with documented, deliberate flaws (BUGS.md, which agents never see: the benchmark runner excludes it from every copy) that the golden tasks target.

Golden tasks (benchmarks/<agent>/golden-NN.md) — three per agent, each a frontmatter file with a one-sentence prompt and a shell success_check (greps and/or a full vitest run). A run only counts as completed if its check passes: a cheap failed run is worse than an expensive successful one, and incomplete runs are excluded from all savings math.

A benchmark run, end to end:

Copy the fixture to a temp dir (node_modules symlinked; BUGS.md excluded).
Install the agent definition into the copy with its memory scope rewritten to project, so the compiled MEMORY.md under test resolves inside the temp dir — real agent memory is never read or written by benchmarks.
Compile the rule set under test (active rules ± one candidate) into that memory.
Run claude -p --agent <name> headlessly with scoped permissions: acceptEdits plus a Bash allowlist of test commands only — never bypassPermissions.
Run the success_check; parse the transcript; record one runs row.
First-ever completed run per (agent, task) freezes baselines.run1_tokens forever; later completed runs only ratchet best_tokens downward.

Variance and honesty. Each configuration runs twice and pairs of runs differing by more than 25% are flagged in the output. LLM variance is the dominant error source at small effect sizes — the recorded demonstration below shows it evicting a rule. The selector is variance-aware: it computes the standard error of the per-task savings, and when a verdict sits within one standard error of the keep/evict threshold it spends one bounded top-up pass (extra suite runs of the measured configuration, budget configurable via --top-up, default 1) before deciding; verdicts that remain within noise are recorded with an explicit low-confidence annotation. The benchmark also reports its own meta-cost after every invocation: when benchmarking exceeds 10% of the week's collected real-work tokens, it tells you to bench less.

Architecture

Module	Responsibility
`src/db.ts`	SQLite schema, versioned migrations (`PRAGMA user_version`), typed query helpers
`src/transcript.ts`	Pure transcript JSONL parser — usage dedup, tool calls, re-reads, completion heuristic, distiller digest
`src/collect.ts`	Stop-hook entrypoint; p75 trigger; spawns the distiller detached
`src/distill.ts`	Waste analysis → 0–2 strict-JSON candidate rules; trigram dedupe
`src/bench.ts`	Golden-suite runner; baseline freezing; meta-cost accounting
`src/select.ts`	Keep/evict verdicts; round-robin re-audit; `MEMORY.md` compiler
`src/status.ts`	Read-only reporting behind `/warden-status`
`src/gate.ts`	Inter-agent `SendMessage` approval gate (Agent Teams)
`src/notify.ts`	SessionStart nudge when candidates await measurement
`src/compare.ts`	Generic A/B comparison engine (processing-token verdict, variance top-up, `runComparison` orchestration) shared by model, prompt, and prompt-evolution benchmarking
`src/modelbench.ts`	Model-migration benchmarking: candidate model vs. agent default
`src/promptbench.ts`	Prompt A/B benchmarking: variant agent definition vs. shipped
`src/evolve.ts`	Automated prompt evolution: propose a cheaper prompt (model call) → measure → recommend

Data model (~/.token-warden/warden.db): runs (one row per session or golden run, tagged real/active/candidate/audit), rules (the ledger — candidates, active rules with measured deltas, and evicted rules kept as the negative dataset), baselines (frozen run1_tokens, ratcheting best_tokens), ruleset_versions, and questions (the inter-agent ledger). Every deviation from the original specification is documented in DECISIONS.md.

The agents

frontend, backend, sql, and testing (agents/*.md) are standard Claude Code subagents with memory: user and domain-scoped prompts seeded with efficiency behaviors (Grep before Read, never re-read a file, one-line plan before editing). Use them like any subagent — the optimizer extends each one's memory independently. Per-agent isolation is deliberate: a rule that pays rent for the sql agent is never charged to the frontend agent's context.

Inter-agent approval gate (experimental)

With CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS=1, a PreToolUse hook intercepts every SendMessage between agents and escalates to you:

[frontend → backend] "What does the orders service return on partial failure?" — approve?

Every question is logged to the questions table — approved sends are confirmed by a PostToolUse hook; denied ones stay pending — and per-agent question volume surfaces in /warden-status. An agent that asks a lot is an agent whose memory is missing something. Without the env flag the gate is structurally inert and everything else works untouched. The gate fails open: an internal error defers to the normal permission flow rather than blocking team messaging.

Design invariants

Candidate rules are never injected until measured. Unverified rules get no context space; candidates live only in SQLite.
MEMORY.md is a build artifact — compiled from the rule ledger, overwritten wholesale, never hand-edited or agent-appended.
Fitness = tokens per completed task. Incomplete runs are excluded from savings math.
Golden tasks run against the frozen fixture, never a live codebase.
First-run baselines are frozen forever. run1_tokens is the permanent denominator of every improvement claim.
The optimizer never re-does past work — all learning is feed-forward.
Eviction is mandatory. Rules must earn at least 2× their context rent, and active rules are re-audited round-robin.

A recorded demonstration

Recorded 2026-06-12; every number is from real headless runs.

A candidate is born. Run #13, an sql golden run, cost 61,003 tokens — above the agent's rolling p75. The distiller proposed two candidates:

rule	body	rent
#3	"Consolidate file discovery into single queries instead of multiple find/ls operations across related paths."	27
#4	"Parse task descriptions for technical direction; verify schema/dependencies only if code doesn't clarify them."	28

The selector measures them (24 headless runs: shared baseline, one configuration per candidate, one re-audit). Mean completed tokens per task:

configuration	sql-01	sql-02	sql-03	delta
baseline (active set)	39,572	70,762 ⚠	50,304	—
+ rule #3	39,541	67,114	52,116	+622 saved/run
+ rule #4	39,664	54,244	49,538	+5,731 saved/run
− rule #1 (re-audit)	39,671	49,006	44,315 ⚠	rule #1 worth −9,215

⚠ = the two same-configuration runs differed by >25%.

Verdicts (threshold: savings ≥ 2× rent):

rule #3 → ACTIVE (622 ≥ 54)
rule #4 → ACTIVE (5,731 ≥ 56)
rule #1 ("Use Grep to locate symbols before reading any file."), active since the previous selector run at +3,673, was EVICTED on re-audit at −9,215: with the two new rules present, removing it made the suite cheaper. This is mandatory eviction working as designed — and an honest illustration that run-to-run variance dominates at small effect sizes. Evicted rules are retained as the negative dataset, and trigram dedupe prevents a falsified rule from being re-proposed.

The compiled memory (~/.claude/agent-memory/sql/MEMORY.md, ruleset v2):

<!-- GENERATED BY token-warden — do not hand-edit -->
# Efficiency rules

- Parse task descriptions for technical direction; verify schema/dependencies only if code doesn't clarify them.
- Consolidate file discovery into single queries instead of multiple find/ls operations across related paths.

Testing

npm run typecheck && npm run lint && npm run test

The unit suite (count in the CI badge above — hard-coding it here rotted once already) spans every module. The transcript parser carries the densest coverage (usage dedup, completion heuristics, malformed-line tolerance, a 5 MB / 2 s performance budget) against committed anonymized fixtures. The hook entrypoints (collect.ts, gate.ts) are tested as real child processes against temp databases, including corrupt-input and fail-open paths. The selector core is tested with an injected fake suite-runner, so verdict logic, regression eviction, re-audit, and memory compilation are verified without spending model tokens. Strict TypeScript (noUncheckedIndexedAccess), Biome for lint/format, vitest for tests.

The fixture has its own independent suite (cd benchmarks/fixture && npm test) and is excluded from plugin CI — its deliberate flaws are benchmark material, not bugs.

Data layout

Path	Contents
`~/.token-warden/warden.db`	The ledger (override with `TOKEN_WARDEN_DB`)
`~/.token-warden/{collect,distill,gate}.log`	Component logs — hooks never surface errors into sessions
`~/.claude/agent-memory/<agent>/MEMORY.md`	Compiled rules (generated; do not hand-edit)
`benchmarks/fixture/`	The frozen benchmark codebase

Security notes

The ledger contains untrusted text: rule bodies and eviction reasons are model-generated, project paths and question senders come from the environment. Defenses, in order:

The distiller rejects rule bodies containing control characters or newlines at the source.
renderStatus sanitizes every untrusted string it displays (ANSI/control characters stripped, newlines collapsed, length clamped), so collected data cannot forge report sections.
The /warden-status command instructs the relaying Claude to treat report contents as data, never as instructions.

The inter-agent gate is an observability and approval layer, not a security boundary — it fails open by design so a broken gate can never block team messaging.

Roadmap

Shipped since v0.1.0:

✅ Subagent collection — Stop and SubagentStop hooks, so the four domain agents' real work reaches the ledger (previously only the main session was collected and the learning loop could not engage on real work at all).
✅ Variance-aware verdicts — standard-error analysis of per-task savings with a bounded top-up pass when a verdict is within noise of the threshold (--top-up).
✅ Selection nudge — a SessionStart hook surfaces pending candidates; /warden-select runs the measurement on demand.
✅ Question-driven distillation — an agent's recent cross-agent questions are fed to the distiller as a memory-gap signal.
✅ Per-project tracking — real-work sessions record their project; status breaks down token volume per project.
✅ Rule provenance — active rules show the run they were distilled from.
✅ Cross-project learning curves — /warden-status charts average completed real-work session cost per ruleset version, per agent and per project (domain agents only; main never has compiled rules). This is the test of the system's core thesis: golden-suite gains must show up in real work.
✅ Model-migration benchmarking — /warden-modelbench runs an agent's golden suite under a candidate model vs. its current one (rules held constant) and reports which uses fewer tokens for that workload. The frozen suite is the fixed workload you need when a new model ships. The verdict uses processing tokens (input + output + cache_creation); cache-read is reported separately because it skews raw cross-model totals, and token counts are never converted to dollars (models are priced differently per token).
✅ Prompt / agent-definition A/B testing — /warden-promptbench runs an agent's golden suite under a variant agent definition vs. the shipped one (rules and model held constant), so a proposed prompt edit can be kept or rejected on measured token savings rather than vibes. The comparison engine (compare.ts) is shared with model benchmarking: the discipline generalizes from "rule selection" to "any context change."
✅ Automated prompt evolution — /warden-evolve proposes a token-cheaper rewrite of an agent's prompt (a model call, like the rule distiller proposes rules), measures it through the shared engine, and recommends a winner to a proposals file. Deliberately not auto-applied: agents/<name>.md is committed source, and three golden tasks can't fully capture an agent's behavior, so a human reviews and applies. Protected frontmatter (name/tools/model/memory) is enforced unchanged before measurement.

Near-term:

Golden suite growth — heavy tasks (testing-02 ≈ 150k tokens/run) deserve splitting into new tasks (existing baselines stay frozen; replacing a task would invalidate its denominator, so growth means adding task files, never editing them).
Fully scheduled selection — auto-running the selector on a cron/routine once variance handling has earned trust; today it deliberately stays a user decision.
Transcript provenance — link a rule's born-of run to its archived transcript digest for post-hoc review.

Bigger directions — the reusable asset here is the frozen-benchmark + measured-verdict discipline, which generalizes well beyond efficiency rules:

Model-migration benchmarking — the frozen golden suite is exactly the fixed workload you need when a new model ships: warden-bench could answer "is the new model cheaper on my workload" with the rigor rules already get.
Prompt / agent-definition A/B testing — the benchmark measures any context change, not just rules; treat an agent's system prompt as a candidate and let the selector keep edits that earn their place.
Team-shared rule ledgers — commit measured rules to a repo (via project-scoped memory) with token-warden as the CI gate, so a PR adding a rule must carry its measured delta. Memory review becomes code review.
Real-time cost anomaly alerting — the p75 machinery already detects expensive sessions; a Stop-hook message could tell the session itself where its tokens went.

License

MIT

推荐订阅源

Hacker News - Newest: "AI"

Table of contents