惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

博客园_首页
cs.AI updates on arXiv.org
cs.AI updates on arXiv.org
量子位
博客园 - Franky
罗磊的独立博客
月光博客
月光博客
酷 壳 – CoolShell
酷 壳 – CoolShell
博客园 - 聂微东
人人都是产品经理
人人都是产品经理
Hugging Face - Blog
Hugging Face - Blog
宝玉的分享
宝玉的分享
腾讯CDC
D
Docker
N
Netflix TechBlog - Medium
Y
Y Combinator Blog
V
V2EX
Microsoft Azure Blog
Microsoft Azure Blog
Latest news
Latest news
C
CERT Recently Published Vulnerability Notes
G
GRAHAM CLULEY
C
Cisco Blogs
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
Cyber Security Advisories - MS-ISAC
Cyber Security Advisories - MS-ISAC
T
Threatpost
Simon Willison's Weblog
Simon Willison's Weblog
GbyAI
GbyAI
S
SegmentFault 最新的问题
Blog — PlanetScale
Blog — PlanetScale
L
Lohrmann on Cybersecurity
I
Intezer
博客园 - 叶小钗
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
Last Week in AI
Last Week in AI
Cisco Talos Blog
Cisco Talos Blog
Hacker News: Ask HN
Hacker News: Ask HN
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
B
Blog
Microsoft Security Blog
Microsoft Security Blog
AI
AI
cs.CL updates on arXiv.org
cs.CL updates on arXiv.org
S
Schneier on Security
V
Visual Studio Blog
The Register - Security
The Register - Security
AWS News Blog
AWS News Blog
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
F
Fortinet All Blogs
博客园 - 司徒正美
WordPress大学
WordPress大学
Jina AI
Jina AI
T
Tor Project blog

Hacker News - Newest: "AI"

AI can't read an investor deck AI as an attorney? Student uses ChatGPT, Gemini to sue UW over alleged racial discrimination Hacking MCP Servers in AI Systems – The Rug Pull: Tool Changes After Approval GitHub - MeepCastana/KubeezCut: Free Web based video editor GitHub - GenAI-Gurus/awesome-eu-ai-act: Curated tools, official sources, OSS, templates, and guides for EU AI Act compliance. Can AI judge journalism? A Thiel-backed startup says yes, even if it risks chilling whistleblowers Coming soon: 10 Things That Matter in AI Right Now DARPA built an AI to fact-check enemy weapons claims What explains heterogeneity in AI adoption? When AI Meets Muscle: Context-Aware Electrical Stimulation Promises a New Way to Guide Human Movements - Department of Computer Science AI Changed How We Build. It Did Not Change What Matters. Linux rules on using AI-generated code - Copilot is OK, but humans must take 'full responsibility for the… Meta spins up AI version of Mark Zuckerberg to engage with employees Code Mode: Let Your AI Write Programs, Not Just Call Tools | TanStack Blog GitHub - Delavalom/graft: Go framework for building AI agents. Type-safe tools, multi-provider (OpenAI, Anthropic, Gemini, Bedrock), zero vendor SDKs. India's TCS tops estimates, says new AI models did not dent services demand Gen Z's fading AI hype Strong feeling: we are in a folded AI reality GitHub - machinarii/total-recall-catalog: A reference catalog of latest knowledge retrieval, memory & RAG systems GitHub - mensfeld/code-on-incus: Give each AI agent its own isolated machine with root, Docker, and systemd. Active defense detects and stops threats automatically.. Quantization, LoRA, and the 8% Problem: Benchmarking Local LLMs for Production AI Iran war: We spoke to the man making Lego-style AI videos that experts say are powerful propaganda Powell, Bessent discussed Anthropic's Mythos AI cyber threat with major U.S. banks GitHub - immartian/bellamem: Persistent belief-graph memory for AI agents. Retrieves decisive context by importance — not recency, not RAG, not /compact. recursive-mode: The Repo-Native Operating System for AI Engineering After the attack on Sam Altman's home, will AI CEO's go on the offensive? The biggest advance in AI since the LLM Opus 4.6 vs GPT 5.4 One Prompt Unity World Generation Test “AI polls” are fake polls Client Challenge Can AI be a 'child of God'? Inside Anthropic's meeting with Christian leaders How to Switch AI Chatbots and Why You Might Want To GitHub - MattMessinger1/agentic_refund_guardrail: Safe refund policy layer for AI agents — Python + TypeScript. Same behavior, shared tests. Adam/papers/emergent_values_whitepaper.md at master · strangeadvancedmarketing/Adam Ask HN: How do you stop playing 20 questions with your AI coding tools How far can automation and AI support psychotherapy? - @theU GitHub - stagas/rtdiff: realtime git diff gui and AI-assisted commits A Mac Studio for Local AI — 6 Months Later A History of the Early Years of AI at the University of Edinburgh Why AI Coding Tools Still Feel Stuck on Localhost MSN AI Datacenters Are Becoming Strategic Targets twitter.com Penn Researchers Use AI to Surface Unreported GLP-1 Side Effects in Reddit Posts Show HN: MoodSense AI (ML and FastAPI and Gradio, Deployed on Hugging Face) Moodsense Ai - a Hugging Face Space by aman179102 AI models are terrible at betting on soccer—especially xAI Grok GitHub - xialeistudio/echoic GitHub - HimashaHerath/github-dev-wrapped: AI-powered weekly GitHub activity reports deployed to GitHub Pages GitHub - alejandrobalderas/claude-code-from-source: Architecture, patterns & internals of Anthropic's AI coding agent — reverse-engineered from source maps AI and Tech brief: Ireland ascendant GitHub - Titovilal/context0: Context0 - Never Surrender Training for a Marathon with an AI Coach: What Worked and What Didn't Cyber Pulse: Agentic Intel - Apps on Google Play I Built an AI PR Reviewer That Catches Bugs by Not Looking for Bugs Gen Z workers are so fearful AI will take their job they’re intentionally sabotaging their company’s AI rollout | Fortune How AI Is Reimagining the Game of Golf–For Both Players and Courses GitHub - nattergabriel/reseed: A CLI tool for managing and distributing agent skills across projects Is SVG the final frontier? My AI workflow evolved from prompts to a near-autonomous workflow MLSharp Help - 3DGS Viewer & Generator I put my cognitive field based AI's runtime on GitHub Is Numble the first AI-proof game? A3: Kubernetes for autonomous AI agent fleets | Emergent Principles Deepali Vyas ("The Elite Recruiter") GitHub - msmarkgu/RelayFreeLLM: A restful API designed to route user prompts to various AI model providers. Unionized ProPublica staff are on strike over AI, layoffs, and wages Unleashing the Advantage of Quantum AI We're heading for an AI-fueled 'dementia crisis,' brain scientist warns The AI-Assisted Breach of Mexico's Government Infrastructure [pdf] GitHub - stef41/lmscan: 🔍 Detect AI-generated text and fingerprint which LLM wrote it. Open-source GPTZero alternative. Zero dependencies, works offline. MSN GitHub - visionscaper/collabmem: Enabling long-term collaboration with Agentic AI - building up episodic and world model memory over time with in-context awareness We gave an AI a 3 year retail lease in SF and asked it to make a profit | Andon Labs AI Code is Hollowing Out Open Source, and Maintainers are Looking the Other Way What leaked "SteamGPT" files could mean for the PC gaming platform's use of AI AI is the boss at this retail store. What could go wrong? GitHub - Wuzu11517/agentic-proxy: Local proxy meant to help reduce With Drones, Geophysics and ArtificiaI Intelligence, Researchers Prepare to Do Battle Against Land Mines A Single Operator, Two AI Platforms, Nine Government Agencies: The Full Technical Report 在 Steam 上购买 FriedrichAI: Offline AI 立省 10% GitHub - inevolin/resume-cli: Hit Claude usage limits? Resume any AI coding session elsewhere. Switch tools at zero friction. GitHub - atripati/ark: AI Runtime Kernel — a context operating system for AI agents. Eliminates tool bloat, loads only what’s needed, and gives LLMs their reasoning space back. How to Build a Secure AI PR Reviewer with Claude, GitHub Actions, and JavaScript This Startup Wants You to Pay Up to Talk With AI Versions of Human Experts Intel Arc Pro B70 Brings 32GB VRAM to Local AI for $949 WordPress 7.0: The Good, the AI, and the Still Missing AI on the couch: Anthropic gives Claude 20 hours of psychiatry IatroBench: Pre-Registered Evidence of Iatrogenic Harm from AI Safety Measures AI Agents Know About Supabase. They Don't Always Use It Right. The history and future of AI at Google, with Sundar Pichai Inside an AI‑enabled device code phishing campaign How Meta Used AI to Map Tribal Knowledge in Large-Scale Data Pipelines AI for Systems: Using LLMs to Optimize Database Query Execution Forecasting the Economic Effects of AI Introducing Tinker: Play with AI, bring your ideas to life AI sheds light on an ancient gaming mystery People really hate AI but not as much as Iran—or Democrats | Fortune What is an AI Product Engineer? Phoebe Gates wants her $185 million AI startup to succeed with 'no ties to my privilege or my last name': 'I have a chip on my shoulder' | Fortune
GitHub - vukkt/token-warden: Claude Code plugin that makes coding agents measurably cheaper over time: collect token costs, distill candidate rules, benchmark them on a frozen golden suite, and keep only rules that earn their context rent.
vukkt · 2026-06-15 · via Hacker News - Newest: "AI"

CI License: MIT

A Claude Code plugin that makes coding agents measurably cheaper over time.

Most "agent memory" accumulates advice nobody ever verifies. token-warden treats agent memory as an engineering problem: every rule that wants space in an agent's context must prove, on a fixed benchmark, that it saves more tokens than it costs — or it gets evicted. The result is a per-agent memory file containing only rules with measured, positive return.

  • Measured, not vibes — every rule carries a token delta from real benchmark runs
  • Self-funding — rules must save ≥ 2× their own context rent to stay
  • Self-auditing — active rules are re-benchmarked round-robin and evicted when they stop earning
  • Zero session overhead — collection runs in a Stop hook that never blocks or fails your work

Table of contents

  • How it works
  • Getting started
  • Commands
  • The benchmark system
  • Architecture
  • The agents
  • Inter-agent approval gate
  • Design invariants
  • A recorded demonstration
  • Testing
  • Data layout
  • Security notes
  • Roadmap

How it works

The optimizer is a four-stage, feed-forward loop. Lessons are extracted from finished sessions and applied to future ones — past work is never re-done.

                  agent session (any project, any repo)
                                  │
                                  │  Stop hook · parses the transcript:
                                  │  tokens, tool calls, file re-reads, completion
                                  ▼
                ┌─────────────────────────────────────┐
                │  1 · COLLECT                        │
                │  one row per session in SQLite      │
                └─────────────────────────────────────┘
                                  │
                                  │  fires only when a run exceeds the
                                  │  agent's rolling p75 token cost
                                  ▼
                ┌─────────────────────────────────────┐
                │  2 · DISTILL                        │
                │  one haiku call over the waste      │
                │  stats → 0–2 candidate rules        │
                └─────────────────────────────────────┘
                                  │
                                  │  candidates wait in SQLite —
                                  │  never injected until measured
                                  ▼
                ┌─────────────────────────────────────┐
                │  3 · BENCH                          │
                │  golden suite on a frozen fixture,  │
                │  run with vs. without the candidate │
                └─────────────────────────────────────┘
                                  │
                                  │  measured delta vs. context rent
                                  ▼
                ┌─────────────────────────────────────┐
                │  4 · SELECT                         │
                │  keep if savings ≥ 2× rent, else    │
                │  evict · re-audit the oldest rule   │
                └─────────────────────────────────────┘
                                  │
                                  ▼
              ~/.claude/agent-memory/<agent>/MEMORY.md
        compiled wholesale from surviving rules and injected
            into the agent's system prompt next session

1 · Collect. Stop and SubagentStop hooks fire after every turn (main session and subagent work respectively) and parse the session transcript into one ledger row: input/output/cache tokens (deduplicated by API message id — the transcript repeats usage per streamed block), tool-call count, files read more than once, and whether the session completed. The hook is hard-capped under the 2-second budget, wraps every failure, and exits 0 regardless — it can never break your session.

2 · Distill. Collection is cheap, analysis is not, so analysis is rationed: only runs above the agent's rolling 75th-percentile cost (minimum 5 prior runs) are distilled. A single detached haiku-tier call receives the waste statistics plus an 8 KB action trace and must return strict JSON: at most two one-sentence, generalizable rules. Invalid output is dropped, never retried. Near-duplicates of any existing rule — including evicted ones — are rejected by trigram similarity, so a falsified rule cannot be re-proposed.

3 · Bench. Candidates are measured on a golden task suite against a frozen fixture repository (see The benchmark system). Each configuration runs the suite headlessly in a throwaway copy with the candidate compiled into a temporary, fully isolated agent memory.

4 · Select. A rule's verdict is the spec inequality: with delta = mean tokens saved per completed golden run and rent = the rule's own size in tokens, the rule goes active iff delta × sessions/week ≥ 2 × rent × sessions/week. Failing a previously-passing task is instant eviction regardless of tokens. Every selector run also re-benchmarks the least-recently-audited active rule — memory must keep earning its place. Survivors are compiled into MEMORY.md, which Claude Code injects into the agent's system prompt.


Getting started

Prerequisites

  • Node.js 22+
  • Claude Code v2.1+ (claude --version)
  • macOS or Linux (Windows via WSL — benchmarks need a POSIX shell)

1 · Clone and install

git clone https://github.com/vukkt/token-warden.git
cd token-warden
npm install        # the hooks run via the plugin's own tsx + better-sqlite3

2 · Load the plugin

For the current session:

claude --plugin-dir /path/to/token-warden

Or install persistently — this repository is also its own marketplace:

/plugin marketplace add vukkt/token-warden
/plugin install token-warden@vukkt-plugins

Marketplace installs are copied to ~/.claude/plugins/cache without node_modules. The Stop hook bootstraps its own dependencies on first run (one-time npm install, silent); collection begins from the second session at the latest.

3 · Verify collection

Work normally for a turn or two, then:

You should see a runs count for main. Every session in every project is now being measured into ~/.token-warden/warden.db.

4 · Freeze the baselines (one-time, ~20 min per agent)

npm run bench -- --agent all      # or one agent at a time

This runs each agent's three golden tasks twice and freezes run1_tokens — the permanent denominator of every future improvement claim. Do this once, before any rules exist.

5 · Let the loop run

Use the four subagents (frontend, backend, sql, testing) for real work. Expensive sessions distill into candidates automatically. When /warden-status shows candidates pending, measure them:

npx tsx src/select.ts --agent sql

Active rules land in the agent's memory; the next session starts cheaper.


Commands

Command What it does
/warden-status Read-only report: per-agent run/rule counts, suite total vs. frozen baseline (absolute + %), learning curve over time, active rules with measured deltas and provenance, recent evictions with reasons, real-work tokens by project, cross-agent question volume
/warden-bench <agent|all> [--runs N] [--task id] Runs the golden suite, compares against run1 and best, and reports benchmarking meta-cost (warns above 10% of the week's real-work tokens)
/warden-select <agent> [--runs N] [--top-up N] Measures pending candidates, evicts or activates them, re-audits the oldest active rule, and recompiles the agent's memory
/warden-modelbench <agent> --model <id> [--baseline <id>] [--runs N] Runs the agent's golden suite under two models (candidate vs. the agent's current model, rules held constant) and reports which uses fewer tokens for that workload
/warden-promptbench <agent> --variant <file.md> [--runs N] Runs the agent's golden suite under two prompts (a variant agent definition vs. the shipped one, rules and model held constant) and reports which uses fewer tokens
/warden-evolve <agent> [--runs N] Proposes a token-cheaper rewrite of the agent's prompt (model call), benchmarks it, and recommends it only if it provably wins — never auto-applied

When candidate rules are waiting, a lightweight SessionStart hook injects a one-line nudge into new sessions — selection itself always stays a user decision, because it spends real benchmark tokens.

Headless or when names collide, use the namespaced forms (/token-warden:warden-status). CLI equivalents:

npx tsx src/status.ts                              # status report
npm run bench -- --agent sql [--rule N]            # benchmark runner
npx tsx src/select.ts --agent sql                  # selector (measure + evict + compile)
npx tsx src/modelbench.ts --agent sql --model haiku  # compare a model against the agent's default
npx tsx src/promptbench.ts --agent sql --variant v.md  # compare a prompt variant against the shipped one
npx tsx src/evolve.ts --agent sql                      # propose + measure a cheaper prompt variant

The benchmark system

Measurement is only as good as its control variables. token-warden controls them aggressively:

The fixture (benchmarks/fixture/) is a small but realistic full-stack TypeScript project — Express routes → services → repositories over SQLite, a React admin UI, a partial vitest suite — frozen at Phase 2 and never modified, so baselines stay comparable across months. It ships with documented, deliberate flaws (BUGS.md, which agents never see: the benchmark runner excludes it from every copy) that the golden tasks target.

Golden tasks (benchmarks/<agent>/golden-NN.md) — three per agent, each a frontmatter file with a one-sentence prompt and a shell success_check (greps and/or a full vitest run). A run only counts as completed if its check passes: a cheap failed run is worse than an expensive successful one, and incomplete runs are excluded from all savings math.

A benchmark run, end to end:

  1. Copy the fixture to a temp dir (node_modules symlinked; BUGS.md excluded).
  2. Install the agent definition into the copy with its memory scope rewritten to project, so the compiled MEMORY.md under test resolves inside the temp dir — real agent memory is never read or written by benchmarks.
  3. Compile the rule set under test (active rules ± one candidate) into that memory.
  4. Run claude -p --agent <name> headlessly with scoped permissions: acceptEdits plus a Bash allowlist of test commands only — never bypassPermissions.
  5. Run the success_check; parse the transcript; record one runs row.
  6. First-ever completed run per (agent, task) freezes baselines.run1_tokens forever; later completed runs only ratchet best_tokens downward.

Variance and honesty. Each configuration runs twice and pairs of runs differing by more than 25% are flagged in the output. LLM variance is the dominant error source at small effect sizes — the recorded demonstration below shows it evicting a rule. The selector is variance-aware: it computes the standard error of the per-task savings, and when a verdict sits within one standard error of the keep/evict threshold it spends one bounded top-up pass (extra suite runs of the measured configuration, budget configurable via --top-up, default 1) before deciding; verdicts that remain within noise are recorded with an explicit low-confidence annotation. The benchmark also reports its own meta-cost after every invocation: when benchmarking exceeds 10% of the week's collected real-work tokens, it tells you to bench less.


Architecture

Module Responsibility
src/db.ts SQLite schema, versioned migrations (PRAGMA user_version), typed query helpers
src/transcript.ts Pure transcript JSONL parser — usage dedup, tool calls, re-reads, completion heuristic, distiller digest
src/collect.ts Stop-hook entrypoint; p75 trigger; spawns the distiller detached
src/distill.ts Waste analysis → 0–2 strict-JSON candidate rules; trigram dedupe
src/bench.ts Golden-suite runner; baseline freezing; meta-cost accounting
src/select.ts Keep/evict verdicts; round-robin re-audit; MEMORY.md compiler
src/status.ts Read-only reporting behind /warden-status
src/gate.ts Inter-agent SendMessage approval gate (Agent Teams)
src/notify.ts SessionStart nudge when candidates await measurement
src/compare.ts Generic A/B comparison engine (processing-token verdict, variance top-up, runComparison orchestration) shared by model, prompt, and prompt-evolution benchmarking
src/modelbench.ts Model-migration benchmarking: candidate model vs. agent default
src/promptbench.ts Prompt A/B benchmarking: variant agent definition vs. shipped
src/evolve.ts Automated prompt evolution: propose a cheaper prompt (model call) → measure → recommend

Data model (~/.token-warden/warden.db): runs (one row per session or golden run, tagged real/active/candidate/audit), rules (the ledger — candidates, active rules with measured deltas, and evicted rules kept as the negative dataset), baselines (frozen run1_tokens, ratcheting best_tokens), ruleset_versions, and questions (the inter-agent ledger). Every deviation from the original specification is documented in DECISIONS.md.


The agents

frontend, backend, sql, and testing (agents/*.md) are standard Claude Code subagents with memory: user and domain-scoped prompts seeded with efficiency behaviors (Grep before Read, never re-read a file, one-line plan before editing). Use them like any subagent — the optimizer extends each one's memory independently. Per-agent isolation is deliberate: a rule that pays rent for the sql agent is never charged to the frontend agent's context.


Inter-agent approval gate (experimental)

With CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS=1, a PreToolUse hook intercepts every SendMessage between agents and escalates to you:

[frontend → backend] "What does the orders service return on partial failure?" — approve?

Every question is logged to the questions table — approved sends are confirmed by a PostToolUse hook; denied ones stay pending — and per-agent question volume surfaces in /warden-status. An agent that asks a lot is an agent whose memory is missing something. Without the env flag the gate is structurally inert and everything else works untouched. The gate fails open: an internal error defers to the normal permission flow rather than blocking team messaging.


Design invariants

  1. Candidate rules are never injected until measured. Unverified rules get no context space; candidates live only in SQLite.
  2. MEMORY.md is a build artifact — compiled from the rule ledger, overwritten wholesale, never hand-edited or agent-appended.
  3. Fitness = tokens per completed task. Incomplete runs are excluded from savings math.
  4. Golden tasks run against the frozen fixture, never a live codebase.
  5. First-run baselines are frozen forever. run1_tokens is the permanent denominator of every improvement claim.
  6. The optimizer never re-does past work — all learning is feed-forward.
  7. Eviction is mandatory. Rules must earn at least 2× their context rent, and active rules are re-audited round-robin.

A recorded demonstration

Recorded 2026-06-12; every number is from real headless runs.

A candidate is born. Run #13, an sql golden run, cost 61,003 tokens — above the agent's rolling p75. The distiller proposed two candidates:

rule body rent
#3 "Consolidate file discovery into single queries instead of multiple find/ls operations across related paths." 27
#4 "Parse task descriptions for technical direction; verify schema/dependencies only if code doesn't clarify them." 28

The selector measures them (24 headless runs: shared baseline, one configuration per candidate, one re-audit). Mean completed tokens per task:

configuration sql-01 sql-02 sql-03 delta
baseline (active set) 39,572 70,762 ⚠ 50,304
+ rule #3 39,541 67,114 52,116 +622 saved/run
+ rule #4 39,664 54,244 49,538 +5,731 saved/run
− rule #1 (re-audit) 39,671 49,006 44,315 ⚠ rule #1 worth −9,215

⚠ = the two same-configuration runs differed by >25%.

Verdicts (threshold: savings ≥ 2× rent):

  • rule #3 → ACTIVE (622 ≥ 54)
  • rule #4 → ACTIVE (5,731 ≥ 56)
  • rule #1 ("Use Grep to locate symbols before reading any file."), active since the previous selector run at +3,673, was EVICTED on re-audit at −9,215: with the two new rules present, removing it made the suite cheaper. This is mandatory eviction working as designed — and an honest illustration that run-to-run variance dominates at small effect sizes. Evicted rules are retained as the negative dataset, and trigram dedupe prevents a falsified rule from being re-proposed.

The compiled memory (~/.claude/agent-memory/sql/MEMORY.md, ruleset v2):

<!-- GENERATED BY token-warden — do not hand-edit -->
# Efficiency rules

- Parse task descriptions for technical direction; verify schema/dependencies only if code doesn't clarify them.
- Consolidate file discovery into single queries instead of multiple find/ls operations across related paths.

Testing

npm run typecheck && npm run lint && npm run test

The unit suite (count in the CI badge above — hard-coding it here rotted once already) spans every module. The transcript parser carries the densest coverage (usage dedup, completion heuristics, malformed-line tolerance, a 5 MB / 2 s performance budget) against committed anonymized fixtures. The hook entrypoints (collect.ts, gate.ts) are tested as real child processes against temp databases, including corrupt-input and fail-open paths. The selector core is tested with an injected fake suite-runner, so verdict logic, regression eviction, re-audit, and memory compilation are verified without spending model tokens. Strict TypeScript (noUncheckedIndexedAccess), Biome for lint/format, vitest for tests.

The fixture has its own independent suite (cd benchmarks/fixture && npm test) and is excluded from plugin CI — its deliberate flaws are benchmark material, not bugs.


Data layout

Path Contents
~/.token-warden/warden.db The ledger (override with TOKEN_WARDEN_DB)
~/.token-warden/{collect,distill,gate}.log Component logs — hooks never surface errors into sessions
~/.claude/agent-memory/<agent>/MEMORY.md Compiled rules (generated; do not hand-edit)
benchmarks/fixture/ The frozen benchmark codebase

Security notes

The ledger contains untrusted text: rule bodies and eviction reasons are model-generated, project paths and question senders come from the environment. Defenses, in order:

  1. The distiller rejects rule bodies containing control characters or newlines at the source.
  2. renderStatus sanitizes every untrusted string it displays (ANSI/control characters stripped, newlines collapsed, length clamped), so collected data cannot forge report sections.
  3. The /warden-status command instructs the relaying Claude to treat report contents as data, never as instructions.

The inter-agent gate is an observability and approval layer, not a security boundary — it fails open by design so a broken gate can never block team messaging.

Roadmap

Shipped since v0.1.0:

  • Subagent collectionStop and SubagentStop hooks, so the four domain agents' real work reaches the ledger (previously only the main session was collected and the learning loop could not engage on real work at all).
  • Variance-aware verdicts — standard-error analysis of per-task savings with a bounded top-up pass when a verdict is within noise of the threshold (--top-up).
  • Selection nudge — a SessionStart hook surfaces pending candidates; /warden-select runs the measurement on demand.
  • Question-driven distillation — an agent's recent cross-agent questions are fed to the distiller as a memory-gap signal.
  • Per-project tracking — real-work sessions record their project; status breaks down token volume per project.
  • Rule provenance — active rules show the run they were distilled from.
  • Cross-project learning curves/warden-status charts average completed real-work session cost per ruleset version, per agent and per project (domain agents only; main never has compiled rules). This is the test of the system's core thesis: golden-suite gains must show up in real work.
  • Model-migration benchmarking/warden-modelbench runs an agent's golden suite under a candidate model vs. its current one (rules held constant) and reports which uses fewer tokens for that workload. The frozen suite is the fixed workload you need when a new model ships. The verdict uses processing tokens (input + output + cache_creation); cache-read is reported separately because it skews raw cross-model totals, and token counts are never converted to dollars (models are priced differently per token).
  • Prompt / agent-definition A/B testing/warden-promptbench runs an agent's golden suite under a variant agent definition vs. the shipped one (rules and model held constant), so a proposed prompt edit can be kept or rejected on measured token savings rather than vibes. The comparison engine (compare.ts) is shared with model benchmarking: the discipline generalizes from "rule selection" to "any context change."
  • Automated prompt evolution/warden-evolve proposes a token-cheaper rewrite of an agent's prompt (a model call, like the rule distiller proposes rules), measures it through the shared engine, and recommends a winner to a proposals file. Deliberately not auto-applied: agents/<name>.md is committed source, and three golden tasks can't fully capture an agent's behavior, so a human reviews and applies. Protected frontmatter (name/tools/model/memory) is enforced unchanged before measurement.

Near-term:

  • Golden suite growth — heavy tasks (testing-02 ≈ 150k tokens/run) deserve splitting into new tasks (existing baselines stay frozen; replacing a task would invalidate its denominator, so growth means adding task files, never editing them).
  • Fully scheduled selection — auto-running the selector on a cron/routine once variance handling has earned trust; today it deliberately stays a user decision.
  • Transcript provenance — link a rule's born-of run to its archived transcript digest for post-hoc review.

Bigger directions — the reusable asset here is the frozen-benchmark + measured-verdict discipline, which generalizes well beyond efficiency rules:

  • Model-migration benchmarking — the frozen golden suite is exactly the fixed workload you need when a new model ships: warden-bench could answer "is the new model cheaper on my workload" with the rigor rules already get.
  • Prompt / agent-definition A/B testing — the benchmark measures any context change, not just rules; treat an agent's system prompt as a candidate and let the selector keep edits that earn their place.
  • Team-shared rule ledgers — commit measured rules to a repo (via project-scoped memory) with token-warden as the CI gate, so a PR adding a rule must carry its measured delta. Memory review becomes code review.
  • Real-time cost anomaly alerting — the p75 machinery already detects expensive sessions; a Stop-hook message could tell the session itself where its tokens went.

License

MIT