The 12-Phase Workflow That Actually Made AI Coding Useful for Me
A practitioner's account — not a tutorial, not a sales pitch.
Quick screen: if you're writing throwaway scripts or solo prototypes, this workflow is overkill — skip to the Cons and Who This Is For sections first.
I've been using a 12-phase workflow I've refined over time — across free-context-hub, lore-weave, and a handful of private internal systems. Both public projects are built almost entirely by AI agents, with me acting as the gatekeeper — approving specs, reviewing diffs, unblocking decisions. Across all of them, the workflow has accumulated 2,500+ commits and a trail of written specs and audit logs I can still query months after the sessions that produced them.
free-context-hub is a self-hosted persistent memory and semantic search layer for AI agents — MCP server, REST API, RAG pipelines, and a full Next.js review UI. 15 development phases delivered end-to-end.
lore-weave is a cloud-hosted multi-agent platform for multilingual novel workflows: translation, knowledge graph construction, glossary management, and AI-assisted writing. 19 microservices across Go, Python, and TypeScript.
I'm sharing the workflow because it's worked better than anything else I've tried, and because the honest trade-offs are worth knowing before you adopt it.
The files are in the repository:
-
WORKFLOW.md— standalone 12-phase template to copy into any project -
CLAUDE.md.snippet— the live project spec with project-specific tooling and AMAW wiring -
AMAW.md— opt-in multi-agent extension spec
The Core Problem This Solves
AI coding assistants are very good at generating plausible-looking code. They're much worse at:
- Knowing when they're operating on stale assumptions
- Catching their own scope creep
- Connecting a code change to its downstream contract obligations
- Stopping themselves when a "small fix" turns into a refactor
The standard advice is "just review the diff." But reviewing a diff without having tracked the intent of the change is almost useless — you're comparing code to code, not code to requirements. The 12-phase workflow forces intent to be written down before the first line of code is written, which is what makes the diff review actually meaningful.
Where It Came From
The workflow is an evolution of two ideas:
Superpowers — a coding agent discipline framework that introduced TDD protocol, the evidence gate (run verification fresh before claiming success), and the debugging protocol (no fix without root cause). I absorbed these directly. If you haven't read Superpowers, it's worth your time.
Human-in-the-loop gatekeeping — my own addition. The core insight: a human reading a short spec + a single diff catches dramatically more than a human reading code cold. The workflow structures every task to produce exactly those artifacts, at exactly the right moment.
The combination took multiple iterations to stabilize. What's here is v2.2 (default mode) with an optional AMAW (Autonomous Multi-Agent Workflow) extension for high-stakes work.
The 12 Phases
Phase │ Role (default v2.2) │ What Happens
───────────────┼───────────────────────┼──────────────────────────────────────────
1. CLARIFY │ Architect + Human │ Read context, write spec, expose assumptions
2. DESIGN │ Lead │ API contract / data flow → DESIGN.md
3. REVIEW │ Adversarial self │ Find gaps / contract holes in spec
4. PLAN │ Lead + Developer │ Decompose into 2–5 min tasks → PLAN.md
5. BUILD │ Developer │ TDD: red → green → refactor
6. VERIFY │ Developer │ Run tests fresh, capture exit code + output
7. REVIEW │ Lead │ Code vs spec — find exactly 3 divergences
8. QC │ Main session │ Spec fingerprint vs implementation, AC coverage
9. POST-REVIEW │ Human checkpoint │ Final gate — blocked on any unresolved issue
10. SESSION │ Scribe │ SESSION_PATCH.md + DEFERRED.md + AUDIT_LOG
11. COMMIT │ Developer │ Git commit
12. RETRO │ All │ Record lessons + finalize audit log
The phases look heavy on paper. In practice, for an XS task (single file, one logic change, no side effects) you're allowed to skip CLARIFY and PLAN and go straight to BUILD — the workflow is explicit about this via a mandatory task size classification step.
Task Size Classification: The Thing That Actually Prevents Drift
Before any work starts, you count three things:
| Metric | What you count |
|---|---|
| Files touched | How many files will be created or modified? |
| Logic changes | How many functions/handlers change behavior? (not formatting) |
| Side effects | API contract, DB schema, config, external behavior, types used by other files? |
| Size | Files | Logic | Side effects | Allowed skips |
|---|---|---|---|---|
| XS | 1 | 0–1 | None | CLARIFY + PLAN |
| S | 1–2 | 2–3 | None | PLAN only |
| M | 3–5 | 4+ | Maybe | None |
| L | 6+ | Any | Yes | None |
| XL | 10+ | Any | Yes | None |
You state the classification explicitly before work begins:
Task: Fix pagination off-by-one
Size: XS (1 file: src/api/routes/lessons.ts, 1 logic change: offset calc, 0 side effects)
Skipping: CLARIFY, PLAN → straight to BUILD
The hard rule: if you haven't read the code yet, you don't know the size. Agents routinely call things XS that turn out to be M or L once you look. The classification forces the read to happen before the label is applied.
The Anti-Skip Rules (The Most Underrated Part)
Every popular AI workflow has phases that agents skip "to save time." This workflow makes the skip patterns explicit and calls them violations:
| Skip pattern | Why agents do it | Why it's forbidden |
|---|---|---|
| Skip CLARIFY, jump to BUILD | "Task seems obvious" | Unexamined assumptions cause rework |
| Skip PLAN, jump to BUILD | "It's a small change" | Small changes grow; no plan = no checkpoint |
| Skip VERIFY after BUILD | "Tests passed earlier" | Stale results are not evidence |
| Skip REVIEW after VERIFY | "I wrote it, I know it's correct" | Author blindness is real |
| Skip POST-REVIEW | "I reviewed in phase 7" | Phase 7 is code review; POST-REVIEW is the final conservative gate — different scope |
| Skip SESSION before COMMIT | "I'll update later" | You won't. Context is lost. |
| Combine multiple phases | "CLARIFY+DESIGN+PLAN in one go" | Each phase boundary is a deliberate pause point; skipping it removes the checkpoint |
Naming these patterns and treating them as violations changes the conversation. When the agent tries to jump phases, you have a handle to point at.
The Evidence Gate (Absorbed from Superpowers)
Phase 6 (VERIFY) has a 5-step gate that runs before any completion claim:
- Identify the verification command
- Run it fresh — not from memory, not from cache
- Read complete output including exit codes
- Confirm output matches the claim
- Only then state the result with evidence
Red flags — stop immediately if you catch yourself:
- Using "should work", "probably passes", "seems fine"
- Feeling satisfied before running verification
- About to commit without a fresh test run
- Trusting prior output without re-running
This sounds obvious. It is not obvious when you're deep in a session and the previous test run was 20 minutes ago.
The Human's Role: Gatekeeper, Not Reviewer
In v2.2 (default mode), there are two mandatory human checkpoints:
- After CLARIFY — human reads the spec and approves the scope before any design or code starts
- After POST-REVIEW — human reviews the AUDIT_LOG, the spec, and the diff before SESSION commits anything
These are not optional. The whole model is that the human reads a short spec, not a long codebase. The AI builds the spec; the human approves it; the AI builds the code against the approved spec. The POST-REVIEW diff is then code-vs-approved-spec, which is a comparison a human can actually do.
AMAW: The Opt-In Multi-Agent Extension
For high-stakes work — data migrations, new service boundaries, security-critical paths — there's an optional extension: AMAW (Autonomous Multi-Agent Workflow). In AMAW mode, cold-start sub-agents replace or augment the human review gates:
- Adversary — finds exactly 3 things that could go wrong. Why 3? Enough to surface real issues, few enough to force prioritization rather than a laundry list. Never says what's good.
- Scope Guard — compares spec fingerprint vs implementation, checks AC coverage, issues CLEAR or BLOCKED
- Scribe — records decisions, writes session summaries, detects deferred items
- Audit Logger — finalizes the audit trail at RETRO
The key insight is cold-start: each agent is spawned fresh with only file access. It cannot inherit the main session's context rot or biases. It reads what's written; it can't be influenced by what was discussed in chat.
Note: AMAW removes the human from all review gates — including POST-REVIEW, which is held by the Scope Guard instead. At CLARIFY, rather than a human approving the spec, the Adversary challenges it at the next phase. In practice this means AMAW sessions can run with minimal human interaction, but they still require a human to kick off the task and review the final audit log. Pure fire-and-forget is not the design intent.
AMAW costs roughly $1–5 in sub-agent tokens and ~30 extra minutes per task. I use it for schema migrations and multi-system contracts. For everyday work, the human-in-loop default catches the same issues faster and cheaper.
What Gets Recorded: The Audit Log
Every phase transition and agent verdict appends to docs/audit/AUDIT_LOG.jsonl — one JSON line per event:
{"ts":"2026-05-15T17:42:00Z","task":"phase-14-model-swap","phase":"review-design","agent":"adversary","action":"review","status":"REJECTED","findings_count":3,"block_count":2,"warn_count":1,"note":"..."}
Append-only. Never modified. Main session and sub-agents both write to it, never delete or edit existing lines.
This becomes the durable record of what was decided and why — something that doesn't exist in most AI coding setups where everything lives in ephemeral chat.
What I've Shipped With This
free-context-hub
On free-context-hub I've delivered 15 development phases covering:
- Core backend: MCP server (36 tools), REST API (70+ endpoints), background worker
- Frontend: Next.js 16 + React 19, 20+ pages, human-in-loop review UI
- RAG pipeline: tiered search (ripgrep → FTS → semantic), 8-model embedding benchmark, reranking benchmarks with reproducible reports
- Multi-agent coordination: artifact leases with TTL/fencing, pending-review state, taxonomy profiles
- Knowledge portability: zip+JSONL bundle format, streaming import/export, cross-instance pull with SSRF hardening
- Tenant-scoped access control: authz model, 3-tier routing, event log, collective decisions
LoreWeave
On lore-weave I've delivered 5 full vertical modules and am mid-way through a sixth, accumulating 1,497 commits since March 2026 across 19 microservices. The modules completed so far cover:
- Identity & Auth — JWT issuance, refresh rotation, multi-device session management (Go/Chi + NestJS gateway)
- Books & Sharing — book and chapter lifecycle, visibility policy, public catalog browse (Go/Chi, Postgres, MinIO)
- Provider Registry — BYOK AI provider credential vault, platform model catalog, streaming proxy, budget pre-flight (Go/Chi + worker-ai)
- Raw Translation Pipeline — async chunk-level translation job lifecycle, job queue via Redis Streams, per-chapter result storage, BYOK + platform model routing (Go/Chi + Python/FastAPI + worker-infra)
- Glossary & Lore Management — multilingual entity management, chapter M:N evidence linking, wiki article generation, RAG-ready glossary export (Go/Chi, Postgres, glossary-service + knowledge-service two-layer pattern)
The current Phase 6 work spans usage-billing and a hierarchical book extraction engine — the kind of multi-service, cross-cutting work where the workflow's cross-phase checkpoints earn their keep.
That's 400+ commits on free-context-hub and 1,497 on lore-weave — the rest comes from private team projects also running this workflow — totaling 2,500+ commits with a live audit trail I can query across sessions that ran months apart.
The hardest part was Phase 10 (SESSION) — keeping the session patch updated after every sprint without skipping it. Once that became a habit, sessions started to feel continuous rather than amnesia-punctuated.
The Real Pros
You understand your own system deeply. Because you write the spec and approve it, you can't hide behind "the AI built it." You actually know what was built and why the trade-offs were made. This is the biggest practical advantage for me — not velocity, but comprehension.
Architectural decisions have a paper trail. Every trade-off is in a spec file that was approved before code was written. When a future session revisits a design choice, the rationale is readable, not reconstructed from diff archaeology.
Context drift is visible. When an AI starts building something that wasn't in the spec, the spec fingerprint comparison at POST-REVIEW catches it. Without a written spec, you'd never notice until integration time.
Deferred items don't get lost. The workflow forces any "we'll do this later" to be written in DEFERRED.md with a specific trigger condition. Nothing lives only in chat — chat is ephemeral, files are truth.
It's incrementally adoptable. You can start with just CLARIFY + VERIFY and get substantial value. Add phases as your trust in the workflow grows.
The Real Cons
Token usage is genuinely high. Each phase generates artifacts: spec files, plan files, audit events. AMAW mode multiplies this by spawning sub-agents. A single M-sized task with AMAW can burn 5,000–10,000 tokens before a line of code is written. At scale, this is a real budget consideration.
You clarify constantly — and it takes real time. Phase 1 (CLARIFY) is not a quick preamble. For any task with real ambiguity — architecture decisions, new API contracts, trade-off calls — you're in a back-and-forth that can run 20–40 minutes before design starts. At a medium-sized project cadence (10–20 above-XS tasks per sprint), this adds up to multiple hours per sprint spent purely on scoping. This is actually the point of the workflow, but if you're used to "just build it," the overhead feels significant early on.
Human approval gates limit automation. Every architecture decision, trade-off, and scope call requires your explicit approval. You cannot queue up a batch of tasks and walk away. If you need fully autonomous overnight runs, this workflow is the wrong tool.
The discipline needs enforcement tooling to hold. Left to their own devices, agents will skip phases. The workflow holds together because of workflow-gate.sh (a pre-commit gate that blocks commits if VERIFY and SESSION aren't done) and the append-only AUDIT_LOG.jsonl. If you copy docs/WORKFLOW.md into your project without also setting up the enforcement layer, expect phases to get skipped within a few sessions. The tooling is in the repository — it's not hidden — but it's a real setup step, not just copy-paste.
Cold-start sub-agents (AMAW only) miss things said in chat. Because each AMAW sub-agent reads files from scratch, anything that was decided verbally in the session but never written to a file is invisible to them. This is a feature for preventing bias, but it means you must be disciplined about writing things down as you go. The Scribe sub-agent helps, but it can only record what's already in files.
Who This Is For
Worth the overhead if:
- You're building production systems — not prototypes — that will be maintained and extended
- You care about knowing why each decision was made, not just that it compiles today
- You find yourself surprised by what the AI built, in ways that cost you rework later
- Sessions run over weeks or months and you need continuity across context windows
Overkill if:
- You're doing exploratory coding, one-shot scripts, or time-boxed experiments
- Your sessions are short and the full context fits in one window
- You don't need an audit trail or human-approved architectural decisions
- Speed of iteration matters more than correctness of decision-making
The workflow is designed for the first category. Using it for the second is just friction.
How to Use It
All workflow files live in the agentic-workflow/ folder of the free-context-hub repository.
Start with the template:
- Copy
WORKFLOW.mdinto your project root or paste the relevant sections into yourCLAUDE.md/ agent instructions — this is the full 12-phase spec - Customize the
[CUSTOMIZE]sections for your stack (verification commands, test runner, any MCP tools you use — MCP is the Model Context Protocol, an interface for giving AI agents access to external tools and knowledge stores; the workflow works without it) - Add
workflow-gate.shfrom the same folder to enforce the phase gates mechanically — without this, agents will skip phases - For high-stakes tasks, see
amaw-workflow.mdfor the AMAW multi-agent extension - Start with just task size classification + VERIFY — those two alone change how you work with agents
The workflow is model-agnostic. I use it with Claude Code but nothing in the spec requires it.
Final Thought
The 12-phase workflow is not magic. It's a way of making explicit things that were always implicit: what are we building, how big is it, what's the verification evidence, who approved it, what did we learn? The AI does most of the work. The human stays in control of the decisions that actually matter.
The cost is real — more tokens, more time spent clarifying, more things requiring your approval before the AI proceeds. The benefit is also real: you end up with a system you understand deeply, and a trail of why it was built the way it was.
For me, after 2,500+ commits across multiple projects, that trade-off is still worth it.
Repositories: letuhao/free-context-hub · letuhao/lore-weave
Workflow files: WORKFLOW.md · AMAW.md · CLAUDE.md




















