Make AI coding agents prove when work is done.
Proof Loop is a repo-local verification protocol for AI coding agents. It freezes acceptance criteria before the build, separates builder and verifier roles, records durable proof artifacts in the repo, and refuses to call work done until every acceptance criterion has a fresh PASS verdict.
Use it when an agent, team, or multi-agent sprint needs a clear boundary between “looks done” and verified work. Because the protocol is just files plus role discipline, it works with OpenClaw, Hermes, Codex, OpenCode, Claude Code, or any other harness that can read and write a repository.
Use Cases
- keep AI coding agents honest when they claim a task is done
- freeze acceptance criteria before implementation starts
- separate builder and verifier roles in multi-agent coding work
- leave proof artifacts in the repo for future review
Proof artifacts and role-brief examples are indexed in examples/README.md.
20-second demo
git clone https://github.com/LeoStehlik/proof-loop.git cd proof-loop make test tmp=$(mktemp -d) bin/proof-loop-init hn-demo --title "Prove this task before done" --root "$tmp" bin/proof-loop-check "$tmp/.agent/tasks/hn-demo"
The last command fails on purpose because the generated task has not been verified yet. Proof Loop only returns success after a fresh verifier records PASS for every acceptance criterion and problems.md is empty.
A completed passing example is included:
bin/proof-loop-check examples/example-task/.agent/tasks/ui-language-fix bin/proof-loop doctor bin/proof-loop report examples/demo-repo/.agent/tasks/nav-labels-proof --format md
Why It Exists
AI coding agents often fail in predictable ways:
- they claim completion without durable proof
- the same session builds and judges its own work
- acceptance criteria drift while implementation is underway
- verification is a prose summary instead of a live check
- future sessions cannot tell what was actually tested
Proof Loop makes completion auditable. A task is done only when a fresh verifier has checked each AC and the repo contains the artifacts to prove it.
What You Get
- a clear sprint protocol: spec freeze -> build -> evidence -> fresh verify -> fix loop
- role boundaries for orchestrator, spec-freezer, builder, verifier, and fixer
- helper scripts to initialize and check task proof folders
- a complete example task with passing artifacts
- copy-paste role briefs for OpenClaw, Hermes, Codex, OpenCode, Claude Code, or any agent setup
- a documented boundary with Loopsmith for recurring behaviour improvement
CLI
bin/proof-loop init TASK_ID --title "Task title"
bin/proof-loop check TASK_ID
bin/proof-loop status TASK_ID
bin/proof-loop list
bin/proof-loop doctor
bin/proof-loop report TASK_ID --format md
bin/proof-loop install-guides --dry-run --harness codex --harness claudeQuick Start
Clone the repo or copy it into the project where you want to run the protocol.
Create a task proof folder from this repo or from another repository:
bin/proof-loop-init ui-language-fix --title "Fix German navigation labels" --root .
This creates:
.agent/tasks/ui-language-fix/
spec.md
verdict.json
problems.md
evidence.md
Fill spec.md with explicit acceptance criteria before implementation starts.
After the build and verifier pass, check whether the task is allowed to be called done:
bin/proof-loop-check .agent/tasks/ui-language-fix
The check exits non-zero unless:
verdict.jsonhasoverall: PASS- every AC has
status: PASS problems.mdis empty or absent
What This Is Not
- not an agent framework
- not a benchmark suite
- not a replacement for tests
- not tied to one model, vendor, or harness
Proof Loop is deliberately small: a protocol, a few files, and a mechanical done gate.
The Protocol
spec freeze -> build -> evidence -> fresh verify -> fix -> fresh verify
^ |
|____________________|
repeat until all ACs PASS
Roles
| Role | Does | Never |
|---|---|---|
| Orchestrator | Keeps the loop intact and refuses weak completion | Accepts narrative-only proof |
| Spec-Freezer | Writes frozen spec.md with explicit ACs |
Edits production code |
| Builder | Implements against the frozen spec | Verifies own work as final |
| Verifier | Fresh session that checks each AC | Edits production code |
| Fixer | Applies minimal fixes for verifier findings | Signs off on completion |
The verifier must be a fresh session. The agent that built the change does not judge whether the change is done.
Acceptance Criteria
Good ACs are specific and testable by a third party.
AC1: A user with locale=de sees all navigation labels in German after saving language preference.
Verify: browser check against a German-locale test user.
AC2: The language preference survives page reload.
Verify: reload the page and confirm the saved locale and labels remain German.
AC3: Existing English navigation remains unchanged for locale=en.
Verify: switch back to English and confirm the original labels render.
Weak ACs are task descriptions, not proof conditions:
AC1: Translate the UI.
AC2: Make language switching work.
AC3: Fix the bugs.
Artifacts
Every task stores proof under .agent/tasks/<TASK_ID>/.
.agent/tasks/<TASK_ID>/
spec.md frozen ACs, constraints, non-goals, verification approach
evidence.md build summary and checks run
verdict.json structured verifier result: PASS / FAIL / UNKNOWN per AC
problems.md specific open failures, empty when no problems remain
See references/artifacts.md for schemas.
Real Demo
Run a small failing-to-passing demo:
make demo
The demo intentionally breaks a tiny navigation-label fixture, shows the check failing, applies the fix, reruns the check, and renders a proof report.
Examples
A complete passing example lives at:
examples/example-task/.agent/tasks/ui-language-fix/
Role prompts live at:
examples/role-briefs/
orchestrator.md
spec-freezer.md
builder.md
verifier.md
fixer.md
Proof Loop vs Loopsmith
Proof Loop governs a single task.
Loopsmith improves repeated agent behaviour over time.
Use Proof Loop when you need a specific task to finish with evidence. Use Loopsmith when the same failure pattern keeps coming back and you want to improve the agent, prompt, policy, or evaluator itself.
See references/loopsmith-bridge.md.
When To Use Which Repo
Use this repo when a specific coding task needs evidence before anyone is allowed to call it done. Proof Loop freezes the spec, separates builder and verifier roles, requires proof artifacts, and records verdicts in the repo.
Use the neighbouring tools at different points in the workflow:
| Need | Use |
|---|---|
| Turn a fuzzy request into an executable agent brief | Brief Master |
| Prove one coding task is actually done | Proof Loop |
| Improve repeated agent behaviour with evals | Loopsmith |
| Keep source-backed memory for long-running agents | Sovereign Brain |
| Stop frontend agents producing generic UI sludge | no-slop-ui |
A practical chain looks like this: messy request -> Brief Master brief -> Proof Loop task -> Loopsmith eval if the same failure keeps recurring -> Sovereign Brain records the durable decision.
Related Tools
- Loopsmith - use when Proof Loop exposes a repeated agent behaviour problem that should become an eval and promotion loop.
- Sovereign Brain - source-backed memory for long-running agents; useful when proof artifacts, decisions, and synthesis need durable context.
- Brief Master - helps write sharper task briefs and acceptance criteria before a Proof Loop starts.
Installation As A Skill
OpenClaw
Add your skills directory to openclaw.json:
{
"skills": {
"load": {
"extraDirs": ["/path/to/your/skills"]
}
}
}Clone this repo into that directory:
git clone https://github.com/LeoStehlik/proof-loop.git /path/to/your/skills/proof-loop
Codex / Claude Code
Copy the proof-loop folder into your agent skills directory, or reference SKILL.md directly in your task brief. For harnesses without a formal skill system, use the README, role briefs, and scripts directly from the repo.
Repository Map
proof-loop/
SKILL.md skill trigger and core operating rules
bin/
proof-loop unified CLI
proof-loop-init compatibility wrapper
proof-loop-check compatibility wrapper
scripts/
init_task.py create .agent/tasks/<TASK_ID>/ skeletons
check_task.py mechanical done gate
schemas/ JSON schemas for verdict and evidence bundles
templates/ opt-in harness guide templates
tests/ stdlib unittest coverage for CLI behavior
.github/workflows/test.yml CI running make test
references/
workflow.md full phase-by-phase protocol
brief-template.md reusable sprint and role prompts
artifacts.md artifact schemas
loopsmith-bridge.md when to escalate repeated failures to Loopsmith
examples/
example-task/ complete passing proof artifact example
role-briefs/ copy-paste role prompts
Status
Usable protocol skill and small toolkit. The scripts are intentionally stdlib-only so they can run inside almost any repository without packaging ceremony.
License
MIT - see LICENSE.
Attribution
Inspired by repo-task-proof-loop, adapted for practical multi-agent coding work and public agent-operation skills.





















