惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

F
Fortinet All Blogs
腾讯CDC
B
Blog
Recorded Future
Recorded Future
V
Visual Studio Blog
WordPress大学
WordPress大学
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
PCI Perspectives
PCI Perspectives
I
InfoQ
博客园 - 聂微东
博客园 - 【当耐特】
宝玉的分享
宝玉的分享
T
Tailwind CSS Blog
T
The Blog of Author Tim Ferriss
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
CTFtime.org: upcoming CTF events
CTFtime.org: upcoming CTF events
小众软件
小众软件
Blog — PlanetScale
Blog — PlanetScale
Microsoft Security Blog
Microsoft Security Blog
雷峰网
雷峰网
aimingoo的专栏
aimingoo的专栏
Hugging Face - Blog
Hugging Face - Blog
人人都是产品经理
人人都是产品经理
云风的 BLOG
云风的 BLOG
P
Proofpoint News Feed
H
Hackread – Cybersecurity News, Data Breaches, AI and More
D
DataBreaches.Net
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
有赞技术团队
有赞技术团队
C
Check Point Blog
Stack Overflow Blog
Stack Overflow Blog
MyScale Blog
MyScale Blog
Google DeepMind News
Google DeepMind News
量子位
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
博客园 - Franky
Spread Privacy
Spread Privacy
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
L
LangChain Blog
G
Google Developers Blog
U
Unit 42
Recent Announcements
Recent Announcements
L
Lohrmann on Cybersecurity
P
Palo Alto Networks Blog
C
Cyber Attacks, Cyber Crime and Cyber Security
MongoDB | Blog
MongoDB | Blog
K
Kaspersky official blog
博客园 - 叶小钗
阮一峰的网络日志
阮一峰的网络日志
Cyberwarzone
Cyberwarzone

freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More

Learn Command Line Interface (CLI) Development with Dart: From Zero to a Fully Published Developer Tool How to Build a Live Options Database in Python – A Complete Guide How to Migrate to S3 Native State Locking in Terraform How to Use SCons to Build Software Projects [Full Handbook] How to Run Open Source LLMs Locally and in the Cloud QuRT: The Real-Time OS Inside Your Phone's Processor [Full Handbook] The Real Infrastructure Behind Remote Work (It’s Not Just Wi-Fi) The Lithography Handbook: Machines, Markets, and the Next Wave of Semiconductor Startups ITCM vs DTCM vs DDR: Embedded Memory Types Explained [Full Handbook] AI Paper Review: Improving Language Understanding by Generative Pre-Training (GPT-1) How to Build a Market Research Copilot with MCP and Python [Full Handbook] How to Build a Scoped Note-Taking API with Django Rest Framework and SimpleJWT The Complete SOC 2 Type II Implementation Handbook for Engineers: A Month-by-Month Roadmap with Real Commands Mastering the JavaScript Event Loop Data Science Insights: Why the Mean Lies When Handling Messy Retail Data How to Build High-Ranking SEO Landing Page How to Query Data in DynamoDB Using .Net How to Unblock Your AI PR Review Bottleneck: A Tech Lead’s Guide to Building a Codebase-Aware Reviewer How to Navigate Microservices as a Frontend Engineer How to Compress PDF Files in the Browser Using JavaScript (Step-by-Step) Stanford's youngest instructor talks InfoSec, AI, and catching cheaters - Rachel Fernandez interview [Podcast #217] Product Experimentation with Propensity Scores: Causal Inference for LLM-Based Features in Python How to Build a Multi-Agent AI System with LangGraph, MCP, and A2A [Full Book] How to Land Your First Cloud or DevOps Role: What Hiring Managers Actually Look For How to Deploy a Serverless Spam Classifier Using Scikit-Learn, AWS Lambda, & API Gateway How to Dockerize a Go Application – Full Step-by-Step Walkthrough Learn Hardware, Cloud, DevOps, Networking, Security, Databases, DNS, Git, and Linux Inside TreeHacks 2026, Stanford’s Elite Student Hakc Inside Stanford’s Elite Student Hackathon [Full Documentary] How to Measure Your AI Citation Rate Across ChatGPT, Perplexity, and Claude How to Deploy a Full-Stack Next.js App on Cloudflare Workers with GitHub Actions CI/CD How to Build a Multi-Tenant SaaS Platform with Next.js, Express, and Prisma How I Completed 15 freeCodeCamp Certifications in 4 Months: A Structured Learning Journey How to Build an Agentic Terminal Workflow with GitHub Copilot CLI and MCP Servers How AI Changed the Economics of Writing Clean Code How to Apply STRIDE Threat Modeling and SonarQube Analysis for Secure Software Development How to Set Up OpenID Connect (OIDC) in GitHub Actions for AWS How to Split PDF Files in the Browser Using JavaScript (Step-by-Step) How to Build Your Own Language-Specific LLM [Full Handbook] How to Build a Self-Learning RAG System with Knowledge Reflection How to Trace Multi-Agent AI Swarms with Jaeger v2 How I Tested Malaysia's Open Data Portals with Plain English How I Built a Production-Ready CI/CD Pipeline for a Monorepo-Based Microservices System with Jenkins, Docker Compose, and Traefik The Hidden Tax of Infrastructure: Why Your Team Shouldn’t Be Running It Anymore From Metrics to Meaning: How PaaS Helps Developers Understand Production From Symptoms to Root Cause: How to Use the 5 Whys Technique Product Experimentation for AI Rollouts: Why A/B Testing Breaks and How Difference-in-Differences in Python Fixes It How to Create a GPU-Optimized Machine Image with HashiCorp Packer on GCP 3D Web Development with Blender and Three.js How to Fix a Failing GitHub PR: Debugging CI, Lint Errors, and Build Errors Step by Step How to Merge PDF Files in the Browser Using JavaScript (Step-by-Step) How to Handle Stripe Webhooks Reliably with Background Jobs How to Build an Automatic Knowledge Graph for Your Blog with PHP and JSON-LD Understanding Proxies and Reverse Proxies: Your Gateway to Secure Networking The Evolution of Nvidia Blackwell GPU Memory Architecture How to Use PostgreSQL as a Cache, Queue, and Search Engine The New Definition of Software Engineering in the Age of AI Reclaim Your Time – Master Automation with Zapier How to Create Dynamic Emails in Go with React Email Why Many Beginner Self-Taught Developers Struggle (And What to Do About It) How to Build a Headless WordPress Frontend with Astro SSR on Cloudflare Pages How to Make Your GitHub Profile Stand Out How to Use Context Hub (chub) to Build a Companion Relevance Engine Why Chrome OS Is the Operating System the AI Era Was Built For How to Build Microservices-Based REST APIs for Healthcare Portals How to friction-max your learning with software engineer Jessica Rose [Podcast #216] Shadow AI Explained: Why Employees Are Using AI Behind Your Back Traditional Scraping vs AI Scraping: A Practical Guide for Developers and Data Teams How Database Indexes Work – A Practical Guide with PostgreSQL Examples How to Streamline Search in Web Applications with Elasticsearch How to Build an Open Source Data Lake for Batch Ingestion OpenAI Codex Essentials – AI Assisted Agentic Development Course Learn Software System Design How to Generate PDF Files in the Browser Using JavaScript (With a Real Invoice Example) How to Get Started with Terraform Service-to-Service Communication: When to Use REST, gRPC, and Event-Driven Messaging A Developer’s Guide to Lazy Loading in React and Next.js The Data Quality Handbook: Data Errors, the Developer's Role, and Validation Layers Explained. United States Residential Proxy: Why Local IP Accuracy Matters for SERP, Ads, and Pricing How to Build a Fashion App That Helps You Organize Your Wardrobe How to Build an Admin Dashboard Sidebar with shadcn/ui and Base UI The AI Governance Handbook: How to Build Responsible AI Systems That Actually Ship How to Build a Local DevOps HomeLab with Docker, Kubernetes, and Ansible How to Use Mixins in Flutter [Full Handbook] How to Prep for Technical Interviews – A Guide for Web Developers GPT-5.4 vs GLM-5: Is Open Source Finally Matching Proprietary AI? Data Visualization Tools for Svelte Developers How to Keep Human Experts Visible in Your AI-Assisted Codebase Efficient Data Processing in Python: Batch vs Streaming Pipelines Explained How to Build and Deploy Multi-Architecture Docker Apps on Google Cloud Using ARM Nodes (Without QEMU) How to Build a Secure AI PR Reviewer with Claude, GitHub Actions, and JavaScript How to Build a Positioning-Based Crude Oil Strategy in Python [Full Handbook] How to learn programming and CS in the AI hype era – interview with dev and prof Mark Mahoney [Podcast #215] CUDA Programming for NVIDIA H100s How to Build Reliable AI Systems. How to Build an Online Marketplace with Next.js, Express, and Stripe Connect How to Build a Cost-Efficient AI Agent with Tiered Model Routing The WebCodecs Handbook: Native Video Processing in the Browser The Bluetooth LE Audio Handbook: From "Why Does My Call Sound Like a Tin Can?" to AOSP Implementation How to Set Up OpenClaw and Design an A2A Plugin Bridge
How to Build a Production-Safe Agent Loop: From Exit Conditions to Audit Trails
Daniel Nwaneri · 2026-06-16 · via freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
How to Build a Production-Safe Agent Loop: From Exit Conditions to Audit Trails

In July 2025, a Claude Code recursion loop burned between 16,000 USD and 50,000 USD in five hours. There was no crash or error, just agents doing exactly what they were told, indefinitely, because nobody told them when to stop.

Four months later, a four-agent LangChain loop ran for eleven days and cost 47,000 USD. Nobody noticed until the invoice arrived. The pipeline worked correctly in testing, and the agents were doing exactly what they were told. Same pattern.

This tutorial is about that missing instruction.

You'll build five small Python primitives that catch most agent loop failures before they ship:

  • A spec writer that forces you to define done before the loop starts

  • A circuit breaker that kills the loop when it exceeds hard limits

  • A ledger that records every turn in an append-only SQLite audit trail

  • An agent loop that ties all three together

  • A review surface that forces human attestation before downstream systems receive anything

By the end you'll have a working repo you can drop into any agent project. The full code is at github.com/dannwaneri/production-safe-agent-loop.

Table of Contents

  1. Why This Keeps Happening

  2. Prerequisites

  3. Phase 1: Define Done Before You Build

  4. Phase 2: Enforce Done at Runtime

  5. Phase 3: Record Everything

  6. Phase 4: The Loop That Respects Its Boundaries

  7. Phase 5: The Review Surface

  8. Phase 6: A Real Example, SEO Audit Agent

  9. Pluggable LLM Client

  10. Running the Tests

  11. What You've Built

  12. Next Steps

Why This Keeps Happening

The math that got companies into trouble was simple. A chatbot costs roughly 0.04 USD per interaction. An orchestrated multi-agent workflow costs 1.20 USD. That's a 30x multiplier — and production benchmarks show it can reach 70x on complex tasks.

The problem isn't that agents are expensive. The problem is that most teams budgeted for chatbot costs and deployed agent architectures. Gartner found the token consumption gap between pilot chatbots and production agent workflows sits at 5-30x. The FinOps Foundation's 2026 State of FinOps report found 73% of enterprises say AI costs exceeded original projections.

The mechanism is straightforward once you see it. When an agent fails a task and retries, it doesn't start fresh. It re-reads the entire context window — every prior failed attempt — before trying again. Iteration one costs 100 tokens. Iteration two costs 200. Iteration ten costs thousands. You're paying for every failure, over and over, in milliseconds.

# This is the entire problem in three lines
while True:
    result = agent.run(task)
    # done when...?

That question mark is where the money goes.

The other thing making it worse: agents don't fail loudly. Traditional code hits an undefined state and crashes. An LLM hits ambiguity and tries to be helpful. It retries. It reformats the tool call. It spins up a verification agent. The verification agent finds something. A correction agent fires. Nobody defined what "correct" means. The loop looks beautiful on every dashboard you have — activity, tool calls, completion rate — while quietly burning through your budget.

Gartner predicts that 40% of agentic projects will be scrapped by 2027 due to economic failure. Most of that failure is preventable. Not with better models, but with exit conditions.

Prerequisites

  • Python 3.10+

  • An Anthropic API key (or any provider — more on that later)

  • Basic familiarity with Python classes and SQLite

git clone https://github.com/dannwaneri/production-safe-agent-loop
cd production-safe-agent-loop
pip install -r requirements.txt
export ANTHROPIC_API_KEY=sk-...

Phase 1: Define Done Before You Build

The most expensive mistake in agent development isn't a bad model choice or a missing retry limit. It's starting the build before you can answer one question in one sentence:

What does done look like?

Most teams can't answer it. Not because they're careless, but because nothing forces them to before they open the terminal. The spec writer is that forcing function.

# spec_writer.py
from spec_writer import SpecWriter

spec = SpecWriter(db_path="spec.db").run()

When you call .run(), it won't return until you've answered three questions:

  1. What does this do?

  2. What does this NOT do?

  3. What does done look like in one sentence?

The third question is the one that matters. It's also the hardest. "The agent audits the site" is not an answer. "The agent crawls the target URL, extracts all <title> and <meta description> tags, flags any missing or over-length, and stops" is an answer. One of those gives the circuit breaker something to enforce.

The spec stores to SQLite and returns a SpecResult dataclass with a session_id. That ID becomes the thread connecting your spec, your ledger rows, and your loop result. One session, traceable end to end.

@dataclass(frozen=True)
class SpecResult:
    what_it_does: str
    what_it_does_not: str
    done_looks_like: str
    session_id: str

frozen=True matters. The spec is a commitment, not a draft. Once it's written, the loop runs against it. No mid-run revisions.

For testing, SpecWriter accepts injectable input_fn and output_fn callables. No stdin monkey-patching required. See tests/test_spec_writer.py for working examples — the suite uses a small scripted_input helper that returns answers from a generator, and writes to a per-test SQLite file via pytest's tmp_path fixture. SQLite's :memory: isn't safe here, because SpecWriter opens a fresh connection per method and each :memory: connection is its own isolated database.

Phase 2: Enforce Done at Runtime

Defining the exit condition upstream is discipline. The circuit breaker is enforcement.

# circuit_breaker.py
from circuit_breaker import CircuitBreaker, CircuitBreakerError

breaker = CircuitBreaker(turn_limit=5, token_limit=15000)
breaker.check(turn_count, accumulated_tokens)  # raises on breach

Two ceilings. Both hard.

turn_limit caps how many times the loop can call the LLM. token_limit caps total token consumption across all turns. Either one tripping raises CircuitBreakerError immediately.

The boundary is strict: turn_count == turn_limit is allowed. turn_count == turn_limit + 1 trips. No grace periods or warnings. A hard stop forces a human checkpoint.

from dataclasses import dataclass


@dataclass
class CircuitBreakerError(Exception):
    reason: str          # "turn_ceiling" or "token_ceiling"
    turn_count: int
    accumulated_tokens: int

    def __post_init__(self) -> None:
        super().__init__(
            f"circuit breaker tripped: {self.reason} "
            f"(turn={self.turn_count}, tokens={self.accumulated_tokens})"
        )


class CircuitBreaker:
    def __init__(self, turn_limit: int = 5, token_limit: int = 15000) -> None:
        self.turn_limit = turn_limit
        self.token_limit = token_limit

    def check(self, turn_count: int, accumulated_tokens: int) -> None:
        if turn_count > self.turn_limit:
            self._trip("turn_ceiling", turn_count, accumulated_tokens)
        if accumulated_tokens > self.token_limit:
            self._trip("token_ceiling", turn_count, accumulated_tokens)

    def _trip(self, reason: str, turn_count: int, accumulated_tokens: int) -> None:
        print(
            "\n=== CIRCUIT BREAKER CHECKPOINT ===\n"
            f"reason         : {reason}\n"
            f"turn_count     : {turn_count} / limit {self.turn_limit}\n"
            f"tokens_used    : {accumulated_tokens} / limit {self.token_limit}\n"
            "action         : halt loop, surface to human reviewer\n"
            "=================================="
        )
        raise CircuitBreakerError(
            reason=reason,
            turn_count=turn_count,
            accumulated_tokens=accumulated_tokens,
        )

CircuitBreakerError is an exception, not a return code. That's intentional. A return code can be ignored. An uncaught exception can't. Silent breach is impossible. The human-readable checkpoint banner is printed to stdout by _trip() before the exception is raised, so even if a caller swallows the exception the operator still sees state.

The critical rule: call .check() before every LLM call, not after. Post-flight checking means you've already burned the tokens before you knew the limit was exceeded.

# Wrong — post-flight
result = client.messages.create(...)
breaker.check(turn_count, accumulated_tokens)  # too late

# Right — pre-flight
breaker.check(turn_count, accumulated_tokens)  # raises before any spend
result = client.messages.create(...)

The defaults (5 turns, 15,000 tokens) match a tight tutorial demo. Your production budget is different. Tune at instantiation:

# Production example — tighter token budget, more turns
breaker = CircuitBreaker(turn_limit=10, token_limit=50000)

Phase 3: Record Everything

The circuit breaker protects your bank account. The ledger protects your understanding of what happened.

Most teams log for debugging — they want to know what went wrong after it went wrong. The ledger has a different purpose. It's governance. Every row is proof that the loop stayed within its boundaries, or didn't, and exactly when.

# ledger.py
from ledger import Ledger

ledger = Ledger(db_path="ledger.db")
ledger.write(
    session_id=spec.session_id,
    turn_count=1,
    state_origin="llm",
    input_str=task,
    token_delta=523,
    execution_time_ms=1240,
    pass_fail=True,
)

One row per turn. Append-only, no updates, and no deletes. The immutability is the point: a ledger you can edit isn't a ledger, it's a notebook.

The schema:

CREATE TABLE IF NOT EXISTS ledger (
    id                 INTEGER PRIMARY KEY AUTOINCREMENT,
    session_id         TEXT    NOT NULL,
    turn_count         INTEGER NOT NULL,
    state_origin       TEXT    NOT NULL,
    input_hash         TEXT    NOT NULL,
    token_delta        INTEGER NOT NULL,
    execution_time_ms  INTEGER NOT NULL,
    pass_fail          INTEGER NOT NULL,  -- 1=pass, 0=fail
    breach_reason      TEXT,              -- NULL unless circuit breaker fired
    created_at         TEXT    NOT NULL   -- ISO 8601, UTC
);
CREATE INDEX IF NOT EXISTS idx_ledger_session ON ledger(session_id);

The index makes get_session(session_id) — the primary read path — a constant-time lookup as the ledger grows.

Three decisions worth explaining:

  1. input_hash not input_text. The raw input string never persists. Only its SHA-256 hash does. There are two benefits to this: identical inputs across runs are detectable, and PII never enters the audit trail.

  2. pass_fail as INTEGER not BOOLEAN. SQLite has no boolean type. 1 and 0 are canonical. Clean Python ergonomics at the API edge, correct SQL types on disk.

  3. created_at as datetime.now(timezone.utc).isoformat(). datetime.utcnow() was deprecated in Python 3.12. Timezone-aware timestamps avoid the footgun in any system that crosses timezones.

Retrieve by session:

rows = ledger.get_session(spec.session_id)
for row in rows:
    print(f"Turn {row.turn_count}: {'PASS' if row.pass_fail else 'FAIL'} "
          f"| {row.token_delta} tokens | {row.execution_time_ms}ms")

Phase 4: The Loop That Respects Its Boundaries

The agent loop wires the three primitives together. It's the only component that calls the LLM. Everything else is local.

# agent_loop.py
from agent_loop import AgentLoop

loop = AgentLoop(spec, breaker, ledger, client)
result = loop.run(task)
# LoopResult(success, turns, total_tokens, session_id, breach_reason)

The anatomy of a turn, in order:

  1. circuit_breaker.check(turn_count, accumulated_tokens) — raises if either ceiling is exceeded

  2. client.messages.create(...) — the actual LLM call

  3. ledger.write(...) — one row, append-only

  4. If stop_reason == "end_turn", return. Otherwise loop.

Pre-flight checking before every LLM call, with no exceptions.

def run(self, task: str) -> LoopResult:
    session_id = self.spec.session_id
    messages: list[dict] = [{"role": "user", "content": task}]
    turn = 0
    total_tokens = 0

    try:
        while True:
            turn += 1
            self.circuit_breaker.check(turn, total_tokens)

            started = time.perf_counter()
            response = self.client.messages.create(
                model=self.model,
                max_tokens=self.max_tokens,
                system=self._system_prompt(),
                messages=messages,
            )
            elapsed_ms = int((time.perf_counter() - started) * 1000)

            turn_tokens = (
                getattr(response.usage, "input_tokens", 0)
                + getattr(response.usage, "output_tokens", 0)
            )
            total_tokens += turn_tokens

            text = self._text_from(response)
            messages.append({"role": "assistant", "content": text})

            self.ledger.write(
                session_id=session_id,
                turn_count=turn,
                state_origin="llm",
                input_str=task,
                token_delta=turn_tokens,
                execution_time_ms=elapsed_ms,
                pass_fail=True,
            )

            if getattr(response, "stop_reason", "end_turn") == "end_turn":
                return LoopResult(
                    success=True,
                    turns=turn,
                    total_tokens=total_tokens,
                    session_id=session_id,
                )

            messages.append({"role": "user", "content": "continue"})

    except CircuitBreakerError as err:
        self.ledger.write(
            session_id=session_id,
            turn_count=turn,
            state_origin="circuit_breaker",
            input_str=task,
            token_delta=0,
            execution_time_ms=0,
            pass_fail=False,
            breach_reason=err.reason,
        )
        return LoopResult(
            success=False,
            turns=turn,
            total_tokens=total_tokens,
            session_id=session_id,
            breach_reason=err.reason,
        )

def _system_prompt(self) -> str:
    return (
        "You are an agent working on a tightly-scoped task.\n\n"
        f"What this does: {self.spec.what_it_does}\n"
        f"What this does NOT do: {self.spec.what_it_does_not}\n"
        f"Done looks like: {self.spec.done_looks_like}\n"
    )

@staticmethod
def _text_from(response) -> str:
    content = getattr(response, "content", None)
    if not content:
        return ""
    block = content[0]
    return getattr(block, "text", "") or ""

A few choices worth calling out in this body:

  • The whole while True: is wrapped in one try/except CircuitBreakerError. The check happens at the top of every turn, so a breach is caught the same way whether it fires on turn 1 or turn 6.

  • input_str=task on every ledger row — the original task, not the last assistant message. The input_hash column then groups rows that share the same starting input across the run.

  • pass_fail=True for every LLM turn that returns, False only on breach. The pass/fail flag tracks whether the loop reached the row legitimately, not whether the model's output was good. Quality scoring is a separate concern.

  • _system_prompt() uses all three spec fields, not just done_looks_like. The model needs the negative scope (what_it_does_not) at least as much as the positive scope.

  • time.perf_counter() not time.time() — monotonic, immune to wall-clock adjustments mid-run.

LoopResult.session_id is inherited from spec.session_id. The ledger rows tie back to the spec without a join. One session ID, one traceable run, start to finish.

Phase 5: The Review Surface

The circuit breaker protects your bank account. The ledger records what happened. But neither tells you whether what happened matched what you promised.

That gap is where bad loops get approved. Polished output, green dashboard, missed commitment. A reviewer sees the artifact, decides it looks acceptable, and signs off. Nobody asked whether the original promise was kept.

The review surface closes that gap. It reads the session from SQLite, assembles the five-element frame, and forces a comparison before anything downstream receives the output.

from review_surface import ReviewSurface

rs = ReviewSurface(spec_db_path="spec.db", ledger_db_path="ledger.db")
print(rs.render(session_id))

Here's the five-element frame, in order:

  1. Original promise — pulled from the spec table: what it does, what it doesn't do, what done looks like

  2. Acceptance criteria — the done_looks_like field rendered as the explicit benchmark

  3. Diff — first turn input vs final turn output, turns completed, total tokens, whether the loop breached

  4. Evidence — all ledger rows for the session: turn-by-turn pass/fail, token delta, execution time

  5. Unresolved assumptions — derived from breach rows and failed turns. Empty when clean.

When the reviewer is satisfied, they attest:

attestation = rs.attest(
    session_id=result.session_id,
    reviewer="daniel",
    notes="Output matches spec. Approved."
)
print(attestation.frame_hash)

.attest() writes to the attestations table in ledger.db. The frame_hash is a SHA-256 of the canonical frame data — deterministic across reviewers attesting the same session. It's the audit receipt. It proves the reviewer saw the exact frame as rendered, not a summary or a paraphrase.

Approval confirms the process ran. Attestation confirms the reviewer compared output to commitment. When the loop touches something regulated, those are different legal documents.

@dataclass(frozen=True)
class ReviewFrame:
    session_id: str
    original_promise: SpecResult
    acceptance_criteria: str
    diff: DiffResult
    evidence: tuple  # tuple[LedgerRow, ...]
    unresolved_assumptions: tuple  # tuple[str, ...]
    created_at: str

ReviewFrame is frozen for the same reason SpecResult is — the frame is evidence, not a draft. evidence and unresolved_assumptions are tuples because lists aren't hashable and frozen dataclasses need hashable fields.

The full end-to-end flow with the review surface lives in examples/review_example.py in the repo. Run it after any completed session: it renders the five-element frame, prompts for attestation, and writes the receipt if you approve.

The loop runs to you. Downstream systems get nothing until someone signs.

Phase 6: A Real Example — SEO Audit Agent

The pattern only makes sense against a real problem. This is the same agent architecture behind my seo-agent project.

SEO audits have a natural cadence: crawl, surface what's broken, fix, wait for reindex. Running the agent continuously doesn't change that cadence. It just burns tokens in the empty space between the moments that matter. A cron job wired to the loop is the honest architecture.

# examples/seo_audit_example.py
import requests
from bs4 import BeautifulSoup
import anthropic
from spec_writer import SpecWriter
from circuit_breaker import CircuitBreaker
from ledger import Ledger
from agent_loop import AgentLoop

def crawl_url(url: str) -> str:
    response = requests.get(url, timeout=10)
    soup = BeautifulSoup(response.text, "html.parser")
    title = soup.find("title")
    meta_desc = soup.find("meta", attrs={"name": "description"})
    h1_tags = soup.find_all("h1")
    return (
        f"URL: {url}\n"
        f"Title: {title.text if title else 'MISSING'}\n"
        f"Meta description: "
        f"{meta_desc['content'] if meta_desc else 'MISSING'}\n"
        f"H1 count: {len(h1_tags)}\n"
        f"H1 tags: {[h.text[:50] for h in h1_tags]}"
    )

def run_seo_audit(url: str) -> None:
    # Step 1: Define done before the loop starts
    spec = SpecWriter(db_path="spec.db").run()

    # Step 2: Initialise circuit breaker and ledger
    breaker = CircuitBreaker(turn_limit=5, token_limit=15000)
    ledger = Ledger(db_path="ledger.db")
    client = anthropic.Anthropic()

    # Step 3: Crawl the URL
    site_data = crawl_url(url)

    # Step 4: Run the loop
    # AgentLoop catches CircuitBreakerError internally and returns
    # LoopResult(success=False, breach_reason=...). Branch on the
    # result — do NOT wrap loop.run() in try/except CircuitBreakerError.
    loop = AgentLoop(spec, breaker, ledger, client)
    result = loop.run(
        f"Audit this page for SEO issues:\n\n{site_data}"
    )

    # Step 5: Print the ledger
    print(f"\nResult: {'SUCCESS' if result.success else 'BREACH'}")
    if not result.success:
        print(f"Breach reason: {result.breach_reason}")
    print(f"Turns: {result.turns} | Tokens: {result.total_tokens}")
    print("\nAudit trail:")
    for row in ledger.get_session(result.session_id):
        status = "PASS" if row.pass_fail else "FAIL"
        print(f"  Turn {row.turn_count}: {status} | "
              f"{row.token_delta} tokens | {row.execution_time_ms}ms")

if __name__ == "__main__":
    import sys
    run_seo_audit(sys.argv[1] if len(sys.argv) > 1 else "https://example.com")

Run it:

python examples/seo_audit_example.py https://yourdomain.com

The spec writer prompts you. The loop runs, the circuit breaker fires if the limits are exceeded, and the ledger records every turn. The output lands in front of you and you decide what to fix.

The loop runs to you, not into a void.

Pluggable LLM Client

The loop works with any client that satisfies the LLMClient protocol (Anthropic by default). Bring your own via a ~20-line adapter.

# agent_loop.py
from typing import Protocol, runtime_checkable


@runtime_checkable
class MessagesEndpoint(Protocol):
    def create(self, *, model: str, max_tokens: int,
               system: str, messages: list) -> object: ...


@runtime_checkable
class LLMClient(Protocol):
    messages: MessagesEndpoint

messages is an instance attribute (not a nested class) because that's how the real Anthropic SDK exposes it — anthropic.Anthropic().messages.create(...). Modeling it as a nested class would mean the real client wouldn't satisfy the Protocol. The @runtime_checkable decorator lets you sanity-check conformance with isinstance(client, LLMClient), and the repo's test suite uses exactly that assertion against the FakeClient test double.

Here's an OpenAI adapter example (This is illustrative. A production adapter would also map streaming, tool-use, and error shapes.):

# openai_adapter.py — illustrative pseudocode, not production-ready.
from openai import OpenAI as _OpenAI


class _MessagesAdapter:
    def __init__(self, client):
        self._client = client

    def create(self, *, model, max_tokens, system, messages):
        completion = self._client.chat.completions.create(
            model=model,
            max_tokens=max_tokens,
            messages=[{"role": "system", "content": system}] + messages,
        )
        # Reshape OpenAI's response into the Anthropic-shaped surface
        # AgentLoop reads: response.usage.{input,output}_tokens,
        # response.content[0].text, response.stop_reason.
        return _adapt_response(completion)


class OpenAIAdapter:
    def __init__(self, api_key: str):
        self._client = _OpenAI(api_key=api_key)
        self.messages = _MessagesAdapter(self._client)  # instance attr, not a nested class

The adapter pattern is worth teaching explicitly. Provider APIs don't share a shape. Anthropic puts system at the top level. OpenAI puts it inside the messages array. An adapter shim is ~20 lines and makes the loop provider-agnostic without rewriting anything. Note that self.messages is assigned in __init__ so it's a real attribute on each adapter instance, the same shape as the actual SDK.

Running the Tests

python -m pytest tests/

With coverage:

python -m coverage run --source=circuit_breaker,ledger,spec_writer,agent_loop,review_surface -m pytest tests/
python -m coverage report -m

80 tests, 100% coverage on all five core modules. The loop is exercised against a FakeClient test double defined inline in tests/test_agent_loop.py. It satisfies the LLMClient protocol via duck typing: messages is set to self, so client.messages.create(...) routes back to the same object and ships with scripted responses for each test scenario. Clone the repo and run pytest to see all 80 tests pass without touching the network or needing an API key.

circuit_breaker.py has 100% coverage — no untested paths. It's the financial safety component. Every path through it is exercised.

What You've Built

In this tutorial, you've build five small primitives, each independently usable.

Module Role Lines
spec_writer.py Forces three answers before the loop runs 104
circuit_breaker.py Hard ceilings on turns and tokens 41
ledger.py Append-only SQLite audit trail 113
agent_loop.py The loop that respects both 128
review_surface.py Assembles the five-element frame, records human attestation 114

The pattern: upstream discipline defines the boundaries. Downstream enforcement breaks the circuit. Neither trusts the model to police itself.

A loop that runs without an exit condition isn't autonomous. It's a billing event waiting to happen.

Define what done looks like before you start. That's the job, and always has been.

Next Steps

The repo is at github.com/dannwaneri/production-safe-agent-loop.

There are three natural extensions if you want to go further:

1. Graduation to Distributed Systems

The SQLite ledger works for isolated sequential loops. The moment you run multiple agents against shared state, you need serializable isolation — concurrent writes to flat JSON corrupt silently. The README documents the three tipping points where a flat ledger needs to graduate.

2. Cryptographic Signing

For compliance-scale systems where the auditor wasn't present when the loop ran, SQLite rows aren't enough. A database admin can run an UPDATE query. Ed25519 signing wraps each ledger row in a receipt that proves the log wasn't altered after execution. But that's a different tutorial.

Wiring a Cron Job

The honest architecture for the SEO audit agent isn't 24/7 autonomous operation. It's a cron job that runs on schedule, surfaces what's broken, and stops. 0 3 * * 2 python examples/seo_audit_example.py https://yourdomain.com is the whole thing. The loop runs to you, not into a void.

If you need this architecture built for your own stack (circuit breakers, audit trails, production-safe agent loops), I do freelance work. dannwaneri.com/ai-agents/



Learn to code for free. freeCodeCamp's open source curriculum has helped more than 40,000 people get jobs as developers. Get started