惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

Google DeepMind News
Google DeepMind News
F
Fortinet All Blogs
阮一峰的网络日志
阮一峰的网络日志
Apple Machine Learning Research
Apple Machine Learning Research
爱范儿
爱范儿
WordPress大学
WordPress大学
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
J
Java Code Geeks
罗磊的独立博客
S
SegmentFault 最新的问题
V
V2EX
V
Visual Studio Blog
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
美团技术团队
博客园 - 三生石上(FineUI控件)
Stack Overflow Blog
Stack Overflow Blog
Y
Y Combinator Blog
MyScale Blog
MyScale Blog
D
Docker
Google DeepMind News
Google DeepMind News
Blog — PlanetScale
Blog — PlanetScale
M
Microsoft Research Blog - Microsoft Research
Martin Fowler
Martin Fowler
S
Secure Thoughts
B
Blog
cs.CL updates on arXiv.org
cs.CL updates on arXiv.org
www.infosecurity-magazine.com
www.infosecurity-magazine.com
Recent Announcements
Recent Announcements
MongoDB | Blog
MongoDB | Blog
C
Cisco Blogs
C
CERT Recently Published Vulnerability Notes
T
True Tiger Recordings
GbyAI
GbyAI
P
Proofpoint News Feed
P
Privacy International News Feed
Jina AI
Jina AI
The Cloudflare Blog
I
Intezer
AWS News Blog
AWS News Blog
Hacker News - Newest:
Hacker News - Newest: "LLM"
S
Security Archives - TechRepublic
NISL@THU
NISL@THU
The Register - Security
The Register - Security
Recent Commits to openclaw:main
Recent Commits to openclaw:main
P
Palo Alto Networks Blog
S
Schneier on Security
L
LINUX DO - 热门话题
C
CXSECURITY Database RSS Feed - CXSecurity.com
Security Latest
Security Latest
C
Cybersecurity and Infrastructure Security Agency CISA

DEV Community

Sixteen TUI components, copy-paste, no dependency The Boring Reliability Layer Every Autonomous Agent Needs Nven - Secret manager Building Multi-Tenant Row-Level Security in PostgreSQL: A Production Pattern Building Vylo — Looking for Collaborators, Partners & Early Support I Thought Memory Fades With Time. It Actually Fades With Information. ORA-00064 오류 원인과 해결 방법 완벽 가이드 I registered an AI agent at 1 AM and something cracked open in my head Pitch: Nven - Sync secrets. Ship faster. Why y=mx+b is the heart of AI From Routines to a Crew — Building a System That Plans Its Own Work & executes it 25 React Interview Questions 2026 (With Answers) — Hooks, React 19, Concurrent Mode An open source LLM eval tool with two independent quality signals Using Dashboard Filtering to Get Customer Usage in Seconds from TBs of Data Skills, Java 17, And Theme Accents 4 Hard Lessons on Optimizing AI Coding Agents Arctype: Cross-Platform Database GUI for LLM Artifacts Your robots.txt says GPTBot is welcome. Your server says 403. Organizing How to Use AWS Glue Workflow 5 n8n Automations Every Digital Agency Should Be Running (Bill More, Work Less) Getting Started with TorchGeo — Remote Sensing with PyTorch Designing a Scalable Cross-Platform Appium Framework Google Antigravity 2.0 & Slash Commands Building a Unified Adaptive Learning Intelligence with Gemma 4, Flutter, and Multi-Model Orchestration Looking for beta testers for a £60 server management application The Disk-Pressure Incident That Taught Me to Always Set LimitRanges and Other Lessons from Mirroring EKS Locally. Why AI Should Not Write SQL Against ERP Databases Vibe coding works until it doesn't. The debt is real. Shipping at the Edge: Migrating a Coffee Subscription Platform to Cloudflare Workers Stop Tab-Switching: A Developer's Guide to Color Tools That Actually Fit the Workflow DevOps vs MLOps vs AIOps: What Changes, What Stays, and a Simple Roadmap to Get Started Run Powerful AI Coding Locally on a Normal Laptop 5 n8n Automations Every WooCommerce Store Needs (Save 10+ Hours/Week) What I Learned Building My Own AI Harness Hytale Servers Will Fail Treasure Hunts Until We Fix Our Event Handling Redux in React: Managing Global State Like a Pro Unfreezing Your GitHub Actions: Troubleshooting Stuck Deployments and Protecting Your Git Repo Statistics Unlocking Project Discoverability on GHES: A Key to Software Engineering Productivity When the Cleanup Code Becomes the Project Rockpack 8.0 - A React Scaffolder Built for the Age of AI-Assisted Development Mismanaging the Treasure Hunt Engine in Hytale Servers Will Get You Killed Why Hardcoded Automations Fail AI Agents Stop Calling It an AI Assistant. It’s Already Managing Your Company Why I built a post-quantum signing API (and why JWT is on borrowed time) Weekend Thought: Frontend Build Tools Suffer From Work Amnesia A 10-Line Playwright Trick That Saved Me Hours on Every Sephora Run AI Is Changing Engineering Culture More Than We Realize Everyone Was Focused on Gemini, But Infinite Scaler Was the Real Twister "Gemma 4 Analyzed My Bank Statements – Apparently I 'Have a Problem' with Coffee and Late-Night Apps" #css #webdev #beginners #codenewbie The Hidden Layer Every AI Developer Must Learn AlphaEvolve: Google DeepMind's Gemini-Powered Evolutionary Coding Agent RDS Reserved Instance Pricing: Every Engine, Every Rule, Real Dollar Savings How To Build An AI-Powered MVP Without Burning Your Startup Budget In 2026 Reading a Psychrometric Chart Without Getting Lost LMR-BENCH: Can LLM Agents Reproduce NLP Research Code? (EMNLP 2025) How to turn text into colors (without AI) Building Real-Time Apps in Node.js with Rivalis: WebSockets, Rooms, Actors, and a Binary Wire This Week In React #282 : Security, Fate, TanStack, Redux, Jotai | Hermes-node, Expo, Rozenite, Harness | TC39, Bun, pnpm, npm, Yarn, Node AI Copilot vs AI Agent Architecture - What's Actually Different (And Why It Matters) Smart Contract Security: NEAR's Futures Surge and AI Token Risks Database Maintenance: Tracing Production Incidents to Their Root Cause Stop juggling AI SDKs in PHP — meet Prisma Google Quietly Changed What “Apps” Mean at I/O 2026 The Infrastructure Team Is the Real Single Point of Failure Building SQLite from Scratch: 740 Lines of C++23 to Understand Every Byte of a .db File The 4 Levels of Hermes Agent Scaling Framework: From One Hermes Agent to a Fully Automated Team Your AI Has a Memory. It Just Doesn’t Know What to Remember. Claprec: Engineering Tradeoffs - Limited time vs. Perfection (6/6) Building a Daily Google News API Monitor in Python Building RookDuel Avikal: From Chess Steganography to Post-Quantum Archival Security Google I/O e IA: o que realmente muda na vida do dev? Color Contrast Failures: The Number One Accessibility Issue and How to Fix It # I Watched 15 Hours of Hermes Agent Videos So You Don't Have To Cómo solucionar el bucle infinito en useEffect con objetos y arrays en React The First Agent-Centric Cloud Security Platform — And Why We Didn't Build It That Way On Purpose Most Treasure Hunts Engines on Hytale Servers Are Built to Fail - Lessons from a Burned Database GhostScan v3.0 — From Closed-Source EXE to Open-Source Pentest Framework De hojas de cálculo a IA: construyendo una plataforma SRM moderna When is AI fine in education? Python Tools for Managing API Rate Limits in Data Pipelines How to Implement Exponential Backoff for Rate-Limited APIs in Python "My Web Chat Wasn't a Real Channel. That Broke My Agent Pipeline" next-advanced-sitemap v1.0.7 — safer URL ingestion & automatic trimming for Next.js sitemap generation I keep seeing people build an AI lead processing agent when they really need a 6-step rules engine AI Powered Student Learning Assistant Using Gemma 4 How I Built a Drop-In Proxy to Slash My OpenAI Bills by 20%+ Automatically Building a Sarcastic AI English Tutor with Persona-as-Code and Gemini Audio Input for Pronunciation Correction Five Years Later, I Finally Have 96GB VRAM — What It Actually Unlocks for Agent Loops Turning a 1-Line Idea Into a 40-Second Short with a 10-Beat Local Video Pipeline Running LTX-2.3 Alongside TTS on a Single 96GB GPU with a Cold-Start Architecture Cutting LTX-2 22B Peak VRAM by 40% with fp8_cast — and Why optimum-quanto Was a Trap HiDream Skeleton Mode: Prompt Beats OpenPose Ref — 8 Patterns Benchmarked Replicating a Language-Learning Comedy Short with Claude Code — Gemini as a Multimodal Sub-Agent HiDream-O1-Image 3–8x Faster: Benchmarking Steps, CFG, and Resolution AWS Savings Plan Buying Strategy: How to Layer, Size, and Time Commitments application.properties I built a macro tracker powered by AI + attitude Solace: A Global Mental Health First Responder Built with Gemma 4 Why Blocking Prompt Injection Is Wrong — and What to Do Instead
Long-running agents need more than memory
Theo Valmis · 2026-05-18 · via DEV Community

Anthropic's managed-agent harness solves one hard problem: continuity. Progress logs, feature lists, git checkpoints, and startup scripts give each new session a map of what happened. But continuity is not governance. As agents work across more sessions, the question changes from "did the agent remember?" to "did the agent stay within its architectural constraints?"

In May 2026, Anthropic published a detailed look at how their internal engineering teams use Claude Code as a long-running managed agent. The infrastructure pattern they describe is worth reading carefully: initializer agents that prepare the workspace, feature lists that define remaining work, progress files that record what happened, git commits that preserve recoverable state, startup checks that orient each new session, and end-to-end tests that stop agents from declaring victory prematurely.

This is not prompt engineering. It is operational infrastructure for agents working across many sessions on the same codebase. The problems it solves are real, and the solutions are well-reasoned.

But the pattern solves continuity. It does not solve governance. Those are two different problems, and conflating them is the most expensive mistake a team can make when designing long-running agent workflows.

Agents as shift workers

The framing that makes the managed-agent pattern click is the relay team metaphor. A long-running agent workflow looks less like one developer with a prompt and more like a team of engineers handing work across shifts.

Each shift worker arrives, reads the handoff notes, picks up where the last person stopped, makes progress, and leaves a record for the next person. The work continues across interruptions. The codebase evolves across sessions. No single session owns the full context.

That framing makes the continuity infrastructure obvious. You need handoff notes that are authoritative (progress files), a work queue that persists across shifts (feature lists), recoverable state at every checkpoint (git commits), orientation scripts so each shift starts correctly (startup checks), and pass/fail criteria that the work must satisfy (E2E tests).

Anthropic's harness provides all five. What it does not provide is the architectural contract that defines what kind of work each shift is allowed to do.

In a real engineering team, that contract exists in ADRs, architecture review boards, code review standards, and the accumulated institutional knowledge of senior engineers. In a long-running agent loop, none of that is automatically present. The harness tells the agent what happened. It does not tell the agent what must remain true.

What the harness gets right

Before addressing the gap, it is worth being precise about what the harness actually solves:

  • Initializer agent — prepares the workspace before the main agent session begins.
  • Feature list — a durable queue of remaining work, written as discrete, completable items.
  • Progress file — a running record of what each session changed, decided, and left incomplete.
  • Git commits as checkpoints — every meaningful unit of work lands as a recoverable commit.
  • E2E tests as the victory condition — agents cannot declare a feature complete until the tests pass.

The pattern is good engineering. Each piece of infrastructure corresponds to a real failure mode that long-running agents encounter in practice.

The remaining gap: continuity is not governance

A progress file can tell the next agent: "Here is what I changed."

It cannot reliably tell the agent: "This architecture boundary must not be crossed. This dependency is forbidden. This ADR supersedes that older decision. This pattern is allowed only in this scope."

That distinction matters in practice because the questions a progress file answers and the questions a governance layer answers are different in kind, not just degree:

Layer Question it answers
Progress log What happened?
Feature list What remains?
Git history What changed?
Test harness Does it work?
Governance layer Is this allowed?

The first four layers are all answered by the managed-agent harness. The fifth is not. A test suite can verify that the output is functionally correct. It cannot verify that the output is architecturally compliant. Those are different properties, and a codebase can be full of passing tests while being full of architectural violations.

Agent harnesses preserve continuity. Governance preserves intent.

Why this gets harder as agents run longer

Over many sessions, a long-running agent loop may:

  • Infer outdated patterns from old code. If earlier sessions used a deprecated pattern, the new session infers that pattern is correct and continues it.
  • Reintroduce forbidden dependencies. A dependency was removed for a documented architectural reason. A later session adds it back because it solves the immediate problem and the prohibition is not in any artifact the agent reads.
  • Bypass undocumented conventions. Architecture that exists in institutional memory but not in enforceable documents is invisible to the agent.
  • Optimize locally while violating system-level constraints. Each session makes a locally reasonable change. The cumulative effect crosses an architectural boundary that no single session was responsible for maintaining.

None of these failures show up in a progress file. None of them cause a test suite to fail. They accumulate silently across sessions and become visible only when the codebase is far enough from its architectural intent that the cost of correction is high.

The role of governance

Governance sits beside the harness. It does not replace progress logs, tests, or git. It gives the agent a deterministic way to check architectural compatibility at each session boundary and at each commit boundary.

The managed-agent startup sequence, extended with governance:

pwd
git log --oneline -20
cat claude-progress.txt
cat feature_list.json
mneme check --mode warn

Enter fullscreen mode Exit fullscreen mode

Before commit or PR:

mneme check --mode strict

Enter fullscreen mode Exit fullscreen mode

In CI, on every push:

mneme check --mode strict --ci

Enter fullscreen mode Exit fullscreen mode

The framing is important: the harness tells the agent where it is. Governance tells it what boundaries it must respect. Both are necessary. Neither substitutes for the other.

ADRs as durable intent, not documentation

The governance layer requires a source of architectural authority. In well-run engineering teams, that source is the ADR corpus: Architecture Decision Records that capture not just what was decided, but why, what alternatives were rejected, and what constraints the decision implies.

For most teams, ADRs sit in /docs/adr and are read only when someone thinks to look. They are documentation, not enforcement. A long-running agent will not read them at session start. A commit hook will not check against them.

A governance layer changes this. Rather than reading the ADR folder as a documentation corpus, it compiles the ADR corpus into a decision graph with declared properties:

  • Which decisions are active, superseded, or deprecated?
  • Which decision applies to which file, service, or scope?
  • Which decision is newer and overrides an older one?
  • Which dependencies or patterns does each decision forbid or require?
  • When two decisions conflict on the same scope, which one wins?

A long-running agent operating under that system can answer: which decision applies to this change, and am I compliant with it? That is a different question from what did the progress file say? and it requires a different infrastructure to answer.

Where governance checkpoints belong

Governance is not a single check at a single moment. The right enforcement points correspond to the moments of highest leverage:

  1. Session start (warn mode) — before any code is written, load constraints and surface existing violations without blocking work.
  2. Pre-tool execution — block actions that are obviously forbidden before they happen.
  3. Pre-commit (strict mode) — the primary enforcement gate, catching architectural drift before it becomes branch history.
  4. Pre-PR — produces an explainable report of which rules applied, which passed, which failed, and why.
  5. CI — the backstop that enforces team-level architectural contracts on every push.

The harness ensures the agent knows where it is. Governance ensures the agent knows where it must not go. Both are infrastructure. Neither is a nice-to-have for long-running loops.

Conclusion: memory is not enough

Anthropic's managed-agent harness is well-designed infrastructure for a real problem. Teams building on Claude Code or similar agent systems should study and adopt this pattern.

But a progress file is descriptive, not prescriptive. It records what happened. It does not enforce what must remain true. And as agent loops grow longer, the gap between those two things grows wider.

The next phase of agent infrastructure needs a governance layer — one that resolves competing ADRs deterministically, produces explainable audit traces, and enforces architectural contracts at the boundaries where agents make consequential changes.

Long-running agents need memory to continue work. They need governance to continue work safely. The next generation of agent infrastructure will not just preserve context. It will preserve intent.

That is the layer Mneme is built for.


Originally published at https://mnemehq.com/insights/long-running-agents-need-governance/