惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

H
Help Net Security
T
ThreatConnect
SecWiki News
SecWiki News
F
Future of Privacy Forum
AWS News Blog
AWS News Blog
C
Cisco Blogs
A
Arctic Wolf
Vercel News
Vercel News
The GitHub Blog
The GitHub Blog
Scott Helme
Scott Helme
V
V2EX
博客园 - 叶小钗
阮一峰的网络日志
阮一峰的网络日志
K
Kaspersky official blog
G
Google Developers Blog
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
P
Privacy International News Feed
C
Cyber Attacks, Cyber Crime and Cyber Security
N
News | PayPal Newsroom
Schneier on Security
Schneier on Security
NISL@THU
NISL@THU
Microsoft Azure Blog
Microsoft Azure Blog
量子位
The Hacker News
The Hacker News
Stack Overflow Blog
Stack Overflow Blog
Security Latest
Security Latest
M
Microsoft Research Blog - Microsoft Research
Google Online Security Blog
Google Online Security Blog
博客园_首页
C
CXSECURITY Database RSS Feed - CXSecurity.com
I
InfoQ
Google DeepMind News
Google DeepMind News
Y
Y Combinator Blog
The Cloudflare Blog
Microsoft Security Blog
Microsoft Security Blog
Martin Fowler
Martin Fowler
Cisco Talos Blog
Cisco Talos Blog
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
T
Troy Hunt's Blog
F
Fox-IT International blog
S
Security @ Cisco Blogs
博客园 - 司徒正美
cs.CV updates on arXiv.org
cs.CV updates on arXiv.org
C
Comments on: Blog
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
L
LINUX DO - 最新话题
GbyAI
GbyAI
Project Zero
Project Zero
腾讯CDC
T
Tailwind CSS Blog

DEV Community

How the Events Table That Looked Right Killed Our Queue dotnet Framework life cycle tool LangGraph 워크플로우 템플릿 (v41) I built a free image compression API — no signup, just curl Designing TikTok from Scratch — A System Design Deep Dive PREDICTION-20260525-0007: boredom-with-asymmetric-leverage [2026-Q3 through 2027-Q3] [Boost] How to integrate the QuickBooks Invoice API in 2026 How I Cut My Anthropic API Bill by 50% With a Local Python Tool Vibe Coding Problems: 7 Visual Bugs AI Code Generators Always Ship Chinese AI Models 2026: The Agentic Revolution, Hardware Independence, and What It Means for Global Developers The Quiet AI War Inside Your Browser The 12-Line Anti-Bot Trick That Saved Our Airdrop Snapshot From Sybil Farms Building a production-ready SaaS dashboard in Next.js 16 — Recharts, TanStack Table, dark mode, and collapsible sidebar Why 2026 Belongs to Agentic AI (And How to Build Your First Local Agent) It Was 2024 When We Tried to Outsmart the Treasure Hunt Engine RAG 시스템 실전 구축 (v40) I Found a Tool That Generates a Complete .NET 8 or Java Spring Boot API From SQL Schema in 30 Seconds I Added a 4th Agent That Audits My Other Agents. It Caught My Strategist Procrastinating for 3 Weeks. Streaming LLM responses to the browser in Go (Server-Sent Events) How We Publish and Manage Educational Admission Updates at Scale on DailyAxom A prompt is not a conversation. It's a component contract. How to Pass the EAA 2025 Accessibility Audit — A Step-by-Step WCAG Checklist Building an Autonomous MCP Lead Generation System with Hermes Agent LangGraph 워크플로우 템플릿 (v40) How I Built 100 Browser-Based Image Tools With No Server (FFmpeg WASM, PDF-lib, AI Background Removal) Nginx CVE-2026-9256, AI Prompt Injection Defenses, and Claude AI Data Leak Demo Scaling RAG for 10M+ Docs, .md Agent Memory, & Claude Code for Motion Graphics Diagram as Code with draw.io DuckDB Delta, PostgreSQL 17 Migration, & SQLite Optimization Deep Dives Windows 11 Microsoft Account Login Recovery During Internet Restrictions The Linux Commands You Forgot Exist (And Why AI Workflows Make Them Relevant Again) Spec-Driven Development Without an IDE: I Generated NestJS, Go, Spring Boot, Laravel, and Rust Apps From a Single PRD File Components are states Edge SEO y Middleware: Cómo Interceptar a Googlebot y LLMs antes de llegar a tu Servidor Context window exceeded at turn 23. Here's how I track token usage without a tokenizer. My Hermes agent spent $3 before I noticed. Now it can't. My Hermes agent's stop condition was a 40-line if/elif chain. I replaced it with 3 lines. My agent kept hitting context limits. This one function fixed it. Create and configure Azure Firewall Your Hermes agent's audit log is leaking customer emails. Here's a 100-line lib that fixes that. My agent kept forgetting what it was doing. A scratchpad fixed it. I replaced 200 lines of ad-hoc state management in my Hermes agent with one object. Per-Key Rate Limiting for Agent Tool Calls: Stop One User From Breaking Everything Composable Output Guardrails: Filter Agent Responses Before They Reach Users Sanitize Your LLM Message Lists Before Every API Call Thread a Run ID Through Every Agent Call So You Can Debug Anything Normalize Provider Error JSON So Your Agent Can Actually Handle Failures Priority Queue for Agent Sub-Tasks: Stop Processing Low-Priority Work First Static Lint Rules for Your LLM Prompts (Before They Hit Production) tool-call-budgets: Stop Runaway Agent Loops Before They Hit Your Invoice Step Through Your Agent's Failures Like a Debugger The Simplest Stop Condition: A Hard Cap on Agent Loop Iterations Score Your Agent's Responses With a 0.0-1.0 Rubric (No LLM Judge Required) Fix Bad Structured Output by Feeding the Error Back to the Model Building an effective Storyblok Tool Plugin with SvelteKit How to Get Your Renault / Dacia Radio Code for Free RAG 시스템 실전 구축 (v39) Retraction — scrml’s Living Compiler I built a fitness app where the AI roasts you for eating pizza (and hypes you when you PR) The Top SaaS Founder Communities on Discord (Beyond the AI Hype) I Built a Production-Grade Async Job Queue from Scratch — Here's Everything That Actually Happened How to watch SMS from multiple Android phones in one iOS app We Didn’t Want Another AI Wrapper — So We Explored a High-Speed Hermes Orchestrator for Engineering Crews Multi-tenant além do TenantId: problemas reais e aprendizados em sistemas .NET After failing 23 times, I am sharing How I Actually Prepare for a Tech Interview Every Single Time Now. I built an app that works like a nutritionist for your brain. Here's what happened in 7 days. GoBadge Dynamic: From Module Stats to Universal Badges LangGraph 워크플로우 템플릿 (v39) The git Commands You Forgot Exist (And Why AI Workflows Make Them Relevant Again) Six Levels of MCP Servers One container to replace Grafana + Loki + Tempo + Prometheus The Request/Response Cycle, HTTP, Auth, JWT, OAuth & Sessions — Explained Properly Python Week 3: We Stopped Repeating Ourselves (Loops!) Creating a Custom Grid Editor tool in Unreal Engine 我做了个付费 Telegram bot。Telegram Stars 实际给开发者多少钱,我算了一笔账。 I Got 96% Recall on LLM Hallucination Detection With No ML Model – Just 50 Lines of Python A practitioner's guide to getting more value out of AI coding: agent quality & token optimization How to Handle Telegram Albums in Telegraf I Built a Multilingual Spam Detection Dataset with 149K+ Messages Across 23 Languages How to Handle Telegram Albums in grammY RAG 시스템 실전 구축 (v38) Beyond Pip Install: Why Your AI Agent Needs a "Hermetic" Life-Support System to Survive Resume Building using HTML & CSS SpecFlow: Multi-Agent SDD in Cursor (4 phases, /approve, single code writer) Running ASR for smart homes in the NPU of Intel processors "Building a CI/CD Pipeline From Scratch: A Practical Guide for Developers (with GitHub Actions)" SpecFlow: SDD multi-agente en Cursor (4 fases, /approve, un solo escritor de código) How to Extract Your Full Team Hierarchy from HubSpot (the API doesn't expose it) Adobe Commerce Cloud now costs $40k/year. We migrated from Adobe Commerce to Magento Open Source — here's the honest breakdown .klickd v4.0.0 — Portable AI memory with constraints, strict schemas, and test vectors We Trust Third Party Code, It’s Time to Trust AI Generated Code LangGraph 워크플로우 템플릿 (v38) Sustainable AI Starts with Efficient AI Find Remove duplicated files in Google Drive How to Detect GPU Waste in a Kubernetes Cluster The Privacy Bug in My First Chrome Extension (And How to Avoid It) Serverless Mental Models: What They Don't Tell You Before You Build Preventing GPT hallucination in automated content pipelines: how I structure Make.com flows with data injection Hmm, where were we?
Three Failures My AI Memory System Caught — And the Flaw It Revealed in Itself
Self-Correct · 2026-05-26 · via DEV Community

This is not proof. It is early, messy evidence from my own workflow: three failures, one small comparison, and one schema bug I missed.

I'd spent a week arguing, in public, that AI memory should be built on discipline before infrastructure: preserve corrections, preserve unresolved questions, decide which record wins. The framework has three layers: summary memory for continuity, correction memory for repeated mistakes, and unresolved memory for questions that should not be settled yet.

Good theory.

Then three days of unplanned failure tested every claim without asking my permission — and then one comment forced me to test it more carefully.

Here's what held, what the numbers actually said, and the flaw I didn't see coming.

Part 1 — The accidental stress test

These aren't dramatic stories. They're the boring failure modes every long-running agent setup eventually hits. That's the point.

The session died. Twice. Mid-build, the machine went down — twice in two days. Every live, in-context understanding vanished instantly: the day's decisions, the current state, the thread. None of it was lost, because none of it lived only in the session. It was on disk, mirrored. The failure mode is universal — crashes, timeouts, and context limits aren't edge cases, they're guarantees. The lesson is the one the whole system rests on: memory persists only to the degree you write it down. The wire broke. The record held.

An agent came back confidently wrong. After the crash, one of my agents restarted, re-anchored to a state about two days stale, and reported it as current — with complete confidence. It wasn't lying. It was sure — about a world that no longer existed. This is the quiet killer, worse than forgetting: an agent that recovers into a stale state and narrates it as settled fact. It got caught only because the real state was written down to compare against. The drift was visible because the truth was on disk.

The wrong version almost shipped. Two "final" versions of the same document existed, and the weaker one was about to go out. It got caught because the evidence didn't match the claim — the backup file was larger and contained sections the "final" was missing. Two files both said final. Only one could prove it. The lesson: memory is not what the agent claims to know; it's what the record can still prove. You need a rule for which record wins before the conflict, not after.

Three failures, three layers of the system catching them. Real — but I'll be honest about what kind of proof that is. It's builder-lived. It shows the system helped me recover. It doesn't measure anything.

The turn

A reader left a comment with the one question that actually changes work like this:

Do you have a baseline to compare it to?

That's the right question, and I didn't have a real answer. Field stories show survival; they don't show that the discipline beats the obvious alternative — just summarizing everything. So I built the test.

Part 2 — The deliberate test

I set up a small A/B comparison:

  • System A — summary-only memory: clean project summaries, recent decisions, preferences, current direction. No correction history, no uncertainty, no source-of-truth rules.
  • System B — layered memory: summary plus correction memory, unresolved questions, source-of-truth rules, verification triggers, and epistemic status.

The metric was deliberately narrow: false-certainty errors after a context reset. A false-certainty error is when the agent treats something as settled even though the record says it was stale, contested, unresolved, unverified, or dependent on live checking.

This was not testing whether "more text is better." It was testing whether memory that preserves epistemic status — stale, unresolved, verified, contested, priority, status — reduces false certainty after a reset.

Method snapshot:

  • 6 reset/recovery scenarios
  • 1 local model: llama3.2:latest
  • same task prompt for both systems
  • different memory packets only
  • manually scored against predefined expected behavior
  • scored on task success, epistemic handling, and false-certainty errors
  • not benchmark-grade and not externally blind-scored yet

The first version of the test gave me a tempting number. Then I audited the method and found two problems: a couple of summary baselines were too easy to fail, and the blind packet leaked too much structural information. So I corrected the baselines, regenerated the packet, reran the local model, and split the score into three parts: task success, epistemic handling, and false certainty.

The corrected first-pass result looked like this:

  • Task success: summary-only 1/6, layered memory 6/6
  • Epistemic handling: summary-only 4/12, layered memory 12/12
  • False-certainty errors: summary-only 2, layered memory 0

Those numbers are clean enough to be suspicious. They are clean because the scenarios came from known failures in my own workflow — the same environment the framework was built to handle. A fairer test would use scenarios written by someone else, multiple models, multiple runs, and external blind scoring.

Now the honesty, because this is exactly where pieces like this usually cheat: this is a first, small, local-model A/B test — an early signal, not a benchmark. Six scenarios. One local model. One run. Internal scoring. The scenarios came from my own workflow. I am not going to call this proven. The right framing is: the early signal supports the direction, and now there is a method to test it more honestly.

One compact example of the scoring shape:

Scenario Expected behavior Summary-only behavior Layered behavior
Wrong "final" version Use the current send file, not an older file that only looked canonical Invented a generic packaging process and never identified the current file Identified the current send version and preserved the copyedit-only boundary
Agent health after reset Separate process health from local-model availability Said no health information was available Listed the verification checks instead of treating process state as full health
Ready vs next action Separate status from priority Collapsed the next move into a stale/simple answer Failed in the first version, which forced the schema correction

Part 3 — The part I didn't expect

The test didn't only support the framework. It corrected it.

The first version of the layered system still failed one scenario. It knew one article was marked ready. It also knew a different article was supposed to be the next thing written. And it chose wrong — because it overweighted the word "ready" and treated readiness as if it were priority.

That exposed a real flaw in my own schema, not the model's: readiness is status; next action is priority — and a memory system has to keep them in separate fields. "Ready" answers is it done. "Next" answers what do I do now. Collapse them and even a disciplined memory will confidently do the wrong thing. So the schema now separates status, priority, confidence, epistemic_status, and verification_required as distinct fields.

That correction matters more to me than the score. A memory framework whose entire thesis is "preserve where you were wrong" caught itself being wrong in a test and got more precise. A system that can only confirm itself proves nothing. One that can correct itself under pressure is at least worth continuing to test.

The rule underneath all of it

That comment never became an argument. It became a test scenario, then a schema fix, then this article. That's a standing rule now: serious criticism becomes a memory input type — a correction, an unresolved question, a test, or a future piece. Smart criticism becomes durable memory when the system knows where to store it.

The close

I don't trust this memory system because I designed it. I trust it more because failure hit it three times and the record still let me recover, compare, and correct — and because, when I finally measured it, the test showed me where the framework still needed work.

That is the bar I care about. Not how much a memory system remembers on a good day. Whether it can still prove what's real on a bad one — and still tell you where it might be wrong.

The next version of the test should not come from me alone. It should use more scenarios, multiple models, repeated runs, and at least one external blind scorer. If the framework only works on my own failures, that is useful to know. If it still holds on someone else's scenarios, then the evidence gets more interesting.


This is part of a short series on treating AI memory as judgment infrastructure: the zero-budget foundation, why correction memory compounds, and why systems should preserve what's unresolved instead of forcing premature closure.