惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

大猫的无限游戏
大猫的无限游戏
阮一峰的网络日志
阮一峰的网络日志
T
Tailwind CSS Blog
WordPress大学
WordPress大学
小众软件
小众软件
Engineering at Meta
Engineering at Meta
有赞技术团队
有赞技术团队
博客园 - 聂微东
GbyAI
GbyAI
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
Stack Overflow Blog
Stack Overflow Blog
罗磊的独立博客
腾讯CDC
The Cloudflare Blog
博客园 - Franky
MongoDB | Blog
MongoDB | Blog
Martin Fowler
Martin Fowler
G
Google Developers Blog
博客园 - 三生石上(FineUI控件)
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知

DEV Community

zkML Inference Proof: What the Receipt Proves, and What the Model Still Does Not I Scored 1000/1000 on AWS Certified AI Practitioner (AIF-C01) Here's Every Resource I Used Go - Struct and Interface Handling JSON Requests in Go Storing Kamal secrets in AWS Secrets Manager and deploying to a cheap Hetzner VPS How I Caught and Fixed an N+1 Query in My Django REST API I got tired of paying $10/month to remove image backgrounds – so I built it for free How to Start Coding as a Student: A Complete Beginner’s Guide 🚀 Storing Kamal secrets in AWS Secrets Manager and deploying to a cheap Hetzner VPS What Are Buffers? Build AI Agents with Hot Dev The Client Onboarding Checklist That Prevents 90% of Project Problems Scalable Treasure Hunts Are a Myth, But We Almost Made One Gemini 3.5 Flash Has a 1M Token Context Window. Here's What You Can Actually Build With It. I built a ultra-polished developer portfolio template using React & Tailwind v4 (with zero-JSX configuration) Gemini CLI Is Dead. Here's the Better Thing That Replaced It Post-quantum cryptography for embedded and IoT: secure boot, TLS and OTA Understanding Optimistic Preloading in Modern Applications Nobody Wants to Read Your Code (And You Don't Want to Read Theirs) A clothing pairing app E2B vs E4B vs 31B Dense: The Practical Guide to Choosing the Right Gemma 4 Model I built an AI app store screenshot generator because Figma made me cry — looking for brutal feedback Hello DEV Community — My Developer Journey Begins Adaptable apps on ChromeOS: a post-mortem The WordPress Paradox: Why It’s Here to Stay (and How to Stop Ruining It) I built a local voice AI that can change to 9 different personalities! UXRay: I Built an AI That Roasts Your UI Like a Senior Designer Would Wyrly DI: Type-safe Dependency Injection for Modern TypeScript The contract is the interface: agent-driven Steampipe Stave in one command Gemma 4's Hidden Superpower: Why Built-in Thinking Tokens Change Everything for Evaluation Tasks ⚡ WordPress Performance: The Real Truth They Don't Tell You A Mobile App Usually Needs an Admin System First Customer Portals Should Remove Repeated Admin Work Episode 4: The Time Loop (Layers & Caching) I Built ContextForge with Gemma 4: A Project Memory Generator for Developers and AI Coding Agents Why shadow DOM beat iframe for inline tooltips HOW TO CREATE USER AND ASSIGN ROLES IN AZURE WITH ENTRA ID When AI Blackmail Goes Viral Episode 3: The Secret Scroll (The Dockerfile) Monte Carlo Simulation for Engineers: Turning Uncertainty Into Numbers The tokens-per-byte trap: character-level 'compression' adds tokens Nobody Reads Your Code Anymore Why I built a collection of 5 free, zero-signup career finance tools for solo builders 🚀 New React Challenge: Instant UI with useOptimistic Resolvendo a Alucinação da IA na Arquitetura de Software com Code Property Graphs e .NET 9 S1 — Clean Backtrace Crashes: How to Diagnose and Fix Them Cómo solucionar el bucle infinito en useEffect con objetos y arrays The Brutal Reality of Running Gemma 4 Locally I made Claude Code refuse to write code unless the ticket scores 80/100 I Fed React's Entire Hooks Transition History to Gemma 4. Here's What It Found That We Missed. Building a Private RAG System: Lessons from a Local-First AI Journal CodePulse AI — Reviving an AI-Powered Repository Intelligence Platform How to Split Video into Segments with FFmpeg (CLI + API) I've audited dozens of estate agency websites. The same 5 problems show up every single time. Part 1: Taming Asynchronous JavaScript: How to Build a "Mailbox" Queue Building My AI-Powered VS Code Extension 🚀 Google Login in Express with PassportJS & JWT Great example of Gemma 4 moving beyond chatbots into real-world decision support. Using AI to guide everyday actions like recycling shows how impactful applied LLMs can be when designed for usability, not just capability. #Gemma4 #AI #Sustainability Building a Production AI Chatbot for an Educational Institute: Architecture, Lessons & Full Stack Deep-Dive Google Login in Express with PassportJS & JWT How I reclaimed 47GB on my MacBook by cleaning developer project junk Operators Are Not Oracles: How We Learned to Stop Worrying and Love the Configuration I Built 6 Free Developer Tools for AI APIs, Cron, Docker, and Self-Hosting How I Built a Real-Time Precious Metals Price Feed for 30,000 Concurrent Users in Laravel How to Use a SERP API to Validate Whether a Project Idea Is Worth Building Gemma 4 discussions often focus on capability, but real-world impact depends on deployment context. For offline education, especially in low-connectivity regions, latency, cost, and local inference matter as much as model strength. Local Mind Explores it Space Complexity + Ω and Θ Notations Google I/O 2026 Just Confirmed the Shift From AI Chatbots to AI Agents How to Add API Monitoring to an Express App in 5 Minutes (2026) Designing an In-Game Inflation Tracking Algorithm for Web Utility Apps Google AI Studio Just Changed the Shape of App Development If you struggle to learn then this is for you. Best AI Agent Security & Guardrails Tools in 2026: LLM Guard vs NeMo vs Guardrails AI Building Dynamic RBAC in React 19: From Permission Strings to Component-Level Access Control How to Build a Self-Hosted AI Code Review Tool in Python Why We Switched from React to HTMX in Production: A 200-Site Case Study Gemma-Loom: The Intent-Based Virtual Machine (IVM) for Edge Sovereignty Java实习海投攻略:3天300个沟通,我是怎么拿到面试的 I Deployed Netflix's Web Server in 30 Seconds (And So Can You) - Docker Project 1 Debugging Android 14 WebRTC Disconnects on a coturn Relay Path 1/30 Days System Design Question Testing FastAPI + SQLAlchemy with Real PostgreSQL Fixtures: No More Mocking Misery FAQ Schema Markup Generators: What They Actually Do (and What They Don't Tell You) How a pure-TypeScript flex layout engine closed the last WASM-Yoga gap Spot instances as GitHub Actions runners Agents Need Receipts, Not Just Better Prompts readmegen — Generate beautiful README.md in seconds (12 templates, open source) When AI Reads Blueprints: The Hidden Attack Surface of Multimodal Engineering Intelligence Simplicity scales — complexity kills side projects AI does exactly what you ask — that's the problem How a model upgrade silently broke our extraction prompt (and how we caught it) The Best Form Backend for Static Sites in 2026 # ⛽ I Built a Cross-Platform Fuel Finder with React & Supabase: The Indie Dev Journey The 11 Major Cloud Service Providers in 2025 Membangun Karya Visual: Mengintip Fasilitas Multimedia dan Studio Kreatif Amikom What Is IOPS? Visualizing Database Design: From Interactive Canvas to Drizzle, Prisma, and SQL in Real-time A tool to make your GitHub README impossible to ignore 🚀 Zero-Downtime Blue-Green and IP-Based Canary Deployments on ECS Fargate I reproduced a Claude Code RCE. The bug pattern is everywhere.
I tested 4 AI agent-governance tools against an open spec - here's the matrix
Sunil Prakas · 2026-05-23 · via DEV Community

The scenario

Your AI agent just deleted a customer record. Three months later, an auditor asks you to prove:

  • What tool actually ran (not "the agent made a deletion call" — the precise tool, version, and capability)
  • With what arguments (the exact customer ID, scoped fields, options — byte-for-byte)
  • Who approved it (which human, or which automated policy rule)
  • Against which version of which policy (the literal policy bundle the runtime evaluated, not "the policy at the time, probably")
  • Whether it actually succeeded (not "we said allow", but "the downstream system confirmed the row is gone")

You open your audit log.

It says: delete_customer approved, run_id=xyz, decision=allow. The arguments are in a different table. The policy version isn't recorded anywhere — you'd have to git log your settings file. The execution outcome lives in your application logs, which roll over after 30 days. And the auditor has no way to verify any of this without an engineer walking them through every join.

This gap shows up the moment an agent does something consequential and a non-engineer needs to understand what happened. It's the same gap regardless of which framework you used. Approval is not proof.

What's actually missing

The pattern across every agent-governance tool I looked at is the same: they're built around the decision (allow / deny / require-approval) and treat the action itself as an implementation detail. So the audit log records "the policy fired" but not a single record carrying everything a third party needs to reconstruct what actually happened.

A useful audit artifact has to survive the following:

  1. It can be verified without trusting the runtime that produced it. If your auditor has to call your engineers to interpret the log, the log is testimony, not evidence.
  2. The arguments and the decision are cryptographically bound. If args mutate between approval and execution, the audit must show it.
  3. The policy version is in the record. Not "the policy at the time" — the literal bundle identifier.
  4. The execution outcome is in the record. Approval ≠ execution. Both belong in the same artifact.
  5. The chain of receipts is tamper-evident. Deleting a row from history must break something a verifier can detect.

A receipt that does all five becomes a single evidence record you can hand to an auditor, regulator, insurer, or a compliance team six months later — without them needing access to your database, your cloud creds, or your engineering team.

What I built

AgentBoundary is an open spec for that kind of receipt. v0.1 is stable; v0.2-alpha (draft) adds the optional provenance block and singly-linked chain shown in the example below. Same JSON document, deterministic schema, hash-bound to its arguments.

Here's one a Discord agent I run in production emitted on 2026-05-21 — it files GitHub issues on behalf of users:

{
  "version":      "agentboundary/v0.2-alpha",
  "receipt_id":   "f04df972-f9fc-4624-83cb-0ed3682297cf",
  "issued_at":    "2026-05-21T06:54:39.251Z",

  "actor": {
    "type":         "agent",
    "id":           "agent:jambot:discord:user:aa74fa40751b528f"
  },

  "tool":   { "name": "github-rest", "version": "2022-11-28", "capability": "github.issues.create" },
  "target": { "system": "github.com/jamjet-labs/jamjet-discord-bot", "environment": "prod" },

  "arguments_hash":  "2d257d4e72f62afa112766154b9b5ac0dd98ae79ee7c2758563a4363a0fb4bdf",
  "policy":          { "name": "jambot.file-issue.v1", "version": "1", "decision": "allow" },
  "execution":       { "status": "success", "completed_at": "2026-05-21T06:54:40.103Z", "result_ref": "github://issues/1" },

  "prior_receipt":      { "receipt_id": "cab5eff7-…", "receipt_hash": "3e7f5a93…" },
  "completeness_score": 0.913,
  "receipt_hash":       "..."
}

Enter fullscreen mode Exit fullscreen mode

A verifier with only this JSON — no database, no Fly.io credentials, no GitHub token, no Discord session — can run six independent checks:

  1. Tamper-evidence. Re-canonicalise the body without receipt_hash, take SHA-256, confirm it matches the stored hash.
  2. Argument binding. Re-canonicalise the arguments separately, take SHA-256, confirm it matches arguments_hash. If anything mutated between approval and execution, this fails.
  3. Spec compliance. Fetch the public JSON Schema, validate the receipt structurally.
  4. Chain integrity. Fetch the receipt at prior_receipt.receipt_id and confirm its hash matches the link.
  5. Emitter honesty. Recompute completeness_score from the provenance block using the deterministic formula in the spec. Catches an emitter that lies about how confident it was in each field.
  6. Execution proof. Follow execution.result_ref to a real downstream artifact (in this case, a public GitHub issue) and read it.

How existing tools do against the bar

I built one adapter per vendor — translating their normative artifact (or, where they don't have one, the developer-recommended capture shape) into an AgentBoundary v0.2-alpha receipt. Then I ran all 40 conformance scenarios against the adapter-produced receipts.

Vendor PASS PARTIAL DOCS-ONLY NOT COVERED N/A
JamJet reference 40 0 0 0 0
Anthropic permission_policy 12 9 3 14 2
Cloudflare HITL Agents 5 7 1 25 2
LangSmith Gateway 15 14 1 8 2
Microsoft AGT 17 5 1 15 2

Reference implementation first; vendors alphabetical. Not ranked. The PASS counts collapse meaningful categorical differences. Each vendor is solving for a different layer of the stack:

  • Anthropic's permission_policy is the richest runtime evaluation pipeline of the four — layered hooks, scoped tool patterns, permission modes, the canUseTool callback. But the audit log from Anthropic's Managed Agents Console isn't a published schema, so there's no portable artifact a third party can verify. That's why 3 DOCS-ONLY (highest of any vendor) and 14 NOT COVERED.
  • Cloudflare HITL is a workflow primitive — durable approval gates with multi-day windows and external notifications. It's deliberately not an emitted-artifact format. The 25 NOT COVERED reflects that their recommended audit table is 6 columns and doesn't model the things conformance is asking about.
  • LangSmith is an observability platform. The Run object captures the data, but where in the Run varies by team convention — one team puts the decision in tags, another in feedback_stats. A cross-team auditor can't reliably extract it. That's why 14 PARTIAL.
  • Microsoft AGT is the closest peer — also an artifact format, also designed for verifiable evidence, with a Merkle-chained audit log that's structurally stronger than AgentBoundary's current singly-linked design. The 15 NOT COVERED rows are deliberate scoping decisions, not bugs.

Per-vendor breakdowns with structural reasoning live in adapters/<vendor>/results.md in the public repo.

Where AgentBoundary itself currently falls short

The reference implementation scoring 40/40 against its own spec is the implementation grading itself. That's meaningful but not sufficient.

  1. JamBot's emitter mutates receipts on approval-finalize. When a maintainer approves a held action, the existing row's execution.status is updated in place and receipt_hash is recomputed — which breaks chain links from any later receipt whose prior_receipt.receipt_hash was captured before the mutation. Fix queued for v0.2.
  2. The chain is singly-linked, not Merkle. AGT's design (every entry commits to every preceding one) catches arbitrary-entry-reordering attacks that v0.2-alpha would miss. v0.3 candidate.
  3. provenance is a 3-value enum where AGT has a float [0.0, 1.0]. Simpler to reason about, coarser in practice. v0.3 candidate if practitioner feedback warrants it.
  4. No second non-reference implementation yet. Only one production deployment (JamBot). A second emitter in Rust, Go, or Java would validate the spec is implementation-portable.

These are also in the report's §8.

Run the suite yourself

npx agentboundary run scenarios/
# or
uvx agentboundary run scenarios/

Enter fullscreen mode Exit fullscreen mode

60 seconds on a clean machine. No signup, no Docker, no account. Scenarios are at jamjet-labs/agentboundary/scenarios. If your results disagree, open an issue with the exact command and your environment — the suite is reproducible; if it isn't on your machine, that's a bug.

What I want from this post

  • If you maintain an agent-governance product and any of the per-scenario mappings are wrong: open a PR against adapters/<your-product>/. Right-to-respond issues are filed against all four vendors; windows close 2026-05-28 to 2026-05-30 and corrections are folded in inline.
  • If you're integrating agents into a regulated stack (finance, healthcare, infrastructure ops): try the suite against your own runtime. Emitting an AgentBoundary receipt from your existing audit log is usually a few hundred lines.
  • If you already have an audit format: map one of your real audit rows to the conformance scenarios and open an issue where the suite misrepresents your model. Concrete corrections are far more useful than general feedback. AGT and AgentBoundary's design centres are complementary; the two specs could reasonably converge.

Full report with the per-vendor deep-dives at jamjet.dev/blog/agent-action-control-40-tests. Canonical archive on the spec microsite at agentboundary.jamjet.dev/reports/2026-05-comparative.

Spec is Apache 2.0. Implementations welcome.