惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

N
News and Events Feed by Topic
Malwarebytes
Malwarebytes
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
C
Cybersecurity and Infrastructure Security Agency CISA
F
Future of Privacy Forum
C
Cisco Blogs
T
The Exploit Database - CXSecurity.com
A
Arctic Wolf
S
Securelist
K
Kaspersky official blog
S
Schneier on Security
T
ThreatConnect
T
Tenable Blog
Spread Privacy
Spread Privacy
T
True Tiger Recordings
AWS News Blog
AWS News Blog
F
Fox-IT International blog
量子位
T
Threatpost
V
Vulnerabilities – Threatpost
C
CERT Recently Published Vulnerability Notes
Cisco Talos Blog
Cisco Talos Blog
GbyAI
GbyAI
宝玉的分享
宝玉的分享
腾讯CDC
G
Google Developers Blog
aimingoo的专栏
aimingoo的专栏
Cyberwarzone
Cyberwarzone
有赞技术团队
有赞技术团队
S
SegmentFault 最新的问题
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
V
Visual Studio Blog
U
Unit 42
雷峰网
雷峰网
cs.CV updates on arXiv.org
cs.CV updates on arXiv.org
Simon Willison's Weblog
Simon Willison's Weblog
O
OpenAI News
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
The GitHub Blog
The GitHub Blog
The Register - Security
The Register - Security
MyScale Blog
MyScale Blog
小众软件
小众软件
A
About on SuperTechFans
Last Week in AI
Last Week in AI
Y
Y Combinator Blog
博客园 - 三生石上(FineUI控件)
美团技术团队
Google Online Security Blog
Google Online Security Blog
P
Proofpoint News Feed
MongoDB | Blog
MongoDB | Blog

DEV Community

Insults & Cutlasses, Local LLM Sword Fighting on Melee Island Production Lab: ECS Fargate + Prometheus + Grafana + Loki + Alloy + Node Exporter The Four-Index Reality: Why AI Search Isn't One Thing I Scanned 1 Million AI Services. Here's What Worries Me More Than the Vulnerabilities Managing multiple docker hub accounts using docker-use System Design Interview: Decentralized Web Crawler Metric Cardinality: High or Low? 4 Steps to Making the Right Choice 로컬 LLM 셋업 가이드 (v23) GEO vs SEO in 2026 — What Google's May Guidance Changed Cursor Review 2026 — Honest 'Not For Me' Take From a VSCode User Hello from rikuq — a practitioner blog for solo AI SaaS founders Why DevOps Engineers Need Practical Tutorials, Not Just Theory AI Agents in CI/CD: Give Them Context, Not Production Authority Now I See Why Translators Are Panicking Over AI—Should Coders Panic Too? Why I Track HRV Every Morning (And How It Actually Changes My Day) Diffusion Language Models: How NVIDIA's Nemotron-Labs DLM Is Killing Token-by-Token Generation Chatbots GPT pour le support client : ce que les équipes françaises ont réellement besoin de savoir I Hit the 1,232-Byte Wall So You Don't Have To Google Just Rebuilt the Search Box (Again) — But This Time It's Different Aether: A local Android assistant built with Gemma 4 BoxAgnts Introduction (1) — Out of the Box mkdev: trusted HTTPS for localhost, mapped by name Just one question, one answer. Why Java Still Rules the Programming World in 2026 Four Architectures for Letting Claude Edit Elementor (and Why We Shipped Clone-and-Mutate) yard-yaml 0.1.1: safer UTF-8 handling for YAML documentation I Built a Mac App That Keeps Your Clipboard in Sync Across All Your Android Devices Stop Using UUIDs: Why B2B SaaS Needs ULIDs in Laravel 🐘 I'm a non-technical founder who built a Slack approval tool. Here's what actually broke first. Open-Sourcing Our Game AI Stack — SDKs, Templates, and CLI Tools for NPC Dialogue I Built an AI System That Makes 1,000 Decisions a Day. Here's Where I Drew the Line. Lets Encrypt DNS Challenge with Traefik and AWS Route 53 Building an agent-ready website: how to make your site readable for ChatGPT, Perplexity and autonomous agents A productivity tool with GitHub as your cloud database How We Built Dynamic NPC Dialogue with LLMs — Lessons from Early Access cmux: The Native macOS Terminal Built for Running AI Coding Agents in Parallel Deep Atlantic Storage: Rewriting in Rust How I Built a Bulk Image Optimizer with $0 Server Costs Using Vanilla JS and Canvas API Humans and Machines read differently, I think I have a fix? Claude Code Deleted 92 Images Without Asking. This Happens More Than You Think. Method Calling Stack in Java I Built Schedule Sensei & Pushed It to GitHub – Here's What's Inside (And I Need Your Help 👀) OIC: From a Working Toast Watcher to a General "Watch It for Me" Agent Memory is two-thirds of what an AI chip costs to build The XState persistence problem is five years old. Here is what we built to finally solve it. i added MCP support to my SaaS in an afternoon. here's the whole thing. Framework: Link Building ☁️ Importing existing S3 buckets into Terraform state made easy with terraform import existing s3 bucket I Built a Token System on Solana (Without Any Backend Code) 터미널 AI 에이전트 구축 (v21) I Built an AI 3D Model Generator — Here's How I Handle Meshes in the Browser 🛡️ PromptGuard: I Built a Local AI Privacy Firewall That Sanitizes Your Prompts Before They Leave Your Machine PostgreSQL WAL Bloat: Why Automatic Management Is Often Insufficient? Seven PRs Before Lunch: Parallel Claude Code Tabs Plus Audit-Before-Bump Deployment using all three Kubernetes probes Qwen 3.6 Has Four Tiers. Here's How to Route Without Burning Cash. RAG 시스템 실전 구축 (v21) How I handle my errors in PHP The Blind Spot in Treasure Hunt Engine Configuration: Long-Term Server Health Run NVIDIA NIM on Your Own GPU — Same API, Different Endpoint Webflow SEO Implementation 로컬 LLM 셋업 가이드 (v21) How Logs Travel From Your EKS Pod to Datadog 𝗦𝘁𝗼𝗽 𝗖𝗿𝗮𝗺𝗺𝗶𝗻𝗴 𝗙𝗼𝗿 𝗘𝘅𝗮𝗺𝘀, 𝗦𝘁𝗮𝗿𝘁 𝗕𝘂𝗶𝗹𝗱𝗶𝗻𝗴 𝗥𝗲𝗮𝗹 𝗦𝗸𝗶𝗹𝗹𝘀 How to Use EXPLAIN ANALYZE in PostgreSQL: A Visual Guide gRPC Performance: tonic (Rust) vs grpc-go Benchmarked at Scale Hack The Box (HTB): Cap Machine (Full Walkthrough) Visual Search Optimization studygemma: AI study buddy for CS students Architectural Tradeoffs in Webhook Idempotency and SaaS API Versioning One Open Source Project a Day (No. 75): Understand Anything - The AI Engine That Turns Any Codebase Into an Explorable Knowledge Graph From mock-only-works to real-world-works: 48 hours of reCAPTCHA debugging I built a free music tool AI Talking Avatar Pipelines Broke Our Ad CTR by 3.7% 800G to 400G Breakout: How to Scale 400G Networks with 800G Ports 터미널 AI 에이전트 구축 (v20) Topical Authority Architecture Inside Hermes Agent's Session Memory: What X-Hermes-Session-Id Actually Does How Logs Travel From Your EKS Pod to Datadog The Hidden Journey Inside / Kubernetes Is it safe to connect my bank account to AI? No Room — The World of Aying (8/12) Fossils — The World of Aying (10/12) Familiar Stranger — The World of Aying (9/12) Being Seen — The World of Aying (7/12) [I Ran an AI Agent for 30 Days Straight — Here's the Boring Engineering That Made It Work] Gemma 4: The 128K Multimodal Powerhouse in Your Terminal How to Consolidate Your QA Toolstack: A Practical Buyer's Guide The Thank-You Email Almost Nobody Sends (And Why That's Your Edge) Schema Types 2026 Idempotency Keys: The API Safety Net You're Probably Not Using How to let Claude see my Plaid bank data Kiro Did It: Build a Simple Portfolio Website with Kiro IDE | From Prompt to HTML Prototype Islands of Commerce: What Marketplace Founders Can Learn from 60 Years of Island Biogeography React Pointer Hooks: Hover, Long-Press, Double-Click, Scratch, and Click-Outside Without the Bugs Engineering decisions for my video call tool VBScript Still Lives: How a Custom Go VM Brought Classic ASP to Linux and Mac What Happens When You Teach Old Scripting Languages New Runtime Tricks? I Tested 6 AI Coding Assistants for a Month. Here's What Actually Works. Extendscript Still Has Life Afriex Webhook Integration Guide: Signature Verification, Event Handling, and Production Best Practices
How 12 AI agent frameworks handle human approval (most badly)
Adewole Baba · 2026-05-25 · via DEV Community

I keep watching teams ship agent systems into production and then discover, on day three, that "the agent needs to wait for a human sometimes" breaks every assumption in their stack. Not because they didn't see it coming, every team plans for HITL. Because every popular agent framework reduces "human in the loop" to "block the Python process on input() and hope for the best."

I spent a day auditing the twelve most popular AI-agent frameworks against a strict production rubric. The results aren't kind. Two frameworks pass. Ten are one production deploy away from breaking.

This post is the receipts.


The rubric (and why it's strict)

A production HITL primitive isn't "the agent can pause for input." That's a 1980s primitive. A production HITL primitive needs six properties:

Axis What it measures
Durability Does the agent survive a worker restart during a pending await? Is the paused state stored in durable storage (Postgres), not in-process memory?
Idempotency If the agent retries after a crash, can the same approval resolve once without double-acting?
Typed I/O Is the request payload AND the human's response a typed schema (Pydantic / Zod)? Or just str?
Channel abstraction Can you swap channel (terminal → Slack → email → dashboard) without rewriting the agent?
Verifier hook Is there a built-in slot for an AI quality check on the human's response before resuming?
Default UI Does the framework ship an admin UI to view, claim, resolve in-flight tasks?

Score 1 (absent / broken) to 5 (production-ready primitive in core). Max 30.

If you think six axes is too strict: the alternative is your agent process dying because a worker rotated mid-await, your retry double-charging a customer, your "approve y/n" prompt happening on stdin in a Slack channel only you can see, and you discovering all of this two weeks after you shipped.


The scorecard

Rank Framework Durability Idempotency Typed I/O Channel Verifier UI Total
1 LangGraph 5 3 3 1 1 2 15
2 Pydantic AI 4 4 5 1 1 0 15
3 Mastra 4 3 4 2 1 1 15
4 OpenAI Agents SDK 3 3 3 1 1 0 11
5 LlamaIndex 3 2 3 1 1 0 10
6 Haystack 2 1 2 2 1 1 9
7 Semantic Kernel 2 2 2 1 1 0 8
8 CrewAI 2 1 1 1 1 0 6
9 Claude Agent SDK 1 1 1 1 1 0 5
10 LangChain (legacy) 1 1 1 1 1 0 5
11 AutoGen 1 1 1 1 1 0 5
12 smolagents 1 1 1 1 1 0 5

Nobody scores above 15/30. Four frameworks tie at the bottom on 5/30, each for a different reason, but the headline number is the same.


The top: LangGraph, Pydantic AI, Mastra

These three are the closest things to production-ready. They still leave 50% of the rubric to you, but at least the durable pause primitive works.

LangGraph — best-in-class durability, BYO everything else

LangGraph's interrupt() pauses a graph node; resuming is graph.invoke(Command(resume=value), config={...}). Critically, and this is what separates it from the field, when paired with a PostgresSaver checkpointer, the paused thread state lives in Postgres. Any worker can resume it. The docs explicitly walk through worker-restart resilience.

from langgraph.types import interrupt, Command

def approval_node(state):
    decision = interrupt({"question": "approve transfer?", "amount": state["amount"]})
    return {"approved": decision}

Enter fullscreen mode Exit fullscreen mode

What you'll discover the hard way: the docs warn that code before interrupt() runs twice on resume. "Place pure computation before, side effects after." That's the developer's idempotency burden, not the framework's. And there's no channel abstraction, the interrupt() payload is just a dict, and how that becomes a Slack message is entirely on you.

LangGraph Platform ships a task UI. The OSS package does not.

Verdict: Best durability story in the survey. Everything above the storage layer is BYO.

Pydantic AI — best typed API, weakest UI story

Pydantic AI's Deferred Tools is the cleanest API I saw. A tool marked requires_approval=True causes the run to end with a DeferredToolRequests output. You collect approvals as DeferredToolResults and resume by passing both into the next agent run.

@agent.tool(requires_approval=True)
async def transfer_funds(amount: int, to: str) -> str:
    return f"sent {amount} to {to}"

result = await agent.run("send $500 to Bob")
if result.output_type is DeferredToolRequests:
    approvals = {call.tool_call_id: ToolApproved() for call in result.output.approvals}
    result = await agent.run(
        message_history=result.all_messages(),
        deferred_tool_results=DeferredToolResults(approvals=approvals),
    )

Enter fullscreen mode Exit fullscreen mode

Everything is Pydantic-typed. Durability comes via integration with Restate (or Temporal, Prefect), which journals every step including the await. That's powerful, but "we ship a primitive plus a separate runtime you also have to learn" is not zero-config.

Verdict: The typing is exactly right. The rest of the stack is a separate framework.

Mastra — best TypeScript story, channels don't compose

Mastra workflows expose suspend() / resume(). When a step suspends, the workflow snapshot is persisted (PostgreSQL, Upstash Redis, etc.), and resume() can be called from any HTTP endpoint with the matching run ID. Both suspend and resume payloads are Zod-typed.

const approvalStep = createStep({
  id: "approval",
  inputSchema: z.object({ amount: z.number() }),
  resumeSchema: z.object({ approved: z.boolean() }),
  execute: async ({ resumeData, suspend }) => {
    if (!resumeData) {
      await suspend({ requestId: crypto.randomUUID() });
      return;
    }
    return { approved: resumeData.approved };
  },
});

Enter fullscreen mode Exit fullscreen mode

The catch: Mastra ships "Channels" (Slack, Discord, Telegram) — but for agents, not workflows. You can't say "suspend this step and route it to #ops on Slack" without writing the glue yourself. The community-maintained assistant-ui/mastra-hitl repo exists precisely because there's no canonical built-in approval UI for production.

Verdict: The best TS option. The last mile is still on you.


The middle: OpenAI Agents SDK, LlamaIndex, Haystack, Semantic Kernel

These four ship a recognizable HITL primitive but trip over a single major axis each.

  • OpenAI Agents SDK (11/30) has clean needs_approval semantics and a RunState you can serialize. But the SDK assumes you'll bring your own queue, storage, and UI. No channel. No idempotency guarantee beyond what serialize-and-resume gives you.
  • LlamaIndex (10/30) has elegant event-driven HITL via wait_for_event — but the docs warn: "the runtime pauses by throwing an internal control-flow exception and replays the entire step when the event arrives." Any side effects before the await run twice. Footgun.
  • Haystack (9/30) has the best-designed confirmation taxonomy (AlwaysAskPolicy, BlockingConfirmationStrategy, RichConsoleUI) — and ships only console UIs. Both built-in implementations block the Python process.
  • Semantic Kernel (8/30) has two half-built HITL mechanisms (IFunctionInvocationFilter and Process Framework KernelFunction parameters). The community is openly confused about which to use — issue #10832 literally asks "How to support HITL in Agent and Process frameworks."

The bottom: CrewAI, Claude Agent SDK, LangChain legacy, AutoGen, smolagents

This is where it gets bleak. Five frameworks. Each ships some pause-for-human primitive. Each one is input().

CrewAI: human_input=True — and then what?

The API is one boolean:

task = Task(
    description="Research the latest AI advancements...",
    expected_output="A comprehensive report",
    agent=researcher,
    human_input=True,
)

Enter fullscreen mode Exit fullscreen mode

The CrewAI community forum has an officially acknowledged thread titled "Human in the loop - workaround" that opens:

"HITL input is handled through the terminal, which does not work for a production web environment."

The recommended workaround is Streamlit or Chainlit — wrap your CrewAI process behind a UI and hijack stdin. The entire production HITL story for CrewAI is "wrap stdin."

LangChain (legacy), AutoGen, smolagents: the input() family

  • LangChain ships HumanInputRun and HumanApprovalCallbackHandler. Both wrap input(). The maintainers' standing recommendation for production HITL is "move to LangGraph" — Discussion #21524, Discussion #28217.
  • AutoGen's human_input_mode="ALWAYS" is a 2023 primitive still shipping in 2026. Issues #2358 (persistence roadmap) and #5806 (mid-run checkpointing) are both open.
  • smolagents has step_callbacks and agent.interrupt(). Memory is in-process. Issue #364 tracks a state-serialization request — open, unfilled.

Claude Agent SDK: not actually HITL

The SDK's canUseTool is permission gating, not human-in-the-loop. The callback is synchronous within the agent loop. If you want to pause for a human review, you make canUseTool block on a Promise that resolves when the human answers. There's no built-in story for "human two timezones away approves via Slack." The SDK is honest about this — it's designed for Claude Code's interactive CLI approval, not for distributed agent ops.


The pattern: three universal gaps

Across all twelve frameworks, three axes are universally missing or near-missing:

1. Channel abstraction. Only Mastra ships anything resembling a channel adapter system. Even there, channels live on agents, not on workflows. Every other framework hands you a payload and walks away.

2. Verifier hook. Zero out of twelve frameworks ship a slot for "the human's response goes through an AI quality check before resuming the agent." The idea that a junior approver clicking "approve" should be pre-validated by an LLM before the agent trusts it — fully absent from the field.

3. Default UI. Zero OSS dashboards for in-flight HITL tasks. LangGraph Platform has one (paid). Mastra has a dev playground (not production). The rest assume you'll build it.

The aggregate pattern: the industry has reduced HITL to "block on stdin," and ten of twelve frameworks are one production deploy away from breaking.


What "good" actually looks like

A production HITL primitive should hit all six axes of the rubric above:

  1. Persist to durable storage by default — Postgres or equivalent. No in-process state.
  2. Emit a typed payload describing what the human is being asked. Pydantic / Zod.
  3. Route through a pluggable channel adapter — Slack, email, dashboard, SMS. One config knob. Audit row written on every delivery + response.
  4. Accept a typed response from any channel, idempotent on duplicate deliveries.
  5. Optionally pipe the response through a verifier model before resuming the agent.
  6. Expose an OSS admin UI so ops teams can see what's in flight.

No framework in this audit ships all six. So I built one: awaithumans. One function call (await_human / awaitHuman), all six properties in the box, channel adapters for Slack and email shipping today, dashboard included, audit trail by default, optional Claude/OpenAI/Gemini/Azure verifier. Apache 2.0. Python + TypeScript. Adapters for the two frameworks that got close — LangGraph and the broader Pydantic AI / Temporal combo — built in.

It is not the final answer. It's an honest answer to the gap this audit exposes.


Methodology

For each framework I read the official docs, the README, the examples folder, and grepped open GitHub issues for "human", "approval", "interrupt", "input", "pause". Code samples are copied from canonical docs; I didn't make any of them up. Scores are subjective on a 1-5 scale per axis but the scorecard is reproducible — pull the framework, grep for the same terms, you'll find the same gaps. The full per-framework deep dive (with source URLs for every claim) lives in the research notes. Audit date: 2026-05-24; the field is moving fast and at least three of these frameworks have HITL work in flight — see linked issues per section. If your framework is on this list and you've shipped one of the missing axes since the audit date, I'd love to update the scorecard — open an issue at github.com/awaithumans/awaithumans or DM @awaithumans on X.


Sources

Selected, in order of relevance. The full list of ~50 source URLs is in the research repo.


If you maintain one of these frameworks and want to argue for a higher score on any axis, please do, open an issue with evidence and I'll update the scorecard. The point of the audit isn't to dunk; it's to make the gap legible.

P.S. — the browser-agent + awaithumans template repo shows what this looks like wired up: a browser-use agent that calls await_human() before clicking Place Order. Slack DM with cart screenshot, one-tap approve, agent resumes. ~90 lines.