惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

N
News and Events Feed by Topic
Malwarebytes
Malwarebytes
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
C
Cybersecurity and Infrastructure Security Agency CISA
F
Future of Privacy Forum
C
Cisco Blogs
T
The Exploit Database - CXSecurity.com
A
Arctic Wolf
S
Securelist
K
Kaspersky official blog
S
Schneier on Security
T
ThreatConnect
T
Tenable Blog
Spread Privacy
Spread Privacy
T
True Tiger Recordings
AWS News Blog
AWS News Blog
F
Fox-IT International blog
量子位
T
Threatpost
V
Vulnerabilities – Threatpost
C
CERT Recently Published Vulnerability Notes
Cisco Talos Blog
Cisco Talos Blog
GbyAI
GbyAI
宝玉的分享
宝玉的分享
腾讯CDC
G
Google Developers Blog
aimingoo的专栏
aimingoo的专栏
Cyberwarzone
Cyberwarzone
有赞技术团队
有赞技术团队
S
SegmentFault 最新的问题
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
V
Visual Studio Blog
U
Unit 42
雷峰网
雷峰网
cs.CV updates on arXiv.org
cs.CV updates on arXiv.org
Simon Willison's Weblog
Simon Willison's Weblog
O
OpenAI News
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
The GitHub Blog
The GitHub Blog
The Register - Security
The Register - Security
MyScale Blog
MyScale Blog
小众软件
小众软件
A
About on SuperTechFans
Last Week in AI
Last Week in AI
Y
Y Combinator Blog
博客园 - 三生石上(FineUI控件)
美团技术团队
Google Online Security Blog
Google Online Security Blog
P
Proofpoint News Feed
MongoDB | Blog
MongoDB | Blog

DEV Community

yard-yaml 0.1.1: safer UTF-8 handling for YAML documentation I Built a Mac App That Keeps Your Clipboard in Sync Across All Your Android Devices Lets Encrypt DNS Challenge with Traefik and AWS Route 53 Building an agent-ready website: how to make your site readable for ChatGPT, Perplexity and autonomous agents A productivity tool with GitHub as your cloud database How We Built Dynamic NPC Dialogue with LLMs — Lessons from Early Access cmux: The Native macOS Terminal Built for Running AI Coding Agents in Parallel Deep Atlantic Storage: Rewriting in Rust How I Built a Bulk Image Optimizer with $0 Server Costs Using Vanilla JS and Canvas API Humans and Machines read differently, I think I have a fix? Claude Code Deleted 92 Images Without Asking. This Happens More Than You Think. Method Calling Stack in Java I Built Schedule Sensei & Pushed It to GitHub – Here's What's Inside (And I Need Your Help 👀) OIC: From a Working Toast Watcher to a General "Watch It for Me" Agent Memory is two-thirds of what an AI chip costs to build The XState persistence problem is five years old. Here is what we built to finally solve it. i added MCP support to my SaaS in an afternoon. here's the whole thing. Framework: Link Building ☁️ Importing existing S3 buckets into Terraform state made easy with terraform import existing s3 bucket I Built a Token System on Solana (Without Any Backend Code) 터미널 AI 에이전트 구축 (v21) I Built an AI 3D Model Generator — Here's How I Handle Meshes in the Browser 🛡️ PromptGuard: I Built a Local AI Privacy Firewall That Sanitizes Your Prompts Before They Leave Your Machine PostgreSQL WAL Bloat: Why Automatic Management Is Often Insufficient? Seven PRs Before Lunch: Parallel Claude Code Tabs Plus Audit-Before-Bump Deployment using all three Kubernetes probes Qwen 3.6 Has Four Tiers. Here's How to Route Without Burning Cash. RAG 시스템 실전 구축 (v21) How I handle my errors in PHP The Blind Spot in Treasure Hunt Engine Configuration: Long-Term Server Health Run NVIDIA NIM on Your Own GPU — Same API, Different Endpoint Webflow SEO Implementation 로컬 LLM 셋업 가이드 (v21) How Logs Travel From Your EKS Pod to Datadog 𝗦𝘁𝗼𝗽 𝗖𝗿𝗮𝗺𝗺𝗶𝗻𝗴 𝗙𝗼𝗿 𝗘𝘅𝗮𝗺𝘀, 𝗦𝘁𝗮𝗿𝘁 𝗕𝘂𝗶𝗹𝗱𝗶𝗻𝗴 𝗥𝗲𝗮𝗹 𝗦𝗸𝗶𝗹𝗹𝘀 How to Use EXPLAIN ANALYZE in PostgreSQL: A Visual Guide gRPC Performance: tonic (Rust) vs grpc-go Benchmarked at Scale Hack The Box (HTB): Cap Machine (Full Walkthrough) Visual Search Optimization studygemma: AI study buddy for CS students Architectural Tradeoffs in Webhook Idempotency and SaaS API Versioning One Open Source Project a Day (No. 75): Understand Anything - The AI Engine That Turns Any Codebase Into an Explorable Knowledge Graph From mock-only-works to real-world-works: 48 hours of reCAPTCHA debugging I built a free music tool AI Talking Avatar Pipelines Broke Our Ad CTR by 3.7% 800G to 400G Breakout: How to Scale 400G Networks with 800G Ports 터미널 AI 에이전트 구축 (v20) Topical Authority Architecture Inside Hermes Agent's Session Memory: What X-Hermes-Session-Id Actually Does How Logs Travel From Your EKS Pod to Datadog The Hidden Journey Inside / Kubernetes Is it safe to connect my bank account to AI? No Room — The World of Aying (8/12) Fossils — The World of Aying (10/12) Familiar Stranger — The World of Aying (9/12) Being Seen — The World of Aying (7/12) [I Ran an AI Agent for 30 Days Straight — Here's the Boring Engineering That Made It Work] Gemma 4: The 128K Multimodal Powerhouse in Your Terminal How to Consolidate Your QA Toolstack: A Practical Buyer's Guide The Thank-You Email Almost Nobody Sends (And Why That's Your Edge) Schema Types 2026 Idempotency Keys: The API Safety Net You're Probably Not Using How to let Claude see my Plaid bank data Kiro Did It: Build a Simple Portfolio Website with Kiro IDE | From Prompt to HTML Prototype Islands of Commerce: What Marketplace Founders Can Learn from 60 Years of Island Biogeography React Pointer Hooks: Hover, Long-Press, Double-Click, Scratch, and Click-Outside Without the Bugs Engineering decisions for my video call tool VBScript Still Lives: How a Custom Go VM Brought Classic ASP to Linux and Mac What Happens When You Teach Old Scripting Languages New Runtime Tricks? I Tested 6 AI Coding Assistants for a Month. Here's What Actually Works. Extendscript Still Has Life Afriex Webhook Integration Guide: Signature Verification, Event Handling, and Production Best Practices The Blind Alleys of Veltrix Configuration How an ESP32 Turned a LEGO WALL-E Into a Real Working Robot The Flawed Promise of Real-Time Event Handling SSH Login Taking Forever? Check Your DNS Settings Found 897 Fake Followers on DEV.to. Here's How I Proved It. Retry logic, Kafka consumer lag, and the hidden failure pattern that Kubernetes won’t catch WebMCP Might Be the Most Important Announcement at Google I/O 2026 Build a Secure API with Rails 8 - Part-3: Auth Controllers I A/B tested 4 LLMs on the same 500 queries. The results surprised me. Google I/O 2026’s Smartest Developer Release Wasn’t a Model, It Was the Runtime - Managed Agents in Gemini API OSS Monthly Recap: What My Daily Commit Challenge Taught Me About Open Source “Culture” GemmaNotes Cognitive Debt: AI Is Building Your Systems. Do You Actually Understand Them? GeekNews Frontend Weekly Deep Dive - 2026-05-25 I Built a Universal Silicon Loader That Runs on Any SOC (No Bootrom Exploit) Docker容器化部署Node.js应用最佳实践 I Put a Neural Network in a Thermometer — Then It Got Out of Hand Building MGZon: Developer Portfolio + AI Bot + Social Network (9 min demo) Bearing Life (L10): What the Catalog Number Really Tells You Longhorn Volume Health: The Gap Between 'Healthy' and Actually Working Stop Prompting. Start Specifying: How Spec-Driven Development Fixes AI Coding TIL a PowerPoint file is just a zip — so I converted .pptx to Word entirely in the browser 로컬 LLM 셋업 가이드 (v18) Cx Dev Log — 2026-04-24 github's agent audit api is the boring feature that matters # From Teaching Code to Building Real-World Applications Vivado 2026.1 and Linux: why this decision matters beyond the headline Vivado 2026.1 y Linux: por qué la decisión importa más allá del titular ORA-00206 오류 원인과 해결 방법 완벽 가이드
I Built an AI System That Makes 1,000 Decisions a Day. Here's Where I Drew the Line.
Venkata Mani · 2026-05-25 · via DEV Community

CostGuard's proxy endpoint makes an autonomous decision on every LLM call that passes through it. It scores the response, compares it against a threshold, and either accepts or rejects in about 1 millisecond, with no human involved.

At first that felt like the right design. Fast, automated, scalable. Exactly what an LLM reliability layer should do.

Then I looked at what it was actually catching and more importantly, what it was missing and I had to rethink where automation ends and human judgment needs to begin.

This is what I learned building a system that sits in the hot path of production LLM pipelines, and why I now think human-in-the-loop design is an engineering decision, not just an ethical one.

What CostGuard Actually Does

CostGuard is an HTTP proxy that wraps your LLM calls. You route your agent's requests through it instead of directly to the provider. On every call it:

  1. Checks the provider's circuit breaker stat
  2. Makes the LLM call with a 30-second timeout
  3. Scores the response with a heuristic validity scorer (~1ms)
  4. Rejects the response and falls back to the next model if the score is below your threshold
  5. Logs cost, latency, validity score, and whether fallback was used

Every one of those decisions is automated. No human is involved. At production scale that's the right call you cannot have a human reviewing every LLM response in a real-time pipeline.

But the automation is only as good as what the scorer can actually detect.

The Flaw I Documented in My Own README

The heuristic scorer in CostGuard's /proxy endpoint works by rewarding statistical markers confidence intervals, p-values, uncertainty language and penalizing failure signals like empty outputs, error tracebacks, and refusal phrases.

It catches obvious failures reliably. A model that returns an empty string, an error message, or 'I cannot help with that' gets caught every time.

What it cannot catch: a model that generates fluent, confident, statistically unsound analysis.

A model generating plausible-sounding confidence intervals with the wrong methodology will pass the heuristic filter at any threshold. CostGuard README, Known Limitations
I wrote that into the documentation before shipping. Not as a future improvement as a hard constraint that shapes how the system should be used.

Because here's what the benchmark data shows. Across 1,412 runs in RealDataAgentBench, the most common failure pattern wasn't models refusing or producing errors. It was models producing correct-looking outputs with broken reasoning underneath.

A model computes the right feature importances. Ranks them correctly. Then stops no confidence intervals, no stability check across folds, no acknowledgment of overfitting risk.

Correctness score: 1.0. Statistical validity score: 0.25.

The heuristic scorer in CostGuard's hot path cannot distinguish these. And that's not a bug I can fix with a better regex. It's a fundamental limit of what can be checked in 1 millisecond without running a full evaluation.

The Two-Layer Design That Actually Works

The solution wasn't to make the automated scorer smarter. It was to accept that two different problems need two different tools and to be explicit about which one handles what.

The /proxy layer runs on every call. It's autonomous because it has to be you can't block a real-time pipeline for 3 minutes on every request.

The /evaluate layer is where human judgment comes back in. You upload your dataset, run the full benchmark, and a human reviews the results before making a model selection or routing decision.

That's the line I drew. Autonomous for low-stakes, real-time filtering. Human-reviewed for high-stakes model decisions.

The Rule I Use for Everything Else

Building CostGuard pushed me toward a cleaner general rule that I now apply to any AI system I design:

Automate when the cost of being wrong is low and reversible. Require human review when the cost of being wrong is high or irreversible.

In CostGuard's case: rejecting a response that was actually fine? Low cost the fallback model handles it, the user sees a slightly slower response. Missing a fluent-but-wrong response? Potentially high cost depends entirely on what the downstream agent does with that output.

That asymmetry is why the /proxy threshold is explicitly documented as a conservative pre-filter, not a quality gate. The words matter. A quality gate implies it catches quality failures. A pre-filter implies it catches obvious failures. These are different claims.

What This Looks Like Across Risk Levels

The same principle applies everywhere I've seen AI systems deployed. The technology looks similar at every layer models, prompts, scores, thresholds. What changes is the cost of getting it wrong.

The mistake I see most often: teams apply the 'low risk' pattern to medium or high-risk decisions because the demo worked. The demo always works — it uses clean data, expected inputs, and carefully chosen examples. Production doesn't.

Three Things a Real Human-in-the-Loop System Needs

Adding a 'review' button at the end of an AI workflow is not human-in-the-loop design. It's human-in-the-loop theater. A system that actually works needs three things

1. Explicit escalation rules — not just confidence thresholds

Confidence scores tell you how certain the model is. They don't tell you how much the decision matters. I escalate to human review based on two independent signals: model confidence AND task risk category. A high-confidence output on a high-risk task still goes to review. An uncertain output on a low-risk task goes to fallback, not human review.

2. Audit logs that capture why, not just what

CostGuard logs the validity score, the model used, the fallback chain, and whether the response was accepted or rejected. Every call. Without that, you can't debug failures or learn from them. In RDAB I take this further the SCORING_SPEC.md documents every formula and threshold so any score is reproducible without reading source code. The audit trail is the system's credibility.

3. A feedback loop that closes

Human corrections should improve the system. If reviewers are overriding the same model failure repeatedly, that pattern should feed back into prompt updates, threshold adjustments, or evaluation dataset expansion. In CostGuard this is what the /replay endpoint is for you capture production traces with Tether, replay them against alternate models, and use the quality delta to make better routing decisions next time. Human judgment doesn't just fix the present mistake. It trains the system to make fewer of them.

Why This Gets More Important as Models Get Better

There's a counterintuitive implication here. As frontier models improve, the case for human-in-the-loop in high-stakes domains gets stronger, not weaker.

Here's why Across 1,412 benchmark runs, the correctness scores across 12 frontier models ranged from 0.84 to 0.99. Tight cluster. Most models look similar on correctness.

Statistical validity ranged from 0.52 to 0.85. Much wider spread and this is where the actual failure modes live.

As correctness improves toward 1.0, the remaining failures become harder to detect. The model sounds more confident. The outputs look more polished. The errors become more subtle a wrong methodology stated fluently, a causal claim buried in a valid correlation, a confidence interval computed with the right formula on the wrong data.

A human reviewer looking at a 2024-era model output could often spot something was off. A 2026-era model output may be indistinguishable from correct reasoning to anyone who isn't an expert in that specific domain.

That's not an argument against using better models. It's an argument for keeping human domain experts in the loop on decisions that matter precisely because the failure modes become harder to catch automatically.

The Question Worth Asking Before You Automate

Before any AI system goes fully autonomous on a decision, I run through four questions:

  1. What's the cost when the model is wrong? Is it reversible?

  2. Can my evaluation system actually detect the failure modes that matter or just the obvious ones?

  3. If a human reviews this output, what judgment are they adding that the model can't provide?

  4. When the model fails, does my system learn from it or does the failure disappear into a log nobody reads?

If the answer to question 2 is 'only the obvious ones' and the answer to question 1 is 'high and irreversible' that's where human-in-the-loop design belongs, regardless of how good the model is.

The Real Lesson

The strongest AI systems aren't always the most autonomous ones. The best design is the one that puts automation where it belongs - repetitive, low-risk, reversible decisions and keeps human judgment where it belongs: anywhere the cost of being wrong is high, the failure modes are subtle, or the impact touches people's actual lives.

CostGuard's /proxy filter makes 1,000 autonomous decisions a day. I'm comfortable with that because I know exactly what it can and can't catch, and I've documented both. The /evaluate endpoint requires human review because the decisions it informs which model to use, which threshold to trust, which routing logic to change affect everything downstream.

That's not a limitation I'm trying to engineer away. It's the design.

The full benchmark, evaluation stack, and scoring methodology are open source: github.com/patibandlavenkatamanideep/RealDataAgentBench

Where in your production pipeline have you drawn the line between automation and human judgment and how did you decide where to draw it?