I Built an AI System That Makes 1,000 Decisions a Day. Here's Where I Drew the Line.

CostGuard's proxy endpoint makes an autonomous decision on every LLM call that passes through it. It scores the response, compares it against a threshold, and either accepts or rejects in about 1 millisecond, with no human involved.

At first that felt like the right design. Fast, automated, scalable. Exactly what an LLM reliability layer should do.

Then I looked at what it was actually catching and more importantly, what it was missing and I had to rethink where automation ends and human judgment needs to begin.

This is what I learned building a system that sits in the hot path of production LLM pipelines, and why I now think human-in-the-loop design is an engineering decision, not just an ethical one.

What CostGuard Actually Does

CostGuard is an HTTP proxy that wraps your LLM calls. You route your agent's requests through it instead of directly to the provider. On every call it:

Checks the provider's circuit breaker stat
Makes the LLM call with a 30-second timeout
Scores the response with a heuristic validity scorer (~1ms)
Rejects the response and falls back to the next model if the score is below your threshold
Logs cost, latency, validity score, and whether fallback was used

Every one of those decisions is automated. No human is involved. At production scale that's the right call you cannot have a human reviewing every LLM response in a real-time pipeline.

But the automation is only as good as what the scorer can actually detect.

The Flaw I Documented in My Own README

The heuristic scorer in CostGuard's /proxy endpoint works by rewarding statistical markers confidence intervals, p-values, uncertainty language and penalizing failure signals like empty outputs, error tracebacks, and refusal phrases.

It catches obvious failures reliably. A model that returns an empty string, an error message, or 'I cannot help with that' gets caught every time.

What it cannot catch: a model that generates fluent, confident, statistically unsound analysis.

A model generating plausible-sounding confidence intervals with the wrong methodology will pass the heuristic filter at any threshold. CostGuard README, Known Limitations
I wrote that into the documentation before shipping. Not as a future improvement as a hard constraint that shapes how the system should be used.

Because here's what the benchmark data shows. Across 1,412 runs in RealDataAgentBench, the most common failure pattern wasn't models refusing or producing errors. It was models producing correct-looking outputs with broken reasoning underneath.

A model computes the right feature importances. Ranks them correctly. Then stops no confidence intervals, no stability check across folds, no acknowledgment of overfitting risk.

Correctness score: 1.0. Statistical validity score: 0.25.

The heuristic scorer in CostGuard's hot path cannot distinguish these. And that's not a bug I can fix with a better regex. It's a fundamental limit of what can be checked in 1 millisecond without running a full evaluation.

The Two-Layer Design That Actually Works

The solution wasn't to make the automated scorer smarter. It was to accept that two different problems need two different tools and to be explicit about which one handles what.

The /proxy layer runs on every call. It's autonomous because it has to be you can't block a real-time pipeline for 3 minutes on every request.

The /evaluate layer is where human judgment comes back in. You upload your dataset, run the full benchmark, and a human reviews the results before making a model selection or routing decision.

That's the line I drew. Autonomous for low-stakes, real-time filtering. Human-reviewed for high-stakes model decisions.

The Rule I Use for Everything Else

Building CostGuard pushed me toward a cleaner general rule that I now apply to any AI system I design:

Automate when the cost of being wrong is low and reversible. Require human review when the cost of being wrong is high or irreversible.

In CostGuard's case: rejecting a response that was actually fine? Low cost the fallback model handles it, the user sees a slightly slower response. Missing a fluent-but-wrong response? Potentially high cost depends entirely on what the downstream agent does with that output.

That asymmetry is why the /proxy threshold is explicitly documented as a conservative pre-filter, not a quality gate. The words matter. A quality gate implies it catches quality failures. A pre-filter implies it catches obvious failures. These are different claims.

What This Looks Like Across Risk Levels

The same principle applies everywhere I've seen AI systems deployed. The technology looks similar at every layer models, prompts, scores, thresholds. What changes is the cost of getting it wrong.

The mistake I see most often: teams apply the 'low risk' pattern to medium or high-risk decisions because the demo worked. The demo always works — it uses clean data, expected inputs, and carefully chosen examples. Production doesn't.

Three Things a Real Human-in-the-Loop System Needs

Adding a 'review' button at the end of an AI workflow is not human-in-the-loop design. It's human-in-the-loop theater. A system that actually works needs three things

1. Explicit escalation rules — not just confidence thresholds

Confidence scores tell you how certain the model is. They don't tell you how much the decision matters. I escalate to human review based on two independent signals: model confidence AND task risk category. A high-confidence output on a high-risk task still goes to review. An uncertain output on a low-risk task goes to fallback, not human review.

2. Audit logs that capture why, not just what

CostGuard logs the validity score, the model used, the fallback chain, and whether the response was accepted or rejected. Every call. Without that, you can't debug failures or learn from them. In RDAB I take this further the SCORING_SPEC.md documents every formula and threshold so any score is reproducible without reading source code. The audit trail is the system's credibility.

3. A feedback loop that closes

Human corrections should improve the system. If reviewers are overriding the same model failure repeatedly, that pattern should feed back into prompt updates, threshold adjustments, or evaluation dataset expansion. In CostGuard this is what the /replay endpoint is for you capture production traces with Tether, replay them against alternate models, and use the quality delta to make better routing decisions next time. Human judgment doesn't just fix the present mistake. It trains the system to make fewer of them.

Why This Gets More Important as Models Get Better

There's a counterintuitive implication here. As frontier models improve, the case for human-in-the-loop in high-stakes domains gets stronger, not weaker.

Here's why Across 1,412 benchmark runs, the correctness scores across 12 frontier models ranged from 0.84 to 0.99. Tight cluster. Most models look similar on correctness.

Statistical validity ranged from 0.52 to 0.85. Much wider spread and this is where the actual failure modes live.

As correctness improves toward 1.0, the remaining failures become harder to detect. The model sounds more confident. The outputs look more polished. The errors become more subtle a wrong methodology stated fluently, a causal claim buried in a valid correlation, a confidence interval computed with the right formula on the wrong data.

A human reviewer looking at a 2024-era model output could often spot something was off. A 2026-era model output may be indistinguishable from correct reasoning to anyone who isn't an expert in that specific domain.

That's not an argument against using better models. It's an argument for keeping human domain experts in the loop on decisions that matter precisely because the failure modes become harder to catch automatically.

The Question Worth Asking Before You Automate

Before any AI system goes fully autonomous on a decision, I run through four questions:

What's the cost when the model is wrong? Is it reversible?
Can my evaluation system actually detect the failure modes that matter or just the obvious ones?
If a human reviews this output, what judgment are they adding that the model can't provide?
When the model fails, does my system learn from it or does the failure disappear into a log nobody reads?

If the answer to question 2 is 'only the obvious ones' and the answer to question 1 is 'high and irreversible' that's where human-in-the-loop design belongs, regardless of how good the model is.

The Real Lesson

The strongest AI systems aren't always the most autonomous ones. The best design is the one that puts automation where it belongs - repetitive, low-risk, reversible decisions and keeps human judgment where it belongs: anywhere the cost of being wrong is high, the failure modes are subtle, or the impact touches people's actual lives.

CostGuard's /proxy filter makes 1,000 autonomous decisions a day. I'm comfortable with that because I know exactly what it can and can't catch, and I've documented both. The /evaluate endpoint requires human review because the decisions it informs which model to use, which threshold to trust, which routing logic to change affect everything downstream.

That's not a limitation I'm trying to engineer away. It's the design.

The full benchmark, evaluation stack, and scoring methodology are open source: github.com/patibandlavenkatamanideep/RealDataAgentBench

Where in your production pipeline have you drawn the line between automation and human judgment and how did you decide where to draw it?

推荐订阅源

DEV Community