惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

云风的 BLOG
云风的 BLOG
Last Week in AI
Last Week in AI
IT之家
IT之家
H
Hackread – Cybersecurity News, Data Breaches, AI and More
博客园 - 三生石上(FineUI控件)
Microsoft Azure Blog
Microsoft Azure Blog
Recent Announcements
Recent Announcements
The Register - Security
The Register - Security
C
Cyber Attacks, Cyber Crime and Cyber Security
S
SegmentFault 最新的问题
Engineering at Meta
Engineering at Meta
Know Your Adversary
Know Your Adversary
IntelliJ IDEA : IntelliJ IDEA – the Leading IDE for Professional Development in Java and Kotlin | The JetBrains Blog
IntelliJ IDEA : IntelliJ IDEA – the Leading IDE for Professional Development in Java and Kotlin | The JetBrains Blog
WordPress大学
WordPress大学
C
CXSECURITY Database RSS Feed - CXSecurity.com
F
Fox-IT International blog
C
Cybersecurity and Infrastructure Security Agency CISA
P
Privacy & Cybersecurity Law Blog
雷峰网
雷峰网
大猫的无限游戏
大猫的无限游戏
F
Future of Privacy Forum
阮一峰的网络日志
阮一峰的网络日志
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
Recorded Future
Recorded Future
P
Proofpoint News Feed
O
OpenAI News
C
CERT Recently Published Vulnerability Notes
E
Exploit-DB.com RSS Feed
Spread Privacy
Spread Privacy
酷 壳 – CoolShell
酷 壳 – CoolShell
人人都是产品经理
人人都是产品经理
罗磊的独立博客
V
V2EX - 技术
CTFtime.org: upcoming CTF events
CTFtime.org: upcoming CTF events
T
The Blog of Author Tim Ferriss
N
Netflix TechBlog - Medium
AWS News Blog
AWS News Blog
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
爱范儿
爱范儿
李成银的技术随笔
C
Cisco Blogs
SecWiki News
SecWiki News
Application and Cybersecurity Blog
Application and Cybersecurity Blog
L
LINUX DO - 热门话题
B
Blog RSS Feed
Google DeepMind News
Google DeepMind News
G
Google Developers Blog
Latest news
Latest news
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
J
Java Code Geeks

DEV Community

Audience Builder vs Data Filter: Which Segmentation Tool When? IoT data into D365 Supply Chain: the Azure-native pattern PostgreSQL VACUUM Tuning: A Technical Deep Dive Into Autovacuum Configuration Eval Set Drift: How to Know When Your Golden Set Went Stale Everyone Needs a README for Their Life Kexa.io: Open-Source IT Security for Local AI Governance Per-Customer LLM Cost Reports (Without Rearchitecting Your Billing Pipeline) AI Is Too Expensive? I Run It for Free on My Laptop What are HTTP security headers — and which ones does your site actually need? LLM Trace Storage Cost: Why Your S3 Bill Exploded, and 3 Fixes Multi-Shot vs Zero-Shot: When Adding Examples Actually Hurts Accuracy Computer-Use Agents: 3 Sandboxing Patterns That Don't Leak Credentials When Your Tool Returns Garbage, Agents Loop Forever. Here's the 30-Line Guard. RAG 시스템 실전 구축 (v3) Safeguard AI — a multilingual disaster preparedness assistant powered by Gemma 4 Formeze - Form Handling Without A Server The SFMC Discovery Checklist We Run Before Touching the UI I was manually comparing two versions of a contract for 2 hours before I built this tool Hosting MCP Gateway Registry on AWS ECS: A Practical Blueprint for Enterprise Agentic AI Systems Google Antigravity 2.0: The IDE is Dead, Long Live the Agent Orchestra I built an AI agent that texts me LeetCode and system design summaries every morning, here's exactly how REAL-WORLD ASSETS TOKENIZATION : THE $10 TRILLION EVOLUTION Life is like a FTP Server I Built an "AI Meal Planner." It Almost Produced a Nutritionally Invalid Plan. Generate Claude Code skills from your git history Stop Fighting the DOM. Selector-First Thinking Will Save Your Scraper. Testing OTP email flows shouldn't be flaky — meet AssertKit Firebase AI Logic Is on the Client. Here Are the 4 Security Layers That Keep It Safe. The php-fpm Tuning Cheat Sheet: 5 Settings That Decide Your p99 SFMC Success Metrics That Survive the QBR Goal vs Exit Criteria in Journey Builder: Measure What Matters Why I Wrote 475 Tests for a Desktop Accounting App Stop Engineering Prompts: How an Eval-First Harness Let Us Ship 25 Algorithm Versions Autonomously Deux IA d'accord = une source : la règle qui m'a évité un pipeline bâti sur du vide Two AI reviews agreeing is not two reviews: how I learned to test claims before adopting them My agent could see the dropdown. It just couldn't pick anything. The Job Role Nobody Is Talking About and Why Freshers Should Get There First Why `mixed` Is the Worst Type in Your PHP Codebase (and How to Kill It) PHP Fibers in Production: 4 Real Cases Where They Beat curl_multi and Queues PHP 8.4 Asymmetric Visibility: 5 Patterns That Replace Constructors and Setters apt-mark hold doesn't pin versions — how it nearly removed OpenSSH across our fleet Getting Started with AWS — A Beginner Friendly Introduction I Built a Free Metal Weight Calculator — Here's the Math Behind It From Half Baked Repos to GitHub Glory: How I Am Finishing My Ambitious Ten App Masterpiece Aasa: The Phone That Finally Notices Why Fast Development Fails Without Strong Engineering Foundations Journey Builder vs Automation Studio: Which Tool for Which Job Dynamic Content Blocks: One Email, Different Content Per Tier Everyone's Talking About Gemini 3.5 Flash. The Real Story at Google I/O 2026 Was a Skill File. Enhancing the AI Blog System: SQLite Support and Streamlined Publishing Features I Fine-Tuned Gemma 4 on an Emotion Dataset Using a Single GPU Omnichannel inventory in D365: DOM + the Inventory Add-in File-Drop Automations: SFMC Pattern for Daily Imports Regression Testing in Agile: How to Test Without Slowing Down Your Sprints I build projects and manage teams without a single call Making a Calculator UI with HTML5 and CSS3 Full Next.js + Node.js + PostgreSQL Interview Task Setup Google’s Gemini Coding Demos Revealed the Slow Death of “Blank Page Programming” Verification Activity: SFMC Guard Against Empty Files Integrating Shopify with external systems: MVP connection choices Beyond RAG: Architecting Local Long-Context Pipelines with Gemma 4's 31B Dense Model KloudAudit vs AWS Cost Explorer: Why I Stopped Using Cost Explorer for Waste Detection Telegram: API bot access token Gemma 4 at the Edge AasPass: A lightweight, local-first password vault for developers Why Local AI Was the Real Winner of Google I/O 2026 (An Insider’s Take) Laravel Google Drive Filesystem: Unlimited Cloud Storage with Familiar Syntax When not to build an AI agent (and what to ship instead) What a real Sanity CMS development services proposal looks like Why hybrid search is the boring default we keep recommending I kept improving my .NET order pipeline after a CTO left feedback. Here is where it ended up. Why Developers go behind Linux ? Does Front End need HTML, CSS? - Part - 2 From Prompts to Action: What Gemini 3.5 Flash and the Agentic Stack Mean for Developers Does Front End need HTML, CSS? - Part - 1 The real attack surface for AI coding agents is the config file Chai aur SQL — A Beginner's Journey into Databases Find Your Route Source Score: Continuing Exploration of LLM Usage in Automated Workflows Tried using the Claude Platform on AWS Your Node.js Server is Using Just One CPU. Here's How to Fix It. 🚀 Google Antigravity 2.0 Quietly Changes What It Means to Be a Software Engineer Environment variables vs connection references in Power Platform Multi-BU D365 environment: single tenant, multiple LEs AI API Integration Testing Checklist for Multi-Model Apps ORA-00203 오류 원인과 해결 방법 완벽 가이드 Designing a Data Extension in SFMC: The Four Decisions First Kayrol — Day 0: Building AI highlight reels for athletes (in public) The Agony of Over-Engineered Operators: Why Simplicity Saved Our Treasure Hunt Engine Business Rules vs Power Automate vs Plugin: pick one Dataverse virtual tables on SQL: three latency patterns Comunicación y sincronización entre procesos distribuidos I let Gemma 4 analyze my credit card statements so I wouldn't have to Faithfulness gate: the agent layer most teams skip Centralized procurement D365: global address book + vendors Why I Can't Stop Thinking About Google's New A2A Protocol Perovskite cell scaps simulation analysis ¿Qué significan esas letras del CVSS? Guía para entenderlo de una vez scrcpy Integration in a Tauri App — Android Screen Mirroring on Mac Shopify theme editor: design tokens merchants can edit
Hallucination Detection at the Trace Layer: 4 Detectors You Can Ship Today
Gabriel Anha · 2026-05-24 · via DEV Community

You can't catch every hallucination at runtime. The cost of running a fact-check model inline doubles your p99 and burns budget on requests that were fine anyway. But you can catch four classes of hallucination cheaply, after the fact, in the trace pipeline. The detectors run on spans your app already emits. The user gets the answer at normal latency. The detector flags the bad ones a few seconds later and a Slack ping shows up before the support ticket does.

This post walks through four detectors that have paid for themselves on real LLM products: citation grounding, confidence anomaly, schema violation, and self-consistency divergence. Each one is 30-80 lines of Python. Then we wire them into an OpenTelemetry SpanProcessor so they run async on every trace your app emits. Then we calibrate them, because every one of them lies if you don't.

Why trace-layer detection beats inline checks

Inline detection means: before you return the LLM response to the user, you run another model to score whether it's hallucinated. NeMo Guardrails, Lynx, and most "evaluator-as-a-service" pitches work this way. The math doesn't favor you.

Pretend each user request is one LLM call (~400ms). Inline detection adds another LLM call to score grounding (~600ms because grounding models tend to be slower and chew the full context). Your p50 goes from 400ms to 1000ms. Your p99 doubles to 2s. Your token bill triples. And you're paying that on 100% of requests when maybe 2% are actually hallucinations worth catching.

Trace-layer detection runs after the response goes out. The user sees the answer at native latency. Your span processor picks up the finished span, runs detectors on it asynchronously, and writes flags back to your observability backend or a separate "incident" queue. If a detector fires, you get a notification, not a degraded user experience.

The trade-off is honest: you can't block the bad response. The user already saw it. That sounds bad until you compare it to the alternative. Blocking 100% of traffic so you can occasionally catch a hallucination is worse than catching 80% of hallucinations and apologizing for the rest. For most products, sub-second latency wins over zero-hallucination guarantees.

Detector 1 — Citation grounding

The most common hallucination in a RAG system: the model cites a source that doesn't actually support the claim. It looks correct (there's a citation) but the cited passage says something else, or nothing at all about the claim.

Citation grounding is a string/embedding check between each cited span and the sentence it's cited in. No second LLM call needed for the basic version.

import re
from dataclasses import dataclass
from sentence_transformers import SentenceTransformer, util

# load once at module level, not per-request
_embedder = SentenceTransformer("BAAI/bge-small-en-v1.5")

@dataclass
class GroundingResult:
    sentence: str
    citation_id: str
    similarity: float
    grounded: bool

CITATION_RE = re.compile(r"\[(\d+)\]")

def check_grounding(
    answer: str,
    sources: dict[str, str],
    threshold: float = 0.55,
) -> list[GroundingResult]:
    """For each sentence with [n] citations, check that the cited
    source actually contains the claim. Returns one row per (sentence,
    citation) pair so you can see which citation in a multi-cite
    sentence is the weak one."""
    results: list[GroundingResult] = []
    sentences = re.split(r"(?<=[.!?])\s+", answer.strip())

    for sentence in sentences:
        cite_ids = CITATION_RE.findall(sentence)
        if not cite_ids:
            continue
        # strip citation markers before embedding — they're noise
        clean = CITATION_RE.sub("", sentence).strip()
        sent_emb = _embedder.encode(clean, convert_to_tensor=True)

        for cid in cite_ids:
            src_text = sources.get(cid, "")
            if not src_text:
                results.append(GroundingResult(clean, cid, 0.0, False))
                continue
            src_emb = _embedder.encode(src_text, convert_to_tensor=True)
            sim = float(util.cos_sim(sent_emb, src_emb).item())
            results.append(
                GroundingResult(clean, cid, sim, sim >= threshold)
            )
    return results

Enter fullscreen mode Exit fullscreen mode

The threshold of 0.55 is a starting point, not a law. You'll move it after calibration. Cosine sim of BGE embeddings on factual claim vs supporting passage tends to sit between 0.4 and 0.85 in practice. Below 0.4 is almost always fabrication; above 0.7 is almost always supported. The grey zone needs labelled traces to tune (we'll get there).

Gotcha: if your sources are very long, embedding the whole chunk dilutes the signal. The cited claim might match one paragraph in a 2000-token source. The cosine sim against the whole blob comes back middling and you miss the hit. Fix: chunk the source into sentence-windows, embed each, and take the max similarity across chunks.

Detector 2 — Confidence anomaly via logprobs

When a model hallucinates a fact it's never seen, the logprob distribution often looks weird. Confident-but-wrong is the dangerous case, but there's a flavor of confident-but-wrong that shows up as unusually flat per-token entropy. The model is committing to tokens it would normally hedge on.

You need logprobs in the response. OpenAI exposes them via logprobs=True, top_logprobs=5. Anthropic doesn't, last I checked, so this detector only works on providers that surface them.

import math
from statistics import mean

def token_entropy(top_logprobs: list[dict]) -> float:
    """Shannon entropy over the top-k token distribution at one
    position. High entropy = model is unsure. Low entropy = model
    is committed."""
    probs = [math.exp(t["logprob"]) for t in top_logprobs]
    total = sum(probs)
    if total == 0:
        return 0.0
    norm = [p / total for p in probs]
    return -sum(p * math.log(p) for p in norm if p > 0)

def confidence_anomaly_score(
    tokens: list[dict],
    baseline_mean_entropy: float,
    baseline_stdev: float,
) -> float:
    """Z-score of this response's mean token entropy against
    the baseline you computed on labelled good traces. Returns
    abs(z). Above 2.5 is worth flagging."""
    if not tokens:
        return 0.0
    per_token = [
        token_entropy(t.get("top_logprobs", []))
        for t in tokens
        if t.get("top_logprobs")
    ]
    if not per_token:
        return 0.0
    response_entropy = mean(per_token)
    if baseline_stdev == 0:
        return 0.0
    return abs(response_entropy - baseline_mean_entropy) / baseline_stdev

Enter fullscreen mode Exit fullscreen mode

You compute baseline_mean_entropy and baseline_stdev once a week from your labelled-good traces. Store them in Redis or just a YAML file your pipeline reloads. Then the per-trace score is a single z-score.

The signal is noisier than citation grounding. A z-score of 2.5 means "this response's average token confidence is 2.5 standard deviations from your baseline." That catches the confident hallucinations but also catches valid-but-unusual responses (a question that just happens to have a more deterministic answer than average, like "what's 2+2"). Use this detector as a tie-breaker or as a filter on top of the others, not as a primary alarm.

Detector 3 — Schema and format violation

If your prompt asks for JSON with a specific shape, anything off-shape is hallucination of intent. Even when the content is correct, a schema break means the model didn't follow the contract, and downstream code probably crashed or silently dropped fields.

This one is the cheapest and lowest false-positive rate of the four. Always ship it.

import json
from jsonschema import Draft202012Validator, ValidationError

@dataclass
class SchemaResult:
    valid: bool
    errors: list[str]
    parse_failed: bool

def check_schema(raw_output: str, schema: dict) -> SchemaResult:
    """Validate the model's raw text against a JSON Schema. Both
    parse-fail and validation-fail count as hallucinations of
    intent — the model didn't follow the contract."""
    try:
        parsed = json.loads(raw_output)
    except json.JSONDecodeError as e:
        return SchemaResult(False, [f"json parse: {e}"], True)

    validator = Draft202012Validator(schema)
    errors = [
        f"{'.'.join(str(p) for p in e.absolute_path) or '<root>'}: "
        f"{e.message}"
        for e in validator.iter_errors(parsed)
    ]
    return SchemaResult(len(errors) == 0, errors, False)

Enter fullscreen mode Exit fullscreen mode

Pair this with response_format={"type": "json_schema", ...} if you're on OpenAI. That should prevent schema violations at generation time. In practice, structured outputs still occasionally drop optional fields or hallucinate enum values on long-running streams that get truncated mid-object. The detector catches those.

A common pattern: a tool-calling agent emits {"tool": "search_docs", "args": {"q": "..."}}. The model gets cute and emits {"tool": "search_documents", "args": {"q": "..."}}. JSON is valid, schema rejects because tool isn't in the enum, your tool-dispatcher silently returns empty results, the user gets a hallucinated answer based on no retrieval. Schema check fires, you see it.

Detector 4 — Self-consistency divergence

Generate the same answer N times at temperature > 0. If the answers disagree past a threshold, the model is making things up. If they converge, the model is committed to its answer (which doesn't prove correctness, but proves it's not flipping coins).

This costs N× tokens, so don't run it on every trace. Run it as a sampling detector. Pick 1% of traces tagged with a "high-stakes" attribute (medical, financial, legal queries) and run consistency on those.

import asyncio
from openai import AsyncOpenAI

_client = AsyncOpenAI()

async def _sample_once(messages: list[dict], model: str) -> str:
    resp = await _client.chat.completions.create(
        model=model,
        messages=messages,
        temperature=0.7,
        max_tokens=400,
    )
    return resp.choices[0].message.content or ""

async def consistency_score(
    messages: list[dict],
    n: int = 4,
    model: str = "gpt-4o-mini",
) -> float:
    """Sample N times, embed each sample, return mean pairwise
    cosine similarity. 1.0 = identical, <0.65 = divergent."""
    samples = await asyncio.gather(
        *(_sample_once(messages, model) for _ in range(n))
    )
    embs = _embedder.encode(samples, convert_to_tensor=True)
    sims: list[float] = []
    for i in range(n):
        for j in range(i + 1, n):
            sims.append(float(util.cos_sim(embs[i], embs[j]).item()))
    return sum(sims) / len(sims) if sims else 0.0

Enter fullscreen mode Exit fullscreen mode

Gotcha: self-consistency catches fabrication on factual questions but misses style hallucination. Two responses can both be wrong and both phrase the lie the same way (because they share the same prior). It's good for "what year did X happen" questions, weak for "summarize this document" questions where consistent-but-wrong is the failure mode.

Wiring the detectors into an OTel span processor

The point of trace-layer detection is that it runs on spans the app already emits. The cleanest way to bolt this on is a custom SpanProcessor that fires after each LLM span finishes, runs the detectors in a background thread, and adds the results back as span attributes.

import json
import logging
from concurrent.futures import ThreadPoolExecutor
from opentelemetry.sdk.trace import SpanProcessor, ReadableSpan
from opentelemetry import trace

log = logging.getLogger("hallucination-detector")
_pool = ThreadPoolExecutor(max_workers=8)

class HallucinationSpanProcessor(SpanProcessor):
    def __init__(self, schema_registry: dict, baseline: dict):
        self.schema_registry = schema_registry  # operation -> schema
        self.baseline = baseline  # {"mean": float, "stdev": float}

    def on_start(self, span, parent_context=None):
        pass

    def on_end(self, span: ReadableSpan):
        # only run on LLM spans — convention: name starts with "llm."
        if not span.name.startswith("llm."):
            return
        _pool.submit(self._run_detectors, span)

    def _run_detectors(self, span: ReadableSpan):
        try:
            attrs = dict(span.attributes or {})
            answer = attrs.get("llm.response.content", "")
            sources_json = attrs.get("rag.sources_json", "{}")
            tokens_json = attrs.get("llm.response.tokens_json", "[]")
            operation = attrs.get("llm.operation", "")
            sources = json.loads(sources_json)
            tokens = json.loads(tokens_json)

            findings: dict[str, object] = {}

            if sources:
                g = check_grounding(answer, sources)
                ungrounded = [r for r in g if not r.grounded]
                findings["grounding.ungrounded_count"] = len(ungrounded)
                findings["grounding.min_sim"] = (
                    min((r.similarity for r in g), default=1.0)
                )

            if tokens:
                z = confidence_anomaly_score(
                    tokens,
                    self.baseline["mean"],
                    self.baseline["stdev"],
                )
                findings["confidence.zscore"] = z

            schema = self.schema_registry.get(operation)
            if schema:
                s = check_schema(answer, schema)
                findings["schema.valid"] = s.valid
                if not s.valid:
                    findings["schema.errors"] = "; ".join(s.errors[:3])

            self._publish(span, findings)
        except Exception:
            log.exception("detector failed for span %s", span.name)

    def _publish(self, span: ReadableSpan, findings: dict):
        # write back as a follow-up span — you can't mutate a
        # finished span, but you CAN emit a sibling span with the
        # same trace_id and the parent's span_id.
        tracer = trace.get_tracer("hallucination-detector")
        ctx = trace.set_span_in_context(
            trace.NonRecordingSpan(span.get_span_context())
        )
        with tracer.start_as_current_span(
            "llm.detector.result", context=ctx
        ) as result_span:
            for k, v in findings.items():
                result_span.set_attribute(k, v)
            # fire your alerting hook here if thresholds are crossed
            if findings.get("grounding.ungrounded_count", 0) > 0:
                result_span.set_attribute("alert.fired", True)

    def shutdown(self):
        _pool.shutdown(wait=True)

    def force_flush(self, timeout_millis: int = 30_000):
        return True

Enter fullscreen mode Exit fullscreen mode

The trick is the _publish step. You can't mutate a finished span in OTel. Once on_end is called, the span is read-only. So you emit a sibling span with the same trace ID, attaching the detector results. Your backend (Honeycomb, Grafana Tempo, Jaeger, Langfuse) will show it next to the original LLM span. Querying for alert.fired = true gives you the hallucination dashboard.

Register the processor on your TracerProvider at startup:

from opentelemetry.sdk.trace import TracerProvider

provider = TracerProvider()
provider.add_span_processor(
    HallucinationSpanProcessor(
        schema_registry={"answer_with_citations": ANSWER_SCHEMA},
        baseline={"mean": 1.42, "stdev": 0.38},
    )
)
trace.set_tracer_provider(provider)

Enter fullscreen mode Exit fullscreen mode

The gotcha: calibrate against labelled traces or every alert is noise

Every detector here has a knob: a threshold, a z-score cutoff, a minimum similarity. Ship them with the defaults and you'll either page yourself constantly or never. Both are useless.

The calibration loop is straightforward but you have to do it:

  1. Sample 200-500 traces from production over the last 2 weeks.
  2. Have a human (you, or a labeler) classify each as clean or hallucinated.
  3. Run all four detectors on every labelled trace.
  4. For each detector, sweep the threshold and compute precision/recall at each setting.
  5. Pick the threshold that gives you the precision your alerting can tolerate (usually 0.7, because you can handle 1 false positive per 3 alerts before fatigue kicks in).
import csv
from dataclasses import dataclass

@dataclass
class LabelledTrace:
    trace_id: str
    is_hallucination: bool
    grounding_min_sim: float

def calibrate_grounding(
    traces: list[LabelledTrace],
    candidate_thresholds: list[float],
) -> list[tuple[float, float, float]]:
    """For each threshold, return (threshold, precision, recall) for
    treating min_sim < threshold as a hallucination flag."""
    rows = []
    for t in candidate_thresholds:
        tp = sum(
            1 for x in traces
            if x.grounding_min_sim < t and x.is_hallucination
        )
        fp = sum(
            1 for x in traces
            if x.grounding_min_sim < t and not x.is_hallucination
        )
        fn = sum(
            1 for x in traces
            if x.grounding_min_sim >= t and x.is_hallucination
        )
        precision = tp / (tp + fp) if (tp + fp) else 0.0
        recall = tp / (tp + fn) if (tp + fn) else 0.0
        rows.append((t, precision, recall))
    return rows

Enter fullscreen mode Exit fullscreen mode

Run that against your labelled set, look at the precision/recall curve, pick the elbow. Then store the chosen thresholds in a versioned config file (the same one your detector reads at startup). When you change a model version, re-calibrate, because the baselines move.

False-positive rates compound across detectors. If each of your four detectors has a 5% false positive rate, and you alert on any of them firing, your trace-level false positive rate is around 18%. Either alert only when 2+ detectors fire (intersection), or weight them: citation grounding and schema check fire as alerts, confidence anomaly and consistency divergence fire as Slack signals nobody pages on.

The detectors are cheap. The discipline of calibrating them is what makes the whole pipeline useful instead of another source of noise.

Which detector would you ship first in your stack, and what's stopped you from shipping it already?


If this was useful

Hallucination detection is one slice of a larger trace-layer evals story. LLM Observability Pocket Guide: Picking the Right Tracing & Evals Tools for Your Team walks through the trace schema choices that make detectors like these possible, the calibration workflow in full, and the tradeoffs across Langfuse, Arize Phoenix, Honeycomb, and rolling your own on OTel. The "online evals" chapter is the closest match to what's in this post.

LLM Observability Pocket Guide: Picking the Right Tracing & Evals Tools for Your Team