Hallucination Detection at the Trace Layer: 4 Detectors You Can Ship Today

Book: LLM Observability Pocket Guide: Picking the Right Tracing & Evals Tools for Your Team
Also by me: Thinking in Go (2-book series) — Complete Guide to Go Programming + Hexagonal Architecture in Go
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

You can't catch every hallucination at runtime. The cost of running a fact-check model inline doubles your p99 and burns budget on requests that were fine anyway. But you can catch four classes of hallucination cheaply, after the fact, in the trace pipeline. The detectors run on spans your app already emits. The user gets the answer at normal latency. The detector flags the bad ones a few seconds later and a Slack ping shows up before the support ticket does.

This post walks through four detectors that have paid for themselves on real LLM products: citation grounding, confidence anomaly, schema violation, and self-consistency divergence. Each one is 30-80 lines of Python. Then we wire them into an OpenTelemetry SpanProcessor so they run async on every trace your app emits. Then we calibrate them, because every one of them lies if you don't.

Why trace-layer detection beats inline checks

Inline detection means: before you return the LLM response to the user, you run another model to score whether it's hallucinated. NeMo Guardrails, Lynx, and most "evaluator-as-a-service" pitches work this way. The math doesn't favor you.

Pretend each user request is one LLM call (~400ms). Inline detection adds another LLM call to score grounding (~600ms because grounding models tend to be slower and chew the full context). Your p50 goes from 400ms to 1000ms. Your p99 doubles to 2s. Your token bill triples. And you're paying that on 100% of requests when maybe 2% are actually hallucinations worth catching.

Trace-layer detection runs after the response goes out. The user sees the answer at native latency. Your span processor picks up the finished span, runs detectors on it asynchronously, and writes flags back to your observability backend or a separate "incident" queue. If a detector fires, you get a notification, not a degraded user experience.

The trade-off is honest: you can't block the bad response. The user already saw it. That sounds bad until you compare it to the alternative. Blocking 100% of traffic so you can occasionally catch a hallucination is worse than catching 80% of hallucinations and apologizing for the rest. For most products, sub-second latency wins over zero-hallucination guarantees.

Detector 1 — Citation grounding

The most common hallucination in a RAG system: the model cites a source that doesn't actually support the claim. It looks correct (there's a citation) but the cited passage says something else, or nothing at all about the claim.

Citation grounding is a string/embedding check between each cited span and the sentence it's cited in. No second LLM call needed for the basic version.

import re
from dataclasses import dataclass
from sentence_transformers import SentenceTransformer, util

# load once at module level, not per-request
_embedder = SentenceTransformer("BAAI/bge-small-en-v1.5")

@dataclass
class GroundingResult:
    sentence: str
    citation_id: str
    similarity: float
    grounded: bool

CITATION_RE = re.compile(r"\[(\d+)\]")

def check_grounding(
    answer: str,
    sources: dict[str, str],
    threshold: float = 0.55,
) -> list[GroundingResult]:
    """For each sentence with [n] citations, check that the cited
    source actually contains the claim. Returns one row per (sentence,
    citation) pair so you can see which citation in a multi-cite
    sentence is the weak one."""
    results: list[GroundingResult] = []
    sentences = re.split(r"(?<=[.!?])\s+", answer.strip())

    for sentence in sentences:
        cite_ids = CITATION_RE.findall(sentence)
        if not cite_ids:
            continue
        # strip citation markers before embedding — they're noise
        clean = CITATION_RE.sub("", sentence).strip()
        sent_emb = _embedder.encode(clean, convert_to_tensor=True)

        for cid in cite_ids:
            src_text = sources.get(cid, "")
            if not src_text:
                results.append(GroundingResult(clean, cid, 0.0, False))
                continue
            src_emb = _embedder.encode(src_text, convert_to_tensor=True)
            sim = float(util.cos_sim(sent_emb, src_emb).item())
            results.append(
                GroundingResult(clean, cid, sim, sim >= threshold)
            )
    return results

The threshold of 0.55 is a starting point, not a law. You'll move it after calibration. Cosine sim of BGE embeddings on factual claim vs supporting passage tends to sit between 0.4 and 0.85 in practice. Below 0.4 is almost always fabrication; above 0.7 is almost always supported. The grey zone needs labelled traces to tune (we'll get there).

Gotcha: if your sources are very long, embedding the whole chunk dilutes the signal. The cited claim might match one paragraph in a 2000-token source. The cosine sim against the whole blob comes back middling and you miss the hit. Fix: chunk the source into sentence-windows, embed each, and take the max similarity across chunks.

Detector 2 — Confidence anomaly via logprobs

When a model hallucinates a fact it's never seen, the logprob distribution often looks weird. Confident-but-wrong is the dangerous case, but there's a flavor of confident-but-wrong that shows up as unusually flat per-token entropy. The model is committing to tokens it would normally hedge on.

You need logprobs in the response. OpenAI exposes them via logprobs=True, top_logprobs=5. Anthropic doesn't, last I checked, so this detector only works on providers that surface them.

import math
from statistics import mean

def token_entropy(top_logprobs: list[dict]) -> float:
    """Shannon entropy over the top-k token distribution at one
    position. High entropy = model is unsure. Low entropy = model
    is committed."""
    probs = [math.exp(t["logprob"]) for t in top_logprobs]
    total = sum(probs)
    if total == 0:
        return 0.0
    norm = [p / total for p in probs]
    return -sum(p * math.log(p) for p in norm if p > 0)

def confidence_anomaly_score(
    tokens: list[dict],
    baseline_mean_entropy: float,
    baseline_stdev: float,
) -> float:
    """Z-score of this response's mean token entropy against
    the baseline you computed on labelled good traces. Returns
    abs(z). Above 2.5 is worth flagging."""
    if not tokens:
        return 0.0
    per_token = [
        token_entropy(t.get("top_logprobs", []))
        for t in tokens
        if t.get("top_logprobs")
    ]
    if not per_token:
        return 0.0
    response_entropy = mean(per_token)
    if baseline_stdev == 0:
        return 0.0
    return abs(response_entropy - baseline_mean_entropy) / baseline_stdev

You compute baseline_mean_entropy and baseline_stdev once a week from your labelled-good traces. Store them in Redis or just a YAML file your pipeline reloads. Then the per-trace score is a single z-score.

The signal is noisier than citation grounding. A z-score of 2.5 means "this response's average token confidence is 2.5 standard deviations from your baseline." That catches the confident hallucinations but also catches valid-but-unusual responses (a question that just happens to have a more deterministic answer than average, like "what's 2+2"). Use this detector as a tie-breaker or as a filter on top of the others, not as a primary alarm.

Detector 3 — Schema and format violation

If your prompt asks for JSON with a specific shape, anything off-shape is hallucination of intent. Even when the content is correct, a schema break means the model didn't follow the contract, and downstream code probably crashed or silently dropped fields.

This one is the cheapest and lowest false-positive rate of the four. Always ship it.

import json
from jsonschema import Draft202012Validator, ValidationError

@dataclass
class SchemaResult:
    valid: bool
    errors: list[str]
    parse_failed: bool

def check_schema(raw_output: str, schema: dict) -> SchemaResult:
    """Validate the model's raw text against a JSON Schema. Both
    parse-fail and validation-fail count as hallucinations of
    intent — the model didn't follow the contract."""
    try:
        parsed = json.loads(raw_output)
    except json.JSONDecodeError as e:
        return SchemaResult(False, [f"json parse: {e}"], True)

    validator = Draft202012Validator(schema)
    errors = [
        f"{'.'.join(str(p) for p in e.absolute_path) or '<root>'}: "
        f"{e.message}"
        for e in validator.iter_errors(parsed)
    ]
    return SchemaResult(len(errors) == 0, errors, False)

Pair this with response_format={"type": "json_schema", ...} if you're on OpenAI. That should prevent schema violations at generation time. In practice, structured outputs still occasionally drop optional fields or hallucinate enum values on long-running streams that get truncated mid-object. The detector catches those.

A common pattern: a tool-calling agent emits {"tool": "search_docs", "args": {"q": "..."}}. The model gets cute and emits {"tool": "search_documents", "args": {"q": "..."}}. JSON is valid, schema rejects because tool isn't in the enum, your tool-dispatcher silently returns empty results, the user gets a hallucinated answer based on no retrieval. Schema check fires, you see it.

Detector 4 — Self-consistency divergence

Generate the same answer N times at temperature > 0. If the answers disagree past a threshold, the model is making things up. If they converge, the model is committed to its answer (which doesn't prove correctness, but proves it's not flipping coins).

This costs N× tokens, so don't run it on every trace. Run it as a sampling detector. Pick 1% of traces tagged with a "high-stakes" attribute (medical, financial, legal queries) and run consistency on those.

import asyncio
from openai import AsyncOpenAI

_client = AsyncOpenAI()

async def _sample_once(messages: list[dict], model: str) -> str:
    resp = await _client.chat.completions.create(
        model=model,
        messages=messages,
        temperature=0.7,
        max_tokens=400,
    )
    return resp.choices[0].message.content or ""

async def consistency_score(
    messages: list[dict],
    n: int = 4,
    model: str = "gpt-4o-mini",
) -> float:
    """Sample N times, embed each sample, return mean pairwise
    cosine similarity. 1.0 = identical, <0.65 = divergent."""
    samples = await asyncio.gather(
        *(_sample_once(messages, model) for _ in range(n))
    )
    embs = _embedder.encode(samples, convert_to_tensor=True)
    sims: list[float] = []
    for i in range(n):
        for j in range(i + 1, n):
            sims.append(float(util.cos_sim(embs[i], embs[j]).item()))
    return sum(sims) / len(sims) if sims else 0.0

Gotcha: self-consistency catches fabrication on factual questions but misses style hallucination. Two responses can both be wrong and both phrase the lie the same way (because they share the same prior). It's good for "what year did X happen" questions, weak for "summarize this document" questions where consistent-but-wrong is the failure mode.

Wiring the detectors into an OTel span processor

The point of trace-layer detection is that it runs on spans the app already emits. The cleanest way to bolt this on is a custom SpanProcessor that fires after each LLM span finishes, runs the detectors in a background thread, and adds the results back as span attributes.

import json
import logging
from concurrent.futures import ThreadPoolExecutor
from opentelemetry.sdk.trace import SpanProcessor, ReadableSpan
from opentelemetry import trace

log = logging.getLogger("hallucination-detector")
_pool = ThreadPoolExecutor(max_workers=8)

class HallucinationSpanProcessor(SpanProcessor):
    def __init__(self, schema_registry: dict, baseline: dict):
        self.schema_registry = schema_registry  # operation -> schema
        self.baseline = baseline  # {"mean": float, "stdev": float}

    def on_start(self, span, parent_context=None):
        pass

    def on_end(self, span: ReadableSpan):
        # only run on LLM spans — convention: name starts with "llm."
        if not span.name.startswith("llm."):
            return
        _pool.submit(self._run_detectors, span)

    def _run_detectors(self, span: ReadableSpan):
        try:
            attrs = dict(span.attributes or {})
            answer = attrs.get("llm.response.content", "")
            sources_json = attrs.get("rag.sources_json", "{}")
            tokens_json = attrs.get("llm.response.tokens_json", "[]")
            operation = attrs.get("llm.operation", "")
            sources = json.loads(sources_json)
            tokens = json.loads(tokens_json)

            findings: dict[str, object] = {}

            if sources:
                g = check_grounding(answer, sources)
                ungrounded = [r for r in g if not r.grounded]
                findings["grounding.ungrounded_count"] = len(ungrounded)
                findings["grounding.min_sim"] = (
                    min((r.similarity for r in g), default=1.0)
                )

            if tokens:
                z = confidence_anomaly_score(
                    tokens,
                    self.baseline["mean"],
                    self.baseline["stdev"],
                )
                findings["confidence.zscore"] = z

            schema = self.schema_registry.get(operation)
            if schema:
                s = check_schema(answer, schema)
                findings["schema.valid"] = s.valid
                if not s.valid:
                    findings["schema.errors"] = "; ".join(s.errors[:3])

            self._publish(span, findings)
        except Exception:
            log.exception("detector failed for span %s", span.name)

    def _publish(self, span: ReadableSpan, findings: dict):
        # write back as a follow-up span — you can't mutate a
        # finished span, but you CAN emit a sibling span with the
        # same trace_id and the parent's span_id.
        tracer = trace.get_tracer("hallucination-detector")
        ctx = trace.set_span_in_context(
            trace.NonRecordingSpan(span.get_span_context())
        )
        with tracer.start_as_current_span(
            "llm.detector.result", context=ctx
        ) as result_span:
            for k, v in findings.items():
                result_span.set_attribute(k, v)
            # fire your alerting hook here if thresholds are crossed
            if findings.get("grounding.ungrounded_count", 0) > 0:
                result_span.set_attribute("alert.fired", True)

    def shutdown(self):
        _pool.shutdown(wait=True)

    def force_flush(self, timeout_millis: int = 30_000):
        return True

The trick is the _publish step. You can't mutate a finished span in OTel. Once on_end is called, the span is read-only. So you emit a sibling span with the same trace ID, attaching the detector results. Your backend (Honeycomb, Grafana Tempo, Jaeger, Langfuse) will show it next to the original LLM span. Querying for alert.fired = true gives you the hallucination dashboard.

from opentelemetry.sdk.trace import TracerProvider

provider = TracerProvider()
provider.add_span_processor(
    HallucinationSpanProcessor(
        schema_registry={"answer_with_citations": ANSWER_SCHEMA},
        baseline={"mean": 1.42, "stdev": 0.38},
    )
)
trace.set_tracer_provider(provider)

The gotcha: calibrate against labelled traces or every alert is noise

Every detector here has a knob: a threshold, a z-score cutoff, a minimum similarity. Ship them with the defaults and you'll either page yourself constantly or never. Both are useless.

The calibration loop is straightforward but you have to do it:

Sample 200-500 traces from production over the last 2 weeks.
Have a human (you, or a labeler) classify each as clean or hallucinated.
Run all four detectors on every labelled trace.
For each detector, sweep the threshold and compute precision/recall at each setting.
Pick the threshold that gives you the precision your alerting can tolerate (usually 0.7, because you can handle 1 false positive per 3 alerts before fatigue kicks in).

import csv
from dataclasses import dataclass

@dataclass
class LabelledTrace:
    trace_id: str
    is_hallucination: bool
    grounding_min_sim: float

def calibrate_grounding(
    traces: list[LabelledTrace],
    candidate_thresholds: list[float],
) -> list[tuple[float, float, float]]:
    """For each threshold, return (threshold, precision, recall) for
    treating min_sim < threshold as a hallucination flag."""
    rows = []
    for t in candidate_thresholds:
        tp = sum(
            1 for x in traces
            if x.grounding_min_sim < t and x.is_hallucination
        )
        fp = sum(
            1 for x in traces
            if x.grounding_min_sim < t and not x.is_hallucination
        )
        fn = sum(
            1 for x in traces
            if x.grounding_min_sim >= t and x.is_hallucination
        )
        precision = tp / (tp + fp) if (tp + fp) else 0.0
        recall = tp / (tp + fn) if (tp + fn) else 0.0
        rows.append((t, precision, recall))
    return rows

Run that against your labelled set, look at the precision/recall curve, pick the elbow. Then store the chosen thresholds in a versioned config file (the same one your detector reads at startup). When you change a model version, re-calibrate, because the baselines move.

False-positive rates compound across detectors. If each of your four detectors has a 5% false positive rate, and you alert on any of them firing, your trace-level false positive rate is around 18%. Either alert only when 2+ detectors fire (intersection), or weight them: citation grounding and schema check fire as alerts, confidence anomaly and consistency divergence fire as Slack signals nobody pages on.

The detectors are cheap. The discipline of calibrating them is what makes the whole pipeline useful instead of another source of noise.

Which detector would you ship first in your stack, and what's stopped you from shipping it already?

If this was useful

Hallucination detection is one slice of a larger trace-layer evals story. LLM Observability Pocket Guide: Picking the Right Tracing & Evals Tools for Your Team walks through the trace schema choices that make detectors like these possible, the calibration workflow in full, and the tradeoffs across Langfuse, Arize Phoenix, Honeycomb, and rolling your own on OTel. The "online evals" chapter is the closest match to what's in this post.

推薦訂閱源

DEV Community