- 書: 大模型可觀察性袖珍指南:選擇適宜之追蹤與評估工具於君之隊
- 亦由吾著: 思於Go(二書系列)— Go程式設計完備指南 + 六邊形架構於Go
- 吾之項目: 赫尔墨斯集成开发环境(IDE)|GitHub (GitHub)— 一款供开发者使用,与Claude Code及其他AI编程工具配套之集成开发环境
- 吾: xgabriel.com|GitHub(代码托管平台)
非尽逐幻听于运行时。运行实相核验之模,其费倍增,使p99益高,而耗预算于本无碍之请。然可。 可于事后于迹流中廉取四类幻象。探测器行于汝应用已发之跨段。用户得答于常时延。探测器数秒后识其不善者,而 Slack 之响先于支援票至。
此篇详述四类检测器,于真实大语言模型产品中已获回报:引文根基、置信异常、架构违逆、自洽分歧。每类不过三十至八十行Python代码。继而将之接入OpenTelemetry。SpanProcessor故其行异步于汝应用所发之迹。吾辈乃校准之,盖若不尔,则每者皆妄也。
何谓追迹层检测胜于内联检查
内检之法者:于将大语言模型之应答返诸用户之前,复以他模型量其是否虚妄。NeMo Guardrails、Lynx及多数“评估即服务”之方案皆循此道。其数理非汝所利。
每用户请求数,如一调大语言模型(约四百毫秒)。行内检测,又增一调大语言模型以评据实(约六百毫秒,盖据实模型多缓,尽纳全境)。尔中位时,自四百毫秒增至一千毫秒。九十九分之百时,倍为两秒。符令之费,三倍。尔于此百之百请求数,偿此,然或仅二为幻生之请,实值察之。
踪层检测,行于后。应答既出。用户见答,若本处之速。汝之跨处理器,取成之跨,异步运行探测器,复将旗标书于汝之可察后端,或别立“事端”之列。若探测器发,汝得知,非用户体验之损。
权衡之实,在尔不能。阻之 之应不良。用户已睹之。初闻似恶,较之他择,则不然。遏百脉之流,冀偶得幻象,不若捕八成幻象而歉其余,且致歉焉。凡品,微秒之迟胜于无幻之誓。
侦测一 — 引文根基
RAG系統中最常見之幻覺者:模型引述之來源實不支援其論斷。其表現似是而非(有引述),然所引之文段所言非是,或與論斷全無關涉。
引述之根基,乃於每引述之文段與其所引之句間,作文字/嵌入之檢驗。基本版本無需再行LLM之呼喚。
import re
from dataclasses import dataclass
from sentence_transformers import SentenceTransformer, util
# load once at module level, not per-request
_embedder = SentenceTransformer("BAAI/bge-small-en-v1.5")
@dataclass
class GroundingResult:
sentence: str
citation_id: str
similarity: float
grounded: bool
CITATION_RE = re.compile(r"\[(\d+)\]")
def check_grounding(
answer: str,
sources: dict[str, str],
threshold: float = 0.55,
) -> list[GroundingResult]:
"""For each sentence with [n] citations, check that the cited
source actually contains the claim. Returns one row per (sentence,
citation) pair so you can see which citation in a multi-cite
sentence is the weak one."""
results: list[GroundingResult] = []
sentences = re.split(r"(?<=[.!?])\s+", answer.strip())
for sentence in sentences:
cite_ids = CITATION_RE.findall(sentence)
if not cite_ids:
continue
# strip citation markers before embedding — they're noise
clean = CITATION_RE.sub("", sentence).strip()
sent_emb = _embedder.encode(clean, convert_to_tensor=True)
for cid in cite_ids:
src_text = sources.get(cid, "")
if not src_text:
results.append(GroundingResult(clean, cid, 0.0, False))
continue
src_emb = _embedder.encode(src_text, convert_to_tensor=True)
sim = float(util.cos_sim(sent_emb, src_emb).item())
results.append(
GroundingResult(clean, cid, sim, sim >= threshold)
)
return results
0.55之阈,非定法也,乃发端耳。校准后可移之。BGE嵌入之于事实申言与佐证文段之余弦相似度,实践间常居0.4与0.85之间。下于0.4者,几为虚构;上于0.7者,近乎有据。其灰色地带,需标记之迹以调(吾辈将及此)。
会意:若所引之文甚长,全嵌之则信号渐弱。所引之论或仅合于二千词源之一节。余弦相似度较之全文,得中庸之数,遂失其契。善策:将源文分句为窗,各嵌之,取诸窗间最似之数。
检测二 — 自信之异,源于对数概率
当模型虚构其未尝所见之实,其日志概率分布常显怪异。自信而谬者,其患尤甚,然自信而谬者,有异态现焉。平地每词熵。此模态固守其素常所忌之词。
汝需于应答中得logprobs。OpenAI以之示人。logprobs=True, top_logprobs=5:Anthropic之检测,吾所查,未尝有之,故此器仅适用于显其形者。
import math
from statistics import mean
def token_entropy(top_logprobs: list[dict]) -> float:
"""Shannon entropy over the top-k token distribution at one
position. High entropy = model is unsure. Low entropy = model
is committed."""
probs = [math.exp(t["logprob"]) for t in top_logprobs]
total = sum(probs)
if total == 0:
return 0.0
norm = [p / total for p in probs]
return -sum(p * math.log(p) for p in norm if p > 0)
def confidence_anomaly_score(
tokens: list[dict],
baseline_mean_entropy: float,
baseline_stdev: float,
) -> float:
"""Z-score of this response's mean token entropy against
the baseline you computed on labelled good traces. Returns
abs(z). Above 2.5 is worth flagging."""
if not tokens:
return 0.0
per_token = [
token_entropy(t.get("top_logprobs", []))
for t in tokens
if t.get("top_logprobs")
]
if not per_token:
return 0.0
response_entropy = mean(per_token)
if baseline_stdev == 0:
return 0.0
return abs(response_entropy - baseline_mean_entropy) / baseline_stdev
:尔当每周自标好之迹中,算baseline_mean_entropy与baseline_stdev。藏于Redis或尔管流所载之YAML文。则每迹之得,惟一z分耳。
信噪较引证为甚。Z值2.5者,谓"此应之平均词信度,较基准偏2.5准差。"此法既捕自信之幻象,亦捕合宜而异常之应(如问"2+2"之答,较寻常为定)。当以此器为和断之器,或为诸器之滤,非为警始之器。
探测器三 — 框架与格式违逆
若汝之提示索求具形之 JSON,凡形不协者,乃意念之幻生。纵内容无谬,框架之破,示模型未守契约,下游之码或崩或默弃其域。
此者四者之中价最廉、误报率最低,必当选用之。
import json
from jsonschema import Draft202012Validator, ValidationError
@dataclass
class SchemaResult:
valid: bool
errors: list[str]
parse_failed: bool
def check_schema(raw_output: str, schema: dict) -> SchemaResult:
"""Validate the model's raw text against a JSON Schema. Both
parse-fail and validation-fail count as hallucinations of
intent — the model didn't follow the contract."""
try:
parsed = json.loads(raw_output)
except json.JSONDecodeError as e:
return SchemaResult(False, [f"json parse: {e}"], True)
validator = Draft202012Validator(schema)
errors = [
f"{'.'.join(str(p) for p in e.absolute_path) or '<root>'}: "
f"{e.message}"
for e in validator.iter_errors(parsed)
]
return SchemaResult(len(errors) == 0, errors, False)
若在OpenAI上,当与之配对response_format={"type": "json_schema", ...}。此当于生成时防备模式违犯。然实践中,结构化输出仍偶失可选字段,或于长流中截断时幻生枚举值。探测器可察之。
寻常之态:工器呼唤之使,发焉。{"tool": "search_docs", "args": {"q": "..."}}此模态渐显娇憨,且发其声。{"tool": "search_documents", "args": {"q": "..."}}JSON有效,然架构拒之。tool非枚举之列,汝之工具调度器默然返空,用户得幻象之应,无所得而还。模式检视发火,汝可见之。
检测器四——自洽性分歧
同答N次,温度>0。若诸答相异逾阈,则模型妄作。若会归,则模型笃于其答(虽不能证其确,然可证非如掷币)。
此耗N×符,故勿遍施于每迹。当以之姿行之。探查器。择“高风险”属性(医、财、法问)之迹百分之一,施以一致性校验。
import asyncio
from openai import AsyncOpenAI
_client = AsyncOpenAI()
async def _sample_once(messages: list[dict], model: str) -> str:
resp = await _client.chat.completions.create(
model=model,
messages=messages,
temperature=0.7,
max_tokens=400,
)
return resp.choices[0].message.content or ""
async def consistency_score(
messages: list[dict],
n: int = 4,
model: str = "gpt-4o-mini",
) -> float:
"""Sample N times, embed each sample, return mean pairwise
cosine similarity. 1.0 = identical, <0.65 = divergent."""
samples = await asyncio.gather(
*(_sample_once(messages, model) for _ in range(n))
)
embs = _embedder.encode(samples, convert_to_tensor=True)
sims: list[float] = []
for i in range(n):
for j in range(i + 1, n):
sims.append(float(util.cos_sim(embs[i], embs[j]).item()))
return sum(sims) / len(sims) if sims else 0.0
了然:自洽之法可察虚妄于事理之问然遗珠之幻象。二者皆误,且以同法陈妄言(因其同具前因)。适于“X事何年发”之问,拙于“撮此文书要旨”之询,盖以谬而一致为失也.
将探测器接于OTel跨度处理器之线
追踪层检测之要义,在于其运行于应用已发出之跨度上。最为洁净之法,乃定制之SpanProcessor每LLM段落毕即发之,于背景线程运行探测器,复将结果加为段落属性。
import json
import logging
from concurrent.futures import ThreadPoolExecutor
from opentelemetry.sdk.trace import SpanProcessor, ReadableSpan
from opentelemetry import trace
log = logging.getLogger("hallucination-detector")
_pool = ThreadPoolExecutor(max_workers=8)
class HallucinationSpanProcessor(SpanProcessor):
def __init__(self, schema_registry: dict, baseline: dict):
self.schema_registry = schema_registry # operation -> schema
self.baseline = baseline # {"mean": float, "stdev": float}
def on_start(self, span, parent_context=None):
pass
def on_end(self, span: ReadableSpan):
# only run on LLM spans — convention: name starts with "llm."
if not span.name.startswith("llm."):
return
_pool.submit(self._run_detectors, span)
def _run_detectors(self, span: ReadableSpan):
try:
attrs = dict(span.attributes or {})
answer = attrs.get("llm.response.content", "")
sources_json = attrs.get("rag.sources_json", "{}")
tokens_json = attrs.get("llm.response.tokens_json", "[]")
operation = attrs.get("llm.operation", "")
sources = json.loads(sources_json)
tokens = json.loads(tokens_json)
findings: dict[str, object] = {}
if sources:
g = check_grounding(answer, sources)
ungrounded = [r for r in g if not r.grounded]
findings["grounding.ungrounded_count"] = len(ungrounded)
findings["grounding.min_sim"] = (
min((r.similarity for r in g), default=1.0)
)
if tokens:
z = confidence_anomaly_score(
tokens,
self.baseline["mean"],
self.baseline["stdev"],
)
findings["confidence.zscore"] = z
schema = self.schema_registry.get(operation)
if schema:
s = check_schema(answer, schema)
findings["schema.valid"] = s.valid
if not s.valid:
findings["schema.errors"] = "; ".join(s.errors[:3])
self._publish(span, findings)
except Exception:
log.exception("detector failed for span %s", span.name)
def _publish(self, span: ReadableSpan, findings: dict):
# write back as a follow-up span — you can't mutate a
# finished span, but you CAN emit a sibling span with the
# same trace_id and the parent's span_id.
tracer = trace.get_tracer("hallucination-detector")
ctx = trace.set_span_in_context(
trace.NonRecordingSpan(span.get_span_context())
)
with tracer.start_as_current_span(
"llm.detector.result", context=ctx
) as result_span:
for k, v in findings.items():
result_span.set_attribute(k, v)
# fire your alerting hook here if thresholds are crossed
if findings.get("grounding.ungrounded_count", 0) > 0:
result_span.set_attribute("alert.fired", True)
def shutdown(self):
_pool.shutdown(wait=True)
def force_flush(self, timeout_millis: int = 30_000):
return True
其妙在于_publish之步,不可变焉。on_end既呼,则段唯读不可改。故尔发之段,同其踪迹之ID,附以检测之果。尔之后台(Honeycomb、Grafana Tempo、Jaeger、Langfuse)将示此段于原LLM段之侧。询alert.fired = true,得幻听之仪表。
请于汝之处理器上注册TracerProvider启程之际:
from opentelemetry.sdk.trace import TracerProvider
provider = TracerProvider()
provider.add_span_processor(
HallucinationSpanProcessor(
schema_registry={"answer_with_citations": ANSWER_SCHEMA},
baseline={"mean": 1.42, "stdev": 0.38},
)
)
trace.set_tracer_provider(provider)
巧者之得:校准于标记之迹,否则每警皆为虚鸣
此间每器皆有一旋钮:阈限、Z分位数之裁、最小相似度。以默认值发之,则或自扰无已,或永无所得。二者皆无益也。
校准之环虽简,然必为之。
- 近两月所产之样本,计二百至五百缕。
- 人(汝或标注者)各分类之
clean或hallucinated请提供需要翻译的英文文本。 - 四探测器悉施于诸标记之迹。
- 每探测器,遍历阈值,于各设定处计算精准率/召回率。
- 择阈限,以合汝警觉所能容之精微(常取七分之七,盖可容三警中一误,未至倦怠耳)。
import csv
from dataclasses import dataclass
@dataclass
class LabelledTrace:
trace_id: str
is_hallucination: bool
grounding_min_sim: float
def calibrate_grounding(
traces: list[LabelledTrace],
candidate_thresholds: list[float],
) -> list[tuple[float, float, float]]:
"""For each threshold, return (threshold, precision, recall) for
treating min_sim < threshold as a hallucination flag."""
rows = []
for t in candidate_thresholds:
tp = sum(
1 for x in traces
if x.grounding_min_sim < t and x.is_hallucination
)
fp = sum(
1 for x in traces
if x.grounding_min_sim < t and not x.is_hallucination
)
fn = sum(
1 for x in traces
if x.grounding_min_sim >= t and x.is_hallucination
)
precision = tp / (tp + fp) if (tp + fp) else 0.0
recall = tp / (tp + fn) if (tp + fn) else 0.0
rows.append((t, precision, recall))
return rows
将此施诸于尔之标记集,察其精准/召回曲线,择其转折。乃将所选阈值,存于版本化之配置文件(即尔之探测器于启动时读取者)。每易模型版本,当复校准,盖基线亦移矣。
误报率,诸探测器相乘。若尔之四探测器,各具五分之误报率,尔当警于任何。者发,尔之微迹伪阳性率约十八。或唯警于二器以上交会,或权其轻重:引文根抵与式范检核发为警报,信度异常及一致性离析发为 Slack 之讯,无人应之。
诸器价廉。校准之术,方使全流有用,非徒增噪耳。
尔之栈中,首当运何器?何故未遽运之?
若此有益
幻觉之辨,乃大之迹层评估之一隅。大语言模型可察小册:择适之迹及评估之器于尔之众 之文,述其行迹,辨诸器之成,详校准之序,析 Langfuse、Arize Phoenix、Honeycomb 与自筑于 OTel 之权衡。所谓“在线评估”之章,与斯文最契。



























