慣性聚合 高效追讀感興趣之博客、新聞、科技資訊
閱原文 以慣性聚合開啟

推薦訂閱源

博客园 - 司徒正美
V
V2EX
T
Tailwind CSS Blog
有赞技术团队
有赞技术团队
aimingoo的专栏
aimingoo的专栏
Apple Machine Learning Research
Apple Machine Learning Research
IT之家
IT之家
Blog — PlanetScale
Blog — PlanetScale
A
About on SuperTechFans
月光博客
月光博客
T
The Blog of Author Tim Ferriss
宝玉的分享
宝玉的分享
Martin Fowler
Martin Fowler
博客园 - 聂微东
The GitHub Blog
The GitHub Blog
V
Visual Studio Blog
WordPress大学
WordPress大学
酷 壳 – CoolShell
酷 壳 – CoolShell
Engineering at Meta
Engineering at Meta
GbyAI
GbyAI

DEV Community

Authentication Security Deep Dive: From Brute Force to Salted Hashing (With Java Examples) Why AI Systems Don’t Fail — They Drift Spilling beans for how i learn for exam😁"Reinforcement Learning Cheat Sheet" I Replaced Chrome with Safari for AI Browser Automation. Here's What Broke (and What Finally Worked) How Python Borrows Other People's Work The $40 Architecture: Processing 1 Billion API Requests with 99.99% Uptime Vibe Coding: A Workflow Guide (From Zero to SaaS) Most webhook security guides protect the wrong side. The scary part is delivery. Headless CMS for TanStack Start: Build a Blog with Cosmic EU Age Verification App "Hacked in 2 Minutes" — What Actually Happened Comfy Cloud’s delete function does not actually remove files Running AI Models on GPU Cloud Servers: A Beginner Guide Event-driven media intelligence with AWS Step Functions and Bedrock I scored 500 AI prompts across 8 quality dimensions — here's what broke How to Call Google Gemini API from Next.js (Free Tier, No Backend Needed) The Portal Protocol: Reclaiming Human Connection in the Age of AI How to Fix Your Team's Scattered Knowledge Problem With a Self-Hosted Forum Intro to tc Cloud Functors: A Graph-First Mental Model for the Modern Cloud Designing Multi-Tenant Backends With Both Ownership and Team Access I Built a Neumorphic CSS Library with 77+ Components — Here's What I Learned PostgreSQL Performance Optimization: Why Connection Pooling Is Critical at Scale Cómo construí un SaaS multi-rubro para gestionar expensas en Argentina con FastAPI + Vue 3 🚀 I Built an Ethical Hacking Scanner Tool – Open Source Project I Replaced /usage and /context in Claude Code With a Single Statusline A Pythonic Way to Handle Emails (IMAP/SMTP) with Auto-Discovery and AI-Ready Design I Collected 8.9 Million Polymarket Price Points — Here's What I Found About How Markets Really Move EcoTrack AI — Carbon Footprint Tracker & Dashboard Everyone's Using AI. No One Agrees How. 5 self-hosted ebook managers worth trying in 2026 Building Your First AI Agent with LangChain: From Chatbot to Autonomous Assistant Common SOC 2 Failures (Real World) Stop Vibe-Checking Your AI App: A Practical Guide to Evals How to Use SonarQube and SonarScanner Locally to Level Up Your Code Quality Your Next To-Do App Is Dead — I Replaced Mine with an OpenClaw AI Sign a Nostr event in 60 lines of Python using coincurve — no nostr-sdk, no nbxplorer, no rust toolchain ITGC Audit Explained Like You’re in Big 4 Patch Tuesday abril 2026: Microsoft parcha 163 vulnerabilidades y un zero-day en SharePoint Stop scraping everything: a better way to track competitor price changes Listing on MCPize + the Official MCP Registry while routing payments OUTSIDE the marketplace — how I kept 100% of my x402 revenue Building an AI-Powered Risk Intelligence System Using Serverless Architecture Why We Ripped Function Overloading Out of Our AI Toolchain Testing AI-Generated Code: How to Actually Know If It Works SaaS Churn Is Killing Your Business. Here Is What to Do About It (Without a Support Team) The Speed of AI Is No Longer Linear - And Self-Improving Models Are Why How to Implement RBAC for MCP Tools: A Practical Guide for Engineering Teams From Standard Quote to Persuasive Proposal: AI Automation for Arborists I built a CLI that scaffolds complete multi-tenant SaaS apps Axios CVE-2025–62718: The Silent SSRF Bug That Could Be Hiding in Your Node.js App Right Now The dashboard that ended our friendship Data Pipelines Explained Simply (and How to Build Them with Python)
评估集漂移:何知吾之金集已陈腐
Gabriel Anha · 2026-05-24 · via DEV Community

尔之金标集于三月甚善。今已十二月。尔于生产中所见之问询,半非金标集中所有。仪表犹示九八成通过,然此数乃妄言,盖尔所测之标集久非负荷之本真矣。

此谓标集漂移。乃模型退步之静境:无物崩坏,绿勾频落,尔却目盲而飞。

评估集何以陈旧

三事潜动于下,往往同时发生

域变 新产表面,新用分段,新整合。建于Q1之支持机器人评估集,于Q3之退款自动化流程无话可说。评估集非恶,所覆之世已缩。

分发转移.同域异配。或企业流量自十比增至四五比,盖销售得佳季也。企业查询,辞长句繁,多涉内行俚语。汝之评估集,犹九成短句单回,盖三月之貌如是.

特征流变。 所嵌之特征已变。尔升嵌入之模。尔易检索之策。尔更系统之示,今应答依不同之境。评验之入虽同,然向量之表(尔检索与重排实所计者)已迁。

汝常先觉者,乃人申牒之时也。其二、三者,乃險者也。及格之率,恒如故。用戶之滿意,則隕落矣。此二度數,所言之事不同,蓋測量者,其眾不同也。

六十行之漂移檢測器

尔毋需研精之作。尔需一数,当生产之态若评价之集而止时,此数自增。嵌入之最大均方差异(MMD)乃正道。此法较二向量之分布,不假其必为高斯,其平方之统计有明解:二样本同源则值为零,异源则值大。

此乃可用之检器。以Sentence-Transformers为嵌入,RBF核,约六十行:

# eval_drift.py
import numpy as np
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import rbf_kernel

MODEL = SentenceTransformer("all-MiniLM-L6-v2")

def embed(texts: list[str]) -> np.ndarray:
    # normalize so kernel bandwidth is well-behaved across batches
    return MODEL.encode(texts, normalize_embeddings=True, batch_size=64)

def mmd_rbf(x: np.ndarray, y: np.ndarray, gamma: float = 1.0) -> float:
    # unbiased MMD^2 with an RBF kernel. zero = same distribution.
    kxx = rbf_kernel(x, x, gamma=gamma)
    kyy = rbf_kernel(y, y, gamma=gamma)
    kxy = rbf_kernel(x, y, gamma=gamma)
    n, m = len(x), len(y)
    # drop diagonal terms for the unbiased estimator
    np.fill_diagonal(kxx, 0); np.fill_diagonal(kyy, 0)
    return (kxx.sum() / (n * (n - 1))
            + kyy.sum() / (m * (m - 1))
            - 2 * kxy.mean())

def permutation_pvalue(x, y, gamma=1.0, n_perm=200) -> float:
    # how often does a random split give an MMD this large?
    pooled = np.vstack([x, y])
    observed = mmd_rbf(x, y, gamma)
    n = len(x)
    count = 0
    rng = np.random.default_rng(42)
    for _ in range(n_perm):
        rng.shuffle(pooled)
        if mmd_rbf(pooled[:n], pooled[n:], gamma) >= observed:
            count += 1
    return (count + 1) / (n_perm + 1)

def detect_drift(eval_queries: list[str], prod_queries: list[str]):
    x = embed(eval_queries)
    y = embed(prod_queries)
    # median heuristic for kernel bandwidth — works for normalized embeddings
    from scipy.spatial.distance import pdist
    sigma = np.median(pdist(np.vstack([x, y]))) or 1.0
    gamma = 1.0 / (2 * sigma ** 2)
    mmd2 = mmd_rbf(x, y, gamma)
    pval = permutation_pvalue(x, y, gamma, n_perm=200)
    return {"mmd2": float(mmd2), "p_value": float(pval), "n_eval": len(x), "n_prod": len(y)}

if __name__ == "__main__":
    import json, sys
    eval_q = json.load(open(sys.argv[1]))      # golden set queries
    prod_q = json.load(open(sys.argv[2]))      # last 7 days of sampled prod queries
    print(json.dumps(detect_drift(eval_q, prod_q), indent=2))

入全屏模式 出全屏模式

于真实数据集上之样本运行:

$ python eval_drift.py golden.json prod_week.json
{
  "mmd2": 0.0382,
  "p_value": 0.005,
  "n_eval": 500,
  "n_prod": 1200
}

入全屏模式 出全屏模式

p值低于0.05,且MMD²非微末,则此二分布可于统计学上辨析。此乃汝之讯号:黄金样本已非生产之公允样本。若欲避核函数之用,则另法为分箱嵌入坐标,于直方计KL散度。价廉,噪高,易向PM解说。MMD乃默认之优选。

有二实用数值以校准之。运行探测器于汝金集随机二分。此得汝之躁度:MMD²之分布略同。偏移之趣,在生用与评验之比逾躁度三五倍时,非当其越某任意之0.01阈也

。 层面偏移:其移之表也。

一全球之MMD²数,可示人有所漂移,然不能言其为何物。若助者司账、纳新、退款、API排错、重置密码诸事,则此数或高,盖因一区移矣,余皆安好。于此信而全易其评,则四区为矫,以正一区。

依片运行探测器。最廉之术:于每评估查询及每抽样产品查询皆加其意图之标,然后循环之:

slices = {}
for slice_name in {q["intent"] for q in prod_queries}:
    eval_slice = [q["text"] for q in eval_queries if q["intent"] == slice_name]
    prod_slice = [q["text"] for q in prod_queries if q["intent"] == slice_name]
    if len(eval_slice) < 30 or len(prod_slice) < 30:
        slices[slice_name] = {"status": "insufficient_samples"}
        continue
    slices[slice_name] = detect_drift(eval_slice, prod_slice)

入全景模式 出全屏模式

依MMD²之序排切片。列首者,乃汝之刷新目标。常见其状:一切片处MMD²为0.08,p值0.001;又一切片处0.04,p值0.02;其余皆沉于噪声底。刷新列首二片,余者勿扰。

次轴可探者:查询长度之区(短/中/长)与用户类(免费/专业/企业)。纵意分布看似稳定,漂移常现于此。长企业查询,其应答本检索者,异于短免费级“如何重置密码”之查,而汝之评估集恐未足取前者。

无生产样本之切片亦为信号,然其意相反。若汝之评估集有"legacy_v1_api"之切片,而生产端三月未发此类型之查询,则此切片为无用之物。宜归档之;勿复对其评分。

更新之约

每月朔日,占日程九十分钟。

可行之道:

  1. 取近三十日之生产查询五百至二千,若可分层则分层,不可则随机取之。去个人识别信息,余事皆后。
  2. 全球及各层运行漂移检测。存其数,非求一时之信号,乃建趋势之线。
  3. 每层之MMD²> 噪音底数增三倍,< 0.05,自该切片之生产样本中,采新问三十至五十。命人(或以强有力之语言模型为初筛,再由人复审)撰其应答。
  4. 附于金集;勿易之。 新行加注所增之月及切片。旧行存之。
  5. 于所增之集,重行全评之套。存其分。此为尔今后之新准。

追加勿替之规,所以存史脉之可循也。若每月易评集之四成,则"通过率"之图,每较异时之主。不能辨二分之降,乃模型之劣化耶,抑评集之艰险耶。追加之,标新之,以新模较新旧之区,则可析变矣。

何时重撰,何时渐延

月月皆增。岁二度,或重撰一隅于本始。重撰之时:

  • 域变甚,使所期之效于旧隅者谬,非惟不肖。若偿还之策更,去岁之金答今已失准,勿复以之衡评。
  • 尔更模型之族(三·五→四→五,或迁他商),应答之式,结构迥异。旧之金准输出,乃为不同详略之默认模型所撰;判之提示,不公评之。
  • 此片之MMD²,虽累月增新,犹渐隐。此乃片之定义自已散,当分之征也。

虽重书,旧片当凝若冰,为slice_v1_archived。每季复之。此乃汝之逆旅之鸟也。若slice_v1_archived骤跌,则模型或提示之根本已变,新评集未察耳。

其患在:频更则基线难较矣。

团队初测偏移,其最大之谬误,乃矫枉过正。偏移既察,则全评刷新,新基线立,而旧者忘之。三月之后,领导问曰:“今胜昔Q2乎?”无人能答,盖因Q2之数,皆以已亡之评集测之也。

其解虽庸,然效。评集版本化(golden_2026_03golden_2026_04者……每有模型之出,必较之今月之集 ,复较六月前之集。每版两数。今数可知今日之流。半岁之数可知是否退于原验之基线。若二者皆升,则胜矣。若今数升而旧数降,则为新流而损旧,此或有当,然当为自觉之权衡,非偶然而已。

他诱惑者:使漂移阈限过窄,致每周皆发。探测器之职,在标示有义分布之变,非噪也。以二随机评估分之噪底校之。设警阈为三至五倍。过窄则入不稳之境,每月评估集稍异其众,无趋势线则无义。

陈旧之金器无声而败:绿纹案几覆无人验之劳。焕然一新者,月月示真,值其九十分钟之价。

今周择一例。取二百查询,运行探测器,记下两随机评估分割之MMD²与噪声底。午后可知,或目开飞行,或化石测之。

吾尝历久,至于何时更易eval-set,终何事迫改之?


若此有益

上所述之漂移检测流程,乃一大型评估系统之组成部分——采样策略、裁判提示、切片分类、仪表盘布线。评估之策章之属也《LLM可观测性袖珍指南》 循全管而行,及于如何择于离线批量评估、在线A/B测试、连续影子评分之间。若尔自创评估或欲使既有之设实为可信,此章当为首选而读。

LLM Observability Pocket Guide: Picking the Right Tracing & Evals Tools for Your Team