























Abstract:Selective prediction with distribution-free risk control promises that, with confidence 1-delta over the calibration draw, the error rate of accepted inputs stays below a user budget alpha. We audit this promise on signal-domain detectors -- machine anomalous-sound detection (ASD) and AI-generated-image forensics -- for four calibration rules: uncertified empirical thresholding (NAIVE) and certified Hoeffding, Clopper-Pearson (CP), and betting (WSR) upper confidence bounds. We report three findings. (i) NAIVE thresholding, common in practice, exceeds its declared budget in 49-73% of synthetic trials (n=200 calibration points) and in up to 68% of real-data splits: a false sense of safety rather than a broken theorem, since the rule never had a certificate. (ii) Tightness matters: CP and WSR certify substantial coverage where Hoeffding certifies none, with zero observed budget overruns under exchangeable splits. (iii) Under grouped deployment (unseen machine types or generators), certified rules overrun in 9-30% of trials -- far above delta -- showing the failure lies in the broken exchangeability premise, not in the bounds; a conservative per-group threshold restores validity at a severe coverage cost.
From: Mingzhe Wang [view email]
[v1]
Sat, 13 Jun 2026 06:48:50 UTC (65 KB)
此内容由惯性聚合(RSS阅读器)自动聚合整理,仅供阅读参考。 原文来自 — 版权归原作者所有。