

















Abstract:Deploying LLMs raises two coupled challenges: (1) monitoring--estimating where a model underperforms as traffic and domains drift--and (2) improvement--prioritizing data acquisition to close the largest performance gaps. We test whether an inference-time signal can estimate slice-level accuracy under domain shift. For each response, we compute an output-entropy profile from final-layer next-token probabilities (from top-$k$ logprobs) and summarize it with different statistics. A lightweight classifier predicts instance correctness, and averaging predicted probabilities yields a domain-level accuracy estimate. We evaluate on ten STEM reasoning benchmarks with exhaustive train/test compositions ($k\in\{1,2,3,4\}$; all $\binom{10}{k}$ combinations), on different classifier models and features across nine LLMs from six families (3B--20B). Estimates often track held-out benchmark accuracy, and several models show near-monotonic ordering of domains, providing evidence for output-entropy profiles being an accessible signal for scalable monitoring and for targeted data acquisition.
| Subjects: | Computation and Language (cs.CL) |
| Cite as: | arXiv:2601.09001 [cs.CL] |
| (or arXiv:2601.09001v4 [cs.CL] for this version) | |
| https://doi.org/10.48550/arXiv.2601.09001 arXiv-issued DOI via DataCite |
From: Pedro Memoli Buffa [view email]
[v1]
Tue, 13 Jan 2026 21:54:38 UTC (646 KB)
[v2]
Mon, 19 Jan 2026 18:44:17 UTC (646 KB)
[v3]
Tue, 3 Mar 2026 16:43:33 UTC (695 KB)
[v4]
Mon, 25 May 2026 21:40:47 UTC (820 KB)
此内容由惯性聚合(RSS阅读器)自动聚合整理,仅供阅读参考。 原文来自 — 版权归原作者所有。