

















Abstract:Many static benchmarks are beginning to saturate: as models rapidly improve, they achieve near-perfect scores on fixed test sets, leaving little headroom to expose genuine model weaknesses -- and even expert-curated challenge sets quickly saturate after hillclimbing. We present a fully automatic framework that searches the Internet at scale to construct challenging benchmarks without human curation. The key insight is to model the Internet as a vast space of topics and formalize the search as a multi-armed bandit problem, where each topic's difficulty is revealed only through expensive sample-and-evaluate queries. Our epsilon-greedy strategy identifies the most challenging topics while exploring only 6% of the search space -- a 100 times cost reduction over exhaustive evaluation. We validate on machine translation and knowledge question answering, confirming that discovered difficulty is robust across independent metrics (GEMBA-SQA and MetricX), languages, and models.
| Subjects: | Computation and Language (cs.CL); Artificial Intelligence (cs.AI) |
| Cite as: | arXiv:2509.26619 [cs.CL] |
| (or arXiv:2509.26619v3 [cs.CL] for this version) | |
| https://doi.org/10.48550/arXiv.2509.26619 arXiv-issued DOI via DataCite |
From: Wenda Xu [view email]
[v1]
Tue, 30 Sep 2025 17:55:47 UTC (9,778 KB)
[v2]
Wed, 6 May 2026 22:17:29 UTC (9,873 KB)
[v3]
Mon, 25 May 2026 19:22:54 UTC (9,874 KB)
此内容由惯性聚合(RSS阅读器)自动聚合整理,仅供阅读参考。 原文来自 — 版权归原作者所有。