
























Abstract:The benchmarks used to evaluate AI agents in security-critical roles suffer from crucial weaknesses. Building on recent empirical evidence, we characterize three core challenges that undermine security evaluations: benchmark vulnerabilities, temporal staleness, and runtime uncertainty. We then outline practical directions toward building more robust and trustworthy evaluation frameworks.
| Subjects: | Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) |
| Cite as: | arXiv:2605.22568 [cs.CR] |
| (or arXiv:2605.22568v1 [cs.CR] for this version) | |
| https://doi.org/10.48550/arXiv.2605.22568 arXiv-issued DOI via DataCite (pending registration) |
From: Konrad Rieck [view email]
[v1]
Thu, 21 May 2026 14:47:54 UTC (11 KB)
此内容由惯性聚合(RSS阅读器)自动聚合整理,仅供阅读参考。 原文来自 — 版权归原作者所有。