

















Abstract:Existing benchmarks for LLM coding agents primarily evaluate final outcomes. While useful for measuring overall capability, these metrics provide limited visibility and often miss defects that arise during execution. We present ProcCtrlBench, a benchmark for execution-process evaluation in LLM coding agents. ProcCtrlBench organizes recurrent execution defects into a reusable ontology covering 11 defect types in 4 categories, and evaluates agent trajectories through standardized process evidence rather than final outcomes alone. To support comparison across heterogeneous agents, ProcCtrlBench standardizes raw logs into a unified trajectory representation and reports calibrated scorecards over process-level findings. In addition, ProcCtrlBench uses control preservation as a way to quantify execution-process quality, capturing whether execution remains interpretable, interruptible, correctable, reversible, and able to hand back authority when needed. We evaluate ProcCtrlBench on 200 cases sampled from three benchmarks: AndroidBench, TerminalBench, and SWE-bench-Verified. Results show that ProcCtrlBench can be instantiated with useful reliability, provides more stable semantics than direct thresholding, and reveals meaningful differences in execution quality that are often overlooked by conventional outcome-based evaluation.
| Comments: | 22 pages, 8 figures |
| Subjects: | Software Engineering (cs.SE); Artificial Intelligence (cs.AI) |
| Cite as: | arXiv:2605.20251 [cs.SE] |
| (or arXiv:2605.20251v4 [cs.SE] for this version) | |
| https://doi.org/10.48550/arXiv.2605.20251 arXiv-issued DOI via DataCite |
From: Jiawei He [view email]
[v1]
Mon, 18 May 2026 08:34:48 UTC (951 KB)
[v2]
Thu, 21 May 2026 09:33:40 UTC (951 KB)
[v3]
Mon, 25 May 2026 10:08:39 UTC (951 KB)
[v4]
Tue, 26 May 2026 09:44:14 UTC (951 KB)
此内容由惯性聚合(RSS阅读器)自动聚合整理,仅供阅读参考。 原文来自 — 版权归原作者所有。