





















Abstract:Coding agents are increasingly used as iterative development partners, but most benchmarks still evaluate one specification followed by one final assessment. This leaves out a basic question: can an agent keep its own codebase working as requirements change? We introduce EvoCode-Bench, a benchmark of 26 stateful coding tasks and 227 evaluated rounds. Each task preserves the agent's workspace for 5-15 rounds, states requirements through observable behavior, and uses cumulative executable tests to check new requirements and still-active prior ones. We evaluate 13 coding agents with two metrics: MT@4, a four-attempt fail-stop multi-round score, and SR, a single-round score from a reference-completed prior state. For most agents, SR exceeds MT@4 by 22-40 points. The gap also changes rankings: the highest-SR agent (78.9) ranks only third in persistent execution (44.0 MT@4). Even the strongest agents achieve only about 50% success on multi-turn metrics, and aggregate pass rate drops below half of round-1 performance by round 5. Failure analysis shows tier-dependent behavior: weaker agents fail early, while stronger agents survive long enough to expose specification-tracking and regression failures. We release the benchmark data and Harbor multi-turn infrastructure.
| Comments: | Work in Progress; 32 pages, 10 figures, preprint |
| Subjects: | Artificial Intelligence (cs.AI) |
| Cite as: | arXiv:2605.24110 [cs.AI] |
| (or arXiv:2605.24110v1 [cs.AI] for this version) | |
| https://doi.org/10.48550/arXiv.2605.24110 arXiv-issued DOI via DataCite (pending registration) |
From: Haiyang Shen [view email]
[v1]
Fri, 22 May 2026 18:17:28 UTC (2,866 KB)
此内容由惯性聚合(RSS阅读器)自动聚合整理,仅供阅读参考。 原文来自 — 版权归原作者所有。