

















Abstract:Large language models (LLMs) often achieve strong performance on reasoning benchmarks, but final-answer accuracy alone does not show whether they faithfully execute the procedure specified in a prompt. We introduce a controlled diagnostic benchmark for procedural execution, where models are given a step-wise arithmetic procedure and two numeric inputs, and must return the final computed value. Complexity is varied through procedure length and look-back dependencies over intermediate variables. Average first-answer accuracy drops from 63% on 5-step procedures to 20% on 95-step procedures. Generation-level analysis shows that failures often involve missing answers, premature answers, self-correction after an initial error and under-executed traces. These findings suggest that apparent reasoning ability can mask substantial weaknesses in faithful long-horizon procedural execution.
| Comments: | 86 pages, 124 figures, 4 Tables |
| Subjects: | Computation and Language (cs.CL) |
| Cite as: | arXiv:2605.00817 [cs.CL] |
| (or arXiv:2605.00817v3 [cs.CL] for this version) | |
| https://doi.org/10.48550/arXiv.2605.00817 arXiv-issued DOI via DataCite |
From: Sailesh Panda [view email]
[v1]
Fri, 1 May 2026 17:55:47 UTC (2,011 KB)
[v2]
Thu, 21 May 2026 09:54:19 UTC (4,642 KB)
[v3]
Sun, 24 May 2026 07:30:17 UTC (4,934 KB)
此内容由惯性聚合(RSS阅读器)自动聚合整理,仅供阅读参考。 原文来自 — 版权归原作者所有。