





















Abstract:Neural surrogate models are powerful and efficient tools in data mining. Meanwhile, large language models (LLMs) have demonstrated remarkable capabilities in code-related tasks, such as generation and understanding. However, an equally important yet underexplored question is whether LLMs can serve as surrogate models for code execution prediction. To systematically investigate it, we introduce SURGE, a comprehensive benchmark with $1160$ problems covering $8$ key aspects: multi-language programming tasks, competition-level programming problems, repository-level code analysis, high-cost scientific computing, time-complexity-intensive algorithms, buggy code analysis, programs dependent on specific compilers or execution environments, and formal mathematical proof verification. Through extensive analysis of $21$ open-source and proprietary LLMs, we examine scaling laws, data efficiency, and predictive accuracy. Our findings reveal important insights about the feasibility of LLMs as efficient surrogates for computational processes. The benchmark and evaluation framework are available at this https URL.
| Subjects: | Machine Learning (cs.LG); Computation and Language (cs.CL) |
| Cite as: | arXiv:2502.11167 [cs.LG] |
| (or arXiv:2502.11167v5 [cs.LG] for this version) | |
| https://doi.org/10.48550/arXiv.2502.11167 arXiv-issued DOI via DataCite |
|
| Journal reference: | Proceedings of The 2025 Conference on Empirical Methods in Natural Language Processing |
From: Bohan Lyu [view email]
[v1]
Sun, 16 Feb 2025 15:38:19 UTC (8,314 KB)
[v2]
Mon, 3 Mar 2025 08:26:12 UTC (1,501 KB)
[v3]
Thu, 3 Apr 2025 09:54:20 UTC (1,501 KB)
[v4]
Sun, 28 Sep 2025 10:36:20 UTC (8,342 KB)
[v5]
Mon, 25 May 2026 05:58:21 UTC (8,331 KB)
此内容由惯性聚合(RSS阅读器)自动聚合整理,仅供阅读参考。 原文来自 — 版权归原作者所有。