






















Authors:Guangrui Li, Yaochen Xie, Yi Liu, Ziwei Dong, Xingyuan Pan, Tianqi Zheng, Jason Choi, Michael J. Morais, Binit Jha, Shaunak Mishra, Bingrou Zhou, Chen Luo, Monica Xiao Cheng, Dawn Song
Abstract:LLM-powered tool-calling agents fulfill user requests by interacting with environments, querying data, and invoking tools in a multi-turn process. Yet, most existing benchmarks evaluate these systems under static environment interfaces, with fixed schemas and toolsets, making it difficult to assess how agents behave as environments evolves -- when capabilities are added, reorganized, or deprecated across successive environment versions. In this paper, we study structured environment evolution as a benchmark-construction problem for tool-calling agents. We propose ProEvolve, a graph-based framework that makes environment evolution programmable. At its core, a typed relational graph provides a unified, explicit representation of the environment - data, tools, and schema. Under this formalism, adding, removing, or modifying capabilities are expressed as graph transformations that coherently propagate updates across tools, schemas, and data access. Building on this, ProEvolve supports (1) automatic generation of evolved executable environments through explicit graph transformations, and (2) graph-grounded construction of task sandboxes via subgraph sampling and instantiation. We validate ProEvolve in two tool-calling domains, e-commerce and airline booking, in terms of quality, implementation validity, and failure modes. Finally, we use the generated benchmark as a downstream diagnostic to study how representative agents behave under structured environment evolution.
| Subjects: | Artificial Intelligence (cs.AI) |
| Cite as: | arXiv:2603.05910 [cs.AI] |
| (or arXiv:2603.05910v2 [cs.AI] for this version) | |
| https://doi.org/10.48550/arXiv.2603.05910 arXiv-issued DOI via DataCite |
From: Guangrui Li [view email]
[v1]
Fri, 6 Mar 2026 04:56:18 UTC (571 KB)
[v2]
Tue, 19 May 2026 17:15:08 UTC (726 KB)
此内容由惯性聚合(RSS阅读器)自动聚合整理,仅供阅读参考。 原文来自 — 版权归原作者所有。