





















Abstract:Recent progress in GUI agents has substantially improved visual grounding, yet robust planning remains challenging, particularly when the environment deviates from a canonical initial state. In real applications, users often invoke assistance mid-workflow, where software may be partially configured, steps may have been executed in different orders, or the interface may differ from its default setup. Such task-state variability is pervasive but insufficiently evaluated in existing GUI benchmarks. To address this gap, we introduce WorldGUI, a benchmark covering ten widely used desktop and web applications with tasks instantiated under diverse, systematically constructed initial states. These variations capture realistic human-computer interaction settings and enable diagnostic evaluation of an agent's ability to recover, adapt plans, and handle non-default contexts. We further present WorldGUI-Agent, a simple and model-agnostic framework that organizes planning and execution around three critique stages, improving reliability in dynamic environments. Experiments demonstrate that state-of-the-art GUI agents exhibit substantial performance degradation under non-default initial conditions, revealing limited robustness and fragile planning behaviors. Our benchmark and framework provide a foundation for developing more adaptable and reliable GUI agents. The code and data are available at this https URL.
| Comments: | Technique Report |
| Subjects: | Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA) |
| Cite as: | arXiv:2502.08047 [cs.AI] |
| (or arXiv:2502.08047v5 [cs.AI] for this version) | |
| https://doi.org/10.48550/arXiv.2502.08047 arXiv-issued DOI via DataCite |
From: Zhao Hengyuan [view email]
[v1]
Wed, 12 Feb 2025 01:06:10 UTC (7,833 KB)
[v2]
Wed, 19 Feb 2025 23:27:05 UTC (7,811 KB)
[v3]
Mon, 9 Jun 2025 06:58:38 UTC (8,891 KB)
[v4]
Sun, 22 Feb 2026 18:05:26 UTC (19,005 KB)
[v5]
Sat, 23 May 2026 20:19:43 UTC (8,826 KB)
此内容由惯性聚合(RSS阅读器)自动聚合整理,仅供阅读参考。 原文来自 — 版权归原作者所有。