

















Abstract:Vision-Language Models (VLMs) have shown rapid progress in mobile GUI navigation. This paper presents a systematic study of data scaling, benchmarking, and reasoning for VLM-based agents in this domain. To facilitate rigorous evaluation, we introduce HyperTrack, a large-scale dataset with over 16000 real-world tasks across more than 650 Chinese mobile applications, along with GUIEvalKit, an open-source toolkit for unified benchmarking of VLMs on offline GUI navigation tasks. Using HyperTrack, we analyze the effects of training data scale on both supervised and reinforcement-based finetuning. Our results show that reinforcement-based finetuning consistently outperforms supervised finetuning, particularly in out-of-domain settings, highlighting the synergy between data scaling and reinforcement learning. Leveraging GUIEvalKit, we further benchmark state-of-the-art (SOTA) VLMs and analyze how interaction history and reasoning capabilities influence task completion. Together, HyperTrack and GUIEvalKit provide a comprehensive platform for developing and evaluating VLM agents in mobile GUI navigation tasks.
| Comments: | Accepted at ICML 2026 |
| Subjects: | Artificial Intelligence (cs.AI) |
| Cite as: | arXiv:2605.27134 [cs.AI] |
| (or arXiv:2605.27134v1 [cs.AI] for this version) | |
| https://doi.org/10.48550/arXiv.2605.27134 arXiv-issued DOI via DataCite (pending registration) |
From: Pengzhi Gao [view email]
[v1]
Tue, 26 May 2026 15:03:56 UTC (11,865 KB)
此内容由惯性聚合(RSS阅读器)自动聚合整理,仅供阅读参考。 原文来自 — 版权归原作者所有。