



























Desktop cleaning demands open-vocabulary recognition and precise manipulation for heterogeneous debris. We propose a hierarchical framework integrating reflective Vision-Language Model (VLM) planning with dual-arm execution via structured scene representation. Grounded-SAM2 facilitates open-vocabulary detection, while a memory-augmented VLM generates, critiques, and revises manipulation sequences. These sequences are converted into parametric trajectories for five primitives executed by coordinated Franka arms. Evaluated in simulated scenarios, our system achieving 87.2% task completion, a 28.8% improvement over static VLM and 36.2% over single-arm baselines. Structured memory integration proves crucial for robust, generalizable manipulation while maintaining real-time control performance.
此内容由惯性聚合(RSS阅读器)自动聚合整理,仅供阅读参考。 原文来自 — 版权归原作者所有。