





















Abstract:A robot resolving ``put the cup on that one'' must fuse gesture, language, and scene geometry, yet 3D grounding benchmarks only partially capture this regime: descriptions are written post-hoc, gestures are templated, or pointing is staged for the camera. MM-Conv captures natural co-speech gesture from dyadic VR interaction alongside full-body motion capture and 3D scene graphs. We use it to evaluate pose-language fusion with a decoupled late-fusion architecture in which pose and text pathways share no learned parameters. The two choices together make category, pose, and text contributions easier to isolate through controlled ablations. Fusion with frozen MiniLM category embeddings exceeds pose alone and the best text-only pathway on every reference type, reaching 31.9% top-1. The learned scalar gate flips between opposing policies depending on whether the text pathway has category access. This is a reliability diagnostic: fusion-accuracy claims for semantic grounding systems are indistinguishable from category-representation artifacts unless pathways are architecturally decoupled.
| Comments: | ICRA 2026 Workshop on Semantics for Reliable Robot Autonomy: From Environment Understanding and Reasoning to Safe Interaction |
| Subjects: | Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV) |
| ACM classes: | I.2.9; I.2.10 |
| Cite as: | arXiv:2605.24622 [cs.RO] |
| (or arXiv:2605.24622v1 [cs.RO] for this version) | |
| https://doi.org/10.48550/arXiv.2605.24622 arXiv-issued DOI via DataCite (pending registration) |
From: Anna Deichler [view email]
[v1]
Sat, 23 May 2026 15:20:04 UTC (322 KB)
此内容由惯性聚合(RSS阅读器)自动聚合整理,仅供阅读参考。 原文来自 — 版权归原作者所有。