






















Authors:Zhengdong Hong, Shenrui Wu, Haozhe Cui, Boyi Zhao, Ran Ji, Yiyang He, Hangxing Zhang, Zundong Ke, Jun Wang, Guofeng Zhang, Jiayuan Gu
Abstract:3D policy learning promises superior generalization and cross-embodiment transfer, but progress has been hindered by training instabilities and severe overfitting, precluding the adoption of powerful 3D perception models. In this work, we systematically diagnose these failures, identifying the omission of 3D data augmentation and the adverse effects of Batch Normalization as primary causes. We propose a new architecture coupling a scalable transformer-based 3D encoder with a diffusion decoder, engineered specifically for stability at scale and designed to leverage large-scale pre-training. Our approach significantly outperforms state-of-the-art 3D baselines on challenging manipulation benchmarks, establishing a new and robust foundation for scalable 3D imitation learning. Project Page: this https URL
| Subjects: | Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO) |
| Cite as: | arXiv:2604.15281 [cs.CV] |
| (or arXiv:2604.15281v1 [cs.CV] for this version) | |
| https://doi.org/10.48550/arXiv.2604.15281 arXiv-issued DOI via DataCite |
From: Zhengdong Hong [view email]
[v1]
Thu, 16 Apr 2026 17:50:37 UTC (9,246 KB)
此内容由惯性聚合(RSS阅读器)自动聚合整理,仅供阅读参考。 原文来自 — 版权归原作者所有。