






















Abstract:We propose a novel hierarchical spatiotemporal vector quantization framework for unsupervised skeleton-based temporal action segmentation. We first introduce a hierarchical approach, which includes two consecutive levels of vector quantization. Specifically, the lower level associates skeletons with fine-grained subactions, while the higher level further aggregates subactions into action-level representations. Our hierarchical approach outperforms the non-hierarchical baseline, while primarily exploiting spatial cues by reconstructing input skeletons. Next, we extend our approach by leveraging both spatial and temporal information, yielding a hierarchical spatiotemporal vector quantization scheme. In particular, our hierarchical spatiotemporal approach performs multi-level clustering, while simultaneously recovering input skeletons and their corresponding timestamps. Lastly, extensive experiments on multiple benchmarks, including HuGaDB, LARa, and BABEL, demonstrate that our approach establishes a new state-of-the-art performance and reduces segment length bias in unsupervised skeleton-based temporal action segmentation.
| Subjects: | Computer Vision and Pattern Recognition (cs.CV) |
| Cite as: | arXiv:2604.15196 [cs.CV] |
| (or arXiv:2604.15196v1 [cs.CV] for this version) | |
| https://doi.org/10.48550/arXiv.2604.15196 arXiv-issued DOI via DataCite |
From: Quoc-Huy Tran [view email]
[v1]
Thu, 16 Apr 2026 16:24:40 UTC (883 KB)
此内容由惯性聚合(RSS阅读器)自动聚合整理,仅供阅读参考。 原文来自 — 版权归原作者所有。