

















Abstract:Multimodal fusion is often treated as an optimization-balancing problem, where training signals are adjusted to prevent one modality from dominating the others. However, balanced optimization does not fully determine the geometry of intermediate representations. Supervised multimodal models may still learn low-diversity modality-specific embeddings or allow paired cross-modal observations to drift excessively apart, weakening both unimodal robustness and multimodal fusion.
We introduce \regName, a lightweight plug-and-play geometric regularization framework for multimodal representation learning. Rather than enforcing rigid cross-modal alignment, \regName follows a bounded-agreement principle: preserve modality-specific diversity while softly constraining only the portion of paired cross-modal drift that exceeds an admissible agreement band. Operationally, \regName combines a dispersion term that mitigates spectral concentration with an agreement-band anchoring term that controls excessive paired drift, requiring no architectural modification or inference-time overhead.
Experiments across audio-visual, image-text, and RF-based benchmarks show that \regName consistently improves multimodal performance and often strengthens unimodal representations. These results suggest that explicitly regulating representation geometry is an effective complement to optimization balancing, and provide evidence that geometry-aware regularization can improve multimodal learning across diverse architectures and domains.
| Subjects: | Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) |
| Cite as: | arXiv:2601.21670 [cs.CV] |
| (or arXiv:2601.21670v3 [cs.CV] for this version) | |
| https://doi.org/10.48550/arXiv.2601.21670 arXiv-issued DOI via DataCite |
From: Zixuan Xia [view email]
[v1]
Thu, 29 Jan 2026 13:03:50 UTC (3,547 KB)
[v2]
Sun, 5 Apr 2026 13:19:29 UTC (3,663 KB)
[v3]
Fri, 22 May 2026 23:30:27 UTC (4,575 KB)
此内容由惯性聚合(RSS阅读器)自动聚合整理,仅供阅读参考。 原文来自 — 版权归原作者所有。