

















Abstract:Blended emotion recognition is challenging because emotions are often expressed as mixtures of subtle and overlapping multimodal cues rather than a single dominant signal. We propose a rank-aware multi-encoder framework that selectively combines complementary representations from diverse pre-extracted video and audio encoders. Our method projects heterogeneous encoder features into a shared latent space, estimates sample-wise encoder importance through an attention-based gating module, and fuses only the top-n most informative encoders. To better model blended emotions, we decouple prediction into presence and salience heads and align them through probability-level fusion. We further incorporate feature-level unsupervised domain adaptation without pseudo-labeling to improve robustness under distribution shift. Experiments on the BlEmoRE challenge show that the proposed framework outperforms strong individual encoders and naïve multi-encoder fusion baselines. Our final system ranked 2nd in the competition, supporting the effectiveness of rank-aware selective fusion for fine-grained blended emotion recognition.
| Comments: | Accepted at IEEE FG 2026 Workshops. Final system ranked 2nd in the BlEmoRE Challenge. 9 pages including appendix, 8 figures |
| Subjects: | Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) |
| Cite as: | arXiv:2605.21417 [cs.CV] |
| (or arXiv:2605.21417v2 [cs.CV] for this version) | |
| https://doi.org/10.48550/arXiv.2605.21417 arXiv-issued DOI via DataCite |
From: Junghyun Lee [view email]
[v1]
Wed, 20 May 2026 17:12:55 UTC (2,712 KB)
[v2]
Sun, 24 May 2026 12:32:06 UTC (2,712 KB)
此内容由惯性聚合(RSS阅读器)自动聚合整理,仅供阅读参考。 原文来自 — 版权归原作者所有。