

















Abstract:We investigate Counterfactual Video Foley Generation, which aims to adopt a sound-source identity that contradicts the visual evidence while remaining temporally synchronized to a silent video. Existing Video&Text-to-Audio (VT2A) models struggle with this, often remaining anchored to the visually implied sound source when video and text contents disagree. We present ConterFlow, an inference-time dual-phase sampling scheme for pretrained flow-matching VT2A models. Phase 1 builds a video-derived temporal structure while suppressing the visually implied source; Phase 2 drops video conditioning to focus entirely on shaping audio timbre toward the target prompt. ConterFlow substantially improves counterfactual Video Foley generation compared to naive negative prompting and state-of-the-art baselines. To evaluate replacement quality, we propose a metric leveraging a text-audio co-embedding space to measure both target-prompt evidence and residual visually implied source leakage. Video demonstrations and code are available at this https URL
| Comments: | accepted to CVPR 2026 Workshop on Sight and Sound |
| Subjects: | Multimedia (cs.MM); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD); Audio and Speech Processing (eess.AS) |
| Cite as: | arXiv:2605.18916 [cs.MM] |
| (or arXiv:2605.18916v2 [cs.MM] for this version) | |
| https://doi.org/10.48550/arXiv.2605.18916 arXiv-issued DOI via DataCite |
From: Gyubin Lee [view email]
[v1]
Mon, 18 May 2026 05:42:06 UTC (1,242 KB)
[v2]
Mon, 25 May 2026 12:15:23 UTC (980 KB)
此内容由惯性聚合(RSS阅读器)自动聚合整理,仅供阅读参考。 原文来自 — 版权归原作者所有。