




















Abstract:Speech deepfake detection has achieved remarkable success in clean environments but faces significant challenges in complex, real-world scenarios where speech is often mixed with background music or noise. Current state-of-the-art methods rely on semantic features from self-supervised learning (SSL) models, which often fail when processing non-speech or mixed-source audio. In this paper, we first introduce MixFake, a large-scale benchmark dataset designed to simulate diverse acoustic environments with varying SNR levels and mixed authenticity components. To address the "semantic-centric" limitation, we propose a Multi-stream Prompt Tuning framework that injects signal-level priors into SSL backbones. By integrating base, frequency, and texture streams through deep prompt injection, our model effectively captures acoustic artifacts. Experimental results demonstrate that our method significantly outperforms existing baselines, achieving a 0.95% EER in foreground detection and a substantial 7.72% absolute improvement in complex background detection tasks. Our dataset and code are available at this https URL.
| Comments: | Accepted by ICME2026 |
| Subjects: | Sound (cs.SD); Multimedia (cs.MM) |
| Cite as: | arXiv:2605.23201 [cs.SD] |
| (or arXiv:2605.23201v1 [cs.SD] for this version) | |
| https://doi.org/10.48550/arXiv.2605.23201 arXiv-issued DOI via DataCite (pending registration) |
From: Qingcao Li [view email]
[v1]
Fri, 22 May 2026 03:33:36 UTC (455 KB)
此内容由惯性聚合(RSS阅读器)自动聚合整理,仅供阅读参考。 原文来自 — 版权归原作者所有。