
























Abstract:While flow-matching text-to-speech (TTS) achieves strong zero-shot speaker similarity and naturalness, it remains susceptible to content fidelity issues, particularly skip and repeat errors from imperfect alignment. We propose RobustSpeechFlow, a training strategy that improves alignment robustness by extending contrastive flow matching with length-preserving repeat and skip latent augmentations. Requiring no external aligners or preference data, our method directly penalizes realistic failure modes and readily integrates into existing pipelines. On Seed-TTS-eval, it reduces the word error rate (WER) from 1.44 to 1.38 using only 0.06B parameters. On our ZERO500 benchmark, it delivers consistent intelligibility improvements across diverse speaker and prosody conditions; at NFE=24, it reduces English character error rate (CER) from 0.48\% to 0.35\% and Korean CER from 0.81\% to 0.57\%. Audio samples: this https URL
| Comments: | Submitted to INTERSPEECH 2026 |
| Subjects: | Sound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS) |
| Cite as: | arXiv:2605.22083 [cs.SD] |
| (or arXiv:2605.22083v1 [cs.SD] for this version) | |
| https://doi.org/10.48550/arXiv.2605.22083 arXiv-issued DOI via DataCite (pending registration) |
From: Jinhyeok Yang [view email]
[v1]
Thu, 21 May 2026 07:22:28 UTC (66 KB)
此内容由惯性聚合(RSS阅读器)自动聚合整理,仅供阅读参考。 原文来自 — 版权归原作者所有。