

















Abstract:Speech tokenizers are a key building block of fully discrete Speech LLMs. Existing tokenizers either prioritize semantic encoding, fuse semantic content with acoustic style inseparably, or achieve incomplete semantic-acoustic disentanglement. To achieve better disentanglement, we propose \textbf{DSA-Tokenizer}, which explicitly disentangles speech into discrete semantic and acoustic tokens via distinct optimization constraints. Specifically, semantic tokens are supervised by ASR to capture linguistic content, while acoustic tokens focus on mel-spectrograms restoration to encode style. We further introduce a hierarchical Flow Matching decoder and a joint reconstruction-context inpainting training strategy, allowing the model to support both high-fidelity reconstruction and cross-utterance voice clone. To speed up inference, we distill the DiT decoder to reduce sampling steps of inference to 4 and improve synthesis quality with GAN fine-tuning. Experiments demonstrate that DSA-Tokenizer provides strong semantic-acoustic disentanglement, reliable controllable voice cloning, and efficient high-fidelity generation with low WER/CER. Moreover, our results suggest that disentangled tokenization provides a more effective interface for downstream large-model speech generation. Audio samples are avaialble at this https URL.
| Comments: | Submit to ACL ARR 2026 May |
| Subjects: | Sound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS) |
| Cite as: | arXiv:2601.09239 [cs.SD] |
| (or arXiv:2601.09239v3 [cs.SD] for this version) | |
| https://doi.org/10.48550/arXiv.2601.09239 arXiv-issued DOI via DataCite |
From: Hanlin Zhang [view email]
[v1]
Wed, 14 Jan 2026 07:22:24 UTC (392 KB)
[v2]
Thu, 15 Jan 2026 04:00:34 UTC (392 KB)
[v3]
Tue, 26 May 2026 12:05:42 UTC (396 KB)
此内容由惯性聚合(RSS阅读器)自动聚合整理,仅供阅读参考。 原文来自 — 版权归原作者所有。