惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

F
Full Disclosure
人人都是产品经理
人人都是产品经理
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
有赞技术团队
有赞技术团队
博客园 - 聂微东
C
Cybersecurity and Infrastructure Security Agency CISA
S
SegmentFault 最新的问题
博客园_首页
I
InfoQ
A
About on SuperTechFans
Apple Machine Learning Research
Apple Machine Learning Research
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
美团技术团队
T
Tor Project blog
K
KPMG report finds enterprise disconnect between AI and its ROI | CIO
V
Visual Studio Blog
WordPress大学
WordPress大学
cs.CL updates on arXiv.org
cs.CL updates on arXiv.org
T
Tailwind CSS Blog
P
Palo Alto Networks Blog
博客园 - 叶小钗
N
News and Events Feed by Topic
Google DeepMind News
Google DeepMind News
Last Week in AI
Last Week in AI
小众软件
小众软件
N
News and Events Feed by Topic
Spread Privacy
Spread Privacy
O
OpenAI News
N
News | PayPal Newsroom
H
Help Net Security
Recent Announcements
Recent Announcements
cs.AI updates on arXiv.org
cs.AI updates on arXiv.org
酷 壳 – CoolShell
酷 壳 – CoolShell
PCI Perspectives
PCI Perspectives
M
MIT News - Artificial intelligence
云风的 BLOG
云风的 BLOG
罗磊的独立博客
D
Darknet – Hacking Tools, Hacker News & Cyber Security
The GitHub Blog
The GitHub Blog
Google Online Security Blog
Google Online Security Blog
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
IT之家
IT之家
Y
Y Combinator Blog
Hacker News - Newest:
Hacker News - Newest: "LLM"
博客园 - 【当耐特】
T
The Blog of Author Tim Ferriss
AWS News Blog
AWS News Blog
W
WeLiveSecurity
www.infosecurity-magazine.com
www.infosecurity-magazine.com
NISL@THU
NISL@THU

cs.AI updates on arXiv.org

暂无文章

LM-SPT: LM-Aligned Semantic Distillation for Speech Tokenization
Daejin Jo, Jeeyoung Yun, Byungseok Roh, Sungwoong Kim · 2025-06-20 · via cs.AI updates on arXiv.org

With the rapid progress of speech language models (SLMs), discrete speech tokens have emerged as a core interface between speech and text, enabling unified modeling across modalities. Recent speech tokenization approaches aim to isolate semantic information from low-level acoustics to better align with language models. In particular, previous methods use SSL teachers such as HuBERT to extract semantic representations, which are then distilled into a semantic quantizer to suppress acoustic redundancy as well as capture content-related latent structures. However, they still produce speech token sequences significantly longer than their textual counterparts, creating challenges for efficient speech-language modeling. Reducing the frame rate is a natural solution, but standard techniques, such as rigid average pooling across frames, can distort or dilute the semantic structure required for effective LM alignment. To address this, we propose LM-SPT, a speech tokenization method that introduces a novel semantic distillation. Instead of directly matching teacher and student features via pooling, we reconstruct speech solely from semantic tokens and minimize the discrepancy between the encoded representations of the original and reconstructed waveforms, obtained from a frozen automatic speech recognition (ASR) encoder. This indirect yet data-driven supervision enables the tokenizer to learn discrete units that are more semantically aligned with language models. LM-SPT further incorporates architectural improvements to the encoder and decoder for speech tokenization, and supports multiple frame rates, including 25Hz, 12.5Hz, and 6.25Hz. Experimental results show that LM-SPT achieves superior reconstruction fidelity compared to baselines, and that SLMs trained with LM-SPT tokens achieve competitive performances on speech-to-text and consistently outperform baselines on text-to-speech tasks.