惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
V
Vulnerabilities – Threatpost
有赞技术团队
有赞技术团队
小众软件
小众软件
O
OpenAI News
C
Cyber Attacks, Cyber Crime and Cyber Security
I
Intezer
NISL@THU
NISL@THU
D
Darknet – Hacking Tools, Hacker News & Cyber Security
N
News and Events Feed by Topic
MongoDB | Blog
MongoDB | Blog
阮一峰的网络日志
阮一峰的网络日志
Hacker News: Ask HN
Hacker News: Ask HN
D
Docker
WordPress大学
WordPress大学
Security Archives - TechRepublic
Security Archives - TechRepublic
A
About on SuperTechFans
Stack Overflow Blog
Stack Overflow Blog
C
CERT Recently Published Vulnerability Notes
L
LINUX DO - 最新话题
Application and Cybersecurity Blog
Application and Cybersecurity Blog
M
MIT News - Artificial intelligence
Blog — PlanetScale
Blog — PlanetScale
S
Security @ Cisco Blogs
Cloudbric
Cloudbric
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
V
V2EX
Hacker News - Newest:
Hacker News - Newest: "LLM"
G
Google Developers Blog
cs.AI updates on arXiv.org
cs.AI updates on arXiv.org
W
WeLiveSecurity
Google DeepMind News
Google DeepMind News
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
H
Hackread – Cybersecurity News, Data Breaches, AI and More
G
GRAHAM CLULEY
S
Schneier on Security
T
Tor Project blog
Spread Privacy
Spread Privacy
PCI Perspectives
PCI Perspectives
Microsoft Security Blog
Microsoft Security Blog
cs.CL updates on arXiv.org
cs.CL updates on arXiv.org
F
Fortinet All Blogs
L
Lohrmann on Cybersecurity
Cyber Security Advisories - MS-ISAC
Cyber Security Advisories - MS-ISAC
T
The Exploit Database - CXSecurity.com
TaoSecurity Blog
TaoSecurity Blog
Apple Machine Learning Research
Apple Machine Learning Research
T
Threat Research - Cisco Blogs
T
Troy Hunt's Blog
罗磊的独立博客

eess.AS updates on arXiv.org

Dependence on Early and Late Reverberation of Single-Channel Speaker Distance Estimation MIST: Multimodal Interactive Speech-based Tool-calling Conversational Assistants for Smart Homes LiVeAction: a Lightweight, Versatile, and Asymmetric Neural Codec Design for Real-time Operation Weight-Decay Turns Transformer Loss Landscapes Villani: Functional-Analytic Foundations for Optimization and Generalization PairAlign: A Framework for Sequence Tokenization via Self-Alignment with Applications to Audio Tokenization WavCube: Unifying Speech Representation for Understanding and Generation via Semantic-Acoustic Joint Modeling Predictive-Generative Drift Decomposition for Speech Enhancement and Separation Minimizing Modality Gap from the Input Side: Your Speech LLM Can Be a Prosody-Aware Text LLM X-Voice: Enabling Everyone to Speak 30 Languages via Zero-Shot Cross-Lingual Voice Cloning JASTIN: Aligning LLMs for Zero-Shot Audio and Speech Evaluation via Natural Language Instructions Phoneme-Level Deepfake Detection Across Emotional Conditions Using Self-Supervised Embeddings When Audio-Language Models Fail to Leverage Multimodal Context for Dysarthric Speech Recognition Dimensionality-Aware Anomaly Detection in Learned Representations of Self-Supervised Speech Models Mitigating Multimodal LLMs Hallucinations via Relevance Propagation at Inference Time Virtual Speech Therapist: A Clinician-in-the-Loop AI Speech Therapy Agent for Personalized and Supervised Therapy LASE: Language-Adversarial Speaker Encoding for Indic Cross-Script Identity Preservation Towards Improving Speaker Distance Estimation through Generative Impulse Response Augmentation Beyond Decodability: Reconstructing Language Model Representations with an Encoding Probe MMAudioReverbs: Video-Guided Acoustic Modeling for Dereverberation and Room Impulse Response Estimation Alethia: A Foundational Encoder for Voice Deepfakes From Birdsong to Rumbles: Classifying Elephant Calls with Out-of-Species Embeddings Beyond the Baseband: Adaptive Multi-Band Encoding for Full-Spectrum Bioacoustics Classification Predicting Upcoming Stuttering Events from Three-Second Audio: Stratified Evaluation Reveals Severity-Selective Precursors, and the Model Deploys Fully On-Device The False Resonance: A Critical Examination of Emotion Embedding Similarity for Speech Generation Evaluation DiffAnon: Diffusion-based Prosody Control for Voice Anonymization Recurrence-Based Nonlinear Vocal Dynamics as Digital Biomarkers for Depression Detection from Conversational Speech One Voice, Many Tongues: Cross-Lingual Voice Cloning for Scientific Speech Similarity Choice and Negative Scaling in Supervised Contrastive Learning for Deepfake Audio Detection Walking Through Uncertainty: An Empirical Study of Uncertainty Estimation for Audio-Aware Large Language Models Praxy Voice: Voice-Prompt Recovery + BUPS for Commercial-Class Indic TTS from a Frozen Non-Indic Base at Zero Commercial-Training-Data Cost ML-SAN: Multi-Level Speaker-Adaptive Network for Emotion Recognition in Conversations Korean aegyo speech shows systematic F1 increase to signal childlike qualities All That Glitters Is Not Audio: Rethinking Text Priors and Audio Reliance in Audio-Language Evaluation Speech Enhancement Based on Drifting Models Talker-T2AV: Joint Talking Audio-Video Generation with Autoregressive Diffusion Modeling Explainable AI in Speaker Recognition -- Making Latent Representations Understandable TTS-PRISM: A Perceptual Reasoning and Interpretable Speech Model for Fine-Grained Diagnosis UniSonate: A Unified Model for Speech, Music, and Sound Effect Generation with Text Instructions Dilated CNNs for Periodic Signal Processing: A Low-Complexity Approach ONOTE: Benchmarking Omnimodal Notation Processing for Expert-level Music Intelligence MoDAl: Self-Supervised Neural Modality Discovery via Decorrelation for Speech Neuroprosthesis Voice Mapping of Text-to-Speech Systems: A Metric-Based Approach for Voice Quality Assessment UAF: A Unified Audio Front-end LLM for Full-Duplex Speech Interaction Voice of India: A Large-Scale Benchmark for Real-World Speech Recognition in India Reducing the Offline-Streaming Gap for Unified ASR Transducer with Consistency Regularization Audio2Tool: Speak, Call, Act -- A Dataset for Benchmarking Speech Tool Use Qwen3.5-Omni Technical Report Speech Emotion Recognition Using MFCC Features and LSTM-Based Deep Learning Model The Acoustic Camouflage Phenomenon: Re-evaluating Speech Features for Financial Risk Prediction SongBench: A Fine-Grained Multi-Aspect Benchmark for Song Quality Assessment VoxSafeBench: Not Just What Is Said, but Who, How, and Where In-Sync: Adaptation of Speech Aware Large Language Models for ASR with Word Level Timestamp Predictions MoshiRAG: Asynchronous Knowledge Retrieval for Full-Duplex Speech Language Models X-VC: Zero-shot Streaming Voice Conversion in Codec Space Audio Flamingo Next: Next-Generation Open Audio-Language Models for Speech, Sound, and Music ASPIRin: Action Space Projection for Interactivity-Optimized Reinforcement Learning in Full-Duplex Speech Language Models Utterance-Level Methods for Identifying Reliable ASR-Output for Child Speech Enhancing ASR Performance in the Medical Domain for Dravidian Languages PS-TTS: Phonetic Synchronization in Text-to-Speech for Achieving Natural Automated Dubbing Disentangled Dual-Branch Graph Learning for Conversational Emotion Recognition OmniVoice: Towards Omnilingual Zero-Shot Text-to-Speech with Diffusion Language Models HARNESS: Lightweight Distilled Arabic Speech Foundation Models KoALa-Bench: Evaluating Large Audio Language Models on Korean Speech Understanding and Faithfulness Explainable Speech Emotion Recognition: Weighted Attribute Fairness to Model Demographic Contributions to Social Bias The Silent Thought: Modeling Internal Cognition in Full-Duplex Spoken Dialogue Models via Latent Reasoning Something from Nothing: Data Augmentation for Robust Severity Level Estimation of Dysarthric Speech Interpreting Speaker Characteristics in the Dimensions of Self-Supervised Speech Features From Black Box to Glass Box: Cross-Model ASR Disagreement to Prioto Review in Ambient AI Scribe Documentation WST-X Series: Wavelet Scattering Transform for Interpretable Speech Deepfake Detection Kanade: A Simple Disentangled Tokenizer for Spoken Language Modeling AQUA-Bench: Beyond Finding Answers to Knowing When There Are None in Audio Question Answering Diagnostic-Driven Layer-Wise Compensation for Post-Training Quantization of Encoder-Decoder ASR Models BERT-APC: A Reference-free Framework for Automatic Pitch Correction via Musical Context Inference MTR-DuplexBench: Towards a Comprehensive Evaluation of Multi-Round Conversations for Full-Duplex Speech Language Models Sound Source Localization for Spatial Mapping of Surgical Actions in Dynamic Scenes Gelina: Unified Speech and Gesture Synthesis via Interleaved Token Prediction Efficient Test-Time Adaptation through Latent Subspace Coefficients Search MMAudioSep: Taming Video-to-Audio Generative Model Towards Video/Text-Queried Sound Separation VAPO: End-to-end Slide-Enhanced Speech Recognition with Omni-modal Large Language Models TokenChain: A Discrete Speech Chain via Semantic Token Modeling BaldWhisper: Faster Whisper with Head Shearing and Layer Merging Game-Time: Evaluating Temporal Dynamics in Spoken Language Models Zero-Effort Image-to-Music Generation: An Interpretable RAG-based VLM Approach Direct Simultaneous Translation Activation for Large Audio-Language Models CodecSep: Prompt-Driven Universal Sound Separation on Neural Audio Codec Latents Joint Learning using Mixture-of-Expert-Based Representation for Speech Enhancement and Robust Emotion Recognition DreamAudio: Customized Text-to-Audio Generation with Diffusion Models Computational Narrative Understanding for Expressive Text-to-Speech Gaussian Process Regression of Steering Vectors With Physics-Aware Deep Composite Kernels for Augmented Listening Balalaika: Data-Centric, Prosody-Aware Annotation Pipeline for Russian Speech Towards Holistic Evaluation of Large Audio-Language Models: A Comprehensive Survey FMSD-TTS: Few-shot Multi-Speaker Multi-Dialect Text-to-Speech Synthesis for Ü-Tsang, Amdo and Kham Speech Dataset Generation Not that Groove: Zero-Shot Symbolic Music Editing Speculative End-Turn Detector for Efficient Speech Chatbot Assistant AudioX: A Unified Framework for Anything-to-Audio Generation S2S-Arena: Evaluating Paralinguistic Instruction Following in Speech-to-Speech Models Throat and acoustic paired speech dataset for deep learning-based speech enhancement Dementia classification from spontaneous speech using wrapper-based feature selection DASB - Discrete Audio and Speech Benchmark Basic syntax from speech: Spontaneous concatenation in unsupervised deep neural networks
Sentence-Select: Large-Scale Language Model Data Selection for Rare-Word Speech Recognition
W. Ronny Huang, Cal Peyser, Tara N. Sainath, Ruoming Pang, Trevo · 2022-03-10 · via eess.AS updates on arXiv.org

Language model fusion helps smart assistants recognize words which are rare in acoustic data but abundant in text-only corpora (typed search logs). However, such corpora have properties that hinder downstream performance, including being (1) too large, (2) beset with domain-mismatched content, and (3) heavy-headed rather than heavy-tailed (excessively many duplicate search queries such as "weather"). We show that three simple strategies for selecting language modeling data can dramatically improve rare-word recognition without harming overall performance. First, to address the heavy-headedness, we downsample the data according to a soft log function, which tunably reduces high frequency (head) sentences. Second, to encourage rare-word exposure, we explicitly filter for words rare in the acoustic data. Finally, we tackle domain-mismatch via perplexity-based contrastive selection, filtering for examples matched to the target domain. We down-select a large corpus of web search queries by a factor of 53x and achieve better LM perplexities than without down-selection. When shallow-fused with a state-of-the-art, production speech engine, our LM achieves WER reductions of up to 24% relative on rare-word sentences (without changing overall WER) compared to a baseline LM trained on the raw corpus. These gains are further validated through favorable side-by-side evaluations on live voice search traffic.