惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

cs.CL updates on arXiv.org
cs.CL updates on arXiv.org
cs.AI updates on arXiv.org
cs.AI updates on arXiv.org
Hacker News - Newest:
Hacker News - Newest: "LLM"
S
Security Affairs
PCI Perspectives
PCI Perspectives
Google Online Security Blog
Google Online Security Blog
W
WeLiveSecurity
www.infosecurity-magazine.com
www.infosecurity-magazine.com
Recent Commits to openclaw:main
Recent Commits to openclaw:main
P
Privacy & Cybersecurity Law Blog
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
S
Security @ Cisco Blogs
Security Archives - TechRepublic
Security Archives - TechRepublic
Cyberwarzone
Cyberwarzone
L
Lohrmann on Cybersecurity
TaoSecurity Blog
TaoSecurity Blog
V
Visual Studio Blog
博客园 - 聂微东
Scott Helme
Scott Helme
博客园 - 【当耐特】
K
Kaspersky official blog
Security Latest
Security Latest
K
KPMG report finds enterprise disconnect between AI and its ROI | CIO
MyScale Blog
MyScale Blog
Schneier on Security
Schneier on Security
WordPress大学
WordPress大学
博客园 - 叶小钗
C
Check Point Blog
V2EX - 技术
V2EX - 技术
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
博客园 - Franky
T
Tor Project blog
Apple Machine Learning Research
Apple Machine Learning Research
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
腾讯CDC
雷峰网
雷峰网
博客园_首页
美团技术团队
Y
Y Combinator Blog
C
CERT Recently Published Vulnerability Notes
AWS News Blog
AWS News Blog
月光博客
月光博客
N
Netflix TechBlog - Medium
Last Week in AI
Last Week in AI
Recent Announcements
Recent Announcements
Google DeepMind News
Google DeepMind News
Help Net Security
Help Net Security
P
Proofpoint News Feed
MongoDB | Blog
MongoDB | Blog
C
Cybersecurity and Infrastructure Security Agency CISA

eess.AS updates on arXiv.org

Dependence on Early and Late Reverberation of Single-Channel Speaker Distance Estimation MIST: Multimodal Interactive Speech-based Tool-calling Conversational Assistants for Smart Homes LiVeAction: a Lightweight, Versatile, and Asymmetric Neural Codec Design for Real-time Operation Weight-Decay Turns Transformer Loss Landscapes Villani: Functional-Analytic Foundations for Optimization and Generalization PairAlign: A Framework for Sequence Tokenization via Self-Alignment with Applications to Audio Tokenization WavCube: Unifying Speech Representation for Understanding and Generation via Semantic-Acoustic Joint Modeling Predictive-Generative Drift Decomposition for Speech Enhancement and Separation Minimizing Modality Gap from the Input Side: Your Speech LLM Can Be a Prosody-Aware Text LLM X-Voice: Enabling Everyone to Speak 30 Languages via Zero-Shot Cross-Lingual Voice Cloning JASTIN: Aligning LLMs for Zero-Shot Audio and Speech Evaluation via Natural Language Instructions Phoneme-Level Deepfake Detection Across Emotional Conditions Using Self-Supervised Embeddings When Audio-Language Models Fail to Leverage Multimodal Context for Dysarthric Speech Recognition Dimensionality-Aware Anomaly Detection in Learned Representations of Self-Supervised Speech Models Mitigating Multimodal LLMs Hallucinations via Relevance Propagation at Inference Time Virtual Speech Therapist: A Clinician-in-the-Loop AI Speech Therapy Agent for Personalized and Supervised Therapy LASE: Language-Adversarial Speaker Encoding for Indic Cross-Script Identity Preservation Towards Improving Speaker Distance Estimation through Generative Impulse Response Augmentation Beyond Decodability: Reconstructing Language Model Representations with an Encoding Probe MMAudioReverbs: Video-Guided Acoustic Modeling for Dereverberation and Room Impulse Response Estimation Alethia: A Foundational Encoder for Voice Deepfakes From Birdsong to Rumbles: Classifying Elephant Calls with Out-of-Species Embeddings Beyond the Baseband: Adaptive Multi-Band Encoding for Full-Spectrum Bioacoustics Classification Predicting Upcoming Stuttering Events from Three-Second Audio: Stratified Evaluation Reveals Severity-Selective Precursors, and the Model Deploys Fully On-Device The False Resonance: A Critical Examination of Emotion Embedding Similarity for Speech Generation Evaluation DiffAnon: Diffusion-based Prosody Control for Voice Anonymization Recurrence-Based Nonlinear Vocal Dynamics as Digital Biomarkers for Depression Detection from Conversational Speech One Voice, Many Tongues: Cross-Lingual Voice Cloning for Scientific Speech Similarity Choice and Negative Scaling in Supervised Contrastive Learning for Deepfake Audio Detection Walking Through Uncertainty: An Empirical Study of Uncertainty Estimation for Audio-Aware Large Language Models Praxy Voice: Voice-Prompt Recovery + BUPS for Commercial-Class Indic TTS from a Frozen Non-Indic Base at Zero Commercial-Training-Data Cost ML-SAN: Multi-Level Speaker-Adaptive Network for Emotion Recognition in Conversations Korean aegyo speech shows systematic F1 increase to signal childlike qualities All That Glitters Is Not Audio: Rethinking Text Priors and Audio Reliance in Audio-Language Evaluation Speech Enhancement Based on Drifting Models Talker-T2AV: Joint Talking Audio-Video Generation with Autoregressive Diffusion Modeling Explainable AI in Speaker Recognition -- Making Latent Representations Understandable TTS-PRISM: A Perceptual Reasoning and Interpretable Speech Model for Fine-Grained Diagnosis UniSonate: A Unified Model for Speech, Music, and Sound Effect Generation with Text Instructions Dilated CNNs for Periodic Signal Processing: A Low-Complexity Approach ONOTE: Benchmarking Omnimodal Notation Processing for Expert-level Music Intelligence MoDAl: Self-Supervised Neural Modality Discovery via Decorrelation for Speech Neuroprosthesis Voice Mapping of Text-to-Speech Systems: A Metric-Based Approach for Voice Quality Assessment UAF: A Unified Audio Front-end LLM for Full-Duplex Speech Interaction Voice of India: A Large-Scale Benchmark for Real-World Speech Recognition in India Reducing the Offline-Streaming Gap for Unified ASR Transducer with Consistency Regularization Audio2Tool: Speak, Call, Act -- A Dataset for Benchmarking Speech Tool Use Qwen3.5-Omni Technical Report Speech Emotion Recognition Using MFCC Features and LSTM-Based Deep Learning Model The Acoustic Camouflage Phenomenon: Re-evaluating Speech Features for Financial Risk Prediction SongBench: A Fine-Grained Multi-Aspect Benchmark for Song Quality Assessment VoxSafeBench: Not Just What Is Said, but Who, How, and Where In-Sync: Adaptation of Speech Aware Large Language Models for ASR with Word Level Timestamp Predictions MoshiRAG: Asynchronous Knowledge Retrieval for Full-Duplex Speech Language Models X-VC: Zero-shot Streaming Voice Conversion in Codec Space Audio Flamingo Next: Next-Generation Open Audio-Language Models for Speech, Sound, and Music ASPIRin: Action Space Projection for Interactivity-Optimized Reinforcement Learning in Full-Duplex Speech Language Models Utterance-Level Methods for Identifying Reliable ASR-Output for Child Speech Enhancing ASR Performance in the Medical Domain for Dravidian Languages PS-TTS: Phonetic Synchronization in Text-to-Speech for Achieving Natural Automated Dubbing Disentangled Dual-Branch Graph Learning for Conversational Emotion Recognition OmniVoice: Towards Omnilingual Zero-Shot Text-to-Speech with Diffusion Language Models HARNESS: Lightweight Distilled Arabic Speech Foundation Models KoALa-Bench: Evaluating Large Audio Language Models on Korean Speech Understanding and Faithfulness Explainable Speech Emotion Recognition: Weighted Attribute Fairness to Model Demographic Contributions to Social Bias The Silent Thought: Modeling Internal Cognition in Full-Duplex Spoken Dialogue Models via Latent Reasoning Something from Nothing: Data Augmentation for Robust Severity Level Estimation of Dysarthric Speech Interpreting Speaker Characteristics in the Dimensions of Self-Supervised Speech Features From Black Box to Glass Box: Cross-Model ASR Disagreement to Prioto Review in Ambient AI Scribe Documentation WST-X Series: Wavelet Scattering Transform for Interpretable Speech Deepfake Detection Kanade: A Simple Disentangled Tokenizer for Spoken Language Modeling AQUA-Bench: Beyond Finding Answers to Knowing When There Are None in Audio Question Answering Diagnostic-Driven Layer-Wise Compensation for Post-Training Quantization of Encoder-Decoder ASR Models BERT-APC: A Reference-free Framework for Automatic Pitch Correction via Musical Context Inference MTR-DuplexBench: Towards a Comprehensive Evaluation of Multi-Round Conversations for Full-Duplex Speech Language Models Sound Source Localization for Spatial Mapping of Surgical Actions in Dynamic Scenes Gelina: Unified Speech and Gesture Synthesis via Interleaved Token Prediction Efficient Test-Time Adaptation through Latent Subspace Coefficients Search MMAudioSep: Taming Video-to-Audio Generative Model Towards Video/Text-Queried Sound Separation VAPO: End-to-end Slide-Enhanced Speech Recognition with Omni-modal Large Language Models TokenChain: A Discrete Speech Chain via Semantic Token Modeling BaldWhisper: Faster Whisper with Head Shearing and Layer Merging Game-Time: Evaluating Temporal Dynamics in Spoken Language Models Zero-Effort Image-to-Music Generation: An Interpretable RAG-based VLM Approach Direct Simultaneous Translation Activation for Large Audio-Language Models CodecSep: Prompt-Driven Universal Sound Separation on Neural Audio Codec Latents Joint Learning using Mixture-of-Expert-Based Representation for Speech Enhancement and Robust Emotion Recognition DreamAudio: Customized Text-to-Audio Generation with Diffusion Models Computational Narrative Understanding for Expressive Text-to-Speech Gaussian Process Regression of Steering Vectors With Physics-Aware Deep Composite Kernels for Augmented Listening Balalaika: Data-Centric, Prosody-Aware Annotation Pipeline for Russian Speech Towards Holistic Evaluation of Large Audio-Language Models: A Comprehensive Survey FMSD-TTS: Few-shot Multi-Speaker Multi-Dialect Text-to-Speech Synthesis for Ü-Tsang, Amdo and Kham Speech Dataset Generation Not that Groove: Zero-Shot Symbolic Music Editing Speculative End-Turn Detector for Efficient Speech Chatbot Assistant AudioX: A Unified Framework for Anything-to-Audio Generation S2S-Arena: Evaluating Paralinguistic Instruction Following in Speech-to-Speech Models Throat and acoustic paired speech dataset for deep learning-based speech enhancement Dementia classification from spontaneous speech using wrapper-based feature selection DASB - Discrete Audio and Speech Benchmark Basic syntax from speech: Spontaneous concatenation in unsupervised deep neural networks
Employing Real Training Data for Deep Noise Suppression
Ziyi Xu, Marvin Sach, Jan Pirklbauer, Tim Fingscheidt · 2023-09-06 · via eess.AS updates on arXiv.org

Most deep noise suppression (DNS) models are trained with reference-based losses requiring access to clean speech. However, sometimes an additive microphone model is insufficient for real-world applications. Accordingly, ways to use real training data in supervised learning for DNS models promise to reduce a potential training/inference mismatch. Employing real data for DNS training requires either generative approaches or a reference-free loss without access to the corresponding clean speech. In this work, we propose to employ an end-to-end non-intrusive deep neural network (DNN), named PESQ-DNN, to estimate perceptual evaluation of speech quality (PESQ) scores of enhanced real data. It provides a reference-free perceptual loss for employing real data during DNS training, maximizing the PESQ scores. Furthermore, we use an epoch-wise alternating training protocol, updating the DNS model on real data, followed by PESQ-DNN updating on synthetic data. The DNS model trained with the PESQ-DNN employing real data outperforms all reference methods employing only synthetic training data. On synthetic test data, our proposed method excels the Interspeech 2021 DNS Challenge baseline by a significant 0.32 PESQ points. Both on synthetic and real test data, the proposed method beats the baseline by 0.05 DNSMOS points - although PESQ-DNN optimizes for a different perceptual metric.