Bridging the gap: A comparative exploration of Speech-LLM and end-to-end architecture for multilingual conversational ASR

Dependence on Early and Late Reverberation of Single-Channel Speaker Distance Estimation

MIST: Multimodal Interactive Speech-based Tool-calling Conversational Assistants for Smart Homes

LiVeAction: a Lightweight, Versatile, and Asymmetric Neural Codec Design for Real-time Operation

Weight-Decay Turns Transformer Loss Landscapes Villani: Functional-Analytic Foundations for Optimization and Generalization

PairAlign: A Framework for Sequence Tokenization via Self-Alignment with Applications to Audio Tokenization

WavCube: Unifying Speech Representation for Understanding and Generation via Semantic-Acoustic Joint Modeling

Predictive-Generative Drift Decomposition for Speech Enhancement and Separation

Minimizing Modality Gap from the Input Side: Your Speech LLM Can Be a Prosody-Aware Text LLM

X-Voice: Enabling Everyone to Speak 30 Languages via Zero-Shot Cross-Lingual Voice Cloning

JASTIN: Aligning LLMs for Zero-Shot Audio and Speech Evaluation via Natural Language Instructions

Phoneme-Level Deepfake Detection Across Emotional Conditions Using Self-Supervised Embeddings

When Audio-Language Models Fail to Leverage Multimodal Context for Dysarthric Speech Recognition

Dimensionality-Aware Anomaly Detection in Learned Representations of Self-Supervised Speech Models

Mitigating Multimodal LLMs Hallucinations via Relevance Propagation at Inference Time

Virtual Speech Therapist: A Clinician-in-the-Loop AI Speech Therapy Agent for Personalized and Supervised Therapy

LASE: Language-Adversarial Speaker Encoding for Indic Cross-Script Identity Preservation

Towards Improving Speaker Distance Estimation through Generative Impulse Response Augmentation

Beyond Decodability: Reconstructing Language Model Representations with an Encoding Probe

MMAudioReverbs: Video-Guided Acoustic Modeling for Dereverberation and Room Impulse Response Estimation

Alethia: A Foundational Encoder for Voice Deepfakes

From Birdsong to Rumbles: Classifying Elephant Calls with Out-of-Species Embeddings

Beyond the Baseband: Adaptive Multi-Band Encoding for Full-Spectrum Bioacoustics Classification

Predicting Upcoming Stuttering Events from Three-Second Audio: Stratified Evaluation Reveals Severity-Selective Precursors, and the Model Deploys Fully On-Device

The False Resonance: A Critical Examination of Emotion Embedding Similarity for Speech Generation Evaluation

DiffAnon: Diffusion-based Prosody Control for Voice Anonymization

Recurrence-Based Nonlinear Vocal Dynamics as Digital Biomarkers for Depression Detection from Conversational Speech

One Voice, Many Tongues: Cross-Lingual Voice Cloning for Scientific Speech

Similarity Choice and Negative Scaling in Supervised Contrastive Learning for Deepfake Audio Detection

Walking Through Uncertainty: An Empirical Study of Uncertainty Estimation for Audio-Aware Large Language Models

Praxy Voice: Voice-Prompt Recovery + BUPS for Commercial-Class Indic TTS from a Frozen Non-Indic Base at Zero Commercial-Training-Data Cost

ML-SAN: Multi-Level Speaker-Adaptive Network for Emotion Recognition in Conversations

Korean aegyo speech shows systematic F1 increase to signal childlike qualities

All That Glitters Is Not Audio: Rethinking Text Priors and Audio Reliance in Audio-Language Evaluation

Speech Enhancement Based on Drifting Models

Talker-T2AV: Joint Talking Audio-Video Generation with Autoregressive Diffusion Modeling

Explainable AI in Speaker Recognition -- Making Latent Representations Understandable

TTS-PRISM: A Perceptual Reasoning and Interpretable Speech Model for Fine-Grained Diagnosis

UniSonate: A Unified Model for Speech, Music, and Sound Effect Generation with Text Instructions

Dilated CNNs for Periodic Signal Processing: A Low-Complexity Approach

ONOTE: Benchmarking Omnimodal Notation Processing for Expert-level Music Intelligence

MoDAl: Self-Supervised Neural Modality Discovery via Decorrelation for Speech Neuroprosthesis

Voice Mapping of Text-to-Speech Systems: A Metric-Based Approach for Voice Quality Assessment

UAF: A Unified Audio Front-end LLM for Full-Duplex Speech Interaction

Voice of India: A Large-Scale Benchmark for Real-World Speech Recognition in India

Reducing the Offline-Streaming Gap for Unified ASR Transducer with Consistency Regularization

Audio2Tool: Speak, Call, Act -- A Dataset for Benchmarking Speech Tool Use

Qwen3.5-Omni Technical Report

Speech Emotion Recognition Using MFCC Features and LSTM-Based Deep Learning Model

The Acoustic Camouflage Phenomenon: Re-evaluating Speech Features for Financial Risk Prediction

SongBench: A Fine-Grained Multi-Aspect Benchmark for Song Quality Assessment

VoxSafeBench: Not Just What Is Said, but Who, How, and Where

In-Sync: Adaptation of Speech Aware Large Language Models for ASR with Word Level Timestamp Predictions

MoshiRAG: Asynchronous Knowledge Retrieval for Full-Duplex Speech Language Models

X-VC: Zero-shot Streaming Voice Conversion in Codec Space

Audio Flamingo Next: Next-Generation Open Audio-Language Models for Speech, Sound, and Music

ASPIRin: Action Space Projection for Interactivity-Optimized Reinforcement Learning in Full-Duplex Speech Language Models

Utterance-Level Methods for Identifying Reliable ASR-Output for Child Speech

Enhancing ASR Performance in the Medical Domain for Dravidian Languages

PS-TTS: Phonetic Synchronization in Text-to-Speech for Achieving Natural Automated Dubbing

Disentangled Dual-Branch Graph Learning for Conversational Emotion Recognition

OmniVoice: Towards Omnilingual Zero-Shot Text-to-Speech with Diffusion Language Models

HARNESS: Lightweight Distilled Arabic Speech Foundation Models

KoALa-Bench: Evaluating Large Audio Language Models on Korean Speech Understanding and Faithfulness

Explainable Speech Emotion Recognition: Weighted Attribute Fairness to Model Demographic Contributions to Social Bias

The Silent Thought: Modeling Internal Cognition in Full-Duplex Spoken Dialogue Models via Latent Reasoning

Something from Nothing: Data Augmentation for Robust Severity Level Estimation of Dysarthric Speech

Interpreting Speaker Characteristics in the Dimensions of Self-Supervised Speech Features

From Black Box to Glass Box: Cross-Model ASR Disagreement to Prioto Review in Ambient AI Scribe Documentation

WST-X Series: Wavelet Scattering Transform for Interpretable Speech Deepfake Detection

Kanade: A Simple Disentangled Tokenizer for Spoken Language Modeling

AQUA-Bench: Beyond Finding Answers to Knowing When There Are None in Audio Question Answering

Diagnostic-Driven Layer-Wise Compensation for Post-Training Quantization of Encoder-Decoder ASR Models

BERT-APC: A Reference-free Framework for Automatic Pitch Correction via Musical Context Inference

MTR-DuplexBench: Towards a Comprehensive Evaluation of Multi-Round Conversations for Full-Duplex Speech Language Models

Sound Source Localization for Spatial Mapping of Surgical Actions in Dynamic Scenes

Gelina: Unified Speech and Gesture Synthesis via Interleaved Token Prediction

Efficient Test-Time Adaptation through Latent Subspace Coefficients Search

MMAudioSep: Taming Video-to-Audio Generative Model Towards Video/Text-Queried Sound Separation

VAPO: End-to-end Slide-Enhanced Speech Recognition with Omni-modal Large Language Models

TokenChain: A Discrete Speech Chain via Semantic Token Modeling

BaldWhisper: Faster Whisper with Head Shearing and Layer Merging

Game-Time: Evaluating Temporal Dynamics in Spoken Language Models

Zero-Effort Image-to-Music Generation: An Interpretable RAG-based VLM Approach

Direct Simultaneous Translation Activation for Large Audio-Language Models

CodecSep: Prompt-Driven Universal Sound Separation on Neural Audio Codec Latents

Joint Learning using Mixture-of-Expert-Based Representation for Speech Enhancement and Robust Emotion Recognition

DreamAudio: Customized Text-to-Audio Generation with Diffusion Models

Computational Narrative Understanding for Expressive Text-to-Speech

Gaussian Process Regression of Steering Vectors With Physics-Aware Deep Composite Kernels for Augmented Listening

Balalaika: Data-Centric, Prosody-Aware Annotation Pipeline for Russian Speech

Towards Holistic Evaluation of Large Audio-Language Models: A Comprehensive Survey

FMSD-TTS: Few-shot Multi-Speaker Multi-Dialect Text-to-Speech Synthesis for Ü-Tsang, Amdo and Kham Speech Dataset Generation

Not that Groove: Zero-Shot Symbolic Music Editing

Speculative End-Turn Detector for Efficient Speech Chatbot Assistant

AudioX: A Unified Framework for Anything-to-Audio Generation

S2S-Arena: Evaluating Paralinguistic Instruction Following in Speech-to-Speech Models

Throat and acoustic paired speech dataset for deep learning-based speech enhancement

Dementia classification from spontaneous speech using wrapper-based feature selection

DASB - Discrete Audio and Speech Benchmark

Basic syntax from speech: Spontaneous concatenation in unsupervised deep neural networks

推荐订阅源

eess.AS updates on arXiv.org