Pronunciation recognition of English phonemes /\textipa{@}/, /æ/, /\textipa{A}:/ and /\textipa{2}/ using Formants and Mel Frequency Cepstral Coefficients

Mutual Forcing: Dual-Mode Self-Evolution for Fast Autoregressive Audio-Video Character Generation

WhisperPipe: A Resource-Efficient Streaming Architecture for Real-Time Automatic Speech Recognition

Walking Through Uncertainty: An Empirical Study of Uncertainty Estimation for Audio-Aware Large Language Models

PSP: An Interpretable Per-Dimension Accent Benchmark for Indic Text-to-Speech

Praxy Voice: Voice-Prompt Recovery + BUPS for Commercial-Class Indic TTS from a Frozen Non-Indic Base at Zero Commercial-Training-Data Cost

Korean aegyo speech shows systematic F1 increase to signal childlike qualities

All That Glitters Is Not Audio: Rethinking Text Priors and Audio Reliance in Audio-Language Evaluation

RAS: a Reliability Oriented Metric for Automatic Speech Recognition

Speech Enhancement Based on Drifting Models

HeadRouter: Dynamic Head-Weight Routing for Task-Adaptive Audio Token Pruning in Large Audio Language Models

Hallo-Live: Real-Time Streaming Joint Audio-Video Avatar Generation with Asynchronous Dual-Stream and Human-Centric Preference Distillation

Talker-T2AV: Joint Talking Audio-Video Generation with Autoregressive Diffusion Modeling

Robust Audio-Text Retrieval via Cross-Modal Attention and Hybrid Loss

Spectro-Temporal Modulation Representation Framework for Human-Imitated Speech Detection

UniSonate: A Unified Model for Speech, Music, and Sound Effect Generation with Text Instructions

Do LLM Decoders Listen Fairly? Benchmarking How Language Model Priors Shape Bias in Speech Recognition

Materialistic RIR: Material Conditioned Realistic RIR Generation

SpeechParaling-Bench: A Comprehensive Benchmark for Paralinguistic-Aware Speech Generation

ONOTE: Benchmarking Omnimodal Notation Processing for Expert-level Music Intelligence

From Image to Music Language: A Two-Stage Structure Decoding Approach for Complex Polyphonic OMR

ATIR: Towards Audio-Text Interleaved Contextual Retrieval

Enhancing Speaker Verification with Whispered Speech via Post-Processing

Environmental Sound Deepfake Detection Using Deep-Learning Framework

Towards Streaming Target Speaker Extraction via Chunk-wise Interleaved Splicing of Autoregressive Language Model

BEAT: Tokenizing and Generating Symbolic Music by Uniform Temporal Steps

Deep Supervised Contrastive Learning of Pitch Contours for Robust Pitch Accent Classification in Seoul Korean

HalluAudio: A Comprehensive Benchmark for Hallucination Detection in Large Audio-Language Models

UAF: A Unified Audio Front-end LLM for Full-Duplex Speech Interaction

Voice of India: A Large-Scale Benchmark for Real-World Speech Recognition in India

Tadabur: A Large-Scale Quran Audio Dataset

Comparison of sEMG Encoding Accuracy Across Speech Modes Using Articulatory and Phoneme Features

Video-Robin: Autoregressive Diffusion Planning for Intent-Grounded Video-to-Music Generation

Virtual boundary integral neural network for three-dimensional exterior acoustic problems

Audio2Tool: Speak, Call, Act -- A Dataset for Benchmarking Speech Tool Use

AST: Adaptive, Seamless, and Training-Free Precise Speech Editing

Breakout-picker: Reducing false positives in deep learning-based borehole breakout characterization from acoustic image logs

Hierarchical Codec Diffusion for Video-to-Speech Generation

ControlFoley: Unified and Controllable Video-to-Audio Generation with Cross-Modal Conflict Handling

From Reactive to Proactive: Assessing the Proactivity of Voice Agents via ProVoice-Bench

The Acoustic Camouflage Phenomenon: Re-evaluating Speech Features for Financial Risk Prediction

Hijacking Large Audio-Language Models via Context-Agnostic and Imperceptible Auditory Prompt Injection

TurboTalk: Progressive Distillation for One-Step Audio-Driven Talking Avatar Generation

Temporal Contrastive Decoding: A Training-Free Method for Large Audio-Language Models

VoxSafeBench: Not Just What Is Said, but Who, How, and Where

Elderly-Contextual Data Augmentation via Speech Synthesis for Elderly ASR

Towards Fine-grained Temporal Perception: Post-Training Large Audio-Language Models with Audio-Side Time Prompt

Comparison of window shapes and lengths in short-time feature extraction for classification of heart sound signals

In-Sync: Adaptation of Speech Aware Large Language Models for ASR with Word Level Timestamp Predictions

Graph Propagated Projection Unlearning: A Unified Framework for Vision and Audio Discriminative Models

HHL with a Coherent Fourier Oracle: A Proof-of-Concept Quantum Architecture for Joint Melody-Harmony Generation

ActorMind: Emulating Human Actor Reasoning for Speech Role-Playing

Efficient Training for Cross-lingual Speech Language Models

Audio Flamingo Next: Next-Generation Open Audio-Language Models for Speech, Sound, and Music

MeloTune: On-Device Arousal Learning and Peer-to-Peer Mood Coupling for Proactive Music Curation

BlasBench: An Open Benchmark for Irish Speech Recognition

Audio-Omni: Extending Multi-modal Understanding to Versatile Audio Generation and Editing

Knowing What to Stress: A Discourse-Conditioned Text-to-Speech Benchmark

VidAudio-Bench: Benchmarking V2A and VT2A Generation across Four Audio Categories

Cross-Cultural Bias in Mel-Scale Representations: Evidence and Alternatives from Speech and Music

Beyond Monologue: Interactive Talking-Listening Avatar Generation with Conversational Audio Context-Aware Kernels

ASPIRin: Action Space Projection for Interactivity-Optimized Reinforcement Learning in Full-Duplex Speech Language Models

Interactive ASR: Towards Human-Like Interaction and Semantic Coherence Evaluation for Agentic Speech Recognition

Tora3: Trajectory-Guided Audio-Video Generation with Physical Coherence

Disentangled Dual-Branch Graph Learning for Conversational Emotion Recognition

Real-Time Voicemail Detection in Telephony Audio Using Temporal Speech Activity Features

Woosh: A Sound Effects Foundation Model

KoALa-Bench: Evaluating Large Audio Language Models on Korean Speech Understanding and Faithfulness

From Black Box to Glass Box: Cross-Model ASR Disagreement to Prioto Review in Ambient AI Scribe Documentation

Diagnostic-Driven Layer-Wise Compensation for Post-Training Quantization of Encoder-Decoder ASR Models

Style Amnesia: Investigating Speaking Style Degradation and Mitigation in Multi-Turn Spoken Language Models

Real-Time Streamable Generative Speech Restoration with Flow Matching

Hearing to Translate: The Effectiveness of Speech Modality Integration into LLMs

Protecting Bystander Privacy via Selective Hearing in Audio LLMs

Language Models as Semantic Teachers: Post-Training Alignment for Medical Audio Understanding

BERT-APC: A Reference-free Framework for Automatic Pitch Correction via Musical Context Inference

Musical Score Understanding Benchmark: Evaluating Large Language Models' Comprehension of Complete Musical Scores

MMAudioSep: Taming Video-to-Audio Generative Model Towards Video/Text-Queried Sound Separation

VAPO: End-to-end Slide-Enhanced Speech Recognition with Omni-modal Large Language Models

Data-efficient Targeted Token-level Preference Optimization for LLM-based Text-to-Speech

When Silence Matters: The Impact of Irrelevant Audio on Text Reasoning in Large Audio-Language Models

MARS: Sound Generation via Multi-Channel Autoregression on Spectrograms

Zero-Effort Image-to-Music Generation: An Interpretable RAG-based VLM Approach

StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient SpeechLLMs

CoMelSinger: Discrete Token-Based Zero-Shot Singing Synthesis With Structured Melody Control and Guidance

RFM-Editing: Rectified Flow Matching for Text-guided Audio Editing

CodecSep: Prompt-Driven Universal Sound Separation on Neural Audio Codec Latents

DreamAudio: Customized Text-to-Audio Generation with Diffusion Models

Computational Narrative Understanding for Expressive Text-to-Speech

Gaussian Process Regression of Steering Vectors With Physics-Aware Deep Composite Kernels for Augmented Listening

Towards Holistic Evaluation of Large Audio-Language Models: A Comprehensive Survey

FMSD-TTS: Few-shot Multi-Speaker Multi-Dialect Text-to-Speech Synthesis for Ü-Tsang, Amdo and Kham Speech Dataset Generation

Sat2Sound: A Unified Framework for Zero-Shot Soundscape Mapping

Histogram-based Parameter-efficient Tuning for Passive and Active Sonar Classification

Speculative End-Turn Detector for Efficient Speech Chatbot Assistant

AudioX: A Unified Framework for Anything-to-Audio Generation

Speech-FT: Merging Pre-trained And Fine-Tuned Speech Representation Models For Cross-Task Generalization

Throat and acoustic paired speech dataset for deep learning-based speech enhancement

Dementia classification from spontaneous speech using wrapper-based feature selection

DASB - Discrete Audio and Speech Benchmark

Basic syntax from speech: Spontaneous concatenation in unsupervised deep neural networks

推荐订阅源

cs.SD updates on arXiv.org