Grounding Spoken LLMs in Multi-Speaker Audio via Diarization Conditioning - 惯性聚合

推荐订阅源

Proofpoint News Feed

The Hacker News

Google Developers Blog

Schneier on Security

freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More

Security Archives - TechRepublic

博客园 - Franky

Recent Announcements

Hacker News - Newest: "LLM"

Kaspersky official blog

Engineering at Meta

Java Code Geeks

Google Online Security Blog

Last Week in AI

Vulnerabilities – Threatpost

News and Events Feed by Topic

cs.CL updates on arXiv.org

Y Combinator Blog

博客园 - 【当耐特】

Hacker News: Ask HN

Tor Project blog

Apple Machine Learning Research

Microsoft Security Blog

Exploit-DB.com RSS Feed

Security Affairs

About on SuperTechFans

Darknet – Hacking Tools, Hacker News & Cyber Security

博客园 - 聂微东

奇客Solidot–传递最新科技情报

Check Point Blog

宝玉的分享

Visual Studio Blog

The Blog of Author Tim Ferriss

eess.AS updates on arXiv.org

Synergizing Zero-Shot Cross-Lingual Alzheimer Detection with Language-Invariant Multimodal Bi-Geometric Adversarial Learning Single frequency filtering based multi-speaker direction of arrival estimation from stereo recordings Intelligibility of Speech in Noise: Investigating Contribution of Magnitude and Phase Spectra Direction of arrival estimation from distant microphone data using single frequency filtering From Signals to Patterns: Non-Invasive Tuberculosis Detection from Cough Audio using Bandit Weighted Hyperbolic Prototypes ELSA: Acoustic Event-Level Semantic Alignment for Fine-Grained Reference-Free Text-to-Audio Evaluation Non-Autoregressive Minimum Bayes' Risk Decoding for Fast Speech Recognition An Analysis of the Effectiveness of Synthetic Speech Data for ASR Fine-tuning in Selected Indic Languages PhASE-Flow: Phonetic-Conditioned Acoustic Flow Matching in SSL Representation Domain for Speech Enhancement A 399uW 114.3 dB DR Companding Readout ASIC for MEMS Microphones Employing a Multirate Time-Domain ADC Reading between the Lines: Leveraging Large Language Models for Global Dementia and Depression Assessment from Clinical Interviews AI-based Cognitive-linguistic Features for Dementia Assessment in Picture Description One-Step Token-to-Waveform Generation with MeanFlow in Latent Space Are you speaking my languages? On spoken language adherence in multimodal LLMs Perceptual compensation for tonal context in self-supervised speech models Embedded Machine Learning for Microcontroller-Class Edge Devices: Data, Feature, Evaluation, and Deployment Pipelines Beyond Classification: A Cough Regression Benchmark for Respiratory Acoustic Foundation Models Geometrically Constrained Decentralized Independent Vector Analysis for Distributed Microphone Arrays Interpretable and Frugal Learning Systems Employing Multiresolution Pyramids and Volterra Kernels Learning Input-Channel Permutation Equivariance for Multi-Channel Source Separation: Reducing Bleeding in Small Music Ensembles Semantic-VAE: Semantic-Alignment Latent Representation for Better Speech Synthesis HRIR-Former: Grid-Free Time-Domain Reconstruction of Head-Related Impulse Responses with a Spatially Encoded Transformer Spatial-Magnifier: Spatial upsampling for multichannel speech enhancement Raon-OpenTTS: Open Models and Data for Robust Text-to-Speech HoliDubber: Holistic Video Dubbing for Complex Acoustic Scenes via Text-Guided Audio Synthesis TuneJury: An Open Metric for Improving Music Generation Preference Alignment Probing Low Frame Rate Degradation in Neural Audio Codecs CraBERT: Efficient Phoneme Encoder Pre-Training via Cascade Fusion of Subword Representations for Text-to-Speech Confidence Score Guided Incremental and Speaker Adaptive Pseudo-Labeling for Semi-Supervised Elderly Speech Recognition Decoding while Adapting: Zero-Shot Online Speaker Adaptation via Audio-Textual Prompts for Elderly Speech Recognition Towards Robust Generative Speech Enhancement Using Vector Quantisation-Based Neural Audio Codec Unified Audio Generation and Editing via Joint Condition Modeling and Progressive Training Joycent: Diffusion-based Accent TTS without Accented Phone Prediction An Asymmetric Formula for Interval Consonance and its Relation to Harmonic Coincidence ArtBoost: Synthetic Articulatory Data Augmentation for Acoustic-to-Articulatory Inversion Stabilizing Short Duration Speaker Verification through Neural Re-scoring with Hybrid Enrollment Bridging the SEA Gap: An Initial Benchmark for Neural Audio Codec-Synthesized Speech Deepfakes in South-East Asian Languages NVMOS: Non-Verbal Vocalization Quality Assessment in Speech AdaTT: Text-Guided Instrument Timbre Transfer with Target-Adaptive Structural Control Acoustic Prompting via Stage-wise Modulation for Few-Shot Learning in Audio Language Models MambAdapter: Lightweight Mamba-Based Adapters for Parameter-Efficient Transfer Learning in Speech and Audio AP-GRPO: Anchor-Gated Phonetic Alignment with Policy Optimization for Pathological Speech Reconstruction Phonetically Explainable Speech Deepfake Detection DDPO-VC: Speaker De-Identification via Diffusion Denoising Policy Optimization Dynamic Prosody Prediction in LLM-based TTS for Improving Speaker Similarity DuraMark: Duration-Embedded Watermarking in LLM-based TTS VoxWatermark: A Large-Scale Benchmark for Audio Watermark Detection under Perturbations FreeSonic: Training-Free Temporal-Aware Decoupled Attention for Precise Audio Editing AUDEDIT: Inversion-Free Text-Guided Editing with Pretrained Audio Flow Models EChO-Agent: Evidence Chain Orchestration Agent for Audio Reasoning When the Same Musical Knowledge Forgets Differently: A Clean Probe of Pathway-Dependent Forgetting An Empirical Study on Learning Latent Representations for Emotional Speech Synthesis Spectro-Temporal Interference Confounds Phase Encoding in Spatial Audio Foundation Models From Physics to Representation: Audio Learning with Synthetic Pre-training via Procedural Generation Unifying Acoustic Features and Text with Multimodal LLMs for Neurodegenerative Screening LLM-Based Synthetic Ground Truth Generation for Audio-Based Emotion Classification via In-Context Learning Fast Speech Foundation Model Distillation Using Interleaved Stacking What Makes Synthetic Speech Sound Sarcastic? A Prosody-Controlled Perception Study Few-shot Class-variable Incremental Audio Classification via Prototype Adaptation and Pseudo Class-variable Training Pixel-TTS: Image based Text Rendering for Robust Text-to-Speech Virtual Speech Therapist: A Clinician-in-the-Loop AI Speech Therapy Agent for Personalized and Supervised Therapy Plug-and-Steer: Decoupling Separation and Selection in Audio-Visual Target Speaker Extraction MSpoofTTS: Multi-Resolution Spoof-Guided Inference for Discrete Speech Synthesis ZeroSyl: Simple Zero-Resource Syllable Tokenization for Spoken Language Modeling SSNAPS: Audio-Visual Separation of Speech and Background Noise with Diffusion Inverse Sampling Modeling Sarcastic Speech: Semantic and Prosodic Cues in a Speech Synthesis Framework Interpretable Audio Editing Evaluation via Chain-of-Thought Difference-Commonality Reasoning with Multimodal LLMs LibriTTS-VI: A Public Corpus and Novel Methods for Efficient Voice Impression Control Discrete optimal transport is a strong audio adversarial attack LISTEN: Lightweight Industrial Sound-representable Transformer for Edge Notification ChildGuard: A Specialized Dataset for Combatting Child-Targeted Hate Speech LM-SPT: LM-Aligned Semantic Distillation for Speech Tokenization Leveraging Local and Global Knowledge Integration with Time-Frequency Calibrated Distillation for Speech Enhancement ALAS: Measuring Latent Speech-Text Alignment For Spoken Language Understanding In Multimodal LLMs Cross-lingual Embedding Clustering for Hierarchical Softmax in Low-Resource Multilingual Speech Recognition AID: Open-source Anechoic Interferer Dataset

Grounding Spoken LLMs in Multi-Speaker Audio via Diarization Conditioning

[Submitted on 16 Jun 2026] · 2026-06-17 · via eess.AS updates on arXiv.org

此内容由惯性聚合(RSS阅读器)自动聚合整理，仅供阅读参考。原文来自 — 版权归原作者所有。