惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

T
Tailwind CSS Blog
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
S
SegmentFault 最新的问题
U
Unit 42
C
Cyber Attacks, Cyber Crime and Cyber Security
Security Latest
Security Latest
L
LINUX DO - 最新话题
The Register - Security
The Register - Security
人人都是产品经理
人人都是产品经理
美团技术团队
PCI Perspectives
PCI Perspectives
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
cs.AI updates on arXiv.org
cs.AI updates on arXiv.org
W
WeLiveSecurity
F
Full Disclosure
Application and Cybersecurity Blog
Application and Cybersecurity Blog
Cloudbric
Cloudbric
L
LangChain Blog
K
KPMG report finds enterprise disconnect between AI and its ROI | CIO
M
MIT News - Artificial intelligence
S
Security @ Cisco Blogs
博客园 - 【当耐特】
Webroot Blog
Webroot Blog
Stack Overflow Blog
Stack Overflow Blog
C
Check Point Blog
Help Net Security
Help Net Security
NISL@THU
NISL@THU
WordPress大学
WordPress大学
Simon Willison's Weblog
Simon Willison's Weblog
月光博客
月光博客
C
CERT Recently Published Vulnerability Notes
博客园 - 三生石上(FineUI控件)
S
Securelist
博客园 - Franky
博客园 - 叶小钗
AWS News Blog
AWS News Blog
D
DataBreaches.Net
P
Proofpoint News Feed
小众软件
小众软件
C
Cybersecurity and Infrastructure Security Agency CISA
Hugging Face - Blog
Hugging Face - Blog
Engineering at Meta
Engineering at Meta
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
H
Hackread – Cybersecurity News, Data Breaches, AI and More
The GitHub Blog
The GitHub Blog
K
Kaspersky official blog
Vercel News
Vercel News
Google Online Security Blog
Google Online Security Blog
C
Cisco Blogs
S
Security Affairs

cs.SD updates on arXiv.org

Evaluating Semantic Fragility in Text-to-Audio Generation Systems Under Controlled Prompt Perturbations V2M-Zero: Zero-Pair Time-Aligned Video-to-Music Generation G-STAR: End-to-End Global Speaker-Tracking Attributed Recognition Targeted Speaker Poisoning Framework in Zero-Shot Text-to-Speech RAMoEA-QA: Hierarchical Specialization for Robust Respiratory Audio Question Answering From Black Box to Glass Box: Cross-Model ASR Disagreement to Prioto Review in Ambient AI Scribe Documentation AG-REPA: Causal Layer Selection for Representation Alignment in Audio Flow Matching Same Words, Different Judgments: How Preferences Vary Across Modalities LMU-Based Sequential Learning and Posterior Ensemble Fusion for Cross-Domain Infant Cry Classification JAEGER: Joint 3D Audio-Visual Grounding and Reasoning in Simulated Physical Environments Multi-Channel Replay Speech Detection using Acoustic Maps BAT: Better Audio Transformer Guided by Convex Gated Probing Iterative LLM-based improvement for French Clinical Interview Transcription and Speaker Diarization TADA! Tuning Audio Diffusion Models through Activation Steering OmniCustom: Sync Audio-Video Customization Via Joint Audio-Video Generation Model Kanade: A Simple Disentangled Tokenizer for Spoken Language Modeling CALM: Joint Contextual Acoustic-Linguistic Modeling for Personalization of Multi-Speaker ASR Switchcodec: Adaptive residual-expert sparse quantization for high-fidelity neural audio coding MetaSICL: Adapting Audiroty LLM via Meta Speech In-Context Learning A Dataset for Automatic Vocal Mode Classification Performance and Complexity Trade-off Optimization of Speech Models During Training AQUA-Bench: Beyond Finding Answers to Knowing When There Are None in Audio Question Answering One Prompt, Many Sounds: Modeling Listener Variability in LLM-Based Equalization Speech-Hands: A Self-Reflection Voice Agentic Approach to Speech Recognition and Audio Reasoning with Omni Perception DSA-Tokenizer: Disentangled Semantic-Acoustic Tokenization via Flow Matching-based Hierarchical Fusion SARA: Stress Test Reasoning in Audio Deepfake Detection Omni2Sound: Towards Unified Video-Text-to-Audio Generation Diagnostic-Driven Layer-Wise Compensation for Post-Training Quantization of Encoder-Decoder ASR Models Style Amnesia: Investigating Speaking Style Degradation and Mitigation in Multi-Turn Spoken Language Models Real-Time Streamable Generative Speech Restoration with Flow Matching Hearing to Translate: The Effectiveness of Speech Modality Integration into LLMs TinyDéjàVu: Smaller RAM and Faster Inference with Neural Networks on MCUs for Sensor Data Streams Protecting Bystander Privacy via Selective Hearing in Audio LLMs Language Models as Semantic Teachers: Post-Training Alignment for Medical Audio Understanding AaSP: Aliasing-aware Self-Supervised Pre-Training for Audio Spectrogram Transformers Two-Dimensional Quantization for Geometry-Aware Audio Coding HarmonicAttack: An Adaptive Cross-Domain Audio Watermark Removal The Spheres Dataset: Multitrack Orchestral Recordings for Music Source Separation and Information Retrieval BERT-APC: A Reference-free Framework for Automatic Pitch Correction via Musical Context Inference Musical Score Understanding Benchmark: Evaluating Large Language Models' Comprehension of Complete Musical Scores Quantizing Whisper-small: How design choices affect ASR performance Towards Fine-Grained Code-Switch Speech Translation with Semantic Space Alignment Assessing Factual Music Comprehension in Large Audio Language Models Sound Source Localization for Spatial Mapping of Surgical Actions in Dynamic Scenes Beyond Hearing: Learning Task-Agnostic ExG Representations from Earphones via Physiology-Informed Tokenization Gelina: Unified Speech and Gesture Synthesis via Interleaved Token Prediction ParsVoice: A Large-Scale Multi-Speaker Persian Speech Corpus for Text-to-Speech Synthesis MMAudioSep: Taming Video-to-Audio Generative Model Towards Video/Text-Queried Sound Separation VAPO: End-to-end Slide-Enhanced Speech Recognition with Omni-modal Large Language Models TokenChain: A Discrete Speech Chain via Semantic Token Modeling Data-efficient Targeted Token-level Preference Optimization for LLM-based Text-to-Speech BaldWhisper: Faster Whisper with Head Shearing and Layer Merging Go witheFlow: Real-time Emotion Driven Audio Effects Modulation When Silence Matters: The Impact of Irrelevant Audio on Text Reasoning in Large Audio-Language Models MARS: Sound Generation via Multi-Channel Autoregression on Spectrograms Unmute the Patch Tokens: Rethinking Probing in Multi-Label Audio Classification Zero-Effort Image-to-Music Generation: An Interpretable RAG-based VLM Approach StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient SpeechLLMs Speak Your Mind: The Speech Continuation Task as a Probe of Voice-Based Model Bias CoMelSinger: Discrete Token-Based Zero-Shot Singing Synthesis With Structured Melody Control and Guidance Direct Simultaneous Translation Activation for Large Audio-Language Models Exploring How Audio Effects Alter Emotion with Foundation Models RFM-Editing: Rectified Flow Matching for Text-guided Audio Editing CodecSep: Prompt-Driven Universal Sound Separation on Neural Audio Codec Latents DreamAudio: Customized Text-to-Audio Generation with Diffusion Models Computational Narrative Understanding for Expressive Text-to-Speech Unifying Diarization, Separation, and ASR with Multi-Speaker Encoder Gaussian Process Regression of Steering Vectors With Physics-Aware Deep Composite Kernels for Augmented Listening AVEX: What Matters for Animal Vocalization Encoding MECAT: A Multi-Experts Constructed Benchmark for Fine-Grained Audio Understanding Tasks CIS-BWE: Chaos-Informed Speech Bandwidth Extension Balalaika: Data-Centric, Prosody-Aware Annotation Pipeline for Russian Speech JAM-Flow: Joint Audio-Motion Synthesis with Flow Matching Acoustic scattering AI for non-invasive object classifications: A case study on hair assessment Genre Controlled Music Generation via Activation Steering Addressing Pitfalls in Auditing Practices of Automatic Speech Recognition Technologies: A Case Study of People with Aphasia MAVL: A Multilingual Audio-Video Lyrics Dataset for Animated Song Translation Semantic-Aware Interpretable Multimodal Music Auto-Tagging Towards Holistic Evaluation of Large Audio-Language Models: A Comprehensive Survey FMSD-TTS: Few-shot Multi-Speaker Multi-Dialect Text-to-Speech Synthesis for Ü-Tsang, Amdo and Kham Speech Dataset Generation Sat2Sound: A Unified Framework for Zero-Shot Soundscape Mapping Survey of End-to-End Multi-Speaker Automatic Speech Recognition for Monaural Audio Not that Groove: Zero-Shot Symbolic Music Editing Histogram-based Parameter-efficient Tuning for Passive and Active Sonar Classification Speculative End-Turn Detector for Efficient Speech Chatbot Assistant AudioX: A Unified Framework for Anything-to-Audio Generation S2S-Arena: Evaluating Paralinguistic Instruction Following in Speech-to-Speech Models Weakly Supervised Detection and Temporal Localization of Whale Calls in Long-Duration Bioacoustic Data DeePen: Penetration Testing for Audio Deepfake Detection Speech-FT: Merging Pre-trained And Fine-Tuned Speech Representation Models For Cross-Task Generalization Throat and acoustic paired speech dataset for deep learning-based speech enhancement XAttnMark: Learning Robust Audio Watermarking with Cross-Attention Dementia classification from spontaneous speech using wrapper-based feature selection Modality-Inconsistent Continual Learning of Multimodal Large Language Models Repurposing Image Diffusion Models for Training-Free Music Style Transfer on Mel-spectrograms DASB - Discrete Audio and Speech Benchmark Developing a Multi-variate Prediction Model For COVID-19 From Crowd-sourced Respiratory Voice Data Basic syntax from speech: Spontaneous concatenation in unsupervised deep neural networks Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation Deep Neural Network for Musical Instrument Recognition using MFCCs
Plug-and-Steer: Decoupling Separation and Selection in Audio-Visual Target Speaker Extraction
Doyeop Kwak, Suyeon Lee, Joon Son Chung · 2026-03-20 · via cs.SD updates on arXiv.org

The goal of this paper is to provide a new perspective on audio-visual target speaker extraction (AV-TSE) by decoupling the separation and target selection. Conventional AV-TSE systems typically integrate audio and visual features deeply to re-learn the entire separation process, which can act as a fidelity ceiling due to the noisy nature of in-the-wild audio-visual datasets. To address this, we propose Plug-and-Steer, which assigns high-fidelity separation to a frozen audio-only backbone and limits the role of visual modality strictly to target selection. We introduce the Latent Steering Matrix (LSM), a minimalist linear transformation that re-routes latent features within the backbone to anchor the target speaker to a designated channel. Experiments across four representative architectures show that our method effectively preserves the acoustic priors of diverse backbones, achieving perceptual quality comparable to the original backbones. Audio samples are available at: https://plugandsteer.github.io