惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

IT之家
IT之家
N
Netflix TechBlog - Medium
Microsoft Security Blog
Microsoft Security Blog
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
Stack Overflow Blog
Stack Overflow Blog
量子位
Cyberwarzone
Cyberwarzone
Hugging Face - Blog
Hugging Face - Blog
cs.AI updates on arXiv.org
cs.AI updates on arXiv.org
D
Darknet – Hacking Tools, Hacker News & Cyber Security
cs.CL updates on arXiv.org
cs.CL updates on arXiv.org
Simon Willison's Weblog
Simon Willison's Weblog
Know Your Adversary
Know Your Adversary
T
The Exploit Database - CXSecurity.com
Security Latest
Security Latest
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
Scott Helme
Scott Helme
cs.CV updates on arXiv.org
cs.CV updates on arXiv.org
D
Docker
大猫的无限游戏
大猫的无限游戏
宝玉的分享
宝玉的分享
人人都是产品经理
人人都是产品经理
M
MIT News - Artificial intelligence
Hacker News: Ask HN
Hacker News: Ask HN
SecWiki News
SecWiki News
F
Full Disclosure
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
H
Heimdal Security Blog
Google DeepMind News
Google DeepMind News
Recorded Future
Recorded Future
Cloudbric
Cloudbric
W
WeLiveSecurity
S
Schneier on Security
Project Zero
Project Zero
T
Threat Research - Cisco Blogs
罗磊的独立博客
Schneier on Security
Schneier on Security
G
Google Developers Blog
Cisco Talos Blog
Cisco Talos Blog
L
Lohrmann on Cybersecurity
A
Arctic Wolf
P
Privacy & Cybersecurity Law Blog
小众软件
小众软件
有赞技术团队
有赞技术团队
云风的 BLOG
云风的 BLOG
NISL@THU
NISL@THU
S
Security Affairs
Application and Cybersecurity Blog
Application and Cybersecurity Blog
www.infosecurity-magazine.com
www.infosecurity-magazine.com
博客园_首页

cs.SD updates on arXiv.org

Mutual Forcing: Dual-Mode Self-Evolution for Fast Autoregressive Audio-Video Character Generation WhisperPipe: A Resource-Efficient Streaming Architecture for Real-Time Automatic Speech Recognition Walking Through Uncertainty: An Empirical Study of Uncertainty Estimation for Audio-Aware Large Language Models PSP: An Interpretable Per-Dimension Accent Benchmark for Indic Text-to-Speech Praxy Voice: Voice-Prompt Recovery + BUPS for Commercial-Class Indic TTS from a Frozen Non-Indic Base at Zero Commercial-Training-Data Cost Korean aegyo speech shows systematic F1 increase to signal childlike qualities All That Glitters Is Not Audio: Rethinking Text Priors and Audio Reliance in Audio-Language Evaluation RAS: a Reliability Oriented Metric for Automatic Speech Recognition Speech Enhancement Based on Drifting Models HeadRouter: Dynamic Head-Weight Routing for Task-Adaptive Audio Token Pruning in Large Audio Language Models Hallo-Live: Real-Time Streaming Joint Audio-Video Avatar Generation with Asynchronous Dual-Stream and Human-Centric Preference Distillation Talker-T2AV: Joint Talking Audio-Video Generation with Autoregressive Diffusion Modeling Robust Audio-Text Retrieval via Cross-Modal Attention and Hybrid Loss Spectro-Temporal Modulation Representation Framework for Human-Imitated Speech Detection UniSonate: A Unified Model for Speech, Music, and Sound Effect Generation with Text Instructions Do LLM Decoders Listen Fairly? Benchmarking How Language Model Priors Shape Bias in Speech Recognition Materialistic RIR: Material Conditioned Realistic RIR Generation SpeechParaling-Bench: A Comprehensive Benchmark for Paralinguistic-Aware Speech Generation ONOTE: Benchmarking Omnimodal Notation Processing for Expert-level Music Intelligence From Image to Music Language: A Two-Stage Structure Decoding Approach for Complex Polyphonic OMR ATIR: Towards Audio-Text Interleaved Contextual Retrieval Enhancing Speaker Verification with Whispered Speech via Post-Processing Environmental Sound Deepfake Detection Using Deep-Learning Framework Towards Streaming Target Speaker Extraction via Chunk-wise Interleaved Splicing of Autoregressive Language Model BEAT: Tokenizing and Generating Symbolic Music by Uniform Temporal Steps Deep Supervised Contrastive Learning of Pitch Contours for Robust Pitch Accent Classification in Seoul Korean HalluAudio: A Comprehensive Benchmark for Hallucination Detection in Large Audio-Language Models UAF: A Unified Audio Front-end LLM for Full-Duplex Speech Interaction Voice of India: A Large-Scale Benchmark for Real-World Speech Recognition in India Tadabur: A Large-Scale Quran Audio Dataset Comparison of sEMG Encoding Accuracy Across Speech Modes Using Articulatory and Phoneme Features Video-Robin: Autoregressive Diffusion Planning for Intent-Grounded Video-to-Music Generation Virtual boundary integral neural network for three-dimensional exterior acoustic problems Audio2Tool: Speak, Call, Act -- A Dataset for Benchmarking Speech Tool Use AST: Adaptive, Seamless, and Training-Free Precise Speech Editing Breakout-picker: Reducing false positives in deep learning-based borehole breakout characterization from acoustic image logs Hierarchical Codec Diffusion for Video-to-Speech Generation ControlFoley: Unified and Controllable Video-to-Audio Generation with Cross-Modal Conflict Handling From Reactive to Proactive: Assessing the Proactivity of Voice Agents via ProVoice-Bench The Acoustic Camouflage Phenomenon: Re-evaluating Speech Features for Financial Risk Prediction Hijacking Large Audio-Language Models via Context-Agnostic and Imperceptible Auditory Prompt Injection TurboTalk: Progressive Distillation for One-Step Audio-Driven Talking Avatar Generation Temporal Contrastive Decoding: A Training-Free Method for Large Audio-Language Models VoxSafeBench: Not Just What Is Said, but Who, How, and Where Elderly-Contextual Data Augmentation via Speech Synthesis for Elderly ASR Towards Fine-grained Temporal Perception: Post-Training Large Audio-Language Models with Audio-Side Time Prompt Comparison of window shapes and lengths in short-time feature extraction for classification of heart sound signals In-Sync: Adaptation of Speech Aware Large Language Models for ASR with Word Level Timestamp Predictions Graph Propagated Projection Unlearning: A Unified Framework for Vision and Audio Discriminative Models HHL with a Coherent Fourier Oracle: A Proof-of-Concept Quantum Architecture for Joint Melody-Harmony Generation ActorMind: Emulating Human Actor Reasoning for Speech Role-Playing Efficient Training for Cross-lingual Speech Language Models Audio Flamingo Next: Next-Generation Open Audio-Language Models for Speech, Sound, and Music MeloTune: On-Device Arousal Learning and Peer-to-Peer Mood Coupling for Proactive Music Curation BlasBench: An Open Benchmark for Irish Speech Recognition Audio-Omni: Extending Multi-modal Understanding to Versatile Audio Generation and Editing Knowing What to Stress: A Discourse-Conditioned Text-to-Speech Benchmark VidAudio-Bench: Benchmarking V2A and VT2A Generation across Four Audio Categories Cross-Cultural Bias in Mel-Scale Representations: Evidence and Alternatives from Speech and Music Beyond Monologue: Interactive Talking-Listening Avatar Generation with Conversational Audio Context-Aware Kernels ASPIRin: Action Space Projection for Interactivity-Optimized Reinforcement Learning in Full-Duplex Speech Language Models Interactive ASR: Towards Human-Like Interaction and Semantic Coherence Evaluation for Agentic Speech Recognition Tora3: Trajectory-Guided Audio-Video Generation with Physical Coherence Disentangled Dual-Branch Graph Learning for Conversational Emotion Recognition Real-Time Voicemail Detection in Telephony Audio Using Temporal Speech Activity Features Woosh: A Sound Effects Foundation Model KoALa-Bench: Evaluating Large Audio Language Models on Korean Speech Understanding and Faithfulness From Black Box to Glass Box: Cross-Model ASR Disagreement to Prioto Review in Ambient AI Scribe Documentation Diagnostic-Driven Layer-Wise Compensation for Post-Training Quantization of Encoder-Decoder ASR Models Style Amnesia: Investigating Speaking Style Degradation and Mitigation in Multi-Turn Spoken Language Models Real-Time Streamable Generative Speech Restoration with Flow Matching Hearing to Translate: The Effectiveness of Speech Modality Integration into LLMs Protecting Bystander Privacy via Selective Hearing in Audio LLMs Language Models as Semantic Teachers: Post-Training Alignment for Medical Audio Understanding BERT-APC: A Reference-free Framework for Automatic Pitch Correction via Musical Context Inference Musical Score Understanding Benchmark: Evaluating Large Language Models' Comprehension of Complete Musical Scores MMAudioSep: Taming Video-to-Audio Generative Model Towards Video/Text-Queried Sound Separation VAPO: End-to-end Slide-Enhanced Speech Recognition with Omni-modal Large Language Models Data-efficient Targeted Token-level Preference Optimization for LLM-based Text-to-Speech When Silence Matters: The Impact of Irrelevant Audio on Text Reasoning in Large Audio-Language Models MARS: Sound Generation via Multi-Channel Autoregression on Spectrograms Zero-Effort Image-to-Music Generation: An Interpretable RAG-based VLM Approach StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient SpeechLLMs CoMelSinger: Discrete Token-Based Zero-Shot Singing Synthesis With Structured Melody Control and Guidance RFM-Editing: Rectified Flow Matching for Text-guided Audio Editing CodecSep: Prompt-Driven Universal Sound Separation on Neural Audio Codec Latents DreamAudio: Customized Text-to-Audio Generation with Diffusion Models Computational Narrative Understanding for Expressive Text-to-Speech Gaussian Process Regression of Steering Vectors With Physics-Aware Deep Composite Kernels for Augmented Listening Towards Holistic Evaluation of Large Audio-Language Models: A Comprehensive Survey FMSD-TTS: Few-shot Multi-Speaker Multi-Dialect Text-to-Speech Synthesis for Ü-Tsang, Amdo and Kham Speech Dataset Generation Sat2Sound: A Unified Framework for Zero-Shot Soundscape Mapping Histogram-based Parameter-efficient Tuning for Passive and Active Sonar Classification Speculative End-Turn Detector for Efficient Speech Chatbot Assistant AudioX: A Unified Framework for Anything-to-Audio Generation Speech-FT: Merging Pre-trained And Fine-Tuned Speech Representation Models For Cross-Task Generalization Throat and acoustic paired speech dataset for deep learning-based speech enhancement Dementia classification from spontaneous speech using wrapper-based feature selection DASB - Discrete Audio and Speech Benchmark Basic syntax from speech: Spontaneous concatenation in unsupervised deep neural networks
Hear "No Evil", See "Kenansville": Efficient and Transferable Black-Box Attacks on Speech Recognition and Voice Identification Systems
Hadi Abdullah, Muhammad Sajidur Rahman, Washington Garcia, Logan · 2019-10-11 · via cs.SD updates on arXiv.org

Automatic speech recognition and voice identification systems are being deployed in a wide array of applications, from providing control mechanisms to devices lacking traditional interfaces, to the automatic transcription of conversations and authentication of users. Many of these applications have significant security and privacy considerations. We develop attacks that force mistranscription and misidentification in state of the art systems, with minimal impact on human comprehension. Processing pipelines for modern systems are comprised of signal preprocessing and feature extraction steps, whose output is fed to a machine-learned model. Prior work has focused on the models, using white-box knowledge to tailor model-specific attacks. We focus on the pipeline stages before the models, which (unlike the models) are quite similar across systems. As such, our attacks are black-box and transferable, and demonstrably achieve mistranscription and misidentification rates as high as 100% by modifying only a few frames of audio. We perform a study via Amazon Mechanical Turk demonstrating that there is no statistically significant difference between human perception of regular and perturbed audio. Our findings suggest that models may learn aspects of speech that are generally not perceived by human subjects, but that are crucial for model accuracy. We also find that certain English language phonemes (in particular, vowels) are significantly more susceptible to our attack. We show that the attacks are effective when mounted over cellular networks, where signals are subject to degradation due to transcoding, jitter, and packet loss.