惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

D
DataBreaches.Net
T
Threatpost
N
News and Events Feed by Topic
PCI Perspectives
PCI Perspectives
V2EX - 技术
V2EX - 技术
D
Docker
G
Google Developers Blog
Microsoft Security Blog
Microsoft Security Blog
N
News and Events Feed by Topic
cs.AI updates on arXiv.org
cs.AI updates on arXiv.org
Google Online Security Blog
Google Online Security Blog
The GitHub Blog
The GitHub Blog
Hacker News - Newest:
Hacker News - Newest: "LLM"
Y
Y Combinator Blog
M
MIT News - Artificial intelligence
Blog — PlanetScale
Blog — PlanetScale
博客园 - 司徒正美
T
Troy Hunt's Blog
Webroot Blog
Webroot Blog
Security Archives - TechRepublic
Security Archives - TechRepublic
量子位
Apple Machine Learning Research
Apple Machine Learning Research
H
Help Net Security
F
Full Disclosure
B
Blog
O
OpenAI News
H
Hackread – Cybersecurity News, Data Breaches, AI and More
博客园_首页
Google DeepMind News
Google DeepMind News
Exploit-DB.com RSS Feed
Exploit-DB.com RSS Feed
Engineering at Meta
Engineering at Meta
大猫的无限游戏
大猫的无限游戏
Forbes - Security
Forbes - Security
Know Your Adversary
Know Your Adversary
B
Blog RSS Feed
MongoDB | Blog
MongoDB | Blog
Scott Helme
Scott Helme
T
The Exploit Database - CXSecurity.com
博客园 - 聂微东
cs.CV updates on arXiv.org
cs.CV updates on arXiv.org
The Last Watchdog
The Last Watchdog
Recorded Future
Recorded Future
IT之家
IT之家
Project Zero
Project Zero
Stack Overflow Blog
Stack Overflow Blog
小众软件
小众软件
Attack and Defense Labs
Attack and Defense Labs
L
Lohrmann on Cybersecurity
SecWiki News
SecWiki News
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com

eess.AS updates on arXiv.org

Dependence on Early and Late Reverberation of Single-Channel Speaker Distance Estimation MIST: Multimodal Interactive Speech-based Tool-calling Conversational Assistants for Smart Homes LiVeAction: a Lightweight, Versatile, and Asymmetric Neural Codec Design for Real-time Operation Weight-Decay Turns Transformer Loss Landscapes Villani: Functional-Analytic Foundations for Optimization and Generalization PairAlign: A Framework for Sequence Tokenization via Self-Alignment with Applications to Audio Tokenization WavCube: Unifying Speech Representation for Understanding and Generation via Semantic-Acoustic Joint Modeling Predictive-Generative Drift Decomposition for Speech Enhancement and Separation Minimizing Modality Gap from the Input Side: Your Speech LLM Can Be a Prosody-Aware Text LLM X-Voice: Enabling Everyone to Speak 30 Languages via Zero-Shot Cross-Lingual Voice Cloning JASTIN: Aligning LLMs for Zero-Shot Audio and Speech Evaluation via Natural Language Instructions Phoneme-Level Deepfake Detection Across Emotional Conditions Using Self-Supervised Embeddings When Audio-Language Models Fail to Leverage Multimodal Context for Dysarthric Speech Recognition Dimensionality-Aware Anomaly Detection in Learned Representations of Self-Supervised Speech Models Mitigating Multimodal LLMs Hallucinations via Relevance Propagation at Inference Time Virtual Speech Therapist: A Clinician-in-the-Loop AI Speech Therapy Agent for Personalized and Supervised Therapy LASE: Language-Adversarial Speaker Encoding for Indic Cross-Script Identity Preservation Towards Improving Speaker Distance Estimation through Generative Impulse Response Augmentation Beyond Decodability: Reconstructing Language Model Representations with an Encoding Probe MMAudioReverbs: Video-Guided Acoustic Modeling for Dereverberation and Room Impulse Response Estimation Alethia: A Foundational Encoder for Voice Deepfakes From Birdsong to Rumbles: Classifying Elephant Calls with Out-of-Species Embeddings Beyond the Baseband: Adaptive Multi-Band Encoding for Full-Spectrum Bioacoustics Classification Predicting Upcoming Stuttering Events from Three-Second Audio: Stratified Evaluation Reveals Severity-Selective Precursors, and the Model Deploys Fully On-Device The False Resonance: A Critical Examination of Emotion Embedding Similarity for Speech Generation Evaluation DiffAnon: Diffusion-based Prosody Control for Voice Anonymization Recurrence-Based Nonlinear Vocal Dynamics as Digital Biomarkers for Depression Detection from Conversational Speech One Voice, Many Tongues: Cross-Lingual Voice Cloning for Scientific Speech Similarity Choice and Negative Scaling in Supervised Contrastive Learning for Deepfake Audio Detection Walking Through Uncertainty: An Empirical Study of Uncertainty Estimation for Audio-Aware Large Language Models Praxy Voice: Voice-Prompt Recovery + BUPS for Commercial-Class Indic TTS from a Frozen Non-Indic Base at Zero Commercial-Training-Data Cost ML-SAN: Multi-Level Speaker-Adaptive Network for Emotion Recognition in Conversations Korean aegyo speech shows systematic F1 increase to signal childlike qualities All That Glitters Is Not Audio: Rethinking Text Priors and Audio Reliance in Audio-Language Evaluation Speech Enhancement Based on Drifting Models Talker-T2AV: Joint Talking Audio-Video Generation with Autoregressive Diffusion Modeling Explainable AI in Speaker Recognition -- Making Latent Representations Understandable TTS-PRISM: A Perceptual Reasoning and Interpretable Speech Model for Fine-Grained Diagnosis UniSonate: A Unified Model for Speech, Music, and Sound Effect Generation with Text Instructions Dilated CNNs for Periodic Signal Processing: A Low-Complexity Approach ONOTE: Benchmarking Omnimodal Notation Processing for Expert-level Music Intelligence MoDAl: Self-Supervised Neural Modality Discovery via Decorrelation for Speech Neuroprosthesis Voice Mapping of Text-to-Speech Systems: A Metric-Based Approach for Voice Quality Assessment UAF: A Unified Audio Front-end LLM for Full-Duplex Speech Interaction Voice of India: A Large-Scale Benchmark for Real-World Speech Recognition in India Reducing the Offline-Streaming Gap for Unified ASR Transducer with Consistency Regularization Audio2Tool: Speak, Call, Act -- A Dataset for Benchmarking Speech Tool Use Qwen3.5-Omni Technical Report Speech Emotion Recognition Using MFCC Features and LSTM-Based Deep Learning Model The Acoustic Camouflage Phenomenon: Re-evaluating Speech Features for Financial Risk Prediction SongBench: A Fine-Grained Multi-Aspect Benchmark for Song Quality Assessment VoxSafeBench: Not Just What Is Said, but Who, How, and Where In-Sync: Adaptation of Speech Aware Large Language Models for ASR with Word Level Timestamp Predictions MoshiRAG: Asynchronous Knowledge Retrieval for Full-Duplex Speech Language Models X-VC: Zero-shot Streaming Voice Conversion in Codec Space Audio Flamingo Next: Next-Generation Open Audio-Language Models for Speech, Sound, and Music ASPIRin: Action Space Projection for Interactivity-Optimized Reinforcement Learning in Full-Duplex Speech Language Models Utterance-Level Methods for Identifying Reliable ASR-Output for Child Speech Enhancing ASR Performance in the Medical Domain for Dravidian Languages PS-TTS: Phonetic Synchronization in Text-to-Speech for Achieving Natural Automated Dubbing Disentangled Dual-Branch Graph Learning for Conversational Emotion Recognition OmniVoice: Towards Omnilingual Zero-Shot Text-to-Speech with Diffusion Language Models HARNESS: Lightweight Distilled Arabic Speech Foundation Models KoALa-Bench: Evaluating Large Audio Language Models on Korean Speech Understanding and Faithfulness Explainable Speech Emotion Recognition: Weighted Attribute Fairness to Model Demographic Contributions to Social Bias The Silent Thought: Modeling Internal Cognition in Full-Duplex Spoken Dialogue Models via Latent Reasoning Something from Nothing: Data Augmentation for Robust Severity Level Estimation of Dysarthric Speech Interpreting Speaker Characteristics in the Dimensions of Self-Supervised Speech Features From Black Box to Glass Box: Cross-Model ASR Disagreement to Prioto Review in Ambient AI Scribe Documentation WST-X Series: Wavelet Scattering Transform for Interpretable Speech Deepfake Detection Kanade: A Simple Disentangled Tokenizer for Spoken Language Modeling AQUA-Bench: Beyond Finding Answers to Knowing When There Are None in Audio Question Answering Diagnostic-Driven Layer-Wise Compensation for Post-Training Quantization of Encoder-Decoder ASR Models BERT-APC: A Reference-free Framework for Automatic Pitch Correction via Musical Context Inference MTR-DuplexBench: Towards a Comprehensive Evaluation of Multi-Round Conversations for Full-Duplex Speech Language Models Sound Source Localization for Spatial Mapping of Surgical Actions in Dynamic Scenes Gelina: Unified Speech and Gesture Synthesis via Interleaved Token Prediction Efficient Test-Time Adaptation through Latent Subspace Coefficients Search MMAudioSep: Taming Video-to-Audio Generative Model Towards Video/Text-Queried Sound Separation VAPO: End-to-end Slide-Enhanced Speech Recognition with Omni-modal Large Language Models TokenChain: A Discrete Speech Chain via Semantic Token Modeling BaldWhisper: Faster Whisper with Head Shearing and Layer Merging Game-Time: Evaluating Temporal Dynamics in Spoken Language Models Zero-Effort Image-to-Music Generation: An Interpretable RAG-based VLM Approach Direct Simultaneous Translation Activation for Large Audio-Language Models CodecSep: Prompt-Driven Universal Sound Separation on Neural Audio Codec Latents Joint Learning using Mixture-of-Expert-Based Representation for Speech Enhancement and Robust Emotion Recognition DreamAudio: Customized Text-to-Audio Generation with Diffusion Models Computational Narrative Understanding for Expressive Text-to-Speech Gaussian Process Regression of Steering Vectors With Physics-Aware Deep Composite Kernels for Augmented Listening Balalaika: Data-Centric, Prosody-Aware Annotation Pipeline for Russian Speech Towards Holistic Evaluation of Large Audio-Language Models: A Comprehensive Survey FMSD-TTS: Few-shot Multi-Speaker Multi-Dialect Text-to-Speech Synthesis for Ü-Tsang, Amdo and Kham Speech Dataset Generation Not that Groove: Zero-Shot Symbolic Music Editing Speculative End-Turn Detector for Efficient Speech Chatbot Assistant AudioX: A Unified Framework for Anything-to-Audio Generation S2S-Arena: Evaluating Paralinguistic Instruction Following in Speech-to-Speech Models Throat and acoustic paired speech dataset for deep learning-based speech enhancement Dementia classification from spontaneous speech using wrapper-based feature selection DASB - Discrete Audio and Speech Benchmark Basic syntax from speech: Spontaneous concatenation in unsupervised deep neural networks
Quantizing Whisper-small: How design choices affect ASR performance
Arthur Söhler, Julian Irigoyen, Andreas Søeborg Kirkedal · 2025-11-11 · via eess.AS updates on arXiv.org

Large speech recognition models like Whisper-small achieve high accuracy but are difficult to deploy on edge devices due to their high computational demand. To this end, we present a unified, cross-library evaluation of post-training quantization (PTQ) on Whisper-small that disentangles the impact of quantization scheme, method, granularity, and bit-width. Our study is based on four libraries: PyTorch, Optimum-Quanto, HQQ, and bitsandbytes. Experiments on LibriSpeech test-clean and test-other show that dynamic int8 quantization with Quanto offers the best trade-off, reducing model size by 57% while improving on the baseline's word error rate. Static quantization performed worse, likely due to Whisper's Transformer architecture, while more aggressive formats (e.g., nf4, int3) achieved up to 71% compression at the cost of accuracy in noisy conditions. Overall, our results demonstrate that carefully chosen PTQ methods can substantially reduce model size and inference cost without retraining, enabling efficient deployment of Whisper-small on constrained hardware.