惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
V
Vulnerabilities – Threatpost
有赞技术团队
有赞技术团队
小众软件
小众软件
O
OpenAI News
C
Cyber Attacks, Cyber Crime and Cyber Security
I
Intezer
NISL@THU
NISL@THU
D
Darknet – Hacking Tools, Hacker News & Cyber Security
N
News and Events Feed by Topic
MongoDB | Blog
MongoDB | Blog
阮一峰的网络日志
阮一峰的网络日志
Hacker News: Ask HN
Hacker News: Ask HN
D
Docker
WordPress大学
WordPress大学
Security Archives - TechRepublic
Security Archives - TechRepublic
A
About on SuperTechFans
Stack Overflow Blog
Stack Overflow Blog
C
CERT Recently Published Vulnerability Notes
L
LINUX DO - 最新话题
Application and Cybersecurity Blog
Application and Cybersecurity Blog
M
MIT News - Artificial intelligence
Blog — PlanetScale
Blog — PlanetScale
S
Security @ Cisco Blogs
Cloudbric
Cloudbric
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
V
V2EX
Hacker News - Newest:
Hacker News - Newest: "LLM"
G
Google Developers Blog
cs.AI updates on arXiv.org
cs.AI updates on arXiv.org
W
WeLiveSecurity
Google DeepMind News
Google DeepMind News
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
H
Hackread – Cybersecurity News, Data Breaches, AI and More
G
GRAHAM CLULEY
S
Schneier on Security
T
Tor Project blog
Spread Privacy
Spread Privacy
PCI Perspectives
PCI Perspectives
Microsoft Security Blog
Microsoft Security Blog
cs.CL updates on arXiv.org
cs.CL updates on arXiv.org
F
Fortinet All Blogs
L
Lohrmann on Cybersecurity
Cyber Security Advisories - MS-ISAC
Cyber Security Advisories - MS-ISAC
T
The Exploit Database - CXSecurity.com
TaoSecurity Blog
TaoSecurity Blog
Apple Machine Learning Research
Apple Machine Learning Research
T
Threat Research - Cisco Blogs
T
Troy Hunt's Blog
罗磊的独立博客

eess.AS updates on arXiv.org

Dependence on Early and Late Reverberation of Single-Channel Speaker Distance Estimation MIST: Multimodal Interactive Speech-based Tool-calling Conversational Assistants for Smart Homes LiVeAction: a Lightweight, Versatile, and Asymmetric Neural Codec Design for Real-time Operation Weight-Decay Turns Transformer Loss Landscapes Villani: Functional-Analytic Foundations for Optimization and Generalization PairAlign: A Framework for Sequence Tokenization via Self-Alignment with Applications to Audio Tokenization WavCube: Unifying Speech Representation for Understanding and Generation via Semantic-Acoustic Joint Modeling Predictive-Generative Drift Decomposition for Speech Enhancement and Separation Minimizing Modality Gap from the Input Side: Your Speech LLM Can Be a Prosody-Aware Text LLM X-Voice: Enabling Everyone to Speak 30 Languages via Zero-Shot Cross-Lingual Voice Cloning JASTIN: Aligning LLMs for Zero-Shot Audio and Speech Evaluation via Natural Language Instructions Phoneme-Level Deepfake Detection Across Emotional Conditions Using Self-Supervised Embeddings When Audio-Language Models Fail to Leverage Multimodal Context for Dysarthric Speech Recognition Dimensionality-Aware Anomaly Detection in Learned Representations of Self-Supervised Speech Models Mitigating Multimodal LLMs Hallucinations via Relevance Propagation at Inference Time Virtual Speech Therapist: A Clinician-in-the-Loop AI Speech Therapy Agent for Personalized and Supervised Therapy LASE: Language-Adversarial Speaker Encoding for Indic Cross-Script Identity Preservation Towards Improving Speaker Distance Estimation through Generative Impulse Response Augmentation Beyond Decodability: Reconstructing Language Model Representations with an Encoding Probe MMAudioReverbs: Video-Guided Acoustic Modeling for Dereverberation and Room Impulse Response Estimation Alethia: A Foundational Encoder for Voice Deepfakes From Birdsong to Rumbles: Classifying Elephant Calls with Out-of-Species Embeddings Beyond the Baseband: Adaptive Multi-Band Encoding for Full-Spectrum Bioacoustics Classification Predicting Upcoming Stuttering Events from Three-Second Audio: Stratified Evaluation Reveals Severity-Selective Precursors, and the Model Deploys Fully On-Device The False Resonance: A Critical Examination of Emotion Embedding Similarity for Speech Generation Evaluation DiffAnon: Diffusion-based Prosody Control for Voice Anonymization Recurrence-Based Nonlinear Vocal Dynamics as Digital Biomarkers for Depression Detection from Conversational Speech One Voice, Many Tongues: Cross-Lingual Voice Cloning for Scientific Speech Similarity Choice and Negative Scaling in Supervised Contrastive Learning for Deepfake Audio Detection Walking Through Uncertainty: An Empirical Study of Uncertainty Estimation for Audio-Aware Large Language Models Praxy Voice: Voice-Prompt Recovery + BUPS for Commercial-Class Indic TTS from a Frozen Non-Indic Base at Zero Commercial-Training-Data Cost ML-SAN: Multi-Level Speaker-Adaptive Network for Emotion Recognition in Conversations Korean aegyo speech shows systematic F1 increase to signal childlike qualities All That Glitters Is Not Audio: Rethinking Text Priors and Audio Reliance in Audio-Language Evaluation Speech Enhancement Based on Drifting Models Talker-T2AV: Joint Talking Audio-Video Generation with Autoregressive Diffusion Modeling Explainable AI in Speaker Recognition -- Making Latent Representations Understandable TTS-PRISM: A Perceptual Reasoning and Interpretable Speech Model for Fine-Grained Diagnosis UniSonate: A Unified Model for Speech, Music, and Sound Effect Generation with Text Instructions Dilated CNNs for Periodic Signal Processing: A Low-Complexity Approach ONOTE: Benchmarking Omnimodal Notation Processing for Expert-level Music Intelligence MoDAl: Self-Supervised Neural Modality Discovery via Decorrelation for Speech Neuroprosthesis Voice Mapping of Text-to-Speech Systems: A Metric-Based Approach for Voice Quality Assessment UAF: A Unified Audio Front-end LLM for Full-Duplex Speech Interaction Voice of India: A Large-Scale Benchmark for Real-World Speech Recognition in India Reducing the Offline-Streaming Gap for Unified ASR Transducer with Consistency Regularization Audio2Tool: Speak, Call, Act -- A Dataset for Benchmarking Speech Tool Use Qwen3.5-Omni Technical Report Speech Emotion Recognition Using MFCC Features and LSTM-Based Deep Learning Model The Acoustic Camouflage Phenomenon: Re-evaluating Speech Features for Financial Risk Prediction SongBench: A Fine-Grained Multi-Aspect Benchmark for Song Quality Assessment VoxSafeBench: Not Just What Is Said, but Who, How, and Where In-Sync: Adaptation of Speech Aware Large Language Models for ASR with Word Level Timestamp Predictions MoshiRAG: Asynchronous Knowledge Retrieval for Full-Duplex Speech Language Models X-VC: Zero-shot Streaming Voice Conversion in Codec Space Audio Flamingo Next: Next-Generation Open Audio-Language Models for Speech, Sound, and Music ASPIRin: Action Space Projection for Interactivity-Optimized Reinforcement Learning in Full-Duplex Speech Language Models Utterance-Level Methods for Identifying Reliable ASR-Output for Child Speech Enhancing ASR Performance in the Medical Domain for Dravidian Languages PS-TTS: Phonetic Synchronization in Text-to-Speech for Achieving Natural Automated Dubbing Disentangled Dual-Branch Graph Learning for Conversational Emotion Recognition OmniVoice: Towards Omnilingual Zero-Shot Text-to-Speech with Diffusion Language Models HARNESS: Lightweight Distilled Arabic Speech Foundation Models KoALa-Bench: Evaluating Large Audio Language Models on Korean Speech Understanding and Faithfulness Explainable Speech Emotion Recognition: Weighted Attribute Fairness to Model Demographic Contributions to Social Bias The Silent Thought: Modeling Internal Cognition in Full-Duplex Spoken Dialogue Models via Latent Reasoning Something from Nothing: Data Augmentation for Robust Severity Level Estimation of Dysarthric Speech Interpreting Speaker Characteristics in the Dimensions of Self-Supervised Speech Features From Black Box to Glass Box: Cross-Model ASR Disagreement to Prioto Review in Ambient AI Scribe Documentation WST-X Series: Wavelet Scattering Transform for Interpretable Speech Deepfake Detection Kanade: A Simple Disentangled Tokenizer for Spoken Language Modeling AQUA-Bench: Beyond Finding Answers to Knowing When There Are None in Audio Question Answering Diagnostic-Driven Layer-Wise Compensation for Post-Training Quantization of Encoder-Decoder ASR Models BERT-APC: A Reference-free Framework for Automatic Pitch Correction via Musical Context Inference MTR-DuplexBench: Towards a Comprehensive Evaluation of Multi-Round Conversations for Full-Duplex Speech Language Models Sound Source Localization for Spatial Mapping of Surgical Actions in Dynamic Scenes Gelina: Unified Speech and Gesture Synthesis via Interleaved Token Prediction Efficient Test-Time Adaptation through Latent Subspace Coefficients Search MMAudioSep: Taming Video-to-Audio Generative Model Towards Video/Text-Queried Sound Separation VAPO: End-to-end Slide-Enhanced Speech Recognition with Omni-modal Large Language Models TokenChain: A Discrete Speech Chain via Semantic Token Modeling BaldWhisper: Faster Whisper with Head Shearing and Layer Merging Game-Time: Evaluating Temporal Dynamics in Spoken Language Models Zero-Effort Image-to-Music Generation: An Interpretable RAG-based VLM Approach Direct Simultaneous Translation Activation for Large Audio-Language Models CodecSep: Prompt-Driven Universal Sound Separation on Neural Audio Codec Latents Joint Learning using Mixture-of-Expert-Based Representation for Speech Enhancement and Robust Emotion Recognition DreamAudio: Customized Text-to-Audio Generation with Diffusion Models Computational Narrative Understanding for Expressive Text-to-Speech Gaussian Process Regression of Steering Vectors With Physics-Aware Deep Composite Kernels for Augmented Listening Balalaika: Data-Centric, Prosody-Aware Annotation Pipeline for Russian Speech Towards Holistic Evaluation of Large Audio-Language Models: A Comprehensive Survey FMSD-TTS: Few-shot Multi-Speaker Multi-Dialect Text-to-Speech Synthesis for Ü-Tsang, Amdo and Kham Speech Dataset Generation Not that Groove: Zero-Shot Symbolic Music Editing Speculative End-Turn Detector for Efficient Speech Chatbot Assistant AudioX: A Unified Framework for Anything-to-Audio Generation S2S-Arena: Evaluating Paralinguistic Instruction Following in Speech-to-Speech Models Throat and acoustic paired speech dataset for deep learning-based speech enhancement Dementia classification from spontaneous speech using wrapper-based feature selection DASB - Discrete Audio and Speech Benchmark Basic syntax from speech: Spontaneous concatenation in unsupervised deep neural networks
The DKU-DukeECE System for the Self-Supervision Speaker Verification Task of the 2021 VoxCeleb Speaker Recognition Challenge
Danwei Cai, Ming Li · 2021-09-07 · via eess.AS updates on arXiv.org

This report describes the submission of the DKU-DukeECE team to the self-supervision speaker verification task of the 2021 VoxCeleb Speaker Recognition Challenge (VoxSRC). Our method employs an iterative labeling framework to learn self-supervised speaker representation based on a deep neural network (DNN). The framework starts with training a self-supervision speaker embedding network by maximizing agreement between different segments within an utterance via a contrastive loss. Taking advantage of DNN's ability to learn from data with label noise, we propose to cluster the speaker embedding obtained from the previous speaker network and use the subsequent class assignments as pseudo labels to train a new DNN. Moreover, we iteratively train the speaker network with pseudo labels generated from the previous step to bootstrap the discriminative power of a DNN. Also, visual modal data is incorporated in this self-labeling framework. The visual pseudo label and the audio pseudo label are fused with a cluster ensemble algorithm to generate a robust supervisory signal for representation learning. Our submission achieves an equal error rate (EER) of 5.58% and 5.59% on the challenge development and test set, respectively.