惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

A
Arctic Wolf
V
V2EX
P
Proofpoint News Feed
The Hacker News
The Hacker News
GbyAI
GbyAI
G
Google Developers Blog
S
Schneier on Security
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
W
WeLiveSecurity
Security Archives - TechRepublic
Security Archives - TechRepublic
博客园 - Franky
Recent Announcements
Recent Announcements
腾讯CDC
Hacker News - Newest:
Hacker News - Newest: "LLM"
K
Kaspersky official blog
U
Unit 42
Engineering at Meta
Engineering at Meta
J
Java Code Geeks
Google Online Security Blog
Google Online Security Blog
Last Week in AI
Last Week in AI
V
Vulnerabilities – Threatpost
N
News and Events Feed by Topic
O
OpenAI News
量子位
cs.CL updates on arXiv.org
cs.CL updates on arXiv.org
Y
Y Combinator Blog
博客园 - 【当耐特】
Vercel News
Vercel News
Hacker News: Ask HN
Hacker News: Ask HN
T
Tor Project blog
Apple Machine Learning Research
Apple Machine Learning Research
Microsoft Security Blog
Microsoft Security Blog
Exploit-DB.com RSS Feed
Exploit-DB.com RSS Feed
AWS News Blog
AWS News Blog
MongoDB | Blog
MongoDB | Blog
S
Security Affairs
A
About on SuperTechFans
Project Zero
Project Zero
D
Darknet – Hacking Tools, Hacker News & Cyber Security
博客园 - 聂微东
Webroot Blog
Webroot Blog
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
Cloudbric
Cloudbric
T
Tenable Blog
月光博客
月光博客
C
Check Point Blog
宝玉的分享
宝玉的分享
V
Visual Studio Blog
T
The Blog of Author Tim Ferriss
NISL@THU
NISL@THU

eess.AS updates on arXiv.org

Single frequency filtering based multi-speaker direction of arrival estimation from stereo recordings Intelligibility of Speech in Noise: Investigating Contribution of Magnitude and Phase Spectra Direction of arrival estimation from distant microphone data using single frequency filtering From Signals to Patterns: Non-Invasive Tuberculosis Detection from Cough Audio using Bandit Weighted Hyperbolic Prototypes ELSA: Acoustic Event-Level Semantic Alignment for Fine-Grained Reference-Free Text-to-Audio Evaluation Non-Autoregressive Minimum Bayes' Risk Decoding for Fast Speech Recognition An Analysis of the Effectiveness of Synthetic Speech Data for ASR Fine-tuning in Selected Indic Languages PhASE-Flow: Phonetic-Conditioned Acoustic Flow Matching in SSL Representation Domain for Speech Enhancement A 399uW 114.3 dB DR Companding Readout ASIC for MEMS Microphones Employing a Multirate Time-Domain ADC Reading between the Lines: Leveraging Large Language Models for Global Dementia and Depression Assessment from Clinical Interviews AI-based Cognitive-linguistic Features for Dementia Assessment in Picture Description One-Step Token-to-Waveform Generation with MeanFlow in Latent Space Grounding Spoken LLMs in Multi-Speaker Audio via Diarization Conditioning Are you speaking my languages? On spoken language adherence in multimodal LLMs Perceptual compensation for tonal context in self-supervised speech models Embedded Machine Learning for Microcontroller-Class Edge Devices: Data, Feature, Evaluation, and Deployment Pipelines Beyond Classification: A Cough Regression Benchmark for Respiratory Acoustic Foundation Models Geometrically Constrained Decentralized Independent Vector Analysis for Distributed Microphone Arrays Interpretable and Frugal Learning Systems Employing Multiresolution Pyramids and Volterra Kernels Learning Input-Channel Permutation Equivariance for Multi-Channel Source Separation: Reducing Bleeding in Small Music Ensembles Semantic-VAE: Semantic-Alignment Latent Representation for Better Speech Synthesis HRIR-Former: Grid-Free Time-Domain Reconstruction of Head-Related Impulse Responses with a Spatially Encoded Transformer Spatial-Magnifier: Spatial upsampling for multichannel speech enhancement Raon-OpenTTS: Open Models and Data for Robust Text-to-Speech HoliDubber: Holistic Video Dubbing for Complex Acoustic Scenes via Text-Guided Audio Synthesis TuneJury: An Open Metric for Improving Music Generation Preference Alignment Probing Low Frame Rate Degradation in Neural Audio Codecs CraBERT: Efficient Phoneme Encoder Pre-Training via Cascade Fusion of Subword Representations for Text-to-Speech Confidence Score Guided Incremental and Speaker Adaptive Pseudo-Labeling for Semi-Supervised Elderly Speech Recognition Decoding while Adapting: Zero-Shot Online Speaker Adaptation via Audio-Textual Prompts for Elderly Speech Recognition Towards Robust Generative Speech Enhancement Using Vector Quantisation-Based Neural Audio Codec Unified Audio Generation and Editing via Joint Condition Modeling and Progressive Training Joycent: Diffusion-based Accent TTS without Accented Phone Prediction An Asymmetric Formula for Interval Consonance and its Relation to Harmonic Coincidence ArtBoost: Synthetic Articulatory Data Augmentation for Acoustic-to-Articulatory Inversion Stabilizing Short Duration Speaker Verification through Neural Re-scoring with Hybrid Enrollment Bridging the SEA Gap: An Initial Benchmark for Neural Audio Codec-Synthesized Speech Deepfakes in South-East Asian Languages NVMOS: Non-Verbal Vocalization Quality Assessment in Speech AdaTT: Text-Guided Instrument Timbre Transfer with Target-Adaptive Structural Control Acoustic Prompting via Stage-wise Modulation for Few-Shot Learning in Audio Language Models MambAdapter: Lightweight Mamba-Based Adapters for Parameter-Efficient Transfer Learning in Speech and Audio AP-GRPO: Anchor-Gated Phonetic Alignment with Policy Optimization for Pathological Speech Reconstruction Phonetically Explainable Speech Deepfake Detection DDPO-VC: Speaker De-Identification via Diffusion Denoising Policy Optimization Dynamic Prosody Prediction in LLM-based TTS for Improving Speaker Similarity DuraMark: Duration-Embedded Watermarking in LLM-based TTS VoxWatermark: A Large-Scale Benchmark for Audio Watermark Detection under Perturbations FreeSonic: Training-Free Temporal-Aware Decoupled Attention for Precise Audio Editing AUDEDIT: Inversion-Free Text-Guided Editing with Pretrained Audio Flow Models EChO-Agent: Evidence Chain Orchestration Agent for Audio Reasoning When the Same Musical Knowledge Forgets Differently: A Clean Probe of Pathway-Dependent Forgetting An Empirical Study on Learning Latent Representations for Emotional Speech Synthesis Spectro-Temporal Interference Confounds Phase Encoding in Spatial Audio Foundation Models From Physics to Representation: Audio Learning with Synthetic Pre-training via Procedural Generation Unifying Acoustic Features and Text with Multimodal LLMs for Neurodegenerative Screening LLM-Based Synthetic Ground Truth Generation for Audio-Based Emotion Classification via In-Context Learning Fast Speech Foundation Model Distillation Using Interleaved Stacking What Makes Synthetic Speech Sound Sarcastic? A Prosody-Controlled Perception Study Few-shot Class-variable Incremental Audio Classification via Prototype Adaptation and Pseudo Class-variable Training Pixel-TTS: Image based Text Rendering for Robust Text-to-Speech Virtual Speech Therapist: A Clinician-in-the-Loop AI Speech Therapy Agent for Personalized and Supervised Therapy Plug-and-Steer: Decoupling Separation and Selection in Audio-Visual Target Speaker Extraction MSpoofTTS: Multi-Resolution Spoof-Guided Inference for Discrete Speech Synthesis ZeroSyl: Simple Zero-Resource Syllable Tokenization for Spoken Language Modeling SSNAPS: Audio-Visual Separation of Speech and Background Noise with Diffusion Inverse Sampling Modeling Sarcastic Speech: Semantic and Prosodic Cues in a Speech Synthesis Framework Interpretable Audio Editing Evaluation via Chain-of-Thought Difference-Commonality Reasoning with Multimodal LLMs LibriTTS-VI: A Public Corpus and Novel Methods for Efficient Voice Impression Control Discrete optimal transport is a strong audio adversarial attack LISTEN: Lightweight Industrial Sound-representable Transformer for Edge Notification ChildGuard: A Specialized Dataset for Combatting Child-Targeted Hate Speech LM-SPT: LM-Aligned Semantic Distillation for Speech Tokenization Leveraging Local and Global Knowledge Integration with Time-Frequency Calibrated Distillation for Speech Enhancement ALAS: Measuring Latent Speech-Text Alignment For Spoken Language Understanding In Multimodal LLMs Cross-lingual Embedding Clustering for Hierarchical Softmax in Low-Resource Multilingual Speech Recognition AID: Open-source Anechoic Interferer Dataset
Synergizing Zero-Shot Cross-Lingual Alzheimer Detection with Language-Invariant Multimodal Bi-Geometric Adversarial Learning
[Submitted on 15 Jun 2026] · 2026-06-17 · via eess.AS updates on arXiv.org

View PDF HTML (experimental)

Abstract:In this work, we study zero-shot cross-lingual speech-based Alzheimer's disease detection (SADD). We hypothesize that learning language-invariant multimodal representations by fusing multilingual speech and text pretrained models is essential for reliable transfer to unseen languages, as the two modalities capture complementary acoustic and linguistic markers of cognitive impairment while adversarial learning suppresses language-specific confounds. Empirical results in zero-shot cross-lingual evaluation substantiate the hypothesis, showing that multimodal fusion consistently outperforms unimodal baselines. To this end, we propose ORBIT, a novel framework that combines cross-attentive fusion, multi-tap language adversaries, and complementary spherical--hyperbolic geometric learning with consensus clustering. Across settings, ORBIT achieves the strongest performance compared to unimodal models and simple concatenation-based fusion baselines.

Submission history

From: Mohd Akhtar Mujtaba [view email]
[v1] Mon, 15 Jun 2026 19:55:41 UTC (211 KB)