惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

F
Full Disclosure
Recorded Future
Recorded Future
T
Tenable Blog
S
Securelist
C
CERT Recently Published Vulnerability Notes
T
Threatpost
S
Schneier on Security
A
Arctic Wolf
The Hacker News
The Hacker News
C
CXSECURITY Database RSS Feed - CXSecurity.com
Know Your Adversary
Know Your Adversary
P
Privacy International News Feed
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
The Register - Security
The Register - Security
Cisco Talos Blog
Cisco Talos Blog
AWS News Blog
AWS News Blog
K
Kaspersky official blog
T
True Tiger Recordings
T
Threat Research - Cisco Blogs
V
Vulnerabilities – Threatpost
P
Palo Alto Networks Blog
T
The Exploit Database - CXSecurity.com
小众软件
小众软件
B
Blog
Cyber Security Advisories - MS-ISAC
Cyber Security Advisories - MS-ISAC
Microsoft Azure Blog
Microsoft Azure Blog
Cyberwarzone
Cyberwarzone
C
Cybersecurity and Infrastructure Security Agency CISA
T
Tor Project blog
Spread Privacy
Spread Privacy
Malwarebytes
Malwarebytes
P
Proofpoint News Feed
F
Fox-IT International blog
F
Fortinet All Blogs
P
Privacy & Cybersecurity Law Blog
G
GRAHAM CLULEY
量子位
Latest news
Latest news
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
博客园 - 叶小钗
Project Zero
Project Zero
T
Tailwind CSS Blog
N
Netflix TechBlog - Medium
Martin Fowler
Martin Fowler
IntelliJ IDEA : IntelliJ IDEA – the Leading IDE for Professional Development in Java and Kotlin | The JetBrains Blog
IntelliJ IDEA : IntelliJ IDEA – the Leading IDE for Professional Development in Java and Kotlin | The JetBrains Blog
I
Intezer
博客园_首页
腾讯CDC
H
Hackread – Cybersecurity News, Data Breaches, AI and More
D
Darknet – Hacking Tools, Hacker News & Cyber Security

cs updates on arXiv.org

RoMo: A Large-Scale, Richly Organized Dataset and Semantic Taxonomy for Human Motion Generation On the Role of Inductive Bias in Time-Series Pretraining: A Case Study in Learning Generalizable Representations for Clinical Time Series Triadic Dynamics Aware Diffusion Posterior Sampling for Inverse Problems: Optimizing Guidance and Stochasticity Schedules Joint Instance Segmentation and Geometric Attribute Regression for Roof Structures in Aerial Imagery MULTISEISMO: A Multimodal Seismic Dataset and Model for Cross-Modal Seismic Understanding Stateful Inference for Low-Latency Multi-Agent Tool Calling Evi-Steer: Learning to Steer Biomedical Vision-Language Models through Efficient and Generalizable Evidential Tuning Re-M3Dr: Rebalanced MultiModal Mean Deviation Regression TSFMAudit: Data Contamination Auditing in Forecasting Time Series Foundation Models CNNs, Transformers, Hybrid, and Vision Language Models for Skin Cancer Detection Two-Parameter Flows for Learning Population Dynamics of Physical Systems Rotation-Invariant Spherical Watermarking via Third-Order SO(3) Representation Coupling QAM-W: Joint 2D Codebook Quantization for LLM Weights via Hadamard Rotation and Activation-Aware Scaling Unified Neural Scaling Laws The Rescue Effect: Spatio-Semantic Early Exit Bypasses Quantization Collapse in CLIP HydraPrompt: An Adaptive and Asymmetric Framework of Vision-Language Models for Synthetic Image Detection Sleep-stage efficient classification using a lightweight self-supervised model LongCat-Video-Avatar 1.5 Technical Report Max-Window Scale Estimation for Near-Lossless HiF8 W8A8 Quantization-Aware Training CmIVTP: Cross-modal Interaction-based Vessel Trajectory Prediction for Maritime Intelligence 3D Gaussian Map with Open-Set Semantic Grouping for Vision-Language Navigation Multi-Modal Building Inspection via Perceiver IO Fusion of Satellite and Street-Level Imagery Curriculum Learning for Safety Alignment Scaling World-Model Reinforcement Learning Through Diffusion Policy Optimization MSCGC-KAN: Multi-scale Causal Graph Convolution and Kolmogorov-Arnold Feature Mapping for EEG Emotion Recognition Classification and detection of multiple UAVs using rational Gaussian wavelet neural networks Scheduled Style Injection: Expanding the Style-Content Pareto Frontier in Training-Free Diffusion-based Style Transfer DynFrame: Adaptive Reasoning-Driven Multimodal Framework with Dynamic Frame Augmentation for Complex Video Understanding Semigroup Consistency as a Diagnostic for Learned Physics Simulators Attenuation-Resilient Alternating Optimization for Laparoscopic Liver Landmark Detection Sentinel: Embodied Cooperative Spatial Reasoning and Planning VesselSim: learning 3D blood vessel segmentation without expert annotations Rethinking Weakly-supervised Video Temporal Grounding From a Game Perspective Comparative Study of Vision-Based Metric Measurement for Large-Scale Planar Scenes Unified Panoramic Geometry Estimation via Multi-View Foundation Models Detail Consistent Stage-Wise Distillation for Efficient 3D MRI Segmentation Erased but Exploitable: Black-box Embedding-Aware Prompting Against Unlearned Text-to-Image Diffusion Models InfoQuant: Shaping Activation Distributions for Low-Bit LLM Quantization The Constraint Tax: Measuring Validity-Correctness Tradeoffs in Structured Outputs for Small Language Models Clinically-Grounded Counterfactual Reasoning for Medical Video Diagnosis HRVConformer: Neonatal Hypoxic-Ischemic Encephalopathy Classification from the Heart Rate signals Not All Modalities Are Equal: Instruction-Aware Gating for Multimodal Videos When Rule Violations Are Rare: Chimera Training for Logical Anomaly Detection $R^3$: 3D Reconstruction via Relative Regression Planning Neural Dynamics with Lie Group Embedding through Supervised Projective Manifold Learning E$^3$C: Video Generation with 3D Environmental Memory and Ego-Exo Human Pose Control JetViT: Efficient High-Resolution Vision Transformer with Post-Training Attention Search Provably Communication-Efficient and Privacy-Preserving Federated Graph Neural Networks Dynamic Link Prediction with Temporally Enhanced Signed Graph Neural Networks Respecting Modality Gap in Post-hoc Out-of-distribution Detection with Pre-trained Vision-Language Models Bridging Classification and Reconstruction: Cooperative Time Series Anomaly Detection OmniRetriever: Any-to-Any Audio-Video-Text Retrieval via Fusion-as-Teacher Distillation DelowlightSplat: Feed-Forward Gaussian Splatting for Lowlight 3D Scene Reconstruction Reparametrizing Shampoo and SOAP for Subspace Basis Updates and BFloat16 Storage PinPoint: Prompting with Informative Interior Points Memory-Distilled Selection for Noise-Robust Anomaly Detection Share More, Search Less: Collaborative Parallel Thinking for Efficient Test-Time Scaling Geometry-Aware Representation Denoising for Robust Multi-view 3D Reconstruction Adaptation-Free Heterogeneous Collaborative Perception with Unseen Agent Configurations A PAC-Bayesian View of Generalisation for Physics-Informed Machine Learning DV-SFT: Direct Vision Supervision for Fine-Grained Visual Understanding Quantized Keys Steal Attention: Bias Correction for KV-Cache Compression in Video Diffusion A multifractal-based masked auto-encoder: an application to medical images RadarSim: Simulating Single-Chip Radar via Multimodal Neural Fields BioFact-MoE: Biologically Factorized Mixture of Experts for Vision-Language Prognostic Modeling in Hepatocellular Carcinoma VisualNeedle: Benchmarking Active Visual Search in Information-Dense Scenes Personalized Generative Models for Contextual Debiasing Benchmarking Convolutional, Transformer, Hybrid, and Vision Language Models for Multi Disease Retinal Screening Sparse-LiDAR Prompting of Monocular Geometry Foundations: An Empirical Study Toward Long-Range Driving Depth Underwater360: Reconstructing Underwater Scenes from Panoramic Images with Omnidirectional Gaussian Splatting OmniGF: A Dual-Branch Vision-Language Framework for Unified Gaze Following Cross-scale Aligned Supervision for Training GANs SilIF: Silhouette-Augmented Isolation Forest for Unsupervised Transaction Fraud Detection DuoGesture: Neuro-Inspired and Biomechanically Informed Dual-Stream Co-Speech Gesture Generation GEM: Geometric Entropy Mixing for Optimal LLM Data Curation Zero-Shot Object Re-Identification in Egocentric Kitchen Videos via Multi-Stage SAM3 Feature Fusion LongAV-Compass: Towards Unified Evaluation of Minute-Scale Audio-Visual Generation Across T2AV, I2AV, and V2AV On the Push-Based Asynchronous Federated Learning: A Bias-Correction Aggregation Approach OmniInteract: Benchmarking Real-World Streaming Interaction for Real-Time Omnimodal Assistants Unveiling the Fragility of Vision-Language Models: Multi-Modal Adversarial Synergy via Texture-Constrained Perturbations and Cross-Modal Optimization ARBITER: Reasoning Trajectory Basins and Majority Vote Failures in Test-Time Sampling CSV-ViT: A Vision Transformer with the Variable-sized Cortical Supervertices for Detection of Alzheimer's Disease Pathologies Uncertainty-Aware Gaussian Map for Vision-Language Navigation Neural Bayesian Sequential Routing A Hybrid Vision-Language Architecture for Automated Defect Reasoning and Report Generation in Industrial Inspection ReCA: Multi-Shot Long Video Extrapolation via Recursive Context Allocation InterSketch: An Interleaved Reasoning Model with Self-correcting Visual Sketch and Stepwise Reward Modeling Dynamic Mixtures of Time-Delay Systems from Streaming Time Series From Privacy to Generalization: Linear Max-Information Bounds for DP-SGD AnchorDiff: Training-Free Concept Grounding for MM-DiTs via Anchor-Based Graph Propagation Co-folding model guided by structural proteomics GAC: Noise-Aware Adaptive Mixing for Hybrid SFT-RL Post-Training AirCast-SR: A Foundation Model for Kilometer-Scale Atmospheric Super-Resolution via Latent Consistency Diffusion Frequency-Guided Fusion For RGB-Thermal Semantic Segmentation The Bridge-Garden Dilemma in LLM Distillation: Why Mixing Hard and Soft Labels Works TrackRef3D: Multi-View Consistent Track-then-Label for Open-World Referring Segmentation in 3D Gaussian Splatting MedVol-R1: Reward-Driven Evidence Grounding for Volumetric Reasoning Segmentation Gaussian-Voxel Duet: A Dual-Scaffolding Hybrid Representation for Fast and Accurate Monocular Surface Reconstruction O-MARC: Omni Memory-Augmented Compression Distillation for Efficient Video Understanding FTibSuite: A Comprehensive Resource Suite for Tibetan Vision-Language Modeling
Can We Hear from Events? Generating Speech from Event Camera
Jingping Fan · 2026-05-27 · via cs updates on arXiv.org

View PDF HTML (experimental)

Abstract:Traditional RGB-based speech generation faces Temporal Granularity Mismatch since fixed camera exposure times inevitably blur the high-frequency articulatory transients essential for rendering emotional speech. To break this ceiling, we propose EventSpeech as a novel text-conditioned framework pioneering the use of neuromorphic events for expressive speech generation, since these microsecond-precise events naturally align with acoustic waveform dynamics. Our architecture integrates a dedicated Event Encoder to model sparse neuromorphic events alongside a multi-scale Audio Encoder featuring a Hierarchical Wavelet Contextualizer (HWC). A bidirectional alignment mechanism seamlessly synchronizes linguistic content and visual dynamics with dense acoustic features. Furthermore, we construct EVT-SPK as the first benchmark comprising large-scale synthetic data and real-world recordings from specialized neuromorphic hardware. Extensive evaluations demonstrate that EventSpeech significantly outperforms current baselines by preserving fine-grained emotions and resisting motion blur to establish a new paradigm for multimodal speech generation. Code and demo are available at this https URL.
Subjects: Multimedia (cs.MM); Sound (cs.SD)
Cite as: arXiv:2605.26672 [cs.MM]
  (or arXiv:2605.26672v1 [cs.MM] for this version)
  https://doi.org/10.48550/arXiv.2605.26672

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Lin Chen [view email]
[v1] Tue, 26 May 2026 08:11:27 UTC (34,426 KB)