惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

F
Full Disclosure
Recorded Future
Recorded Future
T
Tenable Blog
S
Securelist
C
CERT Recently Published Vulnerability Notes
T
Threatpost
S
Schneier on Security
A
Arctic Wolf
The Hacker News
The Hacker News
C
CXSECURITY Database RSS Feed - CXSecurity.com
Know Your Adversary
Know Your Adversary
P
Privacy International News Feed
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
The Register - Security
The Register - Security
Cisco Talos Blog
Cisco Talos Blog
AWS News Blog
AWS News Blog
K
Kaspersky official blog
T
True Tiger Recordings
T
Threat Research - Cisco Blogs
V
Vulnerabilities – Threatpost
P
Palo Alto Networks Blog
T
The Exploit Database - CXSecurity.com
小众软件
小众软件
B
Blog
Cyber Security Advisories - MS-ISAC
Cyber Security Advisories - MS-ISAC
Microsoft Azure Blog
Microsoft Azure Blog
Cyberwarzone
Cyberwarzone
C
Cybersecurity and Infrastructure Security Agency CISA
T
Tor Project blog
Spread Privacy
Spread Privacy
Malwarebytes
Malwarebytes
P
Proofpoint News Feed
F
Fox-IT International blog
F
Fortinet All Blogs
P
Privacy & Cybersecurity Law Blog
G
GRAHAM CLULEY
量子位
Latest news
Latest news
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
博客园 - 叶小钗
Project Zero
Project Zero
T
Tailwind CSS Blog
N
Netflix TechBlog - Medium
Martin Fowler
Martin Fowler
IntelliJ IDEA : IntelliJ IDEA – the Leading IDE for Professional Development in Java and Kotlin | The JetBrains Blog
IntelliJ IDEA : IntelliJ IDEA – the Leading IDE for Professional Development in Java and Kotlin | The JetBrains Blog
I
Intezer
博客园_首页
腾讯CDC
H
Hackread – Cybersecurity News, Data Breaches, AI and More
D
Darknet – Hacking Tools, Hacker News & Cyber Security

cs.CV updates on arXiv.org

DinoComplete: 3D Shape Completion with Distilled Semantic Priors and State Space Models InterSketch: An Interleaved Reasoning Model with Self-correcting Visual Sketch and Stepwise Reward LongAV-Compass: Towards Unified Evaluation of Minute-Scale Audio-Visual Generation Across T2AV, I2AV, and V2AV Sentinel: Embodied Cooperative Spatial Reasoning and Planning OmniInteract: Benchmarking Real-World Streaming Interaction for Real-Time Omnimodal Assistants CmIVTP: Cross-modal Interaction-based Vessel Trajectory Prediction for Maritime Intelligence METATR: A Multilingual, Evolving Benchmark for Automatic Text Recognition REVERSE: Reinforcing Evidence Verification and Search for Agentic Image geo-localization Zero-Shot Object Re-Identification in Egocentric Kitchen Videos via Multi-Stage SAM3 Feature Fusion Learning Reference-Guided Exposure Correction with Hybrid Illumination Characteristics TrackRef3D: Multi-View Consistent Track-then-Label for Open-World Referring Segmentation in 3D Gaussian Splatting Attenuation-Resilient Alternating Optimization for Laparoscopic Liver Landmark Detection Erased but Exploitable: Black-box Embedding-Aware Prompting Against Unlearned Text-to-Image Diffusion Models SoftCap: Soft-Budget Control for Diffusion Transformer Acceleration Joint Instance Segmentation and Geometric Attribute Regression for Roof Structures in Aerial Imagery O-MARC: Omni Memory-Augmented Compression Distillation for Efficient Video Understanding Pop-Up Distractions Reveal Bag-of-Events Behavior in Video Large Language Models Unified Panoramic Geometry Estimation via Multi-View Foundation Models CSV-ViT: A Vision Transformer with the Variable-sized Cortical Supervertices for Detection of Alzheimer's Disease Pathologies RoadGIE: Towards A Global-Scale Aerial Benchmark for Generalizable Interactive Road Extraction Benchmarking Convolutional, Transformer, Hybrid, and Vision Language Models for Multi Disease Retinal Screening CIRCLED: A Multi-turn CIR Dataset with Consistent Dialogues across Domains DV-SFT: Direct Vision Supervision for Fine-Grained Visual Understanding CodecCap: High-Fidelity Codec-Inspired Residual Modeling for Dense Video Captioning LongCat-Video-Avatar 1.5 Technical Report Uncertainty-Aware Gaussian Map for Vision-Language Navigation Rotation-Invariant Spherical Watermarking via Third-Order SO(3) Representation Coupling Dimensional Distribution Emotion State: Leveraging Valence and Arousal as a Common Embedding Space for Visual Emotion Analysis OSMa-Bench++: Toward Open-Ended Benchmarking of Semantic Mapping for Manipulation with Prompt-Generated Synthetic Scenes NeR-SC: Adapting Neural Video Representation to Screen Content ReCA: Multi-Shot Long Video Extrapolation via Recursive Context Allocation COVD: Continual Open-Vocabulary Object Detection with Novel Concept Injection Leveraging Text-to-Image Diffusion Models for Unsupervised Visual Object Tracking Recursive Flow Matching Sleep-stage efficient classification using a lightweight self-supervised model Natural Human Motion Recovery by Aligning High-Order Temporal Dynamics from Monocular Videos On the Robustness of Machine Unlearning for Vision-Language Models OmniGF: A Dual-Branch Vision-Language Framework for Unified Gaze Following Detail Consistent Stage-Wise Distillation for Efficient 3D MRI Segmentation RadarSim: Simulating Single-Chip Radar via Multimodal Neural Fields Scheduled Style Injection: Expanding the Style-Content Pareto Frontier in Training-Free Diffusion-based Style Transfer VesselSim: learning 3D blood vessel segmentation without expert annotations Black-box Membership Inference Attacks on the Pre-training Data of Image-generation Models Sparse-LiDAR Prompting of Monocular Geometry Foundations: An Empirical Study Toward Long-Range Driving Depth E$^3$C: Video Generation with 3D Environmental Memory and Ego-Exo Human Pose Control FTibSuite: A Comprehensive Resource Suite for Tibetan Vision-Language Modeling Cross-scale Aligned Supervision for Training GANs Unveiling the Fragility of Vision-Language Models: Multi-Modal Adversarial Synergy via Texture-Constrained Perturbations and Cross-Modal Optimization Rethinking Weakly-supervised Video Temporal Grounding From a Game Perspective AnchorDiff: Training-Free Concept Grounding for MM-DiTs via Anchor-Based Graph Propagation SCKAN: Structural Consensus-based KAN Prototype Learning for Semi-Supervised Pancreas Segmentation 3D Gaussian Map with Open-Set Semantic Grouping for Vision-Language Navigation Adaptation-Free Heterogeneous Collaborative Perception with Unseen Agent Configurations Quantized Keys Steal Attention: Bias Correction for KV-Cache Compression in Video Diffusion SOLE-R1: Video-Language Reasoning as the Sole Reward for On-Robot Reinforcement Learning JetViT: Efficient High-Resolution Vision Transformer with Post-Training Attention Search Geometry-Aware Representation Denoising for Robust Multi-view 3D Reconstruction RoMo: A Large-Scale, Richly Organized Dataset and Semantic Taxonomy for Human Motion Generation Comparative Study of Vision-Based Metric Measurement for Large-Scale Planar Scenes Triadic Dynamics Aware Diffusion Posterior Sampling for Inverse Problems: Optimizing Guidance and Stochasticity Schedules A multifractal-based masked auto-encoder: an application to medical images Multi-Modal Building Inspection via Perceiver IO Fusion of Satellite and Street-Level Imagery $R^3$: 3D Reconstruction via Relative Regression Gaussian-Voxel Duet: A Dual-Scaffolding Hybrid Representation for Fast and Accurate Monocular Surface Reconstruction MedVol-R1: Reward-Driven Evidence Grounding for Volumetric Reasoning Segmentation DynFrame: Adaptive Reasoning-Driven Multimodal Framework with Dynamic Frame Augmentation for Complex Video Understanding DelowlightSplat: Feed-Forward Gaussian Splatting for Lowlight 3D Scene Reconstruction OmniRetriever: Any-to-Any Audio-Video-Text Retrieval via Fusion-as-Teacher Distillation IPIBench: Evaluating Interactive Proactive Intelligence of MLLMs under Continuous Streams Respecting Modality Gap in Post-hoc Out-of-distribution Detection with Pre-trained Vision-Language Models Memory-Distilled Selection for Noise-Robust Anomaly Detection I2PRef: Image-Driven Point Completion with Iterative Refinement Not All Modalities Are Equal: Instruction-Aware Gating for Multimodal Videos SIMPC: Learning Self-Induced Mirror-Point Consistency for Unsupervised Point Cloud Denoising PinPoint: Prompting with Informative Interior Points Frequency-Guided Fusion For RGB-Thermal Semantic Segmentation Cesarean Scar Defect Segmentation in Transvaginal Ultrasound Images: a Dataset and Benchmark Joint 2D-3D Segmentation and Association in Street-level Imaging Evi-Steer: Learning to Steer Biomedical Vision-Language Models through Efficient and Generalizable Evidential Tuning Self-Intersection-Aware 3D Human Motion Generation Using an Efficient Human Sphere Proxy Personalized Generative Models for Contextual Debiasing VisualNeedle: Benchmarking Active Visual Search in Information-Dense Scenes Revealing the core dimensions underlying representations in brains, behavior and AI BioFact-MoE: Biologically Factorized Mixture of Experts for Vision-Language Prognostic Modeling in Hepatocellular Carcinoma CNNs, Transformers, Hybrid, and Vision Language Models for Skin Cancer Detection Small Object Detection in Industrial Recycling: A New Dataset and YOLO Performance Evaluation Underwater360: Reconstructing Underwater Scenes from Panoramic Images with Omnidirectional Gaussian Splatting A Hybrid Vision-Language Architecture for Automated Defect Reasoning and Report Generation in Industrial Inspection Beyond Pairwise Preferences: Listwise Reward-Aware Alignment for Diffusion Models DuoGesture: Neuro-Inspired and Biomechanically Informed Dual-Stream Co-Speech Gesture Generation Clinically-Grounded Counterfactual Reasoning for Medical Video Diagnosis Timestep-Aware SVDQuant-GPTQ for W4A4 Quantization of Wan2.2-I2V HydraPrompt: An Adaptive and Asymmetric Framework of Vision-Language Models for Synthetic Image Detection Re-M3Dr: Rebalanced MultiModal Mean Deviation Regression Semi-Supervised Gaze Estimation via Disentangled Subspace Contrastive Learning JLT: Clean-Latent Prediction in Latent Diffusion Transformers The Rescue Effect: Spatio-Semantic Early Exit Bypasses Quantization Collapse in CLIP Once-For-All: A Train-Once and Select-Anytime Framework for Multimodal Instruction Tuning ChartAct: A Benchmark for Dynamic Chart Understanding MSCGC-KAN: Multi-scale Causal Graph Convolution and Kolmogorov-Arnold Feature Mapping for EEG Emotion Recognition
BEAT: Rhythm-Elastic Alignment for Agentic Music-guided Movie Trailer Generation
Yutong Wang, · 2026-05-27 · via cs.CV updates on arXiv.org

View PDF HTML (experimental)

Abstract:Automatic movie trailer generation must select shots from a full-length film and synchronize them with background music. Existing methods either relegate music alignment to post-processing or enforce rigid one-to-one shot-music mappings, overlooking that professional editing rhythm is elastic: rapid cuts accompany high-energy passages while sustained shots span quieter bars. We introduce BEAT, a framework that addresses this gap with two core components: MuVA, a compact music-visual alignment encoder trained with Sinkhorn-regularized two-stage learning, and Bar-DP, an energy-adaptive dynamic programming algorithm that produces elastic many-to-one alignments following musical dynamics. These components are integrated into a five-phase agentic pipeline that grounds the core alignment in learned cross-modal features while coordinating higher-level creative decisions through structured text signals. To support comprehensive evaluation, we also introduce TrailerArena, a benchmark with 20+ metrics across four complementary dimensions. On TrailerArena, BEAT achieves state-of-the-art performance across shot selection, ordering, and perceptual quality, while producing fully composed trailers end-to-end.
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Cite as: arXiv:2605.27067 [cs.CV]
  (or arXiv:2605.27067v1 [cs.CV] for this version)
  https://doi.org/10.48550/arXiv.2605.27067

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Yutong Wang [view email]
[v1] Tue, 26 May 2026 14:18:13 UTC (4,338 KB)