惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

F
Full Disclosure
Recorded Future
Recorded Future
T
Tenable Blog
S
Securelist
C
CERT Recently Published Vulnerability Notes
T
Threatpost
S
Schneier on Security
A
Arctic Wolf
The Hacker News
The Hacker News
C
CXSECURITY Database RSS Feed - CXSecurity.com
Know Your Adversary
Know Your Adversary
P
Privacy International News Feed
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
The Register - Security
The Register - Security
Cisco Talos Blog
Cisco Talos Blog
AWS News Blog
AWS News Blog
K
Kaspersky official blog
T
True Tiger Recordings
T
Threat Research - Cisco Blogs
V
Vulnerabilities – Threatpost
P
Palo Alto Networks Blog
T
The Exploit Database - CXSecurity.com
小众软件
小众软件
B
Blog
Cyber Security Advisories - MS-ISAC
Cyber Security Advisories - MS-ISAC
Microsoft Azure Blog
Microsoft Azure Blog
Cyberwarzone
Cyberwarzone
C
Cybersecurity and Infrastructure Security Agency CISA
T
Tor Project blog
Spread Privacy
Spread Privacy
Malwarebytes
Malwarebytes
P
Proofpoint News Feed
F
Fox-IT International blog
F
Fortinet All Blogs
P
Privacy & Cybersecurity Law Blog
G
GRAHAM CLULEY
量子位
Latest news
Latest news
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
博客园 - 叶小钗
Project Zero
Project Zero
T
Tailwind CSS Blog
N
Netflix TechBlog - Medium
Martin Fowler
Martin Fowler
IntelliJ IDEA : IntelliJ IDEA – the Leading IDE for Professional Development in Java and Kotlin | The JetBrains Blog
IntelliJ IDEA : IntelliJ IDEA – the Leading IDE for Professional Development in Java and Kotlin | The JetBrains Blog
I
Intezer
博客园_首页
腾讯CDC
H
Hackread – Cybersecurity News, Data Breaches, AI and More
D
Darknet – Hacking Tools, Hacker News & Cyber Security

cs.CV updates on arXiv.org

Benchmarking Composed Image Retrieval for Applied Earth Observation CLiViS: Unleashing Cognitive Map through Linguistic-Visual Synergy for Embodied Visual Reasoning CAFD: Concept-Aware DNN Fault Detection using VLMs MinerU-Popo: Universal Post-Processing Model for Structured Document Parsing Hierarchical Local-Global Transformer for Temporal Sentence Grounding Φ-Noise: Training-Free Temporal Video Conditioning via Phase-Based Noise Manipulation Cross-Modal Action Recognition in Egocentric Video Using Mamba: Integrating RGB and Hand Skeleton Streams via CLS Token Fusion Strategies ArtSplat: Feed-Forward Articulated 3D Gaussian Splatting from Sparse Multi-State Uncalibrated Views CMAP: Cross-Modal Adaptive Prompting for Multi-Domain Task-Incremental Learning Forgettable Federated Linear Learning with Certified Data Unlearning Paris 2.0: A Decentralized Diffusion Model for Video Generation Everything at Every Scale: Scale-Invariant Diffusion with Continuous Super-Resolution SliceWorld: A Predictive and Controllable World-State Model for CT Report Generation NudgeVAD: Language-Nudged End-to-End Driving via FiLM Residuals STORM: Internalized Modeling for Spatial-Temporal Reasoning in Video-Language Models PQDT: Pseudo-Query Dual Transformer for Robust Point Cloud Restoration FDDet: Achieving Data-Efficient Food Defect Detection Under Real-World Scenarios Reason--Imagine--Act: Closed-Loop LLM Decision Making with World Models for Autonomous Driving Phase-Aware Wavelet-Based-Scattering Encoder-Decoder for Dense Predictions CoDA: Color Distribution Probing for Efficient and Generalizable AI-Generated Image Detection Filtered Posterior Mean Collections: A Unified Framework for Analytical Models of Diffusion Generalization Investigating the Effect of Network Pruning on Performance and Interpretability A Multimodal 3D Foundation Model for Light Sheet Fluorescence Microscopy Enables Few-Shot Segmentation, Classification, and Deblurring SparseWorld: Enhancing End-to-End Autonomous Driving via World Models with Sparse Scene Representation World Models as Group Actions MMSI-Bench: A Benchmark for Multi-Image Spatial Intelligence IVR-R1: Refining Trajectories through Iterative Visual-Grounded Reasoning in Reinforcement Learning Parameter Efficient Multi-Class Intelligent Scheduling for Multimodal Online Distributed Industrial Anomaly Detection ViViD-5K: Vineyard vision dataset for field-based berry detection and segmentation and grape cluster closure estimation Appearance-Invariant Detection of Suggestive Motion via Laban Movement Descriptors on SMPL Skeletons Lattice theory and algebraic models for deep convolutional learning based on mathematical morphology Image-Conditioned Instance Prompt Network for Referring Remote Sensing Image Segmentation SILSM: A Sustainable Interactive Level Set Method for Progressive Refinement Robust Fuzzy Multi-view Learning under View Conflict Muon in Vision Transformers: Optimizer-Recipe Interactions and Gradient Spectra Causal Physics Steering in Video World Models via Concept Activation Vectors EgoAdapt: A Multi-Scene Egocentric Adaptation Method for CVPR 2026 HD-EPIC VQA Challenge When Search Becomes Memory: Turning Robot Design Trials into Transferable Skills MAGIC: Multimodal Alignment & Grounding-aware Instruction Coreset for Vision-Language Models PDEInvBench: A Comprehensive Dataset and Design Space Exploration of Neural Networks for PDE Inverse Problems Physen-Noise2Noise: Physics-Guided Self-Supervised Defocus Deblurring with Bias Correction under Low-Light Conditions PILOT: Policy-Informed Learned Optimization for Adaptive Deep Network Training EgoAction: Egocentric Action Composition with Reliability-Aware Temporal Fusion for the EPIC-KITCHENS Action Detection Challenge at CVPR 2026 Reading, Not Thinking: Understanding and Bridging the Modality Gap When Text Becomes Pixels in Multimodal LLMs Inference-Time Alignment of Diffusion Models via Trust-Region Iterative Twisted Sequential Monte Carlo Mitigating Object Hallucinations in Vision-Language Models through Region-Aware Attention Recalibration Uni-DPO: A Unified Paradigm for Dynamic Preference Optimization of LLMs Prism: A Plug-in Reproducible Infrastructure for Scalable Multimodal Continual Instruction Tuning GIBLy: Improving 3D Semantic Segmentation through an Architecture-Agnostic Lightweight Geometric Inductive Bias Layer Nano World Models: A Minimalist Implementation of Future Video Prediction Med-R2: An Adversarial Benchmark for Evidence-Grounded Reasoning in Medical VLMs Single View Seafloor Recovery from Imaging Sonar via Differentiable Rendering ERNIE-Image Technical Report Cross-Domain Generalization Limits of Vision Foundation Models in Facial Deepfake Detection Learnable Shape Prototypes with Occlusion-Geometry-Guided Injection for Amodal Instance Segmentation Parameter-Efficient CT Reconstruction via Deep Graph Laplacian Regularization Do Understanding and Generation Fight? A Diagnostic Study of DPO for Unified Multimodal Models Beyond Generative Priors: Minority Sampling with JEPA-Guided Diffusion Motion-Compensated Weight Compression Squeezing Capacity from Multimodal Large Language Models for Subject-driven Generation Rethinking Continual Anomaly Detection on the Edge: Benchmarking Under Realistic Industrial Conditions Generalized Evidential Deep Learning: From a Bayesian Perspective TimeSpot: Benchmarking Geo-Temporal Understanding in Vision-Language Models in Real-World Settings From Theory to Decision Rule: Calibrating the Noisy-Label Crossover for Vision-Language Model Weak Supervision Across Three Medical-Imaging Benchmarks HumanEgo: Zero-Shot Robot Learning from Minutes of Human Egocentric Videos Self-supervised Dynamic Heterogeneous Degradation Modeling for Unified Zero-Shot Image Restoration Uncertainty-DTW for Sequences and Visual Tokens A Principled Self-Referenced Early Stopping Approach for Deep Image Prior PEDESTRIANQA: A Benchmark for Vision-Language Models on Pedestrian Intention and Trajectory Prediction Generating 3D models from sketches of human faces using a combined approach of Convolutional Neural Networks, Procedural Modeling, and Contour Mapping Concept Unlearning via Cross-Attention Activation Projection for Diffusion Models Remote sensing data imputation using deep learning for multispectral imagery V3H: View Variation and View Heredity for Incomplete Multi-view Clustering Dual Prototype-Conditioned Diffusion Model for Scalable Multi-Class Unsupervised Anomaly Detection in Large Category Spaces Coarse-to-Fine Domain Incremental Learning with Attentive Distillation for Mining Footprint Segmentation in Multispectral Imagery Unveil: Unified Visual-Textual Integration and Distillation for Multi-modal Document Retrieval CRISP -- Clustering-Based Redundancy-Reduced Instance Sampling for Pathology Case Representation and Retrieval Dale meets Langevin: A Multiplicative Denoising Diffusion Model DUEL: Adversarial Self-Play for Multimodal Reasoning Plume Segmentation from MethaneSAT with Cross-Sensor Transfer Learning and Physics-Informed Postprocessing When Interpretability Becomes a Liability: Adversarial Attacks on CBM Concept Layers Universal Boosts, Specific Suppressors: Sparse Autoencoder Steering of Medical Vision-Language Models VectorArk: Learning Practical Image Vectorization with Rounded Polygon Representation Opportunistic Target Selection: Early Directional Commitment for Query-Efficient Black-Box Adversarial Attacks In Search of the Ingredients of Open-Endedness: Replicating Picbreeder with Large Vision-Language Models EgoProx: Evaluating MLLMs on Egocentric 3D Proximity Reasoning Across a Cognitive Hierarchy AdvantageFlow: Advantage-Weighted Least Squares for RL in Flow Models EMA: Effort Metric Attention for Anatomical Effort-Guided Human Motion Diffusion Artiverse: A Diverse and Physically Grounded Dataset for Articulated Objects MGVQ: Synergizing Multi-dimensional Sensitivity-Aware and Gradient-Hessian Fusion for Vector Quantization Gaussian Rank-Based Neighborhood Degree for Graph Neural Networks in Image Classification TempRet: Temporal Enhancement and Two-Stage Reranking for CVPR 2026 EPIC-KITCHENS-100 Multi-Instance Retrieval Challenge Prism: Spectral-Aware Block-Sparse Attention Agent-X: Evaluating Deep Multimodal Reasoning in Vision-Centric Agentic Tasks FoodMonitor: Benchmarking MLLMs for Explainable Compliance Analysis IQA-Spider: Unifying Multi-Granularity Image Quality Assessment with Reasoning, Grounding and Referring Diff-Instruct with Diffused Reward: Towards Principled One-step Generator RL OmniEgo-R$^2$: A Routed Reasoning Framework for the 1st Cross-Domain EgoCross Challenge at CVPR 2026 Towards Large Model Feature Coding Trust-Aware Joint Feature-Prediction Discrepancy for Robust Domain Adaptation
Spatial-aware Vision Language Model for Autonomous Driving
Weijie Wei, · 2026-05-26 · via cs.CV updates on arXiv.org

View PDF HTML (experimental)

Abstract:While Vision-Language Models (VLMs) show significant promise for end-to-end autonomous driving by leveraging the common sense embedded in language models, their reliance on 2D image cues for complex scene understanding and decision-making presents a critical bottleneck for safety and reliability. Current image-based methods struggle with accurate metric spatial reasoning and geometric inference, leading to unreliable driving policies. To bridge this gap, we propose LVLDrive (LiDAR-Vision-Language), a novel framework specifically designed to upgrade existing VLMs with robust 3D metric spatial understanding for autonomous driving by incoperating LiDAR point cloud as an extra input modality. A key challenge lies in mitigating the catastrophic disturbance introduced by disparate 3D data to the pre-trained VLMs. To this end, we introduce a Gradual Fusion Q-Former that incrementally injects LiDAR features, ensuring the stability and preservation of the VLM's existing knowledge base. Furthermore, we develop a spatial-aware question-answering (SA-QA) dataset to explicitly teach the model advanced 3D perception and reasoning capabilities. Extensive experiments on driving benchmarks demonstrate that LVLDrive achieves superior performance compared to vision-only counterparts across scene understanding, metric spatial perception, and reliable driving decision-making. Our work highlights the necessity of explicit 3D metric data for building trustworthy VLM-based autonomous systems.
Comments: Accepted to CVPR AutoPilot Workshop 2026
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Cite as: arXiv:2512.24331 [cs.CV]
  (or arXiv:2512.24331v2 [cs.CV] for this version)
  https://doi.org/10.48550/arXiv.2512.24331

arXiv-issued DOI via DataCite

Submission history

From: Weijie Wei [view email]
[v1] Tue, 30 Dec 2025 16:35:00 UTC (945 KB)
[v2] Mon, 25 May 2026 12:17:46 UTC (945 KB)