Two-Pass Zero-Shot Temporal-Spatial Grounding of Rare Traffic Events in Surveillance Video

ClawGUI: A Unified Framework for Training, Evaluating, and Deploying GUI Agents

On the Robustness of Watermarking for Autoregressive Image Generation

Revisiting Compositionality in Dual-Encoder Vision-Language Models: The Role of Inference

Anthropogenic Regional Adaptation in Multimodal Vision-Language Model

From Redaction to Restoration: Deep Learning for Medical Image Anonymization and Reconstruction

The Salami Slicing Threat: Exploiting Cumulative Risks in LLM Systems

BoxTuning: Directly Injecting the Object Box for Multimodal Model Fine-Tuning

Semantic-Geometric Dual Compression: Training-Free Visual Token Reduction for Ultra-High-Resolution Remote Sensing Understanding

Lightweight Low-Light Image Enhancement via Distribution-Normalizing Preprocessing and Depthwise U-Net

Back to the Barn with LLAMAs: Evolving Pretrained LLM Backbones in Finetuning Vision Language Models

Pseudo-Unification: Entropy Probing Reveals Divergent Information Patterns in Unified Multimodal Models

QShield: Securing Neural Networks Against Adversarial Attacks using Quantum Circuits

Evaluating the Impact of Medical Image Reconstruction on Downstream AI Fairness and Performance

Retinal Cyst Detection from Optical Coherence Tomography Images

LoViF 2026 The First Challenge on Weather Removal in Videos

STORM: End-to-End Referring Multi-Object Tracking in Videos

Data-Efficient Surgical Phase Segmentation in Small-Incision Cataract Surgery: A Controlled Study of Vision Foundation Models

Rethinking the Diffusion Model from a Langevin Perspective

Zero-shot World Models Are Developmentally Efficient Learners

Edu-MMBias: A Three-Tier Multimodal Benchmark for Auditing Social Bias in Vision-Language Models under Educational Contexts

VGA-Bench: A Unified Benchmark and Multi-Model Framework for Video Aesthetics and Generation Quality Evaluation

Degradation-Consistent Paired Training for Robust AI-Generated Image Detection

FREE-Switch: Frequency-based Dynamic LoRA Switch for Style Transfer

Demographic and Linguistic Bias Evaluation in Omnimodal Language Models

FlowPalm: Optical Flow Driven Non-Rigid Deformation for Geometrically Diverse Palmprint Generation

Cross-Cultural Value Awareness in Large Vision-Language Models

I Walk the Line: Examining the Role of Gestalt Continuity in Object Binding for Vision Transformers

GLEaN: A Text-to-image Bias Detection Approach for Public Comprehension

From UAV Imagery to Agronomic Reasoning: A Multimodal LLM Benchmark for Plant Phenotyping

Not Your Stereo-Typical Estimator: Combining Vision and Language for Volume Perception

Genie 4D: Semantic-Prior-Guided 4D Dynamic Scene Reconstruction

Efficient Personalization of Generative User Interfaces

Is There Knowledge Left to Extract? Evidence of Fragility in Medically Fine-Tuned Vision-Language Models

ProGAL-VLA: Grounded Alignment through Prospective Reasoning in Vision-Language-Action Models

MedLVR: Latent Visual Reasoning for Reliable Medical Visual Question Answering

Rays as Pixels: Learning A Joint Distribution of Videos and Camera Trajectories

PhysInOne: Visual Physics Learning and Reasoning in One Suite

Through Their Eyes: Fixation-aligned Tuning for Personalized User Emulation

Neural Distribution Prior for LiDAR Out-of-Distribution Detection

Adding Another Dimension to Image-based Animal Detection

Long-SCOPE: Fully Sparse Long-Range Cooperative 3D Perception

CT-1: Vision-Language-Camera Models Transfer Spatial Reasoning Knowledge to Camera-Controllable Video Generation

FIRE-CIR: Fine-grained Reasoning for Composed Fashion Image Retrieval

Detecting Diffusion-generated Images via Dynamic Assembly Forests

Memory-Efficient Transfer Learning with Fading Side Networks via Masked Dual Path Distillation

Tora3: Trajectory-Guided Audio-Video Generation with Physical Coherence

Leave My Images Alone: Preventing Multi-Modal Large Language Models from Analyzing Images via Visual Prompt Injection

Domain-generalizable Face Anti-Spoofing with Patch-based Multi-tasking and Artifact Pattern Conversion

Dynamic Class-Aware Active Learning for Unbiased Satellite Image Segmentation

Low-Data Supervised Adaptation Outperforms Prompting for Cloud Segmentation Under Domain Shift

Degradation-Robust Fusion: An Efficient Degradation-Aware Diffusion Framework for Multimodal Image Fusion in Arbitrary Degradation Scenarios

Adaptive Dual Residual U-Net with Attention Gate and Multiscale Spatial Attention Mechanisms (ADRUwAMS)

MedFormer-UR: Uncertainty-Routed Transformer for Medical Image Classification

BIAS: A Biologically Inspired Algorithm for Video Saliency Detection

DeFakeQ: Enabling Real-Time Deepfake Detection on Edge Devices via Adaptive Bidirectional Quantization

Dictionary-Aligned Concept Control for Safeguarding Multimodal LLMs

CatalogStitch: Dimension-Aware and Occlusion-Preserving Object Compositing for Catalog Image Generation

Post-Hoc Guidance for Consistency Models by Joint Flow Distribution Learning

SenBen: Sensitive Scene Graphs for Explainable Content Moderation

Towards Responsible Multimodal Medical Reasoning via Context-Aligned Vision-Language Models

R2G: A Multi-View Circuit Graph Benchmark Suite from RTL to GDSII

State Space Models are Effective Sign Language Learners: Exploiting Phonological Compositionality for Vocabulary-Scale Recognition

Accelerating Transformer-Based Monocular SLAM via Geometric Utility Scoring

Deep Learning-Based Tracking and Lineage Reconstruction of Ligament Breakup

Unified Multimodal Uncertain Inference

Unsupervised Local Plasticity in a Multi-Frequency VisNet Hierarchy

EfficientSign: An Attention-Enhanced Lightweight Architecture for Indian Sign Language Recognition

InsEdit: Towards Instruction-based Visual Editing via Data-Efficient Video Diffusion Models Adaptation

3D-VCD: Hallucination Mitigation in 3D-LLM Embodied Agents through Visual Contrastive Decoding

On Semiotic-Grounded Interpretive Evaluation of Generative Art

Generative 3D Gaussian Splatting for Arbitrary-ResolutionAtmospheric Downscaling and Forecasting

From Selection to Scheduling: Federated Geometry-Aware Correction Makes Exemplar Replay Work Better under Continual Dynamic Heterogeneity

ViSAGE @ NTIRE 2026 Challenge on Video Saliency Prediction

Needle in a Haystack: One-Class Representation Learning for Detecting Rare Malignant Cells in Computational Cytology

A Semi-Automated Framework for 3D Reconstruction of Medieval Manuscript Miniatures

Detection of Hate and Threat in Digital Forensics: A Case-Driven Multimodal Approach

Orthogonal Quadratic Complements for Vision Transformer Feed-Forward Networks

HaloProbe: Bayesian Detection and Mitigation of Object Hallucinations in Vision-Language Models

Pretrain-then-Adapt: Uncertainty-Aware Test-Time Adaptation for Text-based Person Search

Assessing Privacy Preservation and Utility in Online Vision-Language Models

R3PM-Net: Real-time, Robust, Real-world Point Matching Network

Tipiano: Cascaded Piano Hand Motion Synthesis via Fingertip Priors

Belief-Aware VLM Model for Human-like Reasoning

GameplayQA: A Benchmarking Framework for Decision-Dense POV-Synced Multi-Video Understanding of 3D Virtual Agents

B-MoE: A Body-Part-Aware Mixture-of-Experts "All Parts Matter" Approach to Micro-Action Recognition

FDIF: Formula-Driven supervised Learning with Implicit Functions for 3D Medical Image Segmentation

CausalVAD: De-confounding End-to-End Autonomous Driving via Causal Intervention

BiCLIP: Domain Canonicalization via Structured Geometric Transformation

Agentic Exploration of PDE Spaces using Latent Foundation Models for Parameterized Simulations

MerNav: A Highly Generalizable Memory-Execute-Review Framework for Zero-Shot Object Goal Navigation

Why Steering Works: Toward a Unified View of Language Model Parameter Dynamics

Measurement-Consistent Langevin Corrector for Stabilizing Latent Diffusion Inverse Problem Solvers

When & How to Write for Personalized Demand-aware Query Rewriting in Video Search

Relational Visual Similarity

Enhancing Geo-localization for Crowdsourced Flood Imagery via LLM-Guided Attention

GoT-R1: Unleashing Reasoning Capability of MLLM for Visual Generation with Reinforcement Learning

Seeing Through Deception: Uncovering Misleading Creator Intent in Multimodal News with Vision-Language Models

OmniPrism: Learning Disentangled Visual Concept for Image Generation

MM-LIMA: Less Is More for Alignment in Multi-Modal Datasets

SCITUNE: Aligning Large Language Models with Human-Curated Scientific Multimodal Instructions

推荐订阅源

cs.CV updates on arXiv.org