CLIP-Clique: Graph-based Correspondence Matching Augmented by Vision Language Models for Object-based Global Localization

A High-Resolution Landscape Dataset for Concept-Based XAI With Application to Species Distribution Models

SemiFA: An Agentic Multi-Modal Framework for Autonomous Semiconductor Failure Analysis Report Generation

Neural 3D Reconstruction of Planetary Surfaces from Descent-Phase Wide-Angle Imagery

Multitasking Embedding for Embryo Blastocyst Grading Prediction (MEmEBG)

Towards Patient-Specific Deformable Registration in Laparoscopic Surgery

GeoLink: A 3D-Aware Framework Towards Better Generalization in Cross-View Geo-Localization

3DRealHead: Few-Shot Detailed Head Avatar

PatchPoison: Poisoning Multi-View Datasets to Degrade 3D Reconstruction

Graph Propagated Projection Unlearning: A Unified Framework for Vision and Audio Discriminative Models

Solving Physics Olympiad via Reinforcement Learning on Physics Simulators

Budget-Aware Uncertainty for Radiotherapy Segmentation QA Using nnU-Net

ClawGUI: A Unified Framework for Training, Evaluating, and Deploying GUI Agents

Efficient KernelSHAP Explanations for Patch-based 3D Medical Image Segmentation

StarVLA-$α$: Reducing Complexity in Vision-Language-Action Systems

On the Robustness of Watermarking for Autoregressive Image Generation

CLAY: Conditional Visual Similarity Modulation in Vision-Language Embedding Space

Beyond Attention Scores: SVD-Based Vision Token Pruning for Efficient Vision-Language Models

Revisiting Compositionality in Dual-Encoder Vision-Language Models: The Role of Inference

Anthropogenic Regional Adaptation in Multimodal Vision-Language Model

From Redaction to Restoration: Deep Learning for Medical Image Anonymization and Reconstruction

A Compact and Efficient 1.251 Million Parameter Machine Learning CNN Model PD36-C for Plant Disease Detection: A Case Study

The Salami Slicing Threat: Exploiting Cumulative Risks in LLM Systems

Towards Adaptive Open-Set Object Detection via Category-Level Collaboration Knowledge Mining

BoxTuning: Directly Injecting the Object Box for Multimodal Model Fine-Tuning

Semantic-Geometric Dual Compression: Training-Free Visual Token Reduction for Ultra-High-Resolution Remote Sensing Understanding

FlowCoMotion: Text-to-Motion Generation via Token-Latent Flow Modeling

ReSpinQuant: Efficient Layer-Wise LLM Quantization via Subspace Residual Rotation Approximation

Lightweight Low-Light Image Enhancement via Distribution-Normalizing Preprocessing and Depthwise U-Net

Panoptic Pairwise Distortion Graph

WebForge: Breaking the Realism-Reproducibility-Scalability Trilemma in Browser Agent Benchmark

Back to the Barn with LLAMAs: Evolving Pretrained LLM Backbones in Finetuning Vision Language Models

MMR-AD: A Large-Scale Multimodal Dataset for Benchmarking General Anomaly Detection with Multimodal Large Language Models

Towards Automated Solar Panel Integrity: Hybrid Deep Feature Extraction for Advanced Surface Defect Identification

You Only Judge Once: Multi-response Reward Modeling in a Single Forward Pass

Pseudo-Unification: Entropy Probing Reveals Divergent Information Patterns in Unified Multimodal Models

QShield: Securing Neural Networks Against Adversarial Attacks using Quantum Circuits

ReXSonoVQA: A Video QA Benchmark for Procedure-Centric Ultrasound Understanding

Evaluating the Impact of Medical Image Reconstruction on Downstream AI Fairness and Performance

Product Review Based on Optimized Facial Expression Detection

Retinal Cyst Detection from Optical Coherence Tomography Images

Lung Cancer Detection Using Deep Learning

Turning Generators into Retrievers: Unlocking MLLMs for Natural Language-Guided Geo-Localization

Audio-Omni: Extending Multi-modal Understanding to Versatile Audio Generation and Editing

Architecture-Agnostic Modality-Isolated Gated Fusion for Robust Multi-Modal Prostate MRI Segmentation

Camyla: Scaling Autonomous Research in Medical Image Segmentation

LoViF 2026 The First Challenge on Weather Removal in Videos

A Lightweight Multi-Metric No-Reference Image Quality Assessment Framework for UAV Imaging

COREY: Entropy-Guided Runtime Chunk Scheduling for Selective Scan Kernels

GeoMeld: Toward Semantically Grounded Foundation Models for Remote Sensing

STORM: End-to-End Referring Multi-Object Tracking in Videos

Data-Efficient Surgical Phase Segmentation in Small-Incision Cataract Surgery: A Controlled Study of Vision Foundation Models

UDAPose: Unsupervised Domain Adaptation for Low-Light Human Pose Estimation

Rethinking the Diffusion Model from a Langevin Perspective

Toward Accountable AI-Generated Content on Social Platforms: Steganographic Attribution and Multimodal Harm Detection

IMPACT: A Dataset for Multi-Granularity Human Procedural Action Understanding in Industrial Assembly

Rethinking Video Human-Object Interaction: Set Prediction over Time for Unified Detection and Anticipation

FishRoPE: Projective Rotary Position Embeddings for Omnidirectional Visual Perception

Multinex: Lightweight Low-light Image Enhancement via Multi-prior Retinex

Zero-shot World Models Are Developmentally Efficient Learners

Class-Adaptive Cooperative Perception for Multi-Class LiDAR-based 3D Object Detection in V2X Systems

FashionMV: Product-Level Composed Image Retrieval with Multi-View Fashion Data

Adapting 2D Multi-Modal Large Language Model for 3D CT Image Analysis

Edu-MMBias: A Three-Tier Multimodal Benchmark for Auditing Social Bias in Vision-Language Models under Educational Contexts

Semantic Manipulation Localization

VGA-Bench: A Unified Benchmark and Multi-Model Framework for Video Aesthetics and Generation Quality Evaluation

A Dual Cross-Attention Graph Learning Framework For Multimodal MRI-Based Major Depressive Disorder Detection

Degradation-Consistent Paired Training for Robust AI-Generated Image Detection

MatRes: Zero-Shot Test-Time Model Adaptation for Simultaneous Matching and Restoration

LVSum: A Benchmark for Timestamp-Aware Long Video Summarization

FREE-Switch: Frequency-based Dynamic LoRA Switch for Style Transfer

Demographic and Linguistic Bias Evaluation in Omnimodal Language Models

FlowPalm: Optical Flow Driven Non-Rigid Deformation for Geometrically Diverse Palmprint Generation

Cross-Cultural Value Awareness in Large Vision-Language Models

I Walk the Line: Examining the Role of Gestalt Continuity in Object Binding for Vision Transformers

GLEaN: A Text-to-image Bias Detection Approach for Public Comprehension

From UAV Imagery to Agronomic Reasoning: A Multimodal LLM Benchmark for Plant Phenotyping

Not Your Stereo-Typical Estimator: Combining Vision and Language for Volume Perception

Genie 4D: Semantic-Prior-Guided 4D Dynamic Scene Reconstruction

Efficient Personalization of Generative User Interfaces

PAS: Estimating the target accuracy before domain adaptation

Is There Knowledge Left to Extract? Evidence of Fragility in Medically Fine-Tuned Vision-Language Models

F3G-Avatar : Face Focused Full-body Gaussian Avatar

ProGAL-VLA: Grounded Alignment through Prospective Reasoning in Vision-Language-Action Models

ACCIDENT: A Benchmark Dataset for Vehicle Accident Detection from Traffic Surveillance Videos

MedLVR: Latent Visual Reasoning for Reliable Medical Visual Question Answering

Rays as Pixels: Learning A Joint Distribution of Videos and Camera Trajectories

PhysInOne: Visual Physics Learning and Reasoning in One Suite

Through Their Eyes: Fixation-aligned Tuning for Personalized User Emulation

Neural Distribution Prior for LiDAR Out-of-Distribution Detection

Adding Another Dimension to Image-based Animal Detection

Long-SCOPE: Fully Sparse Long-Range Cooperative 3D Perception

CT-1: Vision-Language-Camera Models Transfer Spatial Reasoning Knowledge to Camera-Controllable Video Generation

FIRE-CIR: Fine-grained Reasoning for Composed Fashion Image Retrieval

Detecting Diffusion-generated Images via Dynamic Assembly Forests

Memory-Efficient Transfer Learning with Fading Side Networks via Masked Dual Path Distillation

Tora3: Trajectory-Guided Audio-Video Generation with Physical Coherence

Leave My Images Alone: Preventing Multi-Modal Large Language Models from Analyzing Images via Visual Prompt Injection

Domain-generalizable Face Anti-Spoofing with Patch-based Multi-tasking and Artifact Pattern Conversion

Dynamic Class-Aware Active Learning for Unbiased Satellite Image Segmentation

Low-Data Supervised Adaptation Outperforms Prompting for Cloud Segmentation Under Domain Shift

推荐订阅源

cs.CV updates on arXiv.org