惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

月光博客
月光博客
T
Tor Project blog
美团技术团队
WordPress大学
WordPress大学
V
Visual Studio Blog
cs.AI updates on arXiv.org
cs.AI updates on arXiv.org
O
OpenAI News
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
W
WeLiveSecurity
酷 壳 – CoolShell
酷 壳 – CoolShell
Simon Willison's Weblog
Simon Willison's Weblog
S
Securelist
S
SegmentFault 最新的问题
博客园 - 聂微东
宝玉的分享
宝玉的分享
E
Exploit-DB.com RSS Feed
博客园 - 叶小钗
N
News and Events Feed by Topic
博客园 - 司徒正美
S
Security Archives - TechRepublic
www.infosecurity-magazine.com
www.infosecurity-magazine.com
Last Week in AI
Last Week in AI
小众软件
小众软件
K
Kaspersky official blog
T
Tailwind CSS Blog
Hugging Face - Blog
Hugging Face - Blog
Google DeepMind News
Google DeepMind News
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
博客园 - 三生石上(FineUI控件)
腾讯CDC
V
V2EX
Know Your Adversary
Know Your Adversary
cs.CL updates on arXiv.org
cs.CL updates on arXiv.org
博客园 - 【当耐特】
博客园 - Franky
Spread Privacy
Spread Privacy
T
Troy Hunt's Blog
量子位
Apple Machine Learning Research
Apple Machine Learning Research
阮一峰的网络日志
阮一峰的网络日志
大猫的无限游戏
大猫的无限游戏
T
Threat Research - Cisco Blogs
博客园_首页
J
Java Code Geeks
有赞技术团队
有赞技术团队
Help Net Security
Help Net Security
IT之家
IT之家
T
Threatpost

cs.CV updates on arXiv.org

A Lightweight Multi-Metric No-Reference Image Quality Assessment Framework for UAV Imaging Graph Propagated Projection Unlearning: A Unified Framework for Vision and Audio Discriminative Models PatchPoison: Poisoning Multi-View Datasets to Degrade 3D Reconstruction 3DRealHead: Few-Shot Detailed Head Avatar GeoLink: A 3D-Aware Framework Towards Better Generalization in Cross-View Geo-Localization Towards Patient-Specific Deformable Registration in Laparoscopic Surgery Multitasking Embedding for Embryo Blastocyst Grading Prediction (MEmEBG) Neural 3D Reconstruction of Planetary Surfaces from Descent-Phase Wide-Angle Imagery SemiFA: An Agentic Multi-Modal Framework for Autonomous Semiconductor Failure Analysis Report Generation A High-Resolution Landscape Dataset for Concept-Based XAI With Application to Species Distribution Models 4th Workshop on Maritime Computer Vision (MaCVi): Challenge Overview Rethinking Uncertainty in Segmentation: From Estimation to Decision Indexing Multimodal Language Models for Large-scale Image Retrieval DroneScan-YOLO: Redundancy-Aware Lightweight Detection for Tiny Objects in UAV Imagery Explainable Fall Detection for Elderly Monitoring via Temporally Stable SHAP in Skeleton-Based Human Activity Recognition See&Say: Vision Language Guided Safe Zone Detection for Autonomous Package Delivery Drones PAT-VCM: Plug-and-Play Auxiliary Tokens for Video Coding for Machines Can Cross-Layer Transcoders Replace Vision Transformer Activations? An Interpretable Perspective on Vision Bias at the End of the Score Deep Spatially-Regularized and Superpixel-Based Diffusion Learning for Unsupervised Hyperspectral Image Clustering The Spectrascapes Dataset: Street-view imagery beyond the visible captured using a mobile platform Why MLLMs Struggle to Determine Object Orientations Towards Successful Implementation of Automated Raveling Detection: Effects of Training Data Size, Illumination Difference, and Spatial Shift Right Regions, Wrong Labels: Semantic Label Flips in Segmentation under Correlation Shift SSD-GS: Scattering and Shadow Decomposition for Relightable 3D Gaussian Splatting SEDTalker: Emotion-Aware 3D Facial Animation Using Frame-Level Speech Emotion Diarization MSGS: Multispectral 3D Gaussian Splatting Multi-Agent Object Detection Framework Based on Raspberry Pi YOLO Detector and Slack-Ollama Natural Language Interface A 3D SAM-Based Progressive Prompting Framework for Multi-Task Segmentation of Radiotherapy-induced Normal Tissue Injuries in Limited-Data Settings UniBlendNet: Unified Global, Multi-Scale, and Region-Adaptive Modeling for Ambient Lighting Normalization A Multimodal Clinically Informed Coarse-to-Fine Framework for Longitudinal CT Registration in Proton Therapy Why Multimodal In-Context Learning Lags Behind? Unveiling the Inner Mechanisms and Bottlenecks CausalDisenSeg: A Causality-Guided Disentanglement Framework with Counterfactual Reasoning for Robust Brain Tumor Segmentation Under Missing Modalities DF3DV-1K: A Large-Scale Dataset and Benchmark for Distractor-Free Novel View Synthesis Physically-Guided Optical Inversion Enable Non-Contact Side-Channel Attack on Isolated Screens VibeFlow: Versatile Video Chroma-Lux Editing through Self-Supervised Learning Event-Adaptive State Transition and Gated Fusion for RGB-Event Object Tracking MaMe & MaRe: Matrix-Based Token Merging and Restoration for Efficient Visual Perception and Synthesis A Study of Failure Modes in Two-Stage Human-Object Interaction Detection FiRe: Fine-grained Multimodal Reasoning for Enhanced Image Generation ADP-DiT: Text-Guided Diffusion Transformer for Brain Image Generation in Alzheimer's Disease Progression Enhancing Mixture-of-Experts Specialization via Cluster-Aware Upcycling DiT as Real-Time Rerenderer: Streaming Video Stylization with Autoregressive Diffusion Transformer Free Lunch for Unified Multimodal Models: Enhancing Generation via Reflective Rectification with Inherent Understanding Reconstruction of a 3D wireframe from a single line drawing via generative depth estimation AI Powered Image Analysis for Phishing Detection CLIP Architecture for Abdominal CT Image-Text Alignment and Zero-Shot Learning: Investigating Batch Composition and Data Scaling UHR-BAT: Budget-Aware Token Compression Vision-Language model for Ultra-High-Resolution Remote Sensing ZoomSpec: A Physics-Guided Coarse-to-Fine Framework for Wideband Spectrum Sensing Radar-Informed 3D Multi-Object Tracking under Adverse Conditions SocialMirror: Reconstructing 3D Human Interaction Behaviors from Monocular Videos with Semantic and Geometric Guidance Efficient Multi-View 3D Object Detection by Dynamic Token Selection and Fine-Tuning Dehaze-then-Splat: Generative Dehazing with Physics-Informed 3D Gaussian Splatting for Smoke-Free Novel View Synthesis VGGT-Segmentor: Geometry-Enhanced Cross-View Segmentation What Are We Really Measuring? Rethinking Dataset Bias in Web-Scale Natural Image Collections via Unsupervised Semantic Clustering ESCAPE: Episodic Spatial Memory and Adaptive Execution Policy for Long-Horizon Mobile Manipulation VRAG-DFD: Verifiable Retrieval-Augmentation for MLLM-based Deepfake Detection From Pixels to Nucleotides: End-to-End Token-Based Video Compression for DNA Storage Beyond Voxel 3D Editing: Learning from 3D Masks and Self-Constructed Data Med-CAM: Minimal Evidence for Explaining Medical Decision Making SLQ: Bridging Modalities via Shared Latent Queries for Retrieval with Frozen MLLMs Granularity-Aware Transfer for Tree Instance Segmentation in Synthetic and Real Forests ReConText3D: Replay-based Continual Text-to-3D Generation ClipGStream: Clip-Stream Gaussian Splatting for Any Length and Any Motion Multi-View Dynamic Scene Reconstruction Design and Behavior of Sparse Mixture-of-Experts Layers in CNN-based Semantic Segmentation Temporally Consistent Long-Term Memory for 3D Single Object Tracking PBE-UNet: A light weight Progressive Boundary-Enhanced U-Net with Scale-Aware Aggregation for Ultrasound Image Segmentation From Synchrony to Sequence: Exo-to-Ego Generation via Interpolation Artificial intelligence application in lymphoma diagnosis with Vision Transformer using weakly supervised training DRG-Font: Dynamic Reference-Guided Few-shot Font Generation via Contrastive Style-Content Disentanglement Gaslight, Gatekeep, V1-V3: Early Visual Cortex Alignment Shields Vision-Language Models from Sycophantic Manipulation A Resource-Efficient Hybrid CNN-LSTM network for image-based bean leaf disease classification DiffMagicFace: Identity Consistent Facial Editing of Real Videos Any3DAvatar: Fast and High-Quality Full-Head 3D Avatar Reconstruction from Single Portrait Image PostureObjectstitch: Anomaly Image Generation Considering Assembly Relationships in Industrial Scenarios Context Sensitivity Improves Human-Machine Visual Alignment Rethinking Image-to-3D Generation with Sparse Queries: Efficiency, Capacity, and Input-View Bias Blind Bitstream-corrupted Video Recovery via Metadata-guided Diffusion Model PartNerFace: Part-based Neural Radiance Fields for Animatable Facial Avatar Reconstruction ASTRA: Enhancing Multi-Subject Generation with Retrieval-Augmented Pose Guidance and Disentangled Position Embedding A Multi-Stage Optimization Pipeline for Bethesda Cell Detection in Pap Smear Cytology SceneGlue: Scene-Aware Transformer for Feature Matching without Scene-Level Annotation Heuristic Style Transfer for Real-Time, Efficient Weather Attribute Detection MApLe: Multi-instance Alignment of Diagnostic Reports and Large Medical Images HiProto: Hierarchical Prototype Learning for Interpretable Object Detection Under Low-quality Conditions Remote Sensing Image Super-Resolution for Imbalanced Textures: A Texture-Aware Diffusion Framework Depth-Aware Image and Video Orientation Estimation Feed-Forward 3D Scene Modeling: A Problem-Driven Perspective POINTS-Seeker: Towards Training a Multimodal Agentic Search Model from Scratch Seek-and-Solve: Benchmarking MLLMs for Visual Clue-Driven Reasoning in Daily Scenarios Decoding the Delta: Unifying Remote Sensing Change Detection and Understanding with Multimodal Large Language Models Free Geometry: Refining 3D Reconstruction from Longer Versions of Itself OneHOI: Unifying Human-Object Interaction Generation and Editing Towards Unconstrained Human-Object Interaction Training-Free Semantic Multi-Object Tracking with Vision-Language Models UI-Zoomer: Uncertainty-Driven Adaptive Zoom-In for GUI Grounding HiVLA: A Visual-Grounded-Centric Hierarchical Embodied Manipulation System Don't Let the Video Speak: Audio-Contrastive Preference Optimization for Audio-Visual Language Models SpatialEvo: Self-Evolving Spatial Intelligence via Deterministic Geometric Environments ROSE: Retrieval-Oriented Segmentation Enhancement
Pix2Fact: When Vision Is Not Enough -- Benchmarking Fine-Grained VQA with Web Verification on High-Resolution Real-World Scenes
Yifan Jiang, · 2026-05-12 · via cs.CV updates on arXiv.org

View PDF HTML (experimental)

Abstract:Despite progress on general tasks, vision-language models (VLMs) still struggle with challenges that demand both fine-grained visual grounding and external knowledge, a synergy overlooked by existing benchmarks that evaluate these abilities in isolation. To fill this void, we introduce Pix2Fact, a visual question-answering benchmark designed to assess expert-level visual perception and knowledge search. Pix2Fact comprises 1,000 high-resolution (4K+) images spanning eight scenarios. Its questions and answers are meticulously crafted by PhD-holding annotators from top global universities across diverse disciplines. Each question requires detailed visual grounding and the integration of external knowledge. Evaluating ten state-of-the-art VLMs, including proprietary models such as Gemini-3.1-Pro and GPT-5.4, we find that Pix2Fact poses a formidable challenge: the most advanced model (Gemini-3.1-Pro) achieves only 51.7% average accuracy, even with access to visual ground truth and search tools. Our analysis attributes this low accuracy to three factors, frequent visual grounding errors even with visual ground truth, shallow search harnessing, and VLM's inability to retrieve long-tail, unstructured local information. This striking gap exposes the limitations of current models in assisting humans with real-world scenarios that demand overwhelming visual comprehension. We believe Pix2Fact will serve as a critical benchmark to drive the next generation of language-vision agents that seamlessly integrate fine-grained perception with robust knowledge search.
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Cite as: arXiv:2602.00593 [cs.CV]
  (or arXiv:2602.00593v2 [cs.CV] for this version)
  https://doi.org/10.48550/arXiv.2602.00593

arXiv-issued DOI via DataCite

Submission history

From: Cong Zhang [view email]
[v1] Sat, 31 Jan 2026 08:18:34 UTC (7,856 KB)
[v2] Sun, 10 May 2026 05:04:01 UTC (22,176 KB)