惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

L
LangChain Blog
D
Darknet – Hacking Tools, Hacker News & Cyber Security
G
GRAHAM CLULEY
Latest news
Latest news
H
Heimdal Security Blog
The Hacker News
The Hacker News
AI
AI
S
Secure Thoughts
L
Lohrmann on Cybersecurity
T
Troy Hunt's Blog
cs.AI updates on arXiv.org
cs.AI updates on arXiv.org
S
Securelist
cs.CV updates on arXiv.org
cs.CV updates on arXiv.org
T
Threatpost
大猫的无限游戏
大猫的无限游戏
I
InfoQ
Google DeepMind News
Google DeepMind News
GbyAI
GbyAI
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
博客园 - 三生石上(FineUI控件)
博客园 - 聂微东
NISL@THU
NISL@THU
C
CERT Recently Published Vulnerability Notes
Cyberwarzone
Cyberwarzone
Microsoft Security Blog
Microsoft Security Blog
Apple Machine Learning Research
Apple Machine Learning Research
T
Tailwind CSS Blog
The Register - Security
The Register - Security
Y
Y Combinator Blog
W
WeLiveSecurity
K
KPMG report finds enterprise disconnect between AI and its ROI | CIO
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
V
V2EX - 技术
T
Tor Project blog
MongoDB | Blog
MongoDB | Blog
爱范儿
爱范儿
V
Visual Studio Blog
O
OpenAI News
S
SegmentFault 最新的问题
博客园 - Franky
博客园 - 叶小钗
Hacker News: Ask HN
Hacker News: Ask HN
阮一峰的网络日志
阮一峰的网络日志
Forbes - Security
Forbes - Security
cs.CL updates on arXiv.org
cs.CL updates on arXiv.org
V
V2EX
T
Threat Research - Cisco Blogs
月光博客
月光博客
IT之家
IT之家
美团技术团队

cs.CV updates on arXiv.org

Demographic and Linguistic Bias Evaluation in Omnimodal Language Models Cross-Cultural Value Awareness in Large Vision-Language Models From UAV Imagery to Agronomic Reasoning: A Multimodal LLM Benchmark for Plant Phenotyping ProGAL-VLA: Grounded Alignment through Prospective Reasoning in Vision-Language-Action Models Rays as Pixels: Learning A Joint Distribution of Videos and Camera Trajectories PhysInOne: Visual Physics Learning and Reasoning in One Suite Through Their Eyes: Fixation-aligned Tuning for Personalized User Emulation Neural Distribution Prior for LiDAR Out-of-Distribution Detection Adding Another Dimension to Image-based Animal Detection Long-SCOPE: Fully Sparse Long-Range Cooperative 3D Perception CT-1: Vision-Language-Camera Models Transfer Spatial Reasoning Knowledge to Camera-Controllable Video Generation FIRE-CIR: Fine-grained Reasoning for Composed Fashion Image Retrieval Detecting Diffusion-generated Images via Dynamic Assembly Forests Memory-Efficient Transfer Learning with Fading Side Networks via Masked Dual Path Distillation Tora3: Trajectory-Guided Audio-Video Generation with Physical Coherence Leave My Images Alone: Preventing Multi-Modal Large Language Models from Analyzing Images via Visual Prompt Injection Domain-generalizable Face Anti-Spoofing with Patch-based Multi-tasking and Artifact Pattern Conversion Dynamic Class-Aware Active Learning for Unbiased Satellite Image Segmentation Low-Data Supervised Adaptation Outperforms Prompting for Cloud Segmentation Under Domain Shift Degradation-Robust Fusion: An Efficient Degradation-Aware Diffusion Framework for Multimodal Image Fusion in Arbitrary Degradation Scenarios Adaptive Dual Residual U-Net with Attention Gate and Multiscale Spatial Attention Mechanisms (ADRUwAMS) MedFormer-UR: Uncertainty-Routed Transformer for Medical Image Classification BIAS: A Biologically Inspired Algorithm for Video Saliency Detection DeFakeQ: Enabling Real-Time Deepfake Detection on Edge Devices via Adaptive Bidirectional Quantization Dictionary-Aligned Concept Control for Safeguarding Multimodal LLMs CatalogStitch: Dimension-Aware and Occlusion-Preserving Object Compositing for Catalog Image Generation Post-Hoc Guidance for Consistency Models by Joint Flow Distribution Learning SenBen: Sensitive Scene Graphs for Explainable Content Moderation Towards Responsible Multimodal Medical Reasoning via Context-Aligned Vision-Language Models R2G: A Multi-View Circuit Graph Benchmark Suite from RTL to GDSII State Space Models are Effective Sign Language Learners: Exploiting Phonological Compositionality for Vocabulary-Scale Recognition Accelerating Transformer-Based Monocular SLAM via Geometric Utility Scoring Deep Learning-Based Tracking and Lineage Reconstruction of Ligament Breakup Unified Multimodal Uncertain Inference EfficientSign: An Attention-Enhanced Lightweight Architecture for Indian Sign Language Recognition InsEdit: Towards Instruction-based Visual Editing via Data-Efficient Video Diffusion Models Adaptation 3D-VCD: Hallucination Mitigation in 3D-LLM Embodied Agents through Visual Contrastive Decoding On Semiotic-Grounded Interpretive Evaluation of Generative Art Generative 3D Gaussian Splatting for Arbitrary-ResolutionAtmospheric Downscaling and Forecasting From Selection to Scheduling: Federated Geometry-Aware Correction Makes Exemplar Replay Work Better under Continual Dynamic Heterogeneity ViSAGE @ NTIRE 2026 Challenge on Video Saliency Prediction Needle in a Haystack: One-Class Representation Learning for Detecting Rare Malignant Cells in Computational Cytology A Semi-Automated Framework for 3D Reconstruction of Medieval Manuscript Miniatures Detection of Hate and Threat in Digital Forensics: A Case-Driven Multimodal Approach HaloProbe: Bayesian Detection and Mitigation of Object Hallucinations in Vision-Language Models Pretrain-then-Adapt: Uncertainty-Aware Test-Time Adaptation for Text-based Person Search R3PM-Net: Real-time, Robust, Real-world Point Matching Network B-MoE: A Body-Part-Aware Mixture-of-Experts "All Parts Matter" Approach to Micro-Action Recognition FDIF: Formula-Driven supervised Learning with Implicit Functions for 3D Medical Image Segmentation CausalVAD: De-confounding End-to-End Autonomous Driving via Causal Intervention Measurement-Consistent Langevin Corrector for Stabilizing Latent Diffusion Inverse Problem Solvers When & How to Write for Personalized Demand-aware Query Rewriting in Video Search Relational Visual Similarity OmniPrism: Learning Disentangled Visual Concept for Image Generation
Focus When Necessary: Adaptive Routing and Collaborative Grounding for Training-Free Visual Grounding
[Submitted on 15 Jun 2026] · 2026-06-16 · via cs.CV updates on arXiv.org

View PDF HTML (experimental)

Abstract:While Multimodal Large Language Models (MLLMs) excel in cross-modal reasoning, they often struggle to perceive fine-grained details in complex high-resolution images. Recent training-free methods address this through image scaling and localized cropping. However, applying these manipulations indiscriminately introduces computational redundancy for simple queries and can degrade accuracy by truncating essential global context or introducing irrelevant background noise. To this end, we propose LazyMCoT, a dynamic and training-free framework that adaptively allocates visual grounding efforts based on sample difficulty. The framework features an Adaptive Routing mechanism that evaluates predictive uncertainty using first-token statistics from a single forward pass. This efficiently bypasses confident cases while ensuring the recall of difficult samples via conformal calibration. For these challenging cases, a Collaborative Grounding module integrates the inherent cross-modal attention of the model with an external visual expert through a two-stage refinement process. This refinement process generates a precise localized display to recover small or occluded targets. Extensive experiments across diverse benchmarks demonstrate that LazyMCoT rivals training-based approaches by simultaneously improving reasoning accuracy and reducing average inference latency. Our code is availble at this https URL.

Submission history

From: Yifan Wang [view email]
[v1] Mon, 15 Jun 2026 03:17:44 UTC (26,712 KB)