慣性聚合 高效追蹤和閱讀你感興趣的部落格、新聞、科技資訊
閱讀原文 在慣性聚合中打開

推薦訂閱源

N
Netflix TechBlog - Medium
Microsoft Azure Blog
Microsoft Azure Blog
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
S
Security Archives - TechRepublic
Cyberwarzone
Cyberwarzone
D
Darknet – Hacking Tools, Hacker News & Cyber Security
IntelliJ IDEA : IntelliJ IDEA – the Leading IDE for Professional Development in Java and Kotlin | The JetBrains Blog
IntelliJ IDEA : IntelliJ IDEA – the Leading IDE for Professional Development in Java and Kotlin | The JetBrains Blog
博客园 - 【当耐特】
A
About on SuperTechFans
T
ThreatConnect
IT之家
IT之家
阮一峰的网络日志
阮一峰的网络日志
B
Blog
T
Tailwind CSS Blog
G
GRAHAM CLULEY
F
Future of Privacy Forum
V
Vulnerabilities – Threatpost
J
Java Code Geeks
量子位
博客园 - 叶小钗
Last Week in AI
Last Week in AI
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
Stack Overflow Blog
Stack Overflow Blog
李成银的技术随笔
Spread Privacy
Spread Privacy
The Hacker News
The Hacker News
S
Schneier on Security
T
True Tiger Recordings
Vercel News
Vercel News
C
CXSECURITY Database RSS Feed - CXSecurity.com
C
Cybersecurity and Infrastructure Security Agency CISA
Latest news
Latest news
F
Fox-IT International blog
The Register - Security
The Register - Security
MongoDB | Blog
MongoDB | Blog
博客园 - 聂微东
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
Know Your Adversary
Know Your Adversary
GbyAI
GbyAI
L
LangChain Blog
MyScale Blog
MyScale Blog
AWS News Blog
AWS News Blog
D
Docker
小众软件
小众软件
Stack Overflow Blog
Stack Overflow Blog
Microsoft Security Blog
Microsoft Security Blog
T
Tor Project blog
T
The Exploit Database - CXSecurity.com
P
Palo Alto Networks Blog
Malwarebytes
Malwarebytes

cs.CV updates on arXiv.org

OmniGF: A Dual-Branch Vision-Language Framework for Unified Gaze Following TrackRef3D: Multi-View Consistent Track-then-Label for Open-World Referring Segmentation in 3D Gaussian Splatting A Hybrid Vision-Language Architecture for Automated Defect Reasoning and Report Generation in Industrial Inspection BioFact-MoE: Biologically Factorized Mixture of Experts for Vision-Language Prognostic Modeling in Hepatocellular Carcinoma Joint Instance Segmentation and Geometric Attribute Regression for Roof Structures in Aerial Imagery Not All Modalities Are Equal: Instruction-Aware Gating for Multimodal Videos Evi-Steer: Learning to Steer Biomedical Vision-Language Models through Efficient and Generalizable Evidential Tuning ReCA: Multi-Shot Long Video Extrapolation via Recursive Context Allocation SOLE-R1: Video-Language Reasoning as the Sole Reward for On-Robot Reinforcement Learning Sparse-LiDAR Prompting of Monocular Geometry Foundations: An Empirical Study Toward Long-Range Driving Depth Personalized Generative Models for Contextual Debiasing OmniInteract: Benchmarking Real-World Streaming Interaction for Real-Time Omnimodal Assistants VesselSim: learning 3D blood vessel segmentation without expert annotations Scheduled Style Injection: Expanding the Style-Content Pareto Frontier in Training-Free Diffusion-based Style Transfer LongAV-Compass: Towards Unified Evaluation of Minute-Scale Audio-Visual Generation Across T2AV, I2AV, and V2AV Sleep-stage efficient classification using a lightweight self-supervised model A multifractal-based masked auto-encoder: an application to medical images Zero-Shot Object Re-Identification in Egocentric Kitchen Videos via Multi-Stage SAM3 Feature Fusion E$^3$C: Video Generation with 3D Environmental Memory and Ego-Exo Human Pose Control Detail Consistent Stage-Wise Distillation for Efficient 3D MRI Segmentation RoMo: A Large-Scale, Richly Organized Dataset and Semantic Taxonomy for Human Motion Generation Re-M3Dr: Rebalanced MultiModal Mean Deviation Regression $R^3$: 3D Reconstruction via Relative Regression InterSketch: An Interleaved Reasoning Model with Self-correcting Visual Sketch and Stepwise Reward AnchorDiff: Training-Free Concept Grounding for MM-DiTs via Anchor-Based Graph Propagation Clinically-Grounded Counterfactual Reasoning for Medical Video Diagnosis Multi-Modal Building Inspection via Perceiver IO Fusion of Satellite and Street-Level Imagery LongCat-Video-Avatar 1.5 Technical Report DuoGesture: Neuro-Inspired and Biomechanically Informed Dual-Stream Co-Speech Gesture Generation Rethinking Weakly-supervised Video Temporal Grounding From a Game Perspective Underwater360: Reconstructing Underwater Scenes from Panoramic Images with Omnidirectional Gaussian Splatting HydraPrompt: An Adaptive and Asymmetric Framework of Vision-Language Models for Synthetic Image Detection VisualNeedle: Benchmarking Active Visual Search in Information-Dense Scenes CSV-ViT: A Vision Transformer with the Variable-sized Cortical Supervertices for Detection of Alzheimer's Disease Pathologies CNNs, Transformers, Hybrid, and Vision Language Models for Skin Cancer Detection The Rescue Effect: Spatio-Semantic Early Exit Bypasses Quantization Collapse in CLIP Uncertainty-Aware Gaussian Map for Vision-Language Navigation Dimensional Distribution Emotion State: Leveraging Valence and Arousal as a Common Embedding Space for Visual Emotion Analysis Unveiling the Fragility of Vision-Language Models: Multi-Modal Adversarial Synergy via Texture-Constrained Perturbations and Cross-Modal Optimization Unified Panoramic Geometry Estimation via Multi-View Foundation Models Frequency-Guided Fusion For RGB-Thermal Semantic Segmentation 3D Gaussian Map with Open-Set Semantic Grouping for Vision-Language Navigation Benchmarking Convolutional, Transformer, Hybrid, and Vision Language Models for Multi Disease Retinal Screening Cross-scale Aligned Supervision for Training GANs RadarSim: Simulating Single-Chip Radar via Multimodal Neural Fields Sentinel: Embodied Cooperative Spatial Reasoning and Planning Geometry-Aware Representation Denoising for Robust Multi-view 3D Reconstruction Erased but Exploitable: Black-box Embedding-Aware Prompting Against Unlearned Text-to-Image Diffusion Models Comparative Study of Vision-Based Metric Measurement for Large-Scale Planar Scenes Triadic Dynamics Aware Diffusion Posterior Sampling for Inverse Problems: Optimizing Guidance and Stochasticity Schedules
隱藏以見:視覺錨定思維中的 VLM 蒸發中的推理前綴遮罩
Seonghoon Yu · 2026-05-13 · via cs.CV updates on arXiv.org

檢視 PDF HTML (實驗性)

摘要:近年來的視覺語言模型(VLMs)中的思考-回答方法,例如Qwen3-VL-Thinking,透過在最終答案前利用中間思考步驟來提升推理性能,但其計算成本變得相當龐大,特別是對於較大的VLMs。為了將這種能力精煉到緊湊的思考-回答VLMs中,一個主要目標是提升學習模型在其推理軌跡中利用視覺證據的能力,因為思考-回答軌跡會受到視覺遺忘問題的困擾。為此,我們介紹了一個新的思考-回答精煉框架,鼓勵學習模型在其思考中依賴視覺資訊,方法是遮罩學習模型的顯著推理前綴。為了彌補這些被遮罩的文本提示,在精煉過程中,我們鼓勵學習模型更多地依賴視覺證據作為替代信息來源。我們的遮罩策略包括:1) 依token的顯著推理前綴遮罩,它會針對每個下一token預測選擇性地遮罩高影響力的推理前綴;2) 自適度的遮罩預算排程,它根據精煉難度(由教師-學習模型分佈之間的差異度衡量)逐步增加遮罩規模。在精煉階段,學習模型受到我們的顯著推理前綴遮罩的指導,該遮罩會阻擋未來的token和顯著的推理提示,取代了自回歸語言模型中使用的標準因果遮罩。實驗結果顯示,我們的方法在多模態推理基準測試中優於近年來的開源VLMs、VLM精煉和自精煉方法,而進一步的分析確認了學習模型思考過程中視覺利用的提升。
評論: 預印本
主題: 電腦視覺與模式識別 (cs.CV); 藝術智慧 (cs.AI); 計算與語言 (cs.CL)
引用格式: arXiv:2605.11651 [cs.CV]
  (或 arXiv:2605.11651v4 [cs.CV]) for this version)
  https://doi.org/10.48550/arXiv.2605.11651

arXiv發行的DOI透過DataCite

提交通過歷史

From: Seonghoon Yu [查看郵件]
[v1] 周二,2026年5月12日 07:14:04 UTC (6,058 KB)
[v2] 五, 13 五月 2026 01:49:55 UTC (6,058 KB)
[v3] 六, 15 五月 2026 06:49:33 UTC (6,058 KB)
[v4] 二, 26 五月 2026 04:36:02 UTC (6,058 KB)