惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

GbyAI
GbyAI
大猫的无限游戏
大猫的无限游戏
MongoDB | Blog
MongoDB | Blog
The Register - Security
The Register - Security
Jina AI
Jina AI
Y
Y Combinator Blog
WordPress大学
WordPress大学
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
有赞技术团队
有赞技术团队
B
Blog RSS Feed
Microsoft Security Blog
Microsoft Security Blog
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
博客园 - 【当耐特】
Cloudbric
Cloudbric
CTFtime.org: upcoming CTF events
CTFtime.org: upcoming CTF events
C
CERT Recently Published Vulnerability Notes
L
LangChain Blog
A
Arctic Wolf
Apple Machine Learning Research
Apple Machine Learning Research
aimingoo的专栏
aimingoo的专栏
P
Palo Alto Networks Blog
G
GRAHAM CLULEY
Stack Overflow Blog
Stack Overflow Blog
Security Latest
Security Latest
C
Cybersecurity and Infrastructure Security Agency CISA
M
MIT News - Artificial intelligence
Last Week in AI
Last Week in AI
The Last Watchdog
The Last Watchdog
Google DeepMind News
Google DeepMind News
N
News and Events Feed by Topic
P
Privacy International News Feed
Vercel News
Vercel News
S
Securelist
I
InfoQ
F
Fortinet All Blogs
阮一峰的网络日志
阮一峰的网络日志
B
Blog
N
News | PayPal Newsroom
Blog — PlanetScale
Blog — PlanetScale
K
KPMG report finds enterprise disconnect between AI and its ROI | CIO
A
About on SuperTechFans
Attack and Defense Labs
Attack and Defense Labs
小众软件
小众软件
C
Cisco Blogs
Simon Willison's Weblog
Simon Willison's Weblog
S
Secure Thoughts
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
T
Tailwind CSS Blog
T
The Blog of Author Tim Ferriss
H
Hackread – Cybersecurity News, Data Breaches, AI and More

cs.CV updates on arXiv.org

暂无文章

Detect Before You Leap: Mirage Detection in Vision-Language Models
Sayeed Shafayet Chowdhury, Md. Shaown Miah · 2026-05-30 · via cs.CV updates on arXiv.org

Vision-language models (VLMs) can produce confident visual answers even when the required visual evidence is missing, blank, or unrelated to the question. This failure mode, known as mirage (Asadi et al. 2026), is especially concerning in medical and document visual question answering, where plausible but visually ungrounded responses may be mistaken for image-based evidence. We study pre-release mirage detection: given an image-question pair, the goal is to determine whether a VLM should answer or abstain before producing a response. We propose Text-Conditioned Layer-wise Internal Alignment (TC-LIA), a model-agnostic method that probes patch-token representations across the layers of a CLIP ViT-H/14 vision encoder. TC-LIA projects layer-wise image patch tokens into the final CLIP embedding space and measures their similarity to the question embedding, allowing the method to track whether question-relevant visual evidence emerges across vision layers. The resulting alignment trajectory is summarized using final image-text cosine similarity, late-layer top-k patch-text alignment, early-to-late gain, and layer-wise slope. These features are combined with pixel-statistic blank/noise detection, zero-shot domain routing, and structured VLM self-assessment in an ensemble. Across five VQA domains, three input conditions, and twelve VLM backbones, the best systems achieve approximately 94.6-94.7% three-class detection accuracy with mirage rates below 3%, while baseline mirage rates range from 21.7% to 66.6%.