

















Abstract:Visual inputs are often assumed to improve language understanding in multimodal models. We examine this assumption by asking whether vision-language models (VLMs) can distinguish useful visual evidence from incidental image context in lexical judgments. We use human concreteness and imagery ratings because they span words with varying expected visual relevance, from abstract and low-imagery words to concrete and high-imagery words. We find that real-image contexts do not yield consistent gains and often hurt alignment with human ratings, most sharply when visual evidence is least relevant. Through probing and canonical correlation analysis, complemented by an attribution case study, we find that real-image contexts are associated with representational shifts and greater sensitivity to spurious visual cues, coinciding with weaker recoverability of the targeted lexical properties. We further show that instructing models to focus solely on textual content at inference time can reduce this degradation, with the clearest gains on these vulnerable subsets. Our findings suggest that current instruction-tuned VLMs need better calibration of when visual context should inform lexical judgments.
| Subjects: | Computation and Language (cs.CL) |
| Cite as: | arXiv:2605.27315 [cs.CL] |
| (or arXiv:2605.27315v1 [cs.CL] for this version) | |
| https://doi.org/10.48550/arXiv.2605.27315 arXiv-issued DOI via DataCite (pending registration) |
From: Yifan Jiang [view email]
[v1]
Tue, 26 May 2026 17:24:59 UTC (2,787 KB)
此内容由惯性聚合(RSS阅读器)自动聚合整理,仅供阅读参考。 原文来自 — 版权归原作者所有。