





















Abstract:Multimodal large language models (MLLMs) can process text presented as images, yet they often perform worse than when the same content is provided as textual tokens.
We systematically diagnose this "modality gap" by evaluating seven MLLMs across seven benchmarks in five input modes, spanning both synthetically rendered text and realistic document images from arXiv PDFs to Wikipedia pages.
We find that the gap is highly sensitive to rendering choices such as font and resolution, and that natural document images often exhibit much smaller gaps, suggesting the performance difference partly reflects evaluation artifacts rather than fundamental limitations.
Through a grounded-theory error analysis of over 4,000 examples, we identify the primary cause: image input alone suppresses reasoning effort, with models producing 5--19x shorter outputs that skip step-by-step computation or reasoning.
The reluctance to reason, not a failure of perception or knowledge retrieval, drives the performance gap, particularly on tasks requiring multi-step reasoning.
We show that a simple, lightweight on-policy self-distillation method by fine-tuning models on their own text-mode reasoning traces paired with image inputs closes this gap, raising image-mode accuracy to match or exceed text-mode performance with over 50\% improvement, and the gains transfer to unseen benchmarks without catastrophic forgetting. Overall, our results and analyses provide a systematic understanding of the modality gap and suggest a practical path toward improving visual text understanding in multimodal language models.
| Subjects: | Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV) |
| Cite as: | arXiv:2603.09095 [cs.CL] |
| (or arXiv:2603.09095v2 [cs.CL] for this version) | |
| https://doi.org/10.48550/arXiv.2603.09095 arXiv-issued DOI via DataCite |
From: Kaiser Sun [view email]
[v1]
Tue, 10 Mar 2026 02:14:23 UTC (2,088 KB)
[v2]
Sun, 24 May 2026 00:25:17 UTC (1,939 KB)
此内容由惯性聚合(RSS阅读器)自动聚合整理,仅供阅读参考。 原文来自 — 版权归原作者所有。