

















Abstract:Multimodal large language models (MLLMs) are emerging as versatile foundations for mixed-modality retrieval. Yet, they often require heavy post-hoc training to convert them into contrastive encoders for retrieval. This work asks: Can off-the-shelf MLLMs serve as powerful retrievers without additional training? We present FreeRet, a plug-and-play framework that turns any MLLM into a two-stage retriever. FreeRet first derives semantically grounded embeddings directly from the model for fast candidate search, and then exploits its reasoning ability for precise reranking. The framework contributes three advances: bypassing lexical alignment layers to obtain semantically faithful embeddings, conditioning representation generation with explicit priors, and mitigating framing effect in reranking via neutral choice framing. On the MMEB and MMEB-V2 benchmarks spanning 46 datasets, FreeRet substantially outperforms models trained on millions of pairs. Beyond benchmarks, FreeRet is model-agnostic and scales seamlessly across MLLM families and sizes, preserves their generative abilities, supports arbitrary modality combinations, and unifies retrieval, reranking, and generation into end-to-end RAG within a single model. Our findings demonstrate that pretrained MLLMs, when carefully harnessed, can serve as strong retrieval engines without training, closing a critical gap in their role as generalists.
| Comments: | ICML 2026 |
| Subjects: | Computer Vision and Pattern Recognition (cs.CV) |
| Cite as: | arXiv:2509.24621 [cs.CV] |
| (or arXiv:2509.24621v3 [cs.CV] for this version) | |
| https://doi.org/10.48550/arXiv.2509.24621 arXiv-issued DOI via DataCite |
From: Yuhan Zhu [view email]
[v1]
Mon, 29 Sep 2025 11:28:42 UTC (1,900 KB)
[v2]
Thu, 30 Apr 2026 20:31:56 UTC (2,583 KB)
[v3]
Mon, 25 May 2026 10:53:36 UTC (2,585 KB)
此内容由惯性聚合(RSS阅读器)自动聚合整理,仅供阅读参考。 原文来自 — 版权归原作者所有。