惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

V
Visual Studio Blog
Google DeepMind News
Google DeepMind News
V
V2EX
B
Blog RSS Feed
有赞技术团队
有赞技术团队
博客园 - Franky
美团技术团队
月光博客
月光博客
酷 壳 – CoolShell
酷 壳 – CoolShell
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
腾讯CDC
云风的 BLOG
云风的 BLOG
L
LangChain Blog
GbyAI
GbyAI
The Cloudflare Blog
CTFtime.org: upcoming CTF events
CTFtime.org: upcoming CTF events
C
Check Point Blog
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
Stack Overflow Blog
Stack Overflow Blog
博客园 - 【当耐特】
The Register - Security
The Register - Security
大猫的无限游戏
大猫的无限游戏
D
Docker
Vercel News
Vercel News
Blog — PlanetScale
Blog — PlanetScale
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
博客园 - 司徒正美
人人都是产品经理
人人都是产品经理
雷峰网
雷峰网
阮一峰的网络日志
阮一峰的网络日志
P
Proofpoint News Feed
N
Netflix TechBlog - Medium
博客园_首页
A
About on SuperTechFans
J
Java Code Geeks
量子位
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
MongoDB | Blog
MongoDB | Blog
Recent Announcements
Recent Announcements
G
Google Developers Blog
小众软件
小众软件
博客园 - 叶小钗
WordPress大学
WordPress大学
博客园 - 聂微东
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
Martin Fowler
Martin Fowler
S
SegmentFault 最新的问题
F
Full Disclosure
Jina AI
Jina AI
H
Help Net Security

Jina AI

Bootstrapping Audio Embeddings from Multimodal LLMs Identifying Embedding Models from Raw Numerical Values jina-embeddings-v5-text: New SOTA Small Multilingual Embeddings Jina Reranker v3: 0.6B Listwise Reranker for SOTA Multilingual Retrieval Embeddings Are AI’s Red-Headed Stepchild Multimodal Embeddings in Llama.cpp and GGUF Jina Code Embeddings: SOTA Code Retrieval at 0.5B and 1.5B Agentic Workflow with Jina Remote MCP Server Optimizing GGUFs for Decoder-Only Embedding Models What We Learned at SIGIR 2025 How Image Resolution Impacts Visual Document Retrieval Submodular Optimization for Text Selection, Passage Reranking & Context Engineering Submodular Optimization for Diverse Query Generation in DeepResearch
JinaVDR: New Visual Document Retrieval Benchmark with 95 Tasks in 20 Languages
Maximilian Werk, Alex C-G · 2025-07-26 · via Jina AI

GitHub - jina-ai/jina-vdr: Jina VDR is a multilingual, multi-domain benchmark for visual document retrieval

Jina VDR is a multilingual, multi-domain benchmark for visual document retrieval - jina-ai/jina-vdr

jina-ai

JinaVDR (Visual Document Retrieval) - a jinaai Collection

max. ~1000 images and OCR text included

a jinaai Collection

We're releasing JinaVDR (Visual Document Retrieval), a new benchmark for evaluating how well models retrieve visually complex documents. JinaVDR encompasses multilingual documents with intricate layouts—combining graphs, charts, tables, text, and images alongside scanned copies and screenshots. The benchmark pairs these diverse visual documents with targeted text queries, enabling comprehensive evaluation of retrieval performance across real-world document complexity and broader domain coverage.

Benchmark Task focus Languages Number of tasks
JinaVDR Visually rich documents 20 languages 95
MIEB Mostly natural images 38 languages 130
ViDoRe v1 Visually rich documents English 5
ViDoRe v2 Visually rich documents English, French, Spanish, German 4
Four pie charts present data distributions: query language, dataset domains, document formats, and language usage. Colors ran
JinaVDR statistics, showing query/document languages, domains and document formats

JinaVDR spans diverse languages, domains, and document formats to reflect real-world retrieval scenarios. While English remains predominant in both queries and documents, the benchmark incorporates over a dozen additional languages, providing significantly broader multilingual coverage. The domains encompass historic documents, software documentation, medical records, legal texts, and scientific papers, capturing varied professional use cases. Document formats range from web pages and PDFs to scanned materials, presentation slides, and standalone images. Many datasets intentionally mix languages and formats, creating realistic conditions that challenge models to handle the complexity they encounter in practical applications.

How We Build JinaVDR

JinaVDR benchmark provides an evaluation framework spanning 95 tasks across 20 languages, including domain-diverse and layout-rich documents like charts, maps, traditional scanned documents, Markdown files, and complex tables. It evaluates models through both visual question answering (for example, “How many civil lawsuits were dismissed at the Valladolid audience in 1855?”) and keyword querying (for example, “growth of the LED market across different regions”), giving a clearer assessment of retrieval capabilities on different document types found in the real world.

We used four techniques to construct JinaVDR with a focus on data diversity and task authenticity:

First, we repurposed existing benchmarks by converting OCR datasets into retrieval tasks using rule-based query templates (such as transforming MPMQA data) and reformatting question-answering datasets into retrieval scenarios:

The image features instructions for connecting a smartwatch to a smartphone, with a highlighted query, "Can I connect my watc
Sample MPMQA document and query from JinaVDR benchmark

Second, we manually annotated existing PDF datasets including StanfordSlides, TextbookQA, and ShanghaiMasterPlan, to create high-quality retrieval pairs:

Slide titled "Advantages of Perforated Metal" with a list of benefits, including durability, lightweight design, transparency
Sample StanfordSlides document and query from JinaVDR benchmark

Our third approach involved synthetic query and/or document generation, where we used existing document collections from sources like Europeana to create contextually relevant queries with Qwen2-VL-7B-Instruct, along with EasyOCR text descriptions:

Historical newspaper page from December 2, 1854, with the German text: "Wie viele Personen wurden am 2. Dezember 1854 in Alto
Sample Europeana document and query from JinaVDR benchmark, including English query translation for reference

We also rendered tabular datasets into visual tables and generated corresponding queries through templates derived from the original text data, as demonstrated in our AirBnBRetrieval task.

Finally, we repurposed existing crawled datasets with article-chart pairs, where we use text snippets from the articles as queries and corresponding charts as target documents, as shown in our OWIDRetrieval dataset:

Color-coded world map showing share of one-year-olds vaccinated against Hepatitis B in 2021, with accompanying text on Hepati
Sample OWIDRetrieval document and query from JinaVDR benchmark

This multi-faceted approach gives us comprehensive coverage across document types, languages, and retrieval scenarios.

Existing Benchmarks

Developing truly multimodal models (that can handle visually complex documents) requires benchmarks that go beyond traditional text-only evaluation methods. Frameworks like MTEB (Massive Text Embedding Benchmark) may work well for evaluating text retrieval across different domains and languages, but they’re not built for searching through documents where accurate retrieval depends on visual layout, charts, tables, and formatting. This is where visual document retrieval benchmarks (like the ViDoRe series) and image retrieval benchmarks (like MIEB, the Massive Image Embedding Benchmark) come in.

The ColPali paper introduced ViDoRe v1, which combines five English-language datasets, both academic and synthetic. The benchmark focuses on single-page documents that work well with optical character recognition (OCR), covers narrow domains like scientific papers and healthcare, and uses extractive queries where search terms often appear directly in target documents.

Series of professional pages with topics like "Tax-efficient investing," "Who we are?" and graphs, suggesting finance and inv
Samples from ViDoRe v1 benchmark dataset

After models like ColPali hit a score of 90% nDCG@5 on ViDoRe v1, a new benchmark was needed. ViDoRe v2 improved upon v1 by supporting longer and cross-document queries, blind contextual querying, and more languages (French, German, and Spanish on top of English). Both benchmarks still have limited language diversity and narrow domain coverage, leaving gaps for evaluating new retrieval systems.

A professional brochure layout combining community support themes, such as an elderly woman baking, with medical research top
Samples from ViDoRe v2 benchmark dataset

The MIEB takes a different approach by focusing on visual embeddings across 130 tasks, including other tasks beyond just retrieval. However, it mostly evaluates images without much text content, rather than visually rich documents. While the benchmark is excellent at testing visual understanding capabilities, it doesn’t perform well when you need to retrieve documents based on visual layout and textual content.

Massive Image Embedding Benchmark with 130 tasks in 38 languages. Tasks like Retrieval, Clustering, and VisualSTS are illustr
Samples from MIEB benchmark

Our aim with the JinaVDR (Visual Document Retrieval) benchmark is to expand upon the work of these prior benchmarks by incorporating visually rich multilingual documents with complex layouts like graphs, charts, and tables (mixed with text and images), as well as adding real-world queries and questions.

Evaluating Embeddings on JinaVDR

💡

You can run the benchmark for yourself with the code on our GitHub repo.

Our benchmarking results show that many recent embedding models struggle with JinaVDR’s wide range of visual document tasks, while OCR-based baselines and older models show even weaker results, especially on non-English and structured-document datasets. We included BM25 with OCR for all datasets where simple text extraction made such retrieval feasible.

An exception to this is jina-embeddings-v4. Our results suggest that its multimodal embedding approach handles complex and multilingual document retrieval better than earlier generation models or traditional OCR-based pipelines. The model’s multi-vector capability provides the best performance because it avoids the compression limitations of single-vector approaches — while single vectors have to cram an entire page’s content into one representation (making it difficult to capture specific details), the multi-vector approach maintains the granular information needed for precise retrieval of similar documents.

Bar chart titled "Average Model Performance on Jina VDR Benchmark" comparing the performance of various AI models, labeled al
Figure 9: Model performance on JinaVDR benchmark, averaged over all tasks
Average medical-prescriptions DonutVQA TableVQA europeana-de-news europeana-es-news europeana-it-scans europeana-nl-legal hindi-gov-vqa jdocqa_jp wikimedia-commons-documents (ar) github-readme-retrieval-ml-filtered (ru)
BM25 + OCR 26.67% 38.18% 19.39% 35.64% 11.26% 51.99% 39.11% 34.97% 1.83% 1.64% 19.60% 39.78%
jina-embeddings-v3 + OCR 27.49% 37.25% 2.60% 34.24% 12.05% 44.03% 38.69% 29.07% 7.52% 7.79% 38.06% 51.07%
jina-clip-v2 17.79% 15.66% 1.63% 21.06% 11.19% 13.14% 16.23% 9.79% 5.02% 19.91% 45.29% 36.80%
colpali-v1.2 46.44% 83.91% 32.53% 54.66% 34.64% 44.74% 54.32% 30.89% 13.04% 39.45% 41.96% 80.67%
colqwen2-v0.1 58.26% 77.72% 46.34% 57.52% 53.42% 74.28% 71.23% 46.13% 20.53% 74.38% 36.94% 0.82388
MrLight/dse-qwen2-2b-mrl-v1 47.95% 38.22% 25.31% 57.39% 44.75% 60.58% 53.92% 29.50% 9.80% 66.73% 62.47% 78.77%
jina-embeddings-v4 (single-vector) 61.39% 81.17% 78.48% 58.90% 49.05% 60.10% 57.88% 37.14% 15.40% 75.57% 72.07% 89.55%
jina-embeddings-v4 (multivector) 70.89% 97.95% 73.55% 60.91% 65.65% 80.58% 73.14% 54.15% 21.94% 82.34% 81.19% 88.39%

MTEB Integration

dataset: Add JinaVDR by maximilianwerk · Pull Request #2942 · embeddings-benchmark/mteb

Hey, we would like to contribute the JinaVDR benchmark to MTEB. Our aim with the JinaVDR (Visual Document Retrieval) benchmark is to expand upon the work of these prior benchmarks by incorporating…

GitHubembeddings-benchmark

Since MTEB has become the de facto standard for retrieval benchmarking, we're integrating JinaVDR directly into the MTEB framework to maximize adoption and ease of use. This makes it easier for researchers to run visual retrieval models on our benchmark using familiar evaluation infrastructure. However, migrating our data to the BEIR format did require some trade-offs, like not including OCR results in the MTEB version. This means that traditional text-based methods like BM25 aren't directly runnable as part of the MTEB, which reinforces the focus on visual document understanding rather than falling back to text-based retrieval methods.

Limitations

To build a comprehensive benchmark from a wide range of sources, we had to perform careful preprocessing to ensure both practical usability and evaluation quality: We applied size normalization by subsampling each dataset to a maximum of 1,000 examples (down from thousands or tens of thousands), making the benchmark actually runnable while maintaining good coverage across tasks. This constraint was especially important given the high levels of compute we needed to process high-resolution visual documents.

We used quality filtering to address several challenges common in real-world document collections. While poor image quality in scanned documents often reflects realistic use cases, it made it tougher to control the quality of synthetic data. We implemented consistency filtering to remove duplicates (which are common across large document collections) and used LLMs to filter out low-quality queries that wouldn't provide useful evaluation signals, such as overly generic questions like "What can you see in the chart?". For synthetic data generation, we ran into limitations in query diversity despite using varied prompting strategies, and needed to perform manual curation to ensure enough evaluation coverage across different retrieval scenarios.

Conclusion

Visual document retrieval evaluation is now in a situation where traditional text-based benchmarks no longer capture the complexity of how humans actually search for and consume information. JinaVDR overcomes this barrier by providing comprehensive evaluation across a range of tasks and languages far exceeding previous benchmarks.

Moving forwards, the industry needs benchmarks that reflect genuine retrieval challenges rather than artificial constraints. As organizations come to rely more on visual document retrieval for tasks from legal research to medical diagnostics, evaluation frameworks have to evolve beyond narrow academic datasets toward the messy, multilingual, and visually complex documents that we find in the real world. JinaVDR is just the first step in building retrieval systems that truly understand how visual and textual information work together in practice.