JinaVDR: New Visual Document Retrieval Benchmark with 95 Tasks in 20 Languages

We're releasing JinaVDR (Visual Document Retrieval), a new benchmark for evaluating how well models retrieve visually complex documents. JinaVDR encompasses multilingual documents with intricate layouts—combining graphs, charts, tables, text, and images alongside scanned copies and screenshots. The benchmark pairs these diverse visual documents with targeted text queries, enabling comprehensive evaluation of retrieval performance across real-world document complexity and broader domain coverage.

Benchmark	Task focus	Languages	Number of tasks
JinaVDR	Visually rich documents	20 languages	95
MIEB	Mostly natural images	38 languages	130
ViDoRe v1	Visually rich documents	English	5
ViDoRe v2	Visually rich documents	English, French, Spanish, German	4

Four pie charts present data distributions: query language, dataset domains, document formats, and language usage. Colors ran — JinaVDR statistics, showing query/document languages, domains and document formats

JinaVDR spans diverse languages, domains, and document formats to reflect real-world retrieval scenarios. While English remains predominant in both queries and documents, the benchmark incorporates over a dozen additional languages, providing significantly broader multilingual coverage. The domains encompass historic documents, software documentation, medical records, legal texts, and scientific papers, capturing varied professional use cases. Document formats range from web pages and PDFs to scanned materials, presentation slides, and standalone images. Many datasets intentionally mix languages and formats, creating realistic conditions that challenge models to handle the complexity they encounter in practical applications.

How We Build JinaVDR

JinaVDR benchmark provides an evaluation framework spanning 95 tasks across 20 languages, including domain-diverse and layout-rich documents like charts, maps, traditional scanned documents, Markdown files, and complex tables. It evaluates models through both visual question answering (for example, “How many civil lawsuits were dismissed at the Valladolid audience in 1855?”) and keyword querying (for example, “growth of the LED market across different regions”), giving a clearer assessment of retrieval capabilities on different document types found in the real world.

We used four techniques to construct JinaVDR with a focus on data diversity and task authenticity:

First, we repurposed existing benchmarks by converting OCR datasets into retrieval tasks using rule-based query templates (such as transforming MPMQA data) and reformatting question-answering datasets into retrieval scenarios:

The image features instructions for connecting a smartwatch to a smartphone, with a highlighted query, "Can I connect my watc — Sample MPMQA document and query from JinaVDR benchmark

Second, we manually annotated existing PDF datasets including StanfordSlides, TextbookQA, and ShanghaiMasterPlan, to create high-quality retrieval pairs:

Slide titled "Advantages of Perforated Metal" with a list of benefits, including durability, lightweight design, transparency — Sample StanfordSlides document and query from JinaVDR benchmark

Our third approach involved synthetic query and/or document generation, where we used existing document collections from sources like Europeana to create contextually relevant queries with Qwen2-VL-7B-Instruct, along with EasyOCR text descriptions:

Historical newspaper page from December 2, 1854, with the German text: "Wie viele Personen wurden am 2. Dezember 1854 in Alto — Sample Europeana document and query from JinaVDR benchmark, including English query translation for reference

We also rendered tabular datasets into visual tables and generated corresponding queries through templates derived from the original text data, as demonstrated in our AirBnBRetrieval task.

Finally, we repurposed existing crawled datasets with article-chart pairs, where we use text snippets from the articles as queries and corresponding charts as target documents, as shown in our OWIDRetrieval dataset:

Color-coded world map showing share of one-year-olds vaccinated against Hepatitis B in 2021, with accompanying text on Hepati — Sample OWIDRetrieval document and query from JinaVDR benchmark

This multi-faceted approach gives us comprehensive coverage across document types, languages, and retrieval scenarios.

Existing Benchmarks

Developing truly multimodal models (that can handle visually complex documents) requires benchmarks that go beyond traditional text-only evaluation methods. Frameworks like MTEB (Massive Text Embedding Benchmark) may work well for evaluating text retrieval across different domains and languages, but they’re not built for searching through documents where accurate retrieval depends on visual layout, charts, tables, and formatting. This is where visual document retrieval benchmarks (like the ViDoRe series) and image retrieval benchmarks (like MIEB, the Massive Image Embedding Benchmark) come in.

The ColPali paper introduced ViDoRe v1, which combines five English-language datasets, both academic and synthetic. The benchmark focuses on single-page documents that work well with optical character recognition (OCR), covers narrow domains like scientific papers and healthcare, and uses extractive queries where search terms often appear directly in target documents.

Series of professional pages with topics like "Tax-efficient investing," "Who we are?" and graphs, suggesting finance and inv — Samples from ViDoRe v1 benchmark dataset

After models like ColPali hit a score of 90% nDCG@5 on ViDoRe v1, a new benchmark was needed. ViDoRe v2 improved upon v1 by supporting longer and cross-document queries, blind contextual querying, and more languages (French, German, and Spanish on top of English). Both benchmarks still have limited language diversity and narrow domain coverage, leaving gaps for evaluating new retrieval systems.

A professional brochure layout combining community support themes, such as an elderly woman baking, with medical research top — Samples from ViDoRe v2 benchmark dataset

The MIEB takes a different approach by focusing on visual embeddings across 130 tasks, including other tasks beyond just retrieval. However, it mostly evaluates images without much text content, rather than visually rich documents. While the benchmark is excellent at testing visual understanding capabilities, it doesn’t perform well when you need to retrieve documents based on visual layout and textual content.

Massive Image Embedding Benchmark with 130 tasks in 38 languages. Tasks like Retrieval, Clustering, and VisualSTS are illustr — Samples from MIEB benchmark

Our aim with the JinaVDR (Visual Document Retrieval) benchmark is to expand upon the work of these prior benchmarks by incorporating visually rich multilingual documents with complex layouts like graphs, charts, and tables (mixed with text and images), as well as adding real-world queries and questions.

Evaluating Embeddings on JinaVDR

💡

You can run the benchmark for yourself with the code on our GitHub repo.

Our benchmarking results show that many recent embedding models struggle with JinaVDR’s wide range of visual document tasks, while OCR-based baselines and older models show even weaker results, especially on non-English and structured-document datasets. We included BM25 with OCR for all datasets where simple text extraction made such retrieval feasible.

An exception to this is jina-embeddings-v4. Our results suggest that its multimodal embedding approach handles complex and multilingual document retrieval better than earlier generation models or traditional OCR-based pipelines. The model’s multi-vector capability provides the best performance because it avoids the compression limitations of single-vector approaches — while single vectors have to cram an entire page’s content into one representation (making it difficult to capture specific details), the multi-vector approach maintains the granular information needed for precise retrieval of similar documents.

Bar chart titled "Average Model Performance on Jina VDR Benchmark" comparing the performance of various AI models, labeled al — Figure 9: Model performance on JinaVDR benchmark, averaged over all tasks

	Average	medical-prescriptions	DonutVQA	TableVQA	europeana-de-news	europeana-es-news	europeana-it-scans	europeana-nl-legal	hindi-gov-vqa	jdocqa_jp	wikimedia-commons-documents (ar)	github-readme-retrieval-ml-filtered (ru)
BM25 + OCR	26.67%	38.18%	19.39%	35.64%	11.26%	51.99%	39.11%	34.97%	1.83%	1.64%	19.60%	39.78%
`jina-embeddings-v3 + OCR`	27.49%	37.25%	2.60%	34.24%	12.05%	44.03%	38.69%	29.07%	7.52%	7.79%	38.06%	51.07%
jina-clip-v2	17.79%	15.66%	1.63%	21.06%	11.19%	13.14%	16.23%	9.79%	5.02%	19.91%	45.29%	36.80%
`colpali-v1.2`	46.44%	83.91%	32.53%	54.66%	34.64%	44.74%	54.32%	30.89%	13.04%	39.45%	41.96%	80.67%
`colqwen2-v0.1`	58.26%	77.72%	46.34%	57.52%	53.42%	74.28%	71.23%	46.13%	20.53%	74.38%	36.94%	0.82388
`MrLight/dse-qwen2-2b-mrl-v1`	47.95%	38.22%	25.31%	57.39%	44.75%	60.58%	53.92%	29.50%	9.80%	66.73%	62.47%	78.77%
`jina-embeddings-v4 (single-vector)`	61.39%	81.17%	78.48%	58.90%	49.05%	60.10%	57.88%	37.14%	15.40%	75.57%	72.07%	89.55%
`jina-embeddings-v4 (multivector)`	70.89%	97.95%	73.55%	60.91%	65.65%	80.58%	73.14%	54.15%	21.94%	82.34%	81.19%	88.39%

MTEB Integration

Since MTEB has become the de facto standard for retrieval benchmarking, we're integrating JinaVDR directly into the MTEB framework to maximize adoption and ease of use. This makes it easier for researchers to run visual retrieval models on our benchmark using familiar evaluation infrastructure. However, migrating our data to the BEIR format did require some trade-offs, like not including OCR results in the MTEB version. This means that traditional text-based methods like BM25 aren't directly runnable as part of the MTEB, which reinforces the focus on visual document understanding rather than falling back to text-based retrieval methods.

Limitations

To build a comprehensive benchmark from a wide range of sources, we had to perform careful preprocessing to ensure both practical usability and evaluation quality: We applied size normalization by subsampling each dataset to a maximum of 1,000 examples (down from thousands or tens of thousands), making the benchmark actually runnable while maintaining good coverage across tasks. This constraint was especially important given the high levels of compute we needed to process high-resolution visual documents.

We used quality filtering to address several challenges common in real-world document collections. While poor image quality in scanned documents often reflects realistic use cases, it made it tougher to control the quality of synthetic data. We implemented consistency filtering to remove duplicates (which are common across large document collections) and used LLMs to filter out low-quality queries that wouldn't provide useful evaluation signals, such as overly generic questions like "What can you see in the chart?". For synthetic data generation, we ran into limitations in query diversity despite using varied prompting strategies, and needed to perform manual curation to ensure enough evaluation coverage across different retrieval scenarios.

Conclusion

Visual document retrieval evaluation is now in a situation where traditional text-based benchmarks no longer capture the complexity of how humans actually search for and consume information. JinaVDR overcomes this barrier by providing comprehensive evaluation across a range of tasks and languages far exceeding previous benchmarks.

Moving forwards, the industry needs benchmarks that reflect genuine retrieval challenges rather than artificial constraints. As organizations come to rely more on visual document retrieval for tasks from legal research to medical diagnostics, evaluation frameworks have to evolve beyond narrow academic datasets toward the messy, multilingual, and visually complex documents that we find in the real world. JinaVDR is just the first step in building retrieval systems that truly understand how visual and textual information work together in practice.

推荐订阅源

Jina AI