





















Abstract:We introduce CARTBENCH, a museum-grounded benchmark for evaluating vision-language models (VLMs) on Chinese artworks beyond short-form recognition and QA. CARTBENCH comprises four subtasks: CURATORQA for evidence-grounded recognition and reasoning, CATALOGCAPTION for structured four-section expert-style appreciation, REINTERPRET for defensible reinterpretation with expert ratings, and CONNOISSEURPAIRS for diagnostic authenticity discrimination under visually similar confounds. CARTBENCH is built by aligning image-bearing Palace Museum objects from Wikidata with authoritative catalog pages, spanning five art categories across multiple dynasties. Across nine representative VLMs, we find that high overall CURATORQA accuracy can mask sharp drops on hard evidence linking and style-to-period inference; long-form appreciation remains far from expert references; and authenticity-oriented diagnostic discrimination stays near chance, underscoring the difficulty of connoisseur-level reasoning for current models.
| Comments: | under review |
| Subjects: | Computation and Language (cs.CL) |
| Cite as: | arXiv:2604.11632 [cs.CL] |
| (or arXiv:2604.11632v2 [cs.CL] for this version) | |
| https://doi.org/10.48550/arXiv.2604.11632 arXiv-issued DOI via DataCite |
From: Xuefeng Wei [view email]
[v1]
Mon, 13 Apr 2026 15:44:02 UTC (5,261 KB)
[v2]
Mon, 25 May 2026 15:43:45 UTC (5,261 KB)
此内容由惯性聚合(RSS阅读器)自动聚合整理,仅供阅读参考。 原文来自 — 版权归原作者所有。