




















Just because a language model nails a question about a PDF doesn't mean it actually found the answer where it claims to.
Researchers at Peking University and the Shanghai Artificial Intelligence Laboratory built a new benchmark called CiteVQA to expose this gap between getting the right answer and pointing to the right source. They call it "attribution hallucination."

Standard document analysis tests like DocVQA or MMLongBench-Doc only grade the final answer. They can't tell whether a model actually pulled information from the document or just guessed based on what it already knew. In law, financial audits, or medicine, though, traceability is what makes an AI output usable in the first place, the paper argues.
CiteVQA makes models back up every statement with a precise marker in the document. They have to point to the exact paragraph, table, or figure. A page number alone won't do. The dataset covers 1,897 questions across 711 PDFs from seven subject areas: 451 in English and 260 in Chinese. The documents average 40.6 pages each, way longer than most benchmarks.
Rather than hand-labeling everything, the team built an automated pipeline. It breaks documents into individual elements, has models like Gemini 3.0 Flash trace the chain of evidence, and then checks which pieces are truly needed. Each document gets pulled out on a trial basis. If the model can't answer the question without it, that document counts as essential.

The core metric is called Strict Attributed Accuracy. A model only gets points when the answer is correct and the citation lands on the right spot. Twenty current models were put through the test.
The best performer, Gemini-3.1-Pro-Preview, scored just 76 out of 100. GPT-5.4 often knew the right answer but couldn't show its work: 87.1 for raw answer quality, just 59 once correct citations were required.
Open-source models fared much worse. Qwen3-VL-235B-A22B, the strongest freely available system, managed 22.5 points. Smaller open models mostly landed below 10, making them "extremely risky" for regulated industries, the researchers say.
Many models can't even find the correct page. The Gemini 3 series gets there in over 87 percent of cases. Qwen3-VL-235B-A22B manages just under 58 percent. Harder tasks make things worse. Single-document questions still work okay, but when a model has to pull together info from multiple documents, recall for Gemini 3.1 Pro Preview drops from around 69 to 55 percent.

Math tasks do fairly well because the logic demands obvious evidence. Things fall apart when a model first has to spot a document element by its color, position, or heading, then figure out what it means. Academic papers with tidy layouts score best. Newspapers and magazines with busy designs hold even the top models to around 63 points.
In an ablation study, the researchers narrowed the search space on purpose, feeding models only the relevant pages or the right document. Scores jumped fast - over 13 points for Qwen3-VL-8B.
The not-so-surprising takeaway: models that know where to look also give better answers. Accurate source information directly improves answer quality and is not just about transparency. This also points to why context engineering matters so much: an AI model performs best when it gets exactly the information it needs for the task.

The researchers posted their code and details on GitHub, and the dataset is up for download on Hugging Face.
A different benchmark from the same group, the Shanghai AI Laboratory, showed back in 2024 that language models struggle with long documents across the board. Their bilingual NeedleBench tests how well models dig up relevant info in lengthy English and Chinese texts - with similarly grim results.
Google DeepMind goes after a related problem with FACTS Grounding, which measures whether answers come strictly from the provided document or whether the model sneaks in outside knowledge. Even Gemini 3 Pro and GPT-5.1 don't come close to reliable scores.
OpenAI recently looked at why models guess instead of saying "I don't know." In an analysis, the company framed hallucinations as a systemic incentive problem. Training and evaluation reward confident answers and punish hedging. That same dynamic likely fuels the "attribution hallucination" that CiteVQA now catches in source citations.
Subscribe to THE DECODER for ad-free reading, a weekly AI newsletter, our exclusive "AI Radar" frontier report six times a year, full archive access, and access to our comment section.
此内容由惯性聚合(RSS阅读器)自动聚合整理,仅供阅读参考。 原文来自 — 版权归原作者所有。