






















Parser disagreement, OCR drift, hidden layers and semantic divergence silently corrupt RAG pipelines and training data. Measure the gap before it reaches production.
Built for researchers, security teams and AI engineers investigating document integrity at scale.
24,824PDFs analyzed
1 in 3parser disagreement
18.6%DOJ corpus semantic drift
A PDF isn't one document. The page a person reads and the text a machine extracts are not guaranteed to match — and your AI ingests the machine's version. When they differ, the model learns, retrieves and answers from a version of the document no human ever saw.
Retrieval pipelines index extracted text. If extraction diverges from the rendered page, your assistant cites content that isn't there — confidently.
Fine-tuning on parsed PDFs bakes in extraction errors, hidden layers and reading-order scrambles at scale — invisibly.
When the value stored differs from the value shown — on signed forms, contracts and filings — automated review reaches the wrong conclusion.
None of this throws an error. It degrades answer quality and audit integrity quietly, until someone downstream is wrong and can't say why.
18.6%
Across the 16,971-PDF DOJ Epstein release, 18.6% of files read differently to a machine than to a human — the extracted text layer diverged from the rendered page.
In an adversarial corpus, 502 of 1,572 PDFs (~1 in 3) produced materially different results across parsers. Among IRS tax forms, 43 of 44 exhibited semantic drift.
The same engine behind the research runs as a live scanner — 47 forensic engines that measure parser disagreement, value-vs-appearance drift, hidden layers and OCR-vs-render divergence. Upload a PDF and see exactly where machine and human readings split. Zero retention: files are deleted immediately after analysis. No external APIs, no third-party processing — analysis runs entirely within the PQ PDF environment.
Whether the text your AI ingests from a document matches what a human actually sees on the page. When they diverge, models retrieve, learn and answer from content no human read.
OCR error is only one source of drift. We also measure parser disagreement, hidden layers, reading-order scrambles and value-vs-appearance divergence — the gaps OCR metrics miss.
No. Analysis runs entirely within the PQ PDF environment — no external APIs, no third-party processing, and zero retention: files are deleted immediately after analysis.
Parser disagreement across six independent parsers, OCR-vs-render divergence, hidden or invisible layers, reading order, and value-vs-appearance (V/AP) mismatches — pinpointing where machine and human readings split.
Yes. The scanner runs on individual files now, and the engine can be integrated or licensed for batch and pipeline use. Contact us for a demo or licensing.
For teams running RAG, document AI or LLM training at scale — let's talk about checking your corpus, integrating the engine, or licensing the technology.
此内容由惯性聚合(RSS阅读器)自动聚合整理,仅供阅读参考。 原文来自 — 版权归原作者所有。