























Abstract:Document-to-LLM applications typically read uploaded PDFs by first translating them into text through a hidden extraction layer that users cannot observe or audit. We show that this layer enables split-view PDFs: one document can have two semantic views before model reasoning. By mining specification-permitted or implementation-tolerated representation gaps at the PDF render/extract boundary, we instantiate 25 extraction gaps (EG) in which extractors return attacker-controlled or extractor-dependent text while the rendered page shows benign or different content. The gaps form four families: semantic overrides, hidden semantic injection, reading-order splits, and font-decoding splits, and 14 gaps have no exact path/mechanism-level match in prior PDF-to-LLM attacks.
We evaluate these gaps on 16 PDF processing stacks and 7 commercial LLM services. Each gap causes render-extract divergence on at least one stack. Under a gap-level exposure criterion, every evaluated service exposes at least one gap, with 12/25 to 21/25 exposed gaps. Exposure is driven mainly by the ingestion stack -- not model identity alone. We further show that tested safety filters cover only selected hidden-text constructions. To support triage, we develop a static screening scanner whose rules trigger on all 25 benchmark gaps, and discuss dual-view consistency as a longer-term defense direction.
From: Side Liu [view email]
[v1]
Fri, 12 Jun 2026 23:30:56 UTC (292 KB)
此内容由惯性聚合(RSS阅读器)自动聚合整理,仅供阅读参考。 原文来自 — 版权归原作者所有。