Why your AI agent hallucinates PDF table data

Inside Nutrient

A guide to the invisible work behind documents Introducing Nutrient Documents for Salesforce: Native document generation and signing Document AI vs. traditional OCR: Choosing between OCR, AI, and hybrid pipelines PDF SDK compliance and security evaluation checklist for enterprise teams (2026) Invariant Corp replaces paper processes with Nutrient Workflow and scales without limits What is process mapping? A complete guide Nutrient vs. Conga Composer for Salesforce document generation (2026) Document routing: How to automate document distribution The CTO’s AI playbook: Why accountability architecture beats orchestration Compliance workflow automation: Why built-in compliance is table stakes Workflow diagrams: Examples, symbols, and how to build one that actually runs Digital forms: Replace paper forms with automated workflows Approval workflow software: How to automate approvals Why document-centric automation is different The CEO’s AI playbook: Why decision architecture beats model selection Nutrient SDK product updates for Q1 2026 PDF redaction verification: How to prove sensitive data is permanently removed What is a VPAT? The complete guide to accessibility conformance reports What is PDF/UA? The accessible PDF standard explained Salesforce eSignatures: Generate, sign, and track documents in one flow Online document viewer: Options, tradeoffs, and how to embed one Document viewer for web apps: React, Vue, Angular (2026) Best document viewers in 2026: A buyer’s guide How to edit a PDF in Python: Add text, images, and annotations Nutrient advances Workflow platform with agentic AI for enterprise-grade speed and consistency in document-heavy operations How to create a Salesforce quote template from opportunity data The business case for accessibility: Five ways it drives enterprise value Python PDF library comparison (2026): 7 libraries for developers PDF.js limitations: When to upgrade to a commercial PDF SDK How Subject scaled 5× with Nutrient’s PDF SDK without rebuilding its document layer I replaced our sales training with an AI coach that runs in Slack — here’s what broke Redirecting to: https://securitybuzz.com/cybersecurity-news/why-enterprise-permissions-are-ais-most-dangerous-inheritance/ Nutrient .NET SDK vs. iText Core: Complete comparison for .NET developers DocuVieware: Support’s most frequently asked setup questions Introducing Nutrient Workflow How to convert PDF to Word in C# (.NET) When email and spreadsheets stop working: Work order approval workflows for field teams on the move Compliance with confidence: Why document-centric automation is the foundation of your mission Nutrient expands AI Assistant, automating multistep document workflows inside any application What is document generation? A developer’s guide to PDF generation Document Converter data flow and how real-time watermarks skip the queue PDF/UA compliance guide: Requirements, standards, and best practices Computers still can’t understand you How Athena Intelligence built AI agents for regulated enterprises with Nutrient’s document infrastructure How to convert HTML to PDF (2026): 4 methods from browser print to SDK How to build a document extraction pipeline with Nutrient Vision API OCR vs. intelligent document processing: Choosing the right document extraction engine Beyond OCR: How document intelligence eliminates manual processing in regulated industries Nutrient vs. IronPDF: Complete comparison for .NET developers Nutrient vs. Aspose.PDF: Complete comparison for .NET developers Redirecting to: https://fortune.com/2026/02/19/openclaw-who-is-peter-steinberger-openai-sam-altman-anthropic-moltbook/ Lufthansa Systems uses Nutrient to deliver reliable, scalable PDF rendering for pilots worldwide Nutrient vs. Syncfusion: Complete comparison for .NET developers React’s useTransition: The hook you’re probably using wrong First City Monument Bank streamlines banking processes with Nutrient Workflow Redirecting to: https://www.sdcexec.com/warehousing/automation/article/22957364/nutrient-workflow-automation-the-missing-link-in-supply-chain-efficiency The complete guide to digital signatures: PAdES, CAdES, and XAdES explained Nutrient Python SDK: Production-grade document processing for Python Introducing agentic document editing for web applications with AI Assistant Nutrient vs. QuestPDF: Complete comparison for .NET developers How we fixed the GdPicture license expiration (and what to do if you’re affected) Red team security testing with agentic AI The future of healthcare document automation Best healthcare workflow software compared Nutrient SDK product updates for Q4 2025 How Harvey scaled legal document workflows 50 percent MoM without rebuilding infrastructure HIPAA-compliant document management in hospitals How we optimized rendering performance while handling thousands of annotations in React — Part 2 Automated PII removal with Nutrient API Redirecting to: https://www.devopsdigest.com/2026-low-code-no-code-predictions Redirecting to: https://www.kmworld.com/Articles/Editorial/ViewPoints/Leaders-predict-AI-to-continue-permeating-all-aspects-of-KM-in-2026-172594.aspx What are deep agents and how do they solve complex problems? Whipping up document magic: Your easy-bake recipe for Vue and Nutrient Web SDK 🧁 What I’ve learned about product iteration planning while building SDKs Passwordless document signing: Three-layer security guide New zip folder functionality streamlines file management in Document Automation Server The keyboard shortcuts playbook: Taking control of keyboard events in Nutrient Web SDK From experienced engineer to AI beginner: My unexpected journey AI-assisted manual testing: Handling Safari’s PDF rendering and UI quirks How to keep a 20-year-old SDK up to date How we optimized rendering performance while handling thousands of annotations in React — Part 1 Nutrient announces new executive hires to accelerate next phase of growth High performance UI using web workers Automate document conversion at scale with Python and Nutrient DCS From curiosity to PLG (and AI): My journey to understanding product-led growth Prost to progress: One year as Nutrient Pigeon usage at Nutrient: Bridging native SDKs to Flutter Modernizing CI build servers: How to migrate from Chef to Ansible Unix man pages: AI-friendly documentation since 1971 Consistent hashing for even load distribution Best AI redaction APIs: Complete comparison guide for 2025 Why AI document redaction matters for modern security From coding to coordinating: How AI transformed my workflow What is intelligent document processing (IDP)? A complete guide Enterprise PDF SDKs: Best PSPDFKit (now Nutrient) alternatives Nutrient SDK product updates for Q3 2025 GdPicture support best practices Redacting sensitive data with Nutrient AI redaction API How AI is transforming the customer experience at Nutrient: From instant answers to intelligent support How manual QA uses PR testing between releases

Jonathan D. Rhyne · 2026-04-15 · via Inside Nutrient

Your agent answered confidently. It was also wrong. The problem isn’t the model — it’s the extraction layer.

You asked your agent: “How many Cabinet seats does Ramos have?” The PDF contains a table with four columns: Position, Seats, Aquino, Ramos. The Cabinet row reads 20 seats, Aquino 15 percent, Ramos 5 percent. The correct answer is 5 percent.

Your agent returned 15 percent. That’s Aquino’s number, not Ramos’s. The agent never flagged uncertainty; it just picked the wrong column.

This isn’t a model problem. It’s an extraction problem.

Most AI agent frameworks — OpenClaw(opens in a new tab) included — default to PDF.js for PDF extraction. PDF.js was built to render PDFs in a browser. It wasn’t designed to recover document structure. It reads character positions off the page and concatenates them into a string. That works fine for running prose, but it fails on tables.

Here’s what PDF.js produces from the table above:

Senate 24 8.3 16.7 House of Representatives 202 9.4 10.4 Cabinet 20 15.0 5.0 Governor 73 5.4 5.4 Provincial Board Member 626 9.9 10.9

No column headers. No row boundaries. No cell alignment. Just a flat sequence of words and numbers. The LLM receives this and has to guess which number belongs to which column. Sometimes it guesses right. Often it doesn’t.

The model isn’t hallucinating in the usual sense. Rather, it’s doing its best with garbage input. The extraction layer destroyed the information the model needed to answer correctly.

Tables are the most obvious failure, but PDF.js loses structure in at least three ways.

Tables become word soup. Column relationships vanish. A four-column table turns into an unpunctuated stream of tokens. The model doesn’t have a way to reconstruct grid alignment from positional guessing. In our benchmark of 200 real documents, PDF.js scored 0.000 on table structure recovery. Not low — zero.

Headings vanish into body text. PDFs encode heading level through font size, weight, and spacing — none of which survives text extraction. A section heading and the paragraph below it merge into one block. The model loses the document’s outline. When asked “what does the methodology section say,” it may pull text from the wrong section entirely. PDF.js scored 0.000 on heading detection across the same 200 documents.

Reading order gets scrambled. Multicolumn layouts, sidebars, and numbered lists depend on spatial position. PDF.js reads left-to-right, top-to-bottom from the raw content stream. A two-column page produces interleaved sentences. Numbered steps arrive out of order. The model cites step 3 when it means step 5.

What the benchmark shows

We tested two extraction pipelines across 200 real-world documents using three scoring methods:

Normalized information distance (NID) measures overall text fidelity
Tree-edit distance similarity (TEDS) measures table structure recovery
Markdown heading score (MHS) measures heading detection accuracy

Results:

Metric	PDF.js	Nutrient	Change
Overall accuracy (NID)	0.578	0.880	+52%
Table structure (TEDS)	0.000	0.662	0% → 66%
Heading fidelity (MHS)	0.000	0.811	0% → 81%
Reading order	0.871	0.924	+6%

The overall accuracy gain is significant. The table and heading numbers tell the real story: PDF.js doesn’t attempt structure recovery at all. The scores aren’t low. They’re zero.

With Nutrient’s pdf-to-markdown(opens in a new tab) extraction, that same Cabinet table becomes:

| Position                | Seats  | Aquino | Ramos |
|-------------------------|--------|--------|-------|
| Senate                  | 24     | 8.3    | 16.7  |
| House of Reps           | 202    | 9.4    | 10.4  |
| Cabinet                 | 20     | 15.0   | 5.0   |
| Governor                | 73     | 5.4    | 5.4   |
| Provincial Board Member | 626    | 9.9    | 10.9  |

Cabinet row, Ramos column: 5 percent. The LLM doesn’t need to guess. Row and column boundaries are explicit. The answer is a lookup, not an inference.

The fix is two commands

If you’re running OpenClaw, install the Nutrient plugin and set it as the default PDF extraction engine:

openclaw plugins install @nutrient-sdk/openclaw-nutrient-pdf

openclaw config set agents.defaults.pdfExtraction.engine auto

The auto setting uses Nutrient extraction for PDFs and falls back to PDF.js for anything the plugin cannot handle. Processing runs locally — no documents leave your machine. No API keys are required. The free tier covers 1,000 documents per month.

The underlying problem

AI agent PDF extraction is treated as a solved problem. It isn’t. Dumping raw text into a context window works until the document contains a table, a heading hierarchy, or a multicolumn layout. Then the model confabulates, and the agent delivers the wrong answer with full confidence.

The fix isn’t a better model. It’s a better extractor. Structure-aware PDF-to-Markdown conversion preserves the relationships models need to answer correctly. Until your extraction pipeline recovers tables as tables and headings as headings, your agent will keep hallucinating tabular data.

Resources

OpenClaw Nutrient PDF plugin(opens in a new tab) — Installation and configuration guide
pdf-to-markdown library(opens in a new tab) — The extraction engine underneath the plugin

此内容由惯性聚合(RSS阅读器)自动聚合整理，仅供阅读参考。原文来自 — 版权归原作者所有。

推荐订阅源

Inside Nutrient

What the benchmark shows

The fix is two commands

The underlying problem

Resources