Why your AI agent hallucinates PDF table data

Jonathan D. Rhyne

April 15, 2026

Your agent answered confidently. It was also wrong. The problem isn’t the model — it’s the extraction layer.

Why your AI agent hallucinates PDF table data

You asked your agent: “How many Cabinet seats does Ramos have?” The PDF contains a table with four columns: Position, Seats, Aquino, Ramos. The Cabinet row reads 20 seats, Aquino 15 percent, Ramos 5 percent. The correct answer is 5 percent.

Your agent returned 15 percent. That’s Aquino’s number, not Ramos’s. The agent never flagged uncertainty; it just picked the wrong column.

This isn’t a model problem. It’s an extraction problem.

PDF.js extracts text, not structure

Most AI agent frameworks — OpenClaw(opens in a new tab) included — default to PDF.js for PDF extraction. PDF.js was built to render PDFs in a browser. It wasn’t designed to recover document structure. It reads character positions off the page and concatenates them into a string. That works fine for running prose, but it fails on tables.

Here’s what PDF.js produces from the table above:

Senate 24 8.3 16.7 House of Representatives 202 9.4 10.4 Cabinet 20 15.0 5.0 Governor 73 5.4 5.4 Provincial Board Member 626 9.9 10.9

No column headers. No row boundaries. No cell alignment. Just a flat sequence of words and numbers. The LLM receives this and has to guess which number belongs to which column. Sometimes it guesses right. Often it doesn’t.

The model isn’t hallucinating in the usual sense. Rather, it’s doing its best with garbage input. The extraction layer destroyed the information the model needed to answer correctly.

Three ways flat extraction breaks documents

Tables are the most obvious failure, but PDF.js loses structure in at least three ways.

Tables become word soup. Column relationships vanish. A four-column table turns into an unpunctuated stream of tokens. The model doesn’t have a way to reconstruct grid alignment from positional guessing. In our benchmark of 200 real documents, PDF.js scored 0.000 on table structure recovery. Not low — zero.

Headings vanish into body text. PDFs encode heading level through font size, weight, and spacing — none of which survives text extraction. A section heading and the paragraph below it merge into one block. The model loses the document’s outline. When asked “what does the methodology section say,” it may pull text from the wrong section entirely. PDF.js scored 0.000 on heading detection across the same 200 documents.

Reading order gets scrambled. Multicolumn layouts, sidebars, and numbered lists depend on spatial position. PDF.js reads left-to-right, top-to-bottom from the raw content stream. A two-column page produces interleaved sentences. Numbered steps arrive out of order. The model cites step 3 when it means step 5.

What the benchmark shows

We tested two extraction pipelines across 200 real-world documents using three scoring methods:

Normalized information distance (NID) measures overall text fidelity
Tree-edit distance similarity (TEDS) measures table structure recovery
Markdown heading score (MHS) measures heading detection accuracy

Results:

Metric	PDF.js	Nutrient	Change
Overall accuracy (NID)	0.578	0.880	+52%
Table structure (TEDS)	0.000	0.662	0% → 66%
Heading fidelity (MHS)	0.000	0.811	0% → 81%
Reading order	0.871	0.924	+6%

The overall accuracy gain is significant. The table and heading numbers tell the real story: PDF.js doesn’t attempt structure recovery at all. The scores aren’t low. They’re zero.

With Nutrient’s pdf-to-markdown(opens in a new tab) extraction, that same Cabinet table becomes:

| Position                | Seats  | Aquino | Ramos |
|-------------------------|--------|--------|-------|
| Senate                  | 24     | 8.3    | 16.7  |
| House of Reps           | 202    | 9.4    | 10.4  |
| Cabinet                 | 20     | 15.0   | 5.0   |
| Governor                | 73     | 5.4    | 5.4   |
| Provincial Board Member | 626    | 9.9    | 10.9  |

Cabinet row, Ramos column: 5 percent. The LLM doesn’t need to guess. Row and column boundaries are explicit. The answer is a lookup, not an inference.

The fix is two commands

If you’re running OpenClaw, install the Nutrient plugin and set it as the default PDF extraction engine:

openclaw plugins install @nutrient-sdk/openclaw-nutrient-pdf

openclaw config set agents.defaults.pdfExtraction.engine auto

The auto setting uses Nutrient extraction for PDFs and falls back to PDF.js for anything the plugin cannot handle. Processing runs locally — no documents leave your machine. No API keys are required. The free tier covers 1,000 documents per month.

The underlying problem

AI agent PDF extraction is treated as a solved problem. It isn’t. Dumping raw text into a context window works until the document contains a table, a heading hierarchy, or a multicolumn layout. Then the model confabulates, and the agent delivers the wrong answer with full confidence.

The fix isn’t a better model. It’s a better extractor. Structure-aware PDF-to-Markdown conversion preserves the relationships models need to answer correctly. Until your extraction pipeline recovers tables as tables and headings as headings, your agent will keep hallucinating tabular data.