Table of contents

    Your agent answered confidently. It was also wrong. The problem isn’t the model — it’s the extraction layer.
    Why your AI agent hallucinates PDF table data

    You asked your agent: “How many Cabinet seats does Ramos have?” The PDF contains a table with four columns: Position, Seats, Aquino, Ramos. The Cabinet row reads 20 seats, Aquino 15 percent, Ramos 5 percent. The correct answer is 5 percent.

    Your agent returned 15 percent. That’s Aquino’s number, not Ramos’s. The agent never flagged uncertainty; it just picked the wrong column.

    This isn’t a model problem. It’s an extraction problem.

    PDF.js extracts text, not structure

    Most AI agent frameworks — OpenClaw(opens in a new tab) included — default to PDF.js for PDF extraction. PDF.js was built to render PDFs in a browser. It wasn’t designed to recover document structure. It reads character positions off the page and concatenates them into a string. That works fine for running prose, but it fails on tables.

    Here’s what PDF.js produces from the table above:

    Senate 24 8.3 16.7 House of Representatives 202 9.4 10.4 Cabinet 20 15.0 5.0 Governor 73 5.4 5.4 Provincial Board Member 626 9.9 10.9

    No column headers. No row boundaries. No cell alignment. Just a flat sequence of words and numbers. The LLM receives this and has to guess which number belongs to which column. Sometimes it guesses right. Often it doesn’t.

    The model isn’t hallucinating in the usual sense. Rather, it’s doing its best with garbage input. The extraction layer destroyed the information the model needed to answer correctly.

    Three ways flat extraction breaks documents

    Tables are the most obvious failure, but PDF.js loses structure in at least three ways.

    Tables become word soup. Column relationships vanish. A four-column table turns into an unpunctuated stream of tokens. The model doesn’t have a way to reconstruct grid alignment from positional guessing. In our benchmark of 200 real documents, PDF.js scored 0.000 on table structure recovery. Not low — zero.

    Headings vanish into body text. PDFs encode heading level through font size, weight, and spacing — none of which survives text extraction. A section heading and the paragraph below it merge into one block. The model loses the document’s outline. When asked “what does the methodology section say,” it may pull text from the wrong section entirely. PDF.js scored 0.000 on heading detection across the same 200 documents.

    Reading order gets scrambled. Multicolumn layouts, sidebars, and numbered lists depend on spatial position. PDF.js reads left-to-right, top-to-bottom from the raw content stream. A two-column page produces interleaved sentences. Numbered steps arrive out of order. The model cites step 3 when it means step 5.

    What the benchmark shows

    We tested two extraction pipelines across 200 real-world documents using three scoring methods:

    • Normalized information distance (NID) measures overall text fidelity
    • Tree-edit distance similarity (TEDS) measures table structure recovery
    • Markdown heading score (MHS) measures heading detection accuracy

    Results:

    MetricPDF.jsNutrientChange
    Overall accuracy (NID)0.5780.880+52%
    Table structure (TEDS)0.0000.6620% → 66%
    Heading fidelity (MHS)0.0000.8110% → 81%
    Reading order0.8710.924+6%

    The overall accuracy gain is significant. The table and heading numbers tell the real story: PDF.js doesn’t attempt structure recovery at all. The scores aren’t low. They’re zero.

    With Nutrient’s pdf-to-markdown(opens in a new tab) extraction, that same Cabinet table becomes:

    | Position | Seats | Aquino | Ramos |
    |-------------------------|--------|--------|-------|
    | Senate | 24 | 8.3 | 16.7 |
    | House of Reps | 202 | 9.4 | 10.4 |
    | Cabinet | 20 | 15.0 | 5.0 |
    | Governor | 73 | 5.4 | 5.4 |
    | Provincial Board Member | 626 | 9.9 | 10.9 |

    Cabinet row, Ramos column: 5 percent. The LLM doesn’t need to guess. Row and column boundaries are explicit. The answer is a lookup, not an inference.

    The fix is two commands

    If you’re running OpenClaw, install the Nutrient plugin and set it as the default PDF extraction engine:

    Terminal window
    openclaw plugins install @nutrient-sdk/openclaw-nutrient-pdf
    Terminal window
    openclaw config set agents.defaults.pdfExtraction.engine auto

    The auto setting uses Nutrient extraction for PDFs and falls back to PDF.js for anything the plugin cannot handle. Processing runs locally — no documents leave your machine. No API keys are required. The free tier covers 1,000 documents per month.

    The underlying problem

    AI agent PDF extraction is treated as a solved problem. It isn’t. Dumping raw text into a context window works until the document contains a table, a heading hierarchy, or a multicolumn layout. Then the model confabulates, and the agent delivers the wrong answer with full confidence.

    The fix isn’t a better model. It’s a better extractor. Structure-aware PDF-to-Markdown conversion preserves the relationships models need to answer correctly. Until your extraction pipeline recovers tables as tables and headings as headings, your agent will keep hallucinating tabular data.

    Resources

    Jonathan D. Rhyne

    Jonathan D. Rhyne

    Co-Founder and CEO

    Jonathan joined PSPDFKit in 2014. As Co-founder and CEO, Jonathan defines the company’s vision and strategic goals, bolsters the team culture, and steers product direction. When he’s not working, he enjoys being a dad, photography, and soccer.

    Explore related topics

    Try for free Ready to get started?