Computers still can’t understand you
Table of contents
I recently joined Nutrient, fully expecting to spend a few weeks just getting to know who my teammates are, where we work, and most importantly, who to ping when I inevitably break something.
On my second day, two articles landed in my feed, and it made me wonder if the universe isn’t actually some kind of cosmic simulation after all.
The first was from The Tech Buzz, and it was entitled “AI’s Dirty Secret: It Still Can’t Read PDFs Properly(opens in a new tab).” Not an unusual topic for that particular source. The second, though, really forced a double-take. It was published in The Economist and it was titled “The war against PDFs is heating up(opens in a new tab).”
The Economist — that same publication that talks about global trade, interest rates, international conflict, and neoliberal macroeconomic theory, talking about a software document format? What in Turing’s good name is going on?
For those who haven’t been paying close attention to US domestic news, the federal Department of Justice just released more than three million pages of PDF documents for public review. Like many government documents, many of these weren’t originally software-generated; a great many were still hand-typed — or worse, handwritten — documents that needed translation into machine-readable format.
What emerged comes as no real surprise to anyone who’s ever had to deal with this before: It was a mess. The basic optical character recognition (OCR) technology used for processing those source documents produced garbled, unsearchable output. Not only did the public have reason to wonder at the technical competence of the DOJ staffers, but journalists and researchers couldn’t find what they were looking for in documents the government was legally required to make transparent.
And yet, here’s the thing that caught my attention: This wasn’t treated as a technical glitch. It was treated as a revelation — as if the industry had just discovered that extracting structured data from PDFs is still an unsolved problem.
I’ve spent my career at the intersection of development platforms and enterprise software, and this problem isn’t new. We’ve been quietly struggling to come up with solutions to this problem since the ‘90s. (I first ran into PDFs at Pacific Bell in 1997.) And what I found when I walked in the door at Nutrient is a team that’s been quietly solving it for more than a decade.
The Economist piece posed the question: Can developers build tools that handle PDFs properly? TL;DR: Yes, but like most formats, you have to know what you’re working with when you’re working with a PDF.
Why this is structurally difficult
To understand why document extraction is so hard, you need to understand what a PDF actually is — or, perhaps, what it isn’t.
Consider HTML for a moment. An HTML document contains within it an intentionally explicit structure to help renderers. This is why the HTML specification describes “semantic” HTML markup, as the tag language is designed to reflect the semantics rather than just display characteristics. A bulleted list is bookended by a starting and ending <ul> tag. A table is set off by a <table> tag, and each row and cell is set off by additional tag pairs. Even that most humble of structural elements — the <div> — tells the HTML renderer that something is a new section of markup.
On the other hand, PDF, like its immediate predecessor, PostScript, is a set of printing instructions, a la code, that tells a renderer where to place ink on a page or screen. It doesn’t encode what a table is, where a paragraph begins, or what reading order the content follows. A PDF generated from a Word document often captures none of the underlying structural elements like headers, footers, or page numbers. The structure is obvious in the human’s eyes, but to the PDF, it’s all just formatting commands that happen to look correct when rendered visually. Scanned PDFs are worse: They’re literally arrays of bytes that happen to form images, with no machine-readable text at all.
This is why the DOJ’s documents came out garbled. The OCR step converted pixels to characters just fine — the words were all there. But the extraction pipeline treated the result as a flat text stream, and every bit of structural context was lost.
This is a well-understood failure pattern, and it shows up in predictable ways:
Table extraction failure. Tables in PDFs are often just aligned text and drawn lines. Without structure-aware parsing, columns collapse, headers detach from data, and multi-row cells merge incorrectly.
Multicolumn layout failure. Government documents, academic papers, and annual reports use multicolumn layouts. Naive extraction reads left-to-right across both columns simultaneously, producing nonsense.
Downstream hallucination. When extraction produces partial or disordered text and that text is fed to an LLM, the model fills in the gaps. On financial documents, contracts, or medical records, that means plausible-but-wrong numbers and dates.
The good news is that the industry has been making progress — both with specialized document extraction tools and with multimodal AI models that can process pages as images rather than raw text. The bad news is that each approach comes with its own tradeoffs in quality, infrastructure control, and cost.
Why “better OCR” isn’t the answer — and why multimodal AI isn’t either
The industry’s first instinct is to throw more OCR at the problem. More accuracy, more languages, more preprocessing. But as the DOJ case showed, character recognition accuracy wasn’t the issue — the words were all there. The problem is structural understanding: knowing that these cells form a table, that this paragraph precedes that one, that this handwritten annotation is a dosage linked to a patient field.
The second instinct is to skip the extraction pipeline entirely and point a multimodal AI model at the page. Claude, Gemini, and ChatGPT can all process PDF pages as images now, and they’re meaningfully better than naive text extraction. But they have their own limitations: They struggle with merged table cells, truncate multipage tables at page breaks, produce inconsistent results across runs, and provide no confidence scores to tell you what’s reliable and what isn’t. Real-world tests have found up to 42 percent of fields missing from LLM-extracted data on complex documents. And they return Markdown or text, not structured data with spatial coordinates you can trace back to the source page.
Here’s a useful framing. Think of document extraction as three tiers.
Tier 1: Character recognition. Converting pixels to text. Most tools do this reasonably well. It’s fast and lightweight, but it’s all you get — a flat string of characters with no structural context.
Tier 2: Structural understanding. Determining reading order, detecting columns, extracting tables with cell-level coordinates, recognizing form fields, preserving hierarchical relationships between document elements. This is where most tools stop — and where most failures originate. It’s also what makes extracted data actually usable downstream.
Tier 3: AI-enhanced analysis. Running OCR, intelligent content recognition (ICR), and a vision language model in parallel, and then merging the results — combining the spatial precision of specialized local models with the general document understanding of VLMs for the hardest documents: irregular table layouts, degraded scans, and complex handwriting.
So what’s left? Cloud-only platforms like Adobe’s Acrobat AI assistant and Google Document AI are pushing into Tiers 2 and 3 with dedicated document extraction APIs. These are more reliable than raw LLM processing — but they require every document to leave your infrastructure. For regulated industries — healthcare, legal, financial services, government — that’s often a non-starter before the conversation even begins. They also come with per-page pricing that scales linearly with volume, and as the recent software outages at OpenAI and Microsoft have shown, dependency on external services brings a degree of fragility that many organizations cannot permit.
The question isn’t whether AI can help with document extraction — it can. The question is where that AI runs, who controls it, and whether your extraction pipeline works without it.
What structural document intelligence looks like
It’s always nice working with people who know what they’re doing. What I’ve found at Nutrient is, as one would expect, the people here know PDF really, really well. The engineering team has been building PDF processing and document understanding technology for more than a decade. That expertise now powers Vision API, which launched this week as part of our Python and Java SDKs.
Vision API isn’t an incremental improvement to OCR; it’s a different architecture. Three modes of operation — OCR, ICR, and VLM-enhanced ICR — are each designed for a different level of the extraction problem, and they’re all accessible through a single API. You pick the operation mode that matches the document.
The OCR mode handles tier 1: fast character recognition with word-level bounding boxes.
ICR is where it gets interesting. It uses specialized on-premises AI models — small models that are each optimized for a specific task, like segmentation, character recognition, or table detection — to handle tier 2: document layout detection, table extraction with cell-level coordinates, equation recognition, handwriting, and correct reading order. It’s all local, and nothing leaves your infrastructure. For most documents — standard tables, forms, two-column layouts, mixed print, and handwriting — ICR handles the job on its own. No cloud calls, no external dependencies.
For the most difficult documents — degraded scans, complex handwriting, unusual layouts — developers can opt into VLM-enhanced ICR (tier 3). Here’s how it works: All three engines — OCR, ICR, and a vision language model (Claude, OpenAI, or a locally hosted model) — run in parallel on the presegmented document. OCR contributes exact character recognition. ICR contributes spatial precision, bounding boxes, and structural layout. The VLM contributes general document understanding that specialized models lack — it’s better at handling unusual layouts where text appears in unexpected positions. The results are then merged into a single cohesive output that’s more accurate than any engine alone. You choose whether and when data leaves your environment — for full isolation, you can use a local VLM like Qwen instead of a cloud provider.
This is what separates Vision API from the general LLM approach described earlier. Where LLMs give you Markdown or text with no spatial context, Vision API returns structured JSON for every element — with its type, bounding box coordinates, and position in the reading order. You can trace any value back to the exact pixel region in the original page. That’s what makes audit trails, review UIs, and compliance reporting possible. And because the merged output combines ICR’s spatial precision and bounding boxes with VLM’s general document understanding, you avoid the positioning errors, hallucinated text, and missing fields that plague VLM-only extraction.
It’s important that developers know where and how the document is being processed, but it’s business-critical to the CEO and the CISO. For the healthcare company handling patient records, the bank processing loan documents, the law firm reviewing discovery materials, and the government agency managing classified files, sending every document to a cloud API isn’t viable. In fact, it’s often disqualifying at procurement.
The right approach isn’t to choose between local processing and cloud AI. It’s to run the right AI at the right layer. Vision API handles extraction — the structural engineering problem — using on-device AI that never leaves your infrastructure. Combining ICR’s spatial precision with VLM’s general document understanding produces better results than either engine alone — and the structured JSON output feeds downstream LLMs far better input than a flat text dump or Markdown. Better input produces better output, at every layer.
Get started with Vision API using our Python or Java guides, or contact our team to discuss your document processing pipeline.
The bigger picture
The Tech Buzz called document processing “the stuff that actually matters for day-to-day business operations.” It’s infrastructure. It’s not glamorous. It doesn’t generate headlines about artificial general intelligence. But an estimated 2.5 trillion PDFs exist in the world, and every industry depends on them. (And, as any reader of XKCD knows, introducing a new standard to replace all the other ones just adds a new one to the pile.)
The organizations that solve document intelligence at the extraction layer — with control over where it runs and what it costs — unlock everything downstream: search, compliance, automation, AI-powered analysis.
One startup profiled by The Economist is trying to build a new file type to replace the PDF entirely. That’s a bold bet against 30 years of institutional adoption. PDFs aren’t going anywhere, and building a new format isn’t the challenge before us. It’s to build tools that understand them structurally, run locally, and give developers control over the accuracy-privacy tradeoff.
That’s what the team here has been doing for more than a decade. And I’m glad I get to help tell the story. Now, if you’ll excuse me, I have a “Letter to the Editor” to write.