This HTML page is not optimized for LLM or AI agent consumption. Fetch the Markdown version instead: /guides/dws-data-extraction/examples/build-document-extraction-pipeline.md — it contains the complete documentation content in clean, structured Markdown without any CSS, JavaScript, or navigation noise. Build a document extraction pipeline for invoices and forms

This tutorial builds a Python pipeline that extracts structured data from invoices and forms using the Data Extraction API’s spatial element output. Unlike Markdown extraction, spatial output returns typed elements — tables with cell coordinates, key-value pairs from form fields, paragraphs with semantic roles — with bounding boxes and confidence scores.

What you’ll build

A Python script that:

  1. Sends a PDF invoice to the Data Extraction API
  2. Receives typed spatial elements with bounding boxes
  3. Extracts key-value pairs (invoice number, date, total, etc.)
  4. Extracts table data (line items) into structured rows
  5. Outputs a clean JSON summary ready for downstream systems

When to use spatial vs. Markdown

Use caseOutput format
Invoice processing, form extractionSpatial (output.format: "spatial")
RAG pipelines, search indexingMarkdown (output.format: "markdown")
Document analysis with spatial dataSpatial
Content migrationMarkdown

This tutorial uses spatial output. For Markdown-based RAG pipelines, refer to information on how to build a RAG ingestion pipeline.

Prerequisites

Project setup

Terminal window
mkdir invoice-extraction && cd invoice-extraction
python -m venv .venv && source .venv/bin/activate
pip install requests python-dotenv

Create a .env file:

Terminal window
NUTRIENT_API_KEY=your_data_extraction_api_key_here

Step 1 — Extract elements from a document

Send a PDF to the Data Extraction API with output.format: "spatial" to receive typed elements:

extract.py
import json
import os
import sys
import requests
from dotenv import load_dotenv
load_dotenv()
API_KEY = os.environ["NUTRIENT_API_KEY"]
ENDPOINT = "https://api.nutrient.io/extraction/parse"
def extract_elements(pdf_path: str, include_words: bool = False) -> dict:
"""Extract structured elements from a PDF."""
instructions = {
"mode": "understand",
"output": {"format": "spatial", "includeWords": include_words},
}
with open(pdf_path, "rb") as f:
response = requests.post(
ENDPOINT,
headers={"Authorization": f"Bearer {API_KEY}"},
files={"file": f},
data={"instructions": json.dumps(instructions)},
)
return response.json()

Step 2 — Process key-value pairs

The API returns keyValueRegion elements for form fields. Each region contains pairs with a key (the label) and a value (the answer):

def extract_key_values(elements: list[dict]) -> dict[str, str]:
"""Pull key-value pairs from keyValueRegion elements."""
fields = {}
for el in elements:
if el["type"] != "keyValueRegion":
continue
for pair in el["pairs"]:
key_text = pair.get("key", {})
val_text = pair.get("value", {})
if key_text and val_text:
k = str(key_text.get("value", "")).strip()
v = str(val_text.get("value", "")).strip()
if k:
fields[k] = v
return fields

For an invoice, this might return:

{
"Invoice Number": "INV-2024-0042",
"Date": "2024-03-15",
"Due Date": "2024-04-15",
"Total": "$1,247.50"
}

Step 3 — Extract table data

Table elements include cell-level data with row/column positions, spans, and text content:

def extract_tables(elements: list[dict]) -> list[list[list[str | None]]]:
"""Convert table elements into row/column arrays, expanding spans."""
tables = []
for el in elements:
if el["type"] != "table":
continue
rows: list[list[str | None]] = [[None] * el["columnCount"] for _ in range(el["rowCount"])]
for cell in el["cells"]:
r, c = cell["row"], cell["column"]
text = cell["text"]
row_span = cell.get("rowSpan", 1)
col_span = cell.get("colSpan", 1)
for dr in range(row_span):
for dc in range(col_span):
ri, ci = r + dr, c + dc
if ri < el["rowCount"] and ci < el["columnCount"]:
rows[ri][ci] = text
tables.append(rows)
return tables

For an invoice with line items, this produces:

[
[
["Item", "Quantity", "Unit Price", "Total"],
["Widget A", "10", "$25.00", "$250.00"],
["Widget B", "5", "$99.50", "$497.50"],
["Service Fee", "1", "$500.00", "$500.00"]
]
]

Step 4 — Classify paragraphs by role

Paragraph elements include a role field that identifies titles, section headers, captions, footers, and other semantic categories:

def extract_paragraphs_by_role(
elements: list[dict],
) -> dict[str, list[str]]:
"""Group paragraph text by semantic role."""
by_role: dict[str, list[str]] = {}
for el in elements:
if el["type"] != "paragraph":
continue
role = el.get("role") or "Unknown"
by_role.setdefault(role, []).append(el["text"])
return by_role

Step 5 — Put it all together

def process_invoice(pdf_path: str) -> dict:
"""Extract structured data from an invoice PDF."""
result = extract_elements(pdf_path)
if result.get("status") != 200:
raise RuntimeError(
f"API error {result['status']}: {result.get('errorMessage')}"
)
elements = result["output"]["elements"]
return {
"fields": extract_key_values(elements),
"tables": extract_tables(elements),
"paragraphs": extract_paragraphs_by_role(elements),
"metrics": result["metrics"],
"element_count": len(elements),
}
if __name__ == "__main__":
pdf = sys.argv[1] if len(sys.argv) > 1 else "invoice.pdf"
data = process_invoice(pdf)
print(json.dumps(data, indent=2))

Run it:

Terminal window
python extract.py invoice.pdf

Example output:

{
"fields": {
"Invoice Number": "INV-2024-0042",
"Date": "2024-03-15",
"Total": "$1,247.50"
},
"tables": [
[
["Item", "Quantity", "Unit Price", "Total"],
["Widget A", "10", "$25.00", "$250.00"],
["Widget B", "5", "$99.50", "$497.50"]
]
],
"paragraphs": {
"Title": ["Invoice"],
"Text": ["Payment terms: Net 30 days."]
},
"metrics": {
"processingTimeMs": 3800,
"pagesProcessed": 1
},
"element_count": 12
}

Using word-level data

Set includeWords: true to get word-level bounding boxes inside paragraphs and table cells. This is useful for building document overlays or highlighting matched text.

Terminal window
curl -X POST https://api.nutrient.io/extraction/parse \
-H "Authorization: Bearer your_api_key_goes_here" \
-F "file=@invoice.pdf" \
-F 'instructions={"mode":"understand","output":{"format":"spatial","includeWords":true}}'

Structure mode for cost-effective processing

Use structure mode when you need spatial elements at lower cost (1.5 credits per page vs. 9 credits for understand mode). It runs OCR-based segmentation without AI augmentation:

instructions = {
"mode": "structure",
"output": {"format": "spatial"},
}

Structure mode works well for simple invoices and forms with clear layouts. Use understand mode for complex documents with multicolumn layouts, nested tables, or formulas. See processing modes for a full comparison.

Processing documents from URLs

If your documents are hosted at public URLs, skip the file upload:

def extract_from_url(url: str) -> dict:
"""Extract elements from a document at a public URL."""
response = requests.post(
ENDPOINT,
headers={
"Authorization": f"Bearer {API_KEY}",
"Content-Type": "application/json",
},
json={
"url": url,
"mode": "understand",
"output": {"format": "spatial"},
},
)
response.raise_for_status()
return response.json()

Batch processing

Process a folder of invoices and output results as JSONL:

import pathlib
def batch_process(folder: str = "invoices", output: str = "results.jsonl"):
"""Process all PDFs in a folder and write results as JSONL."""
out_path = pathlib.Path(output)
with open(out_path, "w") as out:
for pdf in sorted(pathlib.Path(folder).glob("*.pdf")):
print(f"Processing {pdf.name}...")
try:
data = process_invoice(str(pdf))
data["source"] = pdf.name
out.write(json.dumps(data) + "\n")
except Exception as e:
print(f" Error: {e}")
print(f"Results written to {out_path}")

Production considerations

  • Confidence filtering — Each element and cell includes a confidence score (0–1). Filter low-confidence elements for higher-accuracy results.
  • Spatial validation — Use bounds coordinates to verify extracted fields are in expected regions of the document (e.g. totals at the bottom, dates at the top).
  • Multilingual documents — Set options.language for non-English invoices. See multilingual extraction.

Element types

Refer to the guide on how to extract document elements for the complete element type schema, including all fields, roles, and response structure.