Build a document extraction pipeline

This tutorial builds a Python pipeline that extracts structured data from invoices and forms using the Data Extraction API’s spatial element output. Unlike Markdown extraction, spatial output returns typed elements — tables with cell coordinates, key-value pairs from form fields, paragraphs with semantic roles — with bounding boxes and confidence scores.

What you’ll build

A Python script that:

Sends a PDF invoice to the Data Extraction API
Receives typed spatial elements with bounding boxes
Extracts key-value pairs (invoice number, date, total, etc.)
Extracts table data (line items) into structured rows
Outputs a clean JSON summary ready for downstream systems

When to use spatial vs. Markdown

Use case	Output format
Invoice processing, form extraction	Spatial (`output.format: "spatial"`)
RAG pipelines, search indexing	Markdown (`output.format: "markdown"`)
Document analysis with spatial data	Spatial
Content migration	Markdown

This tutorial uses spatial output. For Markdown-based RAG pipelines, refer to information on how to build a RAG ingestion pipeline.

Prerequisites

Python 3.10+
A Nutrient DWS account and Data Extraction API key — sign up at the Nutrient dashboard(opens in a new tab)

Project setup

mkdir invoice-extraction && cd invoice-extraction
python -m venv .venv && source .venv/bin/activate
pip install requests python-dotenv

Create a .env file:

NUTRIENT_API_KEY=your_data_extraction_api_key_here

Step 1 — Extract elements from a document

Send a PDF to the Data Extraction API with output.format: "spatial" to receive typed elements:

import json
import os
import sys

import requests
from dotenv import load_dotenv

load_dotenv()

API_KEY = os.environ["NUTRIENT_API_KEY"]
ENDPOINT = "https://api.nutrient.io/extraction/parse"


def extract_elements(pdf_path: str, include_words: bool = False) -> dict:
    """Extract structured elements from a PDF."""
    instructions = {
        "mode": "understand",
        "output": {"format": "spatial", "includeWords": include_words},
    }
    with open(pdf_path, "rb") as f:
        response = requests.post(
            ENDPOINT,
            headers={"Authorization": f"Bearer {API_KEY}"},
            files={"file": f},
            data={"instructions": json.dumps(instructions)},
        )
    return response.json()

Step 2 — Process key-value pairs

The API returns keyValueRegion elements for form fields. Each region contains pairs with a key (the label) and a value (the answer):

def extract_key_values(elements: list[dict]) -> dict[str, str]:
    """Pull key-value pairs from keyValueRegion elements."""
    fields = {}
    for el in elements:
        if el["type"] != "keyValueRegion":
            continue
        for pair in el["pairs"]:
            key_text = pair.get("key", {})
            val_text = pair.get("value", {})
            if key_text and val_text:
                k = str(key_text.get("value", "")).strip()
                v = str(val_text.get("value", "")).strip()
                if k:
                    fields[k] = v
    return fields

For an invoice, this might return:

{
  "Invoice Number": "INV-2024-0042",
  "Date": "2024-03-15",
  "Due Date": "2024-04-15",
  "Total": "$1,247.50"
}

Step 3 — Extract table data

Table elements include cell-level data with row/column positions, spans, and text content:

def extract_tables(elements: list[dict]) -> list[list[list[str | None]]]:
    """Convert table elements into row/column arrays, expanding spans."""
    tables = []
    for el in elements:
        if el["type"] != "table":
            continue
        rows: list[list[str | None]] = [[None] * el["columnCount"] for _ in range(el["rowCount"])]
        for cell in el["cells"]:
            r, c = cell["row"], cell["column"]
            text = cell["text"]
            row_span = cell.get("rowSpan", 1)
            col_span = cell.get("colSpan", 1)
            for dr in range(row_span):
                for dc in range(col_span):
                    ri, ci = r + dr, c + dc
                    if ri < el["rowCount"] and ci < el["columnCount"]:
                        rows[ri][ci] = text
        tables.append(rows)
    return tables

For an invoice with line items, this produces:

[
  [
    ["Item", "Quantity", "Unit Price", "Total"],
    ["Widget A", "10", "$25.00", "$250.00"],
    ["Widget B", "5", "$99.50", "$497.50"],
    ["Service Fee", "1", "$500.00", "$500.00"]
  ]
]

Step 4 — Classify paragraphs by role

Paragraph elements include a role field that identifies titles, section headers, captions, footers, and other semantic categories:

def extract_paragraphs_by_role(
    elements: list[dict],
) -> dict[str, list[str]]:
    """Group paragraph text by semantic role."""
    by_role: dict[str, list[str]] = {}
    for el in elements:
        if el["type"] != "paragraph":
            continue
        role = el.get("role") or "Unknown"
        by_role.setdefault(role, []).append(el["text"])
    return by_role

Step 5 — Put it all together

def process_invoice(pdf_path: str) -> dict:
    """Extract structured data from an invoice PDF."""
    result = extract_elements(pdf_path)

    if result.get("status") != 200:
        raise RuntimeError(
            f"API error {result['status']}: {result.get('errorMessage')}"
        )

    elements = result["output"]["elements"]

    return {
        "fields": extract_key_values(elements),
        "tables": extract_tables(elements),
        "paragraphs": extract_paragraphs_by_role(elements),
        "metrics": result["metrics"],
        "element_count": len(elements),
    }


if __name__ == "__main__":
    pdf = sys.argv[1] if len(sys.argv) > 1 else "invoice.pdf"
    data = process_invoice(pdf)
    print(json.dumps(data, indent=2))

Run it:

python extract.py invoice.pdf

Example output:

{
  "fields": {
    "Invoice Number": "INV-2024-0042",
    "Date": "2024-03-15",
    "Total": "$1,247.50"
  },
  "tables": [
    [
      ["Item", "Quantity", "Unit Price", "Total"],
      ["Widget A", "10", "$25.00", "$250.00"],
      ["Widget B", "5", "$99.50", "$497.50"]
    ]
  ],
  "paragraphs": {
    "Title": ["Invoice"],
    "Text": ["Payment terms: Net 30 days."]
  },
  "metrics": {
    "processingTimeMs": 3800,
    "pagesProcessed": 1
  },
  "element_count": 12
}

Using word-level data

Set includeWords: true to get word-level bounding boxes inside paragraphs and table cells. This is useful for building document overlays or highlighting matched text.

curl
Python

curl -X POST https://api.nutrient.io/extraction/parse \
  -H "Authorization: Bearer your_api_key_goes_here" \
  -F "file=@invoice.pdf" \
  -F 'instructions={"mode":"understand","output":{"format":"spatial","includeWords":true}}'

result = extract_elements("invoice.pdf", include_words=True)

# Access word-level data
for el in result["output"]["elements"]:
    if el["type"] == "paragraph" and el.get("words"):
        for word in el["words"]:
            print(
                f"  '{word['text']}' at ({word['bounds']['x']}, "
                f"{word['bounds']['y']}) confidence={word['confidence']}"
            )

Structure mode for cost-effective processing

Use structure mode when you need spatial elements at lower cost (1.5 credits per page vs. 9 credits for understand mode). It runs OCR-based segmentation without AI augmentation:

instructions = {
    "mode": "structure",
    "output": {"format": "spatial"},
}

Structure mode works well for simple invoices and forms with clear layouts. Use understand mode for complex documents with multicolumn layouts, nested tables, or formulas. See processing modes for a full comparison.

Processing documents from URLs

If your documents are hosted at public URLs, skip the file upload:

def extract_from_url(url: str) -> dict:
    """Extract elements from a document at a public URL."""
    response = requests.post(
        ENDPOINT,
        headers={
            "Authorization": f"Bearer {API_KEY}",
            "Content-Type": "application/json",
        },
        json={
            "url": url,
            "mode": "understand",
            "output": {"format": "spatial"},
        },
    )
    response.raise_for_status()
    return response.json()

Batch processing

Process a folder of invoices and output results as JSONL:

import pathlib


def batch_process(folder: str = "invoices", output: str = "results.jsonl"):
    """Process all PDFs in a folder and write results as JSONL."""
    out_path = pathlib.Path(output)
    with open(out_path, "w") as out:
        for pdf in sorted(pathlib.Path(folder).glob("*.pdf")):
            print(f"Processing {pdf.name}...")
            try:
                data = process_invoice(str(pdf))
                data["source"] = pdf.name
                out.write(json.dumps(data) + "\n")
            except Exception as e:
                print(f"  Error: {e}")
    print(f"Results written to {out_path}")

Production considerations

Confidence filtering — Each element and cell includes a confidence score (0–1). Filter low-confidence elements for higher-accuracy results.
Spatial validation — Use bounds coordinates to verify extracted fields are in expected regions of the document (e.g. totals at the bottom, dates at the top).
Multilingual documents — Set options.language for non-English invoices. See multilingual extraction.

Element types

Refer to the guide on how to extract document elements for the complete element type schema, including all fields, roles, and response structure.