---
title: "Build a document extraction pipeline for invoices and forms"
canonical_url: "https://www.nutrient.io/guides/dws-data-extraction/examples/build-document-extraction-pipeline/"
md_url: "https://www.nutrient.io/guides/dws-data-extraction/examples/build-document-extraction-pipeline.md"
last_updated: "2026-05-26T22:37:31.557Z"
description: "Extract tables, key-value pairs, and structured elements from invoices and forms using the Data Extraction API’s spatial output."
---

This tutorial builds a Python pipeline that extracts structured data from invoices and forms using the Data Extraction API’s spatial element output. Unlike Markdown extraction, spatial output returns typed elements — tables with cell coordinates, key-value pairs from form fields, paragraphs with semantic roles — with bounding boxes and confidence scores.

## What you’ll build

A Python script that:

1. Sends a PDF invoice to the Data Extraction API

2. Receives typed spatial elements with bounding boxes

3. Extracts key-value pairs (invoice number, date, total, etc.)

4. Extracts table data (line items) into structured rows

5. Outputs a clean JSON summary ready for downstream systems

## When to use spatial vs. Markdown

| Use case                            | Output format                          |
| ----------------------------------- | -------------------------------------- |
| Invoice processing, form extraction | Spatial (`output.format: "spatial"`)   |
| RAG pipelines, search indexing      | Markdown (`output.format: "markdown"`) |
| Document analysis with spatial data | Spatial                                |
| Content migration                   | Markdown                               |

This tutorial uses spatial output. For Markdown-based RAG pipelines, refer to information on how to [build a RAG ingestion pipeline](https://www.nutrient.io/guides/dws-data-extraction/examples/build-rag-ingestion-pipeline.md).

## Prerequisites

- Python 3.10+

- A Nutrient DWS account and Data Extraction API key — sign up at the [Nutrient dashboard](https://dashboard.nutrient.io/sign_up/?product=data-extraction)

## Project setup

```shell

mkdir invoice-extraction && cd invoice-extraction
python -m venv.venv && source.venv/bin/activate
pip install requests python-dotenv

```

Create a `.env` file:

```shell

NUTRIENT_API_KEY=your_data_extraction_api_key_here

```

## Step 1 — Extract elements from a document

Send a PDF to the Data Extraction API with `output.format: "spatial"` to receive typed elements:

```python

# extract.py

import json
import os
import sys

import requests
from dotenv import load_dotenv

load_dotenv()

API_KEY = os.environ["NUTRIENT_API_KEY"]
ENDPOINT = "https://api.nutrient.io/extraction/parse"

def extract_elements(pdf_path: str, include_words: bool = False) -> dict:
    """Extract structured elements from a PDF."""
    instructions = {
        "mode": "understand",
        "output": {"format": "spatial", "includeWords": include_words},
    }
    with open(pdf_path, "rb") as f:
        response = requests.post(
            ENDPOINT,
            headers={"Authorization": f"Bearer {API_KEY}"},
            files={"file": f},
            data={"instructions": json.dumps(instructions)},
        )
    return response.json()

```

## Step 2 — Process key-value pairs

The API returns `keyValueRegion` elements for form fields. Each region contains pairs with a key (the label) and a value (the answer):

```python

def extract_key_values(elements: list[dict]) -> dict[str, str]:
    """Pull key-value pairs from keyValueRegion elements."""
    fields = {}
    for el in elements:
        if el["type"]!= "keyValueRegion":
            continue
        for pair in el["pairs"]:
            key_text = pair.get("key", {})
            val_text = pair.get("value", {})
            if key_text and val_text:
                k = str(key_text.get("value", "")).strip()
                v = str(val_text.get("value", "")).strip()
                if k:
                    fields[k] = v
    return fields

```

For an invoice, this might return:

```json

{
  "Invoice Number": "INV-2024-0042",
  "Date": "2024-03-15",
  "Due Date": "2024-04-15",
  "Total": "$1,247.50"
}

```

## Step 3 — Extract table data

Table elements include cell-level data with row/column positions, spans, and text content:

```python

def extract_tables(elements: list[dict]) -> list[list[list[str | None]]]:
    """Convert table elements into row/column arrays, expanding spans."""
    tables = []
    for el in elements:
        if el["type"]!= "table":
            continue
        rows: list[list[str | None]] = [[None] * el["columnCount"] for _ in range(el["rowCount"])]
        for cell in el["cells"]:
            r, c = cell["row"], cell["column"]
            text = cell["text"]
            row_span = cell.get("rowSpan", 1)
            col_span = cell.get("colSpan", 1)
            for dr in range(row_span):
                for dc in range(col_span):
                    ri, ci = r + dr, c + dc
                    if ri < el["rowCount"] and ci < el["columnCount"]:
                        rows[ri][ci] = text
        tables.append(rows)
    return tables

```

For an invoice with line items, this produces:

```json

[
  [
    ["Item", "Quantity", "Unit Price", "Total"],
    ["Widget A", "10", "$25.00", "$250.00"],
    ["Widget B", "5", "$99.50", "$497.50"],
    ["Service Fee", "1", "$500.00", "$500.00"]
  ]
]

```

## Step 4 — Classify paragraphs by role

Paragraph elements include a `role` field that identifies titles, section headers, captions, footers, and other semantic categories:

```python

def extract_paragraphs_by_role(
    elements: list[dict],
) -> dict[str, list[str]]:
    """Group paragraph text by semantic role."""
    by_role: dict[str, list[str]] = {}
    for el in elements:
        if el["type"]!= "paragraph":
            continue
        role = el.get("role") or "Unknown"
        by_role.setdefault(role, []).append(el["text"])
    return by_role

```

## Step 5 — Put it all together

```python

def process_invoice(pdf_path: str) -> dict:
    """Extract structured data from an invoice PDF."""
    result = extract_elements(pdf_path)

    if result.get("status")!= 200:
        raise RuntimeError(
            f"API error {result['status']}: {result.get('errorMessage')}"
        )

    elements = result["output"]["elements"]

    return {
        "fields": extract_key_values(elements),
        "tables": extract_tables(elements),
        "paragraphs": extract_paragraphs_by_role(elements),
        "metrics": result["metrics"],
        "element_count": len(elements),
    }

if __name__ == "__main__":
    pdf = sys.argv[1] if len(sys.argv) > 1 else "invoice.pdf"
    data = process_invoice(pdf)
    print(json.dumps(data, indent=2))

```

Run it:

```shell

python extract.py invoice.pdf

```

Example output:

```json

{
  "fields": {
    "Invoice Number": "INV-2024-0042",
    "Date": "2024-03-15",
    "Total": "$1,247.50"
  },
  "tables": [
    [
      ["Item", "Quantity", "Unit Price", "Total"],
      ["Widget A", "10", "$25.00", "$250.00"],
      ["Widget B", "5", "$99.50", "$497.50"]
    ]
  ],
  "paragraphs": {
    "Title": ["Invoice"],
    "Text": ["Payment terms: Net 30 days."]
  },
  "metrics": {
    "processingTimeMs": 3800,
    "pagesProcessed": 1
  },
  "element_count": 12
}

```

## Using word-level data

Set `includeWords: true` to get word-level bounding boxes inside paragraphs and table cells. This is useful for building document overlays or highlighting matched text.

### curl

```shell

curl -X POST https://api.nutrient.io/extraction/parse \
  -H "Authorization: Bearer your_api_key_goes_here" \
  -F "file=@invoice.pdf" \
  -F 'instructions={"mode":"understand","output":{"format":"spatial","includeWords":true}}'

```

### Python

```python

result = extract_elements("invoice.pdf", include_words=True)

# Access word-level data

for el in result["output"]["elements"]:
    if el["type"] == "paragraph" and el.get("words"):
        for word in el["words"]:
            print(
                f"  '{word['text']}' at ({word['bounds']['x']}, "
                f"{word['bounds']['y']}) confidence={word['confidence']}"
            )

```

## Structure mode for cost-effective processing

Use `structure` mode when you need spatial elements at lower cost (1.5 credits per page vs. 9 credits for understand mode). It runs OCR-based segmentation without AI augmentation:

```python

instructions = {
    "mode": "structure",
    "output": {"format": "spatial"},
}

```

Structure mode works well for simple invoices and forms with clear layouts. Use `understand` mode for complex documents with multicolumn layouts, nested tables, or formulas. See [processing modes](https://www.nutrient.io/guides/dws-data-extraction/parsing/processing-modes.md) for a full comparison.

## Processing documents from URLs

If your documents are hosted at public URLs, skip the file upload:

```python

def extract_from_url(url: str) -> dict:
    """Extract elements from a document at a public URL."""
    response = requests.post(
        ENDPOINT,
        headers={
            "Authorization": f"Bearer {API_KEY}",
            "Content-Type": "application/json",
        },
        json={
            "url": url,
            "mode": "understand",
            "output": {"format": "spatial"},
        },
    )
    response.raise_for_status()
    return response.json()

```

## Batch processing

Process a folder of invoices and output results as JSONL:

```python

import pathlib

def batch_process(folder: str = "invoices", output: str = "results.jsonl"):
    """Process all PDFs in a folder and write results as JSONL."""
    out_path = pathlib.Path(output)
    with open(out_path, "w") as out:
        for pdf in sorted(pathlib.Path(folder).glob("*.pdf")):
            print(f"Processing {pdf.name}...")
            try:
                data = process_invoice(str(pdf))
                data["source"] = pdf.name
                out.write(json.dumps(data) + "\n")
            except Exception as e:
                print(f"  Error: {e}")
    print(f"Results written to {out_path}")

```

## Production considerations

- **Confidence filtering** — Each element and cell includes a `confidence` score (0–1). Filter low-confidence elements for higher-accuracy results.

- **Spatial validation** — Use `bounds` coordinates to verify extracted fields are in expected regions of the document (e.g. totals at the bottom, dates at the top).

- **Multilingual documents** — Set `options.language` for non-English invoices. See [multilingual extraction](https://www.nutrient.io/guides/dws-data-extraction/parsing/multilingual-extraction.md).

## Element types

Refer to the guide on how to [extract document elements](https://www.nutrient.io/guides/dws-data-extraction/parsing/extract-document-elements.md) for the complete element type schema, including all fields, roles, and response structure.
---

## Related pages

- [ingestion/extract.py](/guides/dws-data-extraction/examples/build-rag-ingestion-pipeline.md)
- [Examples](/guides/dws-data-extraction/examples.md)

