Build a document extraction pipeline
This tutorial builds a Python pipeline that extracts structured data from invoices and forms using the Data Extraction API’s spatial element output. Unlike Markdown extraction, spatial output returns typed elements — tables with cell coordinates, key-value pairs from form fields, paragraphs with semantic roles — with bounding boxes and confidence scores.
What you’ll build
A Python script that:
- Sends a PDF invoice to the Data Extraction API
- Receives typed spatial elements with bounding boxes
- Extracts key-value pairs (invoice number, date, total, etc.)
- Extracts table data (line items) into structured rows
- Outputs a clean JSON summary ready for downstream systems
When to use spatial vs. Markdown
| Use case | Output format |
|---|---|
| Invoice processing, form extraction | Spatial (output.format: "spatial") |
| RAG pipelines, search indexing | Markdown (output.format: "markdown") |
| Document analysis with spatial data | Spatial |
| Content migration | Markdown |
This tutorial uses spatial output. For Markdown-based RAG pipelines, refer to information on how to build a RAG ingestion pipeline.
Prerequisites
- Python 3.10+
- A Nutrient DWS account and Data Extraction API key — sign up at the Nutrient dashboard(opens in a new tab)
Project setup
mkdir invoice-extraction && cd invoice-extractionpython -m venv .venv && source .venv/bin/activatepip install requests python-dotenvCreate a .env file:
NUTRIENT_API_KEY=your_data_extraction_api_key_hereStep 1 — Extract elements from a document
Send a PDF to the Data Extraction API with output.format: "spatial" to receive typed elements:
import jsonimport osimport sys
import requestsfrom dotenv import load_dotenv
load_dotenv()
API_KEY = os.environ["NUTRIENT_API_KEY"]ENDPOINT = "https://api.nutrient.io/extraction/parse"
def extract_elements(pdf_path: str, include_words: bool = False) -> dict: """Extract structured elements from a PDF.""" instructions = { "mode": "understand", "output": {"format": "spatial", "includeWords": include_words}, } with open(pdf_path, "rb") as f: response = requests.post( ENDPOINT, headers={"Authorization": f"Bearer {API_KEY}"}, files={"file": f}, data={"instructions": json.dumps(instructions)}, ) return response.json()Step 2 — Process key-value pairs
The API returns keyValueRegion elements for form fields. Each region contains pairs with a key (the label) and a value (the answer):
def extract_key_values(elements: list[dict]) -> dict[str, str]: """Pull key-value pairs from keyValueRegion elements.""" fields = {} for el in elements: if el["type"] != "keyValueRegion": continue for pair in el["pairs"]: key_text = pair.get("key", {}) val_text = pair.get("value", {}) if key_text and val_text: k = str(key_text.get("value", "")).strip() v = str(val_text.get("value", "")).strip() if k: fields[k] = v return fieldsFor an invoice, this might return:
{ "Invoice Number": "INV-2024-0042", "Date": "2024-03-15", "Due Date": "2024-04-15", "Total": "$1,247.50"}Step 3 — Extract table data
Table elements include cell-level data with row/column positions, spans, and text content:
def extract_tables(elements: list[dict]) -> list[list[list[str | None]]]: """Convert table elements into row/column arrays, expanding spans.""" tables = [] for el in elements: if el["type"] != "table": continue rows: list[list[str | None]] = [[None] * el["columnCount"] for _ in range(el["rowCount"])] for cell in el["cells"]: r, c = cell["row"], cell["column"] text = cell["text"] row_span = cell.get("rowSpan", 1) col_span = cell.get("colSpan", 1) for dr in range(row_span): for dc in range(col_span): ri, ci = r + dr, c + dc if ri < el["rowCount"] and ci < el["columnCount"]: rows[ri][ci] = text tables.append(rows) return tablesFor an invoice with line items, this produces:
[ [ ["Item", "Quantity", "Unit Price", "Total"], ["Widget A", "10", "$25.00", "$250.00"], ["Widget B", "5", "$99.50", "$497.50"], ["Service Fee", "1", "$500.00", "$500.00"] ]]Step 4 — Classify paragraphs by role
Paragraph elements include a role field that identifies titles, section headers, captions, footers, and other semantic categories:
def extract_paragraphs_by_role( elements: list[dict],) -> dict[str, list[str]]: """Group paragraph text by semantic role.""" by_role: dict[str, list[str]] = {} for el in elements: if el["type"] != "paragraph": continue role = el.get("role") or "Unknown" by_role.setdefault(role, []).append(el["text"]) return by_roleStep 5 — Put it all together
def process_invoice(pdf_path: str) -> dict: """Extract structured data from an invoice PDF.""" result = extract_elements(pdf_path)
if result.get("status") != 200: raise RuntimeError( f"API error {result['status']}: {result.get('errorMessage')}" )
elements = result["output"]["elements"]
return { "fields": extract_key_values(elements), "tables": extract_tables(elements), "paragraphs": extract_paragraphs_by_role(elements), "metrics": result["metrics"], "element_count": len(elements), }
if __name__ == "__main__": pdf = sys.argv[1] if len(sys.argv) > 1 else "invoice.pdf" data = process_invoice(pdf) print(json.dumps(data, indent=2))Run it:
python extract.py invoice.pdfExample output:
{ "fields": { "Invoice Number": "INV-2024-0042", "Date": "2024-03-15", "Total": "$1,247.50" }, "tables": [ [ ["Item", "Quantity", "Unit Price", "Total"], ["Widget A", "10", "$25.00", "$250.00"], ["Widget B", "5", "$99.50", "$497.50"] ] ], "paragraphs": { "Title": ["Invoice"], "Text": ["Payment terms: Net 30 days."] }, "metrics": { "processingTimeMs": 3800, "pagesProcessed": 1 }, "element_count": 12}Using word-level data
Set includeWords: true to get word-level bounding boxes inside paragraphs and table cells. This is useful for building document overlays or highlighting matched text.
curl -X POST https://api.nutrient.io/extraction/parse \ -H "Authorization: Bearer your_api_key_goes_here" \ -F "file=@invoice.pdf" \ -F 'instructions={"mode":"understand","output":{"format":"spatial","includeWords":true}}'result = extract_elements("invoice.pdf", include_words=True)
# Access word-level datafor el in result["output"]["elements"]: if el["type"] == "paragraph" and el.get("words"): for word in el["words"]: print( f" '{word['text']}' at ({word['bounds']['x']}, " f"{word['bounds']['y']}) confidence={word['confidence']}" )Structure mode for cost-effective processing
Use structure mode when you need spatial elements at lower cost (1.5 credits per page vs. 9 credits for understand mode). It runs OCR-based segmentation without AI augmentation:
instructions = { "mode": "structure", "output": {"format": "spatial"},}Structure mode works well for simple invoices and forms with clear layouts. Use understand mode for complex documents with multicolumn layouts, nested tables, or formulas. See processing modes for a full comparison.
Processing documents from URLs
If your documents are hosted at public URLs, skip the file upload:
def extract_from_url(url: str) -> dict: """Extract elements from a document at a public URL.""" response = requests.post( ENDPOINT, headers={ "Authorization": f"Bearer {API_KEY}", "Content-Type": "application/json", }, json={ "url": url, "mode": "understand", "output": {"format": "spatial"}, }, ) response.raise_for_status() return response.json()Batch processing
Process a folder of invoices and output results as JSONL:
import pathlib
def batch_process(folder: str = "invoices", output: str = "results.jsonl"): """Process all PDFs in a folder and write results as JSONL.""" out_path = pathlib.Path(output) with open(out_path, "w") as out: for pdf in sorted(pathlib.Path(folder).glob("*.pdf")): print(f"Processing {pdf.name}...") try: data = process_invoice(str(pdf)) data["source"] = pdf.name out.write(json.dumps(data) + "\n") except Exception as e: print(f" Error: {e}") print(f"Results written to {out_path}")Production considerations
- Confidence filtering — Each element and cell includes a
confidencescore (0–1). Filter low-confidence elements for higher-accuracy results. - Spatial validation — Use
boundscoordinates to verify extracted fields are in expected regions of the document (e.g. totals at the bottom, dates at the top). - Multilingual documents — Set
options.languagefor non-English invoices. See multilingual extraction.
Element types
Refer to the guide on how to extract document elements for the complete element type schema, including all fields, roles, and response structure.