Extract invoice data with a schema

This tutorial shows how to build a Python pipeline that extracts invoice fields into a typed JSON object with the Nutrient DWS Data Extraction API extract endpoint.

Instead of returning the document’s full structure, the extract endpoint maps a document to a JSON Schema that you define. The response includes the fields you request and citations that ground each value to its source in the document.

What you’ll build

You’ll create a Python script that performs the following tasks:

Defines a JSON Schema for the invoice fields you need.
Sends a PDF invoice to the extract endpoint.
Receives a typed JSON object that matches the schema.
Reviews per-field citations and flags low-confidence fields.

Extract vs. parse

The document extraction pipeline guide uses the parse endpoint and reconstructs fields from spatial elements, such as tables and key-value regions. This tutorial uses the extract endpoint, which maps the document to your schema.

Use extract when you know which fields you want. Use parse when you need the full document structure.

Prerequisites

Before you start, make sure you have the following items:

Python 3.10 or later.
A Nutrient DWS account and Data Extraction API key. Sign up in the Nutrient dashboard(opens in a new tab).

Project setup

Create a project directory, set up a Python virtual environment, and install the required packages:

mkdir invoice-extract && cd invoice-extract
python -m venv .venv && source .venv/bin/activate
pip install requests python-dotenv

Create a .env file with your API key:

NUTRIENT_API_KEY=your_data_extraction_api_key_here

Step 1 — Define the schema

Describe the fields you want to extract. The root schema must be an object, and each field’s description should tell the extraction model what to find. For supported keywords and limits, refer to the define a schema guide:

INVOICE_SCHEMA = {
    "type": "object",
    "properties": {
        "invoice_number": {
            "type": "string",
            "description": "Invoice identifier, usually labeled 'Invoice #' or 'Invoice No.'",
        },
        "issue_date": {
            "type": "string",
            "format": "date",
            "description": "Date the invoice was issued",
        },
        "vendor_name": {
            "type": "string",
            "description": "Name of the company issuing the invoice",
        },
        "currency": {
            "type": "string",
            "description": "ISO 4217 currency code, e.g. USD or EUR",
        },
        "total_amount": {
            "type": "number",
            "description": "Final total after discounts and tax",
        },
        "line_items": {
            "type": "array",
            "description": "One entry per row in the line-item table",
            "items": {
                "type": "object",
                "properties": {
                    "description": {"type": "string"},
                    "quantity": {"type": "integer"},
                    "unit_price": {"type": "number"},
                },
                "required": ["description", "quantity", "unit_price"],
            },
        },
    },
    "required": ["invoice_number", "total_amount"],
}

Step 2 — Call the extract endpoint

Send the document and JSON-serialized instructions to the extract endpoint. The instructions include the schema and can include parseConfig and document-level guidance:

import json
import os
import sys

import requests
from dotenv import load_dotenv

from schema import INVOICE_SCHEMA

load_dotenv()

API_KEY = os.environ["NUTRIENT_API_KEY"]
ENDPOINT = "https://api.nutrient.io/extraction/extract"


def extract_invoice(pdf_path: str) -> dict:
    """Extract invoice fields into a typed JSON object."""
    instructions = {
        "schema": INVOICE_SCHEMA,
        "instructions": "Treat shipping and handling as separate line items.",
        "parseConfig": {"mode": "understand"},
    }
    with open(pdf_path, "rb") as f:
        response = requests.post(
            ENDPOINT,
            headers={"Authorization": f"Bearer {API_KEY}"},
            files={"file": f},
            data={"instructions": json.dumps(instructions)},
        )
    return response.json()

Step 3 — Read the extracted data

A successful response returns the typed values in output.data:

{
  "status": 200,
  "requestId": "req_x1y2z3w4",
  "output": {
    "data": {
      "invoice_number": "INV-2024-0042",
      "issue_date": "2024-03-15",
      "vendor_name": "Acme Industrial Supplies Ltd.",
      "currency": "EUR",
      "total_amount": 1547.5,
      "line_items": [
        { "description": "Widget A", "quantity": 10, "unit_price": 25.0 },
        { "description": "Widget B", "quantity": 5, "unit_price": 99.5 }
      ]
    },
    "metadata": { },
    "pages": [{ "page": 1, "width": 1200, "height": 1697 }]
  },
  "metrics": { "processingTimeMs": 4800, "pagesProcessed": 1 },
  "usage": {
    "data_extraction_credits": { "cost": 27, "remainingCredits": 832 }
  }
}

Because the schema is closed, output.data contains only the fields you declared.

Step 4 — Flag fields for review with citations

By default, the API returns citations in output.metadata. The metadata mirrors the shape of output.data. Use the match label and confidence score to route uncertain fields to human review. For the full metadata reference, refer to the citations and confidence guide:

def fields_needing_review(metadata: dict, threshold: float = 0.7) -> list[str]:
    """Return scalar field names whose citation suggests manual review."""
    flagged = []
    for field, citation in metadata.items():
        if not isinstance(citation, dict):
            continue  # nested object or array
        if citation.get("match") in ("fuzzy_match", "not_found"):
            flagged.append(field)
            continue
        confidence = citation.get("confidence")
        if confidence is not None and confidence < threshold:
            flagged.append(field)
    return flagged

Step 5 — Put it all together

Combine the extraction call, response validation, and review routing in one function:

def process_invoice(pdf_path: str) -> dict:
    """Extract an invoice and report any fields needing review."""
    result = extract_invoice(pdf_path)

    if result.get("status") != 200:
        raise RuntimeError(
            f"API error {result['status']}: {result.get('errorMessage')}"
        )

    output = result["output"]
    return {
        "data": output["data"],
        "review": fields_needing_review(output["metadata"]),
        "metrics": result["metrics"],
    }


if __name__ == "__main__":
    pdf = sys.argv[1] if len(sys.argv) > 1 else "invoice.pdf"
    summary = process_invoice(pdf)
    print(json.dumps(summary, indent=2))

Run the script with an invoice PDF.

python extract.py invoice.pdf

Processing documents from URLs

If your invoices are hosted at public URLs, send the schema and a url field as JSON instead of uploading a file:

def extract_from_url(url: str) -> dict:
    """Extract invoice fields from a document at a public URL."""
    response = requests.post(
        ENDPOINT,
        headers={
            "Authorization": f"Bearer {API_KEY}",
            "Content-Type": "application/json",
        },
        json={
            "url": url,
            "schema": INVOICE_SCHEMA,
            "parseConfig": {"mode": "understand"},
        },
    )
    response.raise_for_status()
    return response.json()

Production considerations

Review these points before you use the pipeline in production:

Review routing — Check the citation match label first. fuzzy_match and not_found are the clearest review signals. Then check confidence. The score is relative and uncalibrated, so calibrate the threshold against a labeled sample. Refer to the citations and confidence guide.
Parse mode and cost — An extract request bills a parse component, set by parseConfig.mode, plus an extract component per page. Use structure mode for clean layouts, or use agentic mode for degraded scans. Refer to the pricing guide and the parse configuration guide.
Schema limits — Keep schemas within the documented size limits: 32 KB, 500 fields, and five levels of nesting. Refer to the define a schema guide.
Multilingual invoices — Set parseConfig.options.language for non-English documents. Refer to the supported languages guide.

Next steps

Use these guides to continue building your extraction workflow:

Define a schema — Review supported keywords, field types, and limits.
Citations and confidence — Ground extracted fields for review workflows.
[Build a document extraction pipeline][] — Use the parse endpoint and spatial elements.