---
title: "Extract invoice data to JSON with a schema"
canonical_url: "https://www.nutrient.io/guides/dws-data-extraction/examples/extract-invoice-data-with-schema/"
md_url: "https://www.nutrient.io/guides/dws-data-extraction/examples/extract-invoice-data-with-schema.md"
last_updated: "2026-06-11T00:00:00.000Z"
description: "Build a Python pipeline that extracts invoice fields into typed JSON using the Data Extraction API extract endpoint, with per-field citations for review."
---

This tutorial shows how to build a Python pipeline that extracts invoice fields into a typed JSON object with the Nutrient DWS Data Extraction API [extract endpoint](https://www.nutrient.io/guides/dws-data-extraction/extract.md).

Instead of returning the document’s full structure, the extract endpoint maps a document to a [JSON Schema](https://www.nutrient.io/guides/dws-data-extraction/extract/define-a-schema.md) that you define. The response includes the fields you request and [citations](https://www.nutrient.io/guides/dws-data-extraction/extract/citations-and-confidence.md) that ground each value to its source in the document.

## What you’ll build

You’ll create a Python script that performs the following tasks:

1. Defines a JSON Schema for the invoice fields you need.

2. Sends a PDF invoice to the extract endpoint.

3. Receives a typed JSON object that matches the schema.

4. Reviews per-field citations and flags low-confidence fields.

## Extract vs. parse

The [document extraction pipeline](https://www.nutrient.io/guides/dws-data-extraction/examples/build-document-extraction-pipeline.md) guide uses the parse endpoint and reconstructs fields from spatial elements, such as tables and key-value regions. This tutorial uses the extract endpoint, which maps the document to your schema.

Use extract when you know which fields you want. Use parse when you need the full document structure.

## Prerequisites

Before you start, make sure you have the following items:

- Python 3.10 or later.

- A Nutrient DWS account and Data Extraction API key. Sign up in the [Nutrient dashboard](https://dashboard.nutrient.io/sign_up/?product=data-extraction).

## Project setup

Create a project directory, set up a Python virtual environment, and install the required packages:

```shell

mkdir invoice-extract && cd invoice-extract
python -m venv.venv && source.venv/bin/activate
pip install requests python-dotenv

```

Create a `.env` file with your API key:

```shell

NUTRIENT_API_KEY=your_data_extraction_api_key_here

```

## Step 1 — Define the schema

Describe the fields you want to extract. The root schema must be an `object`, and each field’s `description` should tell the extraction model what to find. For supported keywords and limits, refer to the [define a schema](https://www.nutrient.io/guides/dws-data-extraction/extract/define-a-schema.md) guide:

```python

# schema.py

INVOICE_SCHEMA = {
    "type": "object",
    "properties": {
        "invoice_number": {
            "type": "string",
            "description": "Invoice identifier, usually labeled 'Invoice #' or 'Invoice No.'",

        },
        "issue_date": {
            "type": "string",
            "format": "date",
            "description": "Date the invoice was issued",
        },
        "vendor_name": {
            "type": "string",
            "description": "Name of the company issuing the invoice",
        },
        "currency": {
            "type": "string",
            "description": "ISO 4217 currency code, e.g. USD or EUR",
        },
        "total_amount": {
            "type": "number",
            "description": "Final total after discounts and tax",
        },
        "line_items": {
            "type": "array",
            "description": "One entry per row in the line-item table",
            "items": {
                "type": "object",
                "properties": {
                    "description": {"type": "string"},
                    "quantity": {"type": "integer"},
                    "unit_price": {"type": "number"},
                },
                "required": ["description", "quantity", "unit_price"],
            },
        },
    },
    "required": ["invoice_number", "total_amount"],
}

```

## Step 2 — Call the extract endpoint

Send the document and JSON-serialized instructions to the extract endpoint. The instructions include the schema and can include `parseConfig` and document-level guidance:

```python

# extract.py

import json
import os
import sys

import requests
from dotenv import load_dotenv

from schema import INVOICE_SCHEMA

load_dotenv()

API_KEY = os.environ["NUTRIENT_API_KEY"]
ENDPOINT = "https://api.nutrient.io/extraction/extract"

def extract_invoice(pdf_path: str) -> dict:
    """Extract invoice fields into a typed JSON object."""
    instructions = {
        "schema": INVOICE_SCHEMA,
        "instructions": "Treat shipping and handling as separate line items.",
        "parseConfig": {"mode": "understand"},
    }
    with open(pdf_path, "rb") as f:
        response = requests.post(
            ENDPOINT,
            headers={"Authorization": f"Bearer {API_KEY}"},
            files={"file": f},
            data={"instructions": json.dumps(instructions)},
        )
    return response.json()

```

## Step 3 — Read the extracted data

A successful response returns the typed values in `output.data`:

```json

{
  "status": 200,
  "requestId": "req_x1y2z3w4",
  "output": {
    "data": {
      "invoice_number": "INV-2024-0042",
      "issue_date": "2024-03-15",
      "vendor_name": "Acme Industrial Supplies Ltd.",
      "currency": "EUR",
      "total_amount": 1547.5,
      "line_items": [
        { "description": "Widget A", "quantity": 10, "unit_price": 25.0 },
        { "description": "Widget B", "quantity": 5, "unit_price": 99.5 }
      ]
    },
    "metadata": { },
    "pages": [{ "page": 1, "width": 1200, "height": 1697 }]
  },
  "metrics": { "processingTimeMs": 4800, "pagesProcessed": 1 },
  "usage": {
    "data_extraction_credits": { "cost": 27, "remainingCredits": 832 }
  }
}

```

Because the schema is closed, `output.data` contains only the fields you declared.

## Step 4 — Flag fields for review with citations

By default, the API returns citations in `output.metadata`. The metadata mirrors the shape of `output.data`. Use the `match` label and `confidence` score to route uncertain fields to human review. For the full metadata reference, refer to the [citations and confidence](https://www.nutrient.io/guides/dws-data-extraction/extract/citations-and-confidence.md) guide:

```python

def fields_needing_review(metadata: dict, threshold: float = 0.7) -> list[str]:
    """Return scalar field names whose citation suggests manual review."""
    flagged = []
    for field, citation in metadata.items():
        if not isinstance(citation, dict):
            continue  # nested object or array

        if citation.get("match") in ("fuzzy_match", "not_found"):
            flagged.append(field)
            continue
        confidence = citation.get("confidence")
        if confidence is not None and confidence < threshold:
            flagged.append(field)
    return flagged

```

## Step 5 — Put it all together

Combine the extraction call, response validation, and review routing in one function:

```python

def process_invoice(pdf_path: str) -> dict:
    """Extract an invoice and report any fields needing review."""
    result = extract_invoice(pdf_path)

    if result.get("status")!= 200:
        raise RuntimeError(
            f"API error {result['status']}: {result.get('errorMessage')}"
        )

    output = result["output"]
    return {
        "data": output["data"],
        "review": fields_needing_review(output["metadata"]),
        "metrics": result["metrics"],
    }

if __name__ == "__main__":
    pdf = sys.argv[1] if len(sys.argv) > 1 else "invoice.pdf"
    summary = process_invoice(pdf)
    print(json.dumps(summary, indent=2))

```

Run the script with an invoice PDF.

```shell

python extract.py invoice.pdf

```

## Processing documents from URLs

If your invoices are hosted at public URLs, send the schema and a `url` field as JSON instead of uploading a file:

```python

def extract_from_url(url: str) -> dict:
    """Extract invoice fields from a document at a public URL."""
    response = requests.post(
        ENDPOINT,
        headers={
            "Authorization": f"Bearer {API_KEY}",
            "Content-Type": "application/json",
        },
        json={
            "url": url,
            "schema": INVOICE_SCHEMA,
            "parseConfig": {"mode": "understand"},
        },
    )
    response.raise_for_status()
    return response.json()

```

## Production considerations

Review these points before you use the pipeline in production:

- **Review routing** — Check the citation `match` label first. `fuzzy_match` and `not_found` are the clearest review signals. Then check `confidence`. The score is relative and uncalibrated, so calibrate the threshold against a labeled sample. Refer to the [citations and confidence](https://www.nutrient.io/guides/dws-data-extraction/extract/citations-and-confidence.md) guide.

- **Parse mode and cost** — An extract request bills a parse component, set by `parseConfig.mode`, plus an extract component per page. Use `structure` mode for clean layouts, or use `agentic` mode for degraded scans. Refer to the [pricing](https://www.nutrient.io/guides/dws-data-extraction/pricing.md) guide and the [parse configuration](https://www.nutrient.io/guides/dws-data-extraction/extract/parse-configuration.md) guide.

- **Schema limits** — Keep schemas within the documented size limits: 32 KB, 500 fields, and five levels of nesting. Refer to the [define a schema](https://www.nutrient.io/guides/dws-data-extraction/extract/define-a-schema.md) guide.

- **Multilingual invoices** — Set `parseConfig.options.language` for non-English documents. Refer to the [supported languages](https://www.nutrient.io/guides/dws-data-extraction/supported-languages.md) guide.

## Next steps

Use these guides to continue building your extraction workflow:

- [Define a schema](https://www.nutrient.io/guides/dws-data-extraction/extract/define-a-schema.md) — Review supported keywords, field types, and limits.

- [Citations and confidence](https://www.nutrient.io/guides/dws-data-extraction/extract/citations-and-confidence.md) — Ground extracted fields for review workflows.

- [Build a document extraction pipeline][] — Use the parse endpoint and spatial elements.
---

## Related pages

- [Examples](/guides/dws-data-extraction/examples.md)
- [ingestion/extract.py](/guides/dws-data-extraction/examples/build-rag-ingestion-pipeline.md)
- [extract.py](/guides/dws-data-extraction/examples/build-document-extraction-pipeline.md)

