Extract invoice data with a schema
This tutorial shows how to build a Python pipeline that extracts invoice fields into a typed JSON object with the Nutrient DWS Data Extraction API extract endpoint.
Instead of returning the document’s full structure, the extract endpoint maps a document to a JSON Schema that you define. The response includes the fields you request and citations that ground each value to its source in the document.
What you’ll build
You’ll create a Python script that performs the following tasks:
- Defines a JSON Schema for the invoice fields you need.
- Sends a PDF invoice to the extract endpoint.
- Receives a typed JSON object that matches the schema.
- Reviews per-field citations and flags low-confidence fields.
Extract vs. parse
The document extraction pipeline guide uses the parse endpoint and reconstructs fields from spatial elements, such as tables and key-value regions. This tutorial uses the extract endpoint, which maps the document to your schema.
Use extract when you know which fields you want. Use parse when you need the full document structure.
Prerequisites
Before you start, make sure you have the following items:
- Python 3.10 or later.
- A Nutrient DWS account and Data Extraction API key. Sign up in the Nutrient dashboard(opens in a new tab).
Project setup
Create a project directory, set up a Python virtual environment, and install the required packages:
mkdir invoice-extract && cd invoice-extractpython -m venv .venv && source .venv/bin/activatepip install requests python-dotenvCreate a .env file with your API key:
NUTRIENT_API_KEY=your_data_extraction_api_key_hereStep 1 — Define the schema
Describe the fields you want to extract. The root schema must be an object, and each field’s description should tell the extraction model what to find. For supported keywords and limits, refer to the define a schema guide:
INVOICE_SCHEMA = { "type": "object", "properties": { "invoice_number": { "type": "string", "description": "Invoice identifier, usually labeled 'Invoice #' or 'Invoice No.'", }, "issue_date": { "type": "string", "format": "date", "description": "Date the invoice was issued", }, "vendor_name": { "type": "string", "description": "Name of the company issuing the invoice", }, "currency": { "type": "string", "description": "ISO 4217 currency code, e.g. USD or EUR", }, "total_amount": { "type": "number", "description": "Final total after discounts and tax", }, "line_items": { "type": "array", "description": "One entry per row in the line-item table", "items": { "type": "object", "properties": { "description": {"type": "string"}, "quantity": {"type": "integer"}, "unit_price": {"type": "number"}, }, "required": ["description", "quantity", "unit_price"], }, }, }, "required": ["invoice_number", "total_amount"],}Step 2 — Call the extract endpoint
Send the document and JSON-serialized instructions to the extract endpoint. The instructions include the schema and can include parseConfig and document-level guidance:
import jsonimport osimport sys
import requestsfrom dotenv import load_dotenv
from schema import INVOICE_SCHEMA
load_dotenv()
API_KEY = os.environ["NUTRIENT_API_KEY"]ENDPOINT = "https://api.nutrient.io/extraction/extract"
def extract_invoice(pdf_path: str) -> dict: """Extract invoice fields into a typed JSON object.""" instructions = { "schema": INVOICE_SCHEMA, "instructions": "Treat shipping and handling as separate line items.", "parseConfig": {"mode": "understand"}, } with open(pdf_path, "rb") as f: response = requests.post( ENDPOINT, headers={"Authorization": f"Bearer {API_KEY}"}, files={"file": f}, data={"instructions": json.dumps(instructions)}, ) return response.json()Step 3 — Read the extracted data
A successful response returns the typed values in output.data:
{ "status": 200, "requestId": "req_x1y2z3w4", "output": { "data": { "invoice_number": "INV-2024-0042", "issue_date": "2024-03-15", "vendor_name": "Acme Industrial Supplies Ltd.", "currency": "EUR", "total_amount": 1547.5, "line_items": [ { "description": "Widget A", "quantity": 10, "unit_price": 25.0 }, { "description": "Widget B", "quantity": 5, "unit_price": 99.5 } ] }, "metadata": { }, "pages": [{ "page": 1, "width": 1200, "height": 1697 }] }, "metrics": { "processingTimeMs": 4800, "pagesProcessed": 1 }, "usage": { "data_extraction_credits": { "cost": 27, "remainingCredits": 832 } }}Because the schema is closed, output.data contains only the fields you declared.
Step 4 — Flag fields for review with citations
By default, the API returns citations in output.metadata. The metadata mirrors the shape of output.data. Use the match label and confidence score to route uncertain fields to human review. For the full metadata reference, refer to the citations and confidence guide:
def fields_needing_review(metadata: dict, threshold: float = 0.7) -> list[str]: """Return scalar field names whose citation suggests manual review.""" flagged = [] for field, citation in metadata.items(): if not isinstance(citation, dict): continue # nested object or array if citation.get("match") in ("fuzzy_match", "not_found"): flagged.append(field) continue confidence = citation.get("confidence") if confidence is not None and confidence < threshold: flagged.append(field) return flaggedStep 5 — Put it all together
Combine the extraction call, response validation, and review routing in one function:
def process_invoice(pdf_path: str) -> dict: """Extract an invoice and report any fields needing review.""" result = extract_invoice(pdf_path)
if result.get("status") != 200: raise RuntimeError( f"API error {result['status']}: {result.get('errorMessage')}" )
output = result["output"] return { "data": output["data"], "review": fields_needing_review(output["metadata"]), "metrics": result["metrics"], }
if __name__ == "__main__": pdf = sys.argv[1] if len(sys.argv) > 1 else "invoice.pdf" summary = process_invoice(pdf) print(json.dumps(summary, indent=2))Run the script with an invoice PDF.
python extract.py invoice.pdfProcessing documents from URLs
If your invoices are hosted at public URLs, send the schema and a url field as JSON instead of uploading a file:
def extract_from_url(url: str) -> dict: """Extract invoice fields from a document at a public URL.""" response = requests.post( ENDPOINT, headers={ "Authorization": f"Bearer {API_KEY}", "Content-Type": "application/json", }, json={ "url": url, "schema": INVOICE_SCHEMA, "parseConfig": {"mode": "understand"}, }, ) response.raise_for_status() return response.json()Production considerations
Review these points before you use the pipeline in production:
- Review routing — Check the citation
matchlabel first.fuzzy_matchandnot_foundare the clearest review signals. Then checkconfidence. The score is relative and uncalibrated, so calibrate the threshold against a labeled sample. Refer to the citations and confidence guide. - Parse mode and cost — An extract request bills a parse component, set by
parseConfig.mode, plus an extract component per page. Usestructuremode for clean layouts, or useagenticmode for degraded scans. Refer to the pricing guide and the parse configuration guide. - Schema limits — Keep schemas within the documented size limits: 32 KB, 500 fields, and five levels of nesting. Refer to the define a schema guide.
- Multilingual invoices — Set
parseConfig.options.languagefor non-English documents. Refer to the supported languages guide.
Next steps
Use these guides to continue building your extraction workflow:
- Define a schema — Review supported keywords, field types, and limits.
- Citations and confidence — Ground extracted fields for review workflows.
- [Build a document extraction pipeline][] — Use the parse endpoint and spatial elements.