---
title: "Extracting structured data from documents | Nutrient Python SDK"
canonical_url: "https://www.nutrient.io/guides/python/extraction/extract-structured-data/"
md_url: "https://www.nutrient.io/guides/python/extraction/extract-structured-data.md"
last_updated: "2026-06-09T19:34:32.777Z"
description: "Extract schema-shaped JSON data from documents using Nutrient Python SDK."
---

# Extracting structured data from documents

Most document workflows don't want a wall of recognized text — they want *fields*: the invoice total, the patient's date of birth, every line item as a row. *Structured extraction* turns a document into exactly the JSON you ask for: you supply a JSON Schema describing the fields, and an AI model fills it from the document's recognized content.

This sample shows how to extract schema-shaped data from a document using Nutrient Python SDK. The result reports not just the values but also *where* each value came from — per-field source locations and grounding labels you can use to verify the extraction against the original document.

[Download sample](https://www.nutrient.io/downloads/samples/python/extract-structured-data.zip)

## How Nutrient helps

Nutrient Python SDK runs the full structured extraction workflow behind a single method call. The SDK handles:

- Reading the document with the extraction pipeline selected by [Vision Settings](https://www.nutrient.io/api/python/settings/vision/vision-settings.md#engine) — text, tables, key-value regions, and form fields in reading order

- Sending the recognized content and your JSON Schema to the AI model as a structured-output request

- Retrying automatically when the model's response doesn't conform to the schema

- Grounding each extracted value back to its source location in the document

- Serializing the result to JSON

The output always conforms to your schema — the same call with the same schema yields the same shape, ready for your downstream code to consume without defensive parsing.

## How extraction works

Two inputs shape the result:

- **Schema envelope** (required) — `{"schema": <JSON Schema>}` describing the fields to extract. Each schema property's `description` tells the extractor what belongs there — the better the description, the better the match.

- **Instructions** (optional) — Free-form guidance for the extraction: disambiguation rules, formatting preferences, or domain context — anything you'd tell a colleague doing the extraction by hand.

Extraction requires an AI model. Configure the provider through `ai_processing_settings` on the document's settings — a local OpenAI-compatible server keeps documents on your machine, or point it at a hosted provider. Structured extraction requires the vision data extraction feature in your license.

## Complete implementation

Import the classes used in the sample:

```python

from nutrient_sdk import Document, StructuredExtractionRequest, Vision, NutrientException

```

## Configuring the AI provider

Open the document in a [context manager](https://docs.python.org/3/reference/datamodel.html#context-managers) so resources are cleaned up after processing, then point `ai_processing_settings` at your model server. This example uses a local OpenAI-compatible endpoint, so the document never leaves your machine:

```python

def main():
    try:
        with Document.open("input.pdf") as document:
            ai_processing = document.settings.ai_processing_settings
            ai_processing.provider = "local"
            ai_processing.endpoint = "http://localhost:1234/v1"
            ai_processing.model = "your-model-id"

```

For a hosted provider instead, set `provider` to `"openai"` or `"azure"` along with `api_key` (and `endpoint` for Azure).

## Building the request

Build a `StructuredExtractionRequest` carrying the schema envelope — a JSON object whose `schema` member is the JSON Schema to extract against. Give every property a `description` — that's what the extractor matches against the document:

```python

            request = StructuredExtractionRequest()
            request.schema = """
            {
              "schema": {
                "type": "object",
                "properties": {
                  "documentNumber": {"type": "string", "description": "The document's reference or invoice number"},
                  "issueDate": {"type": "string", "description": "The date the document was issued, as printed"},
                  "totalAmount": {"type": "number", "description": "The final total amount due"}
                }
              }
            }
            """
            request.instructions = "Amounts are plain numbers without currency symbols."

```

The envelope shape exists so extraction inputs can grow without breaking your code — a future `constraints` member (cross-field validation rules) will ride alongside `schema` in the same envelope.

## Confidence reporting

For per-field confidence signals, enable them on the settings before extracting. Each metadata entry then also carries the individual confidence components for the field — combined with the `match` grounding labels (refer to the output section below), this gives your pipeline a per-field basis for automatic acceptance versus human review:

```python

            ai_processing.include_confidence = True

```

## Strict structured output

By default, extraction runs in best-effort mode: the model is instructed to follow your schema, and the SDK retries when the response doesn't conform. For a hard guarantee instead, enable strict structured output — the model's response is grammar-constrained to the schema, so the result always matches it exactly:

```python

            ai_processing.strict_structured_output = True

```

Two things to know before enabling it:

- **You don't change your schema.** Strict mode has formal requirements (every object closed, every property accounted for), and the SDK normalizes your schema to satisfy them automatically.

- **Absent fields come back as `null`** (`None` once you parse the JSON result). In strict mode the model must emit every schema property, so a field the document doesn't contain is returned as `null` instead of being omitted — your downstream code can rely on every key being present. Fields you list in the schema's `required` array keep their declared type untouched — they can only be `null` if your own schema allows it.

Strict mode requires a model and endpoint that support grammar-constrained structured outputs (hosted providers do; check your local server's documentation).

## Extracting the data

Create a vision instance bound to the document with `Vision.set(document)`, then call `extract_structured(request)`:

```python

            vision = Vision.set(document)
            result_json = vision.extract_structured(request)

            with open("output.json", "w") as f:
                f.write(result_json)
    except NutrientException as e:
        print(f"Error: {e}")

if __name__ == "__main__":
    main()

```

To write the result directly to a file instead, call `extract_structured_to_file(request, "output.json")`.

## Understanding the output

`extract_structured(request)` returns JSON with two top-level nodes:

- **`extraction`** — The extracted fields, shaped exactly to your schema.

- **`metadata`** — One entry per extracted field, carrying where the value came from: a `match` grounding label and source location info (page and bounding box) so you can highlight the source in a viewer or route low-trust fields to human review.

The `match` label tells you how confidently the value was traced back to the document: an exact source match, a partial or multi-block match, a fuzzy match, or `not_found` when the value couldn't be located in the recognized content — the strongest signal that a field deserves review.

Source grounding is on by default. When only the extracted values matter, turn it off with `ai_processing.include_source_locations = False` to cut model token usage — grounding asks the model to also return per-field source references, which roughly doubles the schema sent with each request.

## Error handling

Vision API raises `VisionException` (a `NutrientException`) when extraction fails.

Common failure scenarios include:

- The document can't be read due to path or permission issues

- The JSON Schema is missing or malformed (validated before any model call)

- The model endpoint is unreachable, or the feature isn't licensed

In production code:

- Catch `NutrientException`.

- Return a clear error message.

- Log failure details for debugging.

## Conclusion

The workflow for structured data extraction is:

1. Open the source document using a [context manager](https://docs.python.org/3/reference/datamodel.html#context-managers) for automatic resource cleanup.

2. Configure the AI provider through `ai_processing_settings` — local for privacy, or a hosted provider.

3. Build a `StructuredExtractionRequest` with a schema envelope — `{"schema": <JSON Schema>}`, every property carrying a `description` — and optional instructions.

4. Create a vision instance with `Vision.set()`.

5. Call `extract_structured(request)` and consume the `extraction` node; use `metadata` to verify or review.

6. Handle `NutrientException` for robust error recovery.

For related extraction workflows, refer to the [Python SDK guides](https://www.nutrient.io/guides/python.md).

Download [this ready-to-use sample package](https://www.nutrient.io/downloads/samples/python/extract-structured-data.zip) to explore structured data extraction.
---

## Related pages

- [Speeding up first ICR operation by predownloading models](/guides/python/extraction/speed-up-first-icr-by-downloading-requirements.md)
- [Extracting text from PDF documents](/guides/python/extraction/pdf-to-text.md)
- [Extracting text from multilingual images](/guides/python/extraction/read-text-from-image-multi-language.md)
- [Generating image descriptions using Claude](/guides/python/extraction/describe-image-with-claude.md)
- [Extracting data from images using vision language models](/guides/python/extraction/extract-data-from-image-vlm.md)
- [Generating image descriptions using OpenAI](/guides/python/extraction/describe-image-with-openai.md)
- [Extracting text from images](/guides/python/extraction/read-text-from-image.md)
- [Generating image descriptions using local AI](/guides/python/extraction/describe-image-with-local-ai.md)
- [Nutrient Python SDK extraction guides](/guides/python/extraction.md)
- [Applying OCR to a PDF document](/guides/python/extraction/apply-ocr-to-pdf.md)
- [Extracting form fields from images](/guides/python/extraction/extract-form-fields-from-image.md)
- [Extracting data from images using OCR](/guides/python/extraction/extract-data-from-image-ocr.md)
- [Applying OCR to a PDF page](/guides/python/extraction/apply-ocr-to-pdf-page.md)
- [Labeling form fields with a vision language model](/guides/python/extraction/label-form-fields-with-vlm.md)
- [Extracting structured JSON data from PDF documents](/guides/python/extraction/json-data-extraction.md)
- [Extracting data from images using ICR](/guides/python/extraction/extract-data-from-image-icr.md)

