Extracting structured data from documents

Most document workflows don’t want a wall of recognized text — they want fields: the invoice total, the patient’s date of birth, every line item as a row. Structured extraction turns a document into exactly the JSON you ask for: you supply a JSON Schema describing the fields, and an AI model fills it from the document’s recognized content.

This sample shows how to extract schema-shaped data from a document using Nutrient Python SDK. The result reports not just the values but also where each value came from — per-field source locations and grounding labels you can use to verify the extraction against the original document.

Download sample

How Nutrient helps

Nutrient Python SDK runs the full structured extraction workflow behind a single method call. The SDK handles:

Reading the document with the extraction pipeline selected by Vision Settings — text, tables, key-value regions, and form fields in reading order
Sending the recognized content and your JSON Schema to the AI model as a structured-output request
Retrying automatically when the model’s response doesn’t conform to the schema
Grounding each extracted value back to its source location in the document
Serializing the result to JSON

The output always conforms to your schema — the same call with the same schema yields the same shape, ready for your downstream code to consume without defensive parsing.

How extraction works

Two inputs shape the result:

Schema envelope (required) — {"schema": <JSON Schema>} describing the fields to extract. Each schema property’s description tells the extractor what belongs there — the better the description, the better the match.
Instructions (optional) — Free-form guidance for the extraction: disambiguation rules, formatting preferences, or domain context — anything you’d tell a colleague doing the extraction by hand.

Extraction requires an AI model. Configure the provider through ai_processing_settings on the document’s settings — a local OpenAI-compatible server keeps documents on your machine, or point it at a hosted provider. Structured extraction requires the vision data extraction feature in your license.

Complete implementation

Import the classes used in the sample:

from nutrient_sdk import Document, StructuredExtractionRequest, Vision, NutrientException

Configuring the AI provider

Open the document in a context manager(opens in a new tab) so resources are cleaned up after processing, then point ai_processing_settings at your model server. This example uses a local OpenAI-compatible endpoint, so the document never leaves your machine:

def main():
    try:
        with Document.open("input.pdf") as document:
            ai_processing = document.settings.ai_processing_settings
            ai_processing.provider = "local"
            ai_processing.endpoint = "http://localhost:1234/v1"
            ai_processing.model = "your-model-id"

For a hosted provider instead, set provider to "openai" or "azure" along with api_key (and endpoint for Azure).

Building the request

Build a StructuredExtractionRequest carrying the schema envelope — a JSON object whose schema member is the JSON Schema to extract against. Give every property a description — that’s what the extractor matches against the document:

            request = StructuredExtractionRequest()
            request.schema = """
            {
              "schema": {
                "type": "object",
                "properties": {
                  "documentNumber": {"type": "string", "description": "The document's reference or invoice number"},
                  "issueDate": {"type": "string", "description": "The date the document was issued, as printed"},
                  "totalAmount": {"type": "number", "description": "The final total amount due"}
                }
              }
            }
            """
            request.instructions = "Amounts are plain numbers without currency symbols."

The envelope shape exists so extraction inputs can grow without breaking your code — a future constraints member (cross-field validation rules) will ride alongside schema in the same envelope.

Confidence reporting

For per-field confidence signals, enable them on the settings before extracting. Each metadata entry then also carries the individual confidence components for the field — combined with the match grounding labels (refer to the output section below), this gives your pipeline a per-field basis for automatic acceptance versus human review:

            ai_processing.include_confidence = True

Strict structured output

By default, extraction runs in best-effort mode: the model is instructed to follow your schema, and the SDK retries when the response doesn’t conform. For a hard guarantee instead, enable strict structured output — the model’s response is grammar-constrained to the schema, so the result always matches it exactly:

            ai_processing.strict_structured_output = True

Two things to know before enabling it:

You don’t change your schema. Strict mode has formal requirements (every object closed, every property accounted for), and the SDK normalizes your schema to satisfy them automatically.
Absent fields come back as null (None once you parse the JSON result). In strict mode the model must emit every schema property, so a field the document doesn’t contain is returned as null instead of being omitted — your downstream code can rely on every key being present. Fields you list in the schema’s required array keep their declared type untouched — they can only be null if your own schema allows it.

Strict mode requires a model and endpoint that support grammar-constrained structured outputs (hosted providers do; check your local server’s documentation).

Extracting the data

Create a vision instance bound to the document with Vision.set(document), then call extract_structured(request):

            vision = Vision.set(document)
            result_json = vision.extract_structured(request)

            with open("output.json", "w") as f:
                f.write(result_json)
    except NutrientException as e:
        print(f"Error: {e}")


if __name__ == "__main__":
    main()

To write the result directly to a file instead, call extract_structured_to_file(request, "output.json").

Understanding the output

extract_structured(request) returns JSON with two top-level nodes:

extraction — The extracted fields, shaped exactly to your schema.
metadata — One entry per extracted field, carrying where the value came from: a match grounding label and source location info (page and bounding box) so you can highlight the source in a viewer or route low-trust fields to human review.

The match label tells you how confidently the value was traced back to the document: an exact source match, a partial or multi-block match, a fuzzy match, or not_found when the value couldn’t be located in the recognized content — the strongest signal that a field deserves review.

Source grounding is on by default. When only the extracted values matter, turn it off with ai_processing.include_source_locations = False to cut model token usage — grounding asks the model to also return per-field source references, which roughly doubles the schema sent with each request.

Error handling

Vision API raises VisionException (a NutrientException) when extraction fails.

Common failure scenarios include:

The document can’t be read due to path or permission issues
The JSON Schema is missing or malformed (validated before any model call)
The model endpoint is unreachable, or the feature isn’t licensed

In production code:

Catch NutrientException.
Return a clear error message.
Log failure details for debugging.

Conclusion

The workflow for structured data extraction is:

Open the source document using a context manager(opens in a new tab) for automatic resource cleanup.
Configure the AI provider through ai_processing_settings — local for privacy, or a hosted provider.
Build a StructuredExtractionRequest with a schema envelope — {"schema": <JSON Schema>}, every property carrying a description — and optional instructions.
Create a vision instance with Vision.set().
Call extract_structured(request) and consume the extraction node; use metadata to verify or review.
Handle NutrientException for robust error recovery.

For related extraction workflows, refer to the Python SDK guides.

Download this ready-to-use sample package to explore structured data extraction.