Extracting structured data from documents

Most document workflows don’t want a wall of recognized text — they want fields: the invoice total, the patient’s date of birth, every line item as a row. Structured extraction turns a document into exactly the JSON you ask for: you supply a JSON Schema describing the fields, and an AI model fills it from the document’s recognized content.

This sample shows how to extract schema-shaped data from a document using Nutrient Java SDK. The result reports not just the values but also where each value came from — per-field source locations and grounding labels you can use to verify the extraction against the original document.

Download sample

How Nutrient helps

Nutrient Java SDK runs the full structured extraction workflow behind a single method call. The SDK handles:

Reading the document with the extraction pipeline selected by Vision Settings — text, tables, key-value regions, and form fields in reading order
Sending the recognized content and your JSON Schema to the AI model as a structured-output request
Retrying automatically when the model’s response doesn’t conform to the schema
Grounding each extracted value back to its source location in the document
Serializing the result to JSON

The output always conforms to your schema — the same call with the same schema yields the same shape, ready for your downstream code to consume without defensive parsing.

How extraction works

Two inputs shape the result:

Schema envelope (required) — {"schema": <JSON Schema>} describing the fields to extract. Each schema property’s description tells the extractor what belongs there — the better the description, the better the match.
Instructions (optional) — Free-form guidance for the extraction: disambiguation rules, formatting preferences, or domain context — anything you’d tell a colleague doing the extraction by hand.

Extraction requires an AI model. Configure the provider through AiProcessingSettings on the document’s settings — a local OpenAI-compatible server keeps documents on your machine, or point it at a hosted provider. Structured extraction requires the vision data extraction feature in your license.

Complete implementation

Specify a package name and create a new class:

package io.nutrient.Sample;

Import the classes used in the sample:

import io.nutrient.sdk.Document;
import io.nutrient.sdk.Vision;
import io.nutrient.sdk.requests.StructuredExtractionRequest;
import io.nutrient.sdk.exceptions.NutrientException;

import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Path;

Configuring the AI provider

Open the document in a try-with-resources block so resources are cleaned up after processing, then point AiProcessingSettings at your model server. This example uses a local OpenAI-compatible endpoint, so the document never leaves your machine:

public class ExtractStructuredData {
    public static void main(String[] args) {
        try (Document document = Document.open("input.pdf")) {

            var aiProcessing = document.getSettings().getAiProcessingSettings();
            aiProcessing.setProvider("local");
            aiProcessing.setEndpoint("http://localhost:1234/v1");
            aiProcessing.setModel("your-model-id");

For a hosted provider instead, set provider to "openai" or "azure" along with apiKey (and endpoint for Azure).

Building the request

Build a StructuredExtractionRequest carrying the schema envelope — a JSON object whose schema member is the JSON Schema to extract against. Give every property a description — that’s what the extractor matches against the document:

            StructuredExtractionRequest request = new StructuredExtractionRequest();
            request.setSchema(
                "{" +
                "  \"schema\": {" +
                "    \"type\": \"object\"," +
                "    \"properties\": {" +
                "      \"documentNumber\": {\"type\": \"string\", \"description\": \"The document's reference or invoice number\"}," +
                "      \"issueDate\": {\"type\": \"string\", \"description\": \"The date the document was issued, as printed\"}," +
                "      \"totalAmount\": {\"type\": \"number\", \"description\": \"The final total amount due\"}" +
                "    }" +
                "  }" +
                "}");
            request.setInstructions("Amounts are plain numbers without currency symbols.");

The envelope shape exists so extraction inputs can grow without breaking your code — a future constraints member (cross-field validation rules) will ride alongside schema in the same envelope.

If you need help drafting the JSON Schema from example documents, refer to the generate extraction schema guide.

Confidence reporting

For per-field confidence signals, enable them on the settings before extracting. Each metadata entry then also carries the individual confidence components for the field — combined with the match grounding labels (refer to the output section below), this gives your pipeline a per-field basis for automatic acceptance versus human review:

            aiProcessing.setIncludeConfidence(true);

Strict structured output

By default, extraction runs in best-effort mode: the model is instructed to follow your schema, and the SDK retries when the response doesn’t conform. For a hard guarantee instead, enable strict structured output — the model’s response is grammar-constrained to the schema, so the result always matches it exactly:

            aiProcessing.setStrictStructuredOutput(true);

Two things to know before enabling it:

You don’t change your schema. Strict mode has formal requirements (every object closed, every property accounted for), and the SDK normalizes your schema to satisfy them automatically.
Absent fields come back as null. In strict mode the model must emit every schema property, so a field the document doesn’t contain is returned as null instead of being omitted — your downstream code can rely on every key being present. Fields you list in the schema’s required array keep their declared type untouched — they can only be null if your own schema allows it.

Strict mode requires a model and endpoint that support grammar-constrained structured outputs (hosted providers do; check your local server’s documentation).

Extracting the data

Create a vision instance bound to the document with Vision.set(document), then call extractStructured(request):

            Vision vision = Vision.set(document);
            String resultJson = vision.extractStructured(request);

            Files.writeString(Path.of("output.json"), resultJson);
        } catch (NutrientException | IOException e) {
            System.err.println("Error: " + e.getMessage());
        }
    }
}

To write the result directly to a file instead, call extractStructuredToFile(request, "output.json").

Understanding the output

extractStructured(request) returns JSON with two top-level nodes:

extraction — The extracted fields, shaped exactly to your schema.
metadata — One entry per extracted field, carrying where the value came from: a match grounding label and source location info (page and bounding box) so you can highlight the source in a viewer or route low-trust fields to human review.

The match label tells you how confidently the value was traced back to the document: an exact source match, a partial or multi-block match, a fuzzy match, or not_found when the value couldn’t be located in the recognized content — the strongest signal that a field deserves review.

Source grounding is on by default. When only the extracted values matter, turn it off with aiProcessing.setIncludeSourceLocations(false) to cut model token usage — grounding asks the model to also return per-field source references, which roughly doubles the schema sent with each request.

Error handling

Vision API raises VisionException (a NutrientException) when extraction fails. Common failure scenarios include an unreadable document, a missing or malformed JSON Schema (validated before any model call), an unreachable model endpoint, or the feature not being licensed. In production code, catch NutrientException, return a clear error message, and log failure details for debugging.

Conclusion

The workflow for structured data extraction is:

Open the source document using try-with-resources for automatic resource cleanup.
Configure the AI provider through AiProcessingSettings — local for privacy, or a hosted provider.
Build a StructuredExtractionRequest with a schema envelope — {"schema": <JSON Schema>}, every property carrying a description — and optional instructions.
Create a vision instance with Vision.set().
Call extractStructured(request) and consume the extraction node; use metadata to verify or review.
Handle NutrientException for robust error recovery.

For related extraction workflows, refer to the Java SDK guides.

Download this ready-to-use sample package to explore structured data extraction.