---
title: "Extracting structured data from documents | Nutrient Java SDK"
canonical_url: "https://www.nutrient.io/guides/java/extraction/extract-structured-data/"
md_url: "https://www.nutrient.io/guides/java/extraction/extract-structured-data.md"
last_updated: "2026-06-09T21:11:56.021Z"
description: "Extract schema-shaped JSON data from documents using Nutrient Java SDK."
---

# Extracting structured data from documents

Most document workflows don't want a wall of recognized text — they want *fields*: the invoice total, the patient's date of birth, every line item as a row. *Structured extraction* turns a document into exactly the JSON you ask for: you supply a JSON Schema describing the fields, and an AI model fills it from the document's recognized content.

This sample shows how to extract schema-shaped data from a document using Nutrient Java SDK. The result reports not just the values but also *where* each value came from — per-field source locations and grounding labels you can use to verify the extraction against the original document.

[Download sample](https://www.nutrient.io/downloads/samples/java/extract-structured-data.zip)

## How Nutrient helps

Nutrient Java SDK runs the full structured extraction workflow behind a single method call. The SDK handles:

- Reading the document with the extraction pipeline selected by [Vision Settings](https://www.nutrient.io/api/java/settings/vision-settings/#engine) — text, tables, key-value regions, and form fields in reading order

- Sending the recognized content and your JSON Schema to the AI model as a structured-output request

- Retrying automatically when the model's response doesn't conform to the schema

- Grounding each extracted value back to its source location in the document

- Serializing the result to JSON

The output always conforms to your schema — the same call with the same schema yields the same shape, ready for your downstream code to consume without defensive parsing.

## How extraction works

Two inputs shape the result:

- **Schema envelope** (required) — `{"schema": <JSON Schema>}` describing the fields to extract. Each schema property's `description` tells the extractor what belongs there — the better the description, the better the match.

- **Instructions** (optional) — Free-form guidance for the extraction: disambiguation rules, formatting preferences, or domain context — anything you'd tell a colleague doing the extraction by hand.

Extraction requires an AI model. Configure the provider through `AiProcessingSettings` on the document's settings — a local OpenAI-compatible server keeps documents on your machine, or point it at a hosted provider. Structured extraction requires the vision data extraction feature in your license.

## Complete implementation

Specify a package name and create a new class:

```java

package io.nutrient.Sample;

```

Import the classes used in the sample:

```java

import io.nutrient.sdk.Document;
import io.nutrient.sdk.Vision;
import io.nutrient.sdk.requests.StructuredExtractionRequest;
import io.nutrient.sdk.exceptions.NutrientException;

import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Path;

```

## Configuring the AI provider

Open the document in a try-with-resources block so resources are cleaned up after processing, then point `AiProcessingSettings` at your model server. This example uses a local OpenAI-compatible endpoint, so the document never leaves your machine:

```java

public class ExtractStructuredData {
    public static void main(String[] args) {
        try (Document document = Document.open("input.pdf")) {

            var aiProcessing = document.getSettings().getAiProcessingSettings();
            aiProcessing.setProvider("local");
            aiProcessing.setEndpoint("http://localhost:1234/v1");
            aiProcessing.setModel("your-model-id");

```

For a hosted provider instead, set `provider` to `"openai"` or `"azure"` along with `apiKey` (and `endpoint` for Azure).

## Building the request

Build a `StructuredExtractionRequest` carrying the schema envelope — a JSON object whose `schema` member is the JSON Schema to extract against. Give every property a `description` — that's what the extractor matches against the document:

```java

            StructuredExtractionRequest request = new StructuredExtractionRequest();
            request.setSchema(
                "{" +
                "  \"schema\": {" +
                "    \"type\": \"object\"," +
                "    \"properties\": {" +
                "      \"documentNumber\": {\"type\": \"string\", \"description\": \"The document's reference or invoice number\"}," +
                "      \"issueDate\": {\"type\": \"string\", \"description\": \"The date the document was issued, as printed\"}," +
                "      \"totalAmount\": {\"type\": \"number\", \"description\": \"The final total amount due\"}" +
                "    }" +
                "  }" +
                "}");
            request.setInstructions("Amounts are plain numbers without currency symbols.");

```

The envelope shape exists so extraction inputs can grow without breaking your code — a future `constraints` member (cross-field validation rules) will ride alongside `schema` in the same envelope.

## Confidence reporting

For per-field confidence signals, enable them on the settings before extracting. Each metadata entry then also carries the individual confidence components for the field — combined with the `match` grounding labels (refer to the output section below), this gives your pipeline a per-field basis for automatic acceptance versus human review:

```java

            aiProcessing.setIncludeConfidence(true);

```

## Strict structured output

By default, extraction runs in best-effort mode: the model is instructed to follow your schema, and the SDK retries when the response doesn't conform. For a hard guarantee instead, enable strict structured output — the model's response is grammar-constrained to the schema, so the result always matches it exactly:

```java

            aiProcessing.setStrictStructuredOutput(true);

```

Two things to know before enabling it:

- **You don't change your schema.** Strict mode has formal requirements (every object closed, every property accounted for), and the SDK normalizes your schema to satisfy them automatically.

- **Absent fields come back as `null`.** In strict mode the model must emit every schema property, so a field the document doesn't contain is returned as `null` instead of being omitted — your downstream code can rely on every key being present. Fields you list in the schema's `required` array keep their declared type untouched — they can only be `null` if your own schema allows it.

Strict mode requires a model and endpoint that support grammar-constrained structured outputs (hosted providers do; check your local server's documentation).

## Extracting the data

Create a vision instance bound to the document with `Vision.set(document)`, then call `extractStructured(request)`:

```java

            Vision vision = Vision.set(document);
            String resultJson = vision.extractStructured(request);

            Files.writeString(Path.of("output.json"), resultJson);
        } catch (NutrientException | IOException e) {
            System.err.println("Error: " + e.getMessage());
        }
    }
}

```

To write the result directly to a file instead, call `extractStructuredToFile(request, "output.json")`.

## Understanding the output

`extractStructured(request)` returns JSON with two top-level nodes:

- **`extraction`** — The extracted fields, shaped exactly to your schema.

- **`metadata`** — One entry per extracted field, carrying where the value came from: a `match` grounding label and source location info (page and bounding box) so you can highlight the source in a viewer or route low-trust fields to human review.

The `match` label tells you how confidently the value was traced back to the document: an exact source match, a partial or multi-block match, a fuzzy match, or `not_found` when the value couldn't be located in the recognized content — the strongest signal that a field deserves review.

Source grounding is on by default. When only the extracted values matter, turn it off with `aiProcessing.setIncludeSourceLocations(false)` to cut model token usage — grounding asks the model to also return per-field source references, which roughly doubles the schema sent with each request.

## Error handling

Vision API raises `VisionException` (a `NutrientException`) when extraction fails. Common failure scenarios include an unreadable document, a missing or malformed JSON Schema (validated before any model call), an unreachable model endpoint, or the feature not being licensed. In production code, catch `NutrientException`, return a clear error message, and log failure details for debugging.

## Conclusion

The workflow for structured data extraction is:

1. Open the source document using try-with-resources for automatic resource cleanup.

2. Configure the AI provider through `AiProcessingSettings` — local for privacy, or a hosted provider.

3. Build a `StructuredExtractionRequest` with a schema envelope — `{"schema": <JSON Schema>}`, every property carrying a `description` — and optional instructions.

4. Create a vision instance with `Vision.set()`.

5. Call `extractStructured(request)` and consume the `extraction` node; use `metadata` to verify or review.

6. Handle `NutrientException` for robust error recovery.

For related extraction workflows, refer to the [Java SDK guides](https://www.nutrient.io/guides/java.md).

Download [this ready-to-use sample package](https://www.nutrient.io/downloads/samples/java/extract-structured-data.zip) to explore structured data extraction.
---

## Related pages

- [Applying OCR to a PDF document](/guides/java/extraction/apply-ocr-to-pdf.md)
- [Applying OCR to a PDF page](/guides/java/extraction/apply-ocr-to-pdf-page.md)
- [Generating image descriptions using Claude](/guides/java/extraction/describe-image-with-claude.md)
- [Generating image descriptions using local AI](/guides/java/extraction/describe-image-with-local-ai.md)
- [Extracting data from images using OCR](/guides/java/extraction/extract-data-from-image-ocr.md)
- [Generating image descriptions using OpenAI](/guides/java/extraction/describe-image-with-openai.md)
- [Extracting data from images using ICR](/guides/java/extraction/extract-data-from-image-icr.md)
- [Extracting JSON data from a PDF document](/guides/java/extraction/json-data-extraction.md)
- [Extracting data from images using vision language models](/guides/java/extraction/extract-data-from-image-vlm.md)
- [Extracting form fields from images](/guides/java/extraction/extract-form-fields-from-image.md)
- [Extracting text from PDF documents](/guides/java/extraction/pdf-to-text.md)
- [Labeling form fields with a vision language model](/guides/java/extraction/label-form-fields-with-vlm.md)
- [Nutrient Java SDK extraction guides](/guides/java/extraction.md)
- [Extracting text from multilingual images](/guides/java/extraction/read-text-from-image-multi-language.md)
- [Extracting text from images](/guides/java/extraction/read-text-from-image.md)
- [Speeding up first ICR operation by predownloading models](/guides/java/extraction/speed-up-first-icr-by-downloading-requirements.md)

