---
title: "Extracting structured JSON data from PDF documents | Nutrient Python SDK"
canonical_url: "https://www.nutrient.io/guides/python/extraction/json-data-extraction/"
md_url: "https://www.nutrient.io/guides/python/extraction/json-data-extraction.md"
last_updated: "2026-05-25T12:14:42.960Z"
description: "Extract structured JSON data from PDF documents using OCR with Nutrient Python SDK."
---

# Extracting structured JSON data from PDF documents

Extract structured data from PDF files as JSON for storage, API workflows, or analytics pipelines. This approach reduces manual entry and gives your application direct access to document content.

[Download sample](https://www.nutrient.io/downloads/samples/python/json-data-extraction.zip)

## How Nutrient supports this workflow

Nutrient Python SDK handles OCR-based extraction from PDF documents.

You don’t need to manage:

- Third-party OCR engine integration

- Document layout parsing

- Model download and initialization

- Conversion from OCR output to structured data

Use the SDK API to extract structured JSON in your application.

## Complete implementation

This example shows a complete PDF-to-JSON extraction flow.

Import the required Nutrient classes:

```python

from nutrient_sdk import Document
from nutrient_sdk import Vision
from nutrient_sdk import NutrientException
from nutrient_sdk import VisionEngine

```

Open the PDF with a Python [context manager](https://docs.python.org/3/reference/datamodel.html#context-managers). The context manager closes the document automatically:

```python

def main():
    try:
        with Document.open("input.pdf") as document:

```

Configure the OCR engine, extract JSON content, and write it to `output.json`. Catch `NutrientException` to handle SDK errors:

```python

            document.settings.vision_settings.set_engine(VisionEngine.OCR)

            vision = Vision.set(document)
            content_json = vision.extract_content()

            with open("output.json", "w", encoding="utf-8") as f:
                f.write(content_json)

            print("Successfully extracted content to output.json")
    except NutrientException as e:
        print(f"Error: {e}")

if __name__ == "__main__":
    main()

```

## Summary

The extraction flow has four steps:

1. Open the PDF document.

2. Configure the OCR engine.

3. Extract content as JSON with `Vision`.

4. Write the JSON output to a file.

Nutrient handles OCR and content structuring, so you don’t need to implement PDF parsing or text recognition logic.

You can download [this sample package](https://www.nutrient.io/downloads/samples/python/json-data-extraction.zip) to run the example locally.
---

## Related pages

- [Extracting data from images using ICR](/guides/python/extraction/extract-data-from-image-icr.md)
- [Generating image descriptions using Claude](/guides/python/extraction/describe-image-with-claude.md)
- [Generating image descriptions using OpenAI](/guides/python/extraction/describe-image-with-openai.md)
- [Generating image descriptions using local AI](/guides/python/extraction/describe-image-with-local-ai.md)
- [Extracting data from images using vision language models](/guides/python/extraction/extract-data-from-image-vlm.md)
- [Extracting data from images using OCR](/guides/python/extraction/extract-data-from-image-ocr.md)
- [Nutrient Python SDK extraction guides](/guides/python/extraction.md)
- [Speeding up first ICR operation by predownloading models](/guides/python/extraction/speed-up-first-icr-by-downloading-requirements.md)
- [Extracting text from images](/guides/python/extraction/read-text-from-image.md)
- [Extracting text from multilingual images](/guides/python/extraction/read-text-from-image-multi-language.md)

