This HTML page is not optimized for LLM or AI agent consumption. Fetch the Markdown version instead: /guides/python/extraction/json-data-extraction.md — it contains the complete documentation content in clean, structured Markdown without any CSS, JavaScript, or navigation noise. Extracting structured JSON data from PDF documents | Nutrient Python SDK

Extract structured data from PDF files as JSON for storage, API workflows, or analytics pipelines. This approach reduces manual entry and gives your application direct access to document content.

Download sample

How Nutrient supports this workflow

Nutrient Python SDK handles structured extraction from PDF documents, including digital-native PDFs and PDFs that mix digital text with scanned content.

In this sample, VisionEngine.ADAPTIVE_OCR uses an adaptive extraction pipeline that prefers native PDF text when available and falls back to OCR for image-based content when needed.

You don’t need to manage:

  • Third-party OCR engine integration
  • Switching between native-text extraction and OCR
  • Document layout parsing
  • Model download and initialization
  • Conversion from extracted output to structured data

Use the SDK API to extract structured JSON in your application.

Complete implementation

This example shows a complete PDF-to-JSON extraction flow.

Import the required Nutrient classes:

from nutrient_sdk import Document
from nutrient_sdk import Vision
from nutrient_sdk import NutrientException
from nutrient_sdk import VisionEngine

Open the PDF with a Python context manager(opens in a new tab). The context manager closes the document automatically:

def main():
try:
with Document.open("input.pdf") as document:

Configure the Adaptive OCR engine, extract JSON content, and write it to output.json. Catch NutrientException to handle SDK errors:

document.settings.vision_settings.engine = VisionEngine.ADAPTIVE_OCR
vision = Vision.set(document)
content_json = vision.extract_content()
with open("output.json", "w", encoding="utf-8") as f:
f.write(content_json)
print("Successfully extracted content to output.json")
except NutrientException as e:
print(f"Error: {e}")
if __name__ == "__main__":
main()

Summary

The extraction flow has four steps:

  1. Open the PDF document.
  2. Configure the Adaptive OCR engine.
  3. Extract content as JSON with Vision.
  4. Write the JSON output to a file.

Nutrient handles adaptive extraction and content structuring, so you don’t need to implement PDF parsing, native-text detection, or OCR fallback logic.

You can download this sample package to run the example locally.