Extracting structured JSON data from PDF documents
Extract structured data from PDF files as JSON for storage, API workflows, or analytics pipelines. This approach reduces manual entry and gives your application direct access to document content.
Download sampleHow Nutrient supports this workflow
Nutrient Python SDK handles OCR-based extraction from PDF documents.
You don’t need to manage:
- Third-party OCR engine integration
- Document layout parsing
- Model download and initialization
- Conversion from OCR output to structured data
Use the SDK API to extract structured JSON in your application.
Complete implementation
This example shows a complete PDF-to-JSON extraction flow.
Import the required Nutrient classes:
from nutrient_sdk import Documentfrom nutrient_sdk import Visionfrom nutrient_sdk import NutrientExceptionfrom nutrient_sdk import VisionEngineOpen the PDF with a Python context manager(opens in a new tab). The context manager closes the document automatically:
def main(): try: with Document.open("input.pdf") as document:Configure the OCR engine, extract JSON content, and write it to output.json. Catch NutrientException to handle SDK errors:
document.settings.vision_settings.set_engine(VisionEngine.OCR)
vision = Vision.set(document) content_json = vision.extract_content()
with open("output.json", "w", encoding="utf-8") as f: f.write(content_json)
print("Successfully extracted content to output.json") except NutrientException as e: print(f"Error: {e}")
if __name__ == "__main__": main()Summary
The extraction flow has four steps:
- Open the PDF document.
- Configure the OCR engine.
- Extract content as JSON with
Vision. - Write the JSON output to a file.
Nutrient handles OCR and content structuring, so you don’t need to implement PDF parsing or text recognition logic.
You can download this sample package to run the example locally.