Extracting text from images programmatically enables teams to build high-throughput document processing pipelines, implement real-time text extraction systems, and create resource-efficient content indexing workflows. Whether you’re processing invoice batches requiring fast turnaround times for payment processing, building search indexing systems where millions of document pages need text extraction for full-text search capabilities, implementing receipt scanning applications requiring immediate text capture on mobile devices with limited processing power, or creating document digitization workflows where speed takes priority over complex layout understanding, optical character recognition (OCR) provides fast text extraction optimized for throughput over semantic analysis. The OCR engine focuses on character recognition and word-level bounding boxes without the computational overhead of full document layout analysis, table detection, or semantic element classification, making it suitable for linear documents and high-volume processing scenarios where processing speed and resource efficiency are critical constraints.

How Nutrient helps you achieve this

Nutrient Python SDK handles OCR engine configuration, text extraction, and JSON formatting. With the SDK, you don’t need to worry about:

  • Configuring OCR engines and language model selection for character recognition
  • Implementing word-level bounding box calculation and coordinate transformation
  • Handling text line detection and reading order determination
  • Complex language detection algorithms and multi-language text processing

Instead, Nutrient provides an API that handles all the complexity behind the scenes, letting you focus on your business logic.

Prerequisites

Before following this guide, ensure you have:

  • Python 3.8 or higher installed
  • Nutrient Python SDK installed (pip install nutrient-sdk)
  • An image file to process (PNG, JPEG, or other supported formats)
  • Basic familiarity with Python context manager(opens in a new tab) and the with statement

For initial SDK setup and configuration, refer to the getting started guide.

Complete implementation

Below is a complete working example that demonstrates extracting text from images using the OCR engine optimized for speed and throughput. The vision API processes images and returns JSON-formatted text data with word-level bounding boxes that can be used for search indexing, text analysis, or downstream processing. The following lines set up the Python application. The import statements bring in all the necessary classes from the Nutrient SDK:

from nutrient_sdk import Document, Vision, VisionEngine

Configuring OCR mode

Open the image file and configure the vision API to use the OCR engine for fast text extraction. The following code uses a context manager(opens in a new tab) to open the document with automatic resource cleanup. The vision_settings.engine property is assigned the VisionEngine.OCR enumeration value to explicitly configure OCR-based text extraction. Unlike intelligent content recognition (ICR) mode, which performs full document layout analysis and semantic element detection, OCR mode focuses exclusively on character recognition and word extraction, skipping table detection, equation parsing, and hierarchical structure analysis. This streamlined processing approach minimizes computational overhead, reducing memory consumption and CPU utilization while maximizing throughput for high-volume document processing scenarios:

with Document.open("input_ocr_multiple_languages.png") as document:
# Configure OCR engine for fast text extraction
document.settings.vision_settings.engine = VisionEngine.OCR

Creating a vision instance and extracting content

Create a vision instance and extract the text content to a JSON string. The following code uses the Vision.set() method to create a vision instance bound to the opened document, enabling text extraction operations. The extract_content() method invokes the OCR engine, which performs character recognition on the image, detects individual words and text lines, calculates bounding boxes in pixel coordinates for each word, and generates a JSON-formatted string containing the extracted text with positional data. The OCR extraction process is optimized for speed, processing text sequentially without semantic analysis, making it suitable for simple documents, high-throughput pipelines, and resource-constrained environments:

vision = Vision.set(document)
content_json = vision.extract_content()

Write the extracted text content to a JSON file for downstream processing or search indexing. The following code uses Python’s built-in file handling with a context manager(opens in a new tab) to automatically close the file after writing. The content_json string contains the extracted text with word-level bounding boxes in JSON format, enabling integration with search indexing systems (Elasticsearch, Solr), text analysis pipelines for natural language processing, and database storage for full-text search capabilities:

with open("output.json", "w") as f:
f.write(content_json)

Understanding the output

The extract_content() method in OCR mode returns a JSON structure optimized for text extraction and word-level positioning. The OCR engine generates streamlined output focused on character recognition:

  • Text content — Extracted text from the document with original line breaks and spacing preserved where possible
  • Bounding boxes — Position coordinates and dimensions of text regions in pixel units for word-level positioning
  • Word-level data — Individual words with precise coordinates enabling text highlighting, redaction targeting, or clickable text overlays
  • Language detection — Identified language(s) in the processed text for multi-language document handling and language-specific processing

The JSON format enables integration with search indexing systems, including Elasticsearch with full-text search, data extraction pipelines for invoice processing, text analysis tools for natural language processing, and database storage systems requiring full-text search capabilities. Unlike ICR mode, which provides semantic structure, OCR output focuses on text content and coordinates without table structures, heading hierarchies, or document organization metadata.

Error handling

The vision API raises VisionException if text extraction fails due to image processing errors or OCR resource loading failures. Exception handling ensures robust error recovery in production environments.

Common failure scenarios include:

  • The image file can’t be read due to file system permissions or invalid path errors
  • Image data is corrupted, truncated, or uses an unsupported encoding scheme preventing decoding
  • Required OCR models aren’t installed or accessible, or they have insufficient file permissions
  • Insufficient system memory for processing large images (OCR typically requires less memory than ICR but can still fail on extremely large images)
  • Unsupported image format or resolution (some OCR engines have minimum resolution requirements for accurate character recognition)

In production code, wrap the extraction operations in a try-except block to catch VisionException instances, providing appropriate error messages to users and logging failure details for debugging. This error handling pattern enables graceful degradation when text extraction fails, preventing application crashes and enabling retry logic with different processing parameters or fallback to alternative extraction methods.

Conclusion

The OCR-based text extraction workflow consists of several key operations:

  1. Open the image document using a context manager(opens in a new tab) for automatic resource cleanup.
  2. Configure the vision settings with the engine property assigned to VisionEngine.OCR for fast text extraction.
  3. OCR mode focuses on character recognition and word extraction without semantic analysis or layout detection.
  4. Create a vision instance with Vision.set() to bind text extraction operations to the document.
  5. Call extract_content() to invoke the OCR engine for character recognition.
  6. The OCR engine performs word detection, calculates bounding boxes, and generates JSON output with text and coordinates.
  7. The method returns a JSON-formatted string containing extracted text with word-level bounding boxes in pixel coordinates.
  8. OCR processing is optimized for speed, minimizing computational overhead for high-throughput scenarios.
  9. Write the JSON content to a file using Python’s built-in file handling with context manager(opens in a new tab) syntax.
  10. Handle VisionException errors for robust error recovery in production environments.
  11. The JSON output enables integration with search indexing (Elasticsearch, Solr), text analysis, and database storage.
  12. OCR mode is ideal for invoice processing, receipt scanning, search indexing, and document digitization where speed is critical.

Nutrient handles OCR engine configuration, language model selection, word-level bounding box calculation, text line detection, reading order determination, and JSON schema generation so you don’t need to implement character recognition algorithms or manage OCR model loading manually. The OCR system provides fast text extraction for high-throughput document processing pipelines, real-time text capture applications, search indexing systems requiring millions of document extractions, and resource-efficient digitization workflows where processing speed takes priority over semantic document understanding.

Download this ready-to-use sample package to explore the vision API capabilities with preconfigured OCR settings.