Extracting data from images using OCR

Extracting text from images programmatically enables teams to build high-throughput document processing pipelines, implement real-time text extraction systems, and create resource-efficient content indexing workflows. Whether you’re processing invoice batches requiring fast turnaround times for payment processing, building search indexing systems where millions of document pages need text extraction for full-text search capabilities, implementing receipt scanning applications requiring immediate text capture on mobile devices with limited processing power, or creating document digitization workflows where speed takes priority over complex layout understanding, optical character recognition (OCR) provides fast text extraction optimized for throughput over semantic analysis. The OCR engine focuses on character recognition and word-level bounding boxes without the computational overhead of full document layout analysis, table detection, or semantic element classification, making it suitable for linear documents and high-volume processing scenarios where processing speed and resource efficiency are critical constraints.

How Nutrient helps you achieve this

Nutrient Java SDK handles OCR engine configuration, text extraction, and JSON formatting. With the SDK, you don’t need to worry about:

Configuring OCR engines and language model selection for character recognition
Implementing word-level bounding box calculation and coordinate transformation
Handling text line detection and reading order determination
Complex language detection algorithms and multi-language text processing

Instead, Nutrient provides an API that handles all the complexity behind the scenes, letting you focus on your business logic.

Complete implementation

Below is a complete working example that demonstrates extracting text from images using the OCR engine optimized for speed and throughput. The vision API processes images and returns JSON-formatted text data with word-level bounding boxes that can be used for search indexing, text analysis, or downstream processing.

Prerequisites

Before following this guide, ensure you have:

Java 8 or higher installed
Nutrient Java SDK installed (via Maven, Gradle, or manual JAR installation)
An image file to process (PNG, JPEG, or other supported formats)
Basic familiarity with Java try-with-resources statements

For initial SDK setup and configuration, refer to the getting started guide.

Preparing the project

Start by specifying a package name and create a new class:

package io.nutrient.Sample;

Import the required classes from the SDK:

import io.nutrient.sdk.Document;
import io.nutrient.sdk.Vision;
import io.nutrient.sdk.enums.VisionEngine;
import io.nutrient.sdk.exceptions.NutrientException;

import java.io.FileWriter;
import java.io.IOException;

public class ExtractDataFromImageOcr {

Create the main function that can throw exceptions:

    public static void main(String[] args) throws NutrientException, IOException {

Configuring OCR mode

Open the image file and configure the vision API to use the OCR engine for fast text extraction. The following code uses a try-with-resources statement to open the document with automatic resource cleanup. The getSettings().getVisionSettings().setEngine() method chain configures the vision engine by calling the setter method with VisionEngine.Ocr as the parameter. Unlike intelligent content recognition (ICR) mode, which performs full document layout analysis and semantic element detection, OCR mode focuses exclusively on character recognition and word extraction, skipping table detection, equation parsing, and hierarchical structure analysis. This streamlined processing approach minimizes computational overhead, reducing memory consumption and CPU utilization while maximizing throughput for high-volume document processing scenarios:

        try (Document document = Document.open("input_ocr_multiple_languages.png")) {
            // Configure OCR engine for fast text extraction
            document.getSettings().getVisionSettings().setEngine(VisionEngine.Ocr);

Creating a vision instance and extracting content

Create a vision instance and extract the text content to a JSON string. The following code uses the Vision.set() method to create a vision instance bound to the opened document, enabling text extraction operations. The extractContent() method invokes the OCR engine, which performs character recognition on the image, detects individual words and text lines, calculates bounding boxes in pixel coordinates for each word, and generates a JSON-formatted string containing the extracted text with positional data. The OCR extraction process is optimized for speed, processing text sequentially without semantic analysis, making it suitable for simple documents, high-throughput pipelines, and resource-constrained environments:

            Vision vision = Vision.set(document);
            String contentJson = vision.extractContent();

Write the extracted text content to a JSON file for downstream processing or search indexing. The following code uses a try-with-resources statement with FileWriter to automatically close the file after writing. The contentJson string contains the extracted text with word-level bounding boxes in JSON format, enabling integration with search indexing systems (Elasticsearch, Solr), text analysis pipelines for natural language processing, or database storage for full-text search capabilities:

            try (FileWriter writer = new FileWriter("output.json")) {
                writer.write(contentJson);
            }
        }
    }
}

Understanding the output

The extractContent() method in OCR mode returns a JSON structure optimized for text extraction and word-level positioning. The OCR engine generates streamlined output focused on character recognition:

Text content — Extracted text from the document with original line breaks and spacing preserved where possible
Bounding boxes — Position coordinates and dimensions of text regions in pixel units for word-level positioning
Word-level data — Individual words with precise coordinates enabling text highlighting, redaction targeting, or clickable text overlays
Language detection — Identified language(s) in the processed text for multi-language document handling and language-specific processing

The JSON format enables integration with search indexing systems, including Elasticsearch with full-text search, data extraction pipelines for invoice processing, text analysis tools for natural language processing, and database storage systems requiring full-text search capabilities. Unlike ICR mode which provides semantic structure, OCR output focuses on text content and coordinates without table structures, heading hierarchies, or document organization metadata.

Error handling

The vision API throws VisionException if text extraction fails due to image processing errors or OCR resource loading failures. Exception handling ensures robust error recovery in production environments.

Common failure scenarios include:

The image file can’t be read due to file system permissions or invalid path errors
Image data is corrupted, truncated, or uses an unsupported encoding scheme preventing decoding
Required OCR models aren’t installed or accessible, or they have insufficient file permissions
Insufficient system memory for processing large images (OCR typically requires less memory than ICR but can still fail on extremely large images)
Unsupported image format or resolution (some OCR engines have minimum resolution requirements for accurate character recognition)

In production code, wrap the extraction operations in a try-catch block to catch VisionException instances, providing appropriate error messages to users and logging failure details for debugging. This error handling pattern enables graceful degradation when text extraction fails, preventing application crashes and enabling retry logic with different processing parameters or fallback to alternative extraction methods.

Conclusion

The OCR-based text extraction workflow consists of several key operations:

Open the image document using a try-with-resources statement for automatic resource cleanup.
Configure the vision settings by calling getSettings().getVisionSettings().setEngine(VisionEngine.Ocr) to enable fast text extraction.
OCR mode focuses on character recognition and word extraction without semantic analysis or layout detection.
Create a vision instance with Vision.set() to bind text extraction operations to the document.
Call extractContent() to invoke the OCR engine for character recognition.
The OCR engine performs word detection, calculates bounding boxes, and generates JSON output with text and coordinates.
The method returns a JSON-formatted string containing extracted text with word-level bounding boxes in pixel coordinates.
OCR processing is optimized for speed, minimizing computational overhead for high-throughput scenarios.
Write the JSON content to a file using a try-with-resources statement with FileWriter for automatic resource management.
Handle VisionException errors for robust error recovery in production environments.
The JSON output enables integration with search indexing (Elasticsearch, Solr), text analysis, and database storage.
OCR mode is ideal for invoice processing, receipt scanning, search indexing, and document digitization where speed is critical.

Nutrient handles OCR engine configuration, language model selection, word-level bounding box calculation, text line detection, reading order determination, and JSON schema generation so you don’t need to implement character recognition algorithms or manage OCR model loading manually. The OCR system provides fast text extraction for high-throughput document processing pipelines, real-time text capture applications, search indexing systems requiring millions of document extractions, and resource-efficient digitization workflows where processing speed takes priority over semantic document understanding.

Download this ready-to-use sample package to explore the vision API capabilities with preconfigured OCR settings.