Extracting data from images using vision language models

Extracting structured data from images using VLM-enhanced processing enables teams to achieve higher accuracy document analysis, build intelligent form extraction systems, and implement advanced content understanding pipelines. Whether you’re processing complex financial documents requiring high-accuracy table extraction for regulatory compliance, building invoice processing systems where subtle layout variations affect data extraction quality, implementing medical record digitization requiring precise semantic element detection for clinical accuracy, creating legal document analysis systems where structural understanding determines contract clause extraction, or building multi-language document workflows where improved language model understanding reduces extraction errors, VLM-enhanced ICR provides superior accuracy by combining local AI models with vision language model capabilities. The VLM-enhanced engine analyzes document layout using local models, applies advanced language understanding for improved semantic element classification, detects complex table structures with higher cell boundary accuracy, extracts hierarchical document organization with better reading order determination, and generates comprehensive JSON output with enhanced element classification confidence scores.

Download sample

How Nutrient helps you achieve this

Nutrient Java SDK handles VLM-enhanced ICR engine configuration, AI model orchestration, and structured JSON generation. With the SDK, you don’t need to worry about:

Configuring hybrid processing modes combining local ICR with VLM enhancement
Managing AI model loading and VLM capability coordination
Implementing advanced semantic element classification with confidence scoring
Complex document structure analysis and enhanced layout understanding algorithms

Instead, Nutrient provides an API that handles all the complexity behind the scenes, letting you focus on your business logic.

Complete implementation

Below is a complete working example that demonstrates extracting structured data from images using VLM-enhanced ICR for improved accuracy. The vision API processes images and returns JSON-formatted structural data with enhanced element classification and confidence scores. The following lines set up the Java application. Start by specifying a package name and create a new class:

package io.nutrient.Sample;

Import the required classes from the SDK:

import io.nutrient.sdk.Document;
import io.nutrient.sdk.Vision;
import io.nutrient.sdk.enums.VisionEngine;
import io.nutrient.sdk.exceptions.NutrientException;

import java.io.FileWriter;
import java.io.IOException;

public class ExtractDataFromImage {

Create the main function that can throw exceptions:

    public static void main(String[] args) throws NutrientException, IOException {

Loading and processing the image

Open the image file using the Document class with a try-with-resources statement for automatic resource cleanup. The following code opens an image file in any supported format, including PNG (lossless compression suitable for documents with text), JPEG (compressed format for photographs and scanned documents), and TIFF (multipage format common in document scanning). The try-with-resources pattern ensures the document is properly closed after processing, releasing memory and file handles regardless of whether extraction succeeds or fails:

        try (Document document = Document.open("input.png")) {

Configuring VLM-enhanced mode

Configure the vision API to use the VLM-enhanced ICR engine for superior document analysis accuracy. The following code uses the getSettings().getVisionSettings().setEngine() method chain to assign the VisionEngine.VlmEnhancedIcr enumeration value. This configuration enables hybrid processing that combines local ICR AI models (for document layout detection and element classification) with vision language model capabilities (for advanced semantic understanding and improved accuracy). Unlike pure ICR mode, which relies solely on local models, VLM-enhanced mode applies language model intelligence to improve table cell boundary detection, semantic element classification confidence, reading order determination in complex layouts, and multi-language text understanding. This enhanced processing is particularly valuable for complex financial documents with intricate table structures, legal contracts requiring precise clause extraction, medical records with specialized terminology, or any documents where accuracy improvements justify the additional processing overhead:

            // Configure VLM-enhanced ICR engine for improved accuracy
            document.getSettings().getVisionSettings().setEngine(VisionEngine.VlmEnhancedIcr);

Creating a vision instance

Create a vision instance associated with the document to enable content extraction operations. The following code uses the Vision.set() static method with the document parameter to create a vision instance bound to the opened document. This binding prepares the document for AI-powered analysis, associating the vision processing context with the document’s image data and configured vision settings. The vision instance provides access to content extraction methods and maintains the processing state for the document:

            Vision vision = Vision.set(document);

Extracting structured content

Call the extractContent() method to invoke the VLM-enhanced analysis and extract comprehensive structural information. The following code calls the method on the vision instance, which triggers a multistage processing pipeline: The local ICR engine loads AI models and performs initial document layout analysis to detect text blocks, regions, and candidate elements; the VLM enhancement layer applies advanced language understanding to improve element classification, refine table cell boundaries, and optimize reading order determination; confidence scores are calculated for each detected element based on both visual features and semantic context; and finally, a JSON-formatted string is generated, and it contains the complete document structure with element types, bounding boxes in pixel coordinates, hierarchical relationships, reading order sequences, and classification confidence scores. The enhanced processing typically produces higher accuracy element detection compared to pure ICR, particularly for complex layouts with nested tables, multicolumn formats, or specialized document types:

            String contentJson = vision.extractContent();

Write the extracted structured content to a JSON file for downstream processing, analysis, or integration with other systems. The following code uses a nested try-with-resources statement with FileWriter to automatically close the file after writing, ensuring proper resource cleanup, even if an exception occurs during the write operation. The contentJson string contains the complete VLM-enhanced document structure in JSON format, including element types, bounding boxes, hierarchical relationships, reading order, and confidence scores. This JSON output can be integrated with document processing pipelines, stored in database systems with structured schemas, indexed for search capabilities, or analyzed with custom business logic for form data extraction, contract clause identification, or invoice line item parsing:

            try (FileWriter writer = new FileWriter("output.json")) {
                writer.write(contentJson);
            }
        }
    }
}

Understanding the output

The extractContent() method in VLM-enhanced mode returns a comprehensive JSON structure representing the document layout with improved accuracy and semantic understanding. The VLM-enhanced engine generates detailed output, including:

Document elements — Text blocks, paragraphs, headings organized by semantic roles, tables with cell-level structures and header detection, figures with caption associations, mathematical equations with LaTeX representations, and form fields with key-value pair relationships
Bounding boxes — Position coordinates and dimensions for each element in pixel units, with enhanced accuracy in table cell boundary detection and overlapping element disambiguation
Hierarchical relationships — Parent-child element associations reflecting document organization, with improved detection of nested structures, section hierarchies, and content flow between columns or regions
Element classification — Type identification for each detected region, with confidence scores indicating classification certainty, enabling quality filtering and accuracy assessment
Reading order — Elements sorted by natural reading sequence with VLM-enhanced understanding of complex multicolumn layouts, sidebar content, and non-linear document structures
Semantic metadata — Enhanced element attributes, including text direction, language identification, font properties, and structural roles (headers, footers, page numbers)

The JSON format enables integration with advanced document processing pipelines, including intelligent form data extraction with field validation, contract analysis with clause-level parsing, invoice processing with line item extraction and totals verification, medical record digitization with clinical terminology understanding, and legal document analysis with section-based content organization.

Error handling

The vision API throws VisionException if content extraction fails due to image processing errors, model loading failures, or VLM enhancement processing issues. Exception handling ensures robust error recovery in production environments.

Common failure scenarios include:

The image file can’t be read due to file system permissions, path errors, or file locking by another process
Image data is corrupted, truncated, or uses an unsupported encoding scheme, preventing proper decoding
Required ICR models aren’t installed or accessible, or they have insufficient file permissions in the model directory
Insufficient system memory for loading AI models (VLM-enhanced mode typically requires more memory than pure ICR due to additional language model processing)
VLM enhancement processing fails due to network connectivity issues if external VLM services are required
Unsupported image format, extremely low resolution preventing accurate character recognition, or image dimensions exceeding processing limits

In production code, wrap the extraction operations in a try-catch block to catch VisionException instances, providing appropriate error messages to users and logging failure details, including exception messages and stack traces for debugging. This error handling pattern enables graceful degradation when content extraction fails, preventing application crashes and enabling retry logic with different processing parameters (e.g. falling back to pure ICR mode if VLM enhancement fails), alternative extraction methods, or user notification for manual document review.

Conclusion

The VLM-enhanced content extraction workflow consists of several key operations:

Open the image document using try-with-resources for automatic resource cleanup.
Configure the vision settings with getSettings().getVisionSettings().setEngine(VisionEngine.VlmEnhancedIcr) for enhanced accuracy.
VLM-enhanced mode combines local ICR AI models with vision language model capabilities for superior document analysis.
Create a vision instance with Vision.set() to bind content extraction operations to the document.
Call extractContent() to invoke the VLM-enhanced processing pipeline.
The pipeline performs initial ICR layout analysis, applies VLM enhancement for semantic understanding, calculates confidence scores, and generates JSON output.
VLM enhancement improves table cell boundary detection, element classification accuracy, and reading order determination for complex layouts.
The method returns a JSON-formatted string containing document structure with elements, bounding boxes, hierarchical relationships, reading order, and confidence scores.
Write the JSON content to a file using try-with-resources with FileWriter for automatic resource management.
Handle VisionException errors for robust error recovery with fallback strategies like pure ICR mode.
The JSON output enables integration with intelligent form extraction, contract analysis, invoice processing, and legal document parsing.
VLM-enhanced mode is ideal for complex financial documents, legal contracts, medical records, or any documents requiring superior accuracy.

Nutrient handles VLM-enhanced ICR engine configuration, AI model orchestration, hybrid processing coordination, semantic element classification with confidence scoring, table structure analysis with enhanced cell boundary detection, hierarchical relationship parsing, reading order optimization for complex layouts, and JSON schema generation with enhanced metadata so you don’t need to implement advanced document analysis algorithms or manage VLM capability integration manually. The VLM-enhanced system provides superior accuracy for complex financial document extraction, legal contract clause identification, medical record digitization with clinical terminology understanding, invoice line item parsing with totals verification, and any document processing scenarios where improved accuracy justifies additional processing overhead.

You can download this ready-to-use sample package, fully configured to help you explore the vision API capabilities.