Extracting data from images using vision language models
Use VLM-enhanced ICR when you need higher extraction accuracy on complex documents.
Common use cases include:
- Financial documents with complex tables
- Invoices with varied layouts
- Medical records with specialized terminology
- Legal documents with strict structure requirements
- Multi-language document analysis
VLM-enhanced mode combines ICR layout analysis with language-model reasoning to improve classification and structure detection.
Download sampleHow Nutrient helps
Nutrient Java SDK handles VLM-enhanced configuration, model orchestration, and JSON output generation.
The SDK handles:
- Configuring hybrid processing modes combining local ICR with VLM enhancement
- Managing AI model loading and VLM capability coordination
- Implementing advanced semantic element classification with confidence scoring
- Complex document structure analysis and enhanced layout understanding algorithms
Complete implementation
This example extracts structured JSON using VisionEngine.VlmEnhancedIcr:
package io.nutrient.Sample;Import the required classes from the SDK:
import io.nutrient.sdk.Document;import io.nutrient.sdk.Vision;import io.nutrient.sdk.enums.VisionEngine;import io.nutrient.sdk.exceptions.NutrientException;
import java.io.FileWriter;import java.io.IOException;
public class ExtractDataFromImage {Create the main method and declare thrown exceptions:
public static void main(String[] args) throws NutrientException, IOException {Loading and processing the image
Open the image in try-with-resources so resources are cleaned up after processing:
try (Document document = Document.open("input.png")) {Configuring VLM-enhanced mode
Set the vision engine to VisionEngine.VlmEnhancedIcr.
This mode improves:
- Table boundary detection
- Semantic element classification
- Reading order in complex layouts
- Understanding across document variations
// Configure VLM-enhanced ICR engine for improved accuracy document.getSettings().getVisionSettings().setEngine(VisionEngine.VlmEnhancedIcr);Creating a vision instance
Create a vision instance bound to the document with Vision.set(document):
Vision vision = Vision.set(document);Extracting structured content
Call extractContent() to run the VLM-enhanced pipeline.
In this mode, the pipeline performs:
- Initial ICR layout detection
- VLM-based semantic refinement
- Confidence scoring
- JSON generation with structure and coordinates
String contentJson = vision.extractContent();Write the JSON result to a file for downstream processing.
Use this output for indexing, validation, storage, or custom analysis:
try (FileWriter writer = new FileWriter("output.json")) { writer.write(contentJson); } } }}Understanding the output
extractContent() returns structured JSON with layout and semantic metadata.
VLM-enhanced output includes:
- Document elements — Paragraphs, headings, tables, figures, equations, and form-related regions
- Bounding boxes — Pixel coordinates with improved boundary accuracy
- Hierarchical relationships — Parent-child structure across sections and blocks
- Element classification — Semantic types with confidence scores
- Reading order — Sequence for complex layouts and multicolumn content
- Semantic metadata — Additional attributes used in downstream processing
Use this JSON for form extraction, contract analysis, invoice parsing, and other high-accuracy workflows.
Error handling
Vision API throws VisionException when extraction fails.
Common failure scenarios include:
- The image file can’t be read due to path or permission issues
- Image data is corrupted or unsupported
- Required models are missing or inaccessible
- Available memory is insufficient for VLM-enhanced processing
- VLM enhancement fails due to connectivity or service issues when applicable
- Image format, resolution, or dimensions are unsupported
In production code:
- Catch
VisionException. - Return a clear error message.
- Log failure details for debugging.
- Add fallback logic (for example, retry in ICR mode).
Conclusion
Use this workflow for VLM-enhanced extraction:
- Open the image document using try-with-resources for automatic resource cleanup.
- Configure the vision settings with
getSettings().getVisionSettings().setEngine(VisionEngine.VlmEnhancedIcr)for enhanced accuracy. - VLM-enhanced mode combines local ICR AI models with vision language model capabilities for superior document analysis.
- Create a vision instance with
Vision.set()to bind content extraction operations to the document. - Call
extractContent()to invoke the VLM-enhanced processing pipeline. - The pipeline performs initial ICR layout analysis, applies VLM enhancement for semantic understanding, calculates confidence scores, and generates JSON output.
- VLM enhancement improves table cell boundary detection, element classification accuracy, and reading order determination for complex layouts.
- The method returns a JSON-formatted string containing document structure with elements, bounding boxes, hierarchical relationships, reading order, and confidence scores.
- Write the JSON content to a file using try-with-resources with
FileWriterfor automatic resource management. - Handle
VisionExceptionerrors for robust error recovery with fallback strategies like pure ICR mode. - The JSON output enables integration with intelligent form extraction, contract analysis, invoice processing, and legal document parsing.
- VLM-enhanced mode is ideal for complex documents where extraction accuracy is the priority.
For related image extraction workflows, refer to the Java SDK guides.
Download this ready-to-use sample package to explore VLM-enhanced extraction.