---
title: "Extracting data from images using vision language models | Nutrient Python SDK"
canonical_url: "https://www.nutrient.io/guides/python/extraction/extract-data-from-image-vlm/"
md_url: "https://www.nutrient.io/guides/python/extraction/extract-data-from-image-vlm.md"
last_updated: "2026-06-09T10:32:42.848Z"
description: "Extract structured data from images using vision language models with Nutrient Python SDK."
---

# Extracting data from images using vision language models

Use VLM-enhanced ICR when you need higher extraction accuracy on complex documents.

Common use cases include:

- Financial documents with complex tables

- Invoices with varied layouts

- Medical records with specialized terminology

- Legal documents with strict structure requirements

- Multi-language document analysis

VLM-enhanced mode combines ICR layout analysis with language-model reasoning to improve classification and structure detection.

[Download sample](https://www.nutrient.io/downloads/samples/python/extract-data-from-image-vlm.zip)

## How Nutrient helps

Nutrient Python SDK handles VLM-enhanced configuration, model orchestration, and JSON output generation.

The SDK handles:

- Hybrid mode configuration for ICR + VLM processing

- Model loading and capability coordination

- Semantic classification and confidence scoring internals

- Complex layout analysis implementation details

## Prerequisites

Before running this sample, make sure your VLM setup is ready.

VLM-enhanced mode requires an external VLM endpoint. The SDK does not automatically provision or start a VLM service for you.

- Configure a reachable VLM endpoint in your environment.

- Configure `api_endpoint` in [Custom VLM API Settings](https://www.nutrient.io/api/python/settings/advanced/vision/custom-vlm-api-settings.md#api_endpoint).

- Configure `model` in [Custom VLM API Settings](https://www.nutrient.io/api/python/settings/advanced/vision/custom-vlm-api-settings.md#model).

- By default, the SDK may assume:
  - `api_endpoint`: `http://localhost:1234/v1`
  - `model`: `qwen/qwen3-vl-8b`

- For clarity and reliability, explicitly set both `api_endpoint` and `model` in your configuration.

- Example with [LM Studio](https://lmstudio.ai/):
  - Run LM Studio in server mode.
  - Load a compatible vision model such as Qwen3-VL (4B, 8B, or 23B depending on your hardware).
  - `api_endpoint`: `http://127.0.0.1:1234/v1`
  - `model`: `qwen/qwen3-vl-4b`

- Ensure the endpoint is running before calling `extract_content()` in VLM-enhanced mode.

If no VLM endpoint is available, VLM-enhanced extraction can fail at runtime.

## Complete implementation

This example extracts structured JSON using `VisionEngine.VLM_ENHANCED_ICR`:

```python

from nutrient_sdk import Document, Vision, VisionEngine

```

## Loading and processing the image

Open the image in a [context manager](https://docs.python.org/3/reference/datamodel.html#context-managers) so resources are cleaned up after processing:

```python

with Document.open("input.png") as document:

```

## Configuring VLM-enhanced mode

Set the vision engine to `VisionEngine.VLM_ENHANCED_ICR`.

This mode improves:

- Table boundary detection

- Semantic element classification

- Reading order in complex layouts

- Understanding across document variations

```python

    document.settings.vision_settings.engine = VisionEngine.VLM_ENHANCED_ICR

```

## Creating a vision instance

Create a vision instance bound to the opened document with `Vision.set(document)`:

```python

    vision = Vision.set(document)

```

## Extracting structured content

Call `extract_content()` to run the VLM-enhanced pipeline.

In this mode, the pipeline performs:

- Initial ICR layout detection

- VLM-based semantic refinement

- Confidence scoring

- JSON generation with structure and coordinates

```python

    content_json = vision.extract_content()

```

Write the JSON result to a file for downstream processing.

Use this output for indexing, validation, storage, or custom analysis:

```python

    with open("output.json", "w") as f:
        f.write(content_json)

```

## Understanding the output

`extract_content()` returns structured JSON with layout and semantic metadata.

VLM-enhanced output includes:

- **Document elements** — Paragraphs, headings, tables, figures, equations, and form-related regions

- **Bounding boxes** — Pixel coordinates with improved boundary accuracy

- **Hierarchical relationships** — Parent-child structure across sections and blocks

- **Element classification** — Semantic types with confidence scores

- **Reading order** — Sequence for complex layouts and multicolumn content

- **Semantic metadata** — Additional attributes used in downstream processing

### Key output fields

The following are the most commonly included fields in VLM JSON output:

- **`text`** — Extracted text for the element.

- **`words`** — Per-word OCR/extraction results.

- **`bounds`** — Bounding box coordinates for the element or word.

- **`confidence`** — Confidence score for the element or word.

- **`readingOrder`** — Sequence in which elements should be read.

- **`id`** — Unique identifier for the extracted element.

- **`pageNumber`** — Source page number.

- **`type` / `role`** — Semantic type of the extracted block.

When an element contains only one word, element-level and word-level `bounds`/`confidence` can appear identical.

### Confidence fields in VLM output

VLM output can contain two distinct confidence signals:

1. **`confidence` (or `classificationConfidence`)** — Zone classification confidence
   - **Definition**: How confident the model is in semantic classification (for example, text, heading, table, image), heading level detection, and language detection.
   - **Scale**: `0.0` to `1.0` (float).
   - **Interpretation**:
     - `0.0` = no confidence (often treated as unknown classification)
     - `1.0` = maximum confidence
   - **Use**: Decide whether to trust semantic zone labels in downstream logic.

2. **`textConfidence`** — Text extraction confidence
   - **Definition**: How confident the model is in the extracted text quality for a zone.
   - **Scale**: Categorical values: `high`, `medium`, `low` (not numeric).
   - **Interpretation**:
     - `high` = strong confidence in extracted text
     - `medium` = moderate confidence
     - `low` = uncertain text quality
   - **Use**: Prioritize review, fallback, or fusion strategies for lower-confidence text.

Use this JSON for form extraction, contract analysis, invoice parsing, and other high-accuracy workflows.

## Error handling

Vision API raises `VisionException` when extraction fails.

Common failure scenarios include:

- The image file can’t be read due to path or permission issues

- Image data is corrupted or unsupported

- Required models are missing or inaccessible

- Available memory is insufficient for VLM-enhanced processing

- VLM enhancement fails due to connectivity or service issues when applicable

- Image format, resolution, or dimensions are unsupported

In production code:

- Catch `VisionException`.

- Return a clear error message.

- Log failure details for debugging.

- Add fallback logic (for example, retry in ICR mode).

## Conclusion

Use this workflow for VLM-enhanced extraction:

1. Open the image document using a [context manager](https://docs.python.org/3/reference/datamodel.html#context-managers) for automatic resource cleanup.

2. Configure the vision settings by assigning `VisionEngine.VLM_ENHANCED_ICR` to the `vision_settings.engine` property for enhanced accuracy.

3. VLM-enhanced mode combines local ICR AI models with vision language model capabilities for superior document analysis.

4. Create a vision instance with `Vision.set()` to bind content extraction operations to the document.

5. Call `extract_content()` to invoke the VLM-enhanced processing pipeline.

6. The pipeline performs initial ICR layout analysis, applies VLM enhancement for semantic understanding, calculates confidence scores, and generates JSON output.

7. VLM enhancement improves table cell boundary detection, element classification accuracy, and reading order determination for complex layouts.

8. The method returns a JSON-formatted string containing document structure with elements, bounding boxes, hierarchical relationships, reading order, and confidence scores.

9. Write the JSON content to a file using Python’s built-in file handling with [context manager](https://docs.python.org/3/reference/datamodel.html#context-managers) syntax for automatic resource management.

10. Handle `VisionException` errors for robust error recovery with fallback strategies like pure ICR mode.

11. The JSON output enables integration with intelligent form extraction, contract analysis, invoice processing, and legal document parsing.

12. VLM-enhanced mode is ideal for complex documents where extraction accuracy is the priority.

For related image extraction workflows, refer to the [Python SDK guides](https://www.nutrient.io/guides/python.md).

Download [this ready-to-use sample package](https://www.nutrient.io/downloads/samples/python/extract-data-from-image-vlm.zip) to explore VLM-enhanced extraction.
---

## Related pages

- [Speeding up first ICR operation by predownloading models](/guides/python/extraction/speed-up-first-icr-by-downloading-requirements.md)
- [Extracting text from PDF documents](/guides/python/extraction/pdf-to-text.md)
- [Extracting text from multilingual images](/guides/python/extraction/read-text-from-image-multi-language.md)
- [Extracting structured data from documents](/guides/python/extraction/extract-structured-data.md)
- [Generating image descriptions using Claude](/guides/python/extraction/describe-image-with-claude.md)
- [Generating image descriptions using OpenAI](/guides/python/extraction/describe-image-with-openai.md)
- [Extracting text from images](/guides/python/extraction/read-text-from-image.md)
- [Generating image descriptions using local AI](/guides/python/extraction/describe-image-with-local-ai.md)
- [Nutrient Python SDK extraction guides](/guides/python/extraction.md)
- [Applying OCR to a PDF document](/guides/python/extraction/apply-ocr-to-pdf.md)
- [Extracting form fields from images](/guides/python/extraction/extract-form-fields-from-image.md)
- [Extracting data from images using OCR](/guides/python/extraction/extract-data-from-image-ocr.md)
- [Applying OCR to a PDF page](/guides/python/extraction/apply-ocr-to-pdf-page.md)
- [Labeling form fields with a vision language model](/guides/python/extraction/label-form-fields-with-vlm.md)
- [Extracting structured JSON data from PDF documents](/guides/python/extraction/json-data-extraction.md)
- [Extracting data from images using ICR](/guides/python/extraction/extract-data-from-image-icr.md)

