Generating image descriptions using local AI

Generating accessible image descriptions using local AI models enables teams to build privacy-preserving applications, eliminate cloud API costs, and maintain complete control over vision processing infrastructure. Whether you’re creating on-premises accessibility systems that generate alt text without sending data to external services, building medical imaging applications requiring HIPAA compliance with patient data remaining local, implementing classified document processing systems where images cannot leave secure networks, creating high-volume batch processing workflows that avoid per-image API costs, or building offline-capable mobile applications that describe images without internet connectivity, local AI-powered image description provides privacy, cost control, and infrastructure independence. Image description operations include analyzing image content with locally hosted vision language models (VLM), generating concise descriptions focused on main subjects and key details, producing accessibility-compliant descriptions for screen readers, extracting semantic meaning from charts, diagrams, and photographs, and providing contextual understanding beyond simple object detection — all without external API dependencies.

This guide demonstrates using locally hosted AI models (OpenAI-compatible endpoints) as the VLM provider for generating image descriptions through the Nutrient vision API. Local VLM servers like LM Studio, Ollama, or self-hosted inference servers provide on-premises vision processing with full data privacy, zero per-image costs, and no internet requirements.

Download sample

How Nutrient helps you achieve this

Nutrient Python SDK handles vision API integration, local VLM server configuration, and image processing pipelines. With the SDK, you don’t need to worry about:

Managing local VLM server communication, endpoint configuration, and OpenAI-compatible API formatting
Encoding image data and handling multimodal request structures for local models
Configuring model parameters like temperature, max tokens, and server-specific settings
Complex error handling for local server failures and model loading issues

Instead, Nutrient provides an API that handles all the complexity behind the scenes, enabling you to focus on your business logic.

Complete implementation

Below is a complete working example that demonstrates generating accessible image descriptions using locally hosted AI models through OpenAI-compatible endpoints. The following lines set up the Python application. Start by importing the required classes from the SDK:

from nutrient_sdk import Document, Vision

Opening the image file and configuring the local server

Open the image file using a context manager(opens in a new tab) and optionally configure the local VLM server settings. The following code opens an image file (PNG format in this example) using Document.open() with a file path parameter. The context manager pattern (using the with statement) ensures the document is properly closed after processing, releasing memory and image data regardless of whether description generation succeeds or fails. The SDK supports multiple image formats, including PNG, JPEG, GIF, BMP, and TIFF. The vision API automatically uses the default local VLM server configuration (endpoint: http://localhost:1234/v1, model: qwen/qwen3-vl-4b) unless explicitly configured through document.settings.custom_vlm_api_settings. The optional configuration shown sets the API endpoint and model identifier — these property assignments are only needed if your local server uses non-default settings:

with Document.open("input_photo.png") as document:
    # Optional: Configure the VLM API endpoint settings
    # These settings are customizable based on your VLM provider
    vlm_settings = document.settings.custom_vlm_api_settings
    vlm_settings.api_endpoint = "http://localhost:1234/v1"
    vlm_settings.model = "qwen/qwen3-vl-4b"

Creating a vision instance

Create a vision API instance bound to the document for image analysis with the local AI model. The following code uses the Vision.set() static method with the document parameter to create a vision processor. The vision API communicates with a locally hosted VLM server (like LM Studio, Ollama, or custom inference servers) running on localhost:1234 by default. The server must be running and have a vision-capable model (such as Qwen VL, LLaVA, or similar multimodal models) loaded before calling the vision API. The local server provides the same OpenAI-compatible chat completion interface used by cloud providers, enabling seamless local-to-cloud migration:

    vision = Vision.set(document)

Generating the description

Generate a natural language description by sending the image to the local VLM server for analysis. The following code calls vision.describe(), which encodes the image data, constructs an OpenAI-compatible multimodal API request with the image payload, sends the request to the local VLM server endpoint, and parses the response to extract the description text. The local model analyzes the image content and returns a natural language description as a string. The description focuses on the main subject and key visual details observable in the image, optimized for accessibility purposes and screen reader compatibility:

    description = vision.describe()

Outputting the description

Print the generated description to the console for review or logging. The following code prints a header line followed by the description text returned by the local model. In production applications, you might save the description to a database, write it to a file, or use it to populate alt text attributes in HTML documents. The context manager automatically closes the document after the print statements complete, ensuring proper resource cleanup:

    print("Image description:")
    print(description)

Understanding the output

The describe() method returns a natural language description of the image content generated by the local VLM model. Local models analyze the visual content and generate descriptions with specific characteristics:

Concise — Focused on the main subject and key details without unnecessary verbosity, typically 1–3 sentences depending on model configuration.
Accessible — Written for users who cannot see the image, following accessibility best practices for alt text.
Accurate — Describes only what is clearly visible in the image, avoiding speculation or interpretation beyond observable details.
Model-dependent quality — Description quality varies by model size and architecture (larger models like Qwen 7B produce more nuanced descriptions than smaller 4B variants).

The generated descriptions are suitable for accessibility compliance (WCAG alt text requirements), content management metadata, searchable image cataloging, and document accessibility workflows — all processed locally without external API calls.

Configuring the VLM API endpoint

The vision API uses OpenAI-compatible endpoints for local VLM servers, enabling integration with tools like LM Studio, Ollama, vLLM, or custom inference servers. Configure the local server through custom_vlm_api_settings property assignments if your setup differs from defaults:

api_endpoint — The base URL for the OpenAI-compatible API (default: http://localhost:1234/v1 for local VLM servers like LM Studio).
api_key — Your API key for authentication with the VLM service (optional for most local servers, required for secured deployments).
model — The model identifier to use (default: qwen/qwen3-vl-4b for local models — common alternatives include llava:7b, llama3.2-vision:11b).
temperature — Controls response creativity (0.0 = deterministic descriptions, 1.0 = creative descriptions with varied phrasing).
max_tokens — Maximum tokens in the response (-1 for unlimited, recommended 512–1,024 for descriptions).

Local server setup examples:

LM Studio — Start server with vision model loaded, default endpoint http://localhost:1234/v1
Ollama — Run ollama serve with vision model, default endpoint http://localhost:11434/v1
vLLM — Launch with --api-key flag for authentication, custom port configuration

The SDK automatically handles endpoint URL formatting, request/response parsing, and error retry logic for local server communication.

Error handling

The Python SDK raises NutrientException if vision operations fail due to processing errors, local server failures, or configuration issues. Exception handling ensures robust error recovery in production environments.

Common failure scenarios include:

The input image file can’t be read due to file system permissions, path errors, or unsupported image formats
Local VLM server isn’t running or isn’t reachable at the configured endpoint (connection refused errors)
VLM server has no vision-capable model loaded or uses incompatible model architecture
Server response timeout when processing high-resolution images or using large models
Insufficient GPU/CPU memory on the local machine for model inference
Invalid server configuration (wrong endpoint URL, port not listening, authentication failure)

In production code, wrap the vision operations in a try-except block to catch NutrientException instances, providing appropriate error messages to users and logging failure details for debugging. This error handling pattern enables graceful degradation when vision processing fails, preventing application crashes and enabling retry logic for transient server issues or user notification for server setup problems requiring manual intervention.

Conclusion

The local AI image description workflow consists of several key operations:

Open the image file using a context manager(opens in a new tab) for automatic resource cleanup.
The SDK supports multiple image formats, including PNG, JPEG, GIF, BMP, and TIFF.
The vision API uses OpenAI-compatible endpoints for local VLM servers by default.
Default configuration connects to http://localhost:1234/v1 with model qwen/qwen3-vl-4b.
Supported local VLM servers include LM Studio, Ollama, vLLM, and custom inference servers.
Optionally configure custom server settings through document.settings.custom_vlm_api_settings property assignments.
Create a vision instance with Vision.set() bound to the document for local AI processing.
Generate the description with vision.describe(), which sends the image to the local server endpoint and returns natural language text.
The SDK encodes image data, constructs OpenAI-compatible multimodal requests, and parses responses automatically.
Generated descriptions are concise (1–3 sentences), accessible (WCAG-compliant alt text), accurate (observable details only), and model-dependent.
Description quality varies by model size — larger models (7B+) produce more nuanced descriptions than smaller variants (4B).
Print or save the description for use in accessibility systems, content management, or cataloging workflows.
Handle NutrientException for vision processing failures, including server unavailable, model not loaded, or timeout errors.
The context manager ensures proper resource cleanup when processing completes or exceptions occur.

Nutrient handles local VLM server communication, OpenAI-compatible API formatting, image encoding, endpoint configuration, model parameter management, and response parsing so you don’t need to understand local AI server protocols or manage complex vision service integration manually. The local image description system provides privacy and cost control for on-premises accessibility systems generating alt text without external services, medical imaging applications requiring HIPAA compliance with patient data remaining local, classified document processing where images cannot leave secure networks, high-volume batch processing avoiding per-image API costs, and offline-capable applications describing images without internet connectivity.

You can download this ready-to-use sample package, fully configured to help you explore the vision API description capabilities with local AI models.