Generating image descriptions using local AI

Generating accessible image descriptions using local AI models enables teams to build privacy-preserving applications, eliminate cloud API costs, and maintain complete control over vision processing infrastructure. Whether you’re creating on-premises accessibility systems that generate alt text without sending data to external services, building medical imaging applications requiring HIPAA compliance with patient data remaining local, implementing classified document processing systems where images cannot leave secure networks, creating high-volume batch processing workflows that avoid per-image API costs, or building offline-capable mobile applications that describe images without internet connectivity, local AI-powered image description provides privacy, cost control, and infrastructure independence. Image description operations include analyzing image content with locally hosted vision language models (VLM), generating concise descriptions focused on main subjects and key details, producing accessibility-compliant descriptions for screen readers, extracting semantic meaning from charts, diagrams, and photographs, and providing contextual understanding beyond simple object detection — all without external API dependencies.

Download sample

This guide demonstrates using locally hosted AI models (OpenAI-compatible endpoints) as the VLM provider for generating image descriptions through the Nutrient vision API. Local VLM servers like LM Studio, Ollama, or self-hosted inference servers provide on-premises vision processing with full data privacy, zero per-image costs, and no internet requirements.

How Nutrient helps you achieve this

Nutrient Java SDK handles vision API integration, local VLM server configuration, and image processing pipelines. With the SDK, you don’t need to worry about:

Managing local VLM server communication, endpoint configuration, and OpenAI-compatible API formatting
Encoding image data and handling multimodal request structures for local models
Configuring model parameters like temperature, max tokens, and server-specific settings
Complex error handling for local server failures and model loading issues

Instead, Nutrient provides an API that handles all the complexity behind the scenes, enabling you to focus on your business logic.

Complete implementation

Below is a complete working example that demonstrates generating accessible image descriptions using locally hosted AI models through OpenAI-compatible endpoints. The following lines set up the Java application. Start by specifying a package name and importing the required classes:

package io.nutrient.Sample;

import io.nutrient.sdk.Document;
import io.nutrient.sdk.Vision;
import io.nutrient.sdk.exceptions.NutrientException;

import java.io.FileWriter;
import java.io.IOException;

public class DescribeImageWithLocalAi {

Create the main class and function. The following code creates the main entry point that will contain the image description logic. The throws NutrientException, IOException declaration handles exceptions that may occur during document processing, vision operations with local AI servers, or file I/O operations:

    public static void main(String[] args) throws NutrientException, IOException {

Opening the image file

Open the image file using a try-with-resources statement for automatic resource cleanup. The following code opens an image file (PNG format in this example) using Document.open() with a file path parameter. The try-with-resources pattern ensures the document is properly closed after processing, releasing memory and image data regardless of whether description generation succeeds or fails. The SDK supports multiple image formats, including PNG, JPEG, GIF, BMP, and TIFF. The vision API automatically uses the default local VLM server configuration (endpoint: http://localhost:1234/v1, model: qwen/qwen3-vl-4b) unless explicitly configured otherwise through vision settings:

        try (Document document = Document.open("input_photo.png")) {

Creating a vision instance

Create a vision API instance bound to the document for image analysis with the local AI model. The following code uses the Vision.set() static factory method with the document parameter to create a vision processor. The vision API communicates with a locally hosted VLM server (like LM Studio, Ollama, or custom inference servers) running on localhost:1234 by default. The server must be running and have a vision-capable model (such as Qwen VL, LLaVA, or similar multimodal models) loaded before calling the vision API. The local server provides the same OpenAI-compatible chat completion interface used by cloud providers, enabling seamless local-to-cloud migration:

            Vision vision = Vision.set(document);

Generating the description

Generate a natural language description by sending the image to the local VLM server for analysis. The following code calls vision.describe(), which encodes the image data, constructs an OpenAI-compatible multimodal API request with the image payload, sends the request to the local VLM server endpoint, and parses the response to extract the description text. The local model analyzes the image content and returns a natural language description as a string. The description focuses on the main subject and key visual details observable in the image, optimized for accessibility purposes and screen reader compatibility:

            String description = vision.describe();

Saving the description

Write the generated description to a text file for storage or further processing. The following code uses a try-with-resources statement with FileWriter to create and write the description string to "output.txt". The try-with-resources pattern ensures the file writer is automatically closed after writing completes, properly flushing buffers and releasing file handles. The outer try-with-resources block automatically closes the document after file writing completes, ensuring proper cleanup of both the document and file resources:

            try (FileWriter writer = new FileWriter("output.txt")) {
                writer.write(description);
            }
        }
    }
}

Understanding the output

The describe() method returns a natural language description of the image content generated by the local VLM model. Local models analyze the visual content and generate descriptions with specific characteristics:

Concise — Focused on the main subject and key details without unnecessary verbosity, typically 1–3 sentences depending on model configuration.
Accessible — Written for users who cannot see the image, following accessibility best practices for alt text.
Accurate — Describes only what is clearly visible in the image, avoiding speculation or interpretation beyond observable details.
Model-dependent quality — Description quality varies by model size and architecture (larger models like Qwen 7B produce more nuanced descriptions than smaller 4B variants).

The generated descriptions are suitable for accessibility compliance (WCAG alt text requirements), content management metadata, searchable image cataloging, and document accessibility workflows — all processed locally without external API calls.

Configuring the VLM API endpoint

The vision API uses OpenAI-compatible endpoints for local VLM servers, enabling integration with tools like LM Studio, Ollama, vLLM, or custom inference servers. Configure the local server through CustomVlmApiSettings if your setup differs from defaults:

ApiEndpoint — The base URL for the OpenAI-compatible API (default: http://localhost:1234/v1 for local VLM servers like LM Studio).
ApiKey — Your API key for authentication with the VLM service (optional for most local servers, required for secured deployments).
Model — The model identifier to use (default: qwen/qwen3-vl-4b for local models — common alternatives include llava:7b, llama3.2-vision:11b).
Temperature — Controls response creativity (0.0 = deterministic descriptions, 1.0 = creative descriptions with varied phrasing).
MaxTokens — Maximum tokens in the response (-1 for unlimited, recommended 512–1,024 for descriptions).

Local server setup examples:

LM Studio — Start server with vision model loaded, default endpoint http://localhost:1234/v1
Ollama — Run ollama serve with vision model, default endpoint http://localhost:11434/v1
vLLM — Launch with --api-key flag for authentication, custom port configuration

The SDK automatically handles endpoint URL formatting, request/response parsing, and error retry logic for local server communication.

Error handling

The Java SDK throws NutrientException if vision operations fail due to processing errors, local server failures, or configuration issues. The main method also declares IOException for file I/O operations. Exception handling ensures robust error recovery in production environments.

Common failure scenarios include:

The input image file can’t be read due to file system permissions, path errors, or unsupported image formats
Local VLM server isn’t running or isn’t reachable at the configured endpoint (connection refused errors)
VLM server has no vision-capable model loaded or uses incompatible model architecture
Server response timeout when processing high-resolution images or using large models
Insufficient GPU/CPU memory on the local machine for model inference
Invalid server configuration (wrong endpoint URL, port not listening, authentication failure)
File writing failures due to disk space, permissions, or path errors when saving output descriptions

In production code, wrap the vision operations in a try-catch block to catch NutrientException and IOException instances, providing appropriate error messages to users and logging failure details, including exception messages and stack traces for debugging. This error handling pattern enables graceful degradation when vision processing fails, preventing application crashes and enabling retry logic for transient server issues or user notification for server setup problems requiring manual intervention.

Conclusion

The local AI image description workflow consists of several key operations:

Open the image file using try-with-resources for automatic resource cleanup.
The SDK supports multiple image formats, including PNG, JPEG, GIF, BMP, and TIFF.
The vision API uses OpenAI-compatible endpoints for local VLM servers by default.
Default configuration connects to http://localhost:1234/v1 with model qwen/qwen3-vl-4b.
Supported local VLM servers include LM Studio, Ollama, vLLM, and custom inference servers.
Create a vision instance with Vision.set() bound to the document for local AI processing.
Generate the description with vision.describe(), which sends the image to the local server endpoint and returns natural language text.
The SDK encodes image data, constructs OpenAI-compatible multimodal requests, and parses responses automatically.
Generated descriptions are concise (1–3 sentences), accessible (WCAG-compliant alt text), accurate (observable details only), and model-dependent.
Description quality varies by model size — larger models (7B+) produce more nuanced descriptions than smaller variants (4B).
Write the description to a file using try-with-resources with FileWriter for automatic resource cleanup.
Handle NutrientException for vision processing failures, including server unavailable, model not loaded, or timeout errors.
Handle IOException for file operations, including read failures or write errors when saving output.
Configure custom endpoints through CustomVlmApiSettings for non-default server configurations (custom ports, secured deployments, alternative models).

Nutrient handles local VLM server communication, OpenAI-compatible API formatting, image encoding, endpoint configuration, model parameter management, and response parsing so you don’t need to understand local AI server protocols or manage complex vision service integration manually. The local image description system provides privacy and cost control for on-premises accessibility systems generating alt text without external services, medical imaging applications requiring HIPAA compliance with patient data remaining local, classified document processing where images cannot leave secure networks, high-volume batch processing avoiding per-image API costs, and offline-capable applications describing images without internet connectivity.

You can download this ready-to-use sample package, fully configured to help you explore the vision API description capabilities with local AI models.