Nutrient Vision API
Turn document images into natural language descriptions programmatically. Generate WCAG-compliant alt text, catalog visual content, and build accessible document workflows. Choose your vision language model — Claude for nuanced reasoning, OpenAI for enterprise scale, or local models for complete data privacy.
By submitting this form, you agree to Nutrient’s Privacy Policy and Terms of Service.
Generate concise, accurate, and contextual image descriptions that meet accessibility standards. Automate alt text for document remediation at scale.
Claude for nuanced visual reasoning. OpenAI for enterprise cloud scale. Local models via Ollama, LM Studio, or vLLM for complete data privacy. Switch with one configuration change.
Run vision language models on your own servers. Documents and images never leave your infrastructure. Zero per-image API costs at any volume.
Control output with custom prompts and detail levels. Generate one-sentence summaries or detailed visual analysis. Tailor descriptions to your specific use case.
Anthropic Claude delivers nuanced contextual understanding and strong visual reasoning.
Enterprise-grade image understanding with global availability and SLA guarantees.
Run vision language models on your infrastructure for complete privacy and zero API costs.
Extract text from images with high-speed OCR alongside image descriptions.
AI-powered document understanding that detects tables, equations, and structure.
Combine local AI with vision language models for maximum extraction accuracy.
Image description fits into document processing workflows wherever visual content needs to be understood, cataloged, or made accessible.
PNG JPEG GIF BMP TIFF PDF PART OF THE VISION API
Vision API also includes OCR for fast text extraction, ICR for AI-powered document understanding, and VLM-enhanced ICR for maximum accuracy on complex layouts. Combine image descriptions with structured data extraction for complete document processing in your Java application.
High-speed text extraction with word-level bounding boxes. Optimized for throughput on large document sets.
On-premises AI that detects tables, equations, key-value regions, and document structure without external API calls.
Combine local AI with Claude, OpenAI, or local models for the highest accuracy on complex financial, legal, and medical documents.
Every extraction returns classified elements with bounding boxes, confidence scores, and hierarchical reading order.
The SDK generates natural language descriptions of visual content in images and document pages. Descriptions are concise, accurate, and contextual — focusing on observable details and relationships between objects. You can customize the output with detail levels (brief or detailed) and custom prompts to match your specific requirements, whether that’s accessibility alt text, content cataloging metadata, or detailed visual analysis.
The descriptions are designed to meet WCAG accessibility guidelines for alt text. They describe observable content accurately and concisely without making assumptions about context that isn’t visible. For document remediation workflows, you can use custom prompts to further tailor descriptions to your organization’s accessibility standards and style guides.
There are three provider types: Anthropic Claude for strong visual reasoning and contextual understanding, OpenAI for enterprise-grade cloud scalability, and any OpenAI-compatible custom endpoint for local models. The custom endpoint option works with Ollama, LM Studio, vLLM, and other local inference servers. Switch providers with a single configuration change — no code modifications needed.
Yes. Connect to a local vision language model server (Ollama, LM Studio, or vLLM) using the custom endpoint option. Images are processed entirely on your infrastructure with zero data transmitted externally. This gives you the same description capability with complete data sovereignty, suitable for HIPAA, GDPR, and air-gapped environments.
The image description feature supports PNG, JPEG, GIF, BMP, TIFF, and PDF documents. For PDFs, pages are automatically rendered as images for processing. You can describe individual images or pages from multipage documents. The same formats work across all three VLM providers.
The SDK provides configurable detail levels and custom prompts. Use the detail level setting to choose between concise descriptions (1–3 sentences for alt text) and detailed analysis (comprehensive visual breakdown). Custom prompts let you further shape the output — for example, focusing on specific elements, using particular terminology, or following your organization’s style guide.
Yes. Image description is well-suited for automated content cataloging workflows. Generate descriptions and metadata for large image libraries, making visual content searchable and organized. Combined with Vision API’s extraction capabilities, you can process entire document archives — extracting text and data while simultaneously generating descriptions for embedded images and figures.
Dedicated alt text tools are typically web-based services designed for content managers and marketers. The Nutrient SDK is a developer tool that integrates image description into your Java application’s document processing pipeline. You get programmatic control, choice of VLM provider (including on-premises), custom prompts, and the ability to combine descriptions with OCR, data extraction, and other document operations in a single workflow.
Claude and OpenAI charge per-request API fees based on their pricing models. Local models (via Ollama, LM Studio, or vLLM) have zero per-image API costs — you only pay for the infrastructure to run them. For high-volume description workflows, local models can significantly reduce costs while maintaining quality. The Nutrient SDK itself does not add per-image fees on top of your VLM provider costs.
Add the Nutrient Java SDK dependency to your project. Configure your VLM provider (Claude API key, OpenAI API key, or a local model endpoint). Open a document or image, create a vision instance, and call the describe method. The SDK handles image preparation, API communication, and response formatting. The guides include step-by-step examples for each provider.