Generating image descriptions using OpenAI

Generating accessible image descriptions using OpenAI enables teams to leverage cloud-scale AI infrastructure, access state-of-the-art vision models, and deploy production-ready solutions without managing model infrastructure. Whether you’re creating cloud-based accessibility systems that generate alt text using OpenAI’s vision capabilities, building scalable content management platforms that catalog thousands of images with reliable API behavior, implementing global document workflows that process images from multiple regions with consistent quality, creating enterprise applications that require well-documented API contracts and SLA guarantees, or building rapid prototypes that access cutting-edge vision models without local GPU infrastructure, OpenAI-powered image description provides cloud scalability, consistent performance, and enterprise-grade reliability. Image description operations include analyzing image content with OpenAI’s vision language model (VLM), generating concise descriptions focused on main subjects and key details, producing accessibility-compliant descriptions for screen readers, extracting semantic meaning from charts, diagrams, and photographs, and providing contextual understanding beyond simple object detection — all through OpenAI’s cloud infrastructure.

This guide demonstrates using OpenAI as the vision language model provider for generating image descriptions through the Nutrient vision API. OpenAI provides state-of-the-art vision understanding, reliable cloud-based processing with global availability, well-documented API behavior with enterprise SLAs, and scalable infrastructure eliminating local GPU requirements.

Download sample

How Nutrient helps you achieve this

Nutrient Python SDK handles vision API integration, OpenAI provider configuration, and image processing pipelines. With the SDK, you don’t need to worry about:

Managing OpenAI API authentication, endpoint configuration, and request formatting
Encoding image data and handling multimodal API request structures
Configuring model parameters like temperature, max tokens, and provider-specific settings
Complex error handling for vision service failures and API rate limits

Instead, Nutrient provides an API that handles all the complexity behind the scenes, enabling you to focus on your business logic.

Complete implementation

Below is a complete working example that demonstrates generating accessible image descriptions using OpenAI as the vision language model provider. The following lines set up the Python application. Start by importing the required classes from the SDK:

from nutrient_sdk import Document, Vision
from nutrient_sdk.settings import VlmProvider

Configuring the OpenAI provider

Open the image file using a context manager(opens in a new tab) and configure the vision API to use OpenAI as the vision language model provider. The following code opens an image file (PNG format in this example) using Document.open() with a file path parameter. The context manager pattern (using the with statement) ensures the document is properly closed after processing, releasing memory and image data regardless of whether description generation succeeds or fails. The document.settings.vision_settings.provider property assignment specifies which VLM provider to use — setting it to VlmProvider.OpenAI configures OpenAI as the vision provider instead of alternatives like Claude or local AI models. The document.settings.openai_api_endpoint_settings.api_key property assignment sets your OpenAI API key for authentication — replace "OPENAI_API_KEY" with your actual API key obtained from the OpenAI platform. The SDK supports multiple image formats, including PNG, JPEG, GIF, BMP, and TIFF:

with Document.open("input_photo.png") as document:
    # Configure OpenAI as the VLM provider
    document.settings.vision_settings.provider = VlmProvider.OpenAI

    # Set the OpenAI API key
    document.settings.openai_api_endpoint_settings.api_key = "OPENAI_API_KEY"

Creating a vision instance and generating the description

Create a vision API instance bound to the document and generate a natural language description of the image content. The following code uses the Vision.set() static method with the document parameter to create a vision processor configured with the OpenAI provider settings defined earlier. The vision.describe() method sends the image to the OpenAI API for analysis and returns a natural language description as a string. During processing, the SDK encodes the image data, constructs a multimodal API request with the image payload, sends the request to OpenAI’s vision endpoint, and parses the response to extract the description text. The description focuses on the main subject and key visual details observable in the image, optimized for accessibility purposes and screen reader compatibility:

    vision = Vision.set(document)
    description = vision.describe()

Outputting the description

Print the generated description to the console for review or logging. The following code prints a header line followed by the description text returned by OpenAI. In production applications, you might save the description to a database, write it to a file, or use it to populate alt text attributes in HTML documents. The context manager automatically closes the document after the print statements complete, ensuring proper resource cleanup:

    print("Image description:")
    print(description)

Understanding the output

The describe() method returns a natural language description of the image content optimized for accessibility and content understanding. OpenAI’s vision model analyzes the visual content and generates descriptions with specific characteristics:

Concise — Focused on the main subject and key details without unnecessary verbosity, typically 1–3 sentences.
Accessible — Written for users who cannot see the image, following accessibility best practices for alt text.
Accurate — Describes only what is clearly visible in the image, avoiding speculation or interpretation beyond observable details.
Contextual — Understands relationships between objects, spatial arrangements, and relevant context within the scene.

The generated descriptions are suitable for accessibility compliance (WCAG alt text requirements), content management metadata, searchable image cataloging, and document accessibility workflows.

OpenAI API settings

The OpenAI provider uses the following settings from openai_api_endpoint_settings:

api_endpoint — The OpenAI API endpoint (default: https://api.openai.com/v1).
api_key — Your OpenAI API key for authentication.
model — The model identifier to use.
temperature — Controls response creativity (0.0 = deterministic, 1.0 = creative).
max_tokens — Maximum tokens in the response (default: 16384).

Error handling

The Python SDK raises NutrientException if vision operations fail due to processing errors, API failures, or configuration issues. Exception handling ensures robust error recovery in production environments.

Common failure scenarios include:

The input image file can’t be read due to file system permissions, path errors, or unsupported image formats
Invalid or missing OpenAI API key causing authentication failures with the OpenAI API
OpenAI API service is unavailable or experiencing outages preventing description generation
API rate limits exceeded when processing high volumes of images in rapid succession
Network connectivity issues preventing API requests from reaching OpenAI’s endpoints
Image data too large or corrupted, preventing proper encoding and transmission

In production code, wrap the vision operations in a try-except block to catch NutrientException instances, providing appropriate error messages to users and logging failure details for debugging. This error handling pattern enables graceful degradation when vision processing fails, preventing application crashes and enabling retry logic with exponential backoff for transient API failures or user notification for manual intervention.

Conclusion

The image description workflow with OpenAI consists of several key operations:

Open the image file using a context manager(opens in a new tab) for automatic resource cleanup.
The SDK supports multiple image formats, including PNG, JPEG, GIF, BMP, and TIFF.
Access the vision settings with document.settings.vision_settings.provider to configure the VLM provider.
Set the provider to OpenAI with the VlmProvider.OpenAI property assignment instead of alternatives like Claude or local models.
Access OpenAI-specific settings with document.settings.openai_api_endpoint_settings for API configuration.
Set the OpenAI API key with property assignment using credentials obtained from the OpenAI platform.
OpenAI API settings control endpoint URLs, model selection, temperature, and max tokens.
Create a vision instance with Vision.set() bound to the document with configured provider settings.
Generate the description with vision.describe(), which sends the image to OpenAI’s vision endpoint and returns natural language text.
The SDK encodes image data, constructs multimodal API requests, and parses responses automatically.
Generated descriptions are concise (1–3 sentences), accessible (WCAG-compliant alt text), accurate (observable details only), and contextual.
Print or save the description for use in accessibility systems, content management, or cataloging workflows.
Handle NutrientException for vision processing failures, including authentication errors, API failures, and rate limits.
The context manager ensures proper resource cleanup when processing completes or exceptions occur.

Nutrient handles VLM API authentication, multimodal request formatting, image encoding, endpoint configuration, model parameter management, and response parsing so you don’t need to understand OpenAI API protocols or manage complex vision service integration manually. The image description system provides cloud scalability and consistent performance for cloud-based accessibility systems generating alt text using OpenAI’s vision capabilities, scalable content management platforms cataloging thousands of images with reliable API behavior, global document workflows processing images from multiple regions with consistent quality, enterprise applications requiring well-documented API contracts and SLA guarantees, and rapid prototypes accessing cutting-edge vision models without local GPU infrastructure.

You can download this ready-to-use sample package, fully configured to help you explore the vision API description capabilities with OpenAI.