Labeling form fields with a vision language model

Form-field detection locates the fillable regions on a page, but a bounding box alone doesn’t tell you what each field means. AI labeling adds a human-readable semantic label to each detected field, such as “First name” or “Date of birth”, by sending the page to a vision language model (VLM).

This sample builds on offline form-field detection. For the detection basics — and a fully offline workflow with no model contacted — refer to the extract form fields from an image guide. Here you connect a VLM provider and turn labeling on with Nutrient Python SDK.

Download sample

How Nutrient helps

Nutrient Python SDK runs detection and labeling behind a single method call. With labeling enabled, it also:

Draws numbered marks over each detected field on a rendered copy of the page
Sends the annotated page to the VLM you select and reads each field’s semantic label back
Optionally drops detections the model judges to be false positives
Records each field’s type, bounding box, confidence, and assigned label in JSON

The result is structured data you can index, validate, or feed into a downstream workflow.

Prerequisites

AI labeling requires a reachable VLM endpoint. The SDK does not provision or start a VLM service for you.

Configure a reachable VLM endpoint in your environment.
Configure api_endpoint and model in custom VLM API settings.
By default, the SDK may assume:
- api_endpoint: http://localhost:1234/v1
- model: qwen/qwen3-vl-8b
For clarity and reliability, set both api_endpoint and model explicitly.
Example with LM Studio(opens in a new tab):
- Run LM Studio in server mode.
- Load a compatible vision model such as Qwen3-VL (4B, 8B, or larger depending on your hardware).
Make sure the endpoint is running before you call detect_forms() with labeling enabled.

If no VLM endpoint is available, labeling fails at runtime. Leave enable_ai_labeling at its default of False to run detection only and keep the workflow offline.

Connect a vision model

Labeling uses the same provider configuration as the rest of the Vision API, so you don’t configure a separate endpoint for form labeling. Set the provider in vision settings and fill in the matching provider settings class:

Custom / local (default) — An OpenAI-compatible server such as LM Studio(opens in a new tab), Ollama, or vLLM. Configure custom VLM API settings.
OpenAI — Configure OpenAI API endpoint settings.

Complete implementation

Start by importing the classes used in the sample:

from nutrient_sdk import Document, Vision, VlmProvider, NutrientException

Load the document

Open the document in a context manager(opens in a new tab) so resources are cleaned up after processing:

def main():
    try:
        with Document.open("input_forms_detection.pdf") as document:

Configure AI labeling

Select the provider, point it at your vision model, then opt in with enable_ai_labeling:

            # Select the vision model provider (the same setting Vision.describe() uses)
            document.settings.vision_settings.provider = VlmProvider.Custom

            # Configure the matching provider settings class
            vlm = document.settings.custom_vlm_api_settings
            vlm.api_endpoint = "http://localhost:1234/v1"
            vlm.model = "qwen/qwen3-vl-8b"

            # Turn on labeling and drop detections the model judges to be false positives
            form_labeling = document.settings.form_labeling_settings
            form_labeling.enable_ai_labeling = True
            form_labeling.enable_ai_remove_false_positives = True

            # Optional: constrain labels to a known vocabulary
            form_labeling.candidate_labels = "First name, Last name, Date of birth, Signature"

Detect and label form fields

Create a vision instance from the document with Vision.set(document), then call detect_forms(). The same call covers both modes; it includes labels because enable_ai_labeling is set:

            vision = Vision.set(document)
            forms_json = vision.detect_forms()

Write the JSON result to a file for downstream processing:

            with open("output.json", "w") as f:
                f.write(forms_json)
    except NutrientException as e:
        print(f"Error: {e}")


if __name__ == "__main__":
    main()

Match labels to a vocabulary

Free-form labels can vary between runs (“First name” vs. “Given name”), which makes them hard to map to a database or template. Supply a vocabulary of preferred labels with candidate_labels, as shown above, and the model maps each field to one when it fits. If no label fits, it invents a concise new label.

Pass the labels as newline- or comma-separated text. A matched label uses the casing you supplied, and each field’s labelSource records whether the label was matched or invented. Leave candidate_labels empty, which is the default, for free-form labeling.

Understand the output

detect_forms() returns structured JSON. The elements array holds one form element per page. Each form element includes its pageNumber and a fields list, so fields from a multi-page document stay grouped by the page they came from. Each field includes:

fieldType — The detected type: Text, Checkbox, or Signature.
bounds — The bounding box of the field on the page.
confidence — The detection confidence for the field.
label — The AI-assigned semantic label (for example, “First name”). Present only when AI labeling is enabled.
labelSource — matched or invented, present only when a candidate vocabulary was supplied.
id — A unique identifier for the field.

Handle errors

Vision API raises VisionException, which derives from NutrientException, when detection or labeling fails.

Common failure scenarios include:

The document can’t be read due to path or permission issues
The page produces no renderable image
The form detection model is missing or inaccessible, or the feature isn’t licensed
AI labeling is enabled but the selected provider’s endpoint is unreachable

In production code:

Catch NutrientException.
Return a clear error message.
Log failure details for debugging.
Consider running detection only, with labeling disabled, as a fallback when the vision endpoint is unavailable.

Conclusion

The workflow for labeling form fields with a vision model is:

Open the source document using a context manager(opens in a new tab) for automatic resource cleanup.
Select a provider with vision_settings.provider, configure the matching provider settings class, then set enable_ai_labeling on form_labeling_settings.
Create a vision instance with Vision.set().
Call detect_forms() to detect every field, assign a semantic label, and export the result as JSON.
Write the JSON to a file for indexing, validation, or downstream processing.
Handle NutrientException for robust error recovery.

Labeling adds semantic meaning when a vision model is available. For offline detection with no model contacted, refer to the extract form fields from an image guide. To produce a fillable PDF instead of data, refer to the detect and add form fields guide.

For related image extraction workflows, refer to the Python SDK guides.

Download the sample package to explore form-field labeling.