Labeling form fields with a vision language model

Form-field detection locates the fillable regions on a page, but a bounding box alone doesn’t tell you what each field means. AI labeling adds a human-readable semantic label to each detected field, such as “First name” or “Date of birth”, by sending the page to a vision language model (VLM).

This sample builds on offline form-field detection. For the detection basics — and a fully offline workflow with no model contacted — refer to the extract form fields from an image guide. Here you connect a VLM provider and turn labeling on with Nutrient Java SDK.

Download sample

How Nutrient helps

Nutrient Java SDK runs detection and labeling behind a single method call. With labeling enabled, it also:

Draws numbered marks over each detected field on a rendered copy of the page
Sends the annotated page to the VLM you select and reads each field’s semantic label back
Optionally drops detections the model judges to be false positives
Records each field’s type, bounding box, confidence, and assigned label in JSON

The result is structured data you can index, validate, or feed into a downstream workflow.

Prerequisites

AI labeling requires a reachable VLM endpoint. The SDK does not provision or start a VLM service for you.

Configure a reachable VLM endpoint in your environment.
Configure the API endpoint and model in custom VLM API settings.
By default, the SDK may assume:
- API endpoint: http://localhost:1234/v1
- Model: qwen/qwen3-vl-8b
For clarity and reliability, set both the API endpoint and model explicitly.
Example with LM Studio(opens in a new tab):
- Run LM Studio in server mode.
- Load a compatible vision model such as Qwen3-VL (4B, 8B, or larger depending on your hardware).
Make sure the endpoint is running before you call detectForms() with labeling enabled.

If no VLM endpoint is available, labeling fails at runtime. Leave enableAiLabeling at its default of false to run detection only and keep the workflow offline.

Connect a vision model

Labeling uses the same provider configuration as the rest of the Vision API, so you don’t configure a separate endpoint for form labeling. Set the provider in vision settings and fill in the matching provider settings class:

Custom / local (default) — An OpenAI-compatible server such as LM Studio(opens in a new tab), Ollama, or vLLM. Configure custom VLM API settings.
OpenAI — Configure OpenAI API endpoint settings.

Prepare the project

Set a package name and create the main class:

package io.nutrient.Sample;

Import the required classes from the SDK:

import io.nutrient.sdk.Document;
import io.nutrient.sdk.Vision;
import io.nutrient.sdk.enums.VlmProvider;
import io.nutrient.sdk.exceptions.NutrientException;

import java.io.FileWriter;
import java.io.IOException;

public class LabelFormFieldsWithVlm {

Load the document

Open the document with try-with-resources so the SDK closes resources after processing:

    public static void main(String[] args) throws NutrientException, IOException {
        try (Document document = Document.open("input_forms_detection.pdf")) {

Configure AI labeling

Set the provider to your vision model, then opt in with setEnableAiLabeling:

            // Select the vision model provider (the same setting Vision.describe() uses)
            document.getSettings().getVisionSettings().setProvider(VlmProvider.Custom);

            // Configure the matching provider settings class
            var vlm = document.getSettings().getCustomVlmApiSettings();
            vlm.setApiEndpoint("http://localhost:1234/v1");
            vlm.setModel("qwen/qwen3-vl-8b");

            // Turn on labeling and drop detections the model judges to be false positives
            var formLabeling = document.getSettings().getFormLabelingSettings();
            formLabeling.setEnableAiLabeling(true);
            formLabeling.setEnableAiRemoveFalsePositives(true);

            // Optional: constrain labels to a known vocabulary
            formLabeling.setCandidateLabels("First name, Last name, Date of birth, Signature");

Detect and label form fields

Create a vision instance from the document with Vision.set(document), then call detectForms(). The same call covers both modes; it includes labels because enableAiLabeling is set:

            Vision vision = Vision.set(document);
            String formsJson = vision.detectForms();

Write the JSON result to a file for downstream processing:

            try (FileWriter writer = new FileWriter("output.json")) {
                writer.write(formsJson);
            }
        }
    }
}

Match labels to a vocabulary

Free-form labels can vary between runs (“First name” vs. “Given name”), which makes them hard to map to a database or template. Supply a vocabulary of preferred labels with setCandidateLabels, as shown above, and the model maps each field to one when it fits. If no label fits, it invents a concise new label.

Pass the labels as newline- or comma-separated text. A matched label uses the casing you supplied, and each field’s labelSource records whether the label was matched or invented. Leave candidate labels empty, which is the default, for free-form labeling.

Understand the output

detectForms() returns structured JSON. The elements array holds one form element per page. Each form element includes its pageNumber and a fields list, so fields from a multi-page document stay grouped by the page they came from. Each field includes:

fieldType — The detected type: Text, Checkbox, or Signature.
bounds — The bounding box of the field on the page.
confidence — The detection confidence for the field.
label — The AI-assigned semantic label (for example, “First name”). Present only when AI labeling is enabled.
labelSource — matched or invented, present only when a candidate vocabulary was supplied.
id — A unique identifier for the field.

Handle errors

Vision API throws VisionException, which derives from NutrientException, when detection or labeling fails.

Common failure scenarios include:

The document can’t be read due to path or permission issues
The page produces no renderable image
The form detection model is missing or inaccessible, or the feature isn’t licensed
AI labeling is enabled but the selected provider’s endpoint is unreachable

In production code:

Catch NutrientException.
Return a clear error message.
Log failure details for debugging.
Consider running detection only, with labeling disabled, as a fallback when the vision endpoint is unavailable.

Conclusion

The workflow for labeling form fields with a vision model is:

Open the source document using try-with-resources for automatic resource cleanup.
Select a provider with getVisionSettings().setProvider(...), configure the matching provider settings class, then call setEnableAiLabeling(true) on getFormLabelingSettings().
Create a vision instance with Vision.set().
Call detectForms() to detect every field, assign a semantic label, and export the result as JSON.
Write the JSON to a file for indexing, validation, or downstream processing.
Handle NutrientException for robust error recovery.

Labeling adds semantic meaning when a vision model is available. For offline detection with no model contacted, refer to the extract form fields from an image guide. To produce a fillable PDF instead of data, refer to the detect and add form fields guide.

For related image extraction workflows, refer to the Java SDK guides.

Download the sample package to explore form-field labeling.