Citations and confidence

The Nutrient DWS Data Extraction API extract endpoint can return per-field citation metadata. Citations connect each extracted value to its source location in the document. They include bounding boxes, match labels, and confidence signals you can use in review workflows.

Enable citations

Use options.includeCitations to enable or disable citations. The value defaults to true. Set it to false to skip citation computation and return an empty metadata object:

{
  "schema": { "type": "object", "properties": { "total": { "type": "number" } } },
  "options": { "includeCitations": false }
}

Citation structure

output.metadata mirrors the structure of output.data. Each scalar field maps to a citation object. Nested objects and arrays use the same nested structure as the extracted data. For example, the citation for data.line_items[0].price appears at metadata.line_items[0].price.

The following response shows citation metadata for an extracted invoice number:

{
  "output": {
    "data": {
      "invoice_number": "INV-2024-0042",
      "total_amount": 1547.5
    },
    "metadata": {
      "invoice_number": {
        "bbox": { "x": 878, "y": 268, "width": 82, "height": 25 },
        "match": "id_match",
        "confidence": 0.93,
        "pageIndex": 0,
        "pageNumber": 1,
        "source_blocks": ["c5"],
        "source_bboxes": [
          {
            "bbox": { "x": 878, "y": 268, "width": 82, "height": 25 },
            "block_id": "c5",
            "pageIndex": 0,
            "pageNumber": 1
          }
        ]
      }
    },
    "pages": [{ "page": 1, "width": 1200, "height": 1697 }]
  }
}

Each citation can include the following fields.

Field	Description
`bbox`	Bounding box of the value on the page. Refer to coordinate space for details.
`match`	Grounding label that describes how the API located the value. Refer to match labels for details.
`confidence`	Composite confidence score from zero to one, when available. Refer to interpret confidence for details.
`confidenceComponents`	Per-signal confidence breakdown, when the engine produces it. Refer to confidence components for details.
`recognitionScore`	Field-level OCR recognition confidence — the minimum recognition confidence across the matched source blocks. Present only when a matched block carries a measured OCR confidence. The API omits it for born-digital text, `not_found`, and VLM-only extractions.
`pageIndex`	Zero-based page index of the value.
`pageNumber`	One-based page number of the value.
`source_blocks`	IDs of the source blocks the value came from.
`source_bboxes`	Bounding boxes of those source blocks, each with its `block_id` and page reference.

Match labels

The match field describes how the API grounded the extracted value to a source location in the document.

Label	Meaning
`id_match`	Matched a single source block exactly.
`id_match_multiblock`	Matched source text across multiple source blocks.
`id_match_partial`	Resolved some, but not all, of the cited source blocks.
`fuzzy_match`	Matched approximately — the value is close to, but not identical to, the source text.
`not_found`	The API couldn’t ground the value to a source location.

match is the clearest grounding signal for review logic. For example, route fields with fuzzy_match or not_found to human review.

Interpret confidence

The confidence score is a relative, uncalibrated signal from zero to one. A higher value means more confidence, but the score isn’t a probability or percentage. Don’t present it to users as one.

Treat the absence of a confidence value as “no score available,” not as low confidence. The field is only present when the extraction engine provides composite scoring; when no score is available, it’s omitted. Use match for an interpretable grounding outcome, and add a human review step for high-stakes fields, regardless of the score.

When the engine supports it, confidenceComponents breaks the score into individual signals.

Signal	Description
`probabilityScore`	Model token-probability signal for the extracted value.
`marginScore`	Margin between the top candidate and the next-best alternative.
`groundingScore`	Strength of the value’s grounding to a source location.
`formatScore`	How well the value conforms to its declared schema type or format.

Each of these signals is optional. If a signal is absent, the engine didn’t produce it for that field.

confidenceComponents also carries a source field, which is always present. It reports which model-confidence signals were available for the field. It describes the logprobs side of the score, not every component above.

`source` value	Meaning
`logprobs+margin`	Token probabilities and alternative-token margin were available.
`logprobs-only`	Token probabilities were available, but margin wasn’t.
`no-logprobs`	Token probabilities weren’t available. The score uses provider-independent signals, such as grounding and format, when present.

Coordinate space

Citation bounding boxes use the same top-left origin convention as the parse endpoint. (x, y) is the top-left corner. The x coordinate increases to the right, and the y coordinate increases downward. The unit depends on the matching output.pages entry.

When the page reports width and height, coordinates use render-space pixels on that same canvas.
When the page dimensions aren’t available, and the entry omits width and height, coordinates use PDF points instead.

Scale against the page’s width and height when present instead of assuming a fixed unit. For the full set of scaling formulas and overlay examples, refer to the coordinate spaces guide.

Use citations for review

A common pattern is to send low-confidence or ungrounded fields to human review. The following examples show how to flag fields by match label and confidence score.

Python
JavaScript

def fields_needing_review(data: dict, metadata: dict) -> list[str]:
    """Return field names whose citation suggests manual review."""
    flagged = []
    for field, citation in metadata.items():
        if not isinstance(citation, dict):
            continue  # nested object or array — handle separately
        match = citation.get("match")
        confidence = citation.get("confidence")
        if match in ("fuzzy_match", "not_found"):
            flagged.append(field)
        elif confidence is not None and confidence < 0.7:
            flagged.append(field)
    return flagged

function fieldsNeedingReview(data, metadata) {
  const flagged = [];
  for (const [field, citation] of Object.entries(metadata)) {
    if (typeof citation !== "object" || Array.isArray(citation)) continue;
    const { match, confidence } = citation;
    if (match === "fuzzy_match" || match === "not_found") {
      flagged.push(field);
    } else if (confidence != null && confidence < 0.7) {
      flagged.push(field);
    }
  }
  return flagged;
}

Pick the confidence threshold for your documents. Because the score is relative, calibrate it against a labeled sample instead of assuming a fixed cutoff.

Next steps

Use these guides to continue configuring schema-based extraction:

Refer to the define a schema guide to review field types and limits.
Refer to the parse configuration guide to control the parse stage that feeds extraction.
Refer to the coordinate spaces guide to map bounding boxes to rendered pages and screen pixels.

Citations and confidence

Enable citations

Citation structure

Match labels

Interpret confidence

Coordinate space

Use citations for review

Next steps

Was this helpful?

Help us improve

Thank you for your feedback!

Something went wrong. Please try again or let us know.