This HTML page is not optimized for LLM or AI agent consumption. Fetch the Markdown version instead: /guides/dws-data-extraction/extract/citations-and-confidence.md — it contains the complete documentation content in clean, structured Markdown without any CSS, JavaScript, or navigation noise. Citations and confidence

The Nutrient DWS Data Extraction API extract endpoint can return per-field citation metadata. Citations connect each extracted value to its source location in the document. They include bounding boxes, match labels, and confidence signals you can use in review workflows.

Enable citations

Use options.includeCitations to enable or disable citations. The value defaults to true. Set it to false to skip citation computation and return an empty metadata object:

{
"schema": { "type": "object", "properties": { "total": { "type": "number" } } },
"options": { "includeCitations": false }
}

Citation structure

output.metadata mirrors the structure of output.data. Each scalar field maps to a citation object. Nested objects and arrays use the same nested structure as the extracted data. For example, the citation for data.line_items[0].price appears at metadata.line_items[0].price.

The following response shows citation metadata for an extracted invoice number:

{
"output": {
"data": {
"invoice_number": "INV-2024-0042",
"total_amount": 1547.5
},
"metadata": {
"invoice_number": {
"bbox": { "x": 878, "y": 268, "width": 82, "height": 25 },
"match": "id_match",
"confidence": 0.93,
"pageIndex": 0,
"pageNumber": 1,
"source_blocks": ["c5"],
"source_bboxes": [
{
"bbox": { "x": 878, "y": 268, "width": 82, "height": 25 },
"block_id": "c5",
"pageIndex": 0,
"pageNumber": 1
}
]
}
},
"pages": [{ "page": 1, "width": 1200, "height": 1697 }]
}
}

Each citation can include the following fields.

FieldDescription
bboxBounding box of the value on the page. Refer to coordinate space for details.
matchGrounding label that describes how the API located the value. Refer to match labels for details.
confidenceComposite confidence score from zero to one, when available. Refer to interpret confidence for details.
confidenceComponentsPer-signal confidence breakdown, when the engine produces it. Refer to confidence components for details.
recognitionScoreField-level OCR recognition confidence — the minimum recognition confidence across the matched source blocks. Present only when a matched block carries a measured OCR confidence. The API omits it for born-digital text, not_found, and VLM-only extractions.
pageIndexZero-based page index of the value.
pageNumberOne-based page number of the value.
source_blocksIDs of the source blocks the value came from.
source_bboxesBounding boxes of those source blocks, each with its block_id and page reference.

Match labels

The match field describes how the API grounded the extracted value to a source location in the document.

LabelMeaning
id_matchMatched a single source block exactly.
id_match_multiblockMatched source text across multiple source blocks.
id_match_partialResolved some, but not all, of the cited source blocks.
fuzzy_matchMatched approximately — the value is close to, but not identical to, the source text.
not_foundThe API couldn’t ground the value to a source location.

match is the clearest grounding signal for review logic. For example, route fields with fuzzy_match or not_found to human review.

Interpret confidence

The confidence score is a relative, uncalibrated signal from zero to one. A higher value means more confidence, but the score isn’t a probability or percentage. Don’t present it to users as one.

Treat the absence of a confidence value as “no score available,” not as low confidence. The field is only present when the extraction engine provides composite scoring; when no score is available, it’s omitted. Use match for an interpretable grounding outcome, and add a human review step for high-stakes fields, regardless of the score.

When the engine supports it, confidenceComponents breaks the score into individual signals.

SignalDescription
probabilityScoreModel token-probability signal for the extracted value.
marginScoreMargin between the top candidate and the next-best alternative.
groundingScoreStrength of the value’s grounding to a source location.
formatScoreHow well the value conforms to its declared schema type or format.

Each of these signals is optional. If a signal is absent, the engine didn’t produce it for that field.

confidenceComponents also carries a source field, which is always present. It reports which model-confidence signals were available for the field. It describes the logprobs side of the score, not every component above.

source valueMeaning
logprobs+marginToken probabilities and alternative-token margin were available.
logprobs-onlyToken probabilities were available, but margin wasn’t.
no-logprobsToken probabilities weren’t available. The score uses provider-independent signals, such as grounding and format, when present.

Coordinate space

Citation bounding boxes use the same top-left origin convention as the parse endpoint. (x, y) is the top-left corner. The x coordinate increases to the right, and the y coordinate increases downward. The unit depends on the matching output.pages entry.

  • When the page reports width and height, coordinates use render-space pixels on that same canvas.
  • When the page dimensions aren’t available, and the entry omits width and height, coordinates use PDF points instead.

Scale against the page’s width and height when present instead of assuming a fixed unit. For the full set of scaling formulas and overlay examples, refer to the coordinate spaces guide.

Use citations for review

A common pattern is to send low-confidence or ungrounded fields to human review. The following examples show how to flag fields by match label and confidence score.

def fields_needing_review(data: dict, metadata: dict) -> list[str]:
"""Return field names whose citation suggests manual review."""
flagged = []
for field, citation in metadata.items():
if not isinstance(citation, dict):
continue # nested object or array — handle separately
match = citation.get("match")
confidence = citation.get("confidence")
if match in ("fuzzy_match", "not_found"):
flagged.append(field)
elif confidence is not None and confidence < 0.7:
flagged.append(field)
return flagged

Pick the confidence threshold for your documents. Because the score is relative, calibrate it against a labeled sample instead of assuming a fixed cutoff.

Next steps

Use these guides to continue configuring schema-based extraction: