---
title: "Citations and confidence"
canonical_url: "https://www.nutrient.io/guides/dws-data-extraction/extract/citations-and-confidence/"
md_url: "https://www.nutrient.io/guides/dws-data-extraction/extract/citations-and-confidence.md"
last_updated: "2026-06-11T00:00:00.000Z"
description: "Ground extracted values back to the source document with per-field citations, bounding boxes, match labels, and confidence signals."
---

# Citations and confidence

The Nutrient DWS Data Extraction API extract endpoint can return per-field citation metadata. Citations connect each extracted value to its source location in the document. They include bounding boxes, match labels, and confidence signals you can use in review workflows.

## Enable citations

Use `options.includeCitations` to enable or disable citations. The value defaults to `true`. Set it to `false` to skip citation computation and return an empty `metadata` object:

```json

{
  "schema": { "type": "object", "properties": { "total": { "type": "number" } } },
  "options": { "includeCitations": false }
}

```

## Citation structure

`output.metadata` mirrors the structure of `output.data`. Each scalar field maps to a citation object. Nested objects and arrays use the same nested structure as the extracted data. For example, the citation for `data.line_items[0].price` appears at `metadata.line_items[0].price`.

The following response shows citation metadata for an extracted invoice number:

```json

{
  "output": {
    "data": {
      "invoice_number": "INV-2024-0042",
      "total_amount": 1547.5
    },
    "metadata": {
      "invoice_number": {
        "bbox": { "x": 878, "y": 268, "width": 82, "height": 25 },
        "match": "id_match",
        "confidence": 0.93,
        "pageIndex": 0,
        "pageNumber": 1,
        "source_blocks": ["c5"],
        "source_bboxes": [
          {
            "bbox": { "x": 878, "y": 268, "width": 82, "height": 25 },
            "block_id": "c5",
            "pageIndex": 0,
            "pageNumber": 1
          }
        ]
      }
    },
    "pages": [{ "page": 1, "width": 1200, "height": 1697 }]
  }
}

```

Each citation can include the following fields.

| Field                  | Description                                                                                                                                                                                                                                                       |
| ---------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `bbox`                 | Bounding box of the value on the page. Refer to [coordinate space](#coordinate-space) for details.                                                                                                                                                                                 |
| `match`                | Grounding label that describes how the API located the value. Refer to [match labels](#match-labels) for details.                                                                                                                                                              |
| `confidence`           | Composite confidence score from zero to one, when available. Refer to [interpret confidence](#interpret-confidence) for details.                                                                                                                                                       |
| `confidenceComponents` | Per-signal confidence breakdown, when the engine produces it. Refer to [confidence components](#interpret-confidence) for details.                                                                                                                                                     |
| `recognitionScore`     | Field-level OCR recognition confidence — the minimum recognition confidence across the matched source blocks. Present only when a matched block carries a measured OCR confidence. The API omits it for born-digital text, `not_found`, and VLM-only extractions. |
| `pageIndex`            | Zero-based page index of the value.                                                                                                                                                                                                                               |
| `pageNumber`           | One-based page number of the value.                                                                                                                                                                                                                               |
| `source_blocks`        | IDs of the source blocks the value came from.                                                                                                                                                                                                                     |
| `source_bboxes`        | Bounding boxes of those source blocks, each with its `block_id` and page reference.                                                                                                                                                                               |

## Match labels

The `match` field describes how the API grounded the extracted value to a source location in the document.

| Label                 | Meaning                                                                               |
| --------------------- | ------------------------------------------------------------------------------------- |
| `id_match`            | Matched a single source block exactly.                                                |
| `id_match_multiblock` | Matched source text across multiple source blocks.                                    |
| `id_match_partial`    | Resolved some, but not all, of the cited source blocks.                               |
| `fuzzy_match`         | Matched approximately — the value is close to, but not identical to, the source text. |
| `not_found`           | The API couldn’t ground the value to a source location.                               |

`match` is the clearest grounding signal for review logic. For example, route fields with `fuzzy_match` or `not_found` to human review.

## Interpret confidence

The `confidence` score is a relative, uncalibrated signal from zero to one. A higher value means more confidence, but the score isn’t a probability or percentage. Don’t present it to users as one.

Treat the absence of a `confidence` value as “no score available,” not as low confidence. The field is only present when the extraction engine provides composite scoring; when no score is available, it’s omitted. Use `match` for an interpretable grounding outcome, and add a human review step for high-stakes fields, regardless of the score.

When the engine supports it, `confidenceComponents` breaks the score into individual signals.

| Signal             | Description                                                        |
| ------------------ | ------------------------------------------------------------------ |
| `probabilityScore` | Model token-probability signal for the extracted value.            |
| `marginScore`      | Margin between the top candidate and the next-best alternative.    |
| `groundingScore`   | Strength of the value’s grounding to a source location.            |
| `formatScore`      | How well the value conforms to its declared schema type or format. |

Each of these signals is optional. If a signal is absent, the engine didn’t produce it for that field.

`confidenceComponents` also carries a `source` field, which is always present. It reports which model-confidence signals were available for the field. It describes the logprobs side of the score, not every component above.

| `source` value    | Meaning                                                                                                                         |
| ----------------- | ------------------------------------------------------------------------------------------------------------------------------- |
| `logprobs+margin` | Token probabilities and alternative-token margin were available.                                                                |
| `logprobs-only`   | Token probabilities were available, but margin wasn’t.                                                                          |
| `no-logprobs`     | Token probabilities weren’t available. The score uses provider-independent signals, such as grounding and format, when present. |

## Coordinate space

Citation bounding boxes use the same top-left origin convention as the [parse endpoint](https://www.nutrient.io/guides/dws-data-extraction/parsing.md). `(x, y)` is the top-left corner. The x coordinate increases to the right, and the y coordinate increases downward. The unit depends on the matching `output.pages` entry.

- When the page reports `width` and `height`, coordinates use render-space pixels on that same canvas.

- When the page dimensions aren’t available, and the entry omits `width` and `height`, coordinates use PDF points instead.

Scale against the page’s `width` and `height` when present instead of assuming a fixed unit. For the full set of scaling formulas and overlay examples, refer to the [coordinate spaces](https://www.nutrient.io/guides/dws-data-extraction/parsing/coordinate-spaces.md) guide.

## Use citations for review

A common pattern is to send low-confidence or ungrounded fields to human review. The following examples show how to flag fields by `match` label and `confidence` score.

### Python

```python

def fields_needing_review(data: dict, metadata: dict) -> list[str]:
    """Return field names whose citation suggests manual review."""
    flagged = []
    for field, citation in metadata.items():
        if not isinstance(citation, dict):
            continue  # nested object or array — handle separately

        match = citation.get("match")
        confidence = citation.get("confidence")
        if match in ("fuzzy_match", "not_found"):
            flagged.append(field)
        elif confidence is not None and confidence < 0.7:
            flagged.append(field)
    return flagged

```

### JavaScript

```javascript

function fieldsNeedingReview(data, metadata) {
  const flagged = [];
  for (const [field, citation] of Object.entries(metadata)) {
    if (typeof citation!== "object" || Array.isArray(citation)) continue;
    const { match, confidence } = citation;
    if (match === "fuzzy_match" || match === "not_found") {
      flagged.push(field);
    } else if (confidence!= null && confidence < 0.7) {
      flagged.push(field);
    }
  }
  return flagged;
}

```

Pick the confidence threshold for your documents. Because the score is relative, calibrate it against a labeled sample instead of assuming a fixed cutoff.

## Next steps

Use these guides to continue configuring schema-based extraction:

- Refer to the [define a schema](https://www.nutrient.io/guides/dws-data-extraction/extract/define-a-schema.md) guide to review field types and limits.

- Refer to the [parse configuration](https://www.nutrient.io/guides/dws-data-extraction/extract/parse-configuration.md) guide to control the parse stage that feeds extraction.

- Refer to the [coordinate spaces](https://www.nutrient.io/guides/dws-data-extraction/parsing/coordinate-spaces.md) guide to map bounding boxes to rendered pages and screen pixels.
---

## Related pages

- [Extract endpoint](/guides/dws-data-extraction/extract.md)
- [Define a schema](/guides/dws-data-extraction/extract/define-a-schema.md)
- [Parse configuration](/guides/dws-data-extraction/extract/parse-configuration.md)

