Extract document elements

When output.format is set to spatial, the Data Extraction API returns a flat list of typed document elements. Each element includes its type, text content, spatial coordinates, detection confidence, and page reference.

Basic element extraction

Send a document and receive structured spatial elements.

curl -X POST https://api.nutrient.io/extraction/parse \
  -H "Authorization: Bearer your_api_key_goes_here" \
  -F "file=@document.pdf" \
  -F 'instructions={"mode":"understand","output":{"format":"spatial"}}'
  # Extract data from the JSON response.

import requests

response = requests.post(
    "https://api.nutrient.io/extraction/parse",
    headers={"Authorization": "Bearer your_api_key_goes_here"},
    files={"file": open("document.pdf", "rb")},
    data={
        "instructions": '{"mode":"understand","output":{"format":"spatial"}}'
    },
)

result = response.json()
for element in result["output"]["elements"]:
    print(f'{element["type"]}: {element.get("text", "")}')

import fs from "node:fs";

const form = new FormData();
form.append("file", fs.createReadStream("document.pdf"));
form.append(
  "instructions",
  JSON.stringify({ mode: "understand", output: { format: "spatial" } }),
);

const response = await fetch("https://api.nutrient.io/extraction/parse", {
  method: "POST",
  headers: { Authorization: "Bearer your_api_key_goes_here" },
  body: form,
});

const result = await response.json();
result.output.elements.forEach((el) => {
  console.log(`${el.type}: ${el.text || ""}`);
});

Element types

The API returns six element types. All types are available in structure, understand, and agentic modes (text mode does not support spatial output).

Type	Description	Key fields
`paragraph`	Text content with semantic role	`text`, `role`, `words`
`table`	Structured table with cell data	`rowCount`, `columnCount`, `cells`
`formula`	Mathematical expression	`latex`
`picture`	Image, chart, or diagram	`classification`, `altDescription`
`keyValueRegion`	Form fields and key-value pairs	`pairs`
`handwriting`	Handwritten text content	`text`, `words`

Common fields

Every element includes these fields:

Field	Type	Description
`id`	string	Unique identifier (UUID)
`type`	string	Element type (`paragraph`, `table`, `formula`, `picture`, `keyValueRegion`, `handwriting`)
`bounds`	object	Bounding box with `x`, `y`, `width`, `height`. Origin at top-left. See coordinate spaces.
`confidence`	number	Detection confidence between 0 and 1
`readingOrder`	integer	Position in the page-reading sequence
`page`	object	Source page with `pageIndex` (0-based), `pageNumber` (1-based integer), `width`, and `height`

Paragraph elements

Paragraphs cover all text content. The role field identifies the semantic function:

{
  "type": "paragraph",
  "role": "SectionHeader",
  "text": "Revenue Summary",
  "confidence": 0.95,
  "readingOrder": 0,
  "bounds": { "x": 100, "y": 50, "width": 400, "height": 35 },
  "page": { "pageIndex": 0, "pageNumber": 1, "width": 1818, "height": 2422 }
}

Available roles:

Role	Description
`Text`	Body paragraphs
`Title`	Document title
`SectionHeader`	Section headings
`Header`	Running page headers
`Footer`	Running page footers
`Caption`	Figure or table captions
`Footnote`	Footnotes
`ListItem`	List items (ordered or unordered)
`PageNumber`	Page number labels
`Code`	Code blocks
`CheckboxSelected`	Selected checkbox
`CheckboxUnselected`	Unselected checkbox

The role is null when the API cannot determine the semantic function.

Table elements

Tables include row and column counts, plus cell-level data with text, bounds, and span information:

{
  "type": "table",
  "confidence": 0.92,
  "readingOrder": 2,
  "bounds": { "x": 100, "y": 150, "width": 600, "height": 120 },
  "page": { "pageIndex": 0, "pageNumber": 1, "width": 1818, "height": 2422 },
  "rowCount": 3,
  "columnCount": 3,
  "cells": [
    {
      "id": "c-001",
      "bounds": { "x": 100, "y": 150, "width": 200, "height": 30 },
      "confidence": 0.94,
      "row": 0,
      "column": 0,
      "rowSpan": 1,
      "colSpan": 1,
      "text": "Region"
    }
  ],
  "captionIds": null,
  "footnoteIds": null
}

Each cell includes row, column, rowSpan, and colSpan for reconstructing the table layout. captionIds and footnoteIds reference associated paragraph elements by their id.

Formula elements

Formulas contain a LaTeX representation of the detected mathematical expression:

{
  "type": "formula",
  "confidence": 0.88,
  "readingOrder": 3,
  "bounds": { "x": 100, "y": 300, "width": 250, "height": 40 },
  "page": { "pageIndex": 0, "pageNumber": 1, "width": 1818, "height": 2422 },
  "latex": "r = r_0 e^{kt}"
}

Picture elements

Pictures include classification, confidence, and an AI-generated alt text description:

{
  "type": "picture",
  "confidence": 0.89,
  "readingOrder": 2,
  "bounds": { "x": 100, "y": 300, "width": 400, "height": 300 },
  "page": { "pageIndex": 0, "pageNumber": 1, "width": 1818, "height": 2422 },
  "classification": "chart",
  "classificationConfidence": 0.91,
  "altDescription": "Bar chart showing quarterly revenue growth across regions",
  "captionIds": ["d1e2f3a4-4444-4000-8000-000000000004"],
  "footnoteIds": null
}

Key-value region elements

Key-value regions detect form fields and structured label-value pairs:

{
  "type": "keyValueRegion",
  "confidence": 0.87,
  "readingOrder": 4,
  "bounds": { "x": 100, "y": 700, "width": 500, "height": 100 },
  "page": { "pageIndex": 0, "pageNumber": 1, "width": 1818, "height": 2422 },
  "pairs": [
    {
      "id": "kvp-001",
      "key": {
        "id": "kve-001",
        "bounds": { "x": 100, "y": 700, "width": 150, "height": 25 },
        "confidence": 0.92,
        "entityType": "QUESTION",
        "value": "Invoice Number"
      },
      "value": {
        "id": "kve-002",
        "bounds": { "x": 260, "y": 700, "width": 200, "height": 25 },
        "confidence": 0.95,
        "entityType": "ANSWER",
        "value": "INV-2024-0042"
      },
      "relationshipConfidence": 0.93
    }
  ]
}

Handwriting elements

Handwriting elements contain extracted handwritten text. Like paragraphs, they support optional word-level optical character recognition (OCR) data via includeWords:

{
  "type": "handwriting",
  "confidence": 0.78,
  "readingOrder": 5,
  "bounds": { "x": 30, "y": 320, "width": 200, "height": 30 },
  "page": { "pageIndex": 0, "pageNumber": 1, "width": 1818, "height": 2422 },
  "text": "John Doe",
  "words": null
}

When includeWords is true, the words array contains per-word bounds and confidence — the same format as paragraph word-level data.

Handwriting accuracy depends heavily on writing style. The character-level OCR used in structure and understand modes can recognize neatly printed handwriting, but it often returns poor results on cursive or free-form writing. For those documents, use agentic mode — its vision language model (VLM) interprets whole words and lines for more reliable results. See processing modes for guidance.

Word-level data

Set output.includeWords to true to get word-level OCR data nested inside paragraph and table cell elements.

curl -X POST https://api.nutrient.io/extraction/parse \
  -H "Authorization: Bearer your_api_key_goes_here" \
  -F "file=@document.pdf" \
  -F 'instructions={"mode":"understand","output":{"format":"spatial","includeWords":true}}'

import requests

response = requests.post(
    "https://api.nutrient.io/extraction/parse",
    headers={"Authorization": "Bearer your_api_key_goes_here"},
    files={"file": open("document.pdf", "rb")},
    data={
        "instructions": '{"mode":"understand","output":{"format":"spatial","includeWords":true}}'
    },
)

result = response.json()
for element in result["output"]["elements"]:
    if element.get("words"):
        for word in element["words"]:
            print(f'{word["text"]} (confidence: {word["confidence"]})')

import fs from "node:fs";

const form = new FormData();
form.append("file", fs.createReadStream("document.pdf"));
form.append(
  "instructions",
  JSON.stringify({
    mode: "understand",
    output: { format: "spatial", includeWords: true },
  }),
);

const response = await fetch("https://api.nutrient.io/extraction/parse", {
  method: "POST",
  headers: { Authorization: "Bearer your_api_key_goes_here" },
  body: form,
});

const result = await response.json();
result.output.elements.forEach((el) => {
  if (el.words) {
    el.words.forEach((w) =>
      console.log(`${w.text} (confidence: ${w.confidence})`),
    );
  }
});

Each word object includes:

Field	Type	Description
`text`	string	The word text
`bounds`	object	Bounding box in document coordinate space
`confidence`	number	OCR confidence between 0 and 1

Comparing spatial modes

The structure, understand, and agentic modes all return the same element types and output structure. The difference is in extraction depth and cost.

Aspect	`structure`	`understand`	`agentic`
Speed	Fast	Slower	Slowest
Cost	1.5 credits per page	9 credits per page	18 credits per page
Pipeline	OCR-based segmentation	AI-augmented layout analysis	Hybrid (AI + VLM) layout analysis
Best for	Scanned documents, straightforward layouts	Complex layouts, tables, forms	The most complex documents needing VLM

Use structure mode when you need spatial elements and the documents have straightforward layouts. Use understand mode for complex documents with tables, multicolumn layouts, or mixed content types. Use agentic mode for the most complex documents that benefit from VLM-augmented extraction. See processing modes for a full comparison, including text mode.