Extract document elements
When output.format is set to spatial, the Data Extraction API returns a flat list of typed document elements. Each element includes its type, text content, spatial coordinates, detection confidence, and page reference.
Basic element extraction
Send a document and receive structured spatial elements.
curl -X POST https://api.nutrient.io/extraction/parse \ -H "Authorization: Bearer your_api_key_goes_here" \ -F "file=@document.pdf" \ -F 'instructions={"mode":"understand","output":{"format":"spatial"}}' # Extract data from the JSON response.import requests
response = requests.post( "https://api.nutrient.io/extraction/parse", headers={"Authorization": "Bearer your_api_key_goes_here"}, files={"file": open("document.pdf", "rb")}, data={ "instructions": '{"mode":"understand","output":{"format":"spatial"}}' },)
result = response.json()for element in result["output"]["elements"]: print(f'{element["type"]}: {element.get("text", "")}')import fs from "node:fs";
const form = new FormData();form.append("file", fs.createReadStream("document.pdf"));form.append( "instructions", JSON.stringify({ mode: "understand", output: { format: "spatial" } }),);
const response = await fetch("https://api.nutrient.io/extraction/parse", { method: "POST", headers: { Authorization: "Bearer your_api_key_goes_here" }, body: form,});
const result = await response.json();result.output.elements.forEach((el) => { console.log(`${el.type}: ${el.text || ""}`);});Element types
The API returns six element types. All types are available in structure, understand, and agentic modes (text mode does not support spatial output).
| Type | Description | Key fields |
|---|---|---|
paragraph | Text content with semantic role | text, role, words |
table | Structured table with cell data | rowCount, columnCount, cells |
formula | Mathematical expression | latex |
picture | Image, chart, or diagram | classification, altDescription |
keyValueRegion | Form fields and key-value pairs | pairs |
handwriting | Handwritten text content | text, words |
Common fields
Every element includes these fields:
| Field | Type | Description |
|---|---|---|
id | string | Unique identifier (UUID) |
type | string | Element type (paragraph, table, formula, picture, keyValueRegion, handwriting) |
bounds | object | Bounding box with x, y, width, height. Origin at top-left. See coordinate spaces. |
confidence | number | Detection confidence between 0 and 1 |
readingOrder | integer | Position in the page-reading sequence |
page | object | Source page with pageIndex (0-based), pageNumber (1-based integer), width, and height |
Paragraph elements
Paragraphs cover all text content. The role field identifies the semantic function:
{ "type": "paragraph", "role": "SectionHeader", "text": "Revenue Summary", "confidence": 0.95, "readingOrder": 0, "bounds": { "x": 100, "y": 50, "width": 400, "height": 35 }, "page": { "pageIndex": 0, "pageNumber": 1, "width": 1818, "height": 2422 }}Available roles:
| Role | Description |
|---|---|
Text | Body paragraphs |
Title | Document title |
SectionHeader | Section headings |
Header | Running page headers |
Footer | Running page footers |
Caption | Figure or table captions |
Footnote | Footnotes |
ListItem | List items (ordered or unordered) |
PageNumber | Page number labels |
Code | Code blocks |
CheckboxSelected | Selected checkbox |
CheckboxUnselected | Unselected checkbox |
The role is null when the API cannot determine the semantic function.
Table elements
Tables include row and column counts, plus cell-level data with text, bounds, and span information:
{ "type": "table", "confidence": 0.92, "readingOrder": 2, "bounds": { "x": 100, "y": 150, "width": 600, "height": 120 }, "page": { "pageIndex": 0, "pageNumber": 1, "width": 1818, "height": 2422 }, "rowCount": 3, "columnCount": 3, "cells": [ { "id": "c-001", "bounds": { "x": 100, "y": 150, "width": 200, "height": 30 }, "confidence": 0.94, "row": 0, "column": 0, "rowSpan": 1, "colSpan": 1, "text": "Region" } ], "captionIds": null, "footnoteIds": null}Each cell includes row, column, rowSpan, and colSpan for reconstructing the table layout. captionIds and footnoteIds reference associated paragraph elements by their id.
Formula elements
Formulas contain a LaTeX representation of the detected mathematical expression:
{ "type": "formula", "confidence": 0.88, "readingOrder": 3, "bounds": { "x": 100, "y": 300, "width": 250, "height": 40 }, "page": { "pageIndex": 0, "pageNumber": 1, "width": 1818, "height": 2422 }, "latex": "r = r_0 e^{kt}"}Picture elements
Pictures include classification, confidence, and an AI-generated alt text description:
{ "type": "picture", "confidence": 0.89, "readingOrder": 2, "bounds": { "x": 100, "y": 300, "width": 400, "height": 300 }, "page": { "pageIndex": 0, "pageNumber": 1, "width": 1818, "height": 2422 }, "classification": "chart", "classificationConfidence": 0.91, "altDescription": "Bar chart showing quarterly revenue growth across regions", "captionIds": ["d1e2f3a4-4444-4000-8000-000000000004"], "footnoteIds": null}Key-value region elements
Key-value regions detect form fields and structured label-value pairs:
{ "type": "keyValueRegion", "confidence": 0.87, "readingOrder": 4, "bounds": { "x": 100, "y": 700, "width": 500, "height": 100 }, "page": { "pageIndex": 0, "pageNumber": 1, "width": 1818, "height": 2422 }, "pairs": [ { "id": "kvp-001", "key": { "id": "kve-001", "bounds": { "x": 100, "y": 700, "width": 150, "height": 25 }, "confidence": 0.92, "entityType": "QUESTION", "value": "Invoice Number" }, "value": { "id": "kve-002", "bounds": { "x": 260, "y": 700, "width": 200, "height": 25 }, "confidence": 0.95, "entityType": "ANSWER", "value": "INV-2024-0042" }, "relationshipConfidence": 0.93 } ]}Handwriting elements
Handwriting elements contain extracted handwritten text. Like paragraphs, they support optional word-level OCR data via includeWords:
{ "type": "handwriting", "confidence": 0.78, "readingOrder": 5, "bounds": { "x": 30, "y": 320, "width": 200, "height": 30 }, "page": { "pageIndex": 0, "pageNumber": 1, "width": 1818, "height": 2422 }, "text": "John Doe", "words": null}When includeWords is true, the words array contains per-word bounds and confidence — the same format as paragraph word-level data.
Word-level data
Set output.includeWords to true to get word-level OCR data nested inside paragraph and table cell elements.
curl -X POST https://api.nutrient.io/extraction/parse \ -H "Authorization: Bearer your_api_key_goes_here" \ -F "file=@document.pdf" \ -F 'instructions={"mode":"understand","output":{"format":"spatial","includeWords":true}}'import requests
response = requests.post( "https://api.nutrient.io/extraction/parse", headers={"Authorization": "Bearer your_api_key_goes_here"}, files={"file": open("document.pdf", "rb")}, data={ "instructions": '{"mode":"understand","output":{"format":"spatial","includeWords":true}}' },)
result = response.json()for element in result["output"]["elements"]: if element.get("words"): for word in element["words"]: print(f'{word["text"]} (confidence: {word["confidence"]})')import fs from "node:fs";
const form = new FormData();form.append("file", fs.createReadStream("document.pdf"));form.append( "instructions", JSON.stringify({ mode: "understand", output: { format: "spatial", includeWords: true }, }),);
const response = await fetch("https://api.nutrient.io/extraction/parse", { method: "POST", headers: { Authorization: "Bearer your_api_key_goes_here" }, body: form,});
const result = await response.json();result.output.elements.forEach((el) => { if (el.words) { el.words.forEach((w) => console.log(`${w.text} (confidence: ${w.confidence})`), ); }});Each word object includes:
| Field | Type | Description |
|---|---|---|
text | string | The word text |
bounds | object | Bounding box in document coordinate space |
confidence | number | OCR confidence between 0 and 1 |
Comparing spatial modes
The structure, understand, and agentic modes all return the same element types and output structure. The difference is in extraction depth and cost.
| Aspect | structure | understand | agentic |
|---|---|---|---|
| Speed | Fast | Slower | Slowest |
| Cost | 1.5 credits per page | 9 credits per page | 18 credits per page |
| Pipeline | OCR-based segmentation | AI-augmented layout analysis | Hybrid (AI + VLM) layout analysis |
| Best for | Scanned documents, straightforward layouts | Complex layouts, tables, forms | The most complex documents needing VLM |
Use structure mode when you need spatial elements and the documents have straightforward layouts. Use understand mode for complex documents with tables, multicolumn layouts, or mixed content types. Use agentic mode for the most complex documents that benefit from VLM-augmented extraction. See processing modes for a full comparison, including text mode.