Extract endpoint

The Nutrient DWS Data Extraction API extract endpoint returns domain-specific data from a document as JSON shaped to the schema you provide:

POST https://api.nutrient.io/extraction/extract

Use the extract endpoint when you need specific values from a document. To return document structure, such as typed spatial elements or whole-document Markdown, refer to the parse endpoint guide. Provide a schema for fields such as invoice_number and total_amount, and the response returns those values from the document. You can also include per-field citations that point back to the source. To define the schema, refer to the define a schema guide. To configure citations, refer to the citations and confidence guide.

When to use extract vs. parse

Choose the endpoint based on the output your application needs.

Use case	Endpoint
Pull known fields into a typed JSON object, such as invoices or forms	`/extraction/extract`
Get the full document as typed elements or Markdown	`/extraction/parse`
Map data to a downstream database or API contract	`/extraction/extract`
Support retrieval-augmented generation (RAG) ingestion, search indexing, or content migration	`/extraction/parse`

If you need both raw structure and specific fields, use the extract endpoint. It runs a parse stage internally. To configure that stage, refer to the parse configuration guide.

Request formats

Every extract request must include a schema. You can send the document in two ways.

Multipart form upload

Upload a file with the JSON-serialized extraction instructions:

curl -X POST https://api.nutrient.io/extraction/extract \
  -H "Authorization: Bearer your_api_key_goes_here" \
  -F "file=@invoice.pdf" \
  -F 'instructions={"schema":{"type":"object","properties":{"invoice_number":{"type":"string","description":"Invoice identifier"},"total_amount":{"type":"number","description":"Total amount including tax"}},"required":["invoice_number","total_amount"]}}'

import json

import requests

schema = {
    "type": "object",
    "properties": {
        "invoice_number": {"type": "string", "description": "Invoice identifier"},
        "total_amount": {"type": "number", "description": "Total amount including tax"},
    },
    "required": ["invoice_number", "total_amount"],
}

response = requests.post(
    "https://api.nutrient.io/extraction/extract",
    headers={"Authorization": "Bearer your_api_key_goes_here"},
    files={"file": open("invoice.pdf", "rb")},
    data={"instructions": json.dumps({"schema": schema})},
)

print(response.json()["output"]["data"])

import fs from "node:fs";

const schema = {
  type: "object",
  properties: {
    invoice_number: { type: "string", description: "Invoice identifier" },
    total_amount: { type: "number", description: "Total amount including tax" },
  },
  required: ["invoice_number", "total_amount"],
};

const form = new FormData();
form.append("file", fs.createReadStream("invoice.pdf"));
form.append("instructions", JSON.stringify({ schema }));

const response = await fetch("https://api.nutrient.io/extraction/extract", {
  method: "POST",
  headers: { Authorization: "Bearer your_api_key_goes_here" },
  body: form,
});

const result = await response.json();
console.log(result.output.data);

JSON body with URL

Process a document hosted at a public URL by sending the schema and a url field as JSON:

curl -X POST https://api.nutrient.io/extraction/extract \
  -H "Authorization: Bearer your_api_key_goes_here" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://storage.example.com/invoice.pdf",
    "schema": {
      "type": "object",
      "properties": {
        "invoice_number": { "type": "string", "description": "Invoice identifier" },
        "total_amount": { "type": "number", "description": "Total amount including tax" }
      },
      "required": ["invoice_number", "total_amount"]
    },
    "parseConfig": { "mode": "understand" }
  }'

import requests

response = requests.post(
    "https://api.nutrient.io/extraction/extract",
    headers={
        "Authorization": "Bearer your_api_key_goes_here",
        "Content-Type": "application/json",
    },
    json={
        "url": "https://storage.example.com/invoice.pdf",
        "schema": {
            "type": "object",
            "properties": {
                "invoice_number": {"type": "string", "description": "Invoice identifier"},
                "total_amount": {"type": "number", "description": "Total amount including tax"},
            },
            "required": ["invoice_number", "total_amount"],
        },
        "parseConfig": {"mode": "understand"},
    },
)

print(response.json()["output"]["data"])

const response = await fetch("https://api.nutrient.io/extraction/extract", {
  method: "POST",
  headers: {
    Authorization: "Bearer your_api_key_goes_here",
    "Content-Type": "application/json",
  },
  body: JSON.stringify({
    url: "https://storage.example.com/invoice.pdf",
    schema: {
      type: "object",
      properties: {
        invoice_number: { type: "string", description: "Invoice identifier" },
        total_amount: { type: "number", description: "Total amount including tax" },
      },
      required: ["invoice_number", "total_amount"],
    },
    parseConfig: { mode: "understand" },
  }),
});

const result = await response.json();
console.log(result.output.data);

Instructions

The request accepts these instruction fields.

Field	Type	Description
`schema`	object	Required — JSON Schema that describes the data to extract. The root type must be `object`. Refer to the define a schema guide.
`instructions`	string	Optional free-text guidance for the extraction model, up to 10,000 characters.
`parseConfig`	object	Optional configuration for the parse stage that runs before extraction. Refer to the parse configuration guide.
`options`	object	Extract-specific response options, such as `includeCitations`. Refer to the citations and confidence guide.

Response structure

A successful response returns extracted data, optional per-field metadata citations, and pages:

{
  "status": 200,
  "requestId": "req_x1y2z3w4",
  "output": {
    "data": {
      "invoice_number": "INV-2024-0042",
      "total_amount": 1547.5
    },
    "metadata": {},
    "pages": [{ "page": 1, "width": 1200, "height": 1697 }]
  },
  "metrics": {
    "processingTimeMs": 4800,
    "pagesProcessed": 1
  },
  "usage": {
    "data_extraction_credits": {
      "cost": 27,
      "remainingCredits": 832
    },
    "price_composition": {
      "parse": { "units": 1, "unit_cost": 9, "cost": 9, "currency": "data_extraction_credits" },
      "extract": { "units": 1, "unit_cost": 18, "cost": 18, "currency": "data_extraction_credits" }
    }
  }
}

The response includes these top-level output fields:

output.data — Extracted values shaped to your schema. The API returns only declared properties.
output.metadata — Per-field citation metadata that mirrors the structure of output.data. This object is empty when citations are disabled.
output.pages — Page metadata, including the dimensions citation coordinates use.
metrics — Processing time and pages processed.
usage — Total credits consumed, broken down into parse and extract components. For credit details, refer to the pricing guide.

Next steps

Use these guides to continue working with the extract endpoint.

Refer to the define a schema guide for supported JSON Schema keywords, constraints, and size limits.
Refer to the parse configuration guide to control the parse stage with parseConfig.mode and language hints.
Refer to the citations and confidence guide to ground extracted values back to the source document.
Refer to the error handling guide for status codes and error response formats.

Extract endpoint

When to use extract vs. parse

Request formats

Multipart form upload

JSON body with URL

Instructions

Response structure

Next steps

Was this helpful?

Help us improve

Thank you for your feedback!

Something went wrong. Please try again or let us know.