Extract endpoint
The Nutrient DWS Data Extraction API extract endpoint returns domain-specific data from a document as JSON shaped to the schema you provide:
POST https://api.nutrient.io/extraction/extractUse the extract endpoint when you need specific values from a document. To return document structure, such as typed spatial elements or whole-document Markdown, refer to the parse endpoint guide. Provide a schema for fields such as invoice_number and total_amount, and the response returns those values from the document. You can also include per-field citations that point back to the source. To define the schema, refer to the define a schema guide. To configure citations, refer to the citations and confidence guide.
When to use extract vs. parse
Choose the endpoint based on the output your application needs.
| Use case | Endpoint |
|---|---|
| Pull known fields into a typed JSON object, such as invoices or forms | /extraction/extract |
| Get the full document as typed elements or Markdown | /extraction/parse |
| Map data to a downstream database or API contract | /extraction/extract |
| Support retrieval-augmented generation (RAG) ingestion, search indexing, or content migration | /extraction/parse |
If you need both raw structure and specific fields, use the extract endpoint. It runs a parse stage internally. To configure that stage, refer to the parse configuration guide.
Request formats
Every extract request must include a schema. You can send the document in two ways.
Multipart form upload
Upload a file with the JSON-serialized extraction instructions:
curl -X POST https://api.nutrient.io/extraction/extract \ -H "Authorization: Bearer your_api_key_goes_here" \ -F "file=@invoice.pdf" \ -F 'instructions={"schema":{"type":"object","properties":{"invoice_number":{"type":"string","description":"Invoice identifier"},"total_amount":{"type":"number","description":"Total amount including tax"}},"required":["invoice_number","total_amount"]}}'import json
import requests
schema = { "type": "object", "properties": { "invoice_number": {"type": "string", "description": "Invoice identifier"}, "total_amount": {"type": "number", "description": "Total amount including tax"}, }, "required": ["invoice_number", "total_amount"],}
response = requests.post( "https://api.nutrient.io/extraction/extract", headers={"Authorization": "Bearer your_api_key_goes_here"}, files={"file": open("invoice.pdf", "rb")}, data={"instructions": json.dumps({"schema": schema})},)
print(response.json()["output"]["data"])import fs from "node:fs";
const schema = { type: "object", properties: { invoice_number: { type: "string", description: "Invoice identifier" }, total_amount: { type: "number", description: "Total amount including tax" }, }, required: ["invoice_number", "total_amount"],};
const form = new FormData();form.append("file", fs.createReadStream("invoice.pdf"));form.append("instructions", JSON.stringify({ schema }));
const response = await fetch("https://api.nutrient.io/extraction/extract", { method: "POST", headers: { Authorization: "Bearer your_api_key_goes_here" }, body: form,});
const result = await response.json();console.log(result.output.data);JSON body with URL
Process a document hosted at a public URL by sending the schema and a url field as JSON:
curl -X POST https://api.nutrient.io/extraction/extract \ -H "Authorization: Bearer your_api_key_goes_here" \ -H "Content-Type: application/json" \ -d '{ "url": "https://storage.example.com/invoice.pdf", "schema": { "type": "object", "properties": { "invoice_number": { "type": "string", "description": "Invoice identifier" }, "total_amount": { "type": "number", "description": "Total amount including tax" } }, "required": ["invoice_number", "total_amount"] }, "parseConfig": { "mode": "understand" } }'import requests
response = requests.post( "https://api.nutrient.io/extraction/extract", headers={ "Authorization": "Bearer your_api_key_goes_here", "Content-Type": "application/json", }, json={ "url": "https://storage.example.com/invoice.pdf", "schema": { "type": "object", "properties": { "invoice_number": {"type": "string", "description": "Invoice identifier"}, "total_amount": {"type": "number", "description": "Total amount including tax"}, }, "required": ["invoice_number", "total_amount"], }, "parseConfig": {"mode": "understand"}, },)
print(response.json()["output"]["data"])const response = await fetch("https://api.nutrient.io/extraction/extract", { method: "POST", headers: { Authorization: "Bearer your_api_key_goes_here", "Content-Type": "application/json", }, body: JSON.stringify({ url: "https://storage.example.com/invoice.pdf", schema: { type: "object", properties: { invoice_number: { type: "string", description: "Invoice identifier" }, total_amount: { type: "number", description: "Total amount including tax" }, }, required: ["invoice_number", "total_amount"], }, parseConfig: { mode: "understand" }, }),});
const result = await response.json();console.log(result.output.data);Instructions
The request accepts these instruction fields.
| Field | Type | Description |
|---|---|---|
schema | object | Required — JSON Schema that describes the data to extract. The root type must be object. Refer to the define a schema guide. |
instructions | string | Optional free-text guidance for the extraction model, up to 10,000 characters. |
parseConfig | object | Optional configuration for the parse stage that runs before extraction. Refer to the parse configuration guide. |
options | object | Extract-specific response options, such as includeCitations. Refer to the citations and confidence guide. |
Response structure
A successful response returns extracted data, optional per-field metadata citations, and pages:
{ "status": 200, "requestId": "req_x1y2z3w4", "output": { "data": { "invoice_number": "INV-2024-0042", "total_amount": 1547.5 }, "metadata": {}, "pages": [{ "page": 1, "width": 1200, "height": 1697 }] }, "metrics": { "processingTimeMs": 4800, "pagesProcessed": 1 }, "usage": { "data_extraction_credits": { "cost": 27, "remainingCredits": 832 }, "price_composition": { "parse": { "units": 1, "unit_cost": 9, "cost": 9, "currency": "data_extraction_credits" }, "extract": { "units": 1, "unit_cost": 18, "cost": 18, "currency": "data_extraction_credits" } } }}The response includes these top-level output fields:
output.data— Extracted values shaped to your schema. The API returns only declared properties.output.metadata— Per-field citation metadata that mirrors the structure ofoutput.data. This object is empty when citations are disabled.output.pages— Page metadata, including the dimensions citation coordinates use.metrics— Processing time and pages processed.usage— Total credits consumed, broken down into parse and extract components. For credit details, refer to the pricing guide.
Next steps
Use these guides to continue working with the extract endpoint.
- Refer to the define a schema guide for supported JSON Schema keywords, constraints, and size limits.
- Refer to the parse configuration guide to control the parse stage with
parseConfig.modeand language hints. - Refer to the citations and confidence guide to ground extracted values back to the source document.
- Refer to the error handling guide for status codes and error response formats.