This HTML page is not optimized for LLM or AI agent consumption. Fetch the Markdown version instead: /guides/dws-data-extraction/extract.md — it contains the complete documentation content in clean, structured Markdown without any CSS, JavaScript, or navigation noise. Extract endpoint

The Nutrient DWS Data Extraction API extract endpoint returns domain-specific data from a document as JSON shaped to the schema you provide:

POST https://api.nutrient.io/extraction/extract

Use the extract endpoint when you need specific values from a document. To return document structure, such as typed spatial elements or whole-document Markdown, refer to the parse endpoint guide. Provide a schema for fields such as invoice_number and total_amount, and the response returns those values from the document. You can also include per-field citations that point back to the source. To define the schema, refer to the define a schema guide. To configure citations, refer to the citations and confidence guide.

When to use extract vs. parse

Choose the endpoint based on the output your application needs.

Use caseEndpoint
Pull known fields into a typed JSON object, such as invoices or forms/extraction/extract
Get the full document as typed elements or Markdown/extraction/parse
Map data to a downstream database or API contract/extraction/extract
Support retrieval-augmented generation (RAG) ingestion, search indexing, or content migration/extraction/parse

If you need both raw structure and specific fields, use the extract endpoint. It runs a parse stage internally. To configure that stage, refer to the parse configuration guide.

Request formats

Every extract request must include a schema. You can send the document in two ways.

Multipart form upload

Upload a file with the JSON-serialized extraction instructions:

Terminal window
curl -X POST https://api.nutrient.io/extraction/extract \
-H "Authorization: Bearer your_api_key_goes_here" \
-F "file=@invoice.pdf" \
-F 'instructions={"schema":{"type":"object","properties":{"invoice_number":{"type":"string","description":"Invoice identifier"},"total_amount":{"type":"number","description":"Total amount including tax"}},"required":["invoice_number","total_amount"]}}'

JSON body with URL

Process a document hosted at a public URL by sending the schema and a url field as JSON:

Terminal window
curl -X POST https://api.nutrient.io/extraction/extract \
-H "Authorization: Bearer your_api_key_goes_here" \
-H "Content-Type: application/json" \
-d '{
"url": "https://storage.example.com/invoice.pdf",
"schema": {
"type": "object",
"properties": {
"invoice_number": { "type": "string", "description": "Invoice identifier" },
"total_amount": { "type": "number", "description": "Total amount including tax" }
},
"required": ["invoice_number", "total_amount"]
},
"parseConfig": { "mode": "understand" }
}'

Instructions

The request accepts these instruction fields.

FieldTypeDescription
schemaobjectRequired — JSON Schema that describes the data to extract. The root type must be object. Refer to the define a schema guide.
instructionsstringOptional free-text guidance for the extraction model, up to 10,000 characters.
parseConfigobjectOptional configuration for the parse stage that runs before extraction. Refer to the parse configuration guide.
optionsobjectExtract-specific response options, such as includeCitations. Refer to the citations and confidence guide.

Response structure

A successful response returns extracted data, optional per-field metadata citations, and pages:

{
"status": 200,
"requestId": "req_x1y2z3w4",
"output": {
"data": {
"invoice_number": "INV-2024-0042",
"total_amount": 1547.5
},
"metadata": {},
"pages": [{ "page": 1, "width": 1200, "height": 1697 }]
},
"metrics": {
"processingTimeMs": 4800,
"pagesProcessed": 1
},
"usage": {
"data_extraction_credits": {
"cost": 27,
"remainingCredits": 832
},
"price_composition": {
"parse": { "units": 1, "unit_cost": 9, "cost": 9, "currency": "data_extraction_credits" },
"extract": { "units": 1, "unit_cost": 18, "cost": 18, "currency": "data_extraction_credits" }
}
}
}

The response includes these top-level output fields:

  • output.data — Extracted values shaped to your schema. The API returns only declared properties.
  • output.metadata — Per-field citation metadata that mirrors the structure of output.data. This object is empty when citations are disabled.
  • output.pages — Page metadata, including the dimensions citation coordinates use.
  • metrics — Processing time and pages processed.
  • usage — Total credits consumed, broken down into parse and extract components. For credit details, refer to the pricing guide.

Next steps

Use these guides to continue working with the extract endpoint.

  • Refer to the define a schema guide for supported JSON Schema keywords, constraints, and size limits.
  • Refer to the parse configuration guide to control the parse stage with parseConfig.mode and language hints.
  • Refer to the citations and confidence guide to ground extracted values back to the source document.
  • Refer to the error handling guide for status codes and error response formats.