This HTML page is not optimized for LLM or AI agent consumption. Fetch the Markdown version instead: /guides/dws-data-extraction/extract/define-a-schema.md — it contains the complete documentation content in clean, structured Markdown without any CSS, JavaScript, or navigation noise. Define a schema

The extract endpoint maps a document to the JSON Schema you provide in the schema field. The schema defines which fields Nutrient DWS Data Extraction API extracts and the shape of the returned output.data object.

Use this guide to understand supported schema keywords, closed-schema behavior, field types, and size limits.

Root requirements

Set the root schema to type: "object". The API rejects any other root type, such as array, with a 400 error:

{
"type": "object",
"properties": {
"invoice_number": { "type": "string", "description": "Invoice identifier" },
"total_amount": { "type": "number", "description": "Total including tax" }
},
"required": ["invoice_number", "total_amount"]
}

Supported keywords

Keep the schema inline and self-contained. The API supports only these keywords.

KeywordApplies toDescription
typeallOne of object, array, string, number, integer, boolean.
propertiesobjectsMaps a property name to a nested schema.
requiredobjectsLists property names the extraction must always include.
itemsarraysDefines the schema for each array element.
descriptionallProvides field-level guidance to the extraction model.
enumstring fieldsDefines accepted string values.
format: "date"string fieldsMarks a string field as a date.

The API rejects unsupported keywords, including:

  • $ref
  • $defs
  • Composition keywords, such as allOf, anyOf, and oneOf
  • Validation ranges, such as minimum, maxLength, and similar keywords
  • String format values other than date
  • Conditional schemas, such as if, then, and else

Send only the supported keywords listed above.

Closed schemas

The API treats schemas as closed, and it returns only the properties you declare. It processes every object as if you set additionalProperties: false, so undeclared fields don’t appear in the response.

Don’t send additionalProperties in your schema. The API rejects requests that include it.

Use required to mark fields that extraction must always include in its output. The API can omit fields that aren’t required when the document doesn’t contain them.

Field types

Use primitive fields, arrays, and nested objects to model the values you want to extract.

Primitive fields

String, number, integer, and Boolean fields can include a description. String fields also support enum and format: "date":

{
"type": "object",
"properties": {
"status": {
"type": "string",
"enum": ["paid", "unpaid", "overdue"],
"description": "Payment status of the invoice"
},
"issue_date": {
"type": "string",
"format": "date",
"description": "Date the invoice was issued"
},
"is_tax_exempt": { "type": "boolean" }
}
}

Array fields

Array fields require an items schema that defines each element. Use arrays for repeated structures, such as invoice line items:

{
"type": "object",
"properties": {
"line_items": {
"type": "array",
"description": "One entry per row in the line-item table",
"items": {
"type": "object",
"properties": {
"description": { "type": "string" },
"quantity": { "type": "integer" },
"unit_price": { "type": "number" }
},
"required": ["description", "quantity", "unit_price"]
}
}
}
}

Nested objects

Object fields can nest other objects up to the nesting depth limit. Citations mirror the same nesting, so a value at data.vendor.address.city has its citation at metadata.vendor.address.city:

{
"type": "object",
"properties": {
"vendor": {
"type": "object",
"properties": {
"name": { "type": "string" },
"address": {
"type": "object",
"properties": {
"city": { "type": "string" },
"country": { "type": "string" }
}
}
}
}
}
}

Size limits

Keep the schema within these limits. The API rejects requests that exceed them.

LimitValue
Serialized schema size32 KB
Total fields500
Properties per object50
Nesting depthFive levels
enum values per field50
enum value length256 characters
Property name length128 characters
description length1,024 characters

Write effective descriptions

The description on each field guides the extraction model. Use clear, specific descriptions to improve extraction accuracy.

  • Name the field the way it appears in the document, such as “the ‘Bill To’ company name” or “the grand total at the bottom of the invoice.”
  • State the expected format when it matters, such as “ISO 4217 currency code” or “two-letter country code.”
  • For ambiguous fields, state which value to extract when the document contains several values, such as “the final total after discounts and tax.”

For document-wide guidance that doesn’t fit a single field, use the top-level instructions string. To configure the parse stage that feeds extraction, refer to the parse configuration guide.

Next steps

Use these guides to continue setting up extraction.