---
title: "Define a schema"
canonical_url: "https://www.nutrient.io/guides/dws-data-extraction/extract/define-a-schema/"
md_url: "https://www.nutrient.io/guides/dws-data-extraction/extract/define-a-schema.md"
last_updated: "2026-06-11T00:00:00.000Z"
description: "Write a JSON Schema for the Data Extraction API extract endpoint. Supported keywords, closed-schema behavior, and size limits."
---

# Define a schema

The extract endpoint maps a document to the JSON Schema you provide in the `schema` field. The schema defines which fields Nutrient DWS Data Extraction API extracts and the shape of the returned `output.data` object.

Use this guide to understand supported schema keywords, closed-schema behavior, field types, and size limits.

## Root requirements

Set the root schema to `type: "object"`. The API rejects any other root type, such as `array`, with a 400 error:

```json

{
  "type": "object",
  "properties": {
    "invoice_number": { "type": "string", "description": "Invoice identifier" },
    "total_amount": { "type": "number", "description": "Total including tax" }
  },
  "required": ["invoice_number", "total_amount"]
}

```

## Supported keywords

Keep the schema inline and self-contained. The API supports only these keywords.

| Keyword          | Applies to    | Description                                                         |
| ---------------- | ------------- | ------------------------------------------------------------------- |
| `type`           | all           | One of `object`, `array`, `string`, `number`, `integer`, `boolean`. |
| `properties`     | objects       | Maps a property name to a nested schema.                            |
| `required`       | objects       | Lists property names the extraction must always include.            |
| `items`          | arrays        | Defines the schema for each array element.                          |
| `description`    | all           | Provides field-level guidance to the extraction model.              |
| `enum`           | string fields | Defines accepted string values.                                     |
| `format: "date"` | string fields | Marks a string field as a date.                                     |

The API rejects unsupported keywords, including:

- `$ref`

- `$defs`

- Composition keywords, such as `allOf`, `anyOf`, and `oneOf`

- Validation ranges, such as `minimum`, `maxLength`, and similar keywords

- String `format` values other than `date`

- Conditional schemas, such as `if`, `then`, and `else`

Send only the supported keywords listed above.

## Closed schemas

The API treats schemas as closed, and it returns only the properties you declare. It processes every object as if you set `additionalProperties: false`, so undeclared fields don’t appear in the response.

Don’t send `additionalProperties` in your schema. The API rejects requests that include it.

Use `required` to mark fields that extraction must always include in its output. The API can omit fields that aren’t required when the document doesn’t contain them.

## Field types

Use primitive fields, arrays, and nested objects to model the values you want to extract.

### Primitive fields

String, number, integer, and Boolean fields can include a `description`. String fields also support `enum` and `format: "date"`:

```json

{
  "type": "object",
  "properties": {
    "status": {
      "type": "string",
      "enum": ["paid", "unpaid", "overdue"],
      "description": "Payment status of the invoice"
    },
    "issue_date": {
      "type": "string",
      "format": "date",
      "description": "Date the invoice was issued"
    },
    "is_tax_exempt": { "type": "boolean" }
  }
}

```

### Array fields

Array fields require an `items` schema that defines each element. Use arrays for repeated structures, such as invoice line items:

```json

{
  "type": "object",
  "properties": {
    "line_items": {
      "type": "array",
      "description": "One entry per row in the line-item table",
      "items": {
        "type": "object",
        "properties": {
          "description": { "type": "string" },
          "quantity": { "type": "integer" },
          "unit_price": { "type": "number" }
        },
        "required": ["description", "quantity", "unit_price"]
      }
    }
  }
}

```

### Nested objects

Object fields can nest other objects up to the nesting depth limit. Citations mirror the same nesting, so a value at `data.vendor.address.city` has its citation at `metadata.vendor.address.city`:

```json

{
  "type": "object",
  "properties": {
    "vendor": {
      "type": "object",
      "properties": {
        "name": { "type": "string" },
        "address": {
          "type": "object",
          "properties": {
            "city": { "type": "string" },
            "country": { "type": "string" }
          }
        }
      }
    }
  }
}

```

## Size limits

Keep the schema within these limits. The API rejects requests that exceed them.

| Limit                   | Value            |
| ----------------------- | ---------------- |
| Serialized schema size  | 32 KB            |
| Total fields            | 500              |
| Properties per object   | 50               |
| Nesting depth           | Five levels      |
| `enum` values per field | 50               |
| `enum` value length     | 256 characters   |
| Property name length    | 128 characters   |
| `description` length    | 1,024 characters |

## Write effective descriptions

The `description` on each field guides the extraction model. Use clear, specific descriptions to improve extraction accuracy.

- Name the field the way it appears in the document, such as “the ‘Bill To’ company name” or “the grand total at the bottom of the invoice.”

- State the expected format when it matters, such as “ISO 4217 currency code” or “two-letter country code.”

- For ambiguous fields, state which value to extract when the document contains several values, such as “the final total after discounts and tax.”

For document-wide guidance that doesn’t fit a single field, use the top-level `instructions` string. To configure the parse stage that feeds extraction, refer to the [parse configuration](https://www.nutrient.io/guides/dws-data-extraction/extract/parse-configuration.md) guide.

## Next steps

Use these guides to continue setting up extraction.

- Refer to the [extract endpoint overview](https://www.nutrient.io/guides/dws-data-extraction/extract.md) guide for request formats and response structure.

- Refer to the [citations and confidence](https://www.nutrient.io/guides/dws-data-extraction/extract/citations-and-confidence.md) guide to ground extracted values back to source locations.

- Refer to the [parse configuration](https://www.nutrient.io/guides/dws-data-extraction/extract/parse-configuration.md) guide to tune the parse stage that feeds extraction.
---

## Related pages

- [Extract endpoint](/guides/dws-data-extraction/extract.md)
- [Citations and confidence](/guides/dws-data-extraction/extract/citations-and-confidence.md)
- [Parse configuration](/guides/dws-data-extraction/extract/parse-configuration.md)

