DWS Data Extraction API

Use the Nutrient DWS Data Extraction API to extract structured content from PDFs, images, and Office files. Send a document to the API and receive typed document elements with spatial data, whole-document Markdown, or schema-shaped JSON.

What it does

Use the DWS Data Extraction API to build document extraction workflows that need structured output.

Extract paragraphs, tables, formulas, pictures, and key-value pairs from documents, with bounding box coordinates and confidence scores.
Convert documents to structured Markdown for retrieval-augmented generation (RAG) pipelines, search indexing, and content migration.
Extract domain-specific JSON data mapped to your schema, with per-field citations back to the source.
Choose between four processing modes: text extraction, optical character recognition (OCR)-based structure extraction, AI-augmented document understanding, and vision language model (VLM)-augmented agentic extraction.
Process documents in more than 100 languages with multilingual OCR support.

DWS Data Extraction API is part of Nutrient Document Web Services (DWS). It focuses on content extraction workflows. For document generation, conversion, and editing actions, refer to the DWS Processor API guide.

Two endpoints

Choose the endpoint that matches the output you need from a document.

Parse endpoint

Return the document’s full structure as typed spatial elements with bounding boxes or whole-document Markdown.

Extract endpoint

Map a document to your JSON Schema and return the requested fields with per-field citations.

Processing modes

Choose the processing pipeline that fits your document type and output requirements.

Text mode

Markdown extraction from born-digital documents. No OCR or AI. 1 credit per page.

Structure mode

OCR-based extraction with typed spatial elements and bounding boxes. 1.5 credits per page.

Understand mode

Full AI-augmented pipeline with layout analysis, table detection, and semantic classification. 9 credits per page.

Agentic mode

VLM-augmented extraction that builds on understand mode for deep visual understanding. 18 credits per page.

Output formats

Choose an output format based on what your downstream system needs.

Spatial elements

Typed document elements — such as paragraphs, tables, formulas, pictures, and key-value pairs — with bounding boxes, confidence scores, and reading order.

Markdown

Whole-document Markdown representation for RAG, search indexing, and content pipelines.

Essential guides

Start with these guides to set up your first request, explore the API, or review pricing.

Get started

Developer guides

API reference, request formats, output schemas, and integration patterns.

Pricing

Credit costs per mode and plan options.