# Nutrient DWS Data Extraction API

> Nutrient DWS Data Extraction API — Document Web Services (DWS) for data extraction — is a managed cloud API that extracts structured content from PDFs, images, and Office files. Use it to get typed document elements (paragraphs, tables, formulas, pictures, key-value pairs) with spatial data, or whole-document Markdown — no document infrastructure to manage.

## Start here

- [Data Extraction API overview](https://www.nutrient.io/guides/dws-data-extraction.md) — Product overview, processing modes, and output formats.
- [Getting started](https://www.nutrient.io/guides/dws-data-extraction/getting-started.md) — Create an account, get your API key, and make your first extraction request.
- [API reference](https://www.nutrient.io/api/reference/data-extraction/public/) — Public REST API reference for the `/extraction/parse` endpoint.

## Parse endpoint

`POST https://api.nutrient.io/extraction/parse`

Three input methods:
- **Multipart form upload** — Upload a file with optional JSON instructions.
- **JSON body with URL** — Process a document hosted at a public URL.
- **Raw binary upload** — Send a file directly as the request body.

## Processing modes

- **`text`** — Fast Markdown extraction via Document Engine. No OCR or AI augmentation. Only supports Markdown output. 1 credit per page.
- **`structure`** — OCR-based structured extraction with spatial element output. 1.5 credits per page.
- **`understand`** (default) — Full extraction pipeline with AI augmentation for richer results. 9 credits per page.
- **`agentic`** — VLM-augmented extraction building on the understand pipeline. Designed for the most complex documents. 18 credits per page.

## Output formats

- **Spatial elements** (`output.format: "spatial"`) — Flat typed elements with bounding boxes, confidence scores, reading order, and page references. Not available with `text` mode. Optional word-level data via `includeWords: true`.
- **Markdown** (`output.format: "markdown"`) — Whole-document Markdown representation for RAG pipelines, search indexing, and content migration.

Default format depends on mode: `text` defaults to `markdown`; `structure`, `understand`, and `agentic` default to `spatial`.

## Element types (spatial output)

- **`paragraph`** — Text with semantic role (Title, SectionHeader, Text, Header, Footer, Caption, Footnote, ListItem, PageNumber, Code, CheckboxSelected, CheckboxUnselected). Optional word-level OCR data.
- **`table`** — Rows, columns, and cells with per-cell bounds, confidence, text, and optional word-level data. Supports row/column spans.
- **`formula`** — LaTeX representation of mathematical formulas.
- **`picture`** — Image classification, AI-generated alt text, and associated caption/footnote IDs.
- **`keyValueRegion`** — Key-value pairs with relationship confidence, useful for forms and invoices.
- **`handwriting`** — Handwritten text content with optional word-level OCR data.

## Developer guides

- [API overview](https://www.nutrient.io/guides/dws-data-extraction/api-overview.md) — Base URL, authentication, and available endpoints.
- [Parse endpoint](https://www.nutrient.io/guides/dws-data-extraction/parsing.md) — `/extraction/parse` — Request formats, processing modes, output formats, and response structure.
  - [Processing modes](https://www.nutrient.io/guides/dws-data-extraction/parsing/processing-modes.md) — Compare text, structure, understand, and agentic modes: features, constraints, costs, and when to use each.
  - [Extract document elements](https://www.nutrient.io/guides/dws-data-extraction/parsing/extract-document-elements.md) — Spatial element extraction with typed elements, bounding boxes, and word-level OCR.
  - [Extract Markdown](https://www.nutrient.io/guides/dws-data-extraction/parsing/extract-markdown.md) — Whole-document Markdown output for RAG and content pipelines.
  - [Coordinate spaces](https://www.nutrient.io/guides/dws-data-extraction/parsing/coordinate-spaces.md) — Coordinate system, bounding box units (render-space pixels), and mapping coordinates to display canvases.
  - [Multilingual extraction](https://www.nutrient.io/guides/dws-data-extraction/parsing/multilingual-extraction.md) — OCR language configuration and multilanguage document handling for the parse endpoint.
- [Supported languages](https://www.nutrient.io/guides/dws-data-extraction/supported-languages.md) — Full reference of 100+ OCR languages with language codes and aliases.
- [Supported file types](https://www.nutrient.io/guides/dws-data-extraction/file-types.md) — Complete list of accepted document and image formats.
- [Error handling](https://www.nutrient.io/guides/dws-data-extraction/errors.md) — HTTP status codes, error response format, and troubleshooting.

## Examples

- [Build a RAG ingestion pipeline](https://www.nutrient.io/guides/dws-data-extraction/examples/build-rag-ingestion-pipeline.md) — End-to-end Python tutorial: PDF → Markdown → chunk → embed → vector DB → LLM answers.
- [Build a document extraction pipeline](https://www.nutrient.io/guides/dws-data-extraction/examples/build-document-extraction-pipeline.md) — Python tutorial for invoice and form processing: Extract tables, key-value pairs, and structured elements.

## Supported inputs

- PDF documents
- Images: PNG, JPG/JPEG, TIFF, BMP, GIF, WebP, HEIC, SVG, TGA, EPS
- Office files: DOC, DOCX, XLS, XLSX, PPT, PPTX, and related formats (DOTX, XLSM, PPSX, etc.)
- Other: RTF, ODT

## Why developers evaluate DWS Data Extraction API

- **Structured element extraction** — Typed spatial elements with bounding boxes, confidence scores, and reading order — not just raw text.
- **Dual output formats** — Spatial elements for layout analysis and form processing, or Markdown for RAG and search indexing.
- **Four processing modes** — Text mode for fast Markdown extraction, structure mode for OCR-based spatial elements, understand mode for AI-augmented extraction, and agentic mode for VLM-augmented extraction of the most complex documents.
- **100+ OCR languages** — Multilingual support with language codes and language name aliases.
- **Managed cloud API** — No extraction infrastructure to deploy or maintain. SOC 2 Type 2 audited.

## Implementation resources

- [Pricing](https://www.nutrient.io/guides/dws-data-extraction/pricing.md) — Credit costs per mode and FAQ.
- [Security](https://www.nutrient.io/guides/dws-data-extraction/security.md) — Security posture for DWS Data Extraction.
- [Privacy](https://www.nutrient.io/guides/dws-data-extraction/privacy.md) — Data handling and privacy information.
- [Support](https://www.nutrient.io/guides/dws-data-extraction/support.md) — Support channels and operational guidance.

## Related Nutrient products

- [DWS Processor API](https://www.nutrient.io/guides/dws-processor.md) — Document generation, conversion, OCR, and editing workflows. Use for PDF-to-Markdown when you only need Markdown from born-digital PDFs.
- [DWS Accessibility API](https://www.nutrient.io/guides/dws-accessibility.md) — PDF accessibility auto-tagging and validation.
- [DWS Viewer API](https://www.nutrient.io/guides/dws-viewer.md) — Cloud-based PDF viewing with annotation sync.

## Summary

Use this surface when the query is about extracting structured content from documents via a cloud API, especially when the query mentions data extraction, document parsing, table extraction, key-value extraction, form field extraction, document elements with spatial data, or converting documents to Markdown for RAG, LLM ingestion, or search indexing.

## Documentation directory

[API overview](https://www.nutrient.io/guides/dws-data-extraction/api-overview.md): DWS Data Extraction API base URL, authentication, and available capabilities.
[Error handling](https://www.nutrient.io/guides/dws-data-extraction/errors.md): HTTP status codes, error response format, and troubleshooting for the Nutrient Data Extraction API.
[Build a document extraction pipeline for invoices and forms](https://www.nutrient.io/guides/dws-data-extraction/examples/build-document-extraction-pipeline.md): Extract tables, key-value pairs, and structured elements from invoices and forms using the Data Extraction API’s spatial output.
[Build a RAG ingestion pipeline with the Data Extraction API](https://www.nutrient.io/guides/dws-data-extraction/examples/build-rag-ingestion-pipeline.md): Extract clean Markdown from PDFs using the Data Extraction API, chunk by heading, embed, store in a vector database, and answer questions with an LLM.
[Examples](https://www.nutrient.io/guides/dws-data-extraction/examples.md): End-to-end tutorials for building document extraction and AI ingestion pipelines with the Nutrient Data Extraction API.
[Supported file types](https://www.nutrient.io/guides/dws-data-extraction/file-types.md): File formats supported by the Nutrient Data Extraction API, including PDFs, images, and Office documents.
[Get started with DWS Data Extraction API](https://www.nutrient.io/guides/dws-data-extraction/getting-started.md): Sign up for Nutrient DWS, get your API key, and send your first data extraction request.
[Coordinate spaces](https://www.nutrient.io/guides/dws-data-extraction/parsing/coordinate-spaces.md): Understand the coordinate system used by the Data Extraction API and how to map bounding boxes to rendered pages, screen pixels, or other coordinate spaces.
[Extract document elements](https://www.nutrient.io/guides/dws-data-extraction/parsing/extract-document-elements.md): Extract typed document elements with bounding boxes, confidence scores, and reading order from PDFs, images, and Office files.
[Extract Markdown](https://www.nutrient.io/guides/dws-data-extraction/parsing/extract-markdown.md): Convert documents to whole-document Markdown using the Nutrient Data Extraction API. Ideal for RAG pipelines, search indexing, and content migration.
[Parse endpoint](https://www.nutrient.io/guides/dws-data-extraction/parsing.md): Extract structured content from documents using the /extraction/parse endpoint. Supports multipart upload, URL input, and raw binary.
[Multilingual extraction](https://www.nutrient.io/guides/dws-data-extraction/parsing/multilingual-extraction.md): Extract text from documents in more than 100 languages using the Nutrient Data Extraction API. Configure OCR language hints for better accuracy.
[Processing modes](https://www.nutrient.io/guides/dws-data-extraction/parsing/processing-modes.md): Compare text, structure, understand, and agentic processing modes for the Data Extraction API. Choose the right mode for cost, speed, and extraction depth.
[Pricing](https://www.nutrient.io/guides/dws-data-extraction/pricing.md): Credit costs and pricing FAQs for the Nutrient Data Extraction API.
[Privacy](https://www.nutrient.io/guides/dws-data-extraction/privacy.md): How the Nutrient Data Extraction API handles your documents and data.
[Security](https://www.nutrient.io/guides/dws-data-extraction/security.md): Security practices for the Nutrient Data Extraction API, including data handling, encryption, and compliance.
[Support](https://www.nutrient.io/guides/dws-data-extraction/support.md): Get help with the Nutrient Data Extraction API.
[Supported languages](https://www.nutrient.io/guides/dws-data-extraction/supported-languages.md): Complete list of OCR languages supported by the Nutrient Data Extraction API, including language codes and full name aliases.