This HTML page is not optimized for LLM or AI agent consumption. Fetch the Markdown version instead: /guides/dws-data-extraction/parsing/multilingual-extraction.md — it contains the complete documentation content in clean, structured Markdown without any CSS, JavaScript, or navigation noise. Multilingual extraction

The Data Extraction API supports more than 100 languages for OCR. By default, the API uses English (eng). You can specify one or more languages using the options.language parameter to improve extraction accuracy for non-English documents.

The language option only applies to structure, understand, and agentic modes, which run OCR. It has no effect in text mode.

Specifying a language

Set options.language in the instructions to tell the OCR engine which language to expect.

Terminal window
curl -X POST https://api.nutrient.io/extraction/parse \
-H "Authorization: Bearer your_api_key_goes_here" \
-F "file=@document.pdf" \
-F 'instructions={"mode":"understand","output":{"format":"spatial"},"options":{"language":"german"}}'

Language format

You can specify languages in three ways:

FormatExampleDescription
Full name (lowercase)"english", "german"Common languages only
Language code"eng", "deu"All languages
Code with variant"chi_sim", "deu_frak"Script or historical variants

The API normalizes full language names to language codes internally.

Multilanguage documents

For documents that contain text in multiple languages, specify all relevant languages as an array or a +-joined string.

Array syntax

Terminal window
curl -X POST https://api.nutrient.io/extraction/parse \
-H "Authorization: Bearer your_api_key_goes_here" \
-F "file=@multilingual.pdf" \
-F 'instructions={"mode":"understand","output":{"format":"spatial"},"options":{"language":["eng","spa","fra"]}}'

Plus-joined string syntax

You can also use a +-joined string instead of an array:

Terminal window
curl -X POST https://api.nutrient.io/extraction/parse \
-H "Authorization: Bearer your_api_key_goes_here" \
-F "file=@multilingual.pdf" \
-F 'instructions={"mode":"understand","output":{"format":"spatial"},"options":{"language":"eng+spa+fra"}}'

Both formats are equivalent. The API accepts either one.

Tips for better accuracy

  • Always specify the document language when it isn’t English. This helps the OCR engine load the correct character models and dictionaries.
  • For multilanguage documents, list all languages present. The OCR engine handles language switching within the document.
  • Use language codes when working with languages that don’t have a full-name alias.
  • For Chinese, Japanese, and Korean, use the specific variants (chi_sim, chi_tra, jpn, kor) to select the correct character set.

Supported languages

The Data Extraction API supports more than 100 OCR languages. See the supported languages reference for the full list of language codes and aliases.