Multilingual extraction

The Data Extraction API supports more than 100 languages for OCR. By default, the API uses English (eng). You can specify one or more languages using the options.language parameter to improve extraction accuracy for non-English documents.

The language option only applies to structure, understand, and agentic modes, which run OCR. It has no effect in text mode.

Specifying a language

Set options.language in the instructions to tell the OCR engine which language to expect.

curl -X POST https://api.nutrient.io/extraction/parse \
  -H "Authorization: Bearer your_api_key_goes_here" \
  -F "file=@document.pdf" \
  -F 'instructions={"mode":"understand","output":{"format":"spatial"},"options":{"language":"german"}}'

import requests
import json

response = requests.post(
    "https://api.nutrient.io/extraction/parse",
    headers={"Authorization": "Bearer your_api_key_goes_here"},
    files={"file": open("document.pdf", "rb")},
    data={
        "instructions": json.dumps({
            "mode": "understand",
            "output": {"format": "spatial"},
            "options": {"language": "german"},
        })
    },
)

print(response.json())

import fs from "node:fs";

const form = new FormData();
form.append("file", fs.createReadStream("document.pdf"));
form.append(
  "instructions",
  JSON.stringify({
    mode: "understand",
    output: { format: "spatial" },
    options: { language: "german" },
  }),
);

const response = await fetch("https://api.nutrient.io/extraction/parse", {
  method: "POST",
  headers: { Authorization: "Bearer your_api_key_goes_here" },
  body: form,
});

console.log(await response.json());

Language format

You can specify languages in three ways:

Format	Example	Description
Full name (lowercase)	`"english"`, `"german"`	Common languages only
Language code	`"eng"`, `"deu"`	All languages
Code with variant	`"chi_sim"`, `"deu_frak"`	Script or historical variants

The API normalizes full language names to language codes internally.

Multilanguage documents

For documents that contain text in multiple languages, specify all relevant languages as an array or a +-joined string.

Array syntax

curl -X POST https://api.nutrient.io/extraction/parse \
  -H "Authorization: Bearer your_api_key_goes_here" \
  -F "file=@multilingual.pdf" \
  -F 'instructions={"mode":"understand","output":{"format":"spatial"},"options":{"language":["eng","spa","fra"]}}'

import requests
import json

response = requests.post(
    "https://api.nutrient.io/extraction/parse",
    headers={"Authorization": "Bearer your_api_key_goes_here"},
    files={"file": open("multilingual.pdf", "rb")},
    data={
        "instructions": json.dumps({
            "mode": "understand",
            "output": {"format": "spatial"},
            "options": {"language": ["eng", "spa", "fra"]},
        })
    },
)

print(response.json())

import fs from "node:fs";

const form = new FormData();
form.append("file", fs.createReadStream("multilingual.pdf"));
form.append(
  "instructions",
  JSON.stringify({
    mode: "understand",
    output: { format: "spatial" },
    options: { language: ["eng", "spa", "fra"] },
  }),
);

const response = await fetch("https://api.nutrient.io/extraction/parse", {
  method: "POST",
  headers: { Authorization: "Bearer your_api_key_goes_here" },
  body: form,
});

console.log(await response.json());

Plus-joined string syntax

You can also use a +-joined string instead of an array:

curl -X POST https://api.nutrient.io/extraction/parse \
  -H "Authorization: Bearer your_api_key_goes_here" \
  -F "file=@multilingual.pdf" \
  -F 'instructions={"mode":"understand","output":{"format":"spatial"},"options":{"language":"eng+spa+fra"}}'

Both formats are equivalent. The API accepts either one.

Tips for better accuracy

Always specify the document language when it isn’t English. This helps the OCR engine load the correct character models and dictionaries.
For multilanguage documents, list all languages present. The OCR engine handles language switching within the document.
Use language codes when working with languages that don’t have a full-name alias.
For Chinese, Japanese, and Korean, use the specific variants (chi_sim, chi_tra, jpn, kor) to select the correct character set.

Supported languages

The Data Extraction API supports more than 100 OCR languages. See the supported languages reference for the full list of language codes and aliases.

Multilingual extraction

Specifying a language

Language format

Multilanguage documents

Array syntax

Plus-joined string syntax

Tips for better accuracy

Supported languages

Was this helpful?

Help us improve

Thank you for your feedback!

Something went wrong. Please try again or let us know.