Multilingual extraction
The Data Extraction API supports more than 100 languages for OCR. By default, the API uses English (eng). You can specify one or more languages using the options.language parameter to improve extraction accuracy for non-English documents.
The language option only applies to structure, understand, and agentic modes, which run OCR. It has no effect in text mode.
Specifying a language
Set options.language in the instructions to tell the OCR engine which language to expect.
curl -X POST https://api.nutrient.io/extraction/parse \ -H "Authorization: Bearer your_api_key_goes_here" \ -F "file=@document.pdf" \ -F 'instructions={"mode":"understand","output":{"format":"spatial"},"options":{"language":"german"}}'import requestsimport json
response = requests.post( "https://api.nutrient.io/extraction/parse", headers={"Authorization": "Bearer your_api_key_goes_here"}, files={"file": open("document.pdf", "rb")}, data={ "instructions": json.dumps({ "mode": "understand", "output": {"format": "spatial"}, "options": {"language": "german"}, }) },)
print(response.json())import fs from "node:fs";
const form = new FormData();form.append("file", fs.createReadStream("document.pdf"));form.append( "instructions", JSON.stringify({ mode: "understand", output: { format: "spatial" }, options: { language: "german" }, }),);
const response = await fetch("https://api.nutrient.io/extraction/parse", { method: "POST", headers: { Authorization: "Bearer your_api_key_goes_here" }, body: form,});
console.log(await response.json());Language format
You can specify languages in three ways:
| Format | Example | Description |
|---|---|---|
| Full name (lowercase) | "english", "german" | Common languages only |
| Language code | "eng", "deu" | All languages |
| Code with variant | "chi_sim", "deu_frak" | Script or historical variants |
The API normalizes full language names to language codes internally.
Multilanguage documents
For documents that contain text in multiple languages, specify all relevant languages as an array or a +-joined string.
Array syntax
curl -X POST https://api.nutrient.io/extraction/parse \ -H "Authorization: Bearer your_api_key_goes_here" \ -F "file=@multilingual.pdf" \ -F 'instructions={"mode":"understand","output":{"format":"spatial"},"options":{"language":["eng","spa","fra"]}}'import requestsimport json
response = requests.post( "https://api.nutrient.io/extraction/parse", headers={"Authorization": "Bearer your_api_key_goes_here"}, files={"file": open("multilingual.pdf", "rb")}, data={ "instructions": json.dumps({ "mode": "understand", "output": {"format": "spatial"}, "options": {"language": ["eng", "spa", "fra"]}, }) },)
print(response.json())import fs from "node:fs";
const form = new FormData();form.append("file", fs.createReadStream("multilingual.pdf"));form.append( "instructions", JSON.stringify({ mode: "understand", output: { format: "spatial" }, options: { language: ["eng", "spa", "fra"] }, }),);
const response = await fetch("https://api.nutrient.io/extraction/parse", { method: "POST", headers: { Authorization: "Bearer your_api_key_goes_here" }, body: form,});
console.log(await response.json());Plus-joined string syntax
You can also use a +-joined string instead of an array:
curl -X POST https://api.nutrient.io/extraction/parse \ -H "Authorization: Bearer your_api_key_goes_here" \ -F "file=@multilingual.pdf" \ -F 'instructions={"mode":"understand","output":{"format":"spatial"},"options":{"language":"eng+spa+fra"}}'Both formats are equivalent. The API accepts either one.
Tips for better accuracy
- Always specify the document language when it isn’t English. This helps the OCR engine load the correct character models and dictionaries.
- For multilanguage documents, list all languages present. The OCR engine handles language switching within the document.
- Use language codes when working with languages that don’t have a full-name alias.
- For Chinese, Japanese, and Korean, use the specific variants (
chi_sim,chi_tra,jpn,kor) to select the correct character set.
Supported languages
The Data Extraction API supports more than 100 OCR languages. See the supported languages reference for the full list of language codes and aliases.