OCR server supported languages

The Nutrient OCR component supports a wide range of languages, enabling precise text recognition based on linguistic characteristics such as ligatures, punctuation rules, and symbol variations. To ensure accurate text extraction, specify the language of the document during OCR configuration.

The following languages are supported across all platforms:

  • Croatian

  • Czech

  • Danish

  • Dutch

  • English

  • Finnish

  • French

  • German

  • Indonesian

  • Italian

  • Malay

  • Norwegian

  • Polish

  • Portuguese

  • Serbian

  • Slovak

  • Slovenian

  • Spanish

  • Swedish

  • Turkish

  • Welsh

Languages aren’t region-specific. For example, English applies to both American English and British English.

If your required language isn’t listed, contact Support for assistance.

In addition to the languages listed above, you can OCR other supported languages by providing their ISO 639-2 codes in your API request. Refer to our API reference to view a complete list of supported language codes.

OCR capabilities are exposed through the following API endpoints:

OCR a document with one language

Below is an example of a curl request to Document Engine to OCR a document with one language, Japanese (ISO 639-2 code - jpn):

# Assuming Document Engine is running on `localhost:5000`.

curl -X POST http://localhost:5000/api/build \
    -H 'Authorization: Token token=secret' \
    -o jpn-ocr-result.pdf \
    --fail \
    -H 'Content-Type: multipart/form-data' \
    -F scanned=@/path/to/japanese-document.png \
    -F instructions='{
      "parts": [
        {
          "file": "scanned"
        }
      ],
      "actions": [
        {
          "type": "ocr",
          "language": "jpn"
        }
      ],
      "output": {
        "type": "pdf"
      }
    }'

OCR a document with multiple languages

To perform OCR on a document containing multiple languages, specify a list of desired languages (or their ISO 639-2 codes). Below is an example of a curl request to Document Engine to OCR a document with two languages, English (ISO 639-2 code - eng) and French:

# Assuming Document Engine is running on `localhost:5000`.

curl -X POST http://localhost:5000/api/build \
    -H 'Authorization: Token token=secret' \
    -o french-english-ocr-result.pdf \
    --fail \
    -H 'Content-Type: multipart/form-data' \
    -F scanned=@/path/to/english-french-document.png \
    -F instructions='{
      "parts": [
        {
          "file": "scanned"
        }
      ],
      "actions": [
        {
          "type": "ocr",
          "language": ["french", "eng"]
        }
      ],
      "output": {
        "type": "pdf"
      }
    }'