OCR server supported languages

The Nutrient OCR component supports a wide range of languages, enabling precise text recognition based on linguistic characteristics such as ligatures, punctuation rules, and symbol variations. To ensure accurate text extraction, specify the language of the document during OCR configuration.

Languages aren’t region-specific. For example, English applies to both American English and British English.

If your required language isn’t listed, contact Support for assistance.

Supported languages

DescriptionLanguage codeFull language name
Afrikaansafr
Albaniansqi
Amharicamh
Arabicara
Armenianhye
Assameseasm
Azerbaijaniaze
Azerbaijani - Cyrillicaze_cyrl
Basqueeus
Belarusianbel
Bengaliben
Bosnianbos
Bretonbre
Bulgarianbul
Burmesemya
Catalan; Valenciancat
Cebuanoceb
Central Khmerkhm
Cherokeechr
Chinese - Simplifiedchi_sim
Chinese - Simplified (Vertical)chi_sim_vert
Chinese - Traditionalchi_tra
Chinese - Traditional (Vertical)chi_tra_vert
Corsicancos
Croatianhrvcroatian
Czechcesczech
Danishdandanish
Danish - Frakturdan_frak
Dhivehi; Maldiviandiv
Dutch; Flemishnlddutch
Dzongkhadzo
Englishengenglish
English, Middle (1100–1500)enm
Esperantoepo
Estonianest
Faroesefao
Filipinofil
Finnishfinfinnish
Frenchfrafrench
French, Middle (ca. 1400–1600)frm
Galicianglg
Georgiankat
Georgian - Oldkat_old
Germandeugerman
German - Frakturdeu_frak
German Frakturfrk
Greek, Ancientgrc
Greek, Modernell
Gujaratiguj
Haitian; Haitian Creolehat
Hebrewheb
Hindihin
Hungarianhun
Icelandicisl
Indonesianindindonesian
Inuktitutiku
Irishgle
Italianitaitalian
Italian - Oldita_old
Japanesejpn
Japanese (Vertical)jpn_vert
Javanesejav
Kannadakan
Kazakhkaz
Kirghiz; Kyrgyzkir
Koreankor
Korean (Vertical)kor_vert
Kurdishkur
Kurmanji (Kurdish)kmr
Laolao
Latinlat
Latvianlav
Lithuanianlit
Luxembourgishltz
Macedonianmkd
Malaymsamalay
Malayalammal
Maltesemlt
Maorimri
Marathimar
Math/Equation detectionequ
Mongolianmon
Nepalinep
Norwegiannornorwegian
Occitanoci
Oriyaori
Panjabi; Punjabipan
Persianfas
Polishpolpolish
Portugueseporportuguese
Pushto; Pashtopus
Quechuaque
Romanian; Moldavianron
Russianrus
Sanskritsan
Scottish Gaelicgla
Serbiansrpserbian
Serbian - Latinsrp_latn
Sindhisnd
Sinhala; Sinhalesesin
Slovakslkslovak
Slovak - Frakturslk_frak
Slovenianslvslovenian
Spanish; Castilianspaspanish
Spanish - Oldspa_old
Sundanesesun
Swahiliswa
Swedishsweswedish
Syriacsyr
Tagalogtgl
Tajiktgk
Tamiltam
Tatartat
Telugutel
Thaitha
Tibetanbod
Tigrinyatir
Tongaton
Turkishturturkish
Uighur; Uyghuruig
Ukrainianukr
Urduurd
Uzbekuzb
Uzbek - Cyrillicuzb_cyrl
Vietnamesevie
Welshcym
Western Frisianfry
Yiddishyid
Yorubayor

Usage

OCR capabilities are exposed through the following API endpoints:

You can specify language of your document using either:

  • Full language name e.g. english, german — available for commonly used languages
  • ISO 639-2 language code e.g. eng, deu — available for all languages
  • ISO 639-2 language code with variant e.g. chi_sim_vert or deu_frak

OCR a document with one language

Below is an example of a curl request to Document Engine to OCR a document in Japanese (ISO 639-2 code jpn):

Terminal window
# Assuming Document Engine is running on `localhost:5000`.
curl -X POST http://localhost:5000/api/build \
-H 'Authorization: Token token=secret' \
-o jpn-ocr-result.pdf \
--fail \
-H 'Content-Type: multipart/form-data' \
-F 'scanned=@/path/to/japanese-document.png' \
-F instructions='{
"parts": [
{
"file": "scanned"
}
],
"actions": [
{
"type": "ocr",
"language": "jpn"
}
],
"output": {
"type": "pdf"
}
}'

OCR a document with multiple languages

To perform OCR on a document containing multiple languages, specify a list of desired languages (or their ISO 639-2 codes). Below is an example of a curl request to Document Engine to OCR a document with two languages, English and French:

Terminal window
# Assuming Document Engine is running on `localhost:5000`.
curl -X POST http://localhost:5000/api/build \
-H 'Authorization: Token token=secret' \
-o french-english-ocr-result.pdf \
--fail \
-H 'Content-Type: multipart/form-data' \
-F 'scanned=@/path/to/english-french-document.png' \
-F instructions='{
"parts": [
{
"file": "scanned"
}
],
"actions": [
{
"type": "ocr",
"language": ["french", "eng"]
}
],
"output": {
"type": "pdf"
}
}'