Python Tesseract OCR: Extract text from images using pytesseract

Table of contents

    Extract text from images and scanned documents using Python and Tesseract OCR. This tutorial covers installation, text extraction, and preprocessing techniques. For searchable PDFs from scanned documents, see the Nutrient OCR API section.
    Python Tesseract OCR: Extract text from images using pytesseract
    TL;DR

    Use Tesseract OCR with pytesseract to extract text from images. Preprocessing (grayscale, resizing, thresholding) improves accuracy. For searchable PDFs or batch processing, use Nutrient OCR API.

    To use Tesseract OCR in Python, install the pytesseract wrapper library and Tesseract engine. Then call pytesseract.image_to_string(image) to extract text from any image. The function returns recognized text as a string — no cloud services or API keys are required for basic usage.

    Key capabilities of pytesseract:

    • Text extraction — Extract text from JPG, PNG, TIFF, and other image formats
    • 100+ languages — Support for English, French, German, Chinese, Arabic, and more
    • Configurable — Control page segmentation, language, and character allowlists
    • Free and open source — Apache 2.0 license with active community support
    • Cross-platform — Works on Windows, macOS, and Linux

    Python developers use Tesseract OCR with the pytesseract(opens in a new tab) wrapper to extract text from images and scanned documents.

    What OCR does

    OCR extracts text from images and scanned documents. Common uses include:

    • Digitizing paper documents for search and archival
    • Automating data entry from forms and invoices
    • Making scanned PDFs searchable and copyable
    • Indexing document content for retrieval

    Tesseract OCR

    Tesseract OCR(opens in a new tab) is an open source OCR engine originally developed by Hewlett-Packard (1985–2006) and now maintained by Google. It uses neural networks and traditional image processing to recognize text. Tesseract OCR supports 100+ languages, and it works with Python, Java, and C++ (Apache 2.0 license).

    Use Tesseract 5.x for best results. Version 5+ uses LSTM neural networks that significantly improve accuracy over earlier versions. Check your version with tesseract --version.

    Pros and cons

    Pros

    • Free and open source
    • 100+ languages supported
    • Handles various fonts and text styles
    • Active community, regular updates

    Cons

    • Setup can be tricky on some systems
    • Accuracy drops with poor image quality or complex layouts
    • No built-in preprocessing — you handle that separately
    • Training required for non-standard fonts

    Prerequisites

    You need:

    1. Python 3.x
    2. Tesseract OCR
    3. pytesseract(opens in a new tab)
    4. Pillow (Python Imaging Library)(opens in a new tab)

    pytesseract wraps the Tesseract OCR engine and provides a Python interface for text recognition. It also works as a standalone script for direct Tesseract interaction.

    Installing Tesseract OCR

    Install Tesseract for your operating system:

    For other operating systems, see the installation guide(opens in a new tab).

    Setting up your Python OCR environment

    1. Create a new Python file in your favorite editor and name it ocr.py.
    2. Download the sample image used in this tutorial and save it in the same directory as the Python file.
    3. Install the required Python libraries using pip:
    Terminal window
    pip install pytesseract pillow

    Verify the installation:

    Terminal window
    tesseract --version

    If you encounter import issues, see troubleshooting pytesseract imports.

    Python Tesseract tutorial: Extract text from images

    Import the libraries and load your image:

    import pytesseract
    from PIL import Image
    image_path = "path/to/your/image.jpg"
    image = Image.open(image_path)

    Extracting text from the image

    To extract text from the image, use the image_to_string() function from the pytesseract library:

    extracted_text = pytesseract.image_to_string(image)
    print(extracted_text)

    The image_to_string() function takes an image as an input and returns the recognized text as a string.

    Run the Python script to see the extracted text from the sample image:

    Terminal window
    python3 ocr.py

    The image below shows the output.

    terminal showing the output

    Saving extracted text to a file

    If you want to save the extracted text to a file, use Python’s built-in file I/O functions:

    with open("output.txt", "w") as output_file:
    output_file.write(extracted_text)

    Advanced Python OCR techniques

    pytesseract supports several configuration options for the OCR engine.

    Configuring the OCR engine

    Pass a configuration string to image_to_string() with space-separated key-value pairs. This example sets English as the language and treats the image as a single text block:

    config = '--psm 6 -l eng'
    text = pytesseract.image_to_string(image, config=config)

    Page segmentation modes (PSM) reference

    The --psm option controls how Tesseract analyzes page layout. Choose the mode that matches your document structure:

    PSMModeBest for
    0Orientation and script detection onlyDetecting page rotation
    1Automatic with OSDGeneral documents with mixed content
    3Fully automatic (default)Standard documents
    4Single column of variable sizesArticles, single-column pages
    6Single uniform block of textParagraphs, text blocks
    7Single text lineOne-line captions, headers
    8Single wordIndividual words, labels
    9Single word in a circleCircular text like stamps
    10Single characterIndividual digits or letters
    11Sparse textText scattered across image
    12Sparse text with OSDScattered text with rotation
    13Raw lineTreat as single line, no preprocessing

    For non-standard installation paths, set the Tesseract executable location:

    pytesseract.pytesseract.tesseract_cmd = '/path/to/tesseract'

    Handling multiple languages

    Tesseract supports 100+ languages. Use a plus sign to combine languages:

    config = '-l eng+fra'
    text = pytesseract.image_to_string(image, config=config)

    Improving OCR accuracy with image preprocessing

    Preprocessing images before OCR improves recognition accuracy.

    Converting images to grayscale

    Converting to grayscale improves contrast between text and the background:

    from PIL import Image, ImageOps
    # Open an image.
    image = Image.open("path_to_your_image.jpg")
    # Convert image to grayscale.
    gray_image = ImageOps.grayscale(image)
    # Save or display the grayscale image.
    gray_image.show()
    gray_image.save("path_to_save_grayscale_image.jpg")
    Original imageGrayscale image
    Original image of a blue lizard with vibrant colorsGrayscale version of the original image, showing the blue lizard in shades of gray

    Resizing the image for better accuracy

    Resizing to a larger size makes text easier to recognize:

    # Resize the image.
    scale_factor = 2
    resized_image = gray_image.resize(
    (gray_image.width * scale_factor, gray_image.height * scale_factor),
    resample=Image.LANCZOS
    )

    This resizes the image by a factor of 2 using Lanczos resampling for high-quality results.

    Applying adaptive thresholding

    Adaptive thresholding creates a binary image with clear separation between text and the background:

    from PIL import Image, ImageOps, ImageFilter
    # Load the image.
    image = Image.open('image.png')
    # Convert the image to grayscale.
    gray_image = ImageOps.grayscale(image)
    # Resize the image to enhance details.
    scale_factor = 2
    resized_image = gray_image.resize(
    (gray_image.width * scale_factor, gray_image.height * scale_factor),
    resample=Image.LANCZOS
    )
    # Apply edge detection filter (find edges).
    thresholded_image = resized_image.filter(ImageFilter.FIND_EDGES)
    # Save or display the processed image.
    thresholded_image.show() # This will display the image.
    # thresholded_image.save('path_to_save_image') # This will save the image.
    Original imageThresholded image
    Image of black-and-white text with standard contrastImage of black-and-white text with enhanced contrast after applying thresholding

    Pass the preprocessed image to the OCR engine:

    # Extract text from the preprocessed image.
    improved_text = pytesseract.image_to_string(thresholded_image)
    print(improved_text)

    Complete OCR script

    Here’s the complete preprocessing and OCR example:

    from PIL import Image, ImageOps, ImageFilter
    import pytesseract
    # Define the path to your image.
    image_path = 'image.png'
    # Open the image.
    image = Image.open(image_path)
    # Convert image to grayscale.
    gray_image = ImageOps.grayscale(image)
    # Resize the image to enhance details.
    scale_factor = 2
    resized_image = gray_image.resize(
    (gray_image.width * scale_factor, gray_image.height * scale_factor),
    resample=Image.LANCZOS
    )
    # Apply adaptive thresholding using the `FIND_EDGES` filter.
    thresholded_image = resized_image.filter(ImageFilter.FIND_EDGES)
    # Extract text from the preprocessed image.
    improved_text = pytesseract.image_to_string(thresholded_image)
    # Print the extracted text.
    print(improved_text)
    # Optional: Save the preprocessed image for review.
    thresholded_image.save('preprocessed_image.jpg')

    Recognizing digits only

    To extract only digits, use --psm 6 and filter with regular expressions:

    import pytesseract
    from PIL import Image, ImageOps
    import re
    image_path = "image.png"
    image = Image.open(image_path)
    config = '--psm 6'
    text = pytesseract.image_to_string(image, config=config)
    digits = re.findall(r'\d+', text)
    print(digits)

    The re.findall() method extracts all digit sequences from the OCR output.

    Character restrictions

    Restrict OCR to specific characters using tessedit_char_whitelist:

    # Only recognize uppercase letters and numbers.
    config = '--psm 6 -c tessedit_char_whitelist=ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789'
    text = pytesseract.image_to_string(image, config=config)

    To preserve spaces between words when using an allowlist, add preserve_interword_spaces=1:

    config = '--psm 6 -c preserve_interword_spaces=1 -c tessedit_char_whitelist=ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789'

    Use tessedit_char_blacklist to exclude specific characters instead.

    Getting bounding boxes

    Extract character positions with image_to_boxes:

    import pytesseract
    from PIL import Image
    image = Image.open('image.png')
    boxes = pytesseract.image_to_boxes(image)
    h = image.height
    for box in boxes.splitlines():
    b = box.split()
    char, x1, y1, x2, y2 = b[0], int(b[1]), int(b[2]), int(b[3]), int(b[4])
    # Note: y-coordinates are from image bottom, convert to top-origin
    print(f"Character '{char}' at ({x1}, {h - y2}) to ({x2}, {h - y1})")

    For word-level bounding boxes with confidence scores, use image_to_data (requires pip install pandas):

    import pytesseract
    from PIL import Image
    image = Image.open('image.png')
    data = pytesseract.image_to_data(image, output_type=pytesseract.Output.DATAFRAME)
    # Filter: conf > 0 removes non-text rows, conf > 60 keeps high-confidence words
    words = data[(data['conf'] > 60) & (data['text'].str.strip() != '')]
    for _, row in words.iterrows():
    print(f"'{row['text']}' (conf: {row['conf']}) at ({row['left']}, {row['top']})")

    Orientation and script detection

    Detect the page rotation and script type with image_to_osd:

    import pytesseract
    from PIL import Image
    image = Image.open('rotated_image.png')
    osd = pytesseract.image_to_osd(image)
    print(osd)
    # Output includes:
    # - Page orientation (0, 90, 180, 270 degrees)
    # - Script type (Latin, Cyrillic, Arabic, etc.)
    # - Confidence scores

    This helps preprocess images that need rotation correction before OCR.

    Training Tesseract with custom data

    Training Tesseract improves accuracy for specific fonts, languages, or layouts not well-represented in the default model. The neural network engine learns from structured training data.

    You need a dataset of images with corresponding text files containing the expected output. Tesseract provides tesstrain and text2image tools for generating and labeling training data.

    Training is time-intensive but worthwhile for specialized applications with unique fonts, symbols, or languages.

    Best practices

    1. Preprocess images — Grayscale, resize, threshold. Clean images produce better results.
    2. Set the right PSM — Page segmentation mode (--psm) affects how Tesseract interprets layout. Try different values for your document type.
    3. Specify the language — Use -l eng for English, and -l eng+fra for multiple languages.
    4. Use tessdata_fast for production — The tessdata_fast models are smaller and faster than default models, with minimal accuracy loss. Download from tessdata_fast repository(opens in a new tab).
    5. Filter by confidence — Use image_to_data and filter results by confidence score (greater than 60 percent) to reduce errors.
    6. Train for custom fonts — Non-standard fonts need custom training data.
    7. Test on representative samples — Accuracy varies by document type. Test before deploying.

    Troubleshooting pytesseract imports

    If pytesseract fails to import, the issue is usually installation, environment configuration, or system paths.

    Common causes of pytesseract import errors

    1. Incorrect installation

      • Ensure pytesseract is installed in the correct Python environment.
      • Verify installation by running:
      Terminal window
      pip show pytesseract

      If it’s not installed, install it using:

      Terminal window
      pip install pytesseract
    2. Multiple Python versions

      If you have multiple versions of Python installed, ensure pytesseract is installed in the environment corresponding to the Python version you’re using.

      • Check your Python version with:
      Terminal window
      python3 --version
      • Use the correct pip version:
      Terminal window
      python3 -m pip install pytesseract
    3. Environment issues

      • If you’re using virtual environments, activate the correct environment before installing or running your script.
      • Check if the environment is activated:
      Terminal window
      source your_env_name/bin/activate

      Install pytesseract within the activated environment.

    4. System path issues

    • Ensure the Python and pip paths are correctly set in your system environment variables.
    • Check your current Python path:
    Terminal window
    which python3

    Additional tips

    • Reinstall pytesseract — If problems persist, try uninstalling and reinstalling pytesseract:
    Terminal window
    pip uninstall pytesseract
    pip install pytesseract
    • Check the Tesseract installation — Verify with:
    Terminal window
    tesseract --version
    • Upgrade pip — Upgrading pip can resolve issues:
    Terminal window
    python3 -m pip install --upgrade pip
    • Install packages on managed environments — For externally managed environments (like macOS with Homebrew):

      • Use a virtual environment:
      Terminal window
      python3 -m venv myenv
      source myenv/bin/activate
      pip install pytesseract
      Terminal window
      brew install pipx
      pipx install pytesseract
      • Override the restriction (not recommended):
      Terminal window
      python3 -m pip install pytesseract --break-system-packages

    Check PEP 668(opens in a new tab) for details.

    Limitations of Tesseract

    • Accuracy varies with image quality, language, and document complexity. Output may contain errors or miss text.
    • Non-standard fonts and handwriting require custom training data.
    • Complex layouts, graphics, and tables reduce accuracy.
    • Not all languages and scripts are supported.
    • No built-in preprocessing. You must handle resizing, skew correction, and noise removal separately.

    Comparing pytesseract and Nutrient OCR API

    Featurepytesseract (Tesseract)Nutrient OCR API
    Output formatPlain text stringSearchable PDF with text layer
    InstallationLocal engine + Python wrapperNo installation (cloud API)
    PreprocessingManual (grayscale, threshold)Automatic
    Languages100+ (install language packs)20 languages built in
    Batch processingWrite your own codeSingle API call
    PDF supportRequires pdf2image conversionNative PDF input/output
    CostFree (open source)200 free credits/month, then paid
    Best forLocal text extraction, prototypesProduction searchable PDFs

    Nutrient API for OCR

    Tesseract extracts text. Nutrient’s OCR API creates searchable PDFs — the text layer is embedded in the PDF so users can search, select, and copy text.

    When to use Nutrient instead of Tesseract:

    • You need searchable PDFs, not just raw text
    • You’re processing batches of scanned documents
    • You want to merge multiple scanned pages into one PDF
    • You need 20 languages without installing language packs
    • You want consistent results without preprocessing each image

    The API is SOC 2-audited, stores no document data, and offers 200 free credits/month to start.

    Requirements

    You need:

    Install the requests library:

    Terminal window
    python3 -m pip install requests

    Using the OCR API

    1. Import required modules

    import requests
    import json

    2. Define the OCR instructions

    data = {
    'instructions': json.dumps({
    'parts': [
    {
    'file': 'scanned'
    }
    ],
    'actions': [
    {
    'type': 'ocr',
    'language': 'english'
    }
    ]
    })
    }
    • "file": "scanned" references the uploaded file
    • "type": "ocr" applies OCR
    • "language": "english" sets the OCR language

    3. Send the OCR request to the Nutrient API

    Make a POST request to the https://api.nutrient.io/build endpoint:

    response = requests.request(
    'POST',
    'https://api.nutrient.io/build',
    headers = {
    'Authorization': 'Bearer your_api_key_here'
    },
    files = {
    'scanned': open('image.png', 'rb')
    },
    data = {
    'instructions': json.dumps({
    'parts': [
    {
    'file': 'scanned'
    }
    ],
    'actions': [
    {
    'type': 'ocr',
    'language': 'english'
    }
    ]
    })
    },
    stream = True
    )

    Replace 'your_api_key_here' with your actual API key. The request sends the file, includes OCR instructions, and streams the response for efficient handling of large files.

    You can use the sample document here(opens in a new tab) to test the OCR API.

    4. Save the OCR result to a file

    Write the result to disk if successful:

    if response.ok:
    with open('result.pdf', 'wb') as fd:
    for chunk in response.iter_content(chunk_size=8096):
    fd.write(chunk)
    else:
    print(response.text)
    exit()

    This streams the OCR output into result.pdf and prints error messages if the request fails.

    Advanced OCR with Python: Merge multiple scanned pages into a searchable PDF using Nutrient API

    The Nutrient OCR API merges batches of scanned pages into a single searchable PDF in one API call, unlike pytesseract, which requires processing each page and merging afterward.

    Example: Merge four scanned images with OCR enabled

    import requests
    import json
    response = requests.request(
    'POST',
    'https://api.nutrient.io/build',
    headers={
    'Authorization': 'Bearer your_api_key_here'
    },
    files={
    'page1': open('page1.jpg', 'rb'),
    'page2': open('page2.jpg', 'rb'),
    'page3': open('page3.jpg', 'rb'),
    'page4': open('page4.jpg', 'rb')
    },
    data={
    'instructions': json.dumps({
    'parts': [
    { 'file': 'page1' },
    { 'file': 'page2' },
    { 'file': 'page3' },
    { 'file': 'page4' }
    ],
    'actions': [
    {
    'type': 'ocr',
    'language': 'english'
    }
    ]
    })
    },
    stream=True
    )
    if response.ok:
    with open('merged_scanned.pdf', 'wb') as fd:
    for chunk in response.iter_content(chunk_size=8096):
    fd.write(chunk)
    else:
    print(response.text)
    exit()

    When to use this instead of pytesseract

    pytesseract: Extract text from individual images, handle preprocessing yourself, write code to merge results.

    Nutrient API: Upload images, get a searchable PDF back. One API call handles OCR, merging, and PDF creation.

    Conclusion

    Tesseract with pytesseract handles basic text extraction. Preprocess images (grayscale, resize, threshold) for better accuracy. For searchable PDFs or batch processing, use the Nutrient OCR API.

    FAQ

    What is Tesseract OCR?

    Tesseract OCR is an open source engine for recognizing text from images and scanned documents. Developed by Hewlett-Packard and now sponsored by Google, it supports more than 100 languages and various text styles.

    How do I install Tesseract OCR in Python?

    To install Tesseract OCR, download the installer from GitHub for Windows(opens in a new tab), use brew install tesseract on macOS, or run sudo apt install tesseract-ocr on Debian/Ubuntu.

    How do I install pytesseract?

    Install pytesseract using pip: pip install pytesseract. You also need Tesseract OCR installed on your system. On Windows, download from GitHub. On macOS, use brew install tesseract. On Linux, use sudo apt install tesseract-ocr.

    What is the difference between Tesseract and pytesseract?

    Tesseract is the OCR engine (written in C++) that performs text recognition. pytesseract is a Python wrapper library that provides a simple interface to use Tesseract from Python code. You need both: Tesseract for the OCR functionality and pytesseract for the Python API.

    Why is pytesseract not recognizing text?

    Common causes include poor image quality, incorrect PSM mode, or missing preprocessing. Try converting to grayscale, increasing image resolution, applying thresholding, and using the correct --psm value for your document type. Also verify Tesseract is properly installed with tesseract --version.

    How do I OCR a PDF with pytesseract?

    pytesseract doesn’t directly support PDFs. First convert PDF pages to images using pdf2image library: from pdf2image import convert_from_path; images = convert_from_path('file.pdf'). Then run pytesseract.image_to_string() on each image. For native PDF OCR, use Nutrient’s OCR API instead.

    How can I improve OCR accuracy?

    You can improve OCR accuracy by converting an image to grayscale, resizing it to make the text larger, and applying adaptive thresholding to enhance text contrast.

    Can Tesseract OCR handle multiple languages?

    Yes. Tesseract supports multiple languages. Use a plus sign (+) in the configuration string, like -l eng+fra for English and French.

    What are the limitations of Tesseract OCR?

    Tesseract’s limitations include varying accuracy based on image quality, difficulty with non-standard fonts, and limited support for complex layouts and languages. It also lacks built-in image preprocessing.

    How does Nutrient’s OCR API work?

    Upload scanned images or PDFs, and get searchable PDFs back. The API supports 20 languages, preserves layout, and handles multipage documents. It’s SOC 2-audited with 200 free credits/month.

    How do I use Nutrient’s OCR API?

    Install the requests library and send a POST request to https://api.nutrient.io/build with your API key and document. The response is the searchable PDF.

    Can I merge multiple scanned pages into one searchable PDF using Nutrient?

    Yes. You can merge multiple images into a single searchable PDF. Adjust the file handling and instructions in your API request to include all pages.

    What is pytesseract in Python?

    pytesseract is a Python wrapper for the open source Tesseract OCR engine. It enables developers to extract text from images using a simple Python API.

    How do I use pytesseract to extract text from an image?

    Install the pytesseract and Pillow libraries, open the image using PIL.Image.open(), and pass it to pytesseract.image_to_string() to extract text.

    How do I fix 'TesseractNotFoundError' in Python?

    This error means Python can’t find the Tesseract executable. Either add Tesseract to your system PATH, or set the path explicitly in Python: pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe' (Windows) or pytesseract.pytesseract.tesseract_cmd = '/usr/local/bin/tesseract' (macOS/Linux).

    How do I extract only numbers with pytesseract?

    Use the --psm 6 mode with a digits-only allowlist: config = '--psm 6 -c tessedit_char_whitelist=0123456789' and pass it to pytesseract.image_to_string(image, config=config). Alternatively, extract all text and filter with regex: re.findall(r'\d+', text).

    Is pytesseract free to use?

    Yes. Both Tesseract OCR and pytesseract are free and open source under the Apache 2.0 license. You can use them in commercial projects without licensing fees. However, accuracy and preprocessing are your responsibility.

    Hulya Masharipov

    Hulya Masharipov

    Technical Writer

    Hulya is a frontend web developer and technical writer who enjoys creating responsive, scalable, and maintainable web experiences. She’s passionate about open source, web accessibility, cybersecurity privacy, and blockchain.

    Explore related topics

    Try for free Ready to get started?