---
title: "Applying OCR to a PDF document | Nutrient Python SDK"
canonical_url: "https://www.nutrient.io/guides/python/extraction/apply-ocr-to-pdf/"
md_url: "https://www.nutrient.io/guides/python/extraction/apply-ocr-to-pdf.md"
last_updated: "2026-05-30T02:20:01.349Z"
description: "How to run OCR on a PDF document using Nutrient Python SDK."
---

# Applying OCR to a PDF document

Scanned documents arrive in many formats: image-based PDFs, multi-page TIFFs, single PNG or JPG scans, or even faxed exports. In every case the pages are pictures of text rather than selectable, searchable content. Legal firms receiving historical case files, healthcare teams handling medical records, and finance departments archiving older statements all share this problem: the information is visible but not searchable.

Applying OCR adds an invisible text layer that sits behind the page image. The visual appearance doesn't change, but the text becomes selectable, searchable, and readable by assistive technologies.

This sample shows how to run OCR over every page of a document using Nutrient Python SDK and produce a searchable PDF as output. The input can be any document format the SDK supports. If the input isn't already a PDF, the SDK converts it to PDF automatically when you create the editor.

[Download sample](https://www.nutrient.io/downloads/samples/python/apply-ocr-to-pdf.zip)

## How Nutrient helps

Nutrient Python SDK handles the full OCR pipeline behind a single method call. The SDK takes care of:

- Implicitly converting non-PDF inputs (images, multi-page TIFFs, Office documents) to PDF when the editor is created

- Rendering each PDF page to a bitmap at the resolution OCR needs

- Running text recognition with the configured languages

- Preserving reading order and text block orientation returned by the recognizer

- Placing an invisible, correctly positioned text layer over the original page content

You control the outcome through document settings, such as which OCR languages to use.

## Preparing the project

Import the classes used in the sample:

```python

from nutrient_sdk import Document
from nutrient_sdk import PdfEditor
from nutrient_sdk import NutrientException

```

## Running OCR on the whole document

The `main()` function opens the source document inside a [context manager](https://docs.python.org/3/reference/datamodel.html#context-managers), configures the OCR language, and calls `make_searchable()` on the editor. The sample passes an image-based PDF as input, but the same code handles raw images, multi-page TIFFs, or any other supported document format. The context manager closes the document automatically when the block ends, even if an error is raised:

```python

def main():
    try:
        with Document.open("input_image_based.pdf") as document:
            document.settings.ocr_settings.default_languages = "eng"

            editor = PdfEditor.edit(document)
            editor.make_searchable()

```

Assigning `document.settings.ocr_settings.default_languages = "eng"` tells the recognizer which language models to load. Combine languages with `+` (for example `"eng+deu"`) when you know the document contains more than one language. Setting it to match the document content directly improves accuracy on ambiguous characters.

`PdfEditor.edit(document)` attaches an editor to the open document. If the document isn't already a PDF, the SDK converts it to PDF at this step so the rest of the pipeline works on a uniform page representation. Calling `editor.make_searchable()` loops through every page, runs OCR, and writes an invisible text layer on top of the existing page content. Any hidden text already present on a page is removed before the new layer is drawn, so re-running OCR doesn't duplicate content.

## Saving the result

Save the modified document to a new file and close the editor. Wrap the call in `try/except` on `NutrientException` to surface any licensing, language-pack, or I/O issue that the SDK reports:

```python

            editor.save_as("output.pdf")
            editor.close()
    except NutrientException as e:
        print(f"Error: {e}")

if __name__ == "__main__":
    main()

```

## Conclusion

The workflow for OCR-ing a whole PDF is:

1. Open the source document.

2. Configure OCR languages on the document settings.

3. Create a `PdfEditor` for the document.

4. Call `make_searchable()` to apply OCR to every page.

5. Save the result and close the editor.

The output is a standard PDF with an invisible text layer, so existing PDF viewers, search tools, and accessibility software can read it without any extra configuration.
---

## Related pages

- [Generating image descriptions using local AI](/guides/python/extraction/describe-image-with-local-ai.md)
- [Generating image descriptions using Claude](/guides/python/extraction/describe-image-with-claude.md)
- [Extracting data from images using ICR](/guides/python/extraction/extract-data-from-image-icr.md)
- [Applying OCR to a PDF page](/guides/python/extraction/apply-ocr-to-pdf-page.md)
- [Extracting text from multilingual images](/guides/python/extraction/read-text-from-image-multi-language.md)
- [Nutrient Python SDK extraction guides](/guides/python/extraction.md)
- [Extracting structured JSON data from PDF documents](/guides/python/extraction/json-data-extraction.md)
- [Extracting data from images using vision language models](/guides/python/extraction/extract-data-from-image-vlm.md)
- [Extracting text from images](/guides/python/extraction/read-text-from-image.md)
- [Extracting data from images using OCR](/guides/python/extraction/extract-data-from-image-ocr.md)
- [Speeding up first ICR operation by predownloading models](/guides/python/extraction/speed-up-first-icr-by-downloading-requirements.md)
- [Generating image descriptions using OpenAI](/guides/python/extraction/describe-image-with-openai.md)

