How to OCR PDF files on Linux using OCRmyPDF

Table of contents

    How to OCR PDF files on Linux using OCRmyPDF

    Optical character recognition(opens in a new tab) (OCR) is an essential technology for converting scanned documents, images, and PDFs into searchable and editable formats. This post will walk you through how to OCR PDF files on Linux using the open source tool OCRmyPDF(opens in a new tab), which is powered by Tesseract. It also discusses an alternative approach using Nutrient Document Engine. Both options provide powerful capabilities for extracting text and making PDFs searchable.

    How to OCR a PDF on Linux using an open source library

    This next section will go into details on how to OCR a PDF on Linux with an open source library.

    Why not use Tesseract directly?

    OCRMyPDF Logo

    The open source library you’ll use is OCRmyPDF(opens in a new tab), which is a multi-platform tool for running OCR on PDF files. It’s a wrapper around Tesseract that does some preprocessing on PDF files before running OCR on them. This preprocessing includes deskewing, noise removal, and cleaning up files to ensure the OCR engine can read the text accurately. OCRmyPDF also does some post-processing to ensure the output is consistent and error-free. You can use Tesseract directly, but in doing so, you’ll miss out on these benefits provided by OCRmyPDF.

    Key features of OCRmyPDF

    • Automatic OCR — Automatically adds OCR text layers to existing PDFs.
    • Text recognition — Utilizes Tesseract for high-quality OCR.
    • Multi-language support — Supports multiple languages, including English, French, German, Spanish, and more.
    • PDF/A conversion — Converts PDFs to the PDF/A format for long-term archiving.
    • Command-line interface — Provides a simple command-line interface for ease of use.

    Installing OCRmyPDF

    Install OCRmyPDF using the following command on Ubuntu- or Debian-based systems:

    sudo apt-get install ocrmypdf

    For Fedora, you can use the following command:

    dnf install ocrmypdf

    Sometimes, the available package version might not be the latest one, so you can install OCRmyPDF directly from PIP too:

    pip install --user ocrmypdf

    Just keep in mind that the PIP method won’t install some non-Python dependencies of OCRmyPDF. These dependencies include:

    • Python 3.8 or newer
    • Ghostscript 9.50 or newer
    • Tesseract 4.1.1 or newer
    • jbig2enc 0.29 or newer
    • pngquant 2.5 or newer
    • unpaper 6.1

    Basic usage

    To use OCRmyPDF, run the following command, replacing input.pdf with the path to the PDF file you want to OCR, and output.pdf with the path where you want to save the OCR’d PDF:

    Terminal window
    ocrmypdf input.pdf output.pdf

    This will result in a PDF/A output file with an OCR layer. PDF/A is a subset of the PDF standard that prohibits features that aren’t suitable for long-term archiving. This includes JavaScript in PDFs, font linking, and encryption. You can ask OCRmyPDF to output a standard PDF via this command:

    Terminal window
    ocrmypdf --output-type pdf input.pdf output.pdf

    You can even perform OCR only on certain pages:

    Terminal window
    ocrmypdf --pages 2,3,13-17 input.pdf output.pdf

    OCR in a language other than English

    By default, OCRmyPDF assumes a document is in English. If the language is different, the OCR quality will be considerably poor. In such a case, you need to explicitly pass in the language, like so:

    Terminal window
    ocrmypdf -l rus russian_doc.pdf russian_doc_ocr.pdf

    If the document is multilingual, you can pass in multiple languages:

    ocrmypdf -l rus+eng russian_doc.pdf russian_doc_ocr.pdf

    Tesseract (the OCR engine used by OCRmyPDF under the hood) supports quite a few different languages. You can take a look at the Tesseract documentation(opens in a new tab) to determine if it supports your required language.

    You might be required to install additional language packs before you can use them with OCRmyPDF. Follow these instructions(opens in a new tab) to figure out how to do so.

    Image processing

    As mentioned earlier, OCRmyPDF can perform some image processing on each page of a PDF, if required. It supports multiple options for this purpose. According to the official documentation(opens in a new tab), there are five different options. We’ve included the text from the documentation in the list below:

    • --rotate-pages attempts to determine the correct orientation for each page and rotates the page if necessary.
    • --remove-background attempts to detect and remove a noisy background from grayscale or color images. Monochrome images are ignored. This should not be used on documents that contain color photos as it may remove them.
    • --deskew will correct pages were scanned at a skewed angle by rotating them back into place.
    • --clean uses unpaper(opens in a new tab) to clean up pages before OCR, but does not alter the final output. This makes it less likely that OCR will try to find text in background noise.
    • --clean-final uses unpaper to clean up pages before OCR and inserts the page into the final output. You will want to review each page to ensure that unpaper did not remove something important.

    Regardless of the order in which you pass these options, OCRmyPDF will always apply them in this order:

    rotate -> remove background -> deskew -> clean

    File optimization

    By default, OCRmyPDF optimizes the output PDF for Fast Web View. This linearizes the PDF file and stores all references in the PDF file in the same order in which they’ll be viewed by the user. This slightly increases the file size as well; however, you can disable optimization by passing in --optimize 0 or -O0.

    At the default optimization level, -O1, OCRmyPDF also does some lossless image optimization using JBIG2 encoder. You can disable this optimization by passing in -O0, or you can enable more aggressive lossy optimization by passing in -O2 or -O3.

    Batch processing PDF files

    By default, OCRmyPDF uses all available cores while processing PDF files. You can limit this by using the -j or --jobs option. This limits the number of concurrent threads used:

    Terminal window
    ocrmypdf -j 4 input.pdf output.pdf

    The authors of the program also conveniently created a watcher.py file(opens in a new tab) for watching folders and performing OCR on any new PDF file. You might need to update the contents of the watcher file to suit your specific needs. Because this file has some additional dependencies, you might need to install ocrmypdf using the watcher tag:

    Terminal window
    pip install ocrmypdf[watcher]

    You can then run the watcher like this:

    Terminal window
    env OCR_INPUT_DIRECTORY=./input-pdfs \
    OCR_OUTPUT_DIRECTORY=./output-pdfs \
    python3 watcher.py

    This will OCR any new PDF files that are placed in the input-pdfs folder and place the resulting PDFs in the output-pdfs folder. Note that this won’t process any files that were already in the input-pdfs folder before the watcher was run.

    How to OCR a PDF on Linux using Nutrient Document Engine

    Nutrient Document Engine offers a powerful and scalable solution for performing OCR and managing document workflows. It’s PDF server software designed for processing documents and powering PDF automation workflows. Operating as a headless service, it can be deployed within your own infrastructure or hosted via Nutrient.

    Key features of Nutrient Document Engine

    • HTTP-based API — Operates as a headless service for easy integration.
    • Flexible deployment — Deploy within your infrastructure or host via Nutrient.
    • Frontend SDKs — Works alongside Nutrient’s web and mobile frontend SDKs.
    • Prebuilt features — Includes the ability to annotate, edit, sign, form fill, redact, and more.

    OCR capabilities with Document Engine

    Document Engine includes custom-built OCR technology to accurately recognize text and patterns, generating searchable PDF/A files. OCR-processed PDFs can be opened in Nutrient’s Web, iOS, Android, React Native, and Flutter client SDKs.

    Key features of Nutrient Document Engine for OCR

    • Highly accurate OCR — Document Engine includes a custom-built AI- and ML-powered OCR engine that delivers highly accurate text and pattern recognition. This enables you to convert images, scanned documents, and unstructured data into searchable and editable content.
    • Multi-language support — It supports multiple languages, including English, French, German, Spanish, and more, making it versatile for global applications.
    • Searchable PDF generation — Turn any scanned document or image into a searchable PDF or PDF/A document. This is ideal for archiving and indexing documents for quick retrieval.
    • Data extraction — The OCR engine can extract key-value pairs from unstructured documents, which can be particularly useful for automating workflows in industries like healthcare, finance, and legal.
    • Post-processing capabilities — After processing a document with OCR, you can add signatures and annotations, and even perform document assembly, enhancing your document management workflows.
    • Integrated viewing options — Document Engine integrates seamlessly with Nutrient’s Web, iOS, Android, React Native, and Flutter client SDKs, enabling you to open and display processed PDFs within your applications.

    System Requirements

    To run Nutrient Document Engine, your system must meet the following criteria:

    • macOS — Ventura, Monterey, Mojave, Catalina, or Big Sur
    • Linux — Ubuntu, Fedora, Debian, CentOS, or derivatives like Kubuntu or Xubuntu; 64-bit Intel (x86_64) and ARM (AArch64) processors are supported.

    You should have a minimum of 4 GB RAM available, regardless of the operating system.

    Setting up Docker

    Document Engine is provided as a Docker container. To deploy it, install Docker for your operating system:

    Launching Document Engine

    Once Docker is installed, follow the steps outlined below to start Document Engine.

    1. Open your terminal:
      • macOS — You can use a terminal integrated within your IDE or standalone applications like Terminal.app or iTerm2.
      • Windows/Linux — Use any terminal emulator or the one provided in your IDE.
    2. Enter the following command to start the service:
    Terminal window
    docker run --rm -t -p 5000:5000 -e API_AUTH_TOKEN=secret pspdfkit/document-engine:1.5.0

    The initialization might take some time, depending on your network speed. Wait until you see a message like the following one:

    Terminal window
    [info] 2024-02-05 18:56:45.286 Running Document Engine version 1.5.0

    Installing curl

    To interact with Document Engine, you need to use its HTTP API by sending commands and documents in HTTP requests. For this, ensure you have curl installed:

    • macOS — curl is preinstalled, so no additional steps are required.
    • Windows — Download and install curl from the official site(opens in a new tab).
    • Linux — Use your package manager (e.g. sudo apt-get install curl for Debian/Ubuntu).

    Performing OCR with Nutrient Document Engine

    Once Document Engine is running, you can perform OCR on your PDFs by sending requests to its API.

    1. Running OCR on document upload

      To perform OCR when uploading a new document, use the ocr action within the instructions parameter in your API request:

      Terminal window
      curl -X POST http://localhost:5000/api/documents \
      -H "Authorization: Token token=<API token>" \
      -F instructions='{
      "parts": [
      {
      "file": "file-part"
      }
      ],
      "actions": [
      {
      "type": "ocr",
      "language": "english"
      }
      ]
      }' \
      -F document=@/path/to/ExampleDocument.pdf \
      -o result.pdf
      POST /api/documents HTTP/1.1
      Content-Type: multipart/form-data; boundary=customboundary
      Authorization: Token token=<API token>
      --customboundary
      Content-Disposition: form-data; name="instructions"
      Content-Type: application/json
      {
      "parts": [
      {
      "file": "file-part"
      }
      ],
      "actions": [
      {
      "type": "ocr",
      "language": "english"
      }
      ]
      }
      --customboundary
      Content-Disposition: form-data; name="document"; filename="Example Document.pdf"
      Content-Type: application/pdf
      <PDF data>
      --customboundary--

      This command uploads ExampleDocument.pdf, applies OCR in English, and outputs a searchable PDF named result.pdf.

    2. Applying OCR to existing documents

      If you have a document already uploaded to Document Engine, you can apply OCR using the apply_instructions endpoint:

      Terminal window
      curl -X POST http://localhost:5000/api/documents/:document_id/apply_instructions \
      -H 'Authorization: Token token=<API token>' \
      -H "Content-Type: application/json" \
      -d '{
      "parts": [
      {
      "document": {
      "id": "#self"
      }
      }
      ],
      "actions": [
      {
      "type": "ocr",
      "language": "english"
      }
      ]
      }' \
      -o result.pdf
      POST /api/documents/:document_id/apply_instructions HTTP/1.1
      Content-Type: application/json
      Authorization: Token token=<API token>
      {
      "parts": [
      {
      "document": {
      "id": "#self"
      }
      }
      ],
      "actions": [
      {
      "type": "ocr",
      "language": "english"
      }
      ]
      }

      Replace :document_id with your document’s ID. The #self anchor is used to refer to the current document.

    3. Running OCR and retrieving the result without storing

    To perform OCR on a document and retrieve the result without storing it in Document Engine’s storage, use the /build endpoint:

    Terminal window
    curl -X POST http://localhost:5000/api/build \
    -H "Authorization: Token token=<API token>" \
    -F instructions='{
    "parts": [
    {
    "file": "file-part"
    }
    ],
    "actions": [
    {
    "type": "ocr",
    "language": "english"
    }
    ]
    }' \
    -F document=@/path/to/ExampleDocument.pdf \
    -o result.pdf
    POST /api/build HTTP/1.1
    Content-Type: multipart/form-data; boundary=customboundary
    Authorization: Token token=<API token>
    --customboundary
    Content-Disposition: form-data; name="instructions"
    Content-Type: application/json
    {
    "parts": [
    {
    "file": "file-part"
    }
    ],
    "actions": [
    {
    "type": "ocr",
    "language": "english"
    }
    ]
    }
    --customboundary
    Content-Disposition: form-data; name="document"; filename="Example Document.pdf"
    Content-Type: application/pdf
    <PDF data>
    --customboundary--

    Performance considerations

    Running OCR is a CPU-bound single-threaded operation. Performing many parallel OCR operations on a single Document Engine instance can cause a high load for extended periods. Some performance benchmarks on development hardware are as follows:

    • 6-page document — ~35–40 seconds for the entire document, ~6–11 seconds per page.
    • 1-page document — ~3–4 seconds per page.

    Factors affecting performance include the number of pages, content complexity, and server hardware capabilities.

    Conclusion

    Both OCRmyPDF and Nutrient Document Engine offer robust OCR solutions for converting scanned documents into searchable PDFs. OCRmyPDF is a great choice for those who prefer an open source, command-line-based tool with simple setup and usage. In contrast, Nutrient Document Engine provides a more integrated, scalable, and feature-rich approach for enterprise applications.

    For more information on setting up and using Nutrient Document Engine, visit the Nutrient documentation or reach out to our team to get more information.

    FAQ

    What is OCRmyPDF?

    OCRmyPDF is an open source tool that adds OCR layers to PDF files, making them searchable and editable. It uses Tesseract for text recognition and performs preprocessing like deskewing and noise removal.

    How do I install OCRmyPDF on Linux?

    You can install OCRmyPDF on Ubuntu/Debian with sudo apt-get install ocrmypdf, on Fedora with dnf install ocrmypdf, or via PIP with pip install --user ocrmypdf.

    How can I OCR a PDF in a language other than English?

    You can specify the language using the -l flag. For example, to OCR in Russian:

    ocrmypdf -l rus input.pdf output.pdf

    What is Nutrient Document Engine and how is it different?

    Nutrient Document Engine is an enterprise-grade, scalable solution for OCR and document management. It offers a more feature-rich OCR experience, including AI-powered text recognition, multi-language support, and integration with frontend SDKs for web and mobile.

    How do I use Nutrient Document Engine for OCR?

    You can deploy Document Engine using Docker and perform OCR via its HTTP API. Here’s an example using curl:

    Terminal window
    curl -X POST http://localhost:5000/api/documents \
    -H "Authorization: Token token=<API token>" \
    -F document=@/path/to/file.pdf \
    -F instructions='{"actions":[{"type":"ocr","language":"english"}]}' \
    -o result.pdf
    Yasoob Khalid

    Yasoob Khalid

    Teja Tatimatla

    Teja Tatimatla

    Hulya Masharipov

    Hulya Masharipov

    Technical Writer

    Hulya is a frontend web developer and technical writer at Nutrient who enjoys creating responsive, scalable, and maintainable web experiences. She’s passionate about open source, web accessibility, cybersecurity privacy, and blockchain.

    Explore related topics

    FREE TRIAL Ready to get started?