How to extract text from a PDF using PyMuPDF and Python

Hulya Masharipov

October 9, 2025

PyMuPDF is fast for basic PDF text extraction, while Nutrient DWS Processor API handles complex documents with built-in OCR and data extraction. Here’s how both work, with code examples and performance comparisons.

How to extract text from a PDF using PyMuPDF and Python

TL;DR

PyMuPDF(opens in a new tab) provides fast text extraction from native PDFs but requires custom OCR integration for scanned documents.
Nutrient is a cloud PDF processor with built-in OCR and ML-powered data extraction.
When to use what — Use PyMuPDF for simple native PDFs and Nutrient for mixed documents and production systems.

PyMuPDF

PyMuPDF(opens in a new tab) (imported in Python as fitz) is a Python wrapper for MuPDF that lets you extract text from native PDFs. It supports multiple extraction modes, ranging from simple plain text to detailed coordinate-based data.

Installation and basic setup

# Create and activate a virtual environment (recommended).
python -m venv .venv
# macOS/Linux
source .venv/bin/activate
# Windows (PowerShell)
. .venv/Scripts/Activate.ps1

# Install PyMuPDF.
pip install PyMuPDF

This installs PyMuPDF. The virtual environment isolates dependencies.

Text extraction methods

PyMuPDF’s page.get_text() supports several extraction modes:

"text" — Plain text in reading order (fastest for basic extraction)
"blocks"/"words" — Lists of blocks or words with positions
"dict" — Structured data with fonts, coordinates, and layout information
"json" — Same as "dict", but in JSON format
"html", "xml", "rawdict", and "rawjson" — Also available

Use "text" for simple extraction; use "dict" when you need coordinates for tables or forms.

Basic text extraction

To perform basic text extraction, use the following code:

import fitz  # PyMuPDF

def extract_text_pymupdf(pdf_path):
    doc = fitz.open(pdf_path)
    try:
        text = ""
        for page in doc:
            text += page.get_text("text") # plain text
        return text
    finally:
        doc.close()

# Extract text from a PDF.
result = extract_text_pymupdf("invoice.pdf")
print(result)

This opens the document once and concatenates text from each page. It works for native PDFs, but not for scans.

Getting layout information

For tables and forms, you need coordinates:

1
import fitz  # PyMuPDF
2

3
def extract_with_coordinates(pdf_path):
29 collapsed lines
4
    """Extract text with position and font information for layout analysis."""
5
    doc = fitz.open(pdf_path)
6
    try:
7
        results = []
8
        # Process each page in the document.
9
        for page_num, page in enumerate(doc):
10
            # Get structured data: blocks contain lines, lines contain spans.
11
            data = page.get_text("dict")  # Returns hierarchical text structure
12

13
            # Navigate the hierarchy: blocks > lines > spans.
14
            for block in data["blocks"]:
15
                # Skip image blocks (they don't have "lines" key).
16
                for line in block.get("lines", []):
17
                    # Each span is the smallest text unit with consistent formatting.
18
                    for span in line["spans"]:
19
                        results.append({
20
                            "page": page_num,
21
                            "text": span["text"],
22
                            "bbox": span["bbox"],  # (x0,y0,x1,y1) coordinates in points
23
                            "font": span["font"]   # Font name for styling analysis
24
                        })
25
        return results
26
    finally:
27
        # Always close the document to free memory.
28
        doc.close()
29

30
# Usage example: Extract positioned text for table detection.
31
result = extract_with_coordinates("invoice.pdf")
32
print(result[:5])  # Preview first 5 text spans with positions

The "dict" mode returns text spans along with their bounding boxes and font information. Coordinates are measured in points (1/72 inch) from the top-left corner, and PyMuPDF automatically accounts for page rotation.

Ready to try document processing at scale?

Get started with Nutrient DWS Processor API today and receive 200 free credits monthly! Perfect for watermark-free document processing targeting many use cases.

Start Free Trial

Table extraction (bordered tables)

PyMuPDF includes basic table detection for bordered tables:

1
import fitz  # PyMuPDF
2

3
def extract_tables_pymupdf(pdf_path):
4
    doc = fitz.open(pdf_path)
17 collapsed lines
5
    try:
6
        all_tables = []
7
        for page in doc:
8
            # Returns a `TableFinder` object.
9
            table_finder = page.find_tables() # `TableFinder`
10
            for table in table_finder.tables: # Sequence of `Table` objects
11
                all_tables.append(table.extract())  # Each is a list of rows.
12
        return all_tables
13
    finally:
14
        doc.close()
15

16
# Usage
17
result = extract_tables_pymupdf("invoice.pdf")
18
for i, table in enumerate(result[:5]):  # First 5 tables.
19
    print(f"--- Table {i} ---")
20
    for row in table:
21
        print(row)

The code above finds tables with visible borders and returns rows as lists. Multipage tables and merged cells need manual handling.

Handling scanned PDFs (OCR fallback)

PyMuPDF doesn’t handle image-only pages. This section outlines how to add Tesseract OCR for scanned content.

Install OCR prerequisites:

# Python deps
pip install pytesseract Pillow

# Tesseract engine (system package)
# macOS
brew install tesseract
# Ubuntu / Debian
sudo apt-get install tesseract-ocr
# Windows
# 1) Install from: https://github.com/UB-Mannheim/tesseract/wiki
# 2) Add the install directory (e.g. C:\Program Files\Tesseract-OCR) to PATH

Use the OCR-aware extractor:

1
import fitz
2
import pytesseract
3
from PIL import Image
4
import io
5

6
def is_scanned(page, threshold=40) -> bool:
7
    """Check if page has little text (likely scanned/image-only)."""
8
    return len(page.get_text("text").strip()) < threshold
9

27 collapsed lines
10
def extract_with_ocr(pdf_path: str, ocr_lang: str = "eng") -> str:
11
    """Hybrid extraction: use native text when available, OCR when needed."""
12
    doc = fitz.open(pdf_path)
13
    try:
14
        out = []
15
        for page in doc:
16
            if is_scanned(page):
17
                # Page has minimal text — likely scanned, use OCR.
18
                pix = page.get_pixmap(dpi=300)  # Render at 300 DPI for good OCR quality.
19
                img = Image.open(io.BytesIO(pix.tobytes("png")))  # Convert to PIL Image
20
                # Run Tesseract OCR with specified language.
21
                out.append(pytesseract.image_to_string(img, lang=ocr_lang))
22
            else:
23
                # Page has native text — extract directly (much faster).
24
                out.append(page.get_text("text"))
25
        return "".join(out)
26
    finally:
27
        doc.close()
28

29
# Example usage with error handling.
30
if __name__ == "__main__":
31
    try:
32
        result = extract_with_ocr("invoice.pdf")
33
        print(f"Extracted {len(result)} characters")
34
        print(result[:500])  # Preview first 500 characters.
35
    except Exception as e:
36
        print(f"Extraction failed: {e}")

This code checks each page for text. If it’s empty, it renders the page at 300 DPI and runs OCR. Then it sets ocr_lang to non-English (e.g. "deu", "spa").

For poorly scanned documents, first preprocess them with OpenCV using binarization, deskewing, and denoising techniques.

Performance best practices

Open once — Call fitz.open() once per document. Don’t reopen for each page.
Prefer "text" mode — Use page.get_text("text") unless you need coordinates or font data.
Skip unnecessary rendering — page.get_pixmap() is slow, so only use it for OCR.
Handle invalid files — Use try/except for corrupted PDFs.
Parallelize batches — Document objects are independent, so use multiprocessing for bulk jobs.

Nutrient DWS Processor API for Python

Nutrient is a cloud PDF processor with built-in OCR and data extraction. One API call handles:

OCR for poor-quality scans, handwriting, and mixed file types
AI and machine learning-driven OCR that processes poor-quality scans, handwriting, and mixed file types
Intelligent reading order that maintains logical text flow in complex or multicolumn layouts
Adaptive layout understanding that recognizes headers, paragraphs, lists, and document sections
Key-value pair detection for forms, invoices, and other structured documents
Layout-aware analysis that preserves spatial relationships between text, images, and annotations

Sign up at Nutrient Processor API(opens in a new tab). After email verification, get your API key from the dashboard. You start with 200 free credits.

dashboard

There are two ways to integrate Nutrient. Both use the same processing engine but differ in API interaction.

Step 2: Choose your integration

Choose between the Python client (nutrient-dws) or direct HTTP calls.

Option A — The official Python client

The official Python client (nutrient-dws) handles authentication, uploads, and response parsing. It’s good for automation scripts, backend services, or data pipelines. Additionally, it includes helper functions and error handling.

Use this if you:

Want clean Python code
Need automatic OCR and parsing
Use AI code assistants (Claude, Copilot, Cursor)

pip install nutrient-dws
export NUTRIENT_API_KEY="your_api_key_here"

Minimal extraction:

import asyncio, os
from nutrient_dws import NutrientClient

async def main():
    client = NutrientClient(api_key=os.getenv("NUTRIENT_API_KEY"))
    result = await client.extract_text("invoice.pdf")
    print(result.get("text") or result)

asyncio.run(main())

The client uploads invoice.pdf, applies OCR if needed, and returns extracted text; no OCR setup is required.

AI code helpers

The SDK includes helpers for AI coding assistants. After installing, run these commands for better completion:

# Claude Code
dws-add-claude-code-rule

# GitHub Copilot
dws-add-github-copilot-rule

# JetBrains (Junie)
dws-add-junie-rule

# Cursor
dws-add-cursor-rule

# Windsurf
dws-add-windsurf-rule

These enable SDK method suggestions and examples in your editor.

Option B: HTTP API

The HTTP API works with any language. Send a request with your document, and you’ll get JSON back. Use this for non-Python projects or when you need direct control.

Use this if you:

Use another language (Java, Go, C#, Node.js)
Need to integrate with existing REST systems
Want direct control over requests and responses

Start with the Python client for prototyping. Use the HTTP API for multi-language teams or existing service integration.

pip install requests
export NUTRIENT_API_KEY="your_api_key_here"

Request with retry:

1
import requests
2
import json
3
import os
4

5
API_KEY = os.getenv("NUTRIENT_API_KEY") or "your_api_key_here"
6

7
response = requests.post(
8
    "https://api.nutrient.io/build",
9
    headers={
10
        "Authorization": f"Bearer {API_KEY}"
11
    },
12
    files={
13
        "document": open("invoice.pdf", "rb")
14
    },
15
    data={
16
        "instructions": json.dumps({
17
            "parts": [
18
                {"file": "document"}
19
            ],
20
            "output": {
21
                "type": "json-content",
22
                "plainText": True,
23
                "structuredText": False
24
            }
25
        })
26
    },
27
    stream=True
28
)
29

7 collapsed lines
30
if response.ok:
31
    with open("result.json", "wb") as fd:
32
        for chunk in response.iter_content(chunk_size=8192):
33
            fd.write(chunk)
34
    print("Saved to result.json")
35
else:
36
    print(f"Error {response.status_code}: {response.text}")

Feature comparison

The table below highlights how PyMuPDF and Nutrient compare across key PDF processing capabilities — from native text extraction, to scanned documents, tables, forms, and overall development effort.

Feature	PyMuPDF	Nutrient
Native PDF text	Excellent; `get_text("text")` is very fast	Excellent
Scanned documents	Requires external OCR integration	Built-in OCR
Table extraction	Basic bordered tables via `find_tables()`	Can return tables when requested in output
Form fields/KVP	Manual coding or heuristics required	Can return key-value pairs with instructions
Output format	Plain text, dict/json with coordinates	Plain text + structured JSON (order + hierarchy)
Setup complexity	`pip install PyMuPDF`	API key + HTTP or SDK client
Development time	2–3 months for full pipeline	1 week to production
Maintenance load	High (OCR, edge cases, error handling)	Minimal (automatic updates, provider-managed)

PyMuPDF strengths

Fast on native PDFs, low memory use
Runs locally, no network latency
Multiple output formats (text, words, blocks, coordinates)
Basic table detection with page.find_tables()
No external dependencies for text extraction
Full control over processing

PyMuPDF limitations

No built-in OCR — needs Tesseract for scans
Limited table handling for borderless or multipage tables
No form field detection
Complex layouts need custom code
Multipage tables and error handling add maintenance

Nutrient DWS Processor API strengths

Consistent handling — Works with digital and scanned text, forms, and complex layouts
Built-in OCR — Automatic OCR and image correction (deskewing, contrast)
Regular updates — Accuracy improvements without code changes
Production-ready — Scales with large document volumes

Nutrient DWS Processor API limitations

Overhead for small tasks — Open source may be simpler for one-off extractions
Setup required — Need to integrate SDK or API calls
Paid service — Commercial solution, not open source

Choosing the right tool

Use PyMuPDF if

Your PDFs are native text (not scans)
You need full control over parsing
You have 2–3 months for development
You’re processing more than 1,000 documents/month
Cost matters more than speed and accuracy

Use Nutrient if

Your PDFs mix scanned and digital formats
You need results quickly
You’re processing thousands of documents
Accuracy is critical
You want to focus on your product, not PDF parsing

Migration path

Teams often start with PyMuPDF for simple PDFs, and then add Nutrient for scans, tables, and forms.

Phase 1 — Use PyMuPDF for native text PDFs.
Phase 2 — Hit limits with scans, tables, forms.
Phase 3 — Hybrid approach
- If text exists → PyMuPDF
- If scanned → Nutrient
- Single routing function with logging
Phase 4 — Move most everything to Nutrient, and keep PyMuPDF for offline cases.

Conclusion

PyMuPDF works well for native PDF text extraction — it’s fast, and you control everything.

For scanned documents, complex tables, or forms, Nutrient handles these without extra work.

Choose based on your situation:

Simple PDFs and adequate development time — PyMuPDF
Mixed documents and a need to get to production quickly — Nutrient

Start with what fits now. Migration is always possible.

Try Nutrient yourself: Sign up(opens in a new tab) for 200 free credits monthly.

FAQ

Can PyMuPDF handle scanned PDFs?

No. PyMuPDF needs external OCR like Tesseract. You handle preprocessing and integration. Nutrient has built-in OCR.

What’s the main cost difference between PyMuPDF and Nutrient?

PyMuPDF is free but needs 2–3 months development. Engineering cost often exceeds Nutrient’s pricing.

Can I migrate from PyMuPDF to Nutrient gradually?

Yes. Start hybrid — keep simple PDFs on PyMuPDF, and send complex ones to Nutrient. Migrate fully when ready.

Which approach is better for high-volume processing?

At scale (thousands of documents), Nutrient has better throughput and auto-scaling. PyMuPDF needs infrastructure work.

Can Nutrient handle tables in PDFs?

Yes. Nutrient extracts bordered, semi-bordered, and borderless tables to JSON or Excel. It handles multipage and merged-cell tables in most cases, but very complex layouts may need post-processing.

Explore related topics

Document Processing