How to extract text from a PDF using PyMuPDF and Python

Table of contents

    PyMuPDF is fast for basic PDF text extraction, while Nutrient DWS Processor API handles complex documents with built-in OCR and data extraction. Here’s how both work, with code examples and performance comparisons.
    How to extract text from a PDF using PyMuPDF and Python
    TL;DR
    • PyMuPDF(opens in a new tab) provides fast text extraction from native PDFs but requires custom OCR integration for scanned documents.
    • Nutrient is a cloud PDF processor with built-in OCR and ML-powered data extraction.
    • When to use what — Use PyMuPDF for simple native PDFs and Nutrient for mixed documents and production systems.

    PyMuPDF

    PyMuPDF(opens in a new tab) (imported in Python as fitz) is a Python wrapper for MuPDF that lets you extract text from native PDFs. It supports multiple extraction modes, ranging from simple plain text to detailed coordinate-based data.

    Installation and basic setup

    Terminal window
    # Create and activate a virtual environment (recommended).
    python -m venv .venv
    # macOS/Linux
    source .venv/bin/activate
    # Windows (PowerShell)
    . .venv/Scripts/Activate.ps1
    # Install PyMuPDF.
    pip install PyMuPDF

    This installs PyMuPDF. The virtual environment isolates dependencies.

    Text extraction methods

    PyMuPDF’s page.get_text() supports several extraction modes:

    • "text" — Plain text in reading order (fastest for basic extraction)
    • "blocks"/"words" — Lists of blocks or words with positions
    • "dict" — Structured data with fonts, coordinates, and layout information
    • "json" — Same as "dict", but in JSON format
    • "html", "xml", "rawdict", and "rawjson" — Also available

    Use "text" for simple extraction; use "dict" when you need coordinates for tables or forms.

    Basic text extraction

    To perform basic text extraction, use the following code:

    import fitz # PyMuPDF
    def extract_text_pymupdf(pdf_path):
    doc = fitz.open(pdf_path)
    try:
    text = ""
    for page in doc:
    text += page.get_text("text") # plain text
    return text
    finally:
    doc.close()
    # Extract text from a PDF.
    result = extract_text_pymupdf("invoice.pdf")
    print(result)

    This opens the document once and concatenates text from each page. It works for native PDFs, but not for scans.

    Getting layout information

    For tables and forms, you need coordinates:

    import fitz # PyMuPDF
    def extract_with_coordinates(pdf_path):
    29 collapsed lines
    """Extract text with position and font information for layout analysis."""
    doc = fitz.open(pdf_path)
    try:
    results = []
    # Process each page in the document.
    for page_num, page in enumerate(doc):
    # Get structured data: blocks contain lines, lines contain spans.
    data = page.get_text("dict") # Returns hierarchical text structure
    # Navigate the hierarchy: blocks > lines > spans.
    for block in data["blocks"]:
    # Skip image blocks (they don't have "lines" key).
    for line in block.get("lines", []):
    # Each span is the smallest text unit with consistent formatting.
    for span in line["spans"]:
    results.append({
    "page": page_num,
    "text": span["text"],
    "bbox": span["bbox"], # (x0,y0,x1,y1) coordinates in points
    "font": span["font"] # Font name for styling analysis
    })
    return results
    finally:
    # Always close the document to free memory.
    doc.close()
    # Usage example: Extract positioned text for table detection.
    result = extract_with_coordinates("invoice.pdf")
    print(result[:5]) # Preview first 5 text spans with positions

    The "dict" mode returns text spans along with their bounding boxes and font information. Coordinates are measured in points (1/72 inch) from the top-left corner, and PyMuPDF automatically accounts for page rotation.

    Ready to try document processing at scale?

    Get started with Nutrient DWS Processor API today and receive 200 free credits monthly! Perfect for watermark-free document processing targeting many use cases.

    Table extraction (bordered tables)

    PyMuPDF includes basic table detection for bordered tables:

    import fitz # PyMuPDF
    def extract_tables_pymupdf(pdf_path):
    doc = fitz.open(pdf_path)
    17 collapsed lines
    try:
    all_tables = []
    for page in doc:
    # Returns a `TableFinder` object.
    table_finder = page.find_tables() # `TableFinder`
    for table in table_finder.tables: # Sequence of `Table` objects
    all_tables.append(table.extract()) # Each is a list of rows.
    return all_tables
    finally:
    doc.close()
    # Usage
    result = extract_tables_pymupdf("invoice.pdf")
    for i, table in enumerate(result[:5]): # First 5 tables.
    print(f"--- Table {i} ---")
    for row in table:
    print(row)

    The code above finds tables with visible borders and returns rows as lists. Multipage tables and merged cells need manual handling.

    Handling scanned PDFs (OCR fallback)

    PyMuPDF doesn’t handle image-only pages. This section outlines how to add Tesseract OCR for scanned content.

    • Install OCR prerequisites:
    Terminal window
    # Python deps
    pip install pytesseract Pillow
    # Tesseract engine (system package)
    # macOS
    brew install tesseract
    # Ubuntu / Debian
    sudo apt-get install tesseract-ocr
    # Windows
    # 1) Install from: https://github.com/UB-Mannheim/tesseract/wiki
    # 2) Add the install directory (e.g., C:\Program Files\Tesseract-OCR) to PATH
    • Use the OCR-aware extractor:
    import fitz
    import pytesseract
    from PIL import Image
    import io
    def is_scanned(page, threshold=40) -> bool:
    """Check if page has little text (likely scanned/image-only)."""
    return len(page.get_text("text").strip()) < threshold
    27 collapsed lines
    def extract_with_ocr(pdf_path: str, ocr_lang: str = "eng") -> str:
    """Hybrid extraction: use native text when available, OCR when needed."""
    doc = fitz.open(pdf_path)
    try:
    out = []
    for page in doc:
    if is_scanned(page):
    # Page has minimal text — likely scanned, use OCR.
    pix = page.get_pixmap(dpi=300) # Render at 300 DPI for good OCR quality.
    img = Image.open(io.BytesIO(pix.tobytes("png"))) # Convert to PIL Image
    # Run Tesseract OCR with specified language.
    out.append(pytesseract.image_to_string(img, lang=ocr_lang))
    else:
    # Page has native text — extract directly (much faster).
    out.append(page.get_text("text"))
    return "".join(out)
    finally:
    doc.close()
    # Example usage with error handling.
    if __name__ == "__main__":
    try:
    result = extract_with_ocr("invoice.pdf")
    print(f"Extracted {len(result)} characters")
    print(result[:500]) # Preview first 500 characters.
    except Exception as e:
    print(f"Extraction failed: {e}")

    This code checks each page for text. If it’s empty, it renders the page at 300 DPI and runs OCR. Then it sets ocr_lang to non-English (e.g. "deu", "spa").

    For poorly scanned documents, first preprocess them with OpenCV using binarization, deskewing, and denoising techniques.

    Performance best practices

    • Open once — Call fitz.open() once per document. Don’t reopen for each page.
    • Prefer "text" mode — Use page.get_text("text") unless you need coordinates or font data.
    • Skip unnecessary renderingpage.get_pixmap() is slow, so only use it for OCR.
    • Handle invalid files — Use try/except for corrupted PDFs.
    • Parallelize batchesDocument objects are independent, so use multiprocessing for bulk jobs.

    Nutrient DWS Processor API for Python

    Nutrient is a cloud PDF processor with built-in OCR and data extraction. One API call handles:

    • OCR for poor-quality scans, handwriting, and mixed file types
    • AI and machine learning-driven OCR that processes poor-quality scans, handwriting, and mixed file types
    • Intelligent reading order that maintains logical text flow in complex or multicolumn layouts
    • Adaptive layout understanding that recognizes headers, paragraphs, lists, and document sections
    • Key-value pair detection for forms, invoices, and other structured documents
    • Layout-aware analysis that preserves spatial relationships between text, images, and annotations

    Step 1: Sign up and get your API key

    Sign up at Nutrient Processor API(opens in a new tab). After email verification, get your API key from the dashboard. You start with 200 free credits.

    dashboard

    There are two ways to integrate Nutrient. Both use the same processing engine but differ in API interaction.

    Step 2: Choose your integration

    Choose between the Python client (nutrient-dws) or direct HTTP calls.

    Option A — The official Python client

    The official Python client (nutrient-dws) handles authentication, uploads, and response parsing. It’s good for automation scripts, backend services, or data pipelines. Additionally, it includes helper functions and error handling.

    Use this if you:

    • Want clean Python code
    • Need automatic OCR and parsing
    • Use AI code assistants (Claude, Copilot, Cursor)
    Terminal window
    pip install nutrient-dws
    export NUTRIENT_API_KEY="your_api_key_here"

    Minimal extraction:

    import asyncio, os
    from nutrient_dws import NutrientClient
    async def main():
    client = NutrientClient(api_key=os.getenv("NUTRIENT_API_KEY"))
    result = await client.extract_text("invoice.pdf")
    print(result.get("text") or result)
    asyncio.run(main())

    The client uploads invoice.pdf, applies OCR if needed, and returns extracted text; no OCR setup is required.

    AI code helpers

    The SDK includes helpers for AI coding assistants. After installing, run these commands for better completion:

    Terminal window
    # Claude Code
    dws-add-claude-code-rule
    # GitHub Copilot
    dws-add-github-copilot-rule
    # JetBrains (Junie)
    dws-add-junie-rule
    # Cursor
    dws-add-cursor-rule
    # Windsurf
    dws-add-windsurf-rule

    These enable SDK method suggestions and examples in your editor.

    Option B: HTTP API

    The HTTP API works with any language. Send a request with your document, and you’ll get JSON back. Use this for non-Python projects or when you need direct control.

    Use this if you:

    • Use another language (Java, Go, C#, Node.js)
    • Need to integrate with existing REST systems
    • Want direct control over requests and responses

    Start with the Python client for prototyping. Use the HTTP API for multi-language teams or existing service integration.

    Terminal window
    pip install requests
    export NUTRIENT_API_KEY="your_api_key_here"

    Request with retry:

    import requests
    import json
    import os
    API_KEY = os.getenv("NUTRIENT_API_KEY") or "your_api_key_here"
    response = requests.post(
    "https://api.nutrient.io/build",
    headers={
    "Authorization": f"Bearer {API_KEY}"
    },
    files={
    "document": open("invoice.pdf", "rb")
    },
    data={
    "instructions": json.dumps({
    "parts": [
    {"file": "document"}
    ],
    "output": {
    "type": "json-content",
    "plainText": True,
    "structuredText": False
    }
    })
    },
    stream=True
    )
    7 collapsed lines
    if response.ok:
    with open("result.json", "wb") as fd:
    for chunk in response.iter_content(chunk_size=8192):
    fd.write(chunk)
    print("Saved to result.json")
    else:
    print(f"Error {response.status_code}: {response.text}")

    Feature comparison

    The table below highlights how PyMuPDF and Nutrient compare across key PDF processing capabilities — from native text extraction, to scanned documents, tables, forms, and overall development effort.

    FeaturePyMuPDFNutrient
    Native PDF textExcellent; get_text("text") is very fastExcellent
    Scanned documentsRequires external OCR integrationBuilt-in OCR
    Table extractionBasic bordered tables via find_tables()Can return tables when requested in output
    Form fields/KVPManual coding or heuristics requiredCan return key-value pairs with instructions
    Output formatPlain text, dict/json with coordinatesPlain text + structured JSON (order + hierarchy)
    Setup complexitypip install PyMuPDFAPI key + HTTP or SDK client
    Development time2–3 months for full pipeline1 week to production
    Maintenance loadHigh (OCR, edge cases, error handling)Minimal (automatic updates, provider-managed)

    PyMuPDF strengths

    • Fast on native PDFs, low memory use
    • Runs locally, no network latency
    • Multiple output formats (text, words, blocks, coordinates)
    • Basic table detection with page.find_tables()
    • No external dependencies for text extraction
    • Full control over processing

    PyMuPDF limitations

    • No built-in OCR — needs Tesseract for scans
    • Limited table handling for borderless or multipage tables
    • No form field detection
    • Complex layouts need custom code
    • Multipage tables and error handling add maintenance

    Nutrient DWS Processor API strengths

    • Consistent handling — Works with digital and scanned text, forms, and complex layouts
    • Built-in OCR — Automatic OCR and image correction (deskewing, contrast)
    • Regular updates — Accuracy improvements without code changes
    • Production-ready — Scales with large document volumes

    Nutrient DWS Processor API limitations

    • Overhead for small tasks — Open source may be simpler for one-off extractions
    • Setup required — Need to integrate SDK or API calls
    • Paid service — Commercial solution, not open source

    Choosing the right tool

    Use PyMuPDF if

    • Your PDFs are native text (not scans)
    • You need full control over parsing
    • You have 2–3 months for development
    • You’re processing more than 1,000 documents/month
    • Cost matters more than speed and accuracy

    Use Nutrient if

    • Your PDFs mix scanned and digital formats
    • You need results quickly
    • You’re processing thousands of documents
    • Accuracy is critical
    • You want to focus on your product, not PDF parsing

    Migration path

    Teams often start with PyMuPDF for simple PDFs, and then add Nutrient for scans, tables, and forms.

    • Phase 1 — Use PyMuPDF for native text PDFs.
    • Phase 2 — Hit limits with scans, tables, forms.
    • Phase 3 — Hybrid approach
      • If text exists → PyMuPDF
      • If scanned → Nutrient
      • Single routing function with logging
    • Phase 4 — Move most everything to Nutrient, and keep PyMuPDF for offline cases.

    Conclusion

    PyMuPDF works well for native PDF text extraction — it’s fast, and you control everything.

    For scanned documents, complex tables, or forms, Nutrient handles these without extra work.

    Choose based on your situation:

    • Simple PDFs and adequate development time — PyMuPDF
    • Mixed documents and a need to get to production quickly — Nutrient

    Start with what fits now. Migration is always possible.

    Try Nutrient yourself: Sign up(opens in a new tab) for 200 free credits monthly.

    FAQ

    Can PyMuPDF handle scanned PDFs?

    No. PyMuPDF needs external OCR like Tesseract. You handle preprocessing and integration. Nutrient has built-in OCR.

    What’s the main cost difference between PyMuPDF and Nutrient?

    PyMuPDF is free but needs 2–3 months development. Engineering cost often exceeds Nutrient’s pricing.

    Can I migrate from PyMuPDF to Nutrient gradually?

    Yes. Start hybrid — keep simple PDFs on PyMuPDF, and send complex ones to Nutrient. Migrate fully when ready.

    Which approach is better for high-volume processing?

    At scale (thousands of documents), Nutrient has better throughput and auto-scaling. PyMuPDF needs infrastructure work.

    Can Nutrient handle tables in PDFs?

    Yes. Nutrient extracts bordered, semi-bordered, and borderless tables to JSON or Excel. It handles multipage and merged-cell tables in most cases, but very complex layouts may need post-processing.

    Hulya Masharipov

    Hulya Masharipov

    Technical Writer

    Hulya is a frontend web developer and technical writer at Nutrient who enjoys creating responsive, scalable, and maintainable web experiences. She’s passionate about open source, web accessibility, cybersecurity privacy, and blockchain.

    Explore related topics

    FREE TRIAL Ready to get started?