How to extract text from a PDF using PyMuPDF and Python
Table of contents

- PyMuPDF(opens in a new tab) provides fast text extraction from native PDFs but requires custom OCR integration for scanned documents.
- Nutrient is a cloud PDF processor with built-in OCR and ML-powered data extraction.
- When to use what — Use PyMuPDF for simple native PDFs and Nutrient for mixed documents and production systems.
PyMuPDF
PyMuPDF(opens in a new tab) (imported in Python as fitz
) is a Python wrapper for MuPDF that lets you extract text from native PDFs. It supports multiple extraction modes, ranging from simple plain text to detailed coordinate-based data.
Installation and basic setup
# Create and activate a virtual environment (recommended).python -m venv .venv# macOS/Linuxsource .venv/bin/activate# Windows (PowerShell). .venv/Scripts/Activate.ps1
# Install PyMuPDF.pip install PyMuPDF
This installs PyMuPDF. The virtual environment isolates dependencies.
Text extraction methods
PyMuPDF’s page.get_text()
supports several extraction modes:
"text"
— Plain text in reading order (fastest for basic extraction)"blocks"
/"words"
— Lists of blocks or words with positions"dict"
— Structured data with fonts, coordinates, and layout information"json"
— Same as"dict"
, but in JSON format"html"
,"xml"
,"rawdict"
, and"rawjson"
— Also available
Use "text"
for simple extraction; use "dict"
when you need coordinates for tables or forms.
Basic text extraction
To perform basic text extraction, use the following code:
import fitz # PyMuPDF
def extract_text_pymupdf(pdf_path): doc = fitz.open(pdf_path) try: text = "" for page in doc: text += page.get_text("text") # plain text return text finally: doc.close()
# Extract text from a PDF.result = extract_text_pymupdf("invoice.pdf")print(result)
This opens the document once and concatenates text from each page. It works for native PDFs, but not for scans.
Getting layout information
For tables and forms, you need coordinates:
import fitz # PyMuPDF
def extract_with_coordinates(pdf_path):29 collapsed lines
"""Extract text with position and font information for layout analysis.""" doc = fitz.open(pdf_path) try: results = [] # Process each page in the document. for page_num, page in enumerate(doc): # Get structured data: blocks contain lines, lines contain spans. data = page.get_text("dict") # Returns hierarchical text structure
# Navigate the hierarchy: blocks > lines > spans. for block in data["blocks"]: # Skip image blocks (they don't have "lines" key). for line in block.get("lines", []): # Each span is the smallest text unit with consistent formatting. for span in line["spans"]: results.append({ "page": page_num, "text": span["text"], "bbox": span["bbox"], # (x0,y0,x1,y1) coordinates in points "font": span["font"] # Font name for styling analysis }) return results finally: # Always close the document to free memory. doc.close()
# Usage example: Extract positioned text for table detection.result = extract_with_coordinates("invoice.pdf")print(result[:5]) # Preview first 5 text spans with positions
The "dict" mode returns text spans along with their bounding boxes and font information. Coordinates are measured in points (1/72 inch) from the top-left corner, and PyMuPDF automatically accounts for page rotation.
Get started with Nutrient DWS Processor API today and receive 200 free credits monthly! Perfect for watermark-free document processing targeting many use cases.
Table extraction (bordered tables)
PyMuPDF includes basic table detection for bordered tables:
import fitz # PyMuPDF
def extract_tables_pymupdf(pdf_path): doc = fitz.open(pdf_path)17 collapsed lines
try: all_tables = [] for page in doc: # Returns a `TableFinder` object. table_finder = page.find_tables() # `TableFinder` for table in table_finder.tables: # Sequence of `Table` objects all_tables.append(table.extract()) # Each is a list of rows. return all_tables finally: doc.close()
# Usageresult = extract_tables_pymupdf("invoice.pdf")for i, table in enumerate(result[:5]): # First 5 tables. print(f"--- Table {i} ---") for row in table: print(row)
The code above finds tables with visible borders and returns rows as lists. Multipage tables and merged cells need manual handling.
Handling scanned PDFs (OCR fallback)
PyMuPDF doesn’t handle image-only pages. This section outlines how to add Tesseract OCR for scanned content.
- Install OCR prerequisites:
# Python depspip install pytesseract Pillow
# Tesseract engine (system package)# macOSbrew install tesseract# Ubuntu / Debiansudo apt-get install tesseract-ocr# Windows# 1) Install from: https://github.com/UB-Mannheim/tesseract/wiki# 2) Add the install directory (e.g., C:\Program Files\Tesseract-OCR) to PATH
- Use the OCR-aware extractor:
import fitzimport pytesseractfrom PIL import Imageimport io
def is_scanned(page, threshold=40) -> bool: """Check if page has little text (likely scanned/image-only).""" return len(page.get_text("text").strip()) < threshold
27 collapsed lines
def extract_with_ocr(pdf_path: str, ocr_lang: str = "eng") -> str: """Hybrid extraction: use native text when available, OCR when needed.""" doc = fitz.open(pdf_path) try: out = [] for page in doc: if is_scanned(page): # Page has minimal text — likely scanned, use OCR. pix = page.get_pixmap(dpi=300) # Render at 300 DPI for good OCR quality. img = Image.open(io.BytesIO(pix.tobytes("png"))) # Convert to PIL Image # Run Tesseract OCR with specified language. out.append(pytesseract.image_to_string(img, lang=ocr_lang)) else: # Page has native text — extract directly (much faster). out.append(page.get_text("text")) return "".join(out) finally: doc.close()
# Example usage with error handling.if __name__ == "__main__": try: result = extract_with_ocr("invoice.pdf") print(f"Extracted {len(result)} characters") print(result[:500]) # Preview first 500 characters. except Exception as e: print(f"Extraction failed: {e}")
This code checks each page for text. If it’s empty, it renders the page at 300 DPI and runs OCR. Then it sets ocr_lang
to non-English (e.g. "deu"
, "spa"
).
For poorly scanned documents, first preprocess them with OpenCV using binarization, deskewing, and denoising techniques.
Performance best practices
- Open once — Call
fitz.open()
once per document. Don’t reopen for each page. - Prefer
"text"
mode — Usepage.get_text("text")
unless you need coordinates or font data. - Skip unnecessary rendering —
page.get_pixmap()
is slow, so only use it for OCR. - Handle invalid files — Use
try/except
for corrupted PDFs. - Parallelize batches —
Document
objects are independent, so use multiprocessing for bulk jobs.
Nutrient DWS Processor API for Python
Nutrient is a cloud PDF processor with built-in OCR and data extraction. One API call handles:
- OCR for poor-quality scans, handwriting, and mixed file types
- AI and machine learning-driven OCR that processes poor-quality scans, handwriting, and mixed file types
- Intelligent reading order that maintains logical text flow in complex or multicolumn layouts
- Adaptive layout understanding that recognizes headers, paragraphs, lists, and document sections
- Key-value pair detection for forms, invoices, and other structured documents
- Layout-aware analysis that preserves spatial relationships between text, images, and annotations
Step 1: Sign up and get your API key
Sign up at Nutrient Processor API(opens in a new tab). After email verification, get your API key from the dashboard. You start with 200 free credits.
There are two ways to integrate Nutrient. Both use the same processing engine but differ in API interaction.
Step 2: Choose your integration
Choose between the Python client (nutrient-dws
) or direct HTTP calls.
Option A — The official Python client
The official Python client (nutrient-dws
) handles authentication, uploads, and response parsing. It’s good for automation scripts, backend services, or data pipelines. Additionally, it includes helper functions and error handling.
Use this if you:
- Want clean Python code
- Need automatic OCR and parsing
- Use AI code assistants (Claude, Copilot, Cursor)
pip install nutrient-dwsexport NUTRIENT_API_KEY="your_api_key_here"
Minimal extraction:
import asyncio, osfrom nutrient_dws import NutrientClient
async def main(): client = NutrientClient(api_key=os.getenv("NUTRIENT_API_KEY")) result = await client.extract_text("invoice.pdf") print(result.get("text") or result)
asyncio.run(main())
The client uploads invoice.pdf
, applies OCR if needed, and returns extracted text; no OCR setup is required.
AI code helpers
The SDK includes helpers for AI coding assistants. After installing, run these commands for better completion:
# Claude Codedws-add-claude-code-rule
# GitHub Copilotdws-add-github-copilot-rule
# JetBrains (Junie)dws-add-junie-rule
# Cursordws-add-cursor-rule
# Windsurfdws-add-windsurf-rule
These enable SDK method suggestions and examples in your editor.
Option B: HTTP API
The HTTP API works with any language. Send a request with your document, and you’ll get JSON back. Use this for non-Python projects or when you need direct control.
Use this if you:
- Use another language (Java, Go, C#, Node.js)
- Need to integrate with existing REST systems
- Want direct control over requests and responses
Start with the Python client for prototyping. Use the HTTP API for multi-language teams or existing service integration.
pip install requestsexport NUTRIENT_API_KEY="your_api_key_here"
Request with retry:
import requestsimport jsonimport os
API_KEY = os.getenv("NUTRIENT_API_KEY") or "your_api_key_here"
response = requests.post( "https://api.nutrient.io/build", headers={ "Authorization": f"Bearer {API_KEY}" }, files={ "document": open("invoice.pdf", "rb") }, data={ "instructions": json.dumps({ "parts": [ {"file": "document"} ], "output": { "type": "json-content", "plainText": True, "structuredText": False } }) }, stream=True)
7 collapsed lines
if response.ok: with open("result.json", "wb") as fd: for chunk in response.iter_content(chunk_size=8192): fd.write(chunk) print("Saved to result.json")else: print(f"Error {response.status_code}: {response.text}")
Feature comparison
The table below highlights how PyMuPDF and Nutrient compare across key PDF processing capabilities — from native text extraction, to scanned documents, tables, forms, and overall development effort.
Feature | PyMuPDF | Nutrient |
---|---|---|
Native PDF text | Excellent; get_text("text") is very fast | Excellent |
Scanned documents | Requires external OCR integration | Built-in OCR |
Table extraction | Basic bordered tables via find_tables() | Can return tables when requested in output |
Form fields/KVP | Manual coding or heuristics required | Can return key-value pairs with instructions |
Output format | Plain text, dict/json with coordinates | Plain text + structured JSON (order + hierarchy) |
Setup complexity | pip install PyMuPDF | API key + HTTP or SDK client |
Development time | 2–3 months for full pipeline | 1 week to production |
Maintenance load | High (OCR, edge cases, error handling) | Minimal (automatic updates, provider-managed) |
PyMuPDF strengths
- Fast on native PDFs, low memory use
- Runs locally, no network latency
- Multiple output formats (text, words, blocks, coordinates)
- Basic table detection with
page.find_tables()
- No external dependencies for text extraction
- Full control over processing
PyMuPDF limitations
- No built-in OCR — needs Tesseract for scans
- Limited table handling for borderless or multipage tables
- No form field detection
- Complex layouts need custom code
- Multipage tables and error handling add maintenance
Nutrient DWS Processor API strengths
- Consistent handling — Works with digital and scanned text, forms, and complex layouts
- Built-in OCR — Automatic OCR and image correction (deskewing, contrast)
- Regular updates — Accuracy improvements without code changes
- Production-ready — Scales with large document volumes
Nutrient DWS Processor API limitations
- Overhead for small tasks — Open source may be simpler for one-off extractions
- Setup required — Need to integrate SDK or API calls
- Paid service — Commercial solution, not open source
Choosing the right tool
Use PyMuPDF if
- Your PDFs are native text (not scans)
- You need full control over parsing
- You have 2–3 months for development
- You’re processing more than 1,000 documents/month
- Cost matters more than speed and accuracy
Use Nutrient if
- Your PDFs mix scanned and digital formats
- You need results quickly
- You’re processing thousands of documents
- Accuracy is critical
- You want to focus on your product, not PDF parsing
Migration path
Teams often start with PyMuPDF for simple PDFs, and then add Nutrient for scans, tables, and forms.
- Phase 1 — Use PyMuPDF for native text PDFs.
- Phase 2 — Hit limits with scans, tables, forms.
- Phase 3 — Hybrid approach
- If text exists → PyMuPDF
- If scanned → Nutrient
- Single routing function with logging
- Phase 4 — Move most everything to Nutrient, and keep PyMuPDF for offline cases.
Conclusion
PyMuPDF works well for native PDF text extraction — it’s fast, and you control everything.
For scanned documents, complex tables, or forms, Nutrient handles these without extra work.
Choose based on your situation:
- Simple PDFs and adequate development time — PyMuPDF
- Mixed documents and a need to get to production quickly — Nutrient
Start with what fits now. Migration is always possible.
Try Nutrient yourself: Sign up(opens in a new tab) for 200 free credits monthly.
FAQ
No. PyMuPDF needs external OCR like Tesseract. You handle preprocessing and integration. Nutrient has built-in OCR.
PyMuPDF is free but needs 2–3 months development. Engineering cost often exceeds Nutrient’s pricing.
Yes. Start hybrid — keep simple PDFs on PyMuPDF, and send complex ones to Nutrient. Migrate fully when ready.
At scale (thousands of documents), Nutrient has better throughput and auto-scaling. PyMuPDF needs infrastructure work.
Yes. Nutrient extracts bordered, semi-bordered, and borderless tables to JSON or Excel. It handles multipage and merged-cell tables in most cases, but very complex layouts may need post-processing.