Nutrient

Home

SDK

Software Development Kits

Low-Code

IT Document Solutions

Workflow

Workflow Automation Platform

DWS API

Document Web Services

T
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Company

About

Team

Careers

Contact

Security

Partners

Legal

Resources

Blog

Events

Try for free

Contact Sales
Contact sales
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

products

Web

Web

Document Authoring

AI Assistant

Salesforce

Mobile

iOS

Android

visionOS

Flutter

React Native

MAUI

Server

Document Engine

Document Converter Services

.NET

Java

Node.js

AIDocument Processing

All products

solutions

USECASES

Viewing

Editing

OCR and Data Extraction

Signing

Forms

Scanning & Barcodes

Markup

Generation

Document Conversion

Redaction

Intelligent Doc. Processing

Collaboration

Authoring

Security

INdustries

Aviation

Construction

Education

Financial Services

Government

Healthcare

Legal

Life Sciences

All Solutions

Docs

Guides overview

Web

AIAssistant

Document Engine

iOS

Android

visionOS

Java

Node.js

.NET

Document Converter Services

Downloads

Demo

Support

Log in

Resources

Blog

Events

Pricing

Try for free

Free Trial

Company

About

Security

Partners

Legal

Contact Sales
Contact Sales
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

products

Products overview

Document Converter

Document Editor

Document Searchability

Document Automation Server

Integrations

SharePoint

Power Automate

Nintex

OneDrive

Teams

Window Servers

solutions

USECASES

Conversion

Editing

OCR Data Extraction

Tagging

Security Compliance

Workflow Automation

Solutions For

Overview

Legal

Public Sector

Finance

All Solutions

resources

Help center

Document Converter

Document Editor

Document Searchability

Document Automation Server

learn

Blog

Customer stories

Events

Support

Log in

Pricing

Try for free

Company

About

Security

Partners

Legal

Contact Sales
Contact Sales
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Product

Product overview

Process Builder

Form Designer

Document Viewer

Office Templating

Customization

Reporting

solutions

Industries

Healthcare

Financial

Manufacturing

Pharma

Education

Construction

Nonprofit

Local Government

Food and Beverage

Departments

ITServices

Finance

Compliance

Human Resources

Sales

Marketing

Services

Overview

Capex-accelerator

Process Consulting

Workflow Prototype

All Solutions

resources

Help center

guides

Admin guides

End user guides

Workflow templates

Form templates

Training

learn

Blog

Customer stories

Events

Support

Pricing

Support

Company

About

Security

Partners

Legal

Try for Free
Contact Sales
Try for Free
Contact Sales
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Services

Generation

Editing

Conversion

Watermarking

OCR

Table Extraction

Pricing

Docs

Log in

Try for Free
Try for Free

Free trial

Blog post

How to Extract Text from a PDF

Patrik Weiskircher Patrik Weiskircher

Table of contents

  • Understanding the Challenges of Text Extraction from PDFs
  • How Text Is Represented in a PDF
  • Extracting Text from a Content Stream
  • PSPDFKit for Modern Text Extraction
  • Conclusion
  • FAQ
Illustration: How to Extract Text from a PDF

Extracting text from a PDF can be more challenging than expected. PDF files are designed to preserve document appearance rather than facilitate text extraction. In this post, we’ll explore how to extract text from a PDF effectively and see how PSPDFKit’s advanced features make this process easier.

Understanding the Challenges of Text Extraction from PDFs

PDFs are primarily created for consistent visual presentation across different devices. This design focus makes text extraction from PDFs complex, as PDFs aren’t optimized for this purpose.

How Text Is Represented in a PDF

A PDF file doesn’t simply contain text as you’d be used to in a text file. What it does contain are commands on how to render the given text on the screen without whitespace characters or newlines. But let’s dive a little bit deeper into some PDF internals to further our understanding of this.

Content Streams

Each page in a PDF has one or more content streams that tell the PDF viewer application how to render a page. A very simple one might look like this:

193.95 581.633 Td
(Hello) Tj
30.68 0 Td
(World!) Tj

These content streams can be represented differently while accomplishing the same goal, like this:

193.95 581.633 Td
<00290046004d004d00500001003800500053004d0045> Tj

Td instructs the PDF viewer where to draw the next string. Tj specifies which string to draw.

Extracting Text from a Content Stream

The only way to extract text from a PDF is by looking at the rendering commands and having a good heuristic try at making sense of it. In the example above, we know we’re supposed to render Hello, reposition the text cursor, and then output World!.

You might have noticed there’s no whitespace in the first example above. Because the content stream only instructs the rendering engine what to draw on the screen, and because whitespace is blank, we have to infer the spaces and newlines ourselves most of the time.

Doing this reliably across all the different PDF documents out there is difficult, and it’s not uncommon to encounter problems where tweaking the heuristic breaks one document but fixes another.

PSPDFKit for Modern Text Extraction

PSPDFKit offers APIs to retrieve text from a document. All of our platforms use the same underlying heuristic to determine the layout of the text on the page and how to extract blocks out of it.

iOS

On iOS, you can use PSPDFTextParser to retrieve the text, text blocks, words, or glyphs from a page:

guard let textParser = documentProvider.textParserForPage(at: 0) else {
    // Handle failure.
    abort()
}
print("Text of page 0: \(textParser.text)")

for textBlock in textParser.textBlocks {
    print("TextBlock at \(textBlock.frame): \(textBlock.content)")
}

Android

On Android, there’s no dedicated text parser class; instead, you retrieve your page text using PdfDocument:

val pageText = document.getPageText(0)
print("Text of page 0: $pageText")

for (textRect in document.getPageTextRects(0, 0, pageText.length)) {
    val blockText = document.getPageText(0, textRect)
    print("TextBlock at $textRect: $blockText")
}

Web

PSPDFKit for Web can extract the text from a page using textLinesForPageIndex, but there isn’t yet an API for extracting text blocks:

const textLines = await instance.textLinesForPageIndex(0);
textLines.forEach((textLine) => console.log(textLine.contents));
instance.textLinesForPageIndex(0).then(function (textLines) {
	textLines.forEach(function (textLine) {
		console.log(textLine.contents);
	});
});

Conclusion

Learning how to extract text from a PDF involves understanding the complexities of PDF design and using the right tools. PSPDFKit offers modern solutions to make this process as seamless as possible, helping you focus on your core tasks. This text extraction capability also powers our PDF text comparison feature for identifying differences between document versions.

FAQ

Why is extracting text from PDFs so difficult? PDFs are designed for visual consistency rather than text extraction. The text is often encoded in ways that can make it challenging to extract accurately.
How does PSPDFKit handle text extraction? PSPDFKit uses advanced algorithms to interpret the rendering commands in PDFs and retrieve text with high accuracy across different platforms.
Can PSPDFKit handle all PDFs for text extraction? While PSPDFKit is highly effective, text extraction can vary depending on the complexity and encoding of a PDF. Our tools continuously improve to handle a wide range of documents.
Author
Patrik Weiskircher
Patrik Weiskircher Core Team Lead

Patrik is the team lead of the Core Team, which oversees the shared codebase between our products. He knows far too many things about PDFs — ask him about fonts!

Explore related topics

PDF Development
Free trial Ready to get started?
Free trial

Related articles

Explore more
SDKTUTORIALSWebJavaScriptHow ToPDFRenderingViewing

What is a vector PDF? Understanding the difference between vector, raster, and text elements in PDF documents

SDKTUTORIALSWebJavaScriptHow ToPDFRenderingViewing

The ultimate guide to PDF rendering vs. PDF viewing (and when each is applicable)

SDKDEVELOPMENTZapierAPIPDFAutomationDocument Workflows

Introducing the Nutrient Document Web Services API on Zapier

Company
About
Security
Team
Careers
We're hiring
Partners
Legal
Products
SDK
Low-Code
Workflow
DWS API
resources
Blog
Events
Customer Stories
Tutorials
News
connect
Contact
LinkedIn
YouTube
Discord
X
Facebook
Popular
Java PDF Library
Tag Text
PDF SDK Viewer
Tag Text
React Native PDF SDK
Tag Text
PDF SDK
Tag Text
iOS PDF Viewer
Tag Text
PDF Viewer SDK/Library
Tag Text
PDF Generation
Tag Text
SDK
Web
Tag Text
Mobile/VR
Tag Text
Server
Tag Text
Use Cases
Tag Text
Industries
Tag Text
Resources
Blog
Tag Text
Events
Customer Stories
Tag Text
Tutorials
Tag Text
Features List
Tag Text
Compare
Tag Text
community
Free Trial
Tag Text
Documentation
Tag Text
Nutrient Portal
Tag Text
Contact Support
Tag Text
Company
About
Tag Text
Security
Tag Text
Careers
Tag Text
Legal
Tag Text
Pricing
Tag Text
Partners
Tag Text
connect
Contact
Tag Text
LinkedIn
Tag Text
YouTube
Tag Text
Discord
Tag Text
X
Tag Text
Facebook
Tag Text
low-code
Document Converter
Tag Text
Document Editor
Tag Text
Document Automation Server
Tag Text
Document Searchability
Tag Text
Use Cases
Tag Text
Industries
Tag Text
Resources
Blog
Tag Text
Events
Customer Stories
Tag Text
Support
Help Center
Tag Text
Contact Support
Tag Text
Log In
Tag Text
Company
About
Tag Text
Careers
Tag Text
Security
Tag Text
Legal
Tag Text
Pricing
Tag Text
Partners
Tag Text
connect
Contact
Tag Text
LinkedIn
Tag Text
YouTube
Tag Text
Discord
Tag Text
X
Tag Text
Facebook
Tag Text
Popular
Approvals matrix
Tag Text
BPMS
Tag Text
Budgeting process
Tag Text
CapEx approval
Tag Text
CapEx automation
Tag Text
Document approval
Tag Text
Task automation
Tag Text
workflow
Overview
Tag Text
Services
Tag Text
Industries
Tag Text
Departments
Tag Text
Resources
Blog
Tag Text
Events
Customer Stories
Tag Text
Support
Help Center
Tag Text
FAQ
Tag Text
Troubleshooting
Tag Text
Contact Support
Tag Text
Company
About
Tag Text
Careers
Tag Text
Security
Tag Text
Legal
Tag Text
Pricing
Tag Text
Partners
Tag Text
connect
Contact
Tag Text
LinkedIn
Tag Text
YouTube
Tag Text
Discord
Tag Text
X
Tag Text
Facebook
Tag Text
DWS api
PDF Generator
Tag Text
Editor
Tag Text
Converter API
Tag Text
Watermark
Tag Text
OCR
Tag Text
Table Extraction
Tag Text
Resources
Log in
Tag Text
Help Center
Tag Text
Support
Tag Text
Blog
Tag Text
Company
About
Tag Text
Careers
Tag Text
Security
Tag Text
Pricing
Tag Text
Legal
Privacy
Tag Text
Terms
Tag Text
connect
Contact
Tag Text
X
Tag Text
YouTube
Tag Text
Discord
Tag Text
LinkedIn
Tag Text
Facebook
Tag Text

Copyright 2025 Nutrient. All rights reserved.

Thank you for subscribing to our newsletter!

We’re thrilled to have you join our community. You’re now one step closer to receiving the latest updates, exclusive content, and special offers directly in your inbox.

This builtin is not currently supported: DOM

PSPDFKit is now Nutrient. We've consolidated our group of trusted companies into one unified brand: Nutrient. Learn more

This builtin is not currently supported: DOM

PSPDFKit is now Nutrient. We've consolidated our group of trusted companies into one unified brand: Nutrient. Learn more

This builtin is not currently supported: DOM

New Feature Release. Tap into revolutionary AI technology to instantly complete tasks, analyze text, and redact key information across your documents. Learn More or View Showcase

This builtin is not currently supported: DOM

Aquaforest and Muhimbi are now Nutrient. We've consolidated our group of trusted companies into one unified brand: Nutrient. Learn more

This builtin is not currently supported: DOM

Integrify is now Nutrient. We've consolidated our group of trusted companies into one unified brand: Nutrient. Learn more

This builtin is not currently supported: DOM

Join us on April 15th. Join industry leaders, product experts, and fellow professionals at our exclusive user conference. Register for conference