Extract PDF text with Python

This guide demonstrates how to extract searchable text from PDF documents using Python and Nutrient Document Converter Services (DCS). Text extraction converts PDF content into plain text format, making it accessible for analysis, indexing, and integration workflows.

Common use cases

PDF text extraction is useful for:

Content analysis - Extract text for search indexing and content management systems
Data processing - Convert PDF reports into structured text for analysis and reporting
Document migration - Extract content when migrating from PDF to other formats
Compliance workflows - Extract text for regulatory review and archival processes
Accessibility improvements - Generate text versions of PDF documents for screen readers

The sample code in this guide was developed using Visual Studio 2022, but you can run it in any Python environment with access to the Zeep library(opens in a new tab).

The Zeep library enables interaction with Web Services Description Language (WSDL), which defines how to call the web services and describes the data structures returned. Nutrient Document Converter Services (DCS) provides these WSDL definitions for text extraction and other operations.

Prerequisites

Before extracting text from PDFs, ensure you have:

Python 3.x installed on your system
The Zeep library installed (pip install zeep)
Nutrient Document Converter Services (DCS) running locally on port 41734
Valid DCS license for text extraction functionality
PDF files with extractable text (not scanned images without OCR)
Basic understanding of Python programming and web services
Appropriate file system permissions for reading input files and writing output

For initial DCS setup with Python, refer to the using Document Converter Services with Python guide.

WSDL

Zeep extracts the following WSDL definitions:

     ns1:ExtractText(sourceFile: xsd:base64Binary, openOptions: ns2:OpenOptions, textExtractSettings: ns3:TextExtractSettings)
     ns1:ExtractTextResponse(ExtractTextResult: xsd:base64Binary)
     ...
     ns2:OpenOptions(UserName: xsd:string, Password: xsd:string, FileExtension: xsd:string, OriginalFileName: xsd:string, RefreshContent: xsd:boolean, AllowExternalConnections: xsd:boolean, AllowMacros: ns3:MacroSecurityOption, SystemSettings: ns5:SystemSettings, SubscriptionSettings: ns9:SubscriptionSettings)
     ...
     ns3:TextExtractSettings(PageRange: xsd:string, PageSeparator: xsd:string, PageSeparatorPlacement: ns3:PageSeparatorPlacement)

The ExtractText method requires three parameters:

sourceFile: xsd:base64Binary
openOptions: ns2:OpenOptions
textExtractSettings: ns3:TextExtractSettings

The sourceFile parameter must be a Base64-encoded binary representation of the document, following W3C XML schema standards.

Use Zeep type factories to instantiate the custom DCS types: OpenOptions (under ns2) and TextExtractSettings (under ns3).

The OpenOptions type requires basic configuration, such as the file name and extension.

The TextExtractSettings object supports several configuration options:

PageRange: Specify pages to extract (e.g., "1-5", "1,3,5", or "*" for all pages)
PageSeparator: Character(s) to insert between pages in the output
PageSeparatorPlacement: Controls where page separators are placed in the extracted text

For most use cases, setting PageRange to "*" extracts text from all pages in the document.

The response returns a Base64-encoded binary string that represents the extracted text. Decode it using utf-8-sig, which treats the Byte Order Marker (0xef, 0xbb, 0xbf) as metadata rather than content.

Sample code

The following Python code demonstrates how to extract text from a PDF file:

import zeep
import base64

print("Extract text from a PDF file")
#Service URL.
service_url = "http://localhost:41734/Muhimbi.DocumentConverter.WebService/"
# WSDL URL.
wsdl_url = service_url+"?WSDL"

# Source file.
sourceFile = "SimplePDFText.pdf"

# Construct the header.
header = zeep.xsd.Element(
    "Header",
    zeep.xsd.ComplexType(
        [
            zeep.xsd.Element(
                "{http://www.w3.org/2005/08/addressing}Action", zeep.xsd.String()
            ),
            zeep.xsd.Element(
                "{http://www.w3.org/2005/08/addressing}To", zeep.xsd.String()
            ),
        ]
    ),
)
# Create a heading object.
header_value = header(Action=service_url,To=service_url)
# Create client.
client = zeep.Client(wsdl=wsdl_url)

# Create a factory type to construct objects with the suffix ns2 (see the WSDL).
factory = client.type_factory("ns2")
# Create a factory type to construct objects with the suffix ns3 (see the WSDL).
factory2 = client.type_factory("ns3")

# Create the OpenOptions object with minimum settings.
open_options = factory.OpenOptions(OriginalFileName = sourceFile, FileExtension = "pdf")

# Create the TextExtractSettings only with the page range.
TextExtractSettings = factory2.TextExtractSettings(PageRange = "*")

# Load the source file as a Base64 string.
with open(sourceFile, "rb") as image_file:
    encoded_string = base64.b64encode(image_file.read()).decode('utf-8')

# Call the ExtractText method with the required parameters.
result = client.service.ExtractText(encoded_string, open_options, TextExtractSettings)

# Write the extracted text as a file.
with open("SimplePDFText.txt", "w") as f:
    # Decode the result as utf-8-sig, the sig indicates signature which will treat
    # the Byte Order Marker (0xef, 0xbb, 0xbf) as metadata rather than content.
  f.write(result.decode("utf-8-sig"))

# Write the extracted text to the display.
# Use print(result) to see BOM and CR/LF as characters.
print(result.decode("utf-8-sig"))
print("Done")

Troubleshooting

Service connection error: Cannot connect to DCS

Ensure DCS is running on localhost:41734
Check that no firewall is blocking the connection
Verify the service URL in your code matches your DCS installation

No text extracted: Empty result or blank output

Verify that the PDF contains extractable text (not scanned images without OCR)
Check that the PDF isn’t password-protected or corrupted
Ensure the page range setting is correct (use "*" for all pages)

License error: Text extraction not available

Verify that your DCS license includes text extraction functionality
Check that the license hasn’t expired
Ensure the service is licensed and activated

File access error: Permission denied

Verify that Python has read access to the source PDF file
Check that the output directory has write permissions
Ensure the source file path is correct and the file exists

Encoding issues: Garbled text output

Use utf-8-sig encoding when decoding the result to handle Byte Order Markers
Check that the PDF uses standard text encoding (not custom fonts or embedded images)
Verify that the source PDF was created with correct text layers

Large file processing: Slow performance or timeouts

For large PDF files, consider processing specific page ranges instead of all pages
Increase timeout values in your HTTP client configuration
Monitor memory usage when processing multiple large files

What’s next

Now that you can extract text from PDFs with Python, explore these related document processing capabilities:

Table extraction - Learn how to extract PDF tables using Python for structured data processing
Complete Python setup - Review the comprehensive using Document Converter Services with Python guide for more features

Extract PDF text with Python

Common use cases

Prerequisites

WSDL

Sample code

Troubleshooting

What’s next

Was this helpful?

Help us improve

Thank you for your feedback!

Something went wrong. Please try again or let us know.