Extract PDF text with Python
This guide demonstrates how to extract searchable text from PDF documents using Python and Nutrient Document Converter Services (DCS). Text extraction converts PDF content into plain text format, making it accessible for analysis, indexing, and integration workflows.
Common use cases
PDF text extraction is useful for:
- Content analysis - Extract text for search indexing and content management systems
- Data processing - Convert PDF reports into structured text for analysis and reporting
- Document migration - Extract content when migrating from PDF to other formats
- Compliance workflows - Extract text for regulatory review and archival processes
- Accessibility improvements - Generate text versions of PDF documents for screen readers
The sample code in this guide was developed using Visual Studio 2022, but you can run it in any Python environment with access to the Zeep library(opens in a new tab).
The Zeep library enables interaction with Web Services Description Language (WSDL), which defines how to call the web services and describes the data structures returned. Nutrient Document Converter Services (DCS) provides these WSDL definitions for text extraction and other operations.
Prerequisites
Before extracting text from PDFs, ensure you have:
- Python 3.x installed on your system
- The Zeep library installed (
pip install zeep
) - Nutrient Document Converter Services (DCS) running locally on port 41734
- Valid DCS license for text extraction functionality
- PDF files with extractable text (not scanned images without OCR)
- Basic understanding of Python programming and web services
- Appropriate file system permissions for reading input files and writing output
For initial DCS setup with Python, refer to the using Document Converter Services with Python guide.
WSDL
Zeep extracts the following WSDL definitions:
ns1:ExtractText(sourceFile: xsd:base64Binary, openOptions: ns2:OpenOptions, textExtractSettings: ns3:TextExtractSettings) ns1:ExtractTextResponse(ExtractTextResult: xsd:base64Binary) ... ns2:OpenOptions(UserName: xsd:string, Password: xsd:string, FileExtension: xsd:string, OriginalFileName: xsd:string, RefreshContent: xsd:boolean, AllowExternalConnections: xsd:boolean, AllowMacros: ns3:MacroSecurityOption, SystemSettings: ns5:SystemSettings, SubscriptionSettings: ns9:SubscriptionSettings) ... ns3:TextExtractSettings(PageRange: xsd:string, PageSeparator: xsd:string, PageSeparatorPlacement: ns3:PageSeparatorPlacement)
The ExtractText
method requires three parameters:
sourceFile: xsd:base64Binary
openOptions: ns2:OpenOptions
textExtractSettings: ns3:TextExtractSettings
The sourceFile
parameter must be a Base64-encoded binary representation of the document, following W3C XML schema standards.
Use Zeep type factories to instantiate the custom DCS types: OpenOptions
(under ns2
) and TextExtractSettings
(under ns3
).
The OpenOptions
type requires basic configuration, such as the file name and extension.
The TextExtractSettings
object supports several configuration options:
- PageRange: Specify pages to extract (e.g., "1-5", "1,3,5", or "*" for all pages)
- PageSeparator: Character(s) to insert between pages in the output
- PageSeparatorPlacement: Controls where page separators are placed in the extracted text
For most use cases, setting PageRange
to "*" extracts text from all pages in the document.
The response returns a Base64-encoded binary string that represents the extracted text. Decode it using utf-8-sig
, which treats the Byte Order Marker (0xef, 0xbb, 0xbf) as metadata rather than content.
Sample code
The following Python code demonstrates how to extract text from a PDF file:
import zeepimport base64
print("Extract text from a PDF file")#Service URL.service_url = "http://localhost:41734/Muhimbi.DocumentConverter.WebService/"# WSDL URL.wsdl_url = service_url+"?WSDL"
# Source file.sourceFile = "SimplePDFText.pdf"
# Construct the header.header = zeep.xsd.Element( "Header", zeep.xsd.ComplexType( [ zeep.xsd.Element( "{http://www.w3.org/2005/08/addressing}Action", zeep.xsd.String() ), zeep.xsd.Element( "{http://www.w3.org/2005/08/addressing}To", zeep.xsd.String() ), ] ),)# Create a heading object.header_value = header(Action=service_url,To=service_url)# Create client.client = zeep.Client(wsdl=wsdl_url)
# Create a factory type to construct objects with the suffix ns2 (see the WSDL).factory = client.type_factory("ns2")# Create a factory type to construct objects with the suffix ns3 (see the WSDL).factory2 = client.type_factory("ns3")
# Create the OpenOptions object with minimum settings.open_options = factory.OpenOptions(OriginalFileName = sourceFile, FileExtension = "pdf")
# Create the TextExtractSettings only with the page range.TextExtractSettings = factory2.TextExtractSettings(PageRange = "*")
# Load the source file as a Base64 string.with open(sourceFile, "rb") as image_file: encoded_string = base64.b64encode(image_file.read()).decode('utf-8')
# Call the ExtractText method with the required parameters.result = client.service.ExtractText(encoded_string, open_options, TextExtractSettings)
# Write the extracted text as a file.with open("SimplePDFText.txt", "w") as f: # Decode the result as utf-8-sig, the sig indicates signature which will treat # the Byte Order Marker (0xef, 0xbb, 0xbf) as metadata rather than content. f.write(result.decode("utf-8-sig"))
# Write the extracted text to the display.# Use print(result) to see BOM and CR/LF as characters.print(result.decode("utf-8-sig"))print("Done")
Troubleshooting
Service connection error: Cannot connect to DCS
- Ensure DCS is running on
localhost:41734
- Check that no firewall is blocking the connection
- Verify the service URL in your code matches your DCS installation
No text extracted: Empty result or blank output
- Verify that the PDF contains extractable text (not scanned images without OCR)
- Check that the PDF isn’t password-protected or corrupted
- Ensure the page range setting is correct (use "*" for all pages)
License error: Text extraction not available
- Verify that your DCS license includes text extraction functionality
- Check that the license hasn’t expired
- Ensure the service is licensed and activated
File access error: Permission denied
- Verify that Python has read access to the source PDF file
- Check that the output directory has write permissions
- Ensure the source file path is correct and the file exists
Encoding issues: Garbled text output
- Use
utf-8-sig
encoding when decoding the result to handle Byte Order Markers - Check that the PDF uses standard text encoding (not custom fonts or embedded images)
- Verify that the source PDF was created with correct text layers
Large file processing: Slow performance or timeouts
- For large PDF files, consider processing specific page ranges instead of all pages
- Increase timeout values in your HTTP client configuration
- Monitor memory usage when processing multiple large files
What’s next
Now that you can extract text from PDFs with Python, explore these related document processing capabilities:
- Table extraction - Learn how to extract PDF tables using Python for structured data processing
- Complete Python setup - Review the comprehensive using Document Converter Services with Python guide for more features