---
title: "Extract PDF text with Python | Nutrient DCS"
canonical_url: "https://www.nutrient.io/guides/document-converter/document-converter-services/extraction/extract-text-using-python/"
md_url: "https://www.nutrient.io/guides/document-converter/document-converter-services/extraction/extract-text-using-python.md"
last_updated: "2026-05-25T16:07:03.331Z"
description: "Extract text from PDF files using Python and Nutrient Document Converter Services. Complete code example with Zeep library integration and troubleshooting steps."
---

This guide demonstrates how to extract searchable text from PDF documents using Python and Nutrient Document Converter Services (DCS). Text extraction converts PDF content into plain text format, making it accessible for analysis, indexing, and integration workflows.

## Common use cases

PDF text extraction is useful for:

- **Content analysis** - Extract text for search indexing and content management systems

- **Data processing** - Convert PDF reports into structured text for analysis and reporting

- **Document migration** - Extract content when migrating from PDF to other formats

- **Compliance workflows** - Extract text for regulatory review and archival processes

- **Accessibility improvements** - Generate text versions of PDF documents for screen readers

The sample code in this guide was developed using Visual Studio 2022, but you can run it in any Python environment with access to the [Zeep library](https://docs.python-zeep.org/en/master/in_depth.html#).

The Zeep library enables interaction with Web Services Description Language (WSDL), which defines how to call the web services and describes the data structures returned. Nutrient Document Converter Services (DCS) provides these WSDL definitions for text extraction and other operations.

## Prerequisites

Before extracting text from PDFs, ensure you have:

- Python 3.x installed on your system

- The Zeep library installed (`pip install zeep`)

- Nutrient Document Converter Services (DCS) running locally on port 41734

- Valid DCS license for text extraction functionality

- PDF files with extractable text (not scanned images without OCR)

- Basic understanding of Python programming and web services

- Appropriate file system permissions for reading input files and writing output

For initial DCS setup with Python, refer to the [using Document Converter Services with Python](https://www.nutrient.io/guides/document-converter/document-converter-services/dcs-with-python.md) guide.

## WSDL

Zeep extracts the following WSDL definitions:

```python

     ns1:ExtractText(sourceFile: xsd:base64Binary, openOptions: ns2:OpenOptions, textExtractSettings: ns3:TextExtractSettings)
     ns1:ExtractTextResponse(ExtractTextResult: xsd:base64Binary)...
     ns2:OpenOptions(UserName: xsd:string, Password: xsd:string, FileExtension: xsd:string, OriginalFileName: xsd:string, RefreshContent: xsd:boolean, AllowExternalConnections: xsd:boolean, AllowMacros: ns3:MacroSecurityOption, SystemSettings: ns5:SystemSettings, SubscriptionSettings: ns9:SubscriptionSettings)...
     ns3:TextExtractSettings(PageRange: xsd:string, PageSeparator: xsd:string, PageSeparatorPlacement: ns3:PageSeparatorPlacement)

```

The `ExtractText` method requires three parameters:

- `sourceFile: xsd:base64Binary`

- `openOptions: ns2:OpenOptions`

- `textExtractSettings: ns3:TextExtractSettings`

The `sourceFile` parameter must be a Base64-encoded binary representation of the document, following W3C XML schema standards.

Use Zeep type factories to instantiate the custom DCS types: `OpenOptions` (under `ns2`) and `TextExtractSettings` (under `ns3`).

The `OpenOptions` type requires basic configuration, such as the file name and extension.

The `TextExtractSettings` object supports several configuration options:

- **PageRange**: Specify pages to extract (e.g., "1-5", "1,3,5", or "*" for all pages)

- **PageSeparator**: Character(s) to insert between pages in the output

- **PageSeparatorPlacement**: Controls where page separators are placed in the extracted text

For most use cases, setting `PageRange` to "*" extracts text from all pages in the document.

The response returns a Base64-encoded binary string that represents the extracted text. Decode it using `utf-8-sig`, which treats the Byte Order Marker (0xef, 0xbb, 0xbf) as metadata rather than content.

## Sample code

The following Python code demonstrates how to extract text from a PDF file:

```python

import zeep
import base64

print("Extract text from a PDF file")
#Service URL.

service_url = "http://localhost:41734/Muhimbi.DocumentConverter.WebService/"

# WSDL URL.

wsdl_url = service_url+"?WSDL"

# Source file.

sourceFile = "SimplePDFText.pdf"

# Construct the header.

header = zeep.xsd.Element(
    "Header",
    zeep.xsd.ComplexType(
        [
            zeep.xsd.Element(
                "{http://www.w3.org/2005/08/addressing}Action", zeep.xsd.String()
            ),
            zeep.xsd.Element(
                "{http://www.w3.org/2005/08/addressing}To", zeep.xsd.String()
            ),
        ]
    ),
)

# Create a heading object.

header_value = header(Action=service_url,To=service_url)

# Create client.

client = zeep.Client(wsdl=wsdl_url)

# Create a factory type to construct objects with the suffix ns2 (see the WSDL).

factory = client.type_factory("ns2")

# Create a factory type to construct objects with the suffix ns3 (see the WSDL).

factory2 = client.type_factory("ns3")

# Create the OpenOptions object with minimum settings.

open_options = factory.OpenOptions(OriginalFileName = sourceFile, FileExtension = "pdf")

# Create the TextExtractSettings only with the page range.

TextExtractSettings = factory2.TextExtractSettings(PageRange = "*")

# Load the source file as a Base64 string.

with open(sourceFile, "rb") as image_file:
    encoded_string = base64.b64encode(image_file.read()).decode('utf-8')

# Call the ExtractText method with the required parameters.

result = client.service.ExtractText(encoded_string, open_options, TextExtractSettings)

# Write the extracted text as a file.

with open("SimplePDFText.txt", "w") as f:
    # Decode the result as utf-8-sig, the sig indicates signature which will treat

    # the Byte Order Marker (0xef, 0xbb, 0xbf) as metadata rather than content.

  f.write(result.decode("utf-8-sig"))

# Write the extracted text to the display.

# Use print(result) to see BOM and CR/LF as characters.

print(result.decode("utf-8-sig"))
print("Done")

```

## Troubleshooting

**Service connection error: Cannot connect to DCS**

- Ensure DCS is running on `localhost:41734`

- Check that no firewall is blocking the connection

- Verify the service URL in your code matches your DCS installation

**No text extracted: Empty result or blank output**

- Verify that the PDF contains extractable text (not scanned images without OCR)

- Check that the PDF isn’t password-protected or corrupted

- Ensure the page range setting is correct (use "*" for all pages)

**License error: Text extraction not available**

- Verify that your DCS license includes text extraction functionality

- Check that the license hasn’t expired

- Ensure the service is licensed and activated

**File access error: Permission denied**

- Verify that Python has read access to the source PDF file

- Check that the output directory has write permissions

- Ensure the source file path is correct and the file exists

**Encoding issues: Garbled text output**

- Use `utf-8-sig` encoding when decoding the result to handle Byte Order Markers

- Check that the PDF uses standard text encoding (not custom fonts or embedded images)

- Verify that the source PDF was created with correct text layers

**Large file processing: Slow performance or timeouts**

- For large PDF files, consider processing specific page ranges instead of all pages

- Increase timeout values in your HTTP client configuration

- Monitor memory usage when processing multiple large files

## What’s next

Now that you can extract text from PDFs with Python, explore these related document processing capabilities:

- **Table extraction** - Learn how to [extract PDF tables using Python](https://www.nutrient.io/guides/document-converter/document-converter-services/extraction/extract-tables/extract-table-using-python.md) for structured data processing

- **Complete Python setup** - Review the comprehensive [using Document Converter Services with Python](https://www.nutrient.io/guides/document-converter/document-converter-services/dcs-with-python.md) guide for more features
---

## Related pages

- [Extract PDF attachments with C#](/guides/document-converter/document-converter-services/extraction/extract-attachments-from-pdf.md)

