Extracting text from PDF documents

PDF-to-text extraction pulls readable content from a static document while preserving its spatial arrangement. Layout-aware extraction keeps columns, indentation, and table alignment intact, so the output matches what readers see on the page.

Use programmatic extraction to:

Index large document libraries for search.
Send structured text to data pipelines and language models.
Reuse report and statement content without manual retyping.

Extract PDF text with the Python SDK

You can add layout-preserving text extraction to a Python application with the Nutrient Python SDK. The SDK extracts text directly from PDFs, so you don’t need external tools for this workflow.

Prepare the project

Start by importing the Nutrient Python SDK classes:

from nutrient_sdk import Document
from nutrient_sdk import NutrientException

Load the PDF document

This guide uses the Document class. Use Python’s context manager(opens in a new tab) to manage the document instance lifecycle.

The SDK can load a source file from a file path or a stream. This guide uses a file path:

def main():
    try:
        with Document.open("input.pdf") as document:

The path can be absolute or relative. This example loads the file from the application’s working directory.

Extract layout-preserving text

Call export_as_text to extract the document text into a plain-text file. The method maps each word to a character grid that mirrors its position on the page:

            document.export_as_text("output.txt")
            print("Successfully extracted to output.txt")
    except NutrientException as e:
        print(f"Error: {e}")


if __name__ == "__main__":
    main()

The export_as_text method analyzes the PDF text content and the position of each word, then reconstructs the page in plain text. Words that sit close together join with single spaces, large horizontal gaps become proportional whitespace that preserves columns and tab stops, and vertical gaps between lines produce blank lines. The result reads like the original page while staying in a portable format.

The method handles these PDF content types:

Flowing text.
Multi-column layouts.
Tables and aligned data.
Mixed content layouts.

Handle errors

Nutrient Python SDK uses exception handling for errors. The methods in this guide raise a NutrientException if a failure occurs. Use this exception to troubleshoot issues and implement error handling logic.

Conclusion

You’ve extracted layout-preserving text from a PDF document. The extracted content is ready for search indexing, data pipelines, and downstream processing. You can also download the sample package to explore text extraction with the Python SDK.

Extracting text from PDF documents

Extract PDF text with the Python SDK

Prepare the project

Load the PDF document

Extract layout-preserving text

Handle errors

Conclusion

Was this helpful?

Help us improve

Thank you for your feedback!

Something went wrong. Please try again or let us know.