Extracting text from PDF documents

PDF-to-text extraction pulls readable content from a static document while preserving its spatial arrangement. Layout-aware extraction keeps columns, indentation, and table alignment intact, so the output matches what readers see on the page.

Use programmatic extraction to:

Index large document libraries for search.
Send structured text to data pipelines and language models.
Reuse report and statement content without manual retyping.

Extract PDF text with the Java SDK

You can add layout-preserving text extraction to a Java application with the Nutrient Java SDK. The SDK extracts text directly from PDFs, so you don’t need external tools for this workflow.

Prepare the project

Start by specifying a package name and creating a new class:

package io.nutrient.Sample;

Import Nutrient Java SDK classes. Specify the classes you use, or use a wildcard import if your project requires it:

import io.nutrient.sdk.Document;
import io.nutrient.sdk.exceptions.NutrientException;

public class PdfToText {

Create the main function and specify that it can throw a NutrientException. You can catch this exception in your program logic to handle errors:

    public static void main(String[] args) throws NutrientException {

After you set up the Java application, add the SDK-specific extraction logic.

Load the PDF document

This guide uses the Document class. Initialize Document with a try-with-resources(opens in a new tab) statement so Java manages the document instance lifecycle.

The SDK can load a source file from a file path or a stream. This guide uses a file path:

        try (Document document = Document.open("input.pdf")) {

The path can be absolute or relative. This example loads the file from the application’s working directory, which typically sits next to the executable.

Extract layout-preserving text

Call exportAsText to extract the document text into a plain-text file. The method maps each word to a character grid that mirrors its position on the page:

            document.exportAsText("output.txt");
        }
    }
}

The exportAsText method analyzes the PDF text content and the position of each word, then reconstructs the page in plain text. Words that sit close together join with single spaces, large horizontal gaps become proportional whitespace that preserves columns and tab stops, and vertical gaps between lines produce blank lines. The result reads like the original page while staying in a portable format.

The method handles these PDF content types:

Flowing text.
Multi-column layouts.
Tables and aligned data.
Mixed content layouts.

Handle errors

Nutrient Java SDK uses exception handling for errors. The methods in this guide throw a NutrientException if a failure occurs. Use this exception to troubleshoot issues and implement error handling logic.

Conclusion

You’ve extracted layout-preserving text from a PDF document. The extracted content is ready for search indexing, data pipelines, and downstream processing. If you need ranked passages instead of the full extracted text, refer to the search document text guide. You can also download the sample package to explore text extraction.

Extracting text from PDF documents

Extract PDF text with the Java SDK

Prepare the project

Load the PDF document

Extract layout-preserving text

Handle errors

Conclusion

Was this helpful?

Help us improve

Thank you for your feedback!

Something went wrong. Please try again or let us know.