Extract text from PDFs using C#

Nutrient .NET SDK (formerly GdPicture.NET) enables you to extract all text from a PDF document. Both visible and hidden text are extracted.

To extract text from a PDF document, follow the steps below:

  1. Create a GdPicturePDF object.
  2. Select the source document by passing its path to the LoadFromFile method.
  3. Optional: Configure text extraction with the SetTextExtractionOptions method. This method takes members of the TextExtractionOptions enumeration as its parameter. To specify multiple options, separate them with vertical bar | characters.
  4. Create an empty string where you’ll save the output.
  5. Determine the number of pages with the GetPageCount method and loop through them.
  6. Extract the text from each page with the GetPageText method and add it to the output string.
  7. Save the output in a new text file with the standard System.IO.StreamWriter class.
  8. Release unnecessary resources.

The example below extracts text from a PDF and saves the output in a TXT file:

using GdPicturePDF gdpicturePDF = new GdPicturePDF();
// Load the source document.
gdpicturePDF.LoadFromFile(@"C:\temp\source.pdf");
// Optional: Configure text extraction.
gdpicturePDF.SetTextExtractionOptions(TextExtractionOptions.ExactWordLineMatching);
// Create an empty string where you'll save the output.
string outputText = "";
// Determine the number of pages and loop through them.
int pageCount = gdpicturePDF.GetPageCount();
for (int page = 1; page <= pageCount; page++)
{
gdpicturePDF.SelectPage(page);
// Extract the text from the page.
string pageText = gdpicturePDF.GetPageText();
// Add the extracted text to the output string.
outputText += $"Page: { page.ToString() }\n{ pageText }\n";
}
// Save the output in a new text file.
System.IO.StreamWriter outputFile = new System.IO.StreamWriter(@"C:\temp\output.txt");
outputFile.WriteLine(outputText);
outputFile.Close();
// Release unnecessary resources.
gdpicturePDF.CloseDocument();