Extract Text from PDFs Using C#

GdPicture.NET enables you to extract all text from a PDF document. Both visible and hidden text are extracted.

To extract text from a PDF document, follow these steps:

  1. Create a GdPicturePDF object.

  2. Select the source document by passing its path to the LoadFromFile method.

  3. Optional: Configure text extraction with the SetTextExtractionOptions method. This method takes members of the TextExtractionOptions enumeration as its parameter. To specify multiple options, separate them with vertical bar | characters.

  4. Create an empty string where you’ll save the output.

  5. Determine the number of pages with the GetPageCount method and loop through them.

  6. Extract the text from each page with the GetPageText method and add it to the output string.

  7. Save the output in a new text file with the standard System.IO.StreamWriter class.

  8. Release unnecessary resources.

The example below extracts text from a PDF and saves the output in a TXT file:

using GdPicturePDF gdpicturePDF = new GdPicturePDF();
// Load the source document.
gdpicturePDF.LoadFromFile(@"C:\temp\source.pdf");
// Optional: Configure text extraction.
gdpicturePDF.SetTextExtractionOptions(TextExtractionOptions.ExactWordLineMatching);
// Create an empty string where you'll save the output.
string outputText = "";
// Determine the number of pages and loop through them.
int pageCount = gdpicturePDF.GetPageCount();
for (int page = 1; page <= pageCount; page++)
{
    gdpicturePDF.SelectPage(page);
    // Extract the text from the page.
    string pageText = gdpicturePDF.GetPageText();
    // Add the extracted text to the output string.
    outputText += $"Page: { page.ToString() }\n{ pageText }\n";
}
// Save the output in a new text file.
System.IO.StreamWriter outputFile = new System.IO.StreamWriter(@"C:\temp\output.txt");
outputFile.WriteLine(outputText);
outputFile.Close();
// Release unnecessary resources.
gdpicturePDF.CloseDocument();
Using gdpicturePDF As GdPicturePDF = New GdPicturePDF()
    ' Load the source document.
    gdpicturePDF.LoadFromFile("C:\temp\source.pdf")
    ' Optional: Configure text extraction.
    gdpicturePDF.SetTextExtractionOptions(TextExtractionOptions.ExactWordLineMatching)
    ' Create an empty string where you'll save the output.
    Dim outputText = ""
    ' Determine the number of pages and loop through them.
    Dim pageCount As Integer = gdpicturePDF.GetPageCount()
    For page = 1 To pageCount
        gdpicturePDF.SelectPage(page)
        ' Extract the text from the page.
        Dim pageText As String = gdpicturePDF.GetPageText()
        ' Add the extracted text to the output string.
        outputText += $"Page: " & page.ToString() & vbLf & pageText & vbLf
    Next
    ' Save the output in a new text file.
    Dim outputFile As StreamWriter = New StreamWriter("C:\temp\output.txt")
    outputFile.WriteLine(outputText)
    outputFile.Close()
    ' Release unnecessary resources.
    gdpicturePDF.CloseDocument()
End Using
Used Methods

Related Topics