Extract Text from PDFs Using C#
GdPicture.NET enables you to extract all text from a PDF document. Both visible and hidden text are extracted.
To extract text from a PDF document, follow these steps:
-
Create a
GdPicturePDF
object. -
Select the source document by passing its path to the
LoadFromFile
method. -
Optional: Configure text extraction with the
SetTextExtractionOptions
method. This method takes members of theTextExtractionOptions
enumeration as its parameter. To specify multiple options, separate them with vertical bar|
characters. -
Create an empty string where you’ll save the output.
-
Determine the number of pages with the
GetPageCount
method and loop through them. -
Extract the text from each page with the
GetPageText
method and add it to the output string. -
Save the output in a new text file with the standard
System.IO.StreamWriter
class. -
Release unnecessary resources.
The example below extracts text from a PDF and saves the output in a TXT file:
using GdPicturePDF gdpicturePDF = new GdPicturePDF(); // Load the source document. gdpicturePDF.LoadFromFile(@"C:\temp\source.pdf"); // Optional: Configure text extraction. gdpicturePDF.SetTextExtractionOptions(TextExtractionOptions.ExactWordLineMatching); // Create an empty string where you'll save the output. string outputText = ""; // Determine the number of pages and loop through them. int pageCount = gdpicturePDF.GetPageCount(); for (int page = 1; page <= pageCount; page++) { gdpicturePDF.SelectPage(page); // Extract the text from the page. string pageText = gdpicturePDF.GetPageText(); // Add the extracted text to the output string. outputText += $"Page: { page.ToString() }\n{ pageText }\n"; } // Save the output in a new text file. System.IO.StreamWriter outputFile = new System.IO.StreamWriter(@"C:\temp\output.txt"); outputFile.WriteLine(outputText); outputFile.Close(); // Release unnecessary resources. gdpicturePDF.CloseDocument();
Using gdpicturePDF As GdPicturePDF = New GdPicturePDF() ' Load the source document. gdpicturePDF.LoadFromFile("C:\temp\source.pdf") ' Optional: Configure text extraction. gdpicturePDF.SetTextExtractionOptions(TextExtractionOptions.ExactWordLineMatching) ' Create an empty string where you'll save the output. Dim outputText = "" ' Determine the number of pages and loop through them. Dim pageCount As Integer = gdpicturePDF.GetPageCount() For page = 1 To pageCount gdpicturePDF.SelectPage(page) ' Extract the text from the page. Dim pageText As String = gdpicturePDF.GetPageText() ' Add the extracted text to the output string. outputText += $"Page: " & page.ToString() & vbLf & pageText & vbLf Next ' Save the output in a new text file. Dim outputFile As StreamWriter = New StreamWriter("C:\temp\output.txt") outputFile.WriteLine(outputText) outputFile.Close() ' Release unnecessary resources. gdpicturePDF.CloseDocument() End Using