Extract text from PDFs using C#
Nutrient .NET SDK (formerly GdPicture.NET) enables you to extract all text from a PDF document. Both visible and hidden text are extracted.
To extract text from a PDF document, follow the steps below:
- Create a
GdPicturePDF
object. - Select the source document by passing its path to the
LoadFromFile
method. - Optional: Configure text extraction with the
SetTextExtractionOptions
method. This method takes members of theTextExtractionOptions
enumeration as its parameter. To specify multiple options, separate them with vertical bar|
characters. - Create an empty string where you’ll save the output.
- Determine the number of pages with the
GetPageCount
method and loop through them. - Extract the text from each page with the
GetPageText
method and add it to the output string. - Save the output in a new text file with the standard
System.IO.StreamWriter
class. - Release unnecessary resources.
The example below extracts text from a PDF and saves the output in a TXT file:
using GdPicturePDF gdpicturePDF = new GdPicturePDF();// Load the source document.gdpicturePDF.LoadFromFile(@"C:\temp\source.pdf");// Optional: Configure text extraction.gdpicturePDF.SetTextExtractionOptions(TextExtractionOptions.ExactWordLineMatching);// Create an empty string where you'll save the output.string outputText = "";// Determine the number of pages and loop through them.int pageCount = gdpicturePDF.GetPageCount();for (int page = 1; page <= pageCount; page++){ gdpicturePDF.SelectPage(page); // Extract the text from the page. string pageText = gdpicturePDF.GetPageText(); // Add the extracted text to the output string. outputText += $"Page: { page.ToString() }\n{ pageText }\n";}// Save the output in a new text file.System.IO.StreamWriter outputFile = new System.IO.StreamWriter(@"C:\temp\output.txt");outputFile.WriteLine(outputText);outputFile.Close();// Release unnecessary resources.gdpicturePDF.CloseDocument();
Using gdpicturePDF As GdPicturePDF = New GdPicturePDF() ' Load the source document. gdpicturePDF.LoadFromFile("C:\temp\source.pdf") ' Optional: Configure text extraction. gdpicturePDF.SetTextExtractionOptions(TextExtractionOptions.ExactWordLineMatching) ' Create an empty string where you'll save the output. Dim outputText = "" ' Determine the number of pages and loop through them. Dim pageCount As Integer = gdpicturePDF.GetPageCount() For page = 1 To pageCount gdpicturePDF.SelectPage(page) ' Extract the text from the page. Dim pageText As String = gdpicturePDF.GetPageText() ' Add the extracted text to the output string. outputText += $"Page: " & page.ToString() & vbLf & pageText & vbLf Next ' Save the output in a new text file. Dim outputFile As StreamWriter = New StreamWriter("C:\temp\output.txt") outputFile.WriteLine(outputText) outputFile.Close() ' Release unnecessary resources. gdpicturePDF.CloseDocument()End Using
Used methods
Related topics