Read text from PDFs and images in C#
This guide explains how to read text from a PDF or image file. Sometimes, text is stored in a PDF or an image in a way that makes it so you cannot search or copy it. Nutrient .NET SDK’s (formerly GdPicture.NET) optical character recognition (OCR) engine enables you to recognize text and save it in a separate file where you can both search and copy and paste the text.
Reading text from a PDF
To read text from a PDF, follow the steps below:
- Create a
GdPicturePDFobject and aGdPictureOCRobject. - Select the source document by passing its path to the
LoadFromFilemethod of theGdPicturePDFobject. - Configure the OCR process with the
GdPictureOCRobject in the following way:- Set the path to the OCR resource folder with the
ResourceFolderproperty. The default language resources are located inGdPicture.NET 14\Redist\OCR. For more information on adding language resources, see the language support guide. - With the
AddLanguagemethod, add the language resources that Nutrient .NET SDK uses to recognize text in the image. This method takes an element of theOCRLanguageenum. - Optional: Set whether OCR prioritizes recognition accuracy or speed with the
OCRModeproperty. - Optional: Set the character allowlist with the
CharacterSetproperty. When scanning the image, the OCR engine only recognizes the characters included in the allowlist. - Optional: Set the character denylist with the
CharacterBlackListproperty. When scanning the image, the OCR engine doesn’t recognize the characters included in the denylist.
- Set the path to the OCR resource folder with the
- Create an empty string where you’ll save the output.
- Determine the number of pages with the
GetPageCountmethod of theGdPicturePDFobject and loop through them. - Render each page to a 300 dots-per-inch (DPI) image with the
RenderPageToGdPictureImageExmethod of theGdPicturePDFobject. - Pass the image to the
GdPictureOCRobject with theSetImagemethod of theGdPictureOCRobject. - Run the OCR process with the
RunOCRmethod of theGdPictureOCRobject. - Get the result of the OCR process as text with the
GetOCRResultTextmethod of theGdPictureOCRobject, and save it in the output string. - Release the image with the
DisposeImagemethod of theGdPictureDocumentUtilitiesclass, and release the OCR result with theReleaseOCRResultmethod of theGdPictureOCRobject. - After reading all the pages, save the output in a new text file with the standard
System.IO.StreamWriterclass. - Release unnecessary resources.
The example below reads text from a PDF and saves the output in a TXT file:
using GdPicturePDF gdpicturePDF = new GdPicturePDF();using GdPictureOCR gdpictureOCR = new GdPictureOCR();// Select the source document.gdpicturePDF.LoadFromFile(@"C:\temp\source.pdf");// Configure the OCR process.gdpictureOCR.ResourceFolder = @"C:\GdPicture.NET 14\Redist\OCR";gdpictureOCR.AddLanguage(OCRLanguage.English);// Create an empty string where you'll save the output.string outputText = "";// Determine the number of pages and loop through them.int pageCount = gdpicturePDF.GetPageCount();for (int page = 1; page <= pageCount; page++){ gdpicturePDF.SelectPage(page); // Render the page to a 300 DPI image. int imageId = gdpicturePDF.RenderPageToGdPictureImageEx(300, true); // Pass the image to the `GdPictureOCR` object. gdpictureOCR.SetImage(imageId); // Run the OCR process. string resultId = gdpictureOCR.RunOCR(); // Get the result of the OCR process as text. outputText += gdpictureOCR.GetOCRResultText(resultId); // Release the image and the OCR result. GdPictureDocumentUtilities.DisposeImage(imageId); gdpictureOCR.ReleaseOCRResult(resultId);}// Save the output in a new text file.System.IO.StreamWriter outputFile = new System.IO.StreamWriter(@"C:\temp\output.txt");outputFile.WriteLine(outputText);outputFile.Close();// Release unnecessary resources.gdpicturePDF.CloseDocument();Using gdpicturePDF As GdPicturePDF = New GdPicturePDF()Using gdpictureOCR As GdPictureOCR = New GdPictureOCR() ' Select the source document. gdpicturePDF.LoadFromFile("C:\temp\source.pdf") ' Configure the OCR process. gdpictureOCR.ResourceFolder = "C:\GdPicture.NET 14\Redist\OCR" gdpictureOCR.AddLanguage(OCRLanguage.English) ' Create an empty string where you'll save the output. Dim outputText = "" ' Determine the number of pages and loop through them. Dim pageCount As Integer = gdpicturePDF.GetPageCount() For page = 1 To pageCount gdpicturePDF.SelectPage(page) ' Render the page to a 300 DPI image. Dim imageId As Integer = gdpicturePDF.RenderPageToGdPictureImageEx(300, True) ' Pass the image to the `GdPictureOCR` object. gdpictureOCR.SetImage(imageId) ' Run the OCR process. Dim resultId As String = gdpictureOCR.RunOCR() ' Get the result of the OCR process as text. outputText += gdpictureOCR.GetOCRResultText(resultId) ' Release the image and the OCR result. GdPictureDocumentUtilities.DisposeImage(imageId) gdpictureOCR.ReleaseOCRResult(resultId) Next ' Save the output in a new text file. Dim outputFile As StreamWriter = New StreamWriter("C:\temp\output.txt") outputFile.WriteLine(outputText) outputFile.Close() ' Release unnecessary resources. gdpicturePDF.CloseDocument()End UsingEnd UsingReading text from an image
This section explains how to read text from simple, single-page image files. For more information on reading multipage image files, see Reading Text from Multipage TIFF Files.
To read text from an image file, follow the steps below:
- Create a
GdPictureImagingobject and aGdPictureOCRobject. - Select the image by passing its path to the
CreateGdPictureImageFromFilemethod of theGdPictureImagingobject. - Configure the OCR process with the
GdPictureOCRobject in the following way:- Set the image with the
SetImagemethod. - Set the path to the OCR resource folder with the
ResourceFolderproperty. The default language resources are located inGdPicture.NET 14\Redist\OCR. For more information on adding language resources, see the language support guide. - With the
AddLanguagemethod, add the language resources that Nutrient .NET SDK uses to recognize text in the image. This method takes a member of theOCRLanguageenumeration. - Optional: Set whether OCR prioritizes recognition accuracy or speed with the
OCRModeproperty. - Optional: Set the character allowlist with the
CharacterSetproperty. When scanning the image, the OCR engine only recognizes the characters included in the allowlist. - Optional: Set the character denylist with the
CharacterBlackListproperty. When scanning the image, the OCR engine doesn’t recognize the characters included in the denylist.
- Set the image with the
- Run the OCR process with the
RunOCRmethod of theGdPictureOCRobject. - Get the result of the OCR process as text with the
GetOCRResultTextmethod of theGdPictureOCRobject. - Save the output in a new text file with the standard
System.IO.StreamWriterclass. - Release unnecessary resources.
The example below reads text from an image file and saves the output in a TXT file:
using GdPictureImaging gdpictureImaging = new GdPictureImaging();using GdPictureOCR gdpictureOCR = new GdPictureOCR();// Select the image to read.int imageId = gdpictureImaging.CreateGdPictureImageFromFile(@"C:\temp\source.png");// Configure the OCR parameters.gdpictureOCR.SetImage(imageId);gdpictureOCR.ResourceFolder = @"C:\GdPicture.NET 14\Redist\OCR";gdpictureOCR.AddLanguage(OCRLanguage.English);// Run the OCR process.string resultId = gdpictureOCR.RunOCR();// Get the result of the OCR process as text.string outputText = gdpictureOCR.GetOCRResultText(resultId);// Save the output in a new text file.System.IO.StreamWriter outputFile = new System.IO.StreamWriter(@"/Users/manototh/Documents/windows/temp/output.txt");outputFile.WriteLine(outputText);outputFile.Close();// Release unnecessary resources.gdpictureImaging.ReleaseGdPictureImage(imageId);Using gdpictureImaging As GdPictureImaging = New GdPictureImaging()Using gdpictureOCR As GdPictureOCR = New GdPictureOCR() ' Select the image to read. Dim imageId As Integer = gdpictureImaging.CreateGdPictureImageFromFile("C:\temp\source.png") ' Configure the OCR parameters. gdpictureOCR.SetImage(imageId) gdpictureOCR.ResourceFolder = "C:\GdPicture.NET 14\Redist\OCR" gdpictureOCR.AddLanguage(OCRLanguage.English) ' Run the OCR process. Dim resultId As String = gdpictureOCR.RunOCR() ' Get the result of the OCR process as text. Dim outputText As String = gdpictureOCR.GetOCRResultText(resultId) ' Save the output in a new text file. Dim outputFile As StreamWriter = New StreamWriter("/Users/manototh/Documents/windows/temp/output.txt") outputFile.WriteLine(outputText) outputFile.Close() ' Release unnecessary resources. gdpictureImaging.ReleaseGdPictureImage(imageId)End UsingEnd UsingReading text from multipage TIFF files
To read text from a multipage TIFF file, follow the steps below:
- Create a
GdPictureImagingobject and aGdPictureOCRobject. - Select the image by passing its path to the
TiffCreateMultiPageFromFilemethod of theGdPictureImagingobject. - Configure the OCR process with the
GdPictureOCRobject in the following way:- Set the path to the OCR resource folder with the
ResourceFolderproperty. The default language resources are located inGdPicture.NET 14\Redist\OCR. For more information on adding language resources, see the language support guide. - With the
AddLanguagemethod, add the language resources that Nutrient .NET SDK uses to recognize text in the image. This method takes an element of theOCRLanguageenum. - Optional: Set whether OCR prioritizes recognition accuracy or speed with the
OCRModeproperty. - Optional: Set the character allowlist with the
CharacterSetproperty. When scanning the image, the OCR engine only recognizes the characters included in the allowlist. - Optional: Set the character denylist with the
CharacterBlackListproperty. When scanning the image, the OCR engine doesn’t recognize the characters included in the denylist.
- Set the path to the OCR resource folder with the
- Create an empty string where you’ll save the output.
- Determine the number of pages with the
GetPageCountmethod of theGdPictureImagingobject and loop through them. - Select a page with the
TiffSelectPagemethod of theGdPictureImagingobject. - Pass the page to the
GdPictureOCRobject with theSetImagemethod of theGdPictureOCRobject. - Run the OCR process with the
RunOCRmethod of theGdPictureOCRobject. - Get the result of the OCR process as text with the
GetOCRResultTextmethod of theGdPictureOCRobject, and save it in the output string. - Release the OCR result with the
ReleaseOCRResultmethod of theGdPictureOCRobject. - After reading all the pages, save the output in a new text file with the standard
System.IO.StreamWriterclass. - Release unnecessary resources.
The example below reads text from a multipage TIFF file and saves the output in a TXT file:
using GdPictureImaging gdpictureImaging = new GdPictureImaging();using GdPictureOCR gdpictureOCR = new GdPictureOCR();// Select the image to read.int imageId = gdpictureImaging.TiffCreateMultiPageFromFile(@"C:\temp\source.tif");// Configure the OCR parameters.gdpictureOCR.ResourceFolder = @"C:\GdPicture.NET 14\Redist\OCR";gdpictureOCR.AddLanguage(OCRLanguage.English);// Create an empty string where you'll save the output.string outputText = "";// Determine the number of pages and loop through them.int pageCount = gdpictureImaging.GetPageCount(imageId);for (int page = 1; page <= pageCount; page++){ // Select a page and pass it to the `GdPictureOCR` object. gdpictureImaging.TiffSelectPage(imageId, page); gdpictureOCR.SetImage(imageId); // Run the OCR process. string resultId = gdpictureOCR.RunOCR(); // Get the result of the OCR process as text. outputText += gdpictureOCR.GetOCRResultText(resultId); // Release the OCR result. gdpictureOCR.ReleaseOCRResult(resultId);}// Save the output in a new text file.System.IO.StreamWriter outputFile = new System.IO.StreamWriter(@"C:\temp\output.txt");outputFile.WriteLine(outputText);outputFile.Close();// Release unnecessary resources.gdpictureImaging.ReleaseGdPictureImage(imageId);Using gdpictureImaging As GdPictureImaging = New GdPictureImaging()Using gdpictureOCR As GdPictureOCR = New GdPictureOCR() ' Select the image to read. Dim imageId As Integer = gdpictureImaging.TiffCreateMultiPageFromFile("C:\temp\source.tif") ' Configure the OCR parameters. gdpictureOCR.ResourceFolder = "C:\GdPicture.NET 14\Redist\OCR" gdpictureOCR.AddLanguage(OCRLanguage.English) ' Create an empty string where you'll save the output. Dim outputText = "" ' Determine the number of pages and loop through them. Dim pageCount As Integer = gdpictureImaging.GetPageCount(imageId) For page = 1 To pageCount ' Select a page and pass it to the `GdPictureOCR` object. gdpictureImaging.TiffSelectPage(imageId, page) gdpictureOCR.SetImage(imageId) ' Run the OCR process. Dim resultId As String = gdpictureOCR.RunOCR() ' Get the result of the OCR process as text. outputText += gdpictureOCR.GetOCRResultText(resultId) ' Release the OCR result. gdpictureOCR.ReleaseOCRResult(resultId) Next ' Save the output in a new text file. Dim outputFile As StreamWriter = New StreamWriter("C:\temp\output.txt") outputFile.WriteLine(outputText) outputFile.Close() ' Release unnecessary resources. gdpictureImaging.ReleaseGdPictureImage(imageId)End UsingEnd Using