Convert PDF to Excel in C#
Nutrient .NET SDK’s (formerly GdPicture.NET) table extraction engine is a native SDK that enables you to recognize tables in an unstructured document or image, parse the information, and export the tables to an external destination like a spreadsheet. It can detect and extract bordered, semi-bordered, and borderless tables in images, scanned PDFs, and digitally born PDFs. As a native SDK, it can be deployed on-premises or embedded in your application, and it works offline, without internet access.
There are two possible approaches to converting PDFs to Excel with Nutrient .NET SDK:
- Convert all contents in a PDF document to Excel.
 - Recognize and extract only the tables present in a document to Excel
 
Both of these options are explained below.
Converting the entire PDF document to Excel
To save all contents of a PDF document to an Excel spreadsheet (XLSX), use the SaveAsXLSX method method of the GdPictureDocumentConverter class. It uses the following parameter:
Stream, or the overloadFilePath— A stream object where the current document is saved as an XLSX file. This stream object must be initialized before it can be sent into this method, and it should stay open for subsequent use. If the output stream isn’t open for both reading and writing, the method will fail, returning theGdPictureStatus.InvalidParameterstatus, which is the file path where the converted file will be saved. If the specified file already exists, it’ll be overwritten. You have to specify a full file path, including the file extension.
Note that the output stream should be open for both reading and writing and closed/disposed of by the user once processing is complete using the CloseDocument method.
Here’s how to convert PDF to XLSX:
- Create a 
GdPictureDocumentConverterobject. - Load the source document by passing its path to the 
LoadFromFilemethod. This method accepts all supported file formats. However, only a PDF file can be converted into an XLSX (other input file formats will returnGdPictureStatus.NotImplemented). If the source document isn’t a PDF, it can be converted to PDF first withGdPictureDocumentConverter.SaveAsPDF. Recommended: Specify the source document format with a member of theDocumentFormatenumeration. - Save the PDF file as an XLSX using 
SaveAsXLSX. 
The following example converts and saves all content in a PDF document to an XLSX file (it can also be saved as a stream):
using GdPictureDocumentConverter converter = new();
var status = converter.LoadFromFile("input.pdf");if (status != GdPictureStatus.OK){    throw new Exception(status.ToString());}
status = converter.SaveAsXLSX("output.xlsx");if (status != GdPictureStatus.OK){    throw new Exception(status.ToString());}
Console.WriteLine("The input document has been converted to a xlsx file");Recognizing and extracting table data from a PDF to an Excel spreadsheet
To identify all bordered, semi-bordered, and borderless tables in a PDF and then extract only the tables to an Excel spreadsheet, follow these steps:
The following approach uses the gdpictureOCR.SaveAsXLSX method, which will only extract the tables present in the document.
To read and extract table data from a PDF document to an Excel spreadsheet, follow the steps below:
- Create a 
GdPictureOCRobject and aGdPicturePDFobject. - Select the source document by passing its path to the 
LoadFromFilemethod of theGdPicturePDFobject. - Select the page from which to extract the table data with the 
SelectPagemethod of theGdPicturePDFobject. - Render the selected page to a 300 dots-per-inch (DPI) image with the 
RenderPageToGdPictureImageExmethod of theGdPicturePDFobject. - Pass the image to the 
GdPictureOCRobject with theSetImagemethod. - Configure the table extraction process with the 
GdPictureOCRobject in the following way:- Set the path to the OCR resource folder with the 
ResourceFolderproperty. The default language resources are located inGdPicture.NET 14\Redist\OCR. For more information on adding language resources, see the language support guide. - With the 
AddLanguagemethod, add the language resources that Nutrient .NET SDK uses to recognize text in the image. This method takes a member of theOCRLanguageenumeration. 
 - Set the path to the OCR resource folder with the 
 
For more optional configuration parameters, see the GdPictureOCR class.
- Run the table extraction process with the 
RunOCRmethod of theGdPictureOCRobject, and save the result ID in a list. - Create a 
GdPictureOCR.SpreadsheetOptionsobject and configure the output spreadsheet. By default, tables from the same OCR result are saved in the same sheet. To save each table in a different sheet, set theSeparateTablesproperty of theGdPictureOCR.SpreadsheetOptionsobject totrue. For more optional configuration parameters, see theGdPictureOCR.SpreadsheetOptionsclass. - Save the output in an Excel spreadsheet with the 
SaveAsXLSXmethod of theGdPictureOCRobject. This method takes the following parameters:- The list containing the OCR result ID.
 - The path to the output file.
 - The 
GdPictureOCR.SpreadsheetOptionsobject. 
 - Release unnecessary resources.
 
The example below extracts table data from the first page of a document and saves the output in an Excel spreadsheet:
=
using GdPictureOCR gdpictureOCR = new GdPictureOCR();using GdPicturePDF gdpicturePDF = new GdPicturePDF();// Load the source document.gdpicturePDF.LoadFromFile(@"C:\temp\source.pdf");// Select the first page.gdpicturePDF.SelectPage(1);// Render the first page to a 300 DPI image.int imageId = gdpicturePDF.RenderPageToGdPictureImageEx(300, true);// Pass the image to the `GdPictureOCR` object.gdpictureOCR.SetImage(imageId);// Configure the table extraction process.gdpictureOCR.ResourceFolder = @"C:\GdPicture.NET 14\Redist\OCR";gdpictureOCR.AddLanguage(OCRLanguage.English);// Run the table extraction process and save the result ID in a list.string result = gdpictureOCR.RunOCR();List<string> resultsList = new List<string>() { result };// Configure the output spreadsheet.GdPictureOCR.SpreadsheetOptions spreadsheetOptions = new GdPictureOCR.SpreadsheetOptions()    {        SeparateTables = true    };// Save the output in an Excel spreadsheet.gdpictureOCR.SaveAsXLSX(resultsList, @"C:\temp\output.xlsx", spreadsheetOptions);// Release unnecessary resources.gdpictureOCR.ReleaseOCRResults();GdPictureDocumentUtilities.DisposeImage(imageId);gdpicturePDF.CloseDocument();Using gdpictureOCR As GdPictureOCR = New GdPictureOCR()Using gdpicturePDF As GdPicturePDF = New GdPicturePDF()    ' Load the source document.    gdpicturePDF.LoadFromFile("C:\temp\source.pdf")    ' Select the first page.    gdpicturePDF.SelectPage(1)    ' Render the first page to a 300 DPI image.    Dim imageId As Integer = gdpicturePDF.RenderPageToGdPictureImageEx(300, True)    ' Pass the image to the `GdPictureOCR` object.    gdpictureOCR.SetImage(imageId)    ' Configure the table extraction process.    gdpictureOCR.ResourceFolder = "C:\GdPicture.NET 14\Redist\OCR"    gdpictureOCR.AddLanguage(OCRLanguage.English)    ' Run the table extraction process and save the result ID in a list.    Dim result As String = gdpictureOCR.RunOCR()    Dim resultsList As List(Of String) = New List(Of String)()    resultsList.Add(result)    ' Configure the output spreadsheet.    Dim spreadsheetOptions As gdpictureOCR.SpreadsheetOptions = New GdPictureOCR.SpreadsheetOptions() With {        .SeparateTables = True    }    ' Save the output in an Excel spreadsheet.    gdpictureOCR.SaveAsXLSX(resultsList, "C:\temp\output.xlsx", spreadsheetOptions)    ' Release unnecessary resources.    gdpictureOCR.ReleaseOCRResults()    GdPictureDocumentUtilities.DisposeImage(imageId)    gdpicturePDF.CloseDocument()End UsingEnd Using=
For more information on extracting table data from PDFs, refer to the table extraction guide.