Convert PDF to Excel in C#
Nutrient .NET SDK’s (formerly GdPicture.NET) table extraction engine is a native SDK that enables you to recognize tables in an unstructured document or image, parse the information, and export the tables to an external destination like a spreadsheet. It can detect and extract bordered, semi-bordered, and borderless tables in images, scanned PDFs, and digitally born PDFs. As a native SDK, it can be deployed on-premises or embedded in your application, and it works offline, without internet access.
There are two possible approaches to converting PDFs to Excel with Nutrient .NET SDK:
- Convert all contents in a PDF document to Excel.
- Recognize and extract only the tables present in a document to Excel
Both of these options are explained below.
Converting the entire PDF document to Excel
To save all contents of a PDF document to an Excel spreadsheet (XLSX), use the SaveAsXLSX
method method of the GdPictureDocumentConverter
class. It uses the following parameter:
Stream
, or the overloadFilePath
— A stream object where the current document is saved as an XLSX file. This stream object must be initialized before it can be sent into this method, and it should stay open for subsequent use. If the output stream isn’t open for both reading and writing, the method will fail, returning theGdPictureStatus.InvalidParameter
status, which is the file path where the converted file will be saved. If the specified file already exists, it’ll be overwritten. You have to specify a full file path, including the file extension.
Note that the output stream should be open for both reading and writing and closed/disposed of by the user once processing is complete using the CloseDocument
method.
Here’s how to convert PDF to XLSX:
- Create a
GdPictureDocumentConverter
object. - Load the source document by passing its path to the
LoadFromFile
method. This method accepts all supported file formats. However, only a PDF file can be converted into an XLSX (other input file formats will returnGdPictureStatus.NotImplemented
). If the source document isn’t a PDF, it can be converted to PDF first withGdPictureDocumentConverter.SaveAsPDF
. Recommended: Specify the source document format with a member of theDocumentFormat
enumeration. - Save the PDF file as an XLSX using
SaveAsXLSX
.
The following example converts and saves all content in a PDF document to an XLSX file (it can also be saved as a stream):
using GdPictureDocumentConverter converter = new();
var status = converter.LoadFromFile("input.pdf");if (status != GdPictureStatus.OK){ throw new Exception(status.ToString());}
status = converter.SaveAsXLSX("output.xlsx");if (status != GdPictureStatus.OK){ throw new Exception(status.ToString());}
Console.WriteLine("The input document has been converted to a xlsx file");
Related topics
Recognizing and extracting table data from a PDF to an Excel spreadsheet
To identify all bordered, semi-bordered, and borderless tables in a PDF and then extract only the tables to an Excel spreadsheet, follow these steps:
The following approach uses the gdpictureOCR.SaveAsXLSX
method, which will only extract the tables present in the document.
To read and extract table data from a PDF document to an Excel spreadsheet, follow the steps below:
Create a
GdPictureOCR
object and aGdPicturePDF
object.Select the source document by passing its path to the
LoadFromFile
method of theGdPicturePDF
object.Select the page from which to extract the table data with the
SelectPage
method of theGdPicturePDF
object.Render the selected page to a 300 dots-per-inch (DPI) image with the
RenderPageToGdPictureImageEx
method of theGdPicturePDF
object.Pass the image to the
GdPictureOCR
object with theSetImage
method.Configure the table extraction process with the
GdPictureOCR
object in the following way:- Set the path to the OCR resource folder with the
ResourceFolder
property. The default language resources are located inGdPicture.NET 14\Redist\OCR
. For more information on adding language resources, see the language support guide. - With the
AddLanguage
method, add the language resources that Nutrient .NET SDK uses to recognize text in the image. This method takes a member of theOCRLanguage
enumeration.
For more optional configuration parameters, see the
GdPictureOCR
class.- Set the path to the OCR resource folder with the
Run the table extraction process with the
RunOCR
method of theGdPictureOCR
object, and save the result ID in a list.Create a
GdPictureOCR.SpreadsheetOptions
object and configure the output spreadsheet. By default, tables from the same OCR result are saved in the same sheet. To save each table in a different sheet, set theSeparateTables
property of theGdPictureOCR.SpreadsheetOptions
object totrue
. For more optional configuration parameters, see theGdPictureOCR.SpreadsheetOptions
class.Save the output in an Excel spreadsheet with the
SaveAsXLSX
method of theGdPictureOCR
object. This method takes the following parameters:- The list containing the OCR result ID.
- The path to the output file.
- The
GdPictureOCR.SpreadsheetOptions
object.
Release unnecessary resources.
The example below extracts table data from the first page of a document and saves the output in an Excel spreadsheet:
=
using GdPictureOCR gdpictureOCR = new GdPictureOCR();using GdPicturePDF gdpicturePDF = new GdPicturePDF();// Load the source document.gdpicturePDF.LoadFromFile(@"C:\temp\source.pdf");// Select the first page.gdpicturePDF.SelectPage(1);// Render the first page to a 300 DPI image.int imageId = gdpicturePDF.RenderPageToGdPictureImageEx(300, true);// Pass the image to the `GdPictureOCR` object.gdpictureOCR.SetImage(imageId);// Configure the table extraction process.gdpictureOCR.ResourceFolder = @"C:\GdPicture.NET 14\Redist\OCR";gdpictureOCR.AddLanguage(OCRLanguage.English);// Run the table extraction process and save the result ID in a list.string result = gdpictureOCR.RunOCR();List<string> resultsList = new List<string>() { result };// Configure the output spreadsheet.GdPictureOCR.SpreadsheetOptions spreadsheetOptions = new GdPictureOCR.SpreadsheetOptions() { SeparateTables = true };// Save the output in an Excel spreadsheet.gdpictureOCR.SaveAsXLSX(resultsList, @"C:\temp\output.xlsx", spreadsheetOptions);// Release unnecessary resources.gdpictureOCR.ReleaseOCRResults();GdPictureDocumentUtilities.DisposeImage(imageId);gdpicturePDF.CloseDocument();
Using gdpictureOCR As GdPictureOCR = New GdPictureOCR()Using gdpicturePDF As GdPicturePDF = New GdPicturePDF() ' Load the source document. gdpicturePDF.LoadFromFile("C:\temp\source.pdf") ' Select the first page. gdpicturePDF.SelectPage(1) ' Render the first page to a 300 DPI image. Dim imageId As Integer = gdpicturePDF.RenderPageToGdPictureImageEx(300, True) ' Pass the image to the `GdPictureOCR` object. gdpictureOCR.SetImage(imageId) ' Configure the table extraction process. gdpictureOCR.ResourceFolder = "C:\GdPicture.NET 14\Redist\OCR" gdpictureOCR.AddLanguage(OCRLanguage.English) ' Run the table extraction process and save the result ID in a list. Dim result As String = gdpictureOCR.RunOCR() Dim resultsList As List(Of String) = New List(Of String)() resultsList.Add(result) ' Configure the output spreadsheet. Dim spreadsheetOptions As gdpictureOCR.SpreadsheetOptions = New GdPictureOCR.SpreadsheetOptions() With { .SeparateTables = True } ' Save the output in an Excel spreadsheet. gdpictureOCR.SaveAsXLSX(resultsList, "C:\temp\output.xlsx", spreadsheetOptions) ' Release unnecessary resources. gdpictureOCR.ReleaseOCRResults() GdPictureDocumentUtilities.DisposeImage(imageId) gdpicturePDF.CloseDocument()End UsingEnd Using
=
Used methods and properties
Related topics
For more information on extracting table data from PDFs, refer to the table extraction guide.