Convert PDF to Excel in C#

PDF to XLSX

Nutrient .NET SDK’s (formerly GdPicture.NET) table extraction engine is a native SDK that enables you to recognize tables in an unstructured document or image, parse the information, and export the tables to an external destination like a spreadsheet. It can detect and extract bordered, semi-bordered, and borderless tables in images, scanned PDFs, and digitally born PDFs. As a native SDK, it can be deployed on-premises or embedded in your application, and it works offline, without internet access.

There are two possible approaches to converting PDFs to Excel with Nutrient .NET SDK:

  1. Convert all contents in a PDF document to Excel.
  2. Recognize and extract only the tables present in a document to Excel

Both of these options are explained below.

Converting the entire PDF document to Excel

To save all contents of a PDF document to an Excel spreadsheet (XLSX), use the SaveAsXLSX method method of the GdPictureDocumentConverter class. It uses the following parameter:

  • Stream, or the overload FilePath — A stream object where the current document is saved as an XLSX file. This stream object must be initialized before it can be sent into this method, and it should stay open for subsequent use. If the output stream isn’t open for both reading and writing, the method will fail, returning the GdPictureStatus.InvalidParameter status, which is the file path where the converted file will be saved. If the specified file already exists, it’ll be overwritten. You have to specify a full file path, including the file extension.

Note that the output stream should be open for both reading and writing and closed/disposed of by the user once processing is complete using the CloseDocument method.

Here’s how to convert PDF to XLSX:

  1. Create a GdPictureDocumentConverter object.
  2. Load the source document by passing its path to the LoadFromFile method. This method accepts all supported file formats. However, only a PDF file can be converted into an XLSX (other input file formats will return GdPictureStatus.NotImplemented). If the source document isn’t a PDF, it can be converted to PDF first with GdPictureDocumentConverter.SaveAsPDF. Recommended: Specify the source document format with a member of the DocumentFormat enumeration.
  3. Save the PDF file as an XLSX using SaveAsXLSX.

The following example converts and saves all content in a PDF document to an XLSX file (it can also be saved as a stream):

using GdPictureDocumentConverter converter = new();
var status = converter.LoadFromFile("input.pdf");
if (status != GdPictureStatus.OK)
{
throw new Exception(status.ToString());
}
status = converter.SaveAsXLSX("output.xlsx");
if (status != GdPictureStatus.OK)
{
throw new Exception(status.ToString());
}
Console.WriteLine("The input document has been converted to a xlsx file");

Recognizing and extracting table data from a PDF to an Excel spreadsheet

To identify all bordered, semi-bordered, and borderless tables in a PDF and then extract only the tables to an Excel spreadsheet, follow these steps:

The following approach uses the gdpictureOCR.SaveAsXLSX method, which will only extract the tables present in the document.

To read and extract table data from a PDF document to an Excel spreadsheet, follow the steps below:

  1. Create a GdPictureOCR object and a GdPicturePDF object.

  2. Select the source document by passing its path to the LoadFromFile method of the GdPicturePDF object.

  3. Select the page from which to extract the table data with the SelectPage method of the GdPicturePDF object.

  4. Render the selected page to a 300 dots-per-inch (DPI) image with the RenderPageToGdPictureImageEx method of the GdPicturePDF object.

  5. Pass the image to the GdPictureOCR object with the SetImage method.

  6. Configure the table extraction process with the GdPictureOCR object in the following way:

    • Set the path to the OCR resource folder with the ResourceFolder property. The default language resources are located in GdPicture.NET 14\Redist\OCR. For more information on adding language resources, see the language support guide.
    • With the AddLanguage method, add the language resources that Nutrient .NET SDK uses to recognize text in the image. This method takes a member of the OCRLanguage enumeration.

    For more optional configuration parameters, see the GdPictureOCR class.

  7. Run the table extraction process with the RunOCR method of the GdPictureOCR object, and save the result ID in a list.

  8. Create a GdPictureOCR.SpreadsheetOptions object and configure the output spreadsheet. By default, tables from the same OCR result are saved in the same sheet. To save each table in a different sheet, set the SeparateTables property of the GdPictureOCR.SpreadsheetOptions object to true. For more optional configuration parameters, see the GdPictureOCR.SpreadsheetOptions class.

  9. Save the output in an Excel spreadsheet with the SaveAsXLSX method of the GdPictureOCR object. This method takes the following parameters:

    • The list containing the OCR result ID.
    • The path to the output file.
    • The GdPictureOCR.SpreadsheetOptions object.
  10. Release unnecessary resources.

The example below extracts table data from the first page of a document and saves the output in an Excel spreadsheet:

=

using GdPictureOCR gdpictureOCR = new GdPictureOCR();
using GdPicturePDF gdpicturePDF = new GdPicturePDF();
// Load the source document.
gdpicturePDF.LoadFromFile(@"C:\temp\source.pdf");
// Select the first page.
gdpicturePDF.SelectPage(1);
// Render the first page to a 300 DPI image.
int imageId = gdpicturePDF.RenderPageToGdPictureImageEx(300, true);
// Pass the image to the `GdPictureOCR` object.
gdpictureOCR.SetImage(imageId);
// Configure the table extraction process.
gdpictureOCR.ResourceFolder = @"C:\GdPicture.NET 14\Redist\OCR";
gdpictureOCR.AddLanguage(OCRLanguage.English);
// Run the table extraction process and save the result ID in a list.
string result = gdpictureOCR.RunOCR();
List<string> resultsList = new List<string>() { result };
// Configure the output spreadsheet.
GdPictureOCR.SpreadsheetOptions spreadsheetOptions = new GdPictureOCR.SpreadsheetOptions()
{
SeparateTables = true
};
// Save the output in an Excel spreadsheet.
gdpictureOCR.SaveAsXLSX(resultsList, @"C:\temp\output.xlsx", spreadsheetOptions);
// Release unnecessary resources.
gdpictureOCR.ReleaseOCRResults();
GdPictureDocumentUtilities.DisposeImage(imageId);
gdpicturePDF.CloseDocument();

=

For more information on extracting table data from PDFs, refer to the table extraction guide.