Extract tables from PDFs and images using C#
Nutrient .NET SDK’s (formerly GdPicture.NET) table extraction engine is a native SDK that enables you to recognize tables in an unstructured document or image, parse the information, and export the tables to an external destination like a spreadsheet. It can detect and extract bordered, semi-bordered, and borderless tables in images, scanned PDFs, and digitally born PDFs. As a native SDK, it can be deployed on-premises or embedded in your application, and it works offline, without internet access.
Extracting table data from a PDF to an Excel spreadsheet
To read and extract table data from a PDF document to an Excel spreadsheet, follow the steps below:
- Create a GdPictureOCRobject and aGdPicturePDFobject.
- Select the source document by passing its path to the LoadFromFilemethod of theGdPicturePDFobject.
- Select the page from which to extract the table data with the SelectPagemethod of theGdPicturePDFobject.
- Render the selected page to a 300 dots-per-inch (DPI) image with the RenderPageToGdPictureImageExmethod of theGdPicturePDFobject.
- Pass the image to the GdPictureOCRobject with theSetImagemethod.
- Configure the table extraction process with the GdPictureOCRobject in the following way:- Set the path to the OCR resource folder with the ResourceFolderproperty. The default language resources are located inGdPicture.NET 14\Redist\OCR. For more information on adding language resources, see the language support guide.
- With the AddLanguagemethod, add the language resources that Nutrient .NET SDK uses to recognize text in the image. This method takes a member of theOCRLanguageenumeration.
 
- Set the path to the OCR resource folder with the 
For more optional configuration parameters, see the GdPictureOCR class.
- Run the table extraction process with the RunOCRmethod of theGdPictureOCRobject, and save the result ID in a list.
- Create a GdPictureOCR.SpreadsheetOptionsobject and configure the output spreadsheet. By default, tables from the same OCR result are saved in the same sheet. To save each table in a different sheet, set theSeparateTablesproperty of theGdPictureOCR.SpreadsheetOptionsobject totrue. For more optional configuration parameters, see theGdPictureOCR.SpreadsheetOptionsclass.
- Save the output in an Excel spreadsheet with the SaveAsXLSXmethod of theGdPictureOCRobject. This method takes the following parameters:- The list containing the OCR result ID.
- The path to the output file.
- The GdPictureOCR.SpreadsheetOptionsobject.
 
- Release unnecessary resources.
The example below extracts table data from the first page of a document and saves the output in an Excel spreadsheet:
=
using GdPictureOCR gdpictureOCR = new GdPictureOCR();using GdPicturePDF gdpicturePDF = new GdPicturePDF();// Load the source document.gdpicturePDF.LoadFromFile(@"C:\temp\source.pdf");// Select the first page.gdpicturePDF.SelectPage(1);// Render the first page to a 300 DPI image.int imageId = gdpicturePDF.RenderPageToGdPictureImageEx(300, true);// Pass the image to the `GdPictureOCR` object.gdpictureOCR.SetImage(imageId);// Configure the table extraction process.gdpictureOCR.ResourceFolder = @"C:\GdPicture.NET 14\Redist\OCR";gdpictureOCR.AddLanguage(OCRLanguage.English);// Run the table extraction process and save the result ID in a list.string result = gdpictureOCR.RunOCR();List<string> resultsList = new List<string>() { result };// Configure the output spreadsheet.GdPictureOCR.SpreadsheetOptions spreadsheetOptions = new GdPictureOCR.SpreadsheetOptions()    {        SeparateTables = true    };// Save the output in an Excel spreadsheet.gdpictureOCR.SaveAsXLSX(resultsList, @"C:\temp\output.xlsx", spreadsheetOptions);// Release unnecessary resources.gdpictureOCR.ReleaseOCRResults();GdPictureDocumentUtilities.DisposeImage(imageId);gdpicturePDF.CloseDocument();Using gdpictureOCR As GdPictureOCR = New GdPictureOCR()Using gdpicturePDF As GdPicturePDF = New GdPicturePDF()    ' Load the source document.    gdpicturePDF.LoadFromFile("C:\temp\source.pdf")    ' Select the first page.    gdpicturePDF.SelectPage(1)    ' Render the first page to a 300 DPI image.    Dim imageId As Integer = gdpicturePDF.RenderPageToGdPictureImageEx(300, True)    ' Pass the image to the `GdPictureOCR` object.    gdpictureOCR.SetImage(imageId)    ' Configure the table extraction process.    gdpictureOCR.ResourceFolder = "C:\GdPicture.NET 14\Redist\OCR"    gdpictureOCR.AddLanguage(OCRLanguage.English)    ' Run the table extraction process and save the result ID in a list.    Dim result As String = gdpictureOCR.RunOCR()    Dim resultsList As List(Of String) = New List(Of String)()    resultsList.Add(result)    ' Configure the output spreadsheet.    Dim spreadsheetOptions As gdpictureOCR.SpreadsheetOptions = New GdPictureOCR.SpreadsheetOptions() With {        .SeparateTables = True    }    ' Save the output in an Excel spreadsheet.    gdpictureOCR.SaveAsXLSX(resultsList, "C:\temp\output.xlsx", spreadsheetOptions)    ' Release unnecessary resources.    gdpictureOCR.ReleaseOCRResults()    GdPictureDocumentUtilities.DisposeImage(imageId)    gdpicturePDF.CloseDocument()End UsingEnd Using=
Extracting table data from a PDF to JSON format
To read and extract table data from a PDF document to JSON format, follow these steps:
- Import the GdPicture14and theNewtonsoft.Json.Linqnamespaces.
- Create a GdPictureOCRobject and aGdPicturePDFobject.
- Select the source document by passing its path to the LoadFromFilemethod of theGdPicturePDFobject.
- Select the page from which to extract the table data with the SelectPagemethod of theGdPicturePDFobject.
- Render the selected page to a 300 dots-per-inch (DPI) image with the RenderPageToGdPictureImageExmethod of theGdPicturePDFobject.
- Pass the image to the GdPictureOCRobject with theSetImagemethod.
- Configure the OCR process with the GdPictureOCRobject in the following way:- Set the path to the OCR resource folder with the ResourceFolderproperty. The default language resources are located inGdPicture.NET 14\Redist\OCR. For more information on adding language resources, see the language support guide.
- With the AddLanguagemethod, add the language resources that Nutrient .NET SDK uses to recognize text in the image. This method takes a member of theOCRLanguageenumeration.
 
- Set the path to the OCR resource folder with the 
- Run the OCR process with the RunOCRmethod of theGdPictureOCRobject.
- Get the number of tables detected during the OCR process with the GetTableCountmethod of theGdPictureOCRobject.
- Create the JSON object that contains the tables on the page and loop through the tables.
- For each table, get the number of columns and rows with the GetTableColumnCountandGetTableRowCountmethods of theGdPictureOCRobject.
- Create the JSON object that contains the rows in the table and loop through the rows.
- Create the JSON object that contains the cells in the row and loop through the cells.
- Get the detected value for each cell with the GetTableCellTextmethod of theGdPictureOCRobject and save it in the JSON object.
- Print the tables to the console in JSON format.
- Release unnecessary resources.
The example below extracts table data from the first page of a document and prints the output to the console in JSON format:
=
using GdPicture14;using Newtonsoft.Json.Linq;...using GdPictureOCR gdpictureOCR = new GdPictureOCR();using GdPicturePDF gdpicturePDF = new GdPicturePDF();// Load the source document.gdpicturePDF.LoadFromFile(@"C:\temp\source.pdf");// Select the first page.gdpicturePDF.SelectPage(1);// Render the first page to a 300 DPI image.int imageId = gdpicturePDF.RenderPageToGdPictureImageEx(300, true);// Pass the image to the `GdPictureOCR` object.gdpictureOCR.SetImage(imageId);// Configure the OCR process.gdpictureOCR.ResourceFolder = @"C:\GdPicture.NET 14\Redist\OCR";gdpictureOCR.AddLanguage(OCRLanguage.English);// Run the OCR process.string ocrResultId = gdpictureOCR.RunOCR();// Create the JSON object that contains the tables on the page and loop through the tables.int tableCount = gdpictureOCR.GetTableCount(ocrResultId);dynamic[] tables = new JObject[tableCount];for (int tableIndex = 0; tableIndex < tableCount; tableIndex++){    int columnCount = gdpictureOCR.GetTableColumnCount(ocrResultId, tableIndex);    int rowCount = gdpictureOCR.GetTableRowCount(ocrResultId, tableIndex);    // Create the JSON object that contains the rows in the table and loop through the rows.    dynamic[] rows = new JObject[rowCount];    for (int rowIndex = 0; rowIndex < rowCount; rowIndex++)    {        // Create the JSON object that contains the cells in the row and loop through the cells.        dynamic[] cells = new JObject[columnCount];        for (int columnIndex = 0; columnIndex < columnCount; columnIndex++)        {            cells[columnIndex] = new JObject();            cells[columnIndex].RowIndex = rowIndex;            cells[columnIndex].ColumnIndex = columnIndex;            // Read the content of the cell and save it in the JSON object.            cells[columnIndex].Text = gdpictureOCR.GetTableCellText(ocrResultId, tableIndex, columnIndex, rowIndex);        }        rows[rowIndex] = new JObject();        rows[rowIndex].Cells = new JArray(cells);    }    tables[tableIndex] = new JObject();    tables[tableIndex].Rows = new JArray(rows);}dynamic tablesOnPage = new JObject();tablesOnPage.Tables = new JArray(tables);// Print the tables to the console in JSON format.Console.WriteLine(tablesOnPage.ToString());// Release unnecessary resources.gdpictureOCR.ReleaseOCRResults();GdPictureDocumentUtilities.DisposeImage(imageId);gdpicturePDF.CloseDocument();Imports GdPicture14Imports Newtonsoft.Json.Linq...Using gdpictureOCR As GdPictureOCR = New GdPictureOCR()Using gdpicturePDF As GdPicturePDF = New GdPicturePDF()    ' Load the source document.    gdpicturePDF.LoadFromFile("C:\temp\source.pdf")    ' Select the first page.    gdpicturePDF.SelectPage(1)    ' Render the first page to a 300 DPI image.    Dim imageId As Integer = gdpicturePDF.RenderPageToGdPictureImageEx(300, True)    ' Pass the image to the `GdPictureOCR` object.    gdpictureOCR.SetImage(imageId)    ' Configure the OCR process.    gdpictureOCR.ResourceFolder = "C:\GdPicture.NET 14\Redist\OCR"    gdpictureOCR.AddLanguage(OCRLanguage.English)    ' Run the OCR process.    Dim ocrResultId As String = gdpictureOCR.RunOCR()    ' Create the JSON object that contains the tables on the page and loop through the tables.    Dim tableCount As Integer = gdpictureOCR.GetTableCount(ocrResultId)    Dim tables As Object() = New JObject(tableCount - 1) {}    For tableIndex = 0 To tableCount - 1        Dim columnCount As Integer = gdpictureOCR.GetTableColumnCount(ocrResultId, tableIndex)        Dim rowCount As Integer = gdpictureOCR.GetTableRowCount(ocrResultId, tableIndex)        ' Create the JSON object that contains the rows in the table and loop through the rows.        Dim rows As Object() = New JObject(rowCount - 1) {}        For rowIndex = 0 To rowCount - 1            ' Create the JSON object that contains the cells in the row and loop through the cells.            Dim cells As Object() = New JObject(columnCount - 1) {}            For columnIndex = 0 To columnCount - 1                cells(columnIndex) = New JObject()                cells(columnIndex).RowIndex = rowIndex                cells(columnIndex).ColumnIndex = columnIndex                ' Read the content of the cell and save it in the JSON object.                cells(columnIndex).Text = gdpictureOCR.GetTableCellText(ocrResultId, tableIndex, columnIndex, rowIndex)            Next            rows(rowIndex) = New JObject()            rows(rowIndex).Cells = New JArray(cells)        Next        tables(tableIndex) = New JObject()        tables(tableIndex).Rows = New JArray(rows)    Next    Dim tablesOnPage As Object = New JObject()    tablesOnPage.Tables = New JArray(tables)    ' Print the tables to the console in JSON format.    Console.WriteLine(tablesOnPage.ToString())    ' Release unnecessary resources.    gdpictureOCR.ReleaseOCRResults()    GdPictureDocumentUtilities.DisposeImage(imageId)    gdpicturePDF.CloseDocument()End UsingEnd Using=
Extracting table data from a PDF to markdown format
To read and extract table data from a PDF document and print it to the console, follow these steps:
- Create a GdPictureOCRobject and aGdPicturePDFobject.
- Select the source document by passing its path to the LoadFromFilemethod of theGdPicturePDFobject.
- Select the page from which to extract the table data with the SelectPagemethod of theGdPicturePDFobject.
- Render the selected page to a 300 dots-per-inch (DPI) image with the RenderPageToGdPictureImageExmethod of theGdPicturePDFobject.
- Pass the image to the GdPictureOCRobject with theSetImagemethod.
- Configure the OCR process with the GdPictureOCRobject in the following way:- Set the path to the OCR resource folder with the ResourceFolderproperty. The default language resources are located inGdPicture.NET 14\Redist\OCR. For more information on adding language resources, see the language support guide.
- With the AddLanguagemethod, add the language resources that Nutrient .NET SDK uses to recognize text in the image. This method takes a member of theOCRLanguageenumeration.
 
- Set the path to the OCR resource folder with the 
- Run the OCR process with the RunOCRmethod of theGdPictureOCRobject.
- Get the number of tables detected during the OCR process with the GetTableCountmethod of theGdPictureOCRobject, and loop through them.
- For each table, get the number of columns and rows with the GetTableColumnCountandGetTableRowCountmethods of theGdPictureOCRobject, and loop through them.
- Get the detected value for each cell with the GetTableCellTextmethod of theGdPictureOCRobject, and print it to the console.
- Release unnecessary resources.
The example below extracts table data from the first page of a document and prints the output to the console in Markdown syntax:
=
using GdPictureOCR gdpictureOCR = new GdPictureOCR();using GdPicturePDF gdpicturePDF = new GdPicturePDF();// Load the source document.gdpicturePDF.LoadFromFile(@"C:\temp\source.pdf");// Select the first page.gdpicturePDF.SelectPage(1);// Render the first page to a 300 DPI image.int imageId = gdpicturePDF.RenderPageToGdPictureImageEx(300, true);// Pass the image to the `GdPictureOCR` object.gdpictureOCR.SetImage(imageId);// Configure the OCR process.gdpictureOCR.ResourceFolder = @"C:\GdPicture.NET 14\Redist\OCR";gdpictureOCR.AddLanguage(OCRLanguage.English);// Run the OCR process.string ocrResultId = gdpictureOCR.RunOCR();for (int tableIndex = 0; tableIndex < gdpictureOCR.GetTableCount(ocrResultId); tableIndex++){    int columnCount = gdpictureOCR.GetTableColumnCount(ocrResultId, tableIndex);    int rowCount = gdpictureOCR.GetTableRowCount(ocrResultId, tableIndex);
    // Print the table to the console.    Console.Write($"\nTable {tableIndex}");    for (int rowIndex = 0; rowIndex < rowCount; rowIndex++)    {        Console.Write("\n| ");        for (int columnIndex = 0; columnIndex < columnCount; columnIndex++)        {            string cellContent = gdpictureOCR.GetTableCellText(ocrResultId, tableIndex, columnIndex, rowIndex).Replace(Environment.NewLine, "");            Console.Write($" {cellContent} |");        }    }    Console.WriteLine("");}// Release unnecessary resources.gdpictureOCR.ReleaseOCRResults();GdPictureDocumentUtilities.DisposeImage(imageId);gdpicturePDF.CloseDocument();Using gdpictureOCR As GdPictureOCR = New GdPictureOCR()Using gdpicturePDF As GdPicturePDF = New GdPicturePDF()    ' Load the source document.    gdpicturePDF.LoadFromFile("C:\temp\source.pdf")    ' Select the first page.    gdpicturePDF.SelectPage(1)    ' Render the first page to a 300 DPI image.    Dim imageId As Integer = gdpicturePDF.RenderPageToGdPictureImageEx(300, True)    ' Pass the image to the `GdPictureOCR` object.    gdpictureOCR.SetImage(imageId)    ' Configure the OCR process.    gdpictureOCR.ResourceFolder = "C:\GdPicture.NET 14\Redist\OCR"    gdpictureOCR.AddLanguage(OCRLanguage.English)    ' Run the OCR process.    Dim ocrResultId As String = gdpictureOCR.RunOCR()    For tableIndex As Integer = 0 To gdpictureOCR.GetTableCount(ocrResultId) - 1        Dim columnCount As Integer = gdpictureOCR.GetTableColumnCount(ocrResultId, tableIndex)        Dim rowCount As Integer = gdpictureOCR.GetTableRowCount(ocrResultId, tableIndex)        ' Print the table to the console.        Console.Write(vbLf & $"Table {tableIndex}")        For rowIndex = 0 To rowCount - 1            Console.Write(vbLf & "| ")            For columnIndex = 0 To columnCount - 1                Dim cellContent As String = gdpictureOCR.GetTableCellText(ocrResultId, tableIndex, columnIndex, rowIndex).Replace(Environment.NewLine, "")                Console.Write($" {cellContent} |")            Next        Next        Console.WriteLine("")    Next    ' Release unnecessary resources.    gdpictureOCR.ReleaseOCRResults()    GdPictureDocumentUtilities.DisposeImage(imageId)    gdpicturePDF.CloseDocument()End UsingEnd Using=
Extracting table data from an image
To read and extract table data from an image, follow these steps:
- Create a GdPictureOCRobject and aGdPictureImagingobject.
- Select the image of the table by passing its path to the CreateGdPictureImageFromFilemethod of theGdPictureImagingobject.
- Configure the OCR process with the GdPictureOCRobject in the following way:- Set the image of the table with the SetImagemethod.
- Set the path to the OCR resource folder with the ResourceFolderproperty. The default language resources are located inGdPicture.NET 14\Redist\OCR. For more information on adding language resources, see the language support guide.
- With the AddLanguagemethod, add the language resources that Nutrient .NET SDK uses to recognize text in the image. This method takes a member of theOCRLanguageenumeration.
 
- Set the image of the table with the 
- Run the OCR process with the RunOCRmethod of theGdPictureOCRobject.
- Get the number of tables detected during the OCR process with the GetTableCountmethod of theGdPictureOCRobject, and loop through them.
- For each table, get the number of columns and rows with the GetTableColumnCountandGetTableRowCountmethods of theGdPictureOCRobject, and loop through them.
- Get the detected value for each cell with the GetTableCellTextmethod of theGdPictureOCRobject, and print it to the console.
- Release unnecessary resources.
The example below extracts data from the following table and prints the output to the console in Markdown syntax.

Download the sample table and run the code below, or check out our demo.
=
using GdPictureOCR gdpictureOCR = new GdPictureOCR();using GdPictureImaging gdpictureImaging = new GdPictureImaging();// Load the source document.int imageId = gdpictureImaging.CreateGdPictureImageFromFile(@"C:\temp\source.png");// Configure the OCR process.gdpictureOCR.ResourceFolder = @"C:\GdPicture.NET 14\Redist\OCR";gdpictureOCR.AddLanguage(OCRLanguage.English);gdpictureOCR.SetImage(imageId);// Run the OCR process.string ocrResultId = gdpictureOCR.RunOCR();for (int tableIndex = 0; tableIndex < gdpictureOCR.GetTableCount(ocrResultId); tableIndex++){    int columnCount = gdpictureOCR.GetTableColumnCount(ocrResultId, tableIndex);    int rowCount = gdpictureOCR.GetTableRowCount(ocrResultId, tableIndex);
    // Print the table to the console.    Console.Write($"\nTable {tableIndex}");    for (int rowIndex = 0; rowIndex < rowCount; rowIndex++)    {        Console.Write("\n| ");        for (int columnIndex = 0; columnIndex < columnCount; columnIndex++)        {            string cellContent = gdpictureOCR.GetTableCellText(ocrResultId, tableIndex, columnIndex, rowIndex).Replace(Environment.NewLine, "");            Console.Write($" {cellContent} |");        }    }    Console.WriteLine("");}// Release unnecessary resources.gdpictureImaging.ReleaseGdPictureImage(imageId);gdpictureOCR.ReleaseOCRResults();Using gdpictureOCR As GdPictureOCR = New GdPictureOCR()Using gdpictureImaging As GdPictureImaging = New GdPictureImaging()    ' Load the source document.    Dim imageId As Integer = gdpictureImaging.CreateGdPictureImageFromFile("C:\temp\source.png")    ' Configure the OCR process.    gdpictureOCR.ResourceFolder = "C:\GdPicture.NET 14\Redist\OCR"    gdpictureOCR.AddLanguage(OCRLanguage.English)    gdpictureOCR.SetImage(imageId)    ' Run the OCR process.    Dim ocrResultId As String = gdpictureOCR.RunOCR()    For tableIndex As Integer = 0 To gdpictureOCR.GetTableCount(ocrResultId) - 1        Dim columnCount As Integer = gdpictureOCR.GetTableColumnCount(ocrResultId, tableIndex)        Dim rowCount As Integer = gdpictureOCR.GetTableRowCount(ocrResultId, tableIndex)        ' Print the table to the console.        Console.Write(vbLf & $"Table {tableIndex}")        For rowIndex = 0 To rowCount - 1            Console.Write(vbLf & "| ")            For columnIndex = 0 To columnCount - 1                Dim cellContent As String = gdpictureOCR.GetTableCellText(ocrResultId, tableIndex, columnIndex, rowIndex).Replace(Environment.NewLine, "")                Console.Write($" {cellContent} |")            Next        Next        Console.WriteLine("")    Next    ' Release unnecessary resources.    gdpictureImaging.ReleaseGdPictureImage(imageId)    gdpictureOCR.ReleaseOCRResults()End UsingEnd Using=
Format the output to obtain the following table:
| No. | Museum name | Location | Visits in 2021 | Change since 2020 | 
|---|---|---|---|---|
| 1. | Louvre | France, Paris | 2,825,000 | +5 | 
| 2. | Russian Museum | Russia, Saint Petersburg | 2,260,231 | +88% | 
| 3. | Multimedia Art Museum | Russia, Moscow | 2,242,405 | +421% | 
| 4. | Metropolitan Museum of Art | United States, New York | 1,958,000 | +84% | 
| 5. | National Gallery of Art | United States, Washington, D.C. | 1,704,606 | +133% | 
 
  
  
  
 