Extract PDF tables with C#
This guide demonstrates how to extract tabular data from PDF documents using C# and Nutrient Document Converter Services (DCS). Table extraction converts structured data from PDFs into JSON format, making it accessible for data analysis, reporting, and integration workflows.
Common use cases
PDF table extraction is useful for:
- Financial data processing - Extract tables from invoices, statements, and reports for automated accounting workflows
- Research and analysis - Convert tabular data from research papers and reports into JSON for statistical analysis
- Document digitization - Transform scanned documents with tables into structured, searchable format
- Compliance reporting - Extract regulatory data from PDF forms into structured format for auditing
- Data migration - Recover tabular information from legacy PDF documents for database import
Prerequisites
Before extracting tables from PDFs, ensure you have:
- Nutrient Document Converter Services (DCS) installed, licensed, and running
- .NET Framework 4.6.1+ or .NET Core 2.0+ development environment
- Valid DCS license that includes table extraction functionality
- Implemented
OpenService()
andCloseService()
methods fromDocumentConverterServiceClient
sample code - PDF files containing tabular data for testing
- Write permissions for the target output folder
Input requirements
- PDF files containing actual tabular data (not just visual table layouts)
- Files that are not password-protected or corrupted
- Tables should be reasonably structured for optimal extraction results
Output format
- JSON format: Structured data with table metadata and cell content
Sample code
/// <summary> /// Extract tabular data from a PDF. /// </summary> /// <param name="ServiceURL">URL endpoint for the PDF Converter service.</param> /// <param name="sourceFileName">Source filename.</param> /// <param name="targetFolder">Target folder to receive the output file.</param> /// <param name="outputFileType">JSON only currently</param> /// <param name="languages">List of languages.</param> static void TestTableExtract(string ServiceURL, string sourceFileName, string targetFolder, string outputFileType, string languages = "eng") { Console.WriteLine($"Extracting attachments from {sourceFileName}");
DocumentConverterServiceClient client = null; // Create an `OpenOptions` instance with minimum properties needed for file identification. OpenOptions openOptions = new OpenOptions(); openOptions.FileExtension = Path.GetExtension(sourceFileName); openOptions.OriginalFileName = Path.GetFileName(sourceFileName);
// Create a `TableExtractionSettings` object. TableExtractionSettings settings = new TableExtractionSettings(); settings.DPI = "300"; settings.SeparateTables = BooleanEnum.True; settings.EnableOrientationDetection = BooleanEnum.True; settings.EnableSkewDetection = BooleanEnum.True; settings.RenderFormFields = BooleanEnum.True; settings.OutputFileType = outputFileType; settings.OCRLanguage = languages;
try { // Determine the source file and read it into a byte array. byte[] sourceFile = File.ReadAllBytes(sourceFileName);
// Open the service and configure the bindings. client = OpenService(ServiceURL);
// Carry out the conversion. BatchResult result = client.ExtractTables(sourceFile, openOptions, settings);
if(result != null) { // Create the target folder if it does not exist. if (!Directory.Exists(targetFolder)) { Directory.CreateDirectory(targetFolder); } Console.WriteLine($"Output to: {targetFolder}");
// Get the filename. string filename = result.FileName; Console.WriteLine(filename); // Write the result to a file. File.WriteAllBytes(Path.Combine(targetFolder, filename), result.File); } else { Console.WriteLine("No result returned"); } } finally { if (client != null) { CloseService(client); } } }
Troubleshooting
Service connection error: Cannot connect to DCS
- Ensure DCS is running and accessible
- Verify the service URL in your code matches your DCS installation
- Check that no firewall is blocking the connection
No tables extracted: Empty result or no output file
- Verify that the PDF contains actual tabular data, not just visual table layouts
- Check that the OCR language setting matches the document language
- Ensure the DPI setting is appropriate for your document quality (try 300 or higher)
- Enable orientation and skew detection for scanned documents
License error: Table extraction not available
- Verify that your DCS license includes table extraction functionality
- Check that the license hasn’t expired
- Ensure the service is licensed and activated
File access error: Permission denied
- Verify that the application has read access to the source PDF file
- Check that the target folder has write permissions
- Ensure the PDF file isn’t locked by other applications
Poor extraction quality: Incomplete or inaccurate table data
- Increase the DPI setting for higher quality extraction (try 600 DPI for complex tables)
- Enable orientation detection if tables are rotated
- Enable skew detection for scanned documents
- Set the appropriate OCR language for non-English documents
- Consider using
SeparateTables = BooleanEnum.False
for complex multi-column layouts
Large file processing: Slow performance or timeouts
- For large PDF files, consider processing individual pages
- Increase timeout values for the service client if processing large files
- Monitor memory usage when processing multiple large files
What’s next
Now that you can extract tables from PDFs with C#, explore these related document processing capabilities:
- Attachment extraction - Discover extract PDF attachments with C# for embedded file processing
- Python implementation - Compare approaches with extract PDF tables using Python for cross-language insights
- Complete C# guide - Review the document conversion with C# guide for additional processing capabilities