Extract PDF tables to JSON using Python

This guide explains how to extract tabular information from PDF documents using Python and Nutrient Document Converter Services (DCS). Table extraction is particularly useful for data analysis, reporting workflows, and document digitization processes.

The sample code in this guide can be run in any Python environment with access to the Zeep library(opens in a new tab). For other extraction capabilities, see extract text using Python.

The Zeep library enables interaction with Web Services Description Language (WSDL), which defines how to call the web services and describes the data structures returned. Nutrient Document Converter Services (DCS) provides these WSDL definitions for text extraction and other operations.

Prerequisites

Before extracting tables from PDFs, ensure you have:

  • Python 3.x installed on your system
  • The Zeep library installed (pip install zeep)
  • Nutrient Document Converter Services running locally on port 41734
  • Valid DCS license that includes table extraction functionality
  • PDF files containing tabular data for testing
  • Basic understanding of Python programming and web services
  • Appropriate file system permissions for reading input files and writing output

For initial DCS setup with Python, refer to the using Document Converter Services with Python guide.

WSDL

Zeep extracts the following WSDL definitions:

ExtractTables(inputFile: xsd:base64Binary, openOptions: ns2:OpenOptions, settings: ns2:TableExtractionSettings) -> ExtractTablesResult: ns2:BatchResult
...
ns2:TableExtractionSettings(RenderFormFields: ns3:BooleanEnum, EnableOrientationDetection: ns3:BooleanEnum, EnableSkewDetection: ns3:BooleanEnum, DPI: xsd:string, SeparateTables: ns3:BooleanEnum, OutputFileType: ns3:TableExtractionOutputType, OCRLanguage: xsd:string)
...
ns2:OpenOptions(UserName: xsd:string, Password: xsd:string, FileExtension: xsd:string, OriginalFileName: xsd:string, RefreshContent: xsd:boolean, AllowExternalConnections: xsd:boolean, AllowMacros: ns3:MacroSecurityOption, SystemSettings: ns5:SystemSettings, SubscriptionSettings: ns9:SubscriptionSettings)
...
ns3:BooleanEnum(ns3:BooleanEnum)
...
ns3:TableExtractionOutputType(ns3:TableExtractionOutputType)

The ExtractTables method requires three parameters:

  • inputFile: xsd:base64Binary
  • openOptions: ns2:OpenOptions
  • settings: ns2:TableExtractionSettings

Use a Base64-encoded binary string for inputFile, as defined by the W3C XML schema.

Instantiate openOptions and settings using Zeep type factories. Both types belong to the ns2 namespace.

The OpenOptions type requires minimal setup—set the file name and extension.

TableExtractionSettings supports the following configuration:

  • Multiple boolean flags using the BooleanEnum type (ns3)
  • Output format using TableExtractionOutputType (ns3)
  • OCR language
  • DPI
  • Table separation behavior

The method returns a Base64-encoded binary string representing the extracted data in JSON format.

Sample code

The following Python code demonstrates how to extract tables from a PDF file:

import zeep
import base64
print ("Extract tables from a PDF")
# Source file path.
source_file = "Three-in-one invoice.pdf"
# Target file path.
target_file = "Three-in-one invoice tables.json"
# OCR languages (multiple languages can be included, separated using the '+' character; for example eng+fr).
ocr_languages = "eng"
#Service URL.
service_url = "http://localhost:41734/Muhimbi.DocumentConverter.WebService/"
# WSDL URL.
wsdl_url = service_url+"?WSDL"
# Construct the header.
header = zeep.xsd.Element(
"Header",
zeep.xsd.ComplexType(
[
zeep.xsd.Element(
"{http://www.w3.org/2005/08/addressing}Action", zeep.xsd.String()
),
zeep.xsd.Element(
"{http://www.w3.org/2005/08/addressing}To", zeep.xsd.String()
),
]
),
)
# Create a heading object.
header_value = header(Action=service_url,To=service_url)
# Create client.
client = zeep.Client(wsdl=wsdl_url)
# Create a factory type to construct objects with the suffix ns2 (see the WSDL).
factory2 = client.type_factory("ns2")
# Create a factory type to construct objects with the suffix ns3 (see the WSDL).
factory3 = client.type_factory("ns3")
# Create the BooleanEnum types (only need true for this sample).
boolean_enum_true = factory3.BooleanEnum("True")
boolean_enum_false = factory3.BooleanEnum("False")
# Create the OpenOptions object with minimum settings.
open_options = factory2.OpenOptions(OriginalFileName = source_file, FileExtension = "pdf")
# Create the output file type.
output_file_type = factory3.TableExtractionOutputType("JSON")
# Create the TableExtractionSettings object with minimum settings.
table_extraction_settings = factory2.TableExtractionSettings(DPI = "300",
SeparateTables = boolean_enum_true,
EnableOrientationDetection = boolean_enum_true,
EnableSkewDetection = boolean_enum_true,
RenderFormFields = boolean_enum_true,
OutputFileType = output_file_type,
OCRLanguage = ocr_languages)
# Read the file contents, create the source file information, and add it to the source files list.
with open(source_file, "rb") as filereader:
source_file_content = base64.b64encode(filereader.read()).decode('utf-8')
# Extract the tables.
result = client.service.ExtractTables(source_file_content, open_options, table_extraction_settings)
# Write the output file.
with open(target_file, "wb") as f:
f.write(result.File)
print("Done")

Output format

The table extraction service supports the following output format:

JSON format

  • Structured data with table metadata and cell content
  • Includes table positioning and formatting information
  • Suitable for programmatic processing and integration
  • File extension: .json

To set the output format, modify the OutputFileType parameter:

# For JSON output
output_file_type = factory3.TableExtractionOutputType("JSON")

Troubleshooting

Service connection error: Cannot connect to DCS

  • Ensure DCS is running on localhost:41734
  • Check that no firewall is blocking the connection
  • Verify the service URL in your code matches your DCS installation

File access error: File not found or permission denied

  • Verify that Python has read access to the source PDF file
  • Check that the output directory has write permissions
  • Ensure the source file path is correct and the file exists

No tables extracted: Empty result or no output file

  • Verify that the PDF contains actual tabular data, not just visual table layouts
  • Check that the OCR language setting matches the document language
  • Ensure the DPI setting is appropriate for your document quality (try 300 or higher)
  • Enable orientation and skew detection for scanned documents

License error: Table extraction not available

  • Verify that your DCS license includes table extraction functionality
  • Check that the license hasn’t expired
  • Ensure the service is licensed and activated

Poor extraction quality: Incomplete or inaccurate table data

  • Increase the DPI setting for higher quality extraction (try 600 DPI for complex tables)
  • Enable orientation detection if tables are rotated
  • Enable skew detection for scanned documents
  • Set the appropriate OCR language for non-English documents
  • Consider using SeparateTables=False for complex multi-column layouts

Large file processing: Slow performance or timeouts

  • For large PDF files, consider processing individual pages
  • Increase timeout values in your HTTP client configuration
  • Monitor memory usage when processing multiple large files

What’s next

Now that you can extract tables from PDFs with Python, explore these related document processing capabilities: