Extract PDF tables to JSON using Python

This guide explains how to extract tabular information from PDF documents using Python and Nutrient Document Converter Services (DCS). Table extraction is particularly useful for data analysis, reporting workflows, and document digitization processes.

The sample code in this guide can be run in any Python environment with access to the Zeep library(opens in a new tab). For other extraction capabilities, see extract text using Python.

The Zeep library enables interaction with Web Services Description Language (WSDL), which defines how to call the web services and describes the data structures returned. Nutrient Document Converter Services (DCS) provides these WSDL definitions for text extraction and other operations.

Prerequisites

Before extracting tables from PDFs, ensure you have:

Python 3.x installed on your system
The Zeep library installed (pip install zeep)
Nutrient Document Converter Services running locally on port 41734
Valid DCS license that includes table extraction functionality
PDF files containing tabular data for testing
Basic understanding of Python programming and web services
Appropriate file system permissions for reading input files and writing output

For initial DCS setup with Python, refer to the using Document Converter Services with Python guide.

WSDL

Zeep extracts the following WSDL definitions:

ExtractTables(inputFile: xsd:base64Binary, openOptions: ns2:OpenOptions, settings: ns2:TableExtractionSettings) -> ExtractTablesResult: ns2:BatchResult
...
ns2:TableExtractionSettings(RenderFormFields: ns3:BooleanEnum, EnableOrientationDetection: ns3:BooleanEnum, EnableSkewDetection: ns3:BooleanEnum, DPI: xsd:string, SeparateTables: ns3:BooleanEnum, OutputFileType: ns3:TableExtractionOutputType, OCRLanguage: xsd:string)
...
ns2:OpenOptions(UserName: xsd:string, Password: xsd:string, FileExtension: xsd:string, OriginalFileName: xsd:string, RefreshContent: xsd:boolean, AllowExternalConnections: xsd:boolean, AllowMacros: ns3:MacroSecurityOption, SystemSettings: ns5:SystemSettings, SubscriptionSettings: ns9:SubscriptionSettings)
...
ns3:BooleanEnum(ns3:BooleanEnum)
...
ns3:TableExtractionOutputType(ns3:TableExtractionOutputType)

The ExtractTables method requires three parameters:

inputFile: xsd:base64Binary
openOptions: ns2:OpenOptions
settings: ns2:TableExtractionSettings

Use a Base64-encoded binary string for inputFile, as defined by the W3C XML schema.

Instantiate openOptions and settings using Zeep type factories. Both types belong to the ns2 namespace.

The OpenOptions type requires minimal setup—set the file name and extension.

TableExtractionSettings supports the following configuration:

Multiple boolean flags using the BooleanEnum type (ns3)
Output format using TableExtractionOutputType (ns3)
OCR language
DPI
Table separation behavior

The method returns a Base64-encoded binary string representing the extracted data in JSON format.

Sample code

The following Python code demonstrates how to extract tables from a PDF file:

import zeep
import base64

print ("Extract tables from a PDF")

# Source file path.
source_file = "Three-in-one invoice.pdf"
# Target file path.
target_file = "Three-in-one invoice tables.json"
# OCR languages (multiple languages can be included, separated using the '+' character; for example eng+fr).
ocr_languages = "eng"

#Service URL.
service_url = "http://localhost:41734/Muhimbi.DocumentConverter.WebService/"
# WSDL URL.
wsdl_url = service_url+"?WSDL"

# Construct the header.
header = zeep.xsd.Element(
    "Header",
    zeep.xsd.ComplexType(
        [
            zeep.xsd.Element(
                "{http://www.w3.org/2005/08/addressing}Action", zeep.xsd.String()
            ),
            zeep.xsd.Element(
                "{http://www.w3.org/2005/08/addressing}To", zeep.xsd.String()
            ),
        ]
    ),
)
# Create a heading object.
header_value = header(Action=service_url,To=service_url)
# Create client.
client = zeep.Client(wsdl=wsdl_url)

# Create a factory type to construct objects with the suffix ns2 (see the WSDL).
factory2 = client.type_factory("ns2")
# Create a factory type to construct objects with the suffix ns3 (see the WSDL).
factory3 = client.type_factory("ns3")

# Create the BooleanEnum types (only need true for this sample).
boolean_enum_true = factory3.BooleanEnum("True")
boolean_enum_false = factory3.BooleanEnum("False")

# Create the OpenOptions object with minimum settings.
open_options = factory2.OpenOptions(OriginalFileName = source_file, FileExtension = "pdf")

# Create the output file type.
output_file_type = factory3.TableExtractionOutputType("JSON")

# Create the TableExtractionSettings object with minimum settings.
table_extraction_settings = factory2.TableExtractionSettings(DPI = "300",
                                                             SeparateTables = boolean_enum_true,
                                                             EnableOrientationDetection = boolean_enum_true,
                                                             EnableSkewDetection = boolean_enum_true,
                                                             RenderFormFields = boolean_enum_true,
                                                             OutputFileType = output_file_type,
                                                             OCRLanguage = ocr_languages)

# Read the file contents, create the source file information, and add it to the source files list.
with open(source_file, "rb") as filereader:
    source_file_content = base64.b64encode(filereader.read()).decode('utf-8')

# Extract the tables.
result = client.service.ExtractTables(source_file_content, open_options, table_extraction_settings)

# Write the output file.
with open(target_file, "wb") as f:
  f.write(result.File)

  print("Done")

Output format

The table extraction service supports the following output format:

JSON format

Structured data with table metadata and cell content
Includes table positioning and formatting information
Suitable for programmatic processing and integration
File extension: .json

To set the output format, modify the OutputFileType parameter:

# For JSON output
output_file_type = factory3.TableExtractionOutputType("JSON")

Troubleshooting

Service connection error: Cannot connect to DCS

Ensure DCS is running on localhost:41734
Check that no firewall is blocking the connection
Verify the service URL in your code matches your DCS installation

File access error: File not found or permission denied

Verify that Python has read access to the source PDF file
Check that the output directory has write permissions
Ensure the source file path is correct and the file exists

No tables extracted: Empty result or no output file

Verify that the PDF contains actual tabular data, not just visual table layouts
Check that the OCR language setting matches the document language
Ensure the DPI setting is appropriate for your document quality (try 300 or higher)
Enable orientation and skew detection for scanned documents

License error: Table extraction not available

Verify that your DCS license includes table extraction functionality
Check that the license hasn’t expired
Ensure the service is licensed and activated

Poor extraction quality: Incomplete or inaccurate table data

Increase the DPI setting for higher quality extraction (try 600 DPI for complex tables)
Enable orientation detection if tables are rotated
Enable skew detection for scanned documents
Set the appropriate OCR language for non-English documents
Consider using SeparateTables=False for complex multi-column layouts

Large file processing: Slow performance or timeouts

For large PDF files, consider processing individual pages
Increase timeout values in your HTTP client configuration
Monitor memory usage when processing multiple large files

What’s next

Now that you can extract tables from PDFs with Python, explore these related document processing capabilities:

Text extraction - Learn how to extract text using Python to analyze document content beyond tables
C# implementation - Compare approaches with extract tabular data from PDFs using C# for cross-language insights
Complete Python setup - Review the using Document Converter Services with Python guide for more features

Extract PDF tables to JSON using Python

Prerequisites

WSDL

Sample code

Output format

Troubleshooting

What’s next

Was this helpful?

Help us improve

Thank you for your feedback!

Something went wrong. Please try again or let us know.