Extract PDF tables to JSON using Python
This guide explains how to extract tabular information from PDF documents using Python and Nutrient Document Converter Services (DCS). Table extraction is particularly useful for data analysis, reporting workflows, and document digitization processes.
The sample code in this guide can be run in any Python environment with access to the Zeep library(opens in a new tab). For other extraction capabilities, see extract text using Python.
The Zeep library enables interaction with Web Services Description Language (WSDL), which defines how to call the web services and describes the data structures returned. Nutrient Document Converter Services (DCS) provides these WSDL definitions for text extraction and other operations.
Prerequisites
Before extracting tables from PDFs, ensure you have:
- Python 3.x installed on your system
- The Zeep library installed (
pip install zeep
) - Nutrient Document Converter Services running locally on port 41734
- Valid DCS license that includes table extraction functionality
- PDF files containing tabular data for testing
- Basic understanding of Python programming and web services
- Appropriate file system permissions for reading input files and writing output
For initial DCS setup with Python, refer to the using Document Converter Services with Python guide.
WSDL
Zeep extracts the following WSDL definitions:
ExtractTables(inputFile: xsd:base64Binary, openOptions: ns2:OpenOptions, settings: ns2:TableExtractionSettings) -> ExtractTablesResult: ns2:BatchResult...ns2:TableExtractionSettings(RenderFormFields: ns3:BooleanEnum, EnableOrientationDetection: ns3:BooleanEnum, EnableSkewDetection: ns3:BooleanEnum, DPI: xsd:string, SeparateTables: ns3:BooleanEnum, OutputFileType: ns3:TableExtractionOutputType, OCRLanguage: xsd:string)...ns2:OpenOptions(UserName: xsd:string, Password: xsd:string, FileExtension: xsd:string, OriginalFileName: xsd:string, RefreshContent: xsd:boolean, AllowExternalConnections: xsd:boolean, AllowMacros: ns3:MacroSecurityOption, SystemSettings: ns5:SystemSettings, SubscriptionSettings: ns9:SubscriptionSettings)...ns3:BooleanEnum(ns3:BooleanEnum)...ns3:TableExtractionOutputType(ns3:TableExtractionOutputType)
The ExtractTables
method requires three parameters:
inputFile: xsd:base64Binary
openOptions: ns2:OpenOptions
settings: ns2:TableExtractionSettings
Use a Base64-encoded binary string for inputFile
, as defined by the W3C XML schema.
Instantiate openOptions
and settings
using Zeep type factories. Both types belong to the ns2
namespace.
The OpenOptions
type requires minimal setup—set the file name and extension.
TableExtractionSettings
supports the following configuration:
- Multiple boolean flags using the
BooleanEnum
type (ns3
) - Output format using
TableExtractionOutputType
(ns3
) - OCR language
- DPI
- Table separation behavior
The method returns a Base64-encoded binary string representing the extracted data in JSON format.
Sample code
The following Python code demonstrates how to extract tables from a PDF file:
import zeepimport base64
print ("Extract tables from a PDF")
# Source file path.source_file = "Three-in-one invoice.pdf"# Target file path.target_file = "Three-in-one invoice tables.json"# OCR languages (multiple languages can be included, separated using the '+' character; for example eng+fr).ocr_languages = "eng"
#Service URL.service_url = "http://localhost:41734/Muhimbi.DocumentConverter.WebService/"# WSDL URL.wsdl_url = service_url+"?WSDL"
# Construct the header.header = zeep.xsd.Element( "Header", zeep.xsd.ComplexType( [ zeep.xsd.Element( "{http://www.w3.org/2005/08/addressing}Action", zeep.xsd.String() ), zeep.xsd.Element( "{http://www.w3.org/2005/08/addressing}To", zeep.xsd.String() ), ] ),)# Create a heading object.header_value = header(Action=service_url,To=service_url)# Create client.client = zeep.Client(wsdl=wsdl_url)
# Create a factory type to construct objects with the suffix ns2 (see the WSDL).factory2 = client.type_factory("ns2")# Create a factory type to construct objects with the suffix ns3 (see the WSDL).factory3 = client.type_factory("ns3")
# Create the BooleanEnum types (only need true for this sample).boolean_enum_true = factory3.BooleanEnum("True")boolean_enum_false = factory3.BooleanEnum("False")
# Create the OpenOptions object with minimum settings.open_options = factory2.OpenOptions(OriginalFileName = source_file, FileExtension = "pdf")
# Create the output file type.output_file_type = factory3.TableExtractionOutputType("JSON")
# Create the TableExtractionSettings object with minimum settings.table_extraction_settings = factory2.TableExtractionSettings(DPI = "300", SeparateTables = boolean_enum_true, EnableOrientationDetection = boolean_enum_true, EnableSkewDetection = boolean_enum_true, RenderFormFields = boolean_enum_true, OutputFileType = output_file_type, OCRLanguage = ocr_languages)
# Read the file contents, create the source file information, and add it to the source files list.with open(source_file, "rb") as filereader: source_file_content = base64.b64encode(filereader.read()).decode('utf-8')
# Extract the tables.result = client.service.ExtractTables(source_file_content, open_options, table_extraction_settings)
# Write the output file.with open(target_file, "wb") as f: f.write(result.File)
print("Done")
Output format
The table extraction service supports the following output format:
JSON format
- Structured data with table metadata and cell content
- Includes table positioning and formatting information
- Suitable for programmatic processing and integration
- File extension:
.json
To set the output format, modify the OutputFileType
parameter:
# For JSON outputoutput_file_type = factory3.TableExtractionOutputType("JSON")
Troubleshooting
Service connection error: Cannot connect to DCS
- Ensure DCS is running on
localhost:41734
- Check that no firewall is blocking the connection
- Verify the service URL in your code matches your DCS installation
File access error: File not found or permission denied
- Verify that Python has read access to the source PDF file
- Check that the output directory has write permissions
- Ensure the source file path is correct and the file exists
No tables extracted: Empty result or no output file
- Verify that the PDF contains actual tabular data, not just visual table layouts
- Check that the OCR language setting matches the document language
- Ensure the DPI setting is appropriate for your document quality (try 300 or higher)
- Enable orientation and skew detection for scanned documents
License error: Table extraction not available
- Verify that your DCS license includes table extraction functionality
- Check that the license hasn’t expired
- Ensure the service is licensed and activated
Poor extraction quality: Incomplete or inaccurate table data
- Increase the DPI setting for higher quality extraction (try 600 DPI for complex tables)
- Enable orientation detection if tables are rotated
- Enable skew detection for scanned documents
- Set the appropriate OCR language for non-English documents
- Consider using
SeparateTables=False
for complex multi-column layouts
Large file processing: Slow performance or timeouts
- For large PDF files, consider processing individual pages
- Increase timeout values in your HTTP client configuration
- Monitor memory usage when processing multiple large files
What’s next
Now that you can extract tables from PDFs with Python, explore these related document processing capabilities:
- Text extraction - Learn how to extract text using Python to analyze document content beyond tables
- C# implementation - Compare approaches with extract tabular data from PDFs using C# for cross-language insights
- Complete Python setup - Review the using Document Converter Services with Python guide for more features