---
title: "Extract PDF tables to JSON using Python | Nutrient DCS"
canonical_url: "https://www.nutrient.io/guides/document-converter/document-converter-services/extraction/extract-tables/extract-table-using-python/"
md_url: "https://www.nutrient.io/guides/document-converter/document-converter-services/extraction/extract-tables/extract-table-using-python.md"
last_updated: "2026-05-20T19:49:34.743Z"
description: "Extract tables from PDF files to JSON format using Python and Nutrient Document Converter Services. Complete code example with JSON output options."
---

This guide explains how to extract tabular information from PDF documents using Python and Nutrient Document Converter Services (DCS). Table extraction is particularly useful for data analysis, reporting workflows, and document digitization processes.

The sample code in this guide can be run in any Python environment with access to the [Zeep library](https://docs.python-zeep.org/en/master/in_depth.html#). For other extraction capabilities, see [extract text using Python](https://www.nutrient.io/guides/document-converter/document-converter-services/extraction/extract-text-using-python.md).

The Zeep library enables interaction with Web Services Description Language (WSDL), which defines how to call the web services and describes the data structures returned. Nutrient Document Converter Services (DCS) provides these WSDL definitions for text extraction and other operations.

## Prerequisites

Before extracting tables from PDFs, ensure you have:

- Python 3.x installed on your system

- The Zeep library installed (`pip install zeep`)

- Nutrient Document Converter Services running locally on port 41734

- Valid DCS license that includes table extraction functionality

- PDF files containing tabular data for testing

- Basic understanding of Python programming and web services

- Appropriate file system permissions for reading input files and writing output

For initial DCS setup with Python, refer to the [using Document Converter Services with Python](https://www.nutrient.io/guides/document-converter/document-converter-services/dcs-with-python.md) guide.

## WSDL

Zeep extracts the following WSDL definitions:

```python

ExtractTables(inputFile: xsd:base64Binary, openOptions: ns2:OpenOptions, settings: ns2:TableExtractionSettings) -> ExtractTablesResult: ns2:BatchResult...
ns2:TableExtractionSettings(RenderFormFields: ns3:BooleanEnum, EnableOrientationDetection: ns3:BooleanEnum, EnableSkewDetection: ns3:BooleanEnum, DPI: xsd:string, SeparateTables: ns3:BooleanEnum, OutputFileType: ns3:TableExtractionOutputType, OCRLanguage: xsd:string)...
ns2:OpenOptions(UserName: xsd:string, Password: xsd:string, FileExtension: xsd:string, OriginalFileName: xsd:string, RefreshContent: xsd:boolean, AllowExternalConnections: xsd:boolean, AllowMacros: ns3:MacroSecurityOption, SystemSettings: ns5:SystemSettings, SubscriptionSettings: ns9:SubscriptionSettings)...
ns3:BooleanEnum(ns3:BooleanEnum)...
ns3:TableExtractionOutputType(ns3:TableExtractionOutputType)

```

The `ExtractTables` method requires three parameters:

- `inputFile: xsd:base64Binary`

- `openOptions: ns2:OpenOptions`

- `settings: ns2:TableExtractionSettings`

Use a Base64-encoded binary string for `inputFile`, as defined by the W3C XML schema.

Instantiate `openOptions` and `settings` using Zeep type factories. Both types belong to the `ns2` namespace.

The `OpenOptions` type requires minimal setup — set the file name and extension.

`TableExtractionSettings` supports the following configuration:

- Multiple boolean flags using the `BooleanEnum` type (`ns3`)

- Output format using `TableExtractionOutputType` (`ns3`)

- OCR language

- DPI

- Table separation behavior

The method returns a Base64-encoded binary string representing the extracted data in JSON format.

## Sample code

The following Python code demonstrates how to extract tables from a PDF file:

```python

import zeep
import base64

print ("Extract tables from a PDF")

# Source file path.

source_file = "Three-in-one invoice.pdf"

# Target file path.

target_file = "Three-in-one invoice tables.json"

# OCR languages (multiple languages can be included, separated using the '+' character; for example eng+fr).

ocr_languages = "eng"

#Service URL.

service_url = "http://localhost:41734/Muhimbi.DocumentConverter.WebService/"

# WSDL URL.

wsdl_url = service_url+"?WSDL"

# Construct the header.

header = zeep.xsd.Element(
    "Header",
    zeep.xsd.ComplexType(
        [
            zeep.xsd.Element(
                "{http://www.w3.org/2005/08/addressing}Action", zeep.xsd.String()
            ),
            zeep.xsd.Element(
                "{http://www.w3.org/2005/08/addressing}To", zeep.xsd.String()
            ),
        ]
    ),
)

# Create a heading object.

header_value = header(Action=service_url,To=service_url)

# Create client.

client = zeep.Client(wsdl=wsdl_url)

# Create a factory type to construct objects with the suffix ns2 (see the WSDL).

factory2 = client.type_factory("ns2")

# Create a factory type to construct objects with the suffix ns3 (see the WSDL).

factory3 = client.type_factory("ns3")

# Create the BooleanEnum types (only need true for this sample).

boolean_enum_true = factory3.BooleanEnum("True")
boolean_enum_false = factory3.BooleanEnum("False")

# Create the OpenOptions object with minimum settings.

open_options = factory2.OpenOptions(OriginalFileName = source_file, FileExtension = "pdf")

# Create the output file type.

output_file_type = factory3.TableExtractionOutputType("JSON")

# Create the TableExtractionSettings object with minimum settings.

table_extraction_settings = factory2.TableExtractionSettings(DPI = "300",
                                                             SeparateTables = boolean_enum_true,
                                                             EnableOrientationDetection = boolean_enum_true,
                                                             EnableSkewDetection = boolean_enum_true,
                                                             RenderFormFields = boolean_enum_true,
                                                             OutputFileType = output_file_type,
                                                             OCRLanguage = ocr_languages)

# Read the file contents, create the source file information, and add it to the source files list.

with open(source_file, "rb") as filereader:
    source_file_content = base64.b64encode(filereader.read()).decode('utf-8')

# Extract the tables.

result = client.service.ExtractTables(source_file_content, open_options, table_extraction_settings)

# Write the output file.

with open(target_file, "wb") as f:
  f.write(result.File)

  print("Done")

```

## Output format

The table extraction service supports the following output format:

**JSON format**

- Structured data with table metadata and cell content

- Includes table positioning and formatting information

- Suitable for programmatic processing and integration

- File extension: `.json`

To set the output format, modify the `OutputFileType` parameter:

```python

# For JSON output

output_file_type = factory3.TableExtractionOutputType("JSON")

```

## Troubleshooting

**Service connection error: Cannot connect to DCS**

- Ensure DCS is running on `localhost:41734`

- Check that no firewall is blocking the connection

- Verify the service URL in your code matches your DCS installation

**File access error: File not found or permission denied**

- Verify that Python has read access to the source PDF file

- Check that the output directory has write permissions

- Ensure the source file path is correct and the file exists

**No tables extracted: Empty result or no output file**

- Verify that the PDF contains actual tabular data, not just visual table layouts

- Check that the OCR language setting matches the document language

- Ensure the DPI setting is appropriate for your document quality (try 300 or higher)

- Enable orientation and skew detection for scanned documents

**License error: Table extraction not available**

- Verify that your DCS license includes table extraction functionality

- Check that the license hasn’t expired

- Ensure the service is licensed and activated

**Poor extraction quality: Incomplete or inaccurate table data**

- Increase the DPI setting for higher quality extraction (try 600 DPI for complex tables)

- Enable orientation detection if tables are rotated

- Enable skew detection for scanned documents

- Set the appropriate OCR language for non-English documents

- Consider using `SeparateTables=False` for complex multi-column layouts

**Large file processing: Slow performance or timeouts**

- For large PDF files, consider processing individual pages

- Increase timeout values in your HTTP client configuration

- Monitor memory usage when processing multiple large files

## What’s next

Now that you can extract tables from PDFs with Python, explore these related document processing capabilities:

- **Text extraction** - Learn how to [extract text using Python](https://www.nutrient.io/guides/document-converter/document-converter-services/extraction/extract-text-using-python.md) to analyze document content beyond tables

- **C# implementation** - Compare approaches with [extract tabular data from PDFs](https://www.nutrient.io/guides/document-converter/document-converter-services/extraction/extract-tables/extract-tabular-data-from-pdf.md) using C# for cross-language insights

- **Complete Python setup** - Review the [using Document Converter Services with Python](https://www.nutrient.io/guides/document-converter/document-converter-services/dcs-with-python.md) guide for more features
---

## Related pages

- [Extract PDF tables with C#](/guides/document-converter/document-converter-services/extraction/extract-tables/extract-tabular-data-from-pdf.md)

