Extract Tables from PDFs and Images
This guide explains how to extract table information from PDF documents using Document Engine.
Sending the Request to Extract Data
To extract table data from a document, post a multipart request to the /api/build
endpoint(opens in a new tab). In the instructions, specify the following output parameters:
type
specifies the output type. Set this tojson-content
.tables
is a Boolean value that determines whether to extract table data.language
specifies the language used for recognizing text with optical character recognition (OCR). Sometimes, text is stored in a PDF or an image in a way that makes it so you cannot search or copy it. PSPDFKit’s OCR engine allows you to recognize text and save it in a separate file where you can both search and copy and paste the text.
curl -X POST http://localhost:5000/api/build \ -H "Authorization: Token token=<API token>" \ -F document=@/path/to/example-document.pdf \ -F instructions='{ "parts": [ { "file": "document" } ], "output": { "type": "json-content", "tables": true, "language": "english" }}' \ -o result.pdf
POST /api/build HTTP/1.1Content-Type: multipart/form-data; boundary=customboundaryAuthorization: Token token=<API token>
--customboundaryContent-Disposition: form-data; name="document"; filename="example-document.pdf"Content-Type: application/pdf
<PDF data>--customboundaryContent-Disposition: form-data; name="instructions"Content-Type: application/json
{ "parts": [ { "file": "document" } ], "output": { "type": "json-content", "tables": true, "language": "english" }}--customboundary--
For more information on the Build instructions, refer to the API Reference(opens in a new tab).
Example Data Extraction Response
{ "pages": [ { "pageIndex": 0, "tables": [ { "confidence": 95.4, "bbox": { "left": 0, "top": 0, "width": 100, "height": 100 }, "cells": [ { "bbox": { "left": 0, "top": 0, "width": 100, "height": 100 }, "rowIndex": 0, "columnIndex": 0, "isHeader": true, "text": "Invoice number" } ], "columns": [ { "bbox": { "left": 0, "top": 0, "width": 100, "height": 100 } } ], "lines": [ { "bbox": { "left": 0, "top": 0, "width": 100, "height": 100 }, "isVertical": false, "thickness": 0 } ], "rows": [ { "bbox": { "left": 0, "top": 0, "width": 100, "height": 100 } } ] } ] } ]}