This guide explains how to extract table information from PDF documents using Document Engine.

Sending the request to extract data

To extract table data from a document, post a multipart request to the /api/build endpoint. In the instructions, specify the following output parameters:

  • type specifies the output type. Set this to json-content.
  • tables is a Boolean value that determines whether to extract table data.
  • language specifies the language used for recognizing text with optical character recognition (OCR). Sometimes, text is stored in a PDF or an image in a way that makes it so you cannot search or copy it. Nutrient’s OCR engine allows you to recognize text and save it in a separate file where you can both search and copy and paste the text.
Terminal window
curl -X POST http://localhost:5000/api/build \
-H "Authorization: Token token=<API token>" \
-F document=@/path/to/example-document.pdf \
-F instructions='{
"parts": [
{
"file": "document"
}
],
"output": {
"type": "json-content",
"tables": true,
"language": "english"
}
}' \
-o result.pdf

For more information on the Build instructions, refer to the API Reference.

Example data extraction response

{
"pages": [
{
"pageIndex": 0,
"tables": [
{
"confidence": 95.4,
"bbox": {
"left": 0,
"top": 0,
"width": 100,
"height": 100
},
"cells": [
{
"bbox": {
"left": 0,
"top": 0,
"width": 100,
"height": 100
},
"rowIndex": 0,
"columnIndex": 0,
"isHeader": true,
"text": "Invoice number"
}
],
"columns": [
{
"bbox": {
"left": 0,
"top": 0,
"width": 100,
"height": 100
}
}
],
"lines": [
{
"bbox": {
"left": 0,
"top": 0,
"width": 100,
"height": 100
},
"isVertical": false,
"thickness": 0
}
],
"rows": [
{
"bbox": {
"left": 0,
"top": 0,
"width": 100,
"height": 100
}
}
]
}
]
}
]
}