Extract text, tables, and more from PDFs
This guide explains how to extract data from PDFs using Document Engine.
You can extract the following pieces of information from a PDF document:
- Text
- Tables
- Key-value pairs. For more information, refer to the guide on how key-value pair extraction works.
Sending the request to extract data
To extract data on all pages of a document, post a multipart request to the /api/build endpoint. In the instructions, specify the following output parameters:
typespecifies the output type. Set this tojson-content.plainTextis a Boolean value that determines whether to extract data as plain text.structuredTextis a Boolean value that determines whether to extract data as structured text. Enabling this option gives you information about characters, lines, paragraphs, and words.keyValuePairsis a Boolean value that determines whether to extract key-value pairs.tablesis a Boolean value that determines whether to extract table data.languagespecifies the language used for recognizing text with optical character recognition (OCR). Sometimes, text is stored in a PDF or an image in a way that makes it so you cannot search or copy it. Nutrient’s OCR engine allows you to recognize text and save it in a separate file where you can both search and copy and paste the text.
curl -X POST http://localhost:5000/api/build \ -H "Authorization: Token token=<API token>" \ -F document=@/path/to/example-document.pdf \ -F instructions='{ "parts": [ { "file": "document" } ], "output": { "type": "json-content", "plainText": true, "structuredText": true, "keyValuePairs": true, "tables": true, "language": "english" }}' \ -o result.jsonPOST /api/build HTTP/1.1Content-Type: multipart/form-data; boundary=customboundaryAuthorization: Token token=<API token>
--customboundaryContent-Disposition: form-data; name="document"; filename="example-document.pdf"Content-Type: application/pdf
<PDF data>--customboundaryContent-Disposition: form-data; name="instructions"Content-Type: application/json
{ "parts": [ { "file": "document" } ], "output": { "type": "json-content", "plainText": true, "structuredText": true, "keyValuePairs": true, "tables": true, "language": "english" }}--customboundary--For more information on the Build instructions, refer to the API Reference.
Interpreting the data extraction response
The API response provides information about the data you included in the API request, such as:
- Plain text
- Structured text with information about characters, lines, paragraphs, and words
- Extracted key-value pairs
- Tables
Example data extraction response
{ "pages": [ { "pageIndex": 0, "plainText": "Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa.\n", "structuredText": { "characters": [ { "bbox": { "left": 0, "top": 0, "width": 100, "height": 100 }, "value": "T" } ], "lines": [ { "bbox": { "left": 0, "top": 0, "width": 100, "height": 100 }, "firstWordIndex": 0, "isRTL": false, "isVertical": false, "wordCount": 5 } ], "paragraphs": [ { "bbox": { "left": 0, "top": 0, "width": 100, "height": 100 }, "firstLineIndex": 0, "lineCount": 3 } ], "words": [ { "bbox": { "left": 0, "top": 0, "width": 100, "height": 100 }, "characterCount": 4, "firstCharacterIndex": 0, "isFromDictionary": true, "value": "word" } ] }, "keyValuePairs": [ { "confidence": 95.4, "key": { "bbox": { "left": 0, "top": 0, "width": 100, "height": 100 }, "content": "#" }, "value": { "bbox": { "left": 0, "top": 0, "width": 100, "height": 100 }, "content": "€", "dataType": "Currency" } } ], "tables": [ { "confidence": 95.4, "bbox": { "left": 0, "top": 0, "width": 100, "height": 100 }, "cells": [ { "bbox": { "left": 0, "top": 0, "width": 100, "height": 100 }, "rowIndex": 0, "columnIndex": 0, "isHeader": true, "text": "Invoice number" } ], "columns": [ { "bbox": { "left": 0, "top": 0, "width": 100, "height": 100 } } ], "lines": [ { "bbox": { "left": 0, "top": 0, "width": 100, "height": 100 }, "isVertical": false, "thickness": 0 } ], "rows": [ { "bbox": { "left": 0, "top": 0, "width": 100, "height": 100 } } ] } ] } ]}