Extracting Key-Value Pairs Using GdPicture.NET SDK

Extracting Key-Value Pairs Using GdPicture.NET SDK

This article will cover what key-value pairs (KVPs) are, how they’re used, and how GdPicture.NET’s KVP extraction engine extracts them from documents that contain unstructured data.

What Are Key-Value Pairs?

In the context of a document, key-value pairs are a way of organizing data or information. These pairs are made up of keys and values. Depending on the type of document, the key-value pairs can be different.

For example, key-value pairs on invoices can be the following:

KeyValue
Invoice NumberNo 00162
Billing Date20/09/2022
Total1,165.10€

Here’s an example of key-value pair fields on a government form:

KeyValue
Company NameNutrient GmbH
Registration NumberFN 548939p
Date of Incorporation04/10/2013

Extracting Key-Value Pairs

It’s easy to get key-value pairs from structured documents like Excel files because the values are all named. However, a majority of documents have unstructured data. For these documents, a KVP extraction tool is required to retrieve the information. Intelligent document processing (IDP) extracts data from unstructured and semi-structured documents using optical character recognition (OCR) and artificial intelligence (AI) technologies.

In such cases, the extraction of key-value pairs involves two tasks:

  • Using OCR technology to recognize unstructured information and text in a document.

  • Using machine learning (ML) and deep learning (DL) to make sense of the unstructured information by composing links between different parts of the extracted text.

The next sections will talk about why these two approaches have limitations when used separately, as well as why they should be combined for the most reliable results.

The Disadvantages of Only Using Traditional OCR

Extracting data with the traditional OCR approach is based on heuristics. The most important limitation of the traditional OCR approach is that it needs to use a different template for each document type. This works well for simple documents with structured data. However, extracting data with the traditional OCR approach doesn’t perform well with unstructured or semi-structured documents.

Extracting data with this approach suffers from the same limitations as traditional OCR engines that have difficulties recognizing text in the following contexts:

  • Colored backgrounds
  • Glaring
  • Skew
  • Text in tables and graphics
  • Handwritten text

The Disadvantages of Only Using Machine Learning and Deep Learning

Data extraction solutions that leverage ML and deep learning use AI technologies to mitigate the traditional OCR limitations. These deep learning approaches are usually a combination of different techniques such as convolutional neural networks, long short-term memory layers, transformers, and graph neural networks.

However, this approach has a few drawbacks compared to traditional OCR:

  • Time — You’ll need to spend a lot of time teaching ML/DL how to perform under the specific parameters you need.
  • Fixing errors — In the same vein, it can be hard and take a long time to “unteach” incorrect results.
  • Speed and resource usage — Traditional OCR can work more quickly on smaller systems with fewer resources needed compared to ML/DL approaches.

Additionally, data extraction relying only on machine and deep learning often fails for documents with a lot of noise and dotted lines.

GdPicture.NET’s Key-Value Pair Extraction Engine

A combination of the approaches above is necessary to achieve the best results in data extraction. For this reason, PSPDFKit GdPicture.NET recognizes text and key-value pairs based on a hybrid approach of the following methods:

  • Heuristics
  • Mathematics
  • Machine learning

This approach produces superior results compared to traditional OCR and pure ML approaches.

GdPicture.NET’s key-value pair extraction engine enables you to recognize related data items — such as IBANs and addresses — in a document and export them to an external destination like a spreadsheet. GdPicture.NET automatically recognizes the document type, such as a bank statement, and adapts to the context and determines the extraction approach that makes the best use of available resources. It also recognizes the document type based on adaptive layout understanding and natural language processing (NLP) technologies.

GdPicture.NET Data Model

The GdPicture.NET data model enables you to extract data from documents with excellent results. GdPicture.NET’s hybrid approach performs better than traditional OCR and pure ML engines, especially for documents with the following features:

  • Noise
  • Dotted lines
  • Broken characters
  • Text on colored backgrounds
  • Underlined text
  • Skewed text
  • Text in graphics and tables

Confidence Score — How We Ensure Extraction Accuracy

PSPDFKit GdPicture.NET’s key-value pair extraction engine calculates a confidence score, which expresses how confident the engine is in the accuracy of the extracted data.

The confidence score is calculated by considering the following factors, among others:

  • The confidence in the OCR result at the character level. Some characters are more difficult to recognize than others.
  • The confidence in the OCR result at the word level. Some words are more difficult to recognize than others.
  • The data type of the key. Some data types are more difficult to recognize than others. For example, dates and IBANs are relatively easy to recognize, while phone numbers and addresses are generally more difficult.

The confidence score enables you to filter results based on their assumed accuracy. For example, you can disregard data extraction results with a low confidence score, or flag them as data items that require manual checks.

Data Types We Can Categorize

PSPDFKit GdPicture.NET’s KVP extraction engine automatically detects the data type of values. The following data types are supported:

  • Business Identifier Code (BIC)
  • Credit card number
  • Currency
  • Date and time
  • Email address
  • International Bank Account Number (IBAN)
  • Number
  • Percentage
  • Phone number
  • Postal address
  • Postal code
  • String
  • Symbol
  • Time period
  • Unique Identifier (UID)
  • URL
  • Vat ID

How to Extract Data from Invoices

<%= partial '/guides/dotnet/partials/extraction/kvp-extraction', locals: { document_type: 'invoice' } %>

Format the output to obtain the following table:

KeyValueDocument TypeConfidence Level
Billing date20/09/2022DateTime100%
Order date20/09/2022DateTime100%
Republic of PDF+100 847 738 227PhoneNumber77.2%
IBANAT13 2060 4236 6111 5994IBAN100%
CustomerVandelay Industries Around the Corner 13 NBC CityString69.8%
Delivery addressVandelay Industries Around the Corner 13 NBC CityString69.9%
Invoice numberNo 00162String70.9%
Ref. number34751Number92.9%
No00162Number100%
ReferenceP00201UID100%
Quantity Total (excl. VAT)320.00€Currency59%
Subtotal1,220.00€Currency100%
Discount (10%)-122.00€Currency90.6%
VAT (5.5%)+6710€Currency66.9%
Shipping cost0.00€Currency75%
TOTAL1,165.10€Currency100%
DescriptionLake MirrorString99.6%
VAT5.5%Percentage66.6%
Price per unit (excl. VAT)320.00€Currency80%
Tax No.AT98765321UID73.8%
#[email protected]EmailAddress65.6%
#www.bruuuk.comURL65.6%

When the engine doesn’t recognize the data type of a value, it categorizes the value as a string, as shown in the table above.

This table also contains information about the data type and the confidence level for each key-value pair:

  • The data type describes the nature of the content. In this example, the engine recognizes the value [email protected] as an email address and the value +100 847 738 227 as a phone number.
  • The confidence level describes how confident the KVP engine is in the accuracy of the data extraction.

In this example, the KVP engine automatically detected all key-value pairs in the document with minimal code and without any preconfiguration. The engine supports more than 100 formats and languages, and it has no dependencies to external models, resources, and databases.

How to Extract Data from Bank Statements

<%= partial '/guides/dotnet/partials/extraction/kvp-extraction', locals: { document_type: "bank statement" } %>

Format the output to obtain the following table:

KeyValueDocument TypeConfidence Level
IBANFR7611808009101234567890147IBAN100%
Phone786-315-0313PhoneNumber100%
BIC12345678901Number66.4%
Bank Code11808Number99.4%
Counter Code00914Number100%
Number Account12345678901Number99.3%
Bank Key47Number74.2%
River Bank100Number74%
Account OwnerDavid BricklaneString100%
DomiciliationEast Bank SummerfieldString97.5%

Conclusion

You should now have a basic understanding of how GdPicture.NET’s key-value pair (KVP) extraction engine works, and how it ensures accurate and reliable extraction results. If you have any questions or want to discuss how to implement it in your workflows and projects, you can reach out to our team.

Jonathan D. Rhyne

Jonathan D. Rhyne

Co-Founder and CEO

Jonathan joined Nutrient in 2014. As CEO, Jonathan defines the company’s vision and strategic goals, bolsters the team culture, and steers product direction. When he’s not working, he enjoys being a dad, photography, and soccer.

Explore related topics

FREE TRIAL Ready to get started?