Build a classification / extraction template
This guide describes the steps required to create a document classification and data extraction template.
Example use case
For the purpose of this guide, ACME company hosted a talent contest and awarded certificates of completion to all successful participants. According to company policies, it’s essential to maintain records of all rewards in the rewards archive system.
Due to the significant number of talented participants, this has led to the generation of a substantial volume of documents. Consequently, the HR department has expressed concerns regarding the labor-intensive nature of manually processing this high volume of data.
In response to this challenge, the engineering department has been tasked with swiftly developing an intelligent data processing system designed to efficiently capture and manage this data.
The most evident solution was to employ AI Document Processing to create a tailored data extraction template to capture all the necessary information.
Download the example input image.
Building the document template
Here, you’ll create a DocumentTemplate
object. This object will serve as the comprehensive definition of a specific type of document.
To build a document template, start by defining a unique identifier and a public name. Then provide a semantic description of the document and define a set of fields for extraction.
In the example use case of ACME company, the following information needs to be extracted:
- The year of certificate delivery.
- The person who received the certificate.
- The mentor of the student.
- The member of the jury.
- The achievement of the student.
- The postal address of the organization.
Each data point to be extracted needs to be given a name, a format, and a semantic description. You can pick from several predefined data formats, and you can also opt for one of the built-in data validation methods or define your own.
Here’s how ACME company would define its template:
static DocumentTemplate buildOrpalisCertificateTemplate(){ return new DocumentTemplate() { Name = "ORPALIS certificate", Identifier = "8843294B-5840-4693-8D2A-C4CF76DB1060", SemanticDescription = "ORPALIS certificate of completion", Fields = new List<TemplateField> { new() { Name = "Year", Format = FieldDataFormat.Number, SemanticDescription = "The year of certificate delivery" }, new() { Name = "Student", Format = FieldDataFormat.Text, SemanticDescription = "The person who received the certificate" }, new() { Name = "Mentor", Format = FieldDataFormat.Text, SemanticDescription = "The mentor of the student" }, new() { Name = "Jury member", Format = FieldDataFormat.Text, SemanticDescription = "The member of the jury" }, new() { Name = "Achievement", Format = FieldDataFormat.Text, SemanticDescription = "The achievement of the student" }, new() { Name = "Organization address", Format = FieldDataFormat.Text, SemanticDescription = "The postal address of the organization", StandardValidationMethods = new[]{ new StandardFieldValidationMethod( StandardFieldValidation.PostalAddressIntegrity) } } } };}
In the code above, ACME developers decided to use number and text data formats and applied the built-in postal address validator for the “Organization address” field.
Building the extraction instructions component
To extract the data using the document template, create a ProcessorComponent
object, which provides instructions to the document processor:
static ProcessorComponent buildComponent(){ return new ProcessorComponent() { EnableClassifier = false, // Disable classification: not required; a single document type will be processed. EnableFieldsExtraction = true, // Enable extraction. Templates = new DocumentTemplate[] { buildOrpalisCertificateTemplate() } };}
Processing a document and analyzing results
The last step is to instantiate a DocumentProcessor
object and invoke the Process
method to initiate the data extraction.
The Process
method will return a ProcessorResult
object that encompasses the processing outcome. You can use it to access the extracted fields, respective to the instructions encapsulated in the ProcessorComponent
object:
// Build the component.ProcessorComponent component = buildComponent();// Process a document.ProcessorResult result = new DocumentProcessor().Process("orpalis_certificate.jpg", component);// Output results.if (result.ExtractedFields != null){ foreach (var item in result.ExtractedFields) { Console.WriteLine($"Field name: '{item.FieldName}' - Field value: '{item.Value}' - Validation state: ({item.ValidationState})"); }}
This code will output the following result:
Field name: 'Year' - Field value: '2023' - Validation state: (Undefined)Field name: 'Student' - Field value: 'Fabio Escobar' - Validation state: (Undefined)Field name: 'Mentor' - Field value: 'Loïc Carrère' - Validation state: (Undefined)Field name: 'Jury member' - Field value: 'Olivier Houssin' - Validation state: (Undefined)Field name: 'Achievement' - Field value: 'Successfully juggled with 3 bananas' - Validation state: (Undefined)Field name: 'Organization address' - Field value: '52 Rue de Marclan, 31600 MURET, France' - Validation state: (Valid)
The complete solution
Here’s a full example showing how to extract data from an image document using the certificate document template:
static void runExtraction(){ Configuration.RegisterGdPictureKey("GDPICTURE_KEY"); Configuration.RegisterLLMProvider(new OpenAIProvider(OPENAI_KEY)); Configuration.ResourcesFolder = "resources"; // Building the component. ProcessorComponent component = buildComponent(); // Processing the document. ProcessorResult result = new DocumentProcessor().Process("orpalis_certificate.jpg", component); // Analyzing results. if (result.ExtractedFields != null) { foreach (var item in result.ExtractedFields) { Console.WriteLine($"Field name: '{item.FieldName}' - Field value: '{item.Value}' - Validation state: ({item.ValidationState})"); } }}
static ProcessorComponent buildComponent(){ return new ProcessorComponent() { EnableClassifier = false, // Classification isn't required, as a single class of documents will be processed. EnableFieldsExtraction = true, // Enabling extraction of fields specified from the previously defined template. Templates = new DocumentTemplate[] { buildOrpalisCertificateTemplate() } };}
static DocumentTemplate buildOrpalisCertificateTemplate(){ return new DocumentTemplate() { Name = "ORPALIS certificate", Identifier = "8843294B-5840-4693-8D2A-C4CF76DB1060", SemanticDescription = "ORPALIS certificate of completion", Fields = new List<TemplateField> { new() { Name = "Year", Format = FieldDataFormat.Number, SemanticDescription = "The year of certificate delivery" }, new() { Name = "Student", Format = FieldDataFormat.Text, SemanticDescription = "The person who received the certificate" }, new() { Name = "Mentor", Format = FieldDataFormat.Text, SemanticDescription = "The mentor of the student" }, new() { Name = "Jury member", Format = FieldDataFormat.Text, SemanticDescription = "The member of the jury" }, new() { Name = "Achievement", Format = FieldDataFormat.Text, SemanticDescription = "The achievement of the student" }, new() { Name = "Organization address", Format = FieldDataFormat.Text, SemanticDescription = "The postal address of the organization", StandardValidationMethods = new[]{ new StandardFieldValidationMethod( StandardFieldValidation.PostalAddressIntegrity) } }} };}