Document classification and data extraction
This guide describes the steps required to create a solution for categorizing documents and extracting their data with AI Document Processing’s powerful technology.
Example use case
For the purpose of the examples given here, the CRM system of ACME company is currently handling a variety of document types, such as invoices, resumes, purchase orders, and payroll statements. There’s a need for an automated solution that can intelligently categorize these diverse documents and extract pertinent data based on their respective categories.
Setting up document templates
Templates enable document recognition and drive data extraction with rules and validators. They represent various documents, serving as the comprehensive definition for specific types of documents.
To execute document classification or data extraction on a set of files, first create an enumerable of DocumentTemplate
objects, with each object representing a single specific document type you want to classify or extract data from:
static List<DocumentTemplate> setupDocumentTemplates(){ List<DocumentTemplate> templates = new List<DocumentTemplate>(); templates.Add(DocumentTemplates.Invoice); // Add invoice template. templates.Add(DocumentTemplates.Resume); // Add resume template. templates.Add(DocumentTemplates.PurchaseOrder); // Add purchase order template. templates.Add(DocumentTemplates.PayrollStatement); // Add payroll statement template. return templates;}
Building the component
Then, create a ProcessorComponent
object. This is a necessary component that serves as a set of instructions for the document processor.
This object will encapsulate the document processing workflow’s logic. Instruct the processor to classify documents and/or extract data before passing the list of document templates to use:
static ProcessorComponent buildComponent(){ return new ProcessorComponent() { EnableClassifier = true, // Enable classification. EnableFieldsExtraction = true, // Enable extraction. Templates = setupDocumentTemplates() };}
Processing the documents
The last step is to instantiate a DocumentProcessor
object and invoke the Process
method over one or more files using the instructions from the ProcessorComponent
object.
The Process
method, will return a ProcessorResult
object that encompasses the processing outcome. You can use it to determine which document template was satisfied and/or access the extracted fields, respective to the instructions encapsulated in the ProcessorComponent
object:
// Building the component.ProcessorComponent component = buildComponent();// Processing all documents.foreach (string documentFile in Directory.GetFiles([DIRECTORY_PATH])){ ProcessorResult result = new DocumentProcessor().Process(documentFile, component); // Analyzing results. if (result.Template != null) { Console.WriteLine("Document category:" + result.Template.Name); if (result.ExtractedFields != null) { foreach (var item in result.ExtractedFields) { Console.WriteLine($"Field name: '{item.FieldName}' - Field value: '{item.Value}' - Validation state: ({item.ValidationState})"); } } }}
Complete solution
Here’s a full example showing how to classify all documents from a specific folder and extract data from them, using four distinct document templates:
static void runExtraction(){ Configuration.RegisterGdPictureKey("GDPICTURE_KEY"); Configuration.RegisterLLMProvider(new OpenAIProvider(OPENAI_KEY)); Configuration.ResourcesFolder = "resources"; // Building the component. ProcessorComponent component = buildComponent(); // Processing all documents. foreach (string documentFile in Directory.GetFiles([DIRECTORY_PATH])) { ProcessorResult result = new DocumentProcessor().Process(documentFile, component); // Analyzing results. if (result.Template != null) { Console.WriteLine("Document category:" + result.Template.Name); if (result.ExtractedFields != null) { foreach (var item in result.ExtractedFields) { Console.WriteLine($"Field name: '{item.FieldName}' - Field value: '{item.Value}' - Validation state: ({item.ValidationState})"); } } } }}
static ProcessorComponent buildComponent(){ return new ProcessorComponent() { EnableClassifier = true, // Enabling classification. EnableFieldsExtraction = true, // Enabling extraction. Templates = setupDocumentTemplates() };}
static List<DocumentTemplate> setupDocumentTemplates(){ List<DocumentTemplate> templates = new List<DocumentTemplate>(); templates.Add(DocumentTemplates.Invoice); // Add invoice template. templates.Add(DocumentTemplates.Resume); // Add resume template. templates.Add(DocumentTemplates.PurchaseOrder); // Add purchase order template. templates.Add(DocumentTemplates.PayrollStatement); // Add payroll statement template. return templates;}