One of the more positive changes that have been embraced over the past year is digitization, especially in the mortgage industry, where piles of paperwork have regularly filled desks and filing cabinets.
Digitization gives us the ability to sign documents electronically and store digital files rather than paperwork, all without leaving the comfort and safety of home. Optical Character Recognition (OCR) and its ability to convert paper-based documents into electronic ones are no stranger to many people. However, they might not be aware of how AI technologies could digitize unstructured data into indexed data, saving companies significant overheads.
You may be wondering how digitizing documents results in cost savings. The answer lies in data extraction. Regarding the tasks involving unstructured paper documents, it takes time to find the document, identify the data point(s), and then take action or log the results. Simply converting the paper documents with OCR saves time on the first step but still requires manual effort afterwards.
An AI-enabled data extraction solution can actively learn to extract those data points over time, allowing employees to focus on higher-value tasks. The process can be completed without any complex programming and designed to be user-friendly so that anyone can benefit from its accuracy.
FPT’s Data Extraction Solution
FPT recently developed a system built on the IBM Watson Natural Language Understanding (NLU) platform, which was chosen due to its comprehensive text analytic features and ability to establish complex data relationships. Our team has implemented many pre- and post-processing customizations specifically for mortgage-related datasets to improve the accuracy of results.
It begins by reformatting the OCR results from a scanned document to remove noise and other artefacts. Then it uses Natural Language Processing (NLP) to identify the critical data points (items) and groups of data (objects). Initially, it must be calibrated manually to recognize the items and objects, but then begins using Named Entity Recognition (NER) and Relation Extraction (RE) to identify them automatically. It also incorporates a dictionary to bootstrap the annotation task, provide equivalent words and reduce errors. Finally, the results of the data extraction are captured and output in the preferred format.
Complicated as it might sound, the solution is optimized for mortgage-related documents and tailored for non-technical personnel. Based on customers' feedback, the latest update includes a Pattern Extractor feature that identifies specific patterns under-represented in the dataset. This allows users to create rules to quickly find patterns that would otherwise require larger amounts of training data.
Data Extraction in Action
A client recently asked for our help to streamline the task of auditing the subdivision property and abstract property information from a stack of 2,500+ mortgage documents, which might reach millions in the future. They estimated the amount of manual effort involved and asked if we had any suggestions to “work smarter, not harder.”
We leveraged our data extraction tool due to its speed and high accuracy rating and worked closely to expedite the task. We began by configuring the Named Entity Recognition (NER) and Relation Extraction (RE) models in the extraction tool. They were set up using the fields relating to subdivision property and abstract property, and relationships among them are defined. After the documents were scanned and pre-processed, the client began to train the model by identifying and labelling the correct data points. This took about 7 minutes per document initially, then reduced to 5 minutes once the system had enough information to pre-annotate automatically.
Using the limited training dataset, the system periodically extracted the data and evaluated it for accuracy against a validation dataset. As the training dataset grew, the accuracy results increased to indicate that the model was improving from the larger dataset. Our solution aims for a 96%+ accuracy, and we relayed the progress as the client continued to provide training data via the annotated documents.
In just a few days, the tool achieved a 96.3% accuracy and used less than 25% of training data. Such results surpassed the client’s expectations of 90%+ and saved a significant amount of time in the process. They are currently exploring using the extraction tool more proactively to digitize documents before audits and deadlines become concerns.
A Customizable Tool
With our innovative and customizable extraction tool, it is now possible to have a digital index of a wide range of unstructured paper documents. For more information about our digitization solutions or other services, contact FPT today.