Structured Data Extraction from Documents
Credit card statements, invoices, medical test reports, resumes all have structured information that remains locked in documents. We are exploring an extensible deep learning solution to efficiently extract structured data from such documents.
|Ready to kick-off|
|ML engineers, App developers, Domain experts in sign language|
Why this is relevant in the Indian context
Documents are the primary tools to accumulate information in each and each process. We record information in documents in the form of resumes while applying for the job, financial statements after bank transaction, experimental results in research papers, personal records in offices, marksheets in colleges, tickets in public transport etc. India is a country of 1.3 billion+ people and most of us come across at least 1 new document every week, which tells the huge amount of data associated with documents. These documents have the valuable information locked in them, which could be utilized for the fast and efficient way for searching and business processing. Structured Data Extraction will also bring down the memory requirement as we longer need them to store as images.
Structured vs Unstructured
Data can be broadly classified into Structured and Unstructured based on their organization. Structured data strictly follows the pre defined format whereas unstructured data hardly have any clear format. Data table is the typical example of structured data. Examples of unstructured data include image, audio, video etc.
Structured Data Example:
Top 5 languages in India by the no. of mother tongue speakers as per 2011 census. (Source: Wikipedia)
Unstructured Data Example:
A satellite image of India at night in 2016. (Source: NASA)
Information search can be made easier in Structured data. In the above example, the number of Tamil speakers can be easily fetched by retrieving the value at row “Tamil” and column “Speakers”, which is 68,890,000. This method of information retrieval is universal across the tables. While in unstructured data like this satellite image, there is no ubiquitous way to search and it requires higher expertise to harness. Hence the structured data takes over the unstructured data in a technical perspective because of its analytical friendliness.
Document Images are one of the major subsets of unstructured data, that have the structural information locked in them. In this project, we are going to explore the different Deep Learning based approaches to extract such structural information in document Images. These structural data include tables and key-value pairs. A high level overview of the system is shown below:
Structured data extraction model
Sample Document Image (Input):
Sample structured data (Output):
Table #1: (file: table_1.csv)
|MARKS OBTAINED FOR 200|
Key-Value Pairs: (file: key_value_pairs.json)
“TOTAL MARKS” : 1164,
“DATE OF BIRTH”: “09.02.98”,
“REGISTER NO: 078053,
“TMR CODE N0 & DATE”: “G877239 11.06.2015”
The SDE system is intended to extract the Tables and Key-Value Pairs from a document image passed as an input. This can be achieved by building separate deep learning models for extracting the Tables and Key-Value pairs. The reason for developing the individual models is based on the assumption that the Tables have the special spatial property associated with them over Key-Value pairs irrespective of the document format.
Data availability and collection
Existing work - Research and Practice
There are multiple open source solutions for doing such an extraction from the documents. However they are template driven models, constrained by the specific document format. In many cases, they require addition meta information of the document, which is hardly available. Some proprietary tools like Amazon Textract and PDFTRON do the extraction in a content driven machine learning based approach. Our goal is to build the SDE system that is
Content driven not template driven
Robust to structural noise
Having low extraction latency
Holt et al. extracted the key-value pairs in invoices based on the output of Optical Character Recognizer and the images. The system filters the tokens recognized by OCR based on the query type (key data type). The filtered tokens for a given key along with their variety of features are passed to a machine learning model to identify the most probable value. If the most likely prediction falls below a tuned likelihood threshold, it emits null to indicate no field value is predicted.
Hao et al. implemented the deep learning model for table detection in documents based on the input image, document metadata and loose heuristic rules.
Shigarov et al. developed the complex customized model for structured data extraction from the tables. The system allows to configure the algorithms, thresholds, and rule sets used for decomposing tables. Even though the system has the ability to adapt to different table formats, it heavily relies on setting the hyper parameters as well as heuristic rules.
In DeepDeSRT, Schreiber et al. presented the end to end system for table understanding in document images. The system comprises of two subsequent models for table detection and structured data extraction in the recognized tables. It outperformed state-of-the-art methods for table detection and structure recognition by achieving F1-measures of 96.77% and 91.44% for table detection and structure recognition, respectively.
DeepDeSRT utilized the deep neural network model trained on object detection dataset, applied transfer learning method to identify the location of tables instead of objects in the document images. This is based on the assumption that the tables in the document images seem to have visual correlation with the objects in the natural images. They have used Faster R-CNN for this task.
DeepDeSRT used deep learning based semantic segmentation tools for structure recognition. Convolutional neural network with Skip Connection is used as the architecture. The system incorporates preprocessing of table images by vertically stretching the table images to increase the amount of background separating each row from the remaining structure components,
To develop a model for the recognition of tables in document images
To develop a model for the extraction of structured data from the tables
To develop a model for the extraction of key-value pairs in the document image
To deploy the models as an API service in the cloud
To build a mobile/web application and integrate the SDE APIs