Structured Data Extraction from Documents

Credit card statements, invoices, medical test reports, resumes all have structured information that remains locked in documents. We are exploring an extensible deep learning solution to efficiently extract structured data from such documents.

Status

Ready to kick-off

Positions

ML engineers, App developers, Domain experts in sign language

Goal

-
Healthcare-Document-Collage.png

Why this is relevant in the Indian context

Documents are the primary tools to accumulate information in each and each process. We record information in documents in the form of resumes while applying for the job, financial statements after bank transaction, experimental results in research papers, personal records in offices, marksheets in colleges, tickets in public transport etc. India is a country of 1.3 billion+ people and most of us come across at least 1 new document every week, which tells the huge amount of data associated with documents. These documents have the valuable information locked in them, which could be utilized for the fast and efficient way for searching and business processing. Structured Data Extraction will also bring down the memory requirement as we longer need them to store as images.

Structured vs Unstructured

Data can be broadly classified into Structured and Unstructured based on their organization. Structured data strictly follows the pre defined format whereas unstructured data hardly have any clear format. Data table is the typical example of structured data. Examples of unstructured data include image, audio, video etc.

Structured Data Example:

Languages Speakers
Hindi 322,200,000
Bengali 96,180,000
Marathi 82,800,000
Telugu 80,910,000
Tamil 68,890,000

Top 5 languages in India by the no. of mother tongue speakers as per 2011 census. (Source: Wikipedia)


Unstructured Data Example:

pasted image 0.png

A satellite image of India at night in 2016. (Source: NASA)

Information search can be made easier in Structured data. In the above example, the number of Tamil speakers can be easily fetched by retrieving the value at row “Tamil” and column “Speakers”, which is 68,890,000. This method of information retrieval is universal across the tables. While in unstructured data like this satellite image, there is no ubiquitous way to search and it requires higher expertise to harness. Hence the structured data takes over the unstructured data in a technical perspective because of its analytical friendliness.

System Overview

Document Images are one of the major subsets of unstructured data, that have the structural information locked in them. In this project, we are going to explore the different Deep Learning based approaches to extract such structural information in document Images. These structural data include tables and key-value pairs. A high level overview of the system is shown below:

image1.png

Structured data extraction model

Sample Document Image (Input):

image2.jpg

Sample structured data (Output):

Tables:
Table #1: (file: table_1.csv)

SUBJECT THEORY
150
PRAC
50
MARKS OBTAINED FOR 200
TAMIL 189
ENGLISH 189
PHYSICS 144 050 194
CHEMISTRY 143 050 193
COMPUTER SCIENCE 150 050 200
MATHEMATICS 199

Key-Value Pairs: (file: key_value_pairs.json)

{
“TOTAL MARKS” : 1164,

“DATE OF BIRTH”: “09.02.98”,

“REGISTER NO: 078053,

“TMR CODE N0 & DATE”: “G877239 11.06.2015”

}

The SDE system is intended to extract the Tables and Key-Value Pairs from a document image passed as an input. This can be achieved by building separate deep learning models for extracting the Tables and Key-Value pairs. The reason for developing the individual models is based on the assumption that the Tables have the special spatial property associated with them over Key-Value pairs irrespective of the document format.

Data availability and collection

The ICDAR 2013 Table Competition dataset could be used for the task of structured data recognition in tables.

Existing work - Research and Practice

There are multiple open source solutions for doing such an extraction from the documents. However they are template driven models, constrained by the specific document format. In many cases, they require addition meta information of the document, which is hardly available. Some proprietary tools like Amazon Textract and PDFTRON do the extraction in a content driven machine learning based approach. Our goal is to build the SDE system that is

  • Open source

  • Content driven not template driven

  • High performing 

  • Robust to structural noise

  • Having low extraction latency

Holt et al. extracted the key-value pairs in invoices based on the output of Optical Character Recognizer and the images. The system filters the tokens recognized by OCR based on the query type (key data type).  The filtered tokens for a given key along with their variety of features are passed to a machine learning model to identify the most probable value. If the most likely prediction falls below a tuned likelihood threshold, it emits null to indicate no field value is predicted.

Hao et al. implemented the deep learning model for table detection in documents based on the input image, document metadata and loose heuristic rules.

Shigarov et al. developed the complex customized model for structured data extraction from the tables. The system allows to configure the algorithms, thresholds, and rule sets used for decomposing tables. Even though the system has the ability to adapt to different table formats, it heavily relies on setting the hyper parameters as well as heuristic rules.

In DeepDeSRT, Schreiber et al. presented the end to end system for table understanding in document images. The system comprises of two subsequent models for table detection and structured data extraction in the recognized tables. It outperformed  state-of-the-art methods for table detection and structure recognition by achieving F1-measures of 96.77% and 91.44% for table detection and structure recognition, respectively.

DeepDeSRT utilized the deep neural network model trained on object detection dataset, applied transfer learning method to identify the location of tables instead of objects in the document images.  This is based on the assumption that the tables in the document images seem to have visual correlation with the objects in the natural images. They have used Faster R-CNN for this task. 

DeepDeSRT used deep learning based semantic segmentation tools for structure recognition. Convolutional neural network with Skip Connection is used as the architecture. The system incorporates preprocessing of table images by vertically stretching the table images to increase the amount of background separating each row from the remaining structure components,

Technical Milestones

To develop a model for the recognition of tables in document images

  1. To develop a model for the extraction of structured data from the tables

  2. To develop a model for the extraction of key-value pairs in the document image

  3. To deploy the models as an API service in the cloud

  4. To build a mobile/web application and integrate the SDE APIs

Current Team and Open Positions

Mentor

Lead

Gokul karthik  

Researcher, TCS Innovation Labs

Contributors

Premkumar Selvaraj  

PSG Tech, One Fourth Labs

References