Digitisation of Ancient Manuscripts

A large number of manuscripts still contain locked in them direct sources of Indic heritage knowledge. Digitisation of the documents and subsequent language processing is required to preserve the knowledge contained in these manuscripts.

Status

Ready to kick-off

Open Positions

Sanskrit language scholars, AI engineers, Web developers

Goal

Help digitise 1,000 pages by the end of 2019
Sanskrit.jpg

Background

Bharat’s ancient civilizational knowledge across all fields of human knowledge and endeavor is preserved in varying degrees in all its ancient manuscripts and their copies. These manuscripts range from rock-cut inscriptions, palm leaves to more recent paper-based texts and images from about 200-300 years ago. Manuscripts are available in repositories in many libraries and research institutions in India and a number of important manuscripts are in famous libraries all over the Western world. It is important to realize that these manuscripts are what has bootstrapped the past millennia of progress in the western world but we ourselves have paid scant attention to these. Over the past 30+ years, many basic efforts to collect, digitize and catalog these manuscripts have been underway, primarily as small-scale research projects rather than a unified Bharat-scale initiative due to various constraints - primary one being the need for compute tools to scale the overall effort. Findings presented at the recent National Seminar on Cataloging, Editing and Publication of Ancient Indian Manuscripts on Science and Technology suggests that only 2% of scientific documents have been studied and translated for current-day use. One exemplar number is that only 450 out of nearly 9000 source works in science and technology have been translated (Ref: MD Shrinivas). If we consider other bodies of knowledge, we believe the source collections could be 10-100 times waiting to be harnessed. To facilitate access to these documents and the associated knowledge, the following key steps need to happen - a) Access to the raw document sources in digital form for manual review and analysis, b) Digital encoding of these documents for building various tools and technologies, c) Development of various information science and language technology tools such as concordancers, thesaurus, lexicons, catalogs and more to facilitate research and utilization of these documents and d) Developing a body of scholars and associated scholarship to facilitate on-going preservation and contemporaneous utilization of the knowledge in these manuscripts. Our focus as part of the AI4Bharat initiative is to enable Steps a) and b) mentioned above at scale. Raw document sources need to be scanned and then the images digitized into unicode. With the recent advances in machine learning/deep learning and the availability of open source tools, enabling crowdsourcing of efforts along with sharing intermediary results will help a) build on the collective effort to digitize as many manuscripts as possible and b) enable experts in different fields to contribute to the overall effort. However, before outlining where we are on this journey, we describe a number of open issues that need to be addressed below:

Open Issues:

  1. Physical manuscripts need a digital version - primarily as images. Different kinds of raw source materials such as palm leafs, older textual manuscripts need to be scanned. Raw sources can be in many different formats and orientations. Images may have errors due to lighting and shadow. Multiple images may need to “stitched” together to complete a manuscript page. Multiple images of the same text from different sources may need to be reconciled with each other manually. Partial manuscripts need to be linked with each other. Content from multiple raw sources need to compiled into one “complete” version. In many cases, we may not even know what a “complete” version is. Tools to view these raw “images” by experts need to be provided so that they can annotate and comment on the underlying content, correcting for omissions, highlighting errors and more. Once these basic “image” compilations are done and are of some basic pre-defined/established quality, these images can undergo the next step of digitization. At this stage, the physical documents exist as images - that is primarily as bits.

  2. Text and other symbolic content may be “extracted” from the raw sources - converted into image bitmaps in the afore-mentioned step. Relevant content in the images need to be identified, extracted and converted into their digital encoding - primarily unicode (also called OCR - Optical character recognition). In this step, image level bits are converted into “script-centric” digital representation - in terms of unicode. Problems in this phase include: a) Handling different scripts - including Grantha and others, different scripts may be used for the same language. Considering that these are handwritten, the same script may be written by different authors in different ways. b) Identifying the destination language per se so that the text can be encoded properly, so that the digital manuscript can finally be “analyzed” downstream for its syntax, semantics and subjected to other kinds of content analyses. Many scripts (or specific glyphs) may not yet have unicode. Indic scripts are still being integrated into the Unicode standard, so we may also have to consider “transliteration” if needed. Finally, considering that a single source manuscript may be written in different scripts and languages, the various digital manuscripts need to be linked together so that downstreams efforts are focused properly.

  3. A variety of tools are required to manually assist and curate the various stages of the first two steps. Human feedback has to be collected properly with annotations and used to build ML tools that scale out the processes.

Once the above issues are addressed at scale, future work can address the downstream steps of linguistic and content analyses by different subject matter experts.

Approach & Resources:

As a coordinated effort towards enabling Steps 1 and 2, Vedavaapi (https://vedavaapi.org) is a crowd-sourcing platform being developed by Vedavaapi Foundation, Bangalore. APIs to access both raw data (as images) and other tools are available for public access. The overview paper on Vedavaapi architecture in reference 2 provides the details. Please reach out to info@vedavaapi.org for further details.

Project Proposal:

Human-assisted OCR for Indic texts Vedavaapi has a web-based UI to present a scanned manuscript as image snippets containing words or lines of text, and allow human experts to type the text contained therein in any Indic unicode text. Vedavaapi stores the text labels of image snippets persistently in a scalable Internet-accessible object store with a RESTful API for access. Once the first few pages of a manuscript are typed manually, a machine-learning tool is needed to auto-convert the rest of the manuscript into text and present it to the user for text correction. Based on the corrections, we need the tool to re-learn and re-detect the text so as to reduce further errors. Since the manuscript is detected by human experts, the original text in the image need not have a computer font or Unicode encoding, and its script need not match that of the typed text.

So the machine-learning problem is as follows:

  1. Use the Vedavaapi Object API (RESTful) to extract image snippets of lines / words of a manuscript along with their transliterated unicode text labels.

  2. Build an OCR model of that document using those labeled image snippets.

  3. Use the model to output the text of unlabelled image snippets in the rest of the manuscript.

  4. Provide a way for the model to be “corrected” based on updated labeled image snippets.

How can you help/participate?

As a volunteer on this AI4Bharat effort, you can help at multiple levels, a few of which are mentioned below

  • Check and Validate raw image collections from different sources for their basic quality. Collect and share manuscripts to be digitized and labelled.

  • Annotate raw images - identify scripts, languages etc. Help build training sets for Machine learning.

  • Develop and run models for OCR. Maintain working models for different scripts and languages. This problem is elaborated above.

  • Evaluate extracted data for quality, errors, provide feedback.

  • Use tools from the Vedavaapi platform.

  • Contribute to the crowdsourcing toolkit to scale out the effort.

Mentor

Sai Susarla  

Dean at Maharshi Veda Vyas MIT School of Vedic Sciences

Lead

Joseph Geo  

One Fourth Labs

Contributors

Gokul NC  

One Fourth Labs