Word embeddings for Indian Languages

Status

On-going

Positions

ML engineers

Goal

Release word embeddings for 11 Indian languages by end of 2019
Screenshot 2019-06-07 at 11.28.33 AM.png

Problem statement

The goal of this project is to build high quality word embeddings for major Indian languages. The first phase would involve covering the following languages: Hindi, Tamil, Bangla, Punjabi, Gujarati, Marathi, Odia, Assamese, Kannada, Telugu, Malayalam. In the second phase, we plan to cover all the scheduled languages listed in the Constitution of India. 

Why this is relevant in the Indian context

What are word embeddings and why build them for Indian languages ? Word embeddings are a fundamental resource in modern, deep-learning based NLP. A word-embedding (or a distributed representation of a word) is a vector representation of the word with the property that embeddings of similar words lie close to each other in a vector space. This allows modelling semantics and reasoning about similarities between words. Compared to highly sparse feature spaces in classical ML, word embeddings enable NLP applications with much smaller feature spaces. In addition, word embeddings can be learnt in an unsupervised method from raw corpora based on the distributional hypothesisis which can be succinctly explained by Firth’s quote: “A word is known by the company it keeps”. It is the regularities in the co-occurrence between words that enables learning similarities between words. Word embeddings can capture diverse semantic and syntactic relationships between words by learning from large corpora. Learning word embeddings can be thought of as unsupervised feature extraction, reducing the need for building linguistic resources for feature extraction and hand-coding feature extractors

India has 22 constitutionally recognised languages with a combined speaker base of over 1 billion people. Though India is rich in languages, it is poor in resources on these languages. This severely limits our ability to build Natural language tools for Indian languages. The demand for such tools for Indian languages is likely to increase as the number of Indian language users on the internet grows. The availability of affordable Indian language enabled phones is also increasing the demand and supply for original content in Indian languages in the form of blogs, social media posts, advertisements, etc. NLP of such content to identify sentiments, summarise blogs, translate articles, etc. will become crucial in the coming years.

While it may be feasible to invest efforts in developing various linguistic resources for a single language like English, such an effort is expensive and not scalable to the large number of Indian languages that need to be supported. Neural methods of NLP using pre-trained word embeddings can potentially mitigate the need to develop many linguistic tools to a large extent, with the word embeddings themselves serving as a rich source of feature information extracted in an unsupervised way. Further, recent advances have made it possible to represent embeddings from different languages in the same vector space (referred to as multilingual embeddings). This opens up the possibility of sharing linguistic resources using transfer learning, thus reducing the cumulative need for resources across languages.

Data availability and collection

This section describes the data and resources required for building word embeddings and lists the known publicly available sources of information. 

Monolingual Corpora

 The basic requirement for building word embeddings is large monolingual corpora representative of the various domains of interest. The quality of the word embeddings depends on the frequency of the words, which is related to the corpora size. As the morphological richness of the language increases, the vocabulary size increases and the average frequency of every token decreases. Hence, it is important to build a large collection of monolingual corpora for Indian languages to get high quality word embeddings. 

Today, little monolingual corpora is available off the shelf for Indian languages for building word embeddings. The prominent sources are: 

  • Wikipedia

  • CIIL Corpus (Research)

  • LINDAT Hindi Monolingual Corpus (Research)

  • CommonCrawl (Research)

  • EMILLE corpus (Research)

Most of these corpora are not large enough to train high quality word embeddings (with the exception of the Hindi monolingual corpora). For comparison, if we look at pre-trained embeddings made available by FastText (trained on Wikipedia and CommonCrawl), it has been trained on billions of tokens. 

While curated corpora may not be available off-the-shelf, most major Indian languages have a significant presence online in the form of newspapers, news portals, government websites, etc. which can be a source of high-quality, diverse monolingual corpora. Crawling these sources should be the first target to build reasonably large monolingual corpora for Indian languages. 

Pre-trained embeddings

FastText provides a comprehensive set of pre-trained embeddings for Indian languages. They provide 2 sets of pre-trained embeddings: 

- Trained on Wikipedia data

- Trained on CommonCrawl + Wikipedia data

The Polyglot project also provides pre-trained embeddings trained on Wikipedia for many Indian languages. The model used is similar to the Neural Language Model proposed by Bengio et al. 

Evaluation Resources

The quality of word embeddings can be evaluated using either an intrinsic or extrinsic task. Extrinsic evaluation refers to showing the utility of the embeddings in another NLP application like sentiment analysis, machine translation, etc. Intrinsic evaluation refers to directly obtaining judgments on the quality of the embeddings themselves. Intrinsic evaluation is more insightful and speeds up the development cycle of building word embeddings. In this article, we will discuss intrinsic evaluation. The following are typically used intrinsic evaluation tasks:

  • Word similarity/relatedness: In this task, a dataset containing word pairs and manual similarity judgments is created. Using the learnt word embeddings, a similarity metric between each word pair is created e.g. using cosine similarity between the embeddings. The correlation between the manually judged similarity scores and the ones provided by the embeddings defines the embedding quality. Many word similarity datasets are available for English, German, etc. A word similarity dataset has been published for some Indian languages by \cite{}. This dataset is a partial translation of the WS353 dataset with similarity scores annotated by multiple Indian annotators. 

  • Word Analogy: Given a pair of words (A:B) with a particular relationship and a word C belonging to the same category as C, the word analogy task involves predicting the word D which holds the same relationship with C as B holds with A . This task utilizes the linear relationship between word embeddings to search for the missing word. Word analogy datasets can capture semantic as well as syntactic relationships depending on the tuples included in the query inventory. See Mikolov for examples of different kinds of relationships represented in the word analogy dataset.  A word analogy dataset for Hindi has been made publicly available by the FastText project. 

Existing work - Research and Practice

Please refer to the Stanford NLP course material to know the current state of work on word embeddings. 

Open Technical Challenges

Below we list down some open technical challenges in building word embeddings for Indian languages.

  1. Indian languages are morphologically rich. This increases the vocabulary size and reduces the number of observed instances of a given token in the corpus. This has an adverse impact on the quality of the word embeddings. 

  2. Beyond the challenges of data sparsity introduced by morphological richness, the very question of how word embeddings must incorporate morphology in order to be useful to downstream applications needs to be investigated. 

  3. Many Indian languages are unlikely to have large monolingual corpora. However, these low-resource languages are likely to be related to some high-resource languages. It is worth investigating if the word embeddings of the low resource languages can be improved using the corpora/embeddings of high resource languages.

Technical Milestones

Below we list down various milestones of the project across three different functions: collecting corpus, building evaluation datasets and building pre-trained word embeddings

Monolingual Corpus Collection

The goal is to collect large scale monolingual corpora for major Indian languages and distribute them in a simple format (one sentence per line and tokenised). The major sources to explore are Common Crawl, newspaper websites, government websites. We aim to target the following milestones based on the size of the collected monolingual corpora. 

  • 100 million tokens: This should be feasible for all major Indian languages and a good starting point to start building word embeddings.

  • 500 million tokens: This should be feasible for all major Indian languages.

  • 1 billion tokens: We target to collect large corpora for major Indian languages like Hindi, Tamil and Bengali which have a lot of online content.

  • 100 million tokens for all 22 scheduled languages. 

Building Evaluation datasets

Our goal is to build evaluation datasets which give us a good indication of the quality of the word embeddings. Building evaluation datasets is a labour-intensive and expensive operation. To optimize this process, we can consider different approaches: 

  • Cover the most important languages in terms of number of speakers (Hindi, Tamil, Bangla) 

  • Select some representative languages whose performance can provide an indication of the performance for other languages. 

  • Leverage existing resources to build evaluation datasets for novel tasks. For instance, IndoWordNet can be leveraged to design a hypernym prediction task. 

Build Pre-trained Word Embeddings

  • Milestone 1: Compare different word embedding algorithms and build high quality embeddings for all major Indian languages with 100 million words using off-the-the shelf tools. 

  • Milestone 2: Improve the quality of word embeddings with 500 million to 1 billion words with a goal to better the best publicly available pre-trained embeddings significantly on the defined evaluation sets. 

  • Milestone 3: Explore ways to modelling rich morphology in a better way and develop/adapt relevant evaluation metrics. This investigation should lead to the release of: (a) software for learning word embeddings that model morphology better and can efficiently train models from large datasets, (b) pre-trained embedding based on the proposed approach. 

  • Milestone 4: Improve embeddings of low resource languages by utilizing corpora/embeddings from high-resource languages. 

Current Team and Open Positions

Mentors

Mitesh Khapra  

IIT Madras, One Fourth Labs

Mentors

Anoop K  

Microsoft

Lead

Contributors

References