Sign Language Translation
|Ready to kick-off|
|ML engineers, App developers, Domain experts in sign language|
|Public release of translation app planned by January 2020|
The aim of this project is to build an automatic ISL translator - i.e. given an input video of someone signing a word, the model should be able to predict which word is being signed.
The model built should be reasonably robust to variations in camera angle, video quality, distance from the camera to the subject, variations in lighting etc.
Such a model has several use-cases, and can be deployed either through an app or website for people to use. It has high relevance in the lives of not only the deaf-dumb, but also anyone who interacts with them.
There are 1.1 million deaf-and-dumb people in India. 98% of these people are illiterate, making sign language their only method of communication. This trend is not expected to change anytime soon, as only 2% of deaf children attend school.
Relevance in India
The number of people in the country with hearing impairment was estimated at 13 lakh in the 2011 Indian census. However, this could be a case of significant under-reporting: The National Association of the Deaf estimates that there are 1.8 crore people who are deaf. The real numbers could be even larger as most countries have about 3 percent of their population as experiencing hearing loss, which translates to about 4 crore people. With such a large population of the nation depending on sign language, it significant to support them and their kith and kin with assistive technology.
Though the signs used in the country vary from region to region, several attempts are being made to systematise a convention that is referred to as the Indian Sign Language (ISL). However, this development is still in its nascent stage and is not reaching most deaf schools. Even otherwise, an estimated only 2% of deaf children attend formal education. These twin problems of a developing standard and the limited reach of formal education necessitate tools that can be centrally developed and deployed at scale to accelerate awareness, adoption, and inter-operability of ISL.
Data Availability and Collection
Internationally, a large number of Sign Language datasets do exist. However, most of them are image datasets. Letters can be represented by fingers in sign language, and these image based datasets contain pictures of fingers, each class representing a different letter.
The most famous image based datasets are:
Sign Language MNIST: This Kaggle dataset contains finger-spellings in ASL (American Sign Language)
Turkish Fingerspelling Dataset: This dataset contains Turkish fingerspelling alphabets.
For video based datasets, collection efforts have been undertaken by Germany, USA, Poland and China. A detailed survey of the various Sign Language datasets, and their characteristics, can be found here.
One of the major challenges that we face is that there are no ISL Datasets available online. Though datasets for Chinese, British, German, American and other country datasets are available, there exist absolutely none for Indian Sign Language.
Hence, as a part of this project, we attempted to create the first such dataset. It is a general purpose dataset with recordings of the 264 most commonly used words in ISL. The dataset contains around 4600 videos belonging to 264 classes with roughly 20 videos per class.
A frame from a video in the dataset
6-7 ISL interpreters were used to record every sign, with each interpreter being the subject of 3-4 videos per sign. Each video is between 1-4 seconds in length. All videos have been recorded in bright lighting conditions with similar backgrounds.
For more information, and to download the dataset, please visit the Dataset page (link coming soon).
Finding NGOs / universities to collaborate with to generate data was also an incredible hassle, and we are incredibly grateful to the Ability Foundation, Chennai and St. Louis College for the Deaf, Chennai, for their cooperation in this regard.
Existing Work - Research and Practice
Globally, numerous attempts have been made to do Sign Language classification on videos. A few of the notable attempts are:
Deep Hand: This CVPR 2016 paper presents a new approach to learning a frame-based classifier on weakly labelled sequence data by embedding a CNN within an iterative EM algorithm. This allows the CNN to be trained on a vast number of example images when only loose sequence level information is available for the source videos. Its use in continuous sign language recognition is evident when it was tested on two publicly available large sign language data sets, where it outperforms the current state-of-the-art by a large margin. More details about the paper can be found here.
Deep Sign: Hybrid CNN-HMM for Continuous Sign Language Recognition – This paper introduces the end-to-end embedding of a CNN into a HMM, while interpreting the outputs of the CNN in a Bayesian fashion. The hybrid CNN-HMM combines the strong discriminative abilities of CNNs with the sequence modelling capabilities of HMMs. More information about the paper can be found here.
Recognition of Sign Language expressions using Kinect: The paper attempts to recognise isolated Polish Sign Language words using a Kinect. The paper tries two different approaches – whole word classification using a kNN model, and breaking down the words into subunits (phonemes) and classifying at the phoneme level using kNNs based on the edit distance. The subunits based model proved to be superior during testing. More details about the paper can be found here.
There have been a lot of sign language translation attempts with custom made datasets. Most of these datasets are heavily limited in scope (only images, videos in uniform background such that hands can be easily segmented etc.), which makes the model unusable in a general context. Given below are some of the important work in this area:
Recognition of sign language in live video by Singha and Das  works only on detecting hands from uniform dark background by converting the image to the HSV space. It then uses histograms to detect similarity between images and selects unique images.Finally, images are passed through a classifier to predict signs.
Nearest Neighbour classification of ISL using Kinect cameras by ANSARI and HARIT  uses kNNs to classify 140 different signs. The paper uses a dataset with only images, and makes a lot of assumptions about the data available.
Selfie video based ISL recognition system by Rao and Kishore  again depends on the contrast between the hands and the background, which it detects using an edge detector. After this, some feature extraction is done and kNN is used for classification.
Low latency human action recognition by Cai et al.  detects skeletal key points from a human gesture. It uses these to define limbs or connections between adjacent points on the skeleton. Velocity and acceleration of these points are calculated as well, and a Markov Random Field is used to label test data limbs into states. Classification is done by observing histograms of this test data.
Segment, track, extract, recognize and convert sign language by Kis classifies 351 symbols using an ANN. The model relies on a subject wearing a tshirt of the same colour as the background wall, which should contrast with his/her hands and face.
Open Technical Challenges
Various technical challenges still remain to be resolved, in order to build a fully functional ISL translator. Given below are the various types of challenges still to be solved:
Data Generation and Augmentation
There are 2 sets of challenges here:
Data generation for ISL: There are almost no publicly available datasets for ISL. The dataset put out by us could also vastly benefit from more recordings per sign, as well as the addition of more signs. Moreover, there are a large number of variations of ISL across India, and robust translation will require accounting for these variations as well. Hence, data collection for variations should happen too.
Data augmentation: Data generation for ISL is an expensive and time consuming activity. Therefore, data augmentation techniques that create realistic variations in the data can play a pivotal role in reducing the amount of effort spent on data generation. Fixed cost Gaussian mixture models
One approach to sign language classification involves splitting a video into its constituent signs, and then running a pretrained model to classify each sign. However, segmentation of signs is a non-trivial problem, and has not been successfully solved yet. This approach can also be taken one step further, by splitting each sign into its constituent sub-units, running a pre-trained model to classify those sub-units, and using these classes to predict the meaning of the sentence as a whole. However, this also faces the same challenges as the sign-segmentation, as there are no clearly demarcated boundaries when transitioning from one sign to the next. Several signs are also combinations of other signs (for example, the sign for “truck” is the sign for “car” + the sign for “long”), and these will have to be taken care of as well.
An alternate way to go about this (while building early versions of a translator) would be to introduce a minor inconvenience for the user, by asking him/her to shift to a rest pose after each sign, while signing. Adding this inconvenience would make segmenting a trivial job, and then each sign can be easily classified.
ISL rules of grammar are not as strict as English rules of grammar. The same set of signs can be permuted in almost anyway, with the resulting sentence having the same meaning.
Speed and Variations
Just like with any other language, different sign language users speak at different speeds. There are also a large number of variations in the language from region to region as well. These should gradually be accounted for.
Current pose detection models like OpenPose give highly accurate estimates of certain skeletal points like face, chest and shoulders. However, these models falter when detecting finger skeletal points. Finger skeletal points are incredibly important and useful for classification, and hence better techniques to do finger pose extraction are important. The major challenge here seems to be caused by fingers occluding each other when gestures are performed.
OpenPose by Cao et al.  is the first real-time multi-person system to jointly detect human body, hand, facial, and foot keypoints (in total 135 keypoints) on single images and videos. It’s true strength however, is because of its ability to accurately detect pose keypoints in a wide variety of conditions – in diverse backgrounds, lighting conditions, camera perspective and occlusions.
The number and type of skeletal points obtained are detailed enough to allow for sign classification solely using the above points as input features.
Classification of signs using single images has successfully been done. The signs used for this classification were those which, if a single frame was chosen, appeared to be uniquely different from one another. Key skeletal points were extracted from each image using OpenPose, and a Decision Tree was used for classification. Given below is a visualisation of the skeletal point data for 6 classes. (The points are a 2D representation of the original data, done using tSNE.) We were able to achieve a 96% accuracy, with our limited dataset.
Observations in Video Classification
Though complete video classification remains to be a challenge, given below are a few of our insights, based on our experiments:
Due to the lack of a large dataset, we first perform feature engineering to extract a few features we believe are important, from the dataset. We define the concept of limbs, inspired by the Cai et al.  paper on skeleton representation.
A limb is defined as a skeleton segment connecting any 2 adjacent points on the human arm. Formally, let j = (x; y) denote the 2D coordinate of a skeletal point. Let l denote one of the limbs. The Position vector P(l) is defined as:
For our purpose, we consider only limbs found on the arm and the fingers of the human body.
We also define limb velocity and acceleration for each limb. The velocity vector is defined as the difference in position vector of a frame with its previous frame. Similarly, the acceleration vector is the difference in velocity vector of a frame, with its previous frame.
These limb features appear to form clusters, for each sign. For example, this is a visualisation of the features of the right arm, for a single class (visualised using tSNE):
Based on these cluster observations, we observe that each sign seems to be composed of a few “action states”. We use these action states to further reduce our feature set. By running k-means on the limb features, we can assign an action state for each limb, for each frame of a video. The figure below shows 8 different action states, and the labels assigned for each frame. The clusters have been formed in the high dimensional space, with the 2D space just being used for visualisation.
Relations between Action States
We then explore the relations between different action states. To do this,we build
a state transition matrix for the right arm. we observe that each states branches out to a maximum of 2 other states, and they tend to form loops with each other. We also observe that the state transition diagrams for various classes vary significantly from one another. Given below are examples of the state transition diagrams for 2 classes, “city” and “street”.
These temporal relations can be potentially used for classification. These methods require a lot less data to train robust models than traditional Deep Learning would, and hence should be explored further.