Android App for Crowdsourcing Data and Labels in the Wild
A critical horizontal component for most AI tasks is the ability to efficiently and reliably collect data (images, text, video) and labels in the wild. AI4Bharat’s open-source crowdsourcing Android app will enable this at scale.
|Ready to kick-off|
|Web and mobile app developers (ideally a partnering organisation), ML engineers|
|High priority. Public release by Dec 2019|
Build an Android app that can (a) allow domain experts to collect data across modalities - text, images, videos, and speech, (b) label or tag data samples by annotating on images, writing captions, or classifying across categories, and (c) backup the data on to a cloud database and enable downstream visualisation and machine learning. The app must be customisable to allow different configurations - one instance may only use text captioning of images while another would require annotations on the images. Finally, the app must be progressive, i.e., it should work across network conditions: In case no network is available, data must be stored locally and transmitted to the cloud when network connection is restored. The app must use best principles of data version control.
Relevance in India
There are three axes along which AI4Bharat’s crowdsourcing app is relevant in the Indian context
The scale and diversity of the country necessitates efficient data collection. With a major push towards digitisation, several organisations — private companies, NGOs, and governmental departments — are in need of data collection tools. Each such body is building individual solutions that do not inter-operate or scale to enable downstream AI applications. Thus, there is the need for defining a standard in collecting data and labels for AI use-cases.
In some applications, such as building a face recognition engine, data can be “scraped” from the internet but with questionable access rights and often with privacy violations. Instead collecting data from willing participants with an AI4Bharat crowdsourcing app provides an ethical alternative.
In specific domains, customisations are required in the data collection process that are motivated by the AI model used. For instance, consider the problem of classifying the heart condition from an image of an ECG print-out. Doctors and medical students would collect both images and annotate the heart condition during their day-to-day operations. It turns out that, only specific patterns in a long time series of the ECG reading are salient in classifying the heart condition. An AI model’s accuracy at this classification can be significantly improved if this saliency can be provided as an additional input. This requires the ability of doctors to annotate specific regions in the image as being salient. Such configurable data collection requires a custom built crowdsourcing app.
A lot of work has been done on collecting form-based data. These are usually surveys that social workers fill up on the ground, such as exit polls after elections. Examples include Social Cops Collect, KoBo Toolbox, Open Data Kit, and Magpi. These tools provide form authoring with subsequent visualisation and integration with BI applications. Their primary focus is to allow resource constrained devices — such as feature phones — to allow data collection on the ground. These tools are not designed for AI applications, for instance they do not allow any user annotation on images.
The complement to the above tools is Google’s app Crowdsource. The app allows end users to contribute to generate labelled data for training models for tasks such as translation, landmark detection, sentiment classification, among others. With Crowdsource, users do not collect data, instead they help generate labels on already sourced data. This has similarities to labelling services on platforms such as Amazon Mechanical Turk and Figure Eight. In such platforms, employed or contracted individuals can be paid to perform annotation tasks.
A third category of existing work are tools specifically built for annotating data for machine learning. These include open-source tools such as Annotorious for images, Praat for audio, and Bella for text. Another interesting solution is the open source tool from Data Turks.
The first technical challenge is the design — defining an extensible system architecture of the app and its deployment environment.
The system architecture can be split into three different components.
First, there is an authoring tool that can be used on a desktop machine to design a data collection flow.
Second, there is the app itself which instantiates a given data collection flow on a smartphone.
Third, there is the cloud infrastructure for storing, retrieving, and basic visualisation of the data and labels.
This flow should be instantiated across different applications, for instance the different projects under AI4Bharat. The sign language translation project will require video data collection which have to be labelled with text. The ECG-based heart condition classification project will require images but with custom labels as annotations on the images. The authoring tool should enable creation of multiple such instances which then will be accessible to domain experts through the crowdsourcing app. The crowd store could support multiple visualisation applications. For instance, in the case of the Paddy crop diseases project, multiple government bodies may want to visualise the data.
The next challenge in the project is to adapt best practices from data version control. A good open source standard for this is defined at DVC. Version control is essential as the data collection process is rolled out with iterative refinements on the collection process and labelling practices. Sometimes, the data set itself is not stationary and annotating meta-data like geo and time tags is important.
The third challenge is supporting advanced features in the app to increase the quality of collected data. For instance in the Paddy crop diseases project, when volunteers take photographs of diseased leaves, it is important to ensure good coverage of the affected region of the plant. For instance in the common bacterial leaf blight, the specific region of the leaf must be captured. Instead if a picture covering the entire plant or an even larger area is taken, then the volunteer must get a prompt from the app suggesting a tighter crop. Such active suggestions would increase the quality of the data and help increase awareness amongst data collectors.
|Design document on the app|
|First instance of an Android app to enable data collection in ongoing projects|
|First instance of authoring tool to configure Android app|
|Complete version of the app and authoring tool|