Crowdsourcing Corpus Cleaning


Alternative text - include a link to the PDF!

Related Projects


Creator/Artist: Krishna Nair

Category: Interaction Design

Document: P2 Project

Batch: 2018-2022

Source: India,   IDC IIT Bombay

Period:  2019-onwards

Medium: Report pdf

Supervisor: Prof. Anirudha Joshi


Detailed Description

Swarachakra Malayalam is a Malayalam text input keyboard developed by IDC, IIT Bombay, for touch input mobile devices. It has over 100,000 downloads on the Play Store as of now. The words typed by users through this keyboard are recorded in the form of a word list. The word list contains two data points, the word that was typed and the number of times (frequency) it was typed. The copy of the word list I was working with is from 2015 and contains 4,12,495 unique words with their frequencies. The premise of this project is a need to clean the word list (remove/correct errors and tag problem words) to get a usable database of corrected words and the nature of their corrections so that it can be further used to develop an autocorrection or text suggestion system.