Data Challenge: Differentiating Code-Borrowing from Code-Mixing
About the challenge:
Code-mixing and code borrowing are two important linguistic phenomena as seen in a multilingual context. As a challenge, participants are required to develop a metric to rank a set of candidate words according to the borrowing likeliness. The team will be announced as the winner(s), who develops the best metric to rank the candidate words.
Context and Motivation:
Code-Mixing refers to mixing of two languages. In particular, when words and phrases of one language, say foreign language, is used while communicating in another language, say domain language, then code mixing is said to occur. This phenomenon is often seen in the communication among bilingual and multilingual speakers. In Code-Mixing people are subconsciously aware of the foreign origin of the Code Mixed word or the phrase. A similar linguistic phenomenon is called Code-Borrowing, where a word or a phrase from a foreign language is used as a part of the native vocabulary of the domain language. This phenomenon can be seen in the communication between monolingual people of a language where people often use borrowed words or phrases without being aware of the foreign origin of that word or the phrase. Example: Some of the words in Hindi borrowed from English are botal from the bottle, kaptaan from the captain, afsar from officer etc. Similarly, English-Hindi bilingual speakers often use English words like money, cool, moment etc, which are not in the vocabulary of monolingual Hindi Speaker. These are mostly examples of code-mixing. However, there are certain English words, which at some point got into the vocabulary of the monolingual Hindi speaking community – e.g., match dekhna (to see the match). The English word “match” in this case is an example of borrowing.
Identification of Code-borrowing from code mixing can help in many aspects of multilingual information retrieval and Natural language processing. For example, if we can distinguish multilingual queries having only borrowed foreign words or phrases, then for processing these queries we need to access only monolingual documents of domain language. This ultimately reduces the computational cost of such queries.
In literature, there are very few computational methods proposed that will distinguish code borrowing from code mixing. Bali et al., (2014) is one of them. They proposed a frequency based approach to distinguish borrowed words from the mixed ones. There are many shortcomings in their method e.g their method performs well only for the most likely borrowed words and there is no clarity regarding words falling in the borderline between borrowing and mixing. The main reasons that make this problem challenging is
There is no clear signal of borrowing.
- The borrowing phenomenon evolves over time, i.e., it is dynamic in nature.
Since code borrowing is dynamic in nature, so it necessitates tracking the recent conversation among people to better track probably likely to be borrowed words. Social media data corpus, in this context, can help us to track recent conversations among people.
The task has two parts:
- Borrowedness index: A metric that indicates the likeliness of a word to be borrowed needs to be developed from various signals of user communication obtained from social media.
- Ranking: Based on the value of the borrowedness index of a word, a rank list of words needs to be prepared.
- Ground-truth rank list preparation: Each participant shall have to collect ground-truth responses (described later) from at least 10 individuals.
Ideas for Ranking:
In this section, we are representing our two borrowedness metrics Unique user ratio (UUR) and Unique Tweet Ratio (UTR). You may consider these two metrics as the baseline for your metric comparison. For the calculation of UUR and UTR, we classified each tweet into one of the following categories
- English: If almost every word (i.e. > 90%) in the tweet is tagged as En.
- Hindi: If almost every word ( i.e. > 90%) in the tweet is tagged as Hi.
- CME: Code mixed tweet but the majority (i.e. > 50%) of words in the tweet is tagged as En.
- CMH: Code mixed tweet but the majority (i.e. > 50%) of words in the tweet is tagged as Hi.
- CMEQ: Code mixed tweet having an equal number of words tagged as En and Hi respectively.
- Code-Switched: There is a trail of Hindi words followed by a trail of English words or vice versa.
The UUR and UTR is then defined as,
Unique User Ratio of a word w is given by :
UUR(w) = (U_hi + U_cmh) / U_en
Where U_hi is the number of users who have used the word w in their Hindi tweets at least once. Similarly, U_cmh and U_en represent the number of users who have used the word w in their CMH tweets and in English tweets at least once respectively.
Unique Tweet Ratio of a word w is given by:
UTR(w) = ( T_hi +T_cmh) / T_en
where T_hi, T_cmh, and T_en represent the number of Hindi tweets, CMH tweets and English tweets in which word w is present respectively.
The high value of UUR and UTR is considered a signal signifying a word highly likely to be borrowed. Thus, candidate words are sorted in the descending order of UUR and UTR values to get respective rank lists.
The rank list obtained in the task will be compared with the ground truth rank list (e.g., Spearman’s rank correlation coefficient). The ground truth rank list is to be obtained through a human judgment experiment. In this experiment, for each candidate word, a set of two similar Hindi sentences, one containing the Hindi transliterated candidate word itself, and the other containing the Hindi translation of the candidate word, shall be released. These sets are to be given to survey participants who will be asked to choose their preferred sentence among the two. Participants will also have an option to not choose any of the two options. The ground truth rank list is prepared by the ground truth metric (Ecount - Hcount) where Ecount and Hcount respectively refer to the number of survey participants who preferred the Hindi sentence containing the Hindi transliteration and the number of survey participants who preferred the Hindi sentence containing the Hindi translation of the candidate word (Here it is 12 English words). An example survey form of 12 words can be found here (LINK ). The Spearman’s correlation between UUR and the ground truth for these 12 words is 0.58. Similarly, the Spearman’s correlation between UTR and the ground truth for these 12 words is also 0.58.
The following Data sets will be provided for the data challenge.
Social Data Set: This data set contains set of tweets; each tweet is language tagged (either Hi or En) at the word level. The language labels are machine-generated using (Gella et al, 2013) system. Therefore, it will have some errors, estimated at 2-5%. Also note that though we made sure that the tweets are from Hindi monolingual and Hindi-English bilingual speakers, it is possible that some of the tweets are in languages other than Hindi and English (estimated to be less than 5%). By design, the language labeler would also tag these tweets as either Hindi (most likely for other Indian languages) or English. Since Twitter prohibits third party data sharing, we cannot explicitly provide you the tweets. We will give you the tweet ids, and by using Twitter's API you can crawl corresponding tweets. Further, the language tag for each word is given with respect to the starting and ending character position in the tweet. Further, the Hashtags, URLs and address-mentions (@username) are tagged as OTHER and address-mentions (@username) are tagged as OTHER. Participants can ignore the character chunks which were not language tagged. Typically, these are punctuation marks, spaces, tabs, smilies etc. Here is an example showing how each entry/row in Social Dataset (.csv file) will look like
The first entity is the tweet id, and subsequent comma separated entities are in the format of START: END: LANG. Where START & END denotes starting and ending position of the word in the tweet, and LANG denotes the language tag assigned to that word. For the character positions, we considered each tweet as a character array starting at index 1. You can download the data set from this link.
Frequent 230-word list: This link provides the list of 230 words. Participants need to provide ranking among these 230 words according to their proposed metric. The ranking should be in descending order of borrowing likeliness.
- Ground-truth rank for 12 words: This file gives ground truth responses for these 12 words . You might use this to test your proposed metric. Survey form for collecting ground truth looks like this.
- Candidate test dataset: This is the set of candidate words, on which each participant have to conduct a survey and prepare their own ground truth rank list. Further, they have to compare their metric based rank list with their ground truth rank list. Please note that this is not the final evaluation. We are also going to check participants metric based rank list with the cumulative survey responses. This dataset is to be released at a later date.
- Survey form: Soon, we will release the survey form for the candidate words. Participants have to conduct the survey on at least 10 people (we encourage for the maximum) well distributed over age, education, and location. Candidates being surveyed should be primarily monolingual Hindi speakers; however a few English-Hindi bilinguals may included in case of unavailability. The survey form shall be released at a later date.
- 1st Jan 2017: The social dataset and frequent 230 words will be released.
31st Jan 2017 3rd Feb 2017 : Last date for participants to submit their proposed metric definition and rank of given 230words. The description of the proposed metric and the obtained Spearman's rank correlation method should be submitted as a pdf document in 2 pages ACM format. The exact instructions for the submission shall be updated later.
2nd Feb 2017 4th Feb 2017 : The Release date of test candidate words and sample survey form. (Test Candidate Words, Sample Survey Form)
- 15th Feb 2017: Last date for participants to submit their survey based outcomes.
- 5th March 2017: Winning team(s) will be notified for the award and results submission in ACM format.
- Participants have to develop the models and code entirely by themselves. They can use open-source software packages with appropriate citations in their submissions.
They are encouraged to make use of publicly available data and should be cited in the submissions. No proprietary data source which does not offer free access to all can be used.
Each group cannot have more than 4-members who do not necessarily have to be from the same organization/institution. A member cannot belong to more than one submitting team.
Submissions involved in plagiarism or means found unbecoming of the spirit of this contest shall be disqualified.
Participants need to submit the full rank-list for all 230 words; otherwise the submission shall be disqualified.
Every word in the ranklist should appear only once, i.e., should have one unique rank.
We shall evaluate based on our own ground-truth rank list (Yes! we have our own ground-truth to cross-check, detect and filter outlier responses) plus the ground-truth collected by the participants. In case of disagreement, the ground-truth rank list generated by us will be used for ranking.
The final set of candidate words that we are going to evaluate you on might have a set of few unseen words (i.e., words not in the list of 230 words we have released and asked you to rank in the first phase).
Participants can submit their responses through the following link. Here are few points regarding submission process.
The submission form contains three sections. In the first section, each team has to provide their team name and affiliation. It is OK to have members from different organizations in a team. If there are multiple teams participating from the same organization then team names should be different.
In the second section, basic details of the team members should be provided. Please make sure to fill all the details (Name and Email Id) for every member.
The third section is the result submission section. Each team needs to have a Github page. We will accept all submissions (during all phases of the data challenge) of a team through their GitHub page. In the GitHub page, at least three files should be present
Description of the Model: This should be a PDF file describing the model the team has built. This should be a pdf file not more than two pages in ACM conference proceeding format(Link).
Code: This is the implementation of the model the team has proposed. The input argument to the Code File will be a text file input.txt containing candidate words separated by newline (“\n”), and after executing the code it should generate a comma separated text file output.csv. Each row of output.csv should contain candidate word and calculated rank for that word separated by a comma.
Rank List: This is the same output.csv file generated in step 2.
In addition, if the model needs any additional input files for the execution of the code, then those files should be uploaded in the ”additional” directory. Please make sure to call these files internally in your code file, so that the execution of the code should be possible with the presence of ”additional” directory. Further, in such cases, there should be a Readme.txt file describing all the necessary and sufficient information to run the code e.g. compiler to be used, dependencies (open source packages your team has used) etc. Otherwise, your team can include a script file. Upon running this script file, all of the additional libraries and packages should automatically install in the host system.
Participants may join the google group: IKDD CODS 2017 Data Challenge. This is a discussion cum Q&A forum created to deal with the queries of participants.
BEST BY SCORE:
Team name: flytxt_datasciences_iitd
Participants: Jobin Wilson, Ram Mohan, Muhammad Arif
BEST BY NOVELTY:
Team name: VITians
Participants: Rajalakshmi R, Rohan Agrawal
Organisation: Vellore Institute of Technology
Team name: Nautilus
Participants: Neha Prabhugaonkar, Sai Peketi, Kavita Ganeshan, Unni Krishnan
- Bali et al. (2014): Kalika Bali Jatin, Sharma , Monojit Choudhury, and Yogarshi Vyas. "“I am borrowing ya mixing?” An Analysis of English-Hindi Code Mixing in Facebook." EMNLP 2014 (2014): 116.
Gella et al, (2013): Spandana Gella, Jatin Sharma and Kalika Bali. Query word labeling and Back Transliteration for Indian Languages: Shared task system description In Proceedings of the Fifth Workshop on Forum for Information Retrieval (FIRE 2013). New Delhi, India
Coordinator: Jasabanta Patro (firstname.lastname@example.org)