Data Science In India
KDD 2016

August 15th, 2016

Data Science In India

KDD 2016 Networking Session

The KDD community in India is rapidly growing and the purpose of this event is to showcase the work that is happening both in Indian academia and industry. The event is being organized by ACM SIGKDD India chapter. The program consists of invited talks by leading researchers in the field, as well as talks by early career researchers on their experience in India. This will be followed by a panel discussion on the startup ecosystem in India. This is the second in the series following a successful event in KDD 2015 in Sydney.

Target Audience: Researchers interested in the state of KDD research in India – could be with a view to collaborate, start up, or return to take up positions in academia or research labs.


Balaraman Ravindran (IIT Madras)
Gautam Shroff (TCS Research)
Manish Gupta (Xerox Research, India)

Details on KDD 2016


1:45 to 2:00Welcome: Balaraman Ravindran, IIT Madras
2:00 to 2:45Plenary Talk 1: Manik Varma, IIT Delhi/Microsoft Research
2:45 to 3:10Experience Talk 1: Sayan Ranu, IIT Madras [Academia]
3:10 to 3:35Experience Talk 2: Animesh Mukherjee, IIT Kharagpur [Academia]
3:35 to 4:15Break plus networking
4:15 to 4:40Experience Talk 3: Vikas Raykar, IBM IRL [Industry]
4:40 to 5:25Plenary Talk 2: Harrick Vin, Tata Consultancy Services
5:25 to 6:10Panel Discussion
Topic: Data science startup eco-system in India
Panelists: Anand Rajaraman (Co-founder, Cambrian Ventures), Abinash Tripathi (Co-founder, Helpshift), Parul Gupta (Co-founder, Springboard), Gursimran Singh (Aspiring Minds);
Moderator: Manish Gupta (Xerox Research)
6:10 to 6:15Concluding Remarks: Gautam Shroff, TCS Research

Invited Talks

Animesh Mukherjee
Language of social media: from hashtags to question topics.

In this talk I shall outline a summary of our three year long initiative studying the popularity dynamics of various human language-like entities over the social media. Some of the topics that I plan to cover are (a) how hashtags in Twitter form compounds like natural language words (e.g., milk+man=milkman) that become way more popular then the individual constituent hashtags [CSCW 2016, Honorable Mention], (b) how conversational hashtags (aka idioms) like #4thingsbeforebreakup, #threethingsIlike etc. spread and become popular in Twitter [ICWSM 2015] (c) how question topics in community QA forums like Quora [ICWSM 2015] become popular over time. Finally, I shall outline some of the factors that helped me in fostering an early career in research while staying in India.

Manik Varma
Extreme Multi-label Loss Functions for Ranking, Recommendation, Tagging & Other Missing Label Applications

The choice of the loss function is critical in extreme multi-label learning where the objective is to annotate each data point with the most relevant subset of labels from an extremely large label set. Unfortunately, existing loss functions, such as the Hamming loss, are unsuitable for learning, model selection, hyperparameter tuning and performance evaluation. We address the issue by developing propensity scored losses which: (a) prioritize predicting the few relevant labels over the large number of irrelevant ones; (b) do not erroneously treat missing labels as irrelevant but instead provide unbiased estimates of the true loss function even when ground truth labels go missing under arbitrary probabilistic label noise models; and (c) promote the accurate prediction of infrequently occurring, hard to predict, but rewarding tail labels. We also develop algorithms which efficiently scale to large datasets with up to 9 million labels, 70 million points and 2 million dimensions and which give significant improvements over the state-of-the-art.

Our results also apply to ranking, recommendation and tagging which are the motivating applications for extreme multi-label learning. They generalize previous attempts at deriving unbiased losses under the restrictive assumption that labels go missing uniformly at random from the ground truth. Furthermore, they provide a sound theoretical justification for popular label weighting heuristics used to recommend rare items. Finally, they demonstrate that the proposed contributions align with real world applications by achieving superior clickthrough rates on sponsored search advertising in Bing.

Manik Varma is a researcher at Microsoft Research India and an adjunct professor of computer science at IIT Delhi. His research interests span machine learning, computational advertising and computer vision. He has served as an area chair for CVPR, ICCV, ICML, ICVGIP, IJCAI and NIPS. Classifiers that he has developed are running live on millions of devices around the world protecting them from viruses and malware. Manik has been awarded the Microsoft Gold Star award, won the PASCAL VOC Object Detection Challenge and stood first in chicken chess tournaments and Pepsi drinking competitions. He is a failed physicist (BSc St. Stephen's College, David Raja Ram Prize), theoretician (BA Oxford, Rhodes Scholar), engineer (DPhil Oxford, University Scholar) and mathematician (MSRI Berkeley, Post-doctoral Fellow).

Sayan Ranu
A scalable approach to mining temporally anomalous sub-trajectories.

With the abundance of GPS-enabled devices, there has been an explosion in the availability of trajectory data. In this talk, we will discuss the problem of mining temporally anomalous sub-trajectory patterns and their applications. Given the prevailing road conditions, a sub-trajectory is temporally anomalous if its travel time deviates significantly from the expected time. Mining these patterns requires us to delve into the sub-trajectory space, which is not scalable for real-time analytics. To overcome this scalability challenge, we have designed a technique called MANTRA. MANTRA studies the properties of anomalous sub-trajectories using which it iteratively refines the search space into a disjoint set of sub-trajectory islands. The expensive enumeration of all possible sub-trajectories is performed only on the islands to compute the answer set of maximal anomalous sub-trajectories. Extensive experiments on both real and synthetic datasets establish MANTRA as more than 3 orders of magnitude faster than baseline techniques. Moreover, through trajectory classification and segmentation, we will demonstrate that the proposed model conforms to human intuition.

Vikas Raykar
Evolving predictive models.

A conventional textbook prescription for building good predictive models is to split the data into three parts: training set (for model fitting), validation set (for model selection), and test set (for final model assessment). However in practice searching for the best predictive model is often an iterative and continuous process. Predictive models can potentially evolve over time as developers improve their performance either by acquiring new data or improving the existing model. In this talk we will discuss some of the challenges such scenarios and propose a few solutions to safeguard the bias due to the repeated use of the existing validation or the test set. (joint work with Amrita Saha, IBM Research)

Vikas C. Raykar works as a researcher in the Cognitive Solutions and Services group at IBM Research, Bangalore. Earlier he worked as a senior research scientist at Siemens Healthcare, USA for 5 years. He finished his doctoral studies in the computer science department at the University of Maryland, College Park.