Since 2015, ACM SIGKDD India Chapter (IKDD) has been organizing the “Data Science in India” event collocated with KDD conference to showcase the growth and achievements of the Data Science community in India. Over these years we have been able to bring to this forum several renowned researchers and practitioners from academia and industry. Continuing the journey, this year we have put together a diverse and engaging program in a virtual format.

Program Highlights

Panel Discussion on Generative AI for India: Possibilities, Pitfalls and Dilemmas

Moderator - Debdoot Mukherjee (Meesho)

09:30 AM - 10:30 AM (IST)

An engaging panel discussion on Generative AI for India, as experts delve into the challenges that need to be addressed and uncover opportunities for advancing the state of the art in Gen AI.

Pratyush Kumar
Ai4Bharat, Microsoft Research
Preethi Jyothi
IIT Bombay

Keynote By Ronen Eldan

10:30 AM - 11:30 AM (IST)

An awe-inspiring keynote by Ronen Eldan, Principal Researcher at Machine Learning Foundations group in Microsoft Research, centered on "The Power of Synthetic Datasets: From TinyStories to Phi-1".

Ronen Eldan
Microsoft Research

This talk presents two recent papers that show how synthetic datasets generated by large language models can enable training smaller and more efficient models for specific tasks.

The first paper introduces TinyStories, a dataset of short stories using only very simple words, generated by GPT-3.5/4. TinyStories attempts to preserve the essential elements of natural language, such as grammar, vocabulary, facts, and reasoning, while being more compact and focused than typical corpora. While language models as big as 1B parameters often struggle to produce coherent text beyond one or two sentences, we show that TinyStories can be used to train language models that are much smaller than the state-of-the-art models (below 10 million parameters), or have much simpler architectures (with only one transformer block), yet still produce fluent and consistent stories with several paragraphs that are diverse and have almost perfect grammar, and demonstrate certain reasoning capabilities.

However, most attempts to create synthetic data using LLMs usually end up in datasets which are very repetitive and seem to lack the diversity which is needed so that a model trained on them would exhibit any ability beyond the memorization of these repeating patterns. The generation of TinyStories relies on the (new) idea of attaining this diversity by injecting randomness into the prompt.

A second paper, based on the same paradigm, presents Phi-1, a new large language model for code, trained using a combination of "textbook quality" data from the web and a dataset of synthetically generated textbooks and exercises. Despite having only 1.3B parameters, it achieves pass@1 accuracy 50.6% on HumanEval and 55.5% on MBPP, surpassing models more than 10 times its size. We discuss the implications of these results for the development, analysis and research of language models, especially for low-resource or specialized domains, and the potential of synthetic datasets to improve the performance and efficiency of LLMs.

Kaleidoscope of Papers from India in Top AI Conferences

11:30 AM - 12:30 AM (IST)

A video kaleidoscope of a representative sample of relevant research papers from Indian institutions which appeared in recent editions of top AI conferences - AAAI, CVPR, ECCV, ACL, NeurIPS, WSDM, ICLR etc.

  • ACL 2023

    Towards Leaving No Indic Language Behind: Building Monolingual Corpora, Benchmark and Models for Indic Languages

    SumanthDoddapaneni (IIT Madras, AI4Bharat), Rahul Aralikatte (MILA - Quebec AI Institute, McGill University), Gowtham Ramesh (AI4Bharat), Shreya Goyal (AI4Bharat), Mitesh M. Khapra (IIT Madras, AI4Bharat), AnoopKunchukuttan (Microsoft, AI4Bharat, IIT Madras), Pratyush Kumar (Microsoft, AI4Bharat, IIT Madras)
  • AAAI 2023

    Interactive concept bottleneck models

    Kushal Chauhan (Google Research India), Rishabh Tiwari (Google Research India), Jan Freyberg (Google Research), Pradeep Shenoy (Google Research India), DJ Dvijotham (Google Research)
  • AAAI 2023

    Clustering What Matters: Optimal Approximation for Clustering with Outliers

    Akanksha Agrawal (Indian Institute Of Technology Madras), TanmayInamdar (University Of Bergen), SaketSaurabh (Institute Of Mathematical Sciences), JieXue (NYU Shanghai)
  • ECCV 2022

    Novel Class Discovery without Forgetting

    Joseph K J (IIT Hyderabad), Sujoy Paul (Google Research), Soma Biswas (Indian Institute Of Science), Piyush Rai (IIT Kanpur), Kai Han (The University Of Hong Kong), Vineeth N Balasubramanian (IIT Hyderabad)
  • CVPR 2023

    Canonical Fields: Self-Supervised Learning of Pose-Canonicalized Neural Fields

    RohithAgaram (IIIT Hyderabad), ShauryaDewan (IIIT Hyderabad), Rahul Sajnani (Brown University), Adrien Poulenard (Stanford University), Madhava Krishna (IIIT Hyderabad), Srinath Sridhar (Brown University)
  • Interspeech 2023

    Speech Taskonomy: Which Speech Tasks are the most Predictive of fMRI Brain Activity?

    Subba Reddy Oota (Inria Bordeaux, France), Veeral Agarwal (IIIT Hyderabad), MounikaMarreddy (IIIT Hyderabad), Manish Gupta (Microsoft, India), Bapi S. Raju (IIIT Hyderabad)
  • WSDM 2023

    BLADE: Biased Neighborhood Sampling based Graph Neural Network for Directed Graphs

    Srinivas Virinchi (Amazon), AnoopSaladi (Amazon)
  • ICLR 2023

    Enhancing the Inductive Biases of Graph Neural ODE for Modeling Dynamical Systems

    Suresh Bishnoi (IIT Delhi), RavinderBhattoo (IIT Delhi), Jayadeva (IIT Delhi), SayanRanu (IIT Delhi), N.M. Anoop Krishnan (IIT Delhi)
  • CVPR 2023

    Few-Shot Referring Relationships in Videos

    Yogesh Kumar (IIT Jodhpur), Anand Mishra (IIT Jodhpur)
  • NeurIPS 2022

    Neural Estimation of Submodular Functions with Applications to Differentiable Subset Selection

    Abir De (IIT Bombay), SoumenChakrabarti (IIT Bombay)
  • ACL 2023

    Multi-Row, Multi-Span Distant Supervision For Table+Text Question Answering

    Vishwajeet Kumar (IBM Research), Saneem Chemmengath (IBM Research), Yash Gupta (IIT Bombay), Jaydeep Sen (IBM Research), Samarth Bharadwaj (IBM Research), Feifei Pan (IBM Research), Soumen Chakrabarti (IIT Bombay)

Networking Roundtables

12:30 PM - 01:15 PM (IST)

An opportunity for researchers with similar interests to connect with each other in virtual breakout rooms and discuss mutual interests.



09:15 AM - 09:30 AMWelcome Address
09:30 AM - 10:30 AMPanel Discussion on “Generative AI for India: Possibilities, Pitfalls, and Ethical Dilemmas”
10:30 AM - 11:30 AMKeynote By Ronen Eldan
11:30 AM - 12:30 PMA Kaleidoscope of Papers from India in Top AI Conferences
12:30 PM - 01:15 PMRoundtable Networking Sessions


Registration for this event is closed