Since 2015, ACM SIGKDD India Chapter (IKDD) has been organizing the “Data Science in India” event collocated with KDD conference to showcase the growth and achievements of the Data Science community in India. Over these years we have been able to bring to this forum several renowned researchers and practitioners from academia and industry. Continuing the journey, this year we have put together a diverse and engaging program in a virtual format.
An engaging panel discussion on Generative AI for India, as experts delve into the challenges that need to be addressed and uncover opportunities for advancing the state of the art in Gen AI.
An awe-inspiring keynote by Ronen Eldan, Principal Researcher at Machine Learning Foundations group in Microsoft Research, centered on "The Power of Synthetic Datasets: From TinyStories to Phi-1".
Abstract:
This talk presents two recent papers that show how synthetic datasets generated by large language models can enable training smaller and more efficient models for specific tasks.
The first paper introduces TinyStories, a dataset of short stories using only very simple words, generated by GPT-3.5/4. TinyStories attempts to preserve the essential elements of natural language, such as grammar, vocabulary, facts, and reasoning, while being more compact and focused than typical corpora. While language models as big as 1B parameters often struggle to produce coherent text beyond one or two sentences, we show that TinyStories can be used to train language models that are much smaller than the state-of-the-art models (below 10 million parameters), or have much simpler architectures (with only one transformer block), yet still produce fluent and consistent stories with several paragraphs that are diverse and have almost perfect grammar, and demonstrate certain reasoning capabilities.
However, most attempts to create synthetic data using LLMs usually end up in datasets which are very repetitive and seem to lack the diversity which is needed so that a model trained on them would exhibit any ability beyond the memorization of these repeating patterns. The generation of TinyStories relies on the (new) idea of attaining this diversity by injecting randomness into the prompt.
A second paper, based on the same paradigm, presents Phi-1, a new large language model for code, trained using a combination of "textbook quality" data from the web and a dataset of synthetically generated textbooks and exercises. Despite having only 1.3B parameters, it achieves pass@1 accuracy 50.6% on HumanEval and 55.5% on MBPP, surpassing models more than 10 times its size. We discuss the implications of these results for the development, analysis and research of language models, especially for low-resource or specialized domains, and the potential of synthetic datasets to improve the performance and efficiency of LLMs.
A video kaleidoscope of a representative sample of relevant research papers from Indian institutions which appeared in recent editions of top AI conferences - AAAI, CVPR, ECCV, ACL, NeurIPS, WSDM, ICLR etc.
Towards Leaving No Indic Language Behind: Building Monolingual Corpora, Benchmark and Models for Indic Languages
Interactive concept bottleneck models
Clustering What Matters: Optimal Approximation for Clustering with Outliers
Novel Class Discovery without Forgetting
Canonical Fields: Self-Supervised Learning of Pose-Canonicalized Neural Fields
Speech Taskonomy: Which Speech Tasks are the most Predictive of fMRI Brain Activity?
BLADE: Biased Neighborhood Sampling based Graph Neural Network for Directed Graphs
Enhancing the Inductive Biases of Graph Neural ODE for Modeling Dynamical Systems
Few-Shot Referring Relationships in Videos
Neural Estimation of Submodular Functions with Applications to Differentiable Subset Selection
Multi-Row, Multi-Span Distant Supervision For Table+Text Question Answering
An opportunity for researchers with similar interests to connect with each other in virtual breakout rooms and discuss mutual interests.
Topics:
TIME (IST) | TITLE |
---|---|
09:15 AM - 09:30 AM | Welcome Address |
09:30 AM - 10:30 AM | Panel Discussion on “Generative AI for India: Possibilities, Pitfalls, and Ethical Dilemmas” |
10:30 AM - 11:30 AM | Keynote By Ronen Eldan |
11:30 AM - 12:30 PM | A Kaleidoscope of Papers from India in Top AI Conferences |
12:30 PM - 01:15 PM | Roundtable Networking Sessions |