# Data Challenge UNDERSTANDING THE LABOUR MARKETS

India produces 1.5 million engineers every year. A relevant question is what determines the salary and the jobs these engineers are offered right after graduation. Various factors such as college grades, candidate skills, proximity of the college to industrial hubs, the specialization one is in, market conditions for specific industries determine this.

In this data challenge, we begin with the profiles of several students with varied background and use the data to get insights and answers to several questions:
• [Predictive Modeling]   Given a new student profile, can we predict his/her annual salary from historic data?
• [Recommendation]   Can we identify the key set of parameters in his profile changing which, s/he would get to earn a better salary?
• [Data Insights]   Can we understand what factors in the labor market determine one’s salary? Is it just one’s skills or there are other factors which influence the return in the labor market? What signals and biases enter the labor market? Can we make interpretable models or visualize features to understand what determines salary – for instance do kids from smaller towns get lower salaries? This can help us understand inefficiencies in the labor market, which will be extremely useful for policy making and constructing interventions.
• [Visualization]   Finally, can we visualize where and what jobs people get to get a quick and deeper understanding? Will all your creativity, show us visuals that will tell us something new and that we did not know.
The answers to these interesting questions lie in data hitherto unexplored! Here’s a chance to get to the bottom of labour market dynamics through a systematic study of employment data.

### Final Accepted Entries

Following submissions that have been selected for final presentation from 30 submissions to the data challenge.
• Shrihari Vasudevan, Ritwik Chaudhuri, Madhavan Pallan and Sudhanshu Shekhar Singh.
Gaussian process modeling for understanding labour markets
• Kaushal Paneri, Karamjit Singh, Aditeya Pandey and Geetika Sharma.
Bayesian Visual Analysis of the Indian Labour Market
• Saurabh Banerjee, Yaasna Dua, Kanishk Agarwal and Paviya Chemparathy.
Understanding Labor Markets in India
• Swarnendu Chakraborty and Debjyoti Paul.
Comparative Study of Student Profile Using Linear Regression , Support Vector Regression and Ensemble Techniques
• Shabana K M, Tony Gracious and Hrishikesh Subramonian.
Understanding the Indian Labour Market: A Data Centric Approach

### Data: Aspiring Minds' Employment Outcomes 2015

#### (AMEO 2015)

The dataset contains various information about a set of engineering candidates and their employment outcomes. For every candidate, the data contains both the profile information along with their employment outcome information. Candidate Profile Information includes:
• Scores on Aspiring Minds’ AMCAT – a standardized test of job skills. The test includes cognitive, domain and personality assessments
• Personal information like gender, date of birth, etc.
• Pre-university information like high school grades, high school location
• University information like GPA, college major, college reputation proxy.
• Demographic information like location of college, candidates’ permanent location
Employment Outcome Information includes:
• First job annual salary
• First job title
• First job location
This is the only data set where we have employment outcomes together with scores on a standardized job test, which makes this very unique. Other such data sets either do not test scores at all or scores on pre-university tests.
(Cite the dataset as – Aspiring Minds (2015), Aspiring Minds Employment Outcomes 2015.)

#### Data Collection

A million undergraduates take AMCAT every year as a way to get job credentials and feedback to improve themselves. Candidates are tested on the following skills –
• English Language, Logical Ability and Quantitative Ability – these are IRT based adaptive modules. More information here on what IRT based adaptive tests are.
• Skill tests – Chosen by candidates based on their interest
These assessments are validated against on-job performance and show a validity between 0.3-0.55 (Learn more about test validity here - http://www.centerforpubliceducation.org/Main-Menu/Evaluating-performance/A-guide-to-standardized-testing-The-nature-of-assessment). These scores are used by 2000+ companies.

Random AMCAT takers were surveyed via email wherein they provided information on the dependent variables in this dataset – the jobs they are in and their corresponding annual salaries. Corresponding independent information about the candidates was recorded at the time of them taking AMCAT.

• train.xlsx: An excel containing the training data
• test.xlsx: An excel containing the validation data
• data_description.xlsx: A data description document, which lists out the independent and dependent variables.
• results.xlsx: The file in which you are expected to enter the predictions on the unseen data. The order of the elements should _exactly correspond_ with those present in test.xlsx.
• In total, the dataset has 32 independent variables and 5 dependent variables.
• Participants are expected to treat this as raw data and perform any necessary cleaning/validation steps as required.

#### Disclaimer on the Released Data

Aspiring Minds provides no warranty, expressed or implied, regarding the accuracy, adequacy, completeness, legality, reliability or usefulness of the released data. This disclaimer applies to both isolated and aggregate uses of the data. The data is provided on an "as is" basis. All warranties of any kind, express or implied, including but not limited to the implied warranties of national distribution of AMCAT scores, national distribution of student demographics, national distribution of employment parameters, state-wise distribution of AMCAT scores, state-wise distribution of student demographics, state-wise distribution of employment parameters are disclaimed. Aspiring Minds has released this data for limited public use as preliminary data to be used only with appropriate caution.

As mentioned earlier, the task is a mixture of predictive modelling, recommendation, providing insights from the data and furnishing visualization on the insights.
TASK A [Predictive Modelling]: Design an annual-salary predictor based on the independent variables. You may use any machine learning technique.
TASK B [Recommendation and Data Insights]: Provide us insights on what factors determine salaries in the labour market. You may do this through one or more of the following techniques: feature analysis, interpretable models, causal analysis and other approaches. Provide your commentary/interpretation of why particular factors may be influencing salary outcomes. You may use the same model for both tasks, Task A and Task B.
TASK C [Visualization]: Come up with interesting visualizations and inferences on job titles and city of job (the two other dependent variables). Feel free to use any sort of plots, graphics and visualizations to provide more insight into patterns in job titles and job location and how the independent variables influence the labour market.

#### Submission Format

Each participating team is expected to submit a write-up covering the models explored, inferences drawn along with sufficient justifications and various visualizations explored. The submission should be formatted in ACM proceedings format, using one of the templates available at: http://www.acm.org/sigs/pubs/proceed/template.html. Templates are available in Word and LaTeX (version 2e). For the LaTeX formats, you may use either the standard style or the SIG-alternate style. Along with the write up, submit the results.xlsx for every model explored for Tasks A & B. on the portal.

Note - The submissions made on this leaderboard will not be considered for the final evaluation. The final models have to be submitted on EasyChair for final evaluations.

The following EasyChair portal has been setup to accept submissions. In order to submit, the team members need to have an account on EasyChair. Please upload your results.xlsx and write up (in PDF) via this portal.

#### Evaluation

The panel is to going to evaluate each submission based on the following criteria: For TASK A: The models must minimize the Mean Squared Error (MSE). The MSE is defined as,

where $n$ is the number of examples in the test set, $Y_{pred}$ is the predicted output on the test sample, and $Y$ is the actual value of the dependent variable in the test sample.

For TASK B and TASK C, we will evaluate it based on the nature of insights provided through the data analysis the participant performs (quantity, quality and surprise factor) along with the novelty of the data processing, mining and learning techniques used.

50% of the overall evaluation emphasis will be on Task A considering a combination of the overall performance (MSE) and the novelty of the methods used

### Important Dates

Interested participants are encouraged to get started immediately – following are important dates.
• 15th January, 2016: Each team has to submit a plain-text abstract of length less than 500 words describing their approach and key insights from their exploration.
• 31st January, 2016: Each team has to submit their findings and exploration along with the description of their algorithms in the ACM format.
• 15th February, 2016: Finalists are notified
• 1st March, 2016: Camera Ready Version of the paper due
• 16th March, 2016: CODS Research Challenge presentation and demonstration session followed by announcement of winners.

### Rules

• Participants have to develop the models and code entirely by themselves. They can use open-source software packages with appropriate citations in their submissions.
• They are encouraged to make use of publicly available data and should be cited in the submissions. No proprietary data source which does not offer free access to all can be used.
• Each group cannot have more than 4-members who do not necessarily have to be from the same organization/institution. A member cannot belong to more than one submitting team.
• Submissions involved in plagiarism or means found unbecoming of the spirit of this contest shall be disqualified.

### Award

• Shortlisted teams based on abstract and paper submission will be eligible to present and demonstrate their application to CODS Conference participants.
• The winner and first two runner ups (chosen by an expert panel and conference attendees) will get cash award for INR 30,000, 20,000 and 15,000 respectively.
• The top-3 teams will be invited to write a 4-page paper to be included in ACM CODS proceedings.

### Sample Program and Visualization/Insight

Here’s a sample Python program which accesses the train data, builds a simple linear regression model on it and tests its performance on the unseen data. Alternatively, one could use Matlab/GNU Octave/R to get similar results.
```                import pandas as pd
from sklearn.metrics import mean_absolute_error, mean_squared_error
from sklearn.linear_model import LinearRegression

# removing ID and Salary
X_train = train.drop(['ID', 'Salary'], axis=1)

#Keeping only numeric data
X_train = X_train._get_numeric_data()

y_train = train.Salary

print y_train.shape
print X_train.shape

clf = LinearRegression()
clf = clf.fit(X_train, y_train)

# removing ID and Salary
X_test = test.drop(['ID', 'Salary'], axis=1)

#Keeping only numeric data
X_test = X_test._get_numeric_data()
y_test = test.Salary

r_sqr = clf.score(X_test, y_test)
y_pred = clf.predict(X_test)

mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
print mse

pd.DataFrame({'ID':test.ID,'Salary':y_pred}).to_excel('results.xlsx',index=False)
```

Here’s an interesting visualization which is in line with a submission for Task 2 – the scatter plot between English scores (on the x-axis) and annual salaries (on the y-axis) shows how high paying jobs in the market attracts candidates with higher English skills. If found to be statistically valid, this inference can probably imply English plays a very important role in determining whether a candidate makes it through to high paying jobs. But can this also mean that those who have high English skills also have high domain skills resulting in this trend? Let us know!

### Contact

Balaji Vasan Srinivasan, balsrini at adobe dot com
Varun Aggarwal, varun at aspiringminds dot com

Footnote: We understand this is not a causal inference, but still extremely useful and a starting point for a causal inference.

#### Dates (Data Challenge)

 Abstract Submission 15th January, 2016 Report Submission 31st January, 2016 Notification 15th February, 2016 Final Camera Ready 1st March, 2016 CODS Data Challenge 16th March, 2016