The document provides an introduction to competitive data science. It outlines the data science process for competitions, which includes data cleaning, exploratory data analysis, feature engineering, modeling, and ensemble techniques. It explains that the goal of competitive data science is to improve performance on predefined metrics for a given task. Participants can enhance their skills, showcase their work, and learn from others in a challenging environment.
2. 4/23/2018
Introduction to competitive data science
Nathaniel Shimoni
2
Talk outline
• What is competitive data science?
• Why should you participate in CDS?
• Data science process outline
• How competitive data science differs from other DS processes
• Useful tips & common practices for new participants
3. 4/23/2018
Introduction to competitive data science
Nathaniel Shimoni
3
• Usage of competition based reward system to enhance performance
improvement in a given data related task with predefined metric(s)
• usually intended to improve existing results (does not start from scratch)
• Predefined objective
• Predefined metric
• Any small improvement counts while ranking
What is competitive data science?
4. 4/23/2018
Introduction to competitive data science
Nathaniel Shimoni
4
• Enhanced learning both through the competition and after its ending
• Opportunity to showcase your skills and knowledge
• Work on real life problems
• Meet great like minded people
• Challenging competitive setting
• It’s FUN!!!
Why should you participate in CDS?
5. 4/23/2018
Introduction to competitive data science
Nathaniel Shimoni
5
Common competitive data science project flow
Data cleaning
& wranglering
Data
augmentation
Adding
External Data
Not always allowed yet
good practice to
consider when possible
Exploratory
data analysis
Feature
engineering &
architecture
design
Diverse
single
models
Ensemble
learning
Final
prediction
Set a
relevant
validation
method
Project
summary
Main findings
Lessons learned
Things that
worked well
Things that we
tried and
didn’t work
ideas that we
considered but
haven’t tried
(time limitation)Data cleaning and
augmentation
EDA & preprocessing
Feature generation / architecture design
modeling
Ensemble
learning
% of total time spent
in each activity
20% 40% 30% 10%
Results
evaluation
& error analysis
Increase model pool diversity
Improve data quality
6. 4/23/2018
Introduction to competitive data science
Nathaniel Shimoni
6
• Impute missing values
(mean, median, most common value, use separate prediction task)
• Remove zero variance features
• Remove duplicated features
• Outlier removal – caution can be harmful, at cleaning stage we’ll remove
irrelevant values (e.g. negative price)
• Na’s encoding / imputing
Data cleaning
7. 4/23/2018
Introduction to competitive data science
Nathaniel Shimoni
7
• External data sources:
• open street map
• weather measurement data
• online calendars
• Publicly available data
• API’s
• Scraping (using ScraPy / beautiful soup / other libraries or services)
Data augmentation & external data
8. 4/23/2018
Introduction to competitive data science
Nathaniel Shimoni
8
• Rescaling/ standardization of existing features
• Performing data transformations: Tf-Idf, log1p, min-max scaling, binning of
numeric features
• Turn categorical features to numeric (label encoding / one hot encoding)
• Create count features
• Parsing textual features to get more generalizable features
• Hashing trick
• Extracting date/time features i.e month, year, DayOfWeek, dayOfMonth,
isHoliday?, isExtreme? etc.
Feature engineering
9. 4/23/2018
Introduction to competitive data science
Nathaniel Shimoni
9
Target: get familiar & better understand the dataset at hand
Means:
• Feature distributions
• Histograms
• Correlograms
• Density plots
• Skewness
• Outlier analysis
Exploratory data analysis (EDA)
10. 4/23/2018
Introduction to competitive data science
Nathaniel Shimoni
10
• Re-defining the problem (regression/classification)
• Using unsupervised learning before/in addition to supervised learning
• Pre processing
• Different sub-model per segment
• Post processing
Architecture design (not just for NN)
11. 4/23/2018
Introduction to competitive data science
Nathaniel Shimoni
11
• Remove near-zero-variance features
• Use feature importance and eliminate least important features
• Remove 1-2 most significant features to increase model diversity
• Recursive Feature Elimination
Feature selection
12. 4/23/2018
Introduction to competitive data science
Nathaniel Shimoni
12
• Grid search CV (exhaustive, rarely better than alternatives)
• Random search CV
• Hyper-opt
• Bayesian optimization
* Hyper parameter adjustment will usually yield better results but not as
much as other activities
Hyper parameter optimization
13. 4/23/2018
Introduction to competitive data science
Nathaniel Shimoni
13
• Train test split
• Shuffle split
• Kfold is the most commonly used
• Time based separation
• Group Kfold
• Leave one group out
Selection of most suitable validation method
14. 4/23/2018
Introduction to competitive data science
Nathaniel Shimoni
14
• Classifier distribution
• Classification report
• Confusion matrix
• Specific sample decision path analysis
Results evaluation & error analysis
15. 4/23/2018
Introduction to competitive data science
Nathaniel Shimoni
15
• Simple/weighted average of previous best models
• Bagging of same type of models (i.e different RNG seed, different hyper-param)
• Majority vote
• Using out of fold predictions as meta features a.k.a stacking
Ensemble of several models
16. 4/23/2018
Introduction to competitive data science
Nathaniel Shimoni
16
Out Of Fold predictions – a.k.a meta features
Divided training data to n folds - train on n-1 folds
predict both the remaining fold and the testing data
Fold 1
Fold 2
Fold 3
Fold 4
oof 1
oof 2
oof 3
oof 4
Test
predictions
fold 1
Test
predictions
fold 2
Test
predictions
fold 3
Test
predictions
fold 4
Out of fold
predictions
Averaged test predictions
(mean of all folds models)
17. 4/23/2018
Introduction to competitive data science
Nathaniel Shimoni
17
Out Of Fold predictions – a.k.a meta features
oof 1
oof 2
oof 3
oof 4
Out of fold
predictions
model 1
e.g. SVM
oof 1
oof 2
oof 3
oof 4
oof 1
oof 2
oof 3
oof 4
oof 1
oof 2
oof 3
oof 4
Out of fold
predictions
model 2
e.g. KNN
Out of fold
predictions
model 3
e.g. GBDT
Out of fold
predictions
model 4
e.g. NN
Fold1 true labels
Fold2 true labels
Fold3 true labels
Fold4 true labels
True labels
train data
Test
averaged
predictions
Model 1
Test
averaged
predictions
Model 2
Test
averaged
predictions
Model 3
Test
averaged
predictions
Model 4
Test
averaged
predictions
model 1
e.g. SVM
Test
averaged
predictions
model 2
e.g. KNN
Test
averaged
predictions
model 3
e.g. GBDT
Test
averaged
predictions
model 4
e.g. NN
After training several models using this method (4 different models in this example) We can now train a new model
using our newly formed meta features
* Note that we can either train our meta model using only these new features or use the new features along with
our original train data for training
Train meta features
Test meta features