SlideShare uma empresa Scribd logo
1 de 18
Introduction to
Competitive data science
Nathaniel Shimoni 23/4/2018
4/23/2018
Introduction to competitive data science
Nathaniel Shimoni
2
Talk outline
• What is competitive data science?
• Why should you participate in CDS?
• Data science process outline
• How competitive data science differs from other DS processes
• Useful tips & common practices for new participants
4/23/2018
Introduction to competitive data science
Nathaniel Shimoni
3
• Usage of competition based reward system to enhance performance
improvement in a given data related task with predefined metric(s)
• usually intended to improve existing results (does not start from scratch)
• Predefined objective
• Predefined metric
• Any small improvement counts while ranking
What is competitive data science?
4/23/2018
Introduction to competitive data science
Nathaniel Shimoni
4
• Enhanced learning both through the competition and after its ending
• Opportunity to showcase your skills and knowledge
• Work on real life problems
• Meet great like minded people
• Challenging competitive setting
• It’s FUN!!!
Why should you participate in CDS?
4/23/2018
Introduction to competitive data science
Nathaniel Shimoni
5
Common competitive data science project flow
Data cleaning
& wranglering
Data
augmentation
Adding
External Data
Not always allowed yet
good practice to
consider when possible
Exploratory
data analysis
Feature
engineering &
architecture
design
Diverse
single
models
Ensemble
learning
Final
prediction
Set a
relevant
validation
method
Project
summary
Main findings
Lessons learned
Things that
worked well
Things that we
tried and
didn’t work
ideas that we
considered but
haven’t tried
(time limitation)Data cleaning and
augmentation
EDA & preprocessing
Feature generation / architecture design
modeling
Ensemble
learning
% of total time spent
in each activity
20% 40% 30% 10%
Results
evaluation
& error analysis
Increase model pool diversity
Improve data quality
4/23/2018
Introduction to competitive data science
Nathaniel Shimoni
6
• Impute missing values
(mean, median, most common value, use separate prediction task)
• Remove zero variance features
• Remove duplicated features
• Outlier removal – caution can be harmful, at cleaning stage we’ll remove
irrelevant values (e.g. negative price)
• Na’s encoding / imputing
Data cleaning
4/23/2018
Introduction to competitive data science
Nathaniel Shimoni
7
• External data sources:
• open street map
• weather measurement data
• online calendars
• Publicly available data
• API’s
• Scraping (using ScraPy / beautiful soup / other libraries or services)
Data augmentation & external data
4/23/2018
Introduction to competitive data science
Nathaniel Shimoni
8
• Rescaling/ standardization of existing features
• Performing data transformations: Tf-Idf, log1p, min-max scaling, binning of
numeric features
• Turn categorical features to numeric (label encoding / one hot encoding)
• Create count features
• Parsing textual features to get more generalizable features
• Hashing trick
• Extracting date/time features i.e month, year, DayOfWeek, dayOfMonth,
isHoliday?, isExtreme? etc.
Feature engineering
4/23/2018
Introduction to competitive data science
Nathaniel Shimoni
9
Target: get familiar & better understand the dataset at hand
Means:
• Feature distributions
• Histograms
• Correlograms
• Density plots
• Skewness
• Outlier analysis
Exploratory data analysis (EDA)
4/23/2018
Introduction to competitive data science
Nathaniel Shimoni
10
• Re-defining the problem (regression/classification)
• Using unsupervised learning before/in addition to supervised learning
• Pre processing
• Different sub-model per segment
• Post processing
Architecture design (not just for NN)
4/23/2018
Introduction to competitive data science
Nathaniel Shimoni
11
• Remove near-zero-variance features
• Use feature importance and eliminate least important features
• Remove 1-2 most significant features to increase model diversity
• Recursive Feature Elimination
Feature selection
4/23/2018
Introduction to competitive data science
Nathaniel Shimoni
12
• Grid search CV (exhaustive, rarely better than alternatives)
• Random search CV
• Hyper-opt
• Bayesian optimization
* Hyper parameter adjustment will usually yield better results but not as
much as other activities
Hyper parameter optimization
4/23/2018
Introduction to competitive data science
Nathaniel Shimoni
13
• Train test split
• Shuffle split
• Kfold is the most commonly used
• Time based separation
• Group Kfold
• Leave one group out
Selection of most suitable validation method
4/23/2018
Introduction to competitive data science
Nathaniel Shimoni
14
• Classifier distribution
• Classification report
• Confusion matrix
• Specific sample decision path analysis
Results evaluation & error analysis
4/23/2018
Introduction to competitive data science
Nathaniel Shimoni
15
• Simple/weighted average of previous best models
• Bagging of same type of models (i.e different RNG seed, different hyper-param)
• Majority vote
• Using out of fold predictions as meta features a.k.a stacking
Ensemble of several models
4/23/2018
Introduction to competitive data science
Nathaniel Shimoni
16
Out Of Fold predictions – a.k.a meta features
Divided training data to n folds - train on n-1 folds
predict both the remaining fold and the testing data
Fold 1
Fold 2
Fold 3
Fold 4
oof 1
oof 2
oof 3
oof 4
Test
predictions
fold 1
Test
predictions
fold 2
Test
predictions
fold 3
Test
predictions
fold 4
Out of fold
predictions
Averaged test predictions
(mean of all folds models)
4/23/2018
Introduction to competitive data science
Nathaniel Shimoni
17
Out Of Fold predictions – a.k.a meta features
oof 1
oof 2
oof 3
oof 4
Out of fold
predictions
model 1
e.g. SVM
oof 1
oof 2
oof 3
oof 4
oof 1
oof 2
oof 3
oof 4
oof 1
oof 2
oof 3
oof 4
Out of fold
predictions
model 2
e.g. KNN
Out of fold
predictions
model 3
e.g. GBDT
Out of fold
predictions
model 4
e.g. NN
Fold1 true labels
Fold2 true labels
Fold3 true labels
Fold4 true labels
True labels
train data
Test
averaged
predictions
Model 1
Test
averaged
predictions
Model 2
Test
averaged
predictions
Model 3
Test
averaged
predictions
Model 4
Test
averaged
predictions
model 1
e.g. SVM
Test
averaged
predictions
model 2
e.g. KNN
Test
averaged
predictions
model 3
e.g. GBDT
Test
averaged
predictions
model 4
e.g. NN
After training several models using this method (4 different models in this example) We can now train a new model
using our newly formed meta features
* Note that we can either train our meta model using only these new features or use the new features along with
our original train data for training
Train meta features
Test meta features
Questions?
4/23/2018
Introduction to competitive data science
Nathaniel Shimoni
18

Mais conteúdo relacionado

Mais procurados

SUNG PARK PREDICT 422 Group Project Presentation
SUNG PARK PREDICT 422 Group Project PresentationSUNG PARK PREDICT 422 Group Project Presentation
SUNG PARK PREDICT 422 Group Project Presentation
Sung Park
 

Mais procurados (13)

Requirements Engineering Research: How good are we at solving practical prob...
Requirements Engineering Research:  How good are we at solving practical prob...Requirements Engineering Research:  How good are we at solving practical prob...
Requirements Engineering Research: How good are we at solving practical prob...
 
A survey of 2013 data science salary survey”
A survey of   2013 data science salary survey”A survey of   2013 data science salary survey”
A survey of 2013 data science salary survey”
 
SUNG PARK PREDICT 422 Group Project Presentation
SUNG PARK PREDICT 422 Group Project PresentationSUNG PARK PREDICT 422 Group Project Presentation
SUNG PARK PREDICT 422 Group Project Presentation
 
Software Visualization Today - Systematic Literature Review
Software Visualization Today - Systematic Literature ReviewSoftware Visualization Today - Systematic Literature Review
Software Visualization Today - Systematic Literature Review
 
Question Answering and Virtual Assistants with Deep Learning
Question Answering and Virtual Assistants with Deep LearningQuestion Answering and Virtual Assistants with Deep Learning
Question Answering and Virtual Assistants with Deep Learning
 
Machine Learning using Big data
Machine Learning using Big data Machine Learning using Big data
Machine Learning using Big data
 
Introduction to Big Data/Machine Learning
Introduction to Big Data/Machine LearningIntroduction to Big Data/Machine Learning
Introduction to Big Data/Machine Learning
 
mlcourse.ai fall2019 Live Session 0
mlcourse.ai fall2019 Live Session 0mlcourse.ai fall2019 Live Session 0
mlcourse.ai fall2019 Live Session 0
 
Master guide to become a data scientist by zeke labs
Master guide to become a data scientist by zeke labsMaster guide to become a data scientist by zeke labs
Master guide to become a data scientist by zeke labs
 
How to jump into Data Science
How to jump into Data ScienceHow to jump into Data Science
How to jump into Data Science
 
mlcourse.ai, introduction, course overview
mlcourse.ai, introduction, course overviewmlcourse.ai, introduction, course overview
mlcourse.ai, introduction, course overview
 
Open Data, Big Data and Machine Learning
Open Data, Big Data and Machine LearningOpen Data, Big Data and Machine Learning
Open Data, Big Data and Machine Learning
 
Master guide to become a data scientist
Master guide to become a data scientist Master guide to become a data scientist
Master guide to become a data scientist
 

Semelhante a Introduction to competitive data science

Predictive Human Capital Analytics (1).pptx
Predictive Human Capital Analytics (1).pptxPredictive Human Capital Analytics (1).pptx
Predictive Human Capital Analytics (1).pptx
SaminaNawaz14
 
BI Chapter 04.pdf business business business business
BI Chapter 04.pdf business business business businessBI Chapter 04.pdf business business business business
BI Chapter 04.pdf business business business business
JawaherAlbaddawi
 
Introducition to Data scinece compiled by hu
Introducition to Data scinece compiled by huIntroducition to Data scinece compiled by hu
Introducition to Data scinece compiled by hu
wekineheshete
 
Leveraging Oracle's Life Sciences Data Hub to Enable Dynamic Cross-Study Anal...
Leveraging Oracle's Life Sciences Data Hub to Enable Dynamic Cross-Study Anal...Leveraging Oracle's Life Sciences Data Hub to Enable Dynamic Cross-Study Anal...
Leveraging Oracle's Life Sciences Data Hub to Enable Dynamic Cross-Study Anal...
Perficient
 

Semelhante a Introduction to competitive data science (20)

Drifting Away: Testing ML Models in Production
Drifting Away: Testing ML Models in ProductionDrifting Away: Testing ML Models in Production
Drifting Away: Testing ML Models in Production
 
lec1.pdf
lec1.pdflec1.pdf
lec1.pdf
 
Altron presentation on Emerging Technologies: Data Science and Artificial Int...
Altron presentation on Emerging Technologies: Data Science and Artificial Int...Altron presentation on Emerging Technologies: Data Science and Artificial Int...
Altron presentation on Emerging Technologies: Data Science and Artificial Int...
 
PhD Thesis Igor Barahona July 26th of 2013
PhD Thesis Igor Barahona July 26th of 2013PhD Thesis Igor Barahona July 26th of 2013
PhD Thesis Igor Barahona July 26th of 2013
 
Foundational Strategies for Trust in Big Data Part 2: Understanding Your Data
Foundational Strategies for Trust in Big Data Part 2: Understanding Your DataFoundational Strategies for Trust in Big Data Part 2: Understanding Your Data
Foundational Strategies for Trust in Big Data Part 2: Understanding Your Data
 
5 data analysis approaches dr. hueihsia holloman
5 data analysis approaches dr. hueihsia holloman5 data analysis approaches dr. hueihsia holloman
5 data analysis approaches dr. hueihsia holloman
 
Strasser "Effective data management and its role in open research"
Strasser "Effective data management and its role in open research"Strasser "Effective data management and its role in open research"
Strasser "Effective data management and its role in open research"
 
Predictive Human Capital Analytics (1).pptx
Predictive Human Capital Analytics (1).pptxPredictive Human Capital Analytics (1).pptx
Predictive Human Capital Analytics (1).pptx
 
Data science guide
Data science guideData science guide
Data science guide
 
Data analytcis-first-steps
Data analytcis-first-stepsData analytcis-first-steps
Data analytcis-first-steps
 
Citi Global T4I Accelerator Data and Analytics Presentation
Citi Global T4I Accelerator Data and Analytics PresentationCiti Global T4I Accelerator Data and Analytics Presentation
Citi Global T4I Accelerator Data and Analytics Presentation
 
From data mining to knowledge discovery in
From data mining to knowledge discovery inFrom data mining to knowledge discovery in
From data mining to knowledge discovery in
 
BI Chapter 04.pdf business business business business
BI Chapter 04.pdf business business business businessBI Chapter 04.pdf business business business business
BI Chapter 04.pdf business business business business
 
Osm presentation workshop 19 sept 2018
Osm presentation workshop 19 sept 2018Osm presentation workshop 19 sept 2018
Osm presentation workshop 19 sept 2018
 
Introducition to Data scinece compiled by hu
Introducition to Data scinece compiled by huIntroducition to Data scinece compiled by hu
Introducition to Data scinece compiled by hu
 
Leveraging Oracle's Life Sciences Data Hub to Enable Dynamic Cross-Study Anal...
Leveraging Oracle's Life Sciences Data Hub to Enable Dynamic Cross-Study Anal...Leveraging Oracle's Life Sciences Data Hub to Enable Dynamic Cross-Study Anal...
Leveraging Oracle's Life Sciences Data Hub to Enable Dynamic Cross-Study Anal...
 
Neo4j GraphDay Seattle- Sept19- Connected data imperative
Neo4j GraphDay Seattle- Sept19- Connected data imperativeNeo4j GraphDay Seattle- Sept19- Connected data imperative
Neo4j GraphDay Seattle- Sept19- Connected data imperative
 
Marketing research
Marketing researchMarketing research
Marketing research
 
BAS 250 Lecture 1
BAS 250 Lecture 1BAS 250 Lecture 1
BAS 250 Lecture 1
 
WEBINAR: Open Research Data in Horizon 2020
WEBINAR: Open Research Data in Horizon 2020WEBINAR: Open Research Data in Horizon 2020
WEBINAR: Open Research Data in Horizon 2020
 

Último

Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
gajnagarg
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
chadhar227
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Bertram Ludäscher
 
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
HyderabadDolls
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
HyderabadDolls
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
ahmedjiabur940
 

Último (20)

Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxRESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham Ware
 
20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - Almora
 
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
 
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
 
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for Research
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 
Kings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about themKings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about them
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 

Introduction to competitive data science

  • 1. Introduction to Competitive data science Nathaniel Shimoni 23/4/2018
  • 2. 4/23/2018 Introduction to competitive data science Nathaniel Shimoni 2 Talk outline • What is competitive data science? • Why should you participate in CDS? • Data science process outline • How competitive data science differs from other DS processes • Useful tips & common practices for new participants
  • 3. 4/23/2018 Introduction to competitive data science Nathaniel Shimoni 3 • Usage of competition based reward system to enhance performance improvement in a given data related task with predefined metric(s) • usually intended to improve existing results (does not start from scratch) • Predefined objective • Predefined metric • Any small improvement counts while ranking What is competitive data science?
  • 4. 4/23/2018 Introduction to competitive data science Nathaniel Shimoni 4 • Enhanced learning both through the competition and after its ending • Opportunity to showcase your skills and knowledge • Work on real life problems • Meet great like minded people • Challenging competitive setting • It’s FUN!!! Why should you participate in CDS?
  • 5. 4/23/2018 Introduction to competitive data science Nathaniel Shimoni 5 Common competitive data science project flow Data cleaning & wranglering Data augmentation Adding External Data Not always allowed yet good practice to consider when possible Exploratory data analysis Feature engineering & architecture design Diverse single models Ensemble learning Final prediction Set a relevant validation method Project summary Main findings Lessons learned Things that worked well Things that we tried and didn’t work ideas that we considered but haven’t tried (time limitation)Data cleaning and augmentation EDA & preprocessing Feature generation / architecture design modeling Ensemble learning % of total time spent in each activity 20% 40% 30% 10% Results evaluation & error analysis Increase model pool diversity Improve data quality
  • 6. 4/23/2018 Introduction to competitive data science Nathaniel Shimoni 6 • Impute missing values (mean, median, most common value, use separate prediction task) • Remove zero variance features • Remove duplicated features • Outlier removal – caution can be harmful, at cleaning stage we’ll remove irrelevant values (e.g. negative price) • Na’s encoding / imputing Data cleaning
  • 7. 4/23/2018 Introduction to competitive data science Nathaniel Shimoni 7 • External data sources: • open street map • weather measurement data • online calendars • Publicly available data • API’s • Scraping (using ScraPy / beautiful soup / other libraries or services) Data augmentation & external data
  • 8. 4/23/2018 Introduction to competitive data science Nathaniel Shimoni 8 • Rescaling/ standardization of existing features • Performing data transformations: Tf-Idf, log1p, min-max scaling, binning of numeric features • Turn categorical features to numeric (label encoding / one hot encoding) • Create count features • Parsing textual features to get more generalizable features • Hashing trick • Extracting date/time features i.e month, year, DayOfWeek, dayOfMonth, isHoliday?, isExtreme? etc. Feature engineering
  • 9. 4/23/2018 Introduction to competitive data science Nathaniel Shimoni 9 Target: get familiar & better understand the dataset at hand Means: • Feature distributions • Histograms • Correlograms • Density plots • Skewness • Outlier analysis Exploratory data analysis (EDA)
  • 10. 4/23/2018 Introduction to competitive data science Nathaniel Shimoni 10 • Re-defining the problem (regression/classification) • Using unsupervised learning before/in addition to supervised learning • Pre processing • Different sub-model per segment • Post processing Architecture design (not just for NN)
  • 11. 4/23/2018 Introduction to competitive data science Nathaniel Shimoni 11 • Remove near-zero-variance features • Use feature importance and eliminate least important features • Remove 1-2 most significant features to increase model diversity • Recursive Feature Elimination Feature selection
  • 12. 4/23/2018 Introduction to competitive data science Nathaniel Shimoni 12 • Grid search CV (exhaustive, rarely better than alternatives) • Random search CV • Hyper-opt • Bayesian optimization * Hyper parameter adjustment will usually yield better results but not as much as other activities Hyper parameter optimization
  • 13. 4/23/2018 Introduction to competitive data science Nathaniel Shimoni 13 • Train test split • Shuffle split • Kfold is the most commonly used • Time based separation • Group Kfold • Leave one group out Selection of most suitable validation method
  • 14. 4/23/2018 Introduction to competitive data science Nathaniel Shimoni 14 • Classifier distribution • Classification report • Confusion matrix • Specific sample decision path analysis Results evaluation & error analysis
  • 15. 4/23/2018 Introduction to competitive data science Nathaniel Shimoni 15 • Simple/weighted average of previous best models • Bagging of same type of models (i.e different RNG seed, different hyper-param) • Majority vote • Using out of fold predictions as meta features a.k.a stacking Ensemble of several models
  • 16. 4/23/2018 Introduction to competitive data science Nathaniel Shimoni 16 Out Of Fold predictions – a.k.a meta features Divided training data to n folds - train on n-1 folds predict both the remaining fold and the testing data Fold 1 Fold 2 Fold 3 Fold 4 oof 1 oof 2 oof 3 oof 4 Test predictions fold 1 Test predictions fold 2 Test predictions fold 3 Test predictions fold 4 Out of fold predictions Averaged test predictions (mean of all folds models)
  • 17. 4/23/2018 Introduction to competitive data science Nathaniel Shimoni 17 Out Of Fold predictions – a.k.a meta features oof 1 oof 2 oof 3 oof 4 Out of fold predictions model 1 e.g. SVM oof 1 oof 2 oof 3 oof 4 oof 1 oof 2 oof 3 oof 4 oof 1 oof 2 oof 3 oof 4 Out of fold predictions model 2 e.g. KNN Out of fold predictions model 3 e.g. GBDT Out of fold predictions model 4 e.g. NN Fold1 true labels Fold2 true labels Fold3 true labels Fold4 true labels True labels train data Test averaged predictions Model 1 Test averaged predictions Model 2 Test averaged predictions Model 3 Test averaged predictions Model 4 Test averaged predictions model 1 e.g. SVM Test averaged predictions model 2 e.g. KNN Test averaged predictions model 3 e.g. GBDT Test averaged predictions model 4 e.g. NN After training several models using this method (4 different models in this example) We can now train a new model using our newly formed meta features * Note that we can either train our meta model using only these new features or use the new features along with our original train data for training Train meta features Test meta features
  • 18. Questions? 4/23/2018 Introduction to competitive data science Nathaniel Shimoni 18