What Are The Drone Anti-jamming Systems Technology?
Data Science Competition
1. Data Science Competition
2. 25. 2017
The 27th Annual KSEA South-Western Regional Conference
Jeong-Yoon Lee, Ph.D.
2. Chief Data Scientist, Conversion Logic
Ph.D. in Computer Science, USC
M.S. in Electrical Engineering, USC
B.S. in Electrical Engineering, SNU
KDD Cup Winner 2012 & 2015
Top 10, Kaggle 2015
Jeong-Yoon Lee, Ph.D.
23. No EDA?
• Most of competitions provide actual labels - typical EDA
• Anonymized data - more creative EDA
o People decode age, states, time intervals, income, etc.
23
31. Algorithms
Algorithm Tool Note
Gradient Boosting Machine XGBoost, LightGBM The most popular algorithm in competitions
Random Forests Scikit-Learn, randomForest Used to be popular before GBM
Extremely Random Trees Scikit-Learn
Neural Networks/ Deep Learning Keras, MXNet, CNTK, Torch Blends well with GBM. Best at image and speech recognition competitions
Logistic/Linear Regression Scikit-Learn, Vowpal Wabbit Fastest. Good for ensemble.
Support Vector Machine Scikit-Learn
FTRL Vowpal Wabbit Competitive solution for CTR estimation competitions
Factorization Machine libFM Winning solution for KDD Cup 2012
Field-aware Factorization Machine libFFM Winning solution for CTR estimation competitions (Criteo, Avazu)
31
32. Cross Validation
Training data are split into five folds where the sample size and dropout rate are preserved (stratified).
32
33.
34. Ensemble
* for other types of ensemble, see http://mlwave.com/kaggle-ensembling-guide/
34
I am Jeong-Yoon Lee, Chief Data Scientist at Conversion Logic. I am going to tell you little bit about our attribution approach.
states, age, time interval, weekday,
states, age, time interval, weekday,
Training data are split into five folds while the sample size and dropout rate are preserved across folds.
For validation, each of single and ensemble models is trained five times. Each time, one fold is held out and the remain- ing four folds are used for training. Then, predictions for the hold-out folds are combined and form the model’s CV pre- diction. CV predictions are used in AUC score calculation and/or as inputs in ensemble model training.
For test, each of single and ensemble models is retrained with whole training data. Then predictions for test data are used for submission and/or as inputs in ensemble model prediction.
For validation, each of single and ensemble models is trained five times. Each time, one fold is held out and the remain- ing four folds are used for training. Then, predictions for the hold-out folds are combined and form the model’s CV pre- diction. CV predictions are used in AUC score calculation and/or as inputs in ensemble model training.
For test, each of single and ensemble models is retrained with whole training data. Then predictions for test data are used for submission and/or as inputs in ensemble model prediction.
Stage-I Ensemble: We trained 15 stage-I ensemble classifiers with different subsets of CV predictions of 64 individual classifiers.
Stage-II Ensemble: We trained 2 stage-II ensemble classifiers with different subsets of CV predictions of 15 stage-I ensemble classifiers.
Stage-III Ensemble: We trained a stage-III ensemble classifier with CV predictions of 5 classifiers: 1 stage-II ensemble, 3 stage-I ensemble, and 1 individual classifiers