Ensembling & Boosting 概念介紹

Ensembling & Boosting
概念介紹
Wayne Chen
201608

簡報目的
增加資料分析領域的 sense
遇到自稱打過比賽的人不會心裡涼涼的覺得你好神
Maybe 就算用不上概念也有借鏡的地方

如果說 Deep Learning 改變了 ML 的遊戲規則
XGBoost : Kaggle Winning Solution
Giuliano Janson: Won two games and retired from Kaggle
Persistence: every Kaggler nowadays can put up a great model in a few hours
and usually achieve 95% of final score. Only persistence will get you the
remaining 5%.
Ensembling: need to know how to do it "like a pro". Forget about averaging
models. Nowadays many Kaggler do meta-models, and meta-meta-models.

Why Ensemble is needed?
奧卡姆剃刀 Occam's Razor
● An explanation of the data should be made as simple as possible, but no simpler.
簡單的方法，勝過複雜的方法。 Simple s good. 任何的浪費都是不好的
將多個簡單的模型組合起來，效果比一個複雜的模型還要好
● Training data might not provide sufficient information for choosing a single best learner.
● The search processes of the learning algorithms might be imperfect (difficult to achieve unique
best hypothesis)
● Hypothesis space being searched might not contain the true target function.

所謂簡單的方法是指
ID3, C4.5, CART … Tree base method
Entropy
ex. 找出愛花錢的人，以性別作為切分 5 愛(1M,4F), 9 不愛(6M,3F)
● E_all → -5/14 * log(5/14) - 9/14 * log(9/14)
● Entropy is 1 if 50% - 50%, 0 if 100% - 0%
Information Gain
● 選擇 a 當作 split attribute，之後 Entropy 比原本減少了多少
● E_gender → P(M) * E(1,6) + P(F) * E(4,3) Gain = E_all - E_gender
http://www.saedsayad.com/decision_tree.htm

這樣會有什麼問題？
越精準的模型可能是越偏頗的
http://blogs.sas.com/content/jmp/2013/03/25/partitioning-a-quadratic-in-jmp/

一句話講完 Boost Ensemble
知錯能改、善莫大焉
學習就是一遍一遍的的對錯誤加重記憶，然後改進
做錯的事就沒有後悔藥吃了，記取教訓努力在未來不再犯錯
1. 錯了就錯了，不要丟掉，也不要執著
2. 記住錯在哪裡，下次加重學習
3. 一直學到考試都可以考一百分 (誤)

一秒鐘學會用 Ensemble
我想你已經 try 過一些不同 model 了
● Decision tree, NN, SVM, Regression ..
Ensemble Kaggle submission CSV files. → It’s work!
Majority Voting
● Three models : 70%, 70%, 70%
● Majority vote ensemble will be ~78%.
● Averaging predictions often reduces overfit.
http://mlwave.com/kaggle-ensembling-guide/

Ensemble 的陷阱
把 Kobe, Curry, LBJ 組一隊，就會拿總冠軍嗎？
Uncorrelated models usually performed better
As more accurate as possible, and as more diverse aspossible
常見機制 Majority Vote, Weighted Averaging
Voting Ensemble → RandomForest → GradientBoostingMachine
1111111100 = 80% accuracy
1111111100 = 80% accuracy
1011111100 = 70% accuracy
1111111100 = 80% accuracy
1111111100 = 80% accuracy
0111011101 = 70% accuracy
1000101111 = 60% accuracy
1111111101 = 90% accuracy

你一定聽過的
Ensemble 方法
● Randomly sampling not
only dat but also feature
● Majority vote
● Minimal tuning
● Performance pass lots of
complex method
n: subsample size
m: subfeature set size
tree size, tree number
http://www.slideshare.net/0xdata/jan-vitek-distributedrandomforest522013

Base Learner：被拿來 ensemble 的基礎模型 ex. 一棵樹, simple neural network
● Train by base learning algorithm (ex. decision tree, neural network ..)
三大訓練方法分支：
● Boosting - Boost weak learners too strong learners (sequential learners)
● Bagging - Like RandomForest, sampling from data or features
● Stacking - 打包的概念 (parallel learners)
● Employing different learning algorithms to train individual learners
● Individual learners then combined by a second-level learner which is
called meta-learner.
Ensemble 的關鍵字

Bagging Ensemble Bootstrap Aggregating
每次取樣m個資料點 (bootstrap sample) train base learner by calling a base
learning algorithm
● Sampling 的比例是學問
● 甚至針對不同特徵的子資料集 train 不同 model
○ Cherkauer(1996) 火山鑑定工程 32 NN，依據不同 input feature 切分
● 加入 randomness 元素
○ backpropagation random init, tree random select feature
● Majority voting
優點 -- 保留整體假說的多樣化特徵

Boost Family
● AdaBoost (Adaptive Boosting)
● Gradient Tree Boosting
● XGBoost
Conbination of Additive Models
學習收斂效能好
有放大雜訊的危險性
● Bagging can significantly reduce the variance
● Boosting can significantly reduce the bias

http://slideplayer.com/slide/4816467/
Assigns equal weights to all the training examples,
increased the weights of incorrectly classified examples.

Adaboost 特性介紹
在大部分情況下，可以有非常好的
表現，但對於雜訊的放大，是其必
須克服的地方。
在每一次的分類中，我們要提升被
分錯的點再下一次被分對的機率，
以及降低被分錯的機率。
http://www.37steps.com/exam/adaboost_comp/html/adaboost_comp.html

Gradient Boosting
Additive training
● New predictor is optimized by moving in the opposite direction of the
gradient to minimize the loss function.
GBDT 中的決策樹深度較小一般不會超過5，葉子節點的數量也不會超過10
● Boosted Tree: GBDT, GBRT, MART, LambdaMART

Gradient Boosting Model Steps
● Leaf weighted cost score
● Additive training: 加入一個新模型到模型中 → 選擇一個
加入後 cost error 下降最多的模型
● Greedy algorithm to build new tree from a single leaf
● Gradient update weight

Training Tips
Shrinkage
● Reduces the influence of each individual tree and leaves space for
future trees to improve the model.
● Better to improve model by many small steps than lagre steps.
Subsampling, Early Stopping, Post-Prunning

● In 2015, 29 challenge winning solutions, 17 used XGBoost (deep neural
nets 11)
● KDDCup 2015 all winning solution mention it.
● 用了直接上 leaderboard top 10
Scalability enables data scientists to process hundred millions of examples
on a desktop.
● OpenMP CPU multi-thread
● DMatrix
● Cache-aware and Sparsity-aware
為什麼 XGBoost 這麼威

Column Block for Parallel Learning
The most time consuming part of tree learning is to get the data into sorted
order.
In memory block, compressed column format, each column sorted by the
corresponding feature value. Block Compression, Block Sharding.

Use it in Python
xgb_model = XGBClassifier( learning_rate =0.1, n_estimators=1000,
max_depth=5, min_child_weight=1, gamma=0, subsample=0.8,
colsample_bytree=0.8, objective= 'binary:logistic', nthread=8,
scale_pos_weight=1, seed=27)
● gamma : Minimum loss reduction required to make a further partition on a
leaf node of the tree.
● min_child_weight : Minimum sum of instance weight(hessian) needed in a
child.
● colsample_bytree : Subsample ratio of columns when constructing each
tree.

Ensamble in Kaggle
Voting ensembles, Weighted majority vote, Bagged Perceptrons, Rank
averaging, Historical ranks, Stacked & Blending (Netflix)

圖片分類比賽
● Voting ensemble of around 30 convnets. The best single model scored
0.93170. Final score 0.94120.
Ensemble in Kaggle

No Free Lunch
Ensemble is much better than single learner.
Bias-variance tradeoff → Boosting or Average vote it.
● Not understandable -- like DNN, Non-linear SVM
● There is no ensemble method which outperforms other ensemble methods
consistently
Selecting some base learners instead of using all of them to compose an
ensemble is a better choice -- selective ensembles
XGBoost(tabular data) v.s. Deep Learning(more & complex data, hard tuning)

Reference
● Gradient boosting machines, a tutorial Alexey Natekin1* and Alois Knoll2
● XGBoost: A Scalable Tree Boosting System - Tianqi Chen
● NTU cmlab http://www.cmlab.csie.ntu.edu.tw/~cyy/learning/tutorials/
● http://mlwave.com/kaggle-ensembling-guide/

Ensembling & Boosting 概念介紹

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Ensembling & Boosting 概念介紹

Semelhante a Ensembling & Boosting 概念介紹 (20)

Último

Último (20)

Ensembling & Boosting 概念介紹