SlideShare uma empresa Scribd logo
1 de 15
Baixar para ler offline
Ensemble Learning with
Apache Spark MLlib 1.5
leoricklin@gmail.com
Reference
[1] http://www.csdn.net/article/2015-03-02/2824069
[2] http://www.csdn.net/article/2015-09-07/2825629
[3] http://www.scholarpedia.org/article/Ensemble_learning
What is Ensemble Learning (集成学习) ?
● 结合不同的学习模块(单个模型)来加强模型的稳定性和预
测能力
● 导致模型不同的4个主要因素。这些因素的组合也可能会造
成模型不同:
● 集成学习是典型的实践驱动的研究方向,它一开始先在实践
中证明有效,而后才有学者从理论上进行各种分析
● 不同种类
● 不同假设
● 不同建模技术
● 初始化参数不同
A pinch of math
● There are 3 (independent) binary classifiers (A,B,C) with a
70% accuracy
● For a majority vote with 3 members we can expect 4
outcomes:
● All three are correct
0.7 * 0.7 * 0.7 = 0.3429
● Two are correct
0.7 * 0.7 * 0.3 + 0.7 * 0.3 * 0.7
+ 0.3 * 0.7 * 0.7 = 0.4409
● Two are wrong
0.3 * 0.3 * 0.7 + 0.3 * 0.7 * 0.3 + 0.7 * 0.3 *
0.3 = 0.189
● All three are wrong
0.3 * 0.3 * 0.3 = 0.027
0.3429 + 0.4409 = 0.7838 > 0.7
Model Error
● 任何模型中出现的误差都可以在
数学上分解成三个分量:
○ Bias error 是用来度量预测值与实
际值的差异
○ Variance 则是度量基于同一观测
值,预测值之间的差异
Trade-off management of bias-variance errors
● 通当模型复杂性增加时,最
终会过拟合,因此模型开始
出现Variance
● 优良的模型应该在这两种
误差之间保持平衡
● 集成学习就是执行折衷权
衡的一种方法
○ 怎么训练每个算法?
○ 怎么融合每个算法?
EL techniques (1): Bagging
● 试图在小样本集上实现相
似的学习模块,然后对预
测值求平均值
● 可以帮助减少Variance
EL techniques (2): Boosting
● 是一项迭代技术
● 它在上一次分类的基础上
调整观测值的权重。如果
观测值被错误分类,它就
会增加这个观测值的权重
● 会减少Bias error,但是有
些时候会在训练数据上过
拟合
EL techniques (3): Stacking
● 用一个学习模块与来自
不同学习模块的输出结
合起来
● 可以减少Bias error和
Variance
● 选择合适的集成模块与
其说是纯粹的科研问题,
不如说是一种艺术
https://www.kaggle.com/c/otto-group-product-classification-challenge/forums/t/14335/1st-place-winner-solution-gilberto-titericz-stanislav-semenov
https://www.linkedin.com/pulse/ideas-sharing-kaggle-crowdflower-search-results-relevance-mark-peng
Stacking with Apache MLLib (1)
● Dataset:UCI Covtype (Ch04, Adv. Analytic w/ Spark)
● Baseline: RandomForest (Best from 8 hyper-parameters
with 3-folds C.V.)
○ precision = 0.956144
○ recall = 0.956144
Training
set X RF(θ1
)
fits
Training
set Y
predicts
h1
(Y,θ1
)
#trees = 32
θ1
: #bins=300, #depth=30, entropy
● Using Meta-features
Stacking with Apache MLLib (2)
Training
set X RF(θ1
) RF(θ2
) RF(θ3
)
RF(θ1
)
h1
(Y,θ1
)
#trees = 32
θ1
: #bins=300, #depth=30, entropy
θ2
: #bins=40, #depth=30, entropy
θ3
: #bins=300, #depth=30, gini
fits
predicts 3-folds C.V of
Training set X
h1
(X,θ1
) h2
(X,θ2
) h3
(X,θ3
) Label
fits
Training
set Y
predicts
RF(θ1
)
RF(θ2
)
RF(θ3
)
h1
(Y,θ1
)
h2
(Y,θ2
)
h3
(Y,θ3
)
predicts
sort by precision
Baseline Current
precision 0.956144 0.951056
recall 0.956144 0.951056
Stacking with Apache MLLib (3)
● Using Original features
& Meta-features
Training
set X RF(θ1
) RF(θ2
) RF(θ3
)
RF(θ1
)
h1
(Y,θ1
)
#trees = 32
θ1
: #bins=300, #depth=30, entropy
θ2
: #bins=40, #depth=30, entropy
θ3
: #bins=300, #depth=30, gini
fits
predicts 3-folds C.V of
Training set X
h1
(X,θ1
) h2
(X,θ2
) h3
(X,θ3
) Label
fits
Training
set Y
predicts
RF(θ1
)
RF(θ2
)
RF(θ3
)
h1
(Y,θ1
)
h2
(Y,θ2
)
h3
(Y,θ3
)
predicts
sort by precision
Baseline Current
precision 0.956144 0.951094
recall 0.956144 0.951094
f1
fn
………..
f1
...fn
Stacking with Apache MLLib (4)
● Retrain tier-1 models and
stacking with all features
Training
set X RF(θ1
) RF(θ2
) RF(θ3
)
RF(θ1
)
h1
(Y,θ1
)
#trees = 32
θ1
: #bins=300, #depth=30, entropy
θ2
: #bins=40, #depth=30, entropy
θ3
: #bins=300, #depth=30, gini
fits
predicts Training
set X
h1
(X,θ1
) h2
(X,θ2
) h3
(X,θ3
) Label
fits
Training
set Y
predicts
h1
(Y,θ1
)
h2
(Y,θ2
)
h3
(Y,θ3
)
predicts
sort by precision
Baseline Current
precision 0.956144 0.956836
recall 0.956144 0.956836
f1
fn
………..
f1
...fn
RF(θ1
)
RF(θ2
)
RF(θ3
)

Mais conteúdo relacionado

Mais procurados

Machine Learning - Dataset Preparation
Machine Learning - Dataset PreparationMachine Learning - Dataset Preparation
Machine Learning - Dataset PreparationAndrew Ferlitsch
 
Object-Oriented Programming (OOP)
Object-Oriented Programming (OOP)Object-Oriented Programming (OOP)
Object-Oriented Programming (OOP)Ahmad karawash
 
The ABC of Implementing Supervised Machine Learning with Python.pptx
The ABC of Implementing Supervised Machine Learning with Python.pptxThe ABC of Implementing Supervised Machine Learning with Python.pptx
The ABC of Implementing Supervised Machine Learning with Python.pptxRuby Shrestha
 
Introduction to Machine Learning in Python using Scikit-Learn
Introduction to Machine Learning in Python using Scikit-LearnIntroduction to Machine Learning in Python using Scikit-Learn
Introduction to Machine Learning in Python using Scikit-LearnAmol Agrawal
 
Dev Concepts: Object-Oriented Programming
Dev Concepts: Object-Oriented ProgrammingDev Concepts: Object-Oriented Programming
Dev Concepts: Object-Oriented ProgrammingSvetlin Nakov
 
Pointer to array and structure
Pointer to array and structurePointer to array and structure
Pointer to array and structuresangrampatil81
 
Basic of python for data analysis
Basic of python for data analysisBasic of python for data analysis
Basic of python for data analysisPramod Toraskar
 
Integration of all 6 trig functions
Integration of all 6 trig functionsIntegration of all 6 trig functions
Integration of all 6 trig functionsRon Eick
 
Introduction to NumPy (PyData SV 2013)
Introduction to NumPy (PyData SV 2013)Introduction to NumPy (PyData SV 2013)
Introduction to NumPy (PyData SV 2013)PyData
 

Mais procurados (19)

Machine Learning - Dataset Preparation
Machine Learning - Dataset PreparationMachine Learning - Dataset Preparation
Machine Learning - Dataset Preparation
 
Chapter 6 ds
Chapter 6 dsChapter 6 ds
Chapter 6 ds
 
02 Stack
02 Stack02 Stack
02 Stack
 
Lect9
Lect9Lect9
Lect9
 
Object-Oriented Programming (OOP)
Object-Oriented Programming (OOP)Object-Oriented Programming (OOP)
Object-Oriented Programming (OOP)
 
7.basic array
7.basic array7.basic array
7.basic array
 
NumPy
NumPyNumPy
NumPy
 
The ABC of Implementing Supervised Machine Learning with Python.pptx
The ABC of Implementing Supervised Machine Learning with Python.pptxThe ABC of Implementing Supervised Machine Learning with Python.pptx
The ABC of Implementing Supervised Machine Learning with Python.pptx
 
Introduction to Machine Learning in Python using Scikit-Learn
Introduction to Machine Learning in Python using Scikit-LearnIntroduction to Machine Learning in Python using Scikit-Learn
Introduction to Machine Learning in Python using Scikit-Learn
 
Introduction to numpy
Introduction to numpyIntroduction to numpy
Introduction to numpy
 
Numpy
NumpyNumpy
Numpy
 
Pooja
PoojaPooja
Pooja
 
Lec3
Lec3Lec3
Lec3
 
Dev Concepts: Object-Oriented Programming
Dev Concepts: Object-Oriented ProgrammingDev Concepts: Object-Oriented Programming
Dev Concepts: Object-Oriented Programming
 
DATASTRUCTURES UNIT-1
DATASTRUCTURES UNIT-1DATASTRUCTURES UNIT-1
DATASTRUCTURES UNIT-1
 
Pointer to array and structure
Pointer to array and structurePointer to array and structure
Pointer to array and structure
 
Basic of python for data analysis
Basic of python for data analysisBasic of python for data analysis
Basic of python for data analysis
 
Integration of all 6 trig functions
Integration of all 6 trig functionsIntegration of all 6 trig functions
Integration of all 6 trig functions
 
Introduction to NumPy (PyData SV 2013)
Introduction to NumPy (PyData SV 2013)Introduction to NumPy (PyData SV 2013)
Introduction to NumPy (PyData SV 2013)
 

Destaque

обява про прийом в гуртки 2016 з виправленням
обява про прийом в гуртки 2016 з виправленнямобява про прийом в гуртки 2016 з виправленням
обява про прийом в гуртки 2016 з виправленнямМарина Московская
 
речівки
речівкиречівки
речівкиdianchuk
 
Work Project 2-latest
Work Project 2-latestWork Project 2-latest
Work Project 2-latestRanjit David
 
160203 테헤란로 커피클럽_바이로봇
160203 테헤란로 커피클럽_바이로봇160203 테헤란로 커피클럽_바이로봇
160203 테헤란로 커피클럽_바이로봇StartupAlliance
 
Pullman_teen_earns_Distinguished_Young_Woman_of_Washington_title
Pullman_teen_earns_Distinguished_Young_Woman_of_Washington_titlePullman_teen_earns_Distinguished_Young_Woman_of_Washington_title
Pullman_teen_earns_Distinguished_Young_Woman_of_Washington_titleRose Jao
 
160810_테헤란로 커피클럽_52th_헤이뷰티
160810_테헤란로 커피클럽_52th_헤이뷰티160810_테헤란로 커피클럽_52th_헤이뷰티
160810_테헤란로 커피클럽_52th_헤이뷰티StartupAlliance
 
8. la prohibition de l'inceste en islam
8. la prohibition de l'inceste en islam8. la prohibition de l'inceste en islam
8. la prohibition de l'inceste en islamSouad Azizi
 
KubeFuse - A File-System for Kubernetes
KubeFuse - A File-System for KubernetesKubeFuse - A File-System for Kubernetes
KubeFuse - A File-System for KubernetesBart Spaans
 
킥스타터 모바일 참여방법 Sgnl(outline)
킥스타터 모바일 참여방법 Sgnl(outline)킥스타터 모바일 참여방법 Sgnl(outline)
킥스타터 모바일 참여방법 Sgnl(outline)Sae-Mi Kim
 
Поетична студія "Елегія"
Поетична студія "Елегія"Поетична студія "Елегія"
Поетична студія "Елегія"Галина Сызько
 
160615_테헤란로 커피클럽_이놈들연구소
160615_테헤란로 커피클럽_이놈들연구소160615_테헤란로 커피클럽_이놈들연구소
160615_테헤란로 커피클럽_이놈들연구소StartupAlliance
 

Destaque (12)

обява про прийом в гуртки 2016 з виправленням
обява про прийом в гуртки 2016 з виправленнямобява про прийом в гуртки 2016 з виправленням
обява про прийом в гуртки 2016 з виправленням
 
речівки
речівкиречівки
речівки
 
Work Project 2-latest
Work Project 2-latestWork Project 2-latest
Work Project 2-latest
 
160203 테헤란로 커피클럽_바이로봇
160203 테헤란로 커피클럽_바이로봇160203 테헤란로 커피클럽_바이로봇
160203 테헤란로 커피클럽_바이로봇
 
Pullman_teen_earns_Distinguished_Young_Woman_of_Washington_title
Pullman_teen_earns_Distinguished_Young_Woman_of_Washington_titlePullman_teen_earns_Distinguished_Young_Woman_of_Washington_title
Pullman_teen_earns_Distinguished_Young_Woman_of_Washington_title
 
AMISTAD
AMISTADAMISTAD
AMISTAD
 
160810_테헤란로 커피클럽_52th_헤이뷰티
160810_테헤란로 커피클럽_52th_헤이뷰티160810_테헤란로 커피클럽_52th_헤이뷰티
160810_테헤란로 커피클럽_52th_헤이뷰티
 
8. la prohibition de l'inceste en islam
8. la prohibition de l'inceste en islam8. la prohibition de l'inceste en islam
8. la prohibition de l'inceste en islam
 
KubeFuse - A File-System for Kubernetes
KubeFuse - A File-System for KubernetesKubeFuse - A File-System for Kubernetes
KubeFuse - A File-System for Kubernetes
 
킥스타터 모바일 참여방법 Sgnl(outline)
킥스타터 모바일 참여방법 Sgnl(outline)킥스타터 모바일 참여방법 Sgnl(outline)
킥스타터 모바일 참여방법 Sgnl(outline)
 
Поетична студія "Елегія"
Поетична студія "Елегія"Поетична студія "Елегія"
Поетична студія "Елегія"
 
160615_테헤란로 커피클럽_이놈들연구소
160615_테헤란로 커피클럽_이놈들연구소160615_테헤란로 커피클럽_이놈들연구소
160615_테헤란로 커피클럽_이놈들연구소
 

Semelhante a 1.5.ensemble learning with apache spark m llib 1.5

Parallel Ablation Studies for Machine Learning with Maggy on Apache Spark
Parallel Ablation Studies for Machine Learning with Maggy on Apache SparkParallel Ablation Studies for Machine Learning with Maggy on Apache Spark
Parallel Ablation Studies for Machine Learning with Maggy on Apache SparkDatabricks
 
From Python to PySpark and Back Again – Unifying Single-host and Distributed ...
From Python to PySpark and Back Again – Unifying Single-host and Distributed ...From Python to PySpark and Back Again – Unifying Single-host and Distributed ...
From Python to PySpark and Back Again – Unifying Single-host and Distributed ...Databricks
 
Intro to Machine Learning for non-Data Scientists
Intro to Machine Learning for non-Data ScientistsIntro to Machine Learning for non-Data Scientists
Intro to Machine Learning for non-Data ScientistsParinaz Ameri
 
Semi-supervised Learning
Semi-supervised LearningSemi-supervised Learning
Semi-supervised Learningbutest
 
Meta-learning of exploration-exploitation strategies in reinforcement learning
Meta-learning of exploration-exploitation strategies in reinforcement learningMeta-learning of exploration-exploitation strategies in reinforcement learning
Meta-learning of exploration-exploitation strategies in reinforcement learningUniversité de Liège (ULg)
 
StackNet Meta-Modelling framework
StackNet Meta-Modelling frameworkStackNet Meta-Modelling framework
StackNet Meta-Modelling frameworkSri Ambati
 
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & PyTorch with B...
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & PyTorch with B...A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & PyTorch with B...
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & PyTorch with B...Databricks
 
A baseline for_few_shot_image_classification
A baseline for_few_shot_image_classificationA baseline for_few_shot_image_classification
A baseline for_few_shot_image_classificationDongHeeKim39
 
Learning for exploration-exploitation in reinforcement learning. The dusk of ...
Learning for exploration-exploitation in reinforcement learning. The dusk of ...Learning for exploration-exploitation in reinforcement learning. The dusk of ...
Learning for exploration-exploitation in reinforcement learning. The dusk of ...Université de Liège (ULg)
 
Spark MLlib - Training Material
Spark MLlib - Training Material Spark MLlib - Training Material
Spark MLlib - Training Material Bryan Yang
 
[PR12] PR-036 Learning to Remember Rare Events
[PR12] PR-036 Learning to Remember Rare Events[PR12] PR-036 Learning to Remember Rare Events
[PR12] PR-036 Learning to Remember Rare EventsTaegyun Jeon
 
Matlab and Python: Basic Operations
Matlab and Python: Basic OperationsMatlab and Python: Basic Operations
Matlab and Python: Basic OperationsWai Nwe Tun
 
Object Oriented Programming in Matlab
Object Oriented Programming in Matlab Object Oriented Programming in Matlab
Object Oriented Programming in Matlab AlbanLevy
 

Semelhante a 1.5.ensemble learning with apache spark m llib 1.5 (20)

Parallel Ablation Studies for Machine Learning with Maggy on Apache Spark
Parallel Ablation Studies for Machine Learning with Maggy on Apache SparkParallel Ablation Studies for Machine Learning with Maggy on Apache Spark
Parallel Ablation Studies for Machine Learning with Maggy on Apache Spark
 
From Python to PySpark and Back Again – Unifying Single-host and Distributed ...
From Python to PySpark and Back Again – Unifying Single-host and Distributed ...From Python to PySpark and Back Again – Unifying Single-host and Distributed ...
From Python to PySpark and Back Again – Unifying Single-host and Distributed ...
 
Please .pdf
Please .pdfPlease .pdf
Please .pdf
 
Intro to Machine Learning for non-Data Scientists
Intro to Machine Learning for non-Data ScientistsIntro to Machine Learning for non-Data Scientists
Intro to Machine Learning for non-Data Scientists
 
AutoML lectures (ACDL 2019)
AutoML lectures (ACDL 2019)AutoML lectures (ACDL 2019)
AutoML lectures (ACDL 2019)
 
Semi-supervised Learning
Semi-supervised LearningSemi-supervised Learning
Semi-supervised Learning
 
Meta-learning of exploration-exploitation strategies in reinforcement learning
Meta-learning of exploration-exploitation strategies in reinforcement learningMeta-learning of exploration-exploitation strategies in reinforcement learning
Meta-learning of exploration-exploitation strategies in reinforcement learning
 
StackNet Meta-Modelling framework
StackNet Meta-Modelling frameworkStackNet Meta-Modelling framework
StackNet Meta-Modelling framework
 
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & PyTorch with B...
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & PyTorch with B...A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & PyTorch with B...
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & PyTorch with B...
 
A baseline for_few_shot_image_classification
A baseline for_few_shot_image_classificationA baseline for_few_shot_image_classification
A baseline for_few_shot_image_classification
 
Learning for exploration-exploitation in reinforcement learning. The dusk of ...
Learning for exploration-exploitation in reinforcement learning. The dusk of ...Learning for exploration-exploitation in reinforcement learning. The dusk of ...
Learning for exploration-exploitation in reinforcement learning. The dusk of ...
 
Deep Learning meetup
Deep Learning meetupDeep Learning meetup
Deep Learning meetup
 
Spark MLlib - Training Material
Spark MLlib - Training Material Spark MLlib - Training Material
Spark MLlib - Training Material
 
S2 NIGHT SKILL.pptx
S2 NIGHT SKILL.pptxS2 NIGHT SKILL.pptx
S2 NIGHT SKILL.pptx
 
S2 NIGHT SKILL.pptx
S2 NIGHT SKILL.pptxS2 NIGHT SKILL.pptx
S2 NIGHT SKILL.pptx
 
[PR12] PR-036 Learning to Remember Rare Events
[PR12] PR-036 Learning to Remember Rare Events[PR12] PR-036 Learning to Remember Rare Events
[PR12] PR-036 Learning to Remember Rare Events
 
Matlab and Python: Basic Operations
Matlab and Python: Basic OperationsMatlab and Python: Basic Operations
Matlab and Python: Basic Operations
 
Object Oriented Programming in Matlab
Object Oriented Programming in Matlab Object Oriented Programming in Matlab
Object Oriented Programming in Matlab
 
Slides
SlidesSlides
Slides
 
Csc446: Pattern Recognition
Csc446: Pattern Recognition Csc446: Pattern Recognition
Csc446: Pattern Recognition
 

Mais de leorick lin

How to prepare for pca certification 2021
How to prepare for pca certification 2021How to prepare for pca certification 2021
How to prepare for pca certification 2021leorick lin
 
1.5.recommending music with apache spark ml
1.5.recommending music with apache spark ml1.5.recommending music with apache spark ml
1.5.recommending music with apache spark mlleorick lin
 
analyzing hdfs files using apace spark and mapreduce FixedLengthInputformat
analyzing hdfs files using apace spark and mapreduce FixedLengthInputformatanalyzing hdfs files using apace spark and mapreduce FixedLengthInputformat
analyzing hdfs files using apace spark and mapreduce FixedLengthInputformatleorick lin
 
Multiclassification with Decision Tree in Spark MLlib 1.3
Multiclassification with Decision Tree in Spark MLlib 1.3Multiclassification with Decision Tree in Spark MLlib 1.3
Multiclassification with Decision Tree in Spark MLlib 1.3leorick lin
 
Email Classifier using Spark 1.3 Mlib / ML Pipeline
Email Classifier using Spark 1.3 Mlib / ML PipelineEmail Classifier using Spark 1.3 Mlib / ML Pipeline
Email Classifier using Spark 1.3 Mlib / ML Pipelineleorick lin
 
Integrating data stored in rdbms and hadoop
Integrating data stored in rdbms and hadoopIntegrating data stored in rdbms and hadoop
Integrating data stored in rdbms and hadoopleorick lin
 

Mais de leorick lin (6)

How to prepare for pca certification 2021
How to prepare for pca certification 2021How to prepare for pca certification 2021
How to prepare for pca certification 2021
 
1.5.recommending music with apache spark ml
1.5.recommending music with apache spark ml1.5.recommending music with apache spark ml
1.5.recommending music with apache spark ml
 
analyzing hdfs files using apace spark and mapreduce FixedLengthInputformat
analyzing hdfs files using apace spark and mapreduce FixedLengthInputformatanalyzing hdfs files using apace spark and mapreduce FixedLengthInputformat
analyzing hdfs files using apace spark and mapreduce FixedLengthInputformat
 
Multiclassification with Decision Tree in Spark MLlib 1.3
Multiclassification with Decision Tree in Spark MLlib 1.3Multiclassification with Decision Tree in Spark MLlib 1.3
Multiclassification with Decision Tree in Spark MLlib 1.3
 
Email Classifier using Spark 1.3 Mlib / ML Pipeline
Email Classifier using Spark 1.3 Mlib / ML PipelineEmail Classifier using Spark 1.3 Mlib / ML Pipeline
Email Classifier using Spark 1.3 Mlib / ML Pipeline
 
Integrating data stored in rdbms and hadoop
Integrating data stored in rdbms and hadoopIntegrating data stored in rdbms and hadoop
Integrating data stored in rdbms and hadoop
 

Último

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024The Digital Insurer
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 

Último (20)

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 

1.5.ensemble learning with apache spark m llib 1.5

  • 1. Ensemble Learning with Apache Spark MLlib 1.5 leoricklin@gmail.com
  • 3. What is Ensemble Learning (集成学习) ? ● 结合不同的学习模块(单个模型)来加强模型的稳定性和预 测能力 ● 导致模型不同的4个主要因素。这些因素的组合也可能会造 成模型不同: ● 集成学习是典型的实践驱动的研究方向,它一开始先在实践 中证明有效,而后才有学者从理论上进行各种分析 ● 不同种类 ● 不同假设 ● 不同建模技术 ● 初始化参数不同
  • 4. A pinch of math ● There are 3 (independent) binary classifiers (A,B,C) with a 70% accuracy ● For a majority vote with 3 members we can expect 4 outcomes: ● All three are correct 0.7 * 0.7 * 0.7 = 0.3429 ● Two are correct 0.7 * 0.7 * 0.3 + 0.7 * 0.3 * 0.7 + 0.3 * 0.7 * 0.7 = 0.4409 ● Two are wrong 0.3 * 0.3 * 0.7 + 0.3 * 0.7 * 0.3 + 0.7 * 0.3 * 0.3 = 0.189 ● All three are wrong 0.3 * 0.3 * 0.3 = 0.027 0.3429 + 0.4409 = 0.7838 > 0.7
  • 5. Model Error ● 任何模型中出现的误差都可以在 数学上分解成三个分量: ○ Bias error 是用来度量预测值与实 际值的差异 ○ Variance 则是度量基于同一观测 值,预测值之间的差异
  • 6. Trade-off management of bias-variance errors ● 通当模型复杂性增加时,最 终会过拟合,因此模型开始 出现Variance ● 优良的模型应该在这两种 误差之间保持平衡 ● 集成学习就是执行折衷权 衡的一种方法 ○ 怎么训练每个算法? ○ 怎么融合每个算法?
  • 7. EL techniques (1): Bagging ● 试图在小样本集上实现相 似的学习模块,然后对预 测值求平均值 ● 可以帮助减少Variance
  • 8. EL techniques (2): Boosting ● 是一项迭代技术 ● 它在上一次分类的基础上 调整观测值的权重。如果 观测值被错误分类,它就 会增加这个观测值的权重 ● 会减少Bias error,但是有 些时候会在训练数据上过 拟合
  • 9. EL techniques (3): Stacking ● 用一个学习模块与来自 不同学习模块的输出结 合起来 ● 可以减少Bias error和 Variance ● 选择合适的集成模块与 其说是纯粹的科研问题, 不如说是一种艺术
  • 12. Stacking with Apache MLLib (1) ● Dataset:UCI Covtype (Ch04, Adv. Analytic w/ Spark) ● Baseline: RandomForest (Best from 8 hyper-parameters with 3-folds C.V.) ○ precision = 0.956144 ○ recall = 0.956144 Training set X RF(θ1 ) fits Training set Y predicts h1 (Y,θ1 ) #trees = 32 θ1 : #bins=300, #depth=30, entropy
  • 13. ● Using Meta-features Stacking with Apache MLLib (2) Training set X RF(θ1 ) RF(θ2 ) RF(θ3 ) RF(θ1 ) h1 (Y,θ1 ) #trees = 32 θ1 : #bins=300, #depth=30, entropy θ2 : #bins=40, #depth=30, entropy θ3 : #bins=300, #depth=30, gini fits predicts 3-folds C.V of Training set X h1 (X,θ1 ) h2 (X,θ2 ) h3 (X,θ3 ) Label fits Training set Y predicts RF(θ1 ) RF(θ2 ) RF(θ3 ) h1 (Y,θ1 ) h2 (Y,θ2 ) h3 (Y,θ3 ) predicts sort by precision Baseline Current precision 0.956144 0.951056 recall 0.956144 0.951056
  • 14. Stacking with Apache MLLib (3) ● Using Original features & Meta-features Training set X RF(θ1 ) RF(θ2 ) RF(θ3 ) RF(θ1 ) h1 (Y,θ1 ) #trees = 32 θ1 : #bins=300, #depth=30, entropy θ2 : #bins=40, #depth=30, entropy θ3 : #bins=300, #depth=30, gini fits predicts 3-folds C.V of Training set X h1 (X,θ1 ) h2 (X,θ2 ) h3 (X,θ3 ) Label fits Training set Y predicts RF(θ1 ) RF(θ2 ) RF(θ3 ) h1 (Y,θ1 ) h2 (Y,θ2 ) h3 (Y,θ3 ) predicts sort by precision Baseline Current precision 0.956144 0.951094 recall 0.956144 0.951094 f1 fn ……….. f1 ...fn
  • 15. Stacking with Apache MLLib (4) ● Retrain tier-1 models and stacking with all features Training set X RF(θ1 ) RF(θ2 ) RF(θ3 ) RF(θ1 ) h1 (Y,θ1 ) #trees = 32 θ1 : #bins=300, #depth=30, entropy θ2 : #bins=40, #depth=30, entropy θ3 : #bins=300, #depth=30, gini fits predicts Training set X h1 (X,θ1 ) h2 (X,θ2 ) h3 (X,θ3 ) Label fits Training set Y predicts h1 (Y,θ1 ) h2 (Y,θ2 ) h3 (Y,θ3 ) predicts sort by precision Baseline Current precision 0.956144 0.956836 recall 0.956144 0.956836 f1 fn ……….. f1 ...fn RF(θ1 ) RF(θ2 ) RF(θ3 )