O SlideShare utiliza cookies para otimizar a funcionalidade e o desempenho do site, assim como para apresentar publicidade mais relevante aos nossos usuários. Se você continuar a navegar o site, você aceita o uso de cookies. Leia nosso Contrato do Usuário e nossa Política de Privacidade.
O SlideShare utiliza cookies para otimizar a funcionalidade e o desempenho do site, assim como para apresentar publicidade mais relevante aos nossos usuários. Se você continuar a utilizar o site, você aceita o uso de cookies. Leia nossa Política de Privacidade e nosso Contrato do Usuário para obter mais detalhes.
What is an "ensemble learner"? How can we combine different base learners into an ensemble in order to improve the overall classification performance? In this lecture, we are providing some answers to these questions.
Machine Learning for Language Technology
Marina Santini, Uppsala University
Department of Linguistics and Philology
Where we are…
Lecture 6: Ensemble Methods2
Previous lectures, various different learning
How to combine classifiers
Combining Multiple Learners
Lecture 6: Ensemble Methods3
Thanks to E. Alpaydin and Oscar Täckström
Wisdom of the Crowd
Lecture 6: Ensemble Methods4
Guess the weight of an ox
Average of people's votes close to true weight
Better than most individual members' votes and
cattle experts' votes
Intuitively, the law of large numbers…
Lecture 6: Ensemble Methods5
An ensemble of classifiers is a set of classifiers
whose individual decisions are combined in some
way to classify new examples (Dietterich, 2000)
Diversity vs accuracy
Lecture 6: Ensemble Methods6
An ensemble of classifiers must be more accurate
than any of its individual members.
The indivudual classifiers composing an ensemble
must be accurate and diverse:
An accurate classifier is one that has an error rate better
than random when guessing new examples
Two classifiers are diverse if they make different errors on
new data points.
Why it can be a good idea to build an ensemble
Lecture 6: Ensemble Methods7
It is possible to build good ensembles for three fundamental reasons.
(Dietterich , 2000):
1. Statistical: if little data
2. Computational: enough data, but local optima produced by local search
3. Representational: when the true function f cannot be represeted by any of the
hypothesis in H (weighted sums of hypotheses drawn from H might expand the
Lecture 6: Ensemble Methods8
Arbitrary learning algorithm which could be used on its
A learning algorithm composed of a set of base learners.
The base learners may be organized in some structure
However, not completely clear cut
E.g. a linear classifier is a combination of multiple simple
learners, in the sense that each dimension is in a simple
The main purpose of an ensemble:
maximising individual accuracy and diversity
Lecture 6: Ensemble Methods9
Different learners use different
Lecture 6: Ensemble Methods10
Lecture 6: Ensemble Methods11
No Free Lunch Theorem: There is no algorithm
that is always the most accurate in all
Generate a group of base-learners which when
combined has higher accuracy.
Methods for Constructing Ensembles
Lecture 6: Ensemble Methods12
Lecture 6: Ensemble Methods13
How do we generate base-learners that complement each
How do we combine the outputs of base learner for maximum
Lecture 6: Ensemble Methods
Lecture 6: Ensemble Methods17
Use bootstrapping to generate L training sets
Train L base learners using an unstable
During test, take the avarage
In bagging, generating complementary base-learners
is left to chance and to the instability of the learning
**Unstable algorithm: when small change in the
training set causes a large differnce in the base
Boosting: Weak learner vs Strong learner
Lecture 6: Ensemble Methods18
In boosting, we actively try to generate
complementary base-learners by training the next
learner on the mistakes of the previous learners.
The original boosting algorithm (Schapire 1990)
combines three weak learners to generate a strong
A weak learner has error probability less than
1/2, which makes it better than random guessing on
a two-class problem
A strong learner has arbitrarily small error probability.
Boosting (ii) [Alpaydin, 2012: 431]
Lecture 6: Ensemble Methods19
Given a large training set, we randomly divide it into
We use X1 and train d1.
We then take X2 and feed it to d1. We take all
instances misclassified by d1 and also as many
instances on which d1 is correct from X2, and these
together form the training set of d2.
We then take X3 and feed it to d1 and d2. The
instances on which d1 and d2 disagree form the
training set of d3.
During testing, given an instance, we give it to d1
and d2; if they agree, that is the response, otherwise
the response of d3 is taken as the output.
Lecture 6: Ensemble Methods20
Though it is quite successful, the disadvantage of
the original boosting method is that it requires a very
large training sample.
Adaboost (adaptive boosting)
Lecture 6: Ensemble Methods21
Use the same training set over and over and thus
need not to be large.
Classifiers must be simple so they do not overfit.
Can combine an arbitrary number of base
learners, not only three.
Lecture 6: Ensemble Methods22
sequence of base-
The porbability of a
decreased, and the
probability of a
This has the effect
that the next
more on instances
Lecture 6: Ensemble Methods23
Given an instance, all the classifiers decide and a
weighted vote is taken.
The weights are proportional to the base learners’
accuracies on the training set.
The success of Adaboost is due to its property of
increasing the margin. If the margin increases, the
training istances are better separated and errors are
less likely. (This aim is similar to SVMs)
Lecture 6: Ensemble Methods24
combiner f( ) is
and is not restricted
to being a linear
combination as in
Lecture 6: Ensemble Methods25
The combiner system should learn how the base
learners make errors.
Stacking is a means of estimating and correcting for
the biases of the base-learners.
Therefore, the combiner should be trained on data
unused in training the base-learners
Lecture 6: Ensemble Methods26
Use dj only if
are not confident
in order of
Lecture 6: Ensemble Methods27
Cascading is a multistage method, and we use dj
only if all preceding learners are not confident.
Associated with each learner is a confidence wj such
that we say dj is confident of its output and can be
used if wj > θj (the threshold).
Confident: misclassifications as well as the instances
for which the posterior is not high enough.
Important: The idea is that an early simple classifier
handles the majority of instances, and a more
complex classifier is used only for a small
percentage, so does not significantly increase the
Lecture 6: Ensemble Methods28
It is often a good idea to combine several learning
We want diverse classifiers, so their errors cancel
However, remember, ensemble methods do not get
Lecture 6: Ensemble Methods29
in the case of arc-factored graph-based parsing, we
relied on spanning tree over a dense graph over the
a dense graph is a graph that contains all possible
a spanning tree is a tree that has an incoming arc for
Lecture 6: Ensemble Methods31
Combining multiple learners has been a popular
topic in machine learning since the early 1990s, and
research has been going on ever since.
Recently, it has been noticed that ensembles do not
always improve accuracy and research has started
to focus on the criteria that a good ensemble should
satisfy or how to form a good one.