Machine learning Lecture 2

Lecture No. 2

Ravi Gupta
AU-KBC Research Centre,
MIT Campus, Anna University

Date: 8.3.2008

Today’s Agenda

• Recap (FIND-S Algorithm)
• Version Space
• Candidate-Elimination Algorithm
• Decision Tree
• ID3 Algorithm
• Entropy

Concept Learning as Search

Concept learning can be viewed as the task of searching through
a large space of hypothesis implicitly defined by the hypothesis
representation.

The goal of the concept learning search is to find the hypothesis
that best fits the training examples.

General-to-Specific Learning
Every day Tom his
enjoy i.e., Only
positive examples.

Most General Hypothesis: h = <?, ?, ?, ?, ?, ?>

Most Specific Hypothesis: h = < Ø, Ø, Ø, Ø, Ø, Ø>

General-to-Specific Learning

h2 is more general than h1

h2 imposes fewer constraints on the instance than h1

Definition

Given hypotheses hj and hk, hj is more_general_than_or_equal_to
hk if and only if any instance that satisfies hk also satisfies hj.

We can also say that hj is more_specific_than hk when hk is
more_general_than hj.

FIND-S: Finding a Maximally
Specific Hypothesis

Step 1: FIND-S

h0 = <Ø, Ø, Ø, Ø, Ø, Ø>

Step 2: FIND-S

h0 = <Ø, Ø, Ø, Ø, Ø, Ø>

a1 a2 a3 a4 a5 a6

x1 = <Sunny, Warm, Normal, Strong, Warm, Same>

Iteration 1

h1 = <Sunny, Warm, Normal, Strong, Warm, Same>

h1 = <Sunny, Warm, Normal, Strong, Warm, Same>

Iteration 2
x2 = <Sunny, Warm, High, Strong, Warm, Same>

h2 = <Sunny, Warm, ?, Strong, Warm, Same>

Iteration 3 Ignore h3 = <Sunny, Warm, ?, Strong, Warm, Same>

h3 = < Sunny, Warm, ?, Strong, Warm, Same >

Iteration 4
x4 = < Sunny, Warm, High, Strong, Cool, Change >

Step 3

Output h4 = <Sunny, Warm, ?, Strong, ?, ?>

Unanswered Questions by FIND-S

• Has the learner converged to the correct target
concept?

• Why prefer the most specific hypothesis?

• What if the training examples consistent?

Version Space

The set of all valid hypotheses provided by an
algorithm is called version space (VS) with respect
to the hypothesis space H and the given example set
D.

Candidate-Elimination Algorithm

The Candidate-Elimination algorithm finds all describable hypotheses
that are consistent with the observed training examples

Hypothesis is derived from examples regardless of whether x is
positive or negative example


Earlier
(i.e., FIND-S)
Def.

LIST-THEN-ELIMINATE Algorithm
to Obtain Version Space

Examples
Hypothesis Space

.
Version Space
.
.
.
.
VSH,D
.
H

D


• In principle, the LIST-THEN-ELIMINATE algorithm can be
applied whenever the hypothesis space H is finite.

• It is guaranteed to output all hypotheses consistent with the
training data.

• Unfortunately, it requires exhaustively enumerating all
hypotheses in H-an unrealistic requirement for all but the most
trivial hypothesis spaces.


• The CANDIDATE-ELIMINATION algorithm works on the same
principle as the above LIST-THEN-ELIMINATE algorithm.

• It employs a much more compact representation of the version
space.

• In this the version space is represented by its most general and
least general members (Specific).

• These members form general and specific boundary sets that delimit
the version space within the partially ordered hypothesis space.

Least General
(Specific)

Most General

Example

G0 ← {<?, ?, ?, ?, ?, ?>}

Initialization

S0 ← {<Ø, Ø, Ø, Ø, Ø, Ø >}

G0 ← {<?, ?, ?, ?, ?, ?>}

S0 ← {<Ø, Ø, Ø, Ø, Ø, Ø >}

Iteration 1
G1 ← {<?, ?, ?, ?, ?, ?>}

S1 ← {< Sunny, Warm, Normal, Strong, Warm, Same >}

Iteration 2
G2 ← {<?, ?, ?, ?, ?, ?>}

S2 ← {< Sunny, Warm, ?, Strong, Warm, Same >}

G2 ← {<?, ?, ?, ?, ?, ?>}


consistent

x3 = <Rainy, Cold, High, Strong, Warm, Change>
Iteration 3

G3 ← {<Sunny, ?, ?, ?, ?, ?>, <?, Warm, ?, ?, ?, ?>, <?, ?, ?, ?, Same>}

G2 ← {<?, ?, ?, ?, ?, ?>}


G3 ← {<Sunny, ?, ?, ?, ?, ?>, <?, Warm, ?, ?, ?, ?>, <?, ?, ?, ?, Same>}

x4 = <Sunny, Warm, high, Strong, Cool, Change>
Iteration 4
S4 ← {< Sunny, Warm, ?, Strong, ?, ? >}

G4 ← {<Sunny, ?, ?, ?, ?, ?>, <?, Warm, ?, ?, ?, ?>}

G3 ← {<Sunny, ?, ?, ?, ?, ?>, <?, Warm, ?, ?, ?, ?>, <?, ?, ?, ?, Same>}

Remarks on Version Spaces and
Candidate-Elimination

The version space learned by the CANDIDATE-ELIMINATION algorithm
will converge toward the hypothesis that correctly describes the target
concept, provided

(1) there are no errors in the training examples, and

(2) there is some hypothesis in H that correctly
describes the target concept.

What will Happen if the Training
Contains errors ?

No

G0 ← {<?, ?, ?, ?, ?, ?>}

S0 ← {<Ø, Ø, Ø, Ø, Ø, Ø >}

Iteration 1
G1 ← {<?, ?, ?, ?, ?, ?>}


Iteration 2
G2 ← {<?, ?, Normal, ?, ?, ?>}


G2 ← {<?, ?, Normal, ?, ?, ?>}


consistent

x3 = <Rainy, Cold, High, Strong, Warm, Change>
Iteration 3

G3 ← {<?, ?, Normal, ?, ?, ?>}


G3 ← {<?, ?, Normal, ?, ?, ?>}

x4 = <Sunny, Warm, high, Strong, Cool, Change>
Iteration 4
S4 ← { }
Empty

G4 ← { }

G3 ← {<?, ?, Normal, ?, ?, ?>}

What will Happen if Hypothesis
is not Present ?


The target concept is exactly learned when
the S and G boundary sets converge to a
single, identical, hypothesis.


How Can Partially Learned Concepts Be Used?
Suppose that no additional training examples are available beyond
the four in our example. And the learner is now required to classify
new instances that it has not yet observed.

The target concept is exactly learned when
the S and G boundary sets converge to a
single, identical, hypothesis.


All six hypotheses satisfied

All six hypotheses satisfied


Three hypotheses satisfied
Three hypotheses not satisfied

Two hypotheses satisfied
Four hypotheses not satisfied


Yes
No

Decision Trees

• Decision tree learning is a method for approximating
discrete value target functions, in which the learned function
is represented by a decision tree.

• Decision trees can also be represented by if-then-else rule.

• Decision tree learning is one of the most widely used
approach for inductive inference .

Decision Trees

An instance is classified by starting at the root node of the tree, testing the
attribute specified by this node, then moving down the tree branch
corresponding to the value of the attribute in the given example. This process
is then repeated for the subtree rooted at the new node.

Decision Trees

<Outlook = Sunny, Temperature = Hot, Humidity = High, Wind = Strong>

PlayTennis = No

Decision Trees
Edges: Attribute
value
Intermediate
Nodes: Attributes

Attribute: A1
Attribute Attribute
value Attribute
value
value

Attribute: A2 Output Attribute: A3
value
Attribute
Attribute Attribute Attribute
value
value value value

Output Output Output Output
value value value value

Leave node:
Output value

Decision Trees

conjunction
disjunction

Decision trees represent a disjunction of conjunctions of constraints
on the attribute values of instances.

Each path from the tree root to a leaf corresponds to a conjunction of
attribute tests, and the tree itself to a disjunction of these
conjunctions.

Decision Trees (F = A ^ B')
F = A ^ B‘
If (A=True and B = False) then Yes
else
No

If then else form
A
False True

No
B
False True

Yes No

Decision Trees (F = A V (B ^ C))

If (A=True) then Yes
else if (B = True and C=True) then Yes If then else form
else No

A
True False

Yes
B
False True

No C
False True

No Yes

Decision Trees (F = A XOR B)
F = (A ^ B') V (A' ^ B)

If (A=True and B = False) then Yes
If then else form
else If (A=False and B = False) then Yes
else No

A
False True

B B
False False True
True

No Yes No
Yes

Decision Trees as If-then-else rule
conjunction
disjunction

If (Outlook = Sunny AND humidity = Normal) then PlayTennis = Yes
If (Outlook = Overcast) then PlayTennis = Yes
If (Outlook = Rain AND Wind = Weak) then PlayTennis = Yes

Problems Suitable for Decision Trees

• Instances are represented by attribute-value pairs
Instances are described by a fixed set of attributes (e.g., Temperature) and
their values (e.g., Hot). The easiest situation for decision tree learning is when
each attribute takes on a small number of disjoint possible values (e.g., Hot,
Mild, Cold). However, extensions to the basic algorithm allow handling real-
valued attributes as well (e.g., representing Temperature numerically).

• The target function has discrete output values

• Disjunctive descriptions may be required

• The training data may contain errors

• The training data may contain missing attribute values

Basic Decision Tree Learning Algorithm

• ID3 Algorithm (Quinlan 1986) and it’s
successors C4.5 and C5.0

• Employs a top-down
An instance is classified by starting at the root
node of the tree, testing the attribute specified
by this node, then moving down the tree
branch corresponding to the value of the
attribute in the given example. This process is
then repeated for the subtree rooted at the
new node.

• Greedy search the space of possible
http://www.rulequest.com/Personal/
decision trees.
The algorithm never backtracks to
reconsider earlier choices.

Attributes…

Attributes are Outlook, Temperature, Humidity, Wind

Building Decision Tree

Attribute: A1
Attribute value
Attribute value
Attribute
value

Output value
Attribute: A2 Attribute: A3

Attribute value
Attribute value Attribute value Attribute value

Output value Output value
Output value Output value

Building Decision Tree

Outlook
Temperature Which attribute to
select ?????
Humidity
Wind
Root
node

Which Attribute to Select ??

• We would like to select the attribute that is most useful for
classifying examples.

• What is a good quantitative measure of the worth of an
attribute?

ID3 uses this information gain measure to select among the
candidate attributes at each step while growing the tree.

Information Gain

Information gain is based on information theory concept called Entropy

“Nothing in life is certain except death,
taxes and the second law of
thermodynamics. All three are
processes in which useful or
accessible forms of some quantity,
such as energy or money, are
transformed into useless, inaccessible
forms of the same quantity. That is not
to say that these three processes don’t
have fringe benefits: taxes pay for
Rudolf Julius Emanuel roads and schools; the second law of
Claude Elwood
Clausius (January 2, thermodynamics drives cars,
Shannon (April 30,
1822 – August 24, 1888), 1916 – February 24, computers and metabolism; and death,
was a German physicist 2001), an American at the very least, opens up tenured
and mathematician and electrical engineer and faculty positions”
is considered one of the mathematician, has
central founders of the been called quot;the father
Seth Lloyd, writing in Nature 430,
science of of information theoryquot; 971 (26 August 2004).
thermodynamics

Entropy

• In information theory, the Shannon entropy or
information entropy is a measure of the uncertainty
associated with a random variable.

• It quantifies the information contained in a
message, usually in bits or bits/symbol.

• It is the minimum message length necessary to
communicate information.

Why Shannon named his uncertainty
function quot;entropy“ ?

John von
Neumann

My greatest concern was what to call it. I thought of calling it 'information,' but the
word was overly used, so I decided to call it 'uncertainty.' When I discussed it with
John von Neumann, he had a better idea. Von Neumann told me, 'You should call
it entropy, for two reasons. In the first place your uncertainty function has
been used in statistical mechanics under that name, so it already has a name.
In the second place, and more important, no one really knows what entropy
really is, so in a debate you will always have the advantage.'

Shannon's mouse

Shannon and his famous
electromechanical mouse
Theseus, named after the Greek
mythology hero of Minotaur and
Labyrinth fame, and which he
tried to teach to come out of the
maze in one of the first
experiments in artificial
intelligence.

Entropy

The information entropy of a discrete random variable X, that can take on
possible values {x1...xn} is

where
I(X) is the information content or self-information of X, which is itself a
random variable; and
p(xi) = Pr(X=xi) is the probability mass function of X.

Entropy in our Context

Given a collection S, containing positive and negative
examples of some target concept, the entropy of S relative to
this boolean classification (yes/no) is

where is the proportion of positive examples in S and pӨ, is the
proportion of negative examples in S. In all calculations involving
entropy we define 0 log 0 to be 0.

Example

There are 14 examples. 9 positive and 5 negative examples [9+, 5-].

The entropy of S relative to this boolean (yes/no) classification is

Information Gain Measure

Information gain, is simply the expected reduction in entropy
caused by partitioning the examples according to this attribute.

More precisely, the information gain, Gain(S, A) of an attribute A,
relative to a collection of examples S, is defined as

where Values(A) is the set of all possible values for attribute A,
and Sv, is the subset of S for which attribute A has value v, i.e.,

Information Gain Measure

Entropy of S after
Entropy of S
partition

Gain(S, A) is the expected reduction in entropy caused by knowing the value of
attribute A.

Gain(S, A) is the information provided about the target &action value, given the
value of some other attribute A. The value of Gain(S, A) is the number of bits
saved when encoding the target value of an arbitrary member of S, by knowing
the value of attribute A.

Gain (SSunny,A)

Temperature Humidity Wind
(Hot) {0+, 2-) (High) {0+, 3-} (Weak) {1+, 2-}
(Mild) {1+, 1-} (Normal) {2+, 0-} (Strong) {1+, 1-}
(Cool) {1+, 0-}

Gain (SSunny,A)
Entropy(SSunny) = - { 2/5 log(2/5) + 3/5 log(3/5)} = 0.97095

Entropy(Hot) = 0
Temperature
(Hot) {0+, 2-) Entropy(Mild) = 1
(Mild) {1+, 1-} Entropy(Cool) = 0
(Cool) {1+, 0-} Gain(S1, Temperature) = 0.97095 – 2/5*0 – 2/5*1 – 1/5*0 = 0.57095

Humidity Entropy(High) = 0
(High) {0+, 3-} Entropy(Normal) = 0
(Normal) {2+, 0-} Gain(S1, Humidity) = 0.97095 – 3/5*0 – 2/5*0 = 0.97095

Entropy(Weak) = 0.9183
Wind
(Weak) {1+, 2-} Entropy(Normal) = 1.0
(Strong) {1+, 1-} Gain(S1, Wind) = 0.97095 – 3/5*0.9183 – 2/5*1 = 0.01997

Gain (SRain,A)

Temperature
Humidity Wind
(Hot) {0+, 0-)
(High) {1+, 1-} (Weak) {3+, 0-}
(Mild) {2+, 1-}
(Normal) {2+, 1-} (Strong) {0+, 2-}
(Cool) {1+, 1-}

Gain (SRain,A)
Entropy(SRain) = - { 2/5 log(2/5) + 3/5 log(3/5)} = 0.97095

Entropy(Hot) = 0
Temperature
(Hot) {0+, 0-) Entropy(Mild) = 0.1383
(Mild) {2+, 1-} Entropy(Cool) = 1.0
(Cool) {1+, 1-} Gain(S1, Temperature) = 0.97095 – 0 – 2/3*0.1383 - 2/5*1 = 0.4922

Humidity Entropy(High) = 1.0
(High) {1+, 1-} Entropy(Normal) = 0.1383
(Normal) {2+, 1-} Gain(S1, Humidity) = 0.97095 – 2/5*1.0 – 3/5*0.1383 = 0.4922

Entropy(Weak) = 0.0
Wind
(Weak) {3+, 0-} Entropy(Normal) = 0.0
(Strong) {0+, 2-} Gain(S1, Humidity) = 0.97095 - 3/5*0 – 2/5*0 = 0.97095

Home work
a1
(True) {2+, 1-}
(False) {1+, 2-}

Entropy(a1=True) = -{2/3log(2/3) + 1/3log(1/3)} = 0.9183
Entropy(a1=False) = 0.9183
Gain (S, a1) = 1 – 3/6*0.9183 – 3/6*0.9183 = 0.0817 S {3+, 3-} => Entropy(S) = 1

a2 Entropy(a2=True) = 1.0
(True) {2+, 2-} Entropy(a1=False) = 1.0
(False) {1+, 1-} Gain (S, a1) = 1 – 4/6*1 -2/6*1 = 0.0

Home work

a1

True False

[D1, D2, D3]
[D4, D5, D6]

Home work

a1

True False

[D1, D2, D3]
[D4, D5, D6]
a2
a2
True False
True False

+ (Yes)
- (No) - (No)
+ (Yes)

Home work
a1

True False

[D1, D2, D3]
[D4, D5, D6]
a2
a2
True False
True False

+ (Yes)
- (No) - (No)
+ (Yes)

(a1^a2) V (a1' ^ a2')

Some Insights into Capabilities and
Limitations of ID3 Algorithm
• ID3’s algorithm searches complete hypothesis space. [Advantage]

• ID3 maintain only a single current hypothesis as it searches through
the space of decision trees. By determining only as single
hypothesis, ID3 loses the capabilities that follows explicitly
representing all consistent hypothesis. [Disadvantage]

• ID3 in its pure form performs no backtracking in its search. Once it
selects an attribute to test at a particular level in the tree, it never
backtracks to reconsider this choice. Therefore, it is susceptible to
the usual risks of hill-climbing search without backtracking:
converging to locally optimal solutions that are not globally optimal.
[Disadvantage]

Some Insights into Capabilities and
Limitations of ID3 Algorithm

• ID3 uses all training examples at each step in the search to make
statistically based decisions regarding how to refine its current
hypothesis. This contrasts with methods that make decisions
incrementally, based on individual training examples (e.g., FIND-S
or CANDIDATE-ELIMINATION). One advantage of using statistical
properties of all the examples (e.g., information gain) is that the
resulting search is much less sensitive to errors in individual training
examples. [Advantage]

Machine learning Lecture 2

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (7)

Semelhante a Machine learning Lecture 2

Semelhante a Machine learning Lecture 2 (20)

Último

Último (20)

Machine learning Lecture 2