Unit-2 Bayes Decision Theory.pptx

2
Data Preprocessing
 Data Preprocessing: An Overview
 Data Quality
 Major Tasks in Data Preprocessing
 Data Cleaning
 Data Integration
 Data Reduction
 Data Transformation and Data Discretization
1. Data collection
parameters,
features, variables
• Errors will propagate
If you have error in very beginning (1st step)
1st step  2nd  3rd  4th

Data Quality: Why Preprocess the Data?
 Measures for data quality: A multidimensional view
 Accuracy: correct or wrong, accurate or not
 Completeness: not recorded, unavailable, …
 Consistency: some modified but some not, dangling,
…
 Timeliness: timely update?
 Believability: how trustable the data are correct?
 Interpretability: how easily the data can be
understood?

Major Tasks in Data Preprocessing
 Data cleaning
 Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
 Data integration
 Integration of multiple databases, data cubes, or files
 Data reduction
 Dimensionality reduction
 Numerosity reduction
 Data compression
 Data transformation and data discretization
 Normalization
 Concept hierarchy generation

5
Histogram Analysis
 Divide data into buckets and
store average (sum) for each
bucket
 Partitioning rules:
 Equal-width: equal bucket
range
 Equal-frequency (or equal-
depth) 0
5
10
15
20
25
30
35
40
10000 30000 50000 70000 90000
Interval of 10000

6
Correlation Analysis (Nominal Data)
 Χ2 (chi-square) test
 The larger the Χ2 value, the more likely the variables
are related
 The cells that contribute the most to the Χ2 value are
those whose actual count is very different from the
expected count
 Correlation does not imply causality
 # of hospitals and # of car-theft in a city are correlated
 Both are causally linked to the third variable: population



Expected
Expected
Observed 2
2 )
(

Features,
attributes or
variable

7
Chi-Square Calculation: An Example
 Χ2 (chi-square) calculation (numbers in parenthesis are
expected counts calculated based on the data distribution
in the two categories)
 It shows that like_science_fiction and play_chess are
correlated in the group
93
.
507
840
)
840
1000
(
360
)
360
200
(
210
)
210
50
(
90
)
90
250
( 2
2
2
2
2










Play chess Not play chess Sum (row)
Like science fiction 250(90) 200(360) 450
Not like science fiction 50(210) 1000(840) 1050
Sum(col.) 300 1200 1500

8
Correlation Analysis (Numeric Data)
 Correlation coefficient (also called Pearson’s product
moment coefficient)
where n is the number of tuples, and are the respective
means of A and B, σA and σB are the respective standard deviation
of A and B, and Σ(aibi) is the sum of the AB cross-product.
 If rA,B > 0, A and B are positively correlated (A’s values
increase as B’s). The higher, the stronger correlation.
 rA,B = 0: independent; rAB < 0: negatively correlated
B
A
n
i i
i
B
A
n
i i
i
B
A
n
B
A
n
b
a
n
B
b
A
a
r



 )
1
(
)
(
)
1
(
)
)(
( 1
1
,








 

A B

Visually Evaluating Correlation
Scatter plots
showing the
similarity from
–1 to 1.
Age
Height
f1
f2

Covariance (Numeric Data)
 Covariance is similar to correlation
where n is the number of tuples, and are the respective mean or
expected values of A and B, σA and σB are the respective standard
deviation of A and B.
 Positive covariance: If CovA,B > 0, then A and B both tend to be larger
than their expected values.
 Negative covariance: If CovA,B < 0 then if A is larger than its expected
value, B is likely to be smaller than its expected value.
 Independence: CovA,B = 0 but the converse is not true:
 Some pairs of random variables may have a covariance of 0 but are not
independent. Only under some additional assumptions (e.g., the data follow
multivariate normal distributions) does a covariance of 0 imply independence
A B
Correlation coefficient:

Co-Variance: An Example
 It can be simplified in computation as
 Suppose two stocks A and B have the following values in one week:
(2, 5), (3, 8), (5, 10), (4, 11), (6, 14).
 Question: If the stocks are affected by the same industry trends, will
their prices rise or fall together?
 E(A) = (2 + 3 + 5 + 4 + 6)/ 5 = 20/5 = 4
 E(B) = (5 + 8 + 10 + 11 + 14) /5 = 48/5 = 9.6
 Cov(A,B) = (2×5+3×8+5×10+4×11+6×14)/5 − 4 × 9.6 = 4
 Thus, A and B rise together since Cov(A, B) > 0.

The Normal Distribution
X
f(X)


Changing μ shifts the
distribution left or right.
Changing σ increases or
decreases the spread.
- Population
- Samples
- Mean
- Standard Deviation

The Normal Distribution:
as mathematical function (pdf)
2
)
(
2
1
2
1
)
( 







x
e
x
f
Note constants:
=3.14159
e=2.71828
This is a bell shaped
curve with different
centers and spreads
depending on  and 

**The beauty of the normal curve:
No matter what  and  are, the area between -1 and
+1 is about 68%; the area between -2 and +2 is
about 95%; and the area between -3 and +3 is
about 99.7%. Almost all values fall within 3 standard
deviations.

68-95-99.7 Rule
68% of
the data
95% of the data
99.7% of the data

Example
 Suppose SAT scores roughly follows a normal
distribution in the U.S. population of college-
bound students (with range restricted to 200-800),
and the average math SAT is 500 with a standard
deviation of 50, then:
 68% of students will have scores between 450
and 550
 95% will be between 400 and 600
 99.7% will be between 350 and 650

Basic Formulas for Probabilities
• Product Rule : probability P(AB) of a conjunction of two events A and B:
•Sum Rule: probability of a disjunction of two events A and B:
•Theorem of Total Probability : if events A1, …., An are mutually exclusive with
)
(
)
|
(
)
(
)
|
(
)
,
( A
P
A
B
P
B
P
B
A
P
B
A
P 

)
(
)
(
)
(
)
( AB
P
B
P
A
P
B
A
P 



)
(
)
|
(
)
(
1
i
n
i
i A
P
A
B
P
B
P 



Basic Approach
Bayes Rule:
)
(
)
(
)
|
(
)
|
(
D
P
h
P
h
D
P
D
h
P 
 P(h) = prior probability of hypothesis h
 P(D) = prior probability of training data D
 P(h|D) = probability of h given D (posterior density )
 P(D|h) = probability of D given h (likelihood of D given h)
The Goal of Bayesian Learning: the most probable hypothesis given
the training data (Maximum A Posteriori hypothesis )
map
h
)
(
)
|
(
max
)
(
)
(
)
|
(
max
)
|
(
max
h
P
h
D
P
D
P
h
P
h
D
P
D
h
P
h
H
h
H
h
H
h
map






Null hypothesis
Alternate hypothesis
In which class I have to put my
sample?
Prediction (classification )
Class = [0 1]

An Example
Does patient have cancer or not?
A patient takes a lab test and the result comes back positive.
The test returns a correct positive result in only 98% of the
cases in which the disease is actually present, and a correct
negative result in only 97% of the cases in which the disease is
not present. Furthermore, .008 of the entire population have this
cancer.
)
(
)
(
)
|
(
)
|
(
)
(
)
(
)
|
(
)
|
(
97
.
)
|
(
,
03
.
)
|
(
02
.
)
|
(
,
98
.
)
|
(
992
.
)
(
,
008
.
)
(
























P
cancer
P
cancer
P
cancer
P
P
cancer
P
cancer
P
cancer
P
cancer
P
cancer
P
cancer
P
cancer
P
cancer
P
cancer
P

MAP Learner
For each hypothesis h in H, calculate the posterior probability
)
(
)
(
)
|
(
)
|
(
D
P
h
P
h
D
P
D
h
P 
Output the hypothesis hmap with the highest posterior probability
)
|
(
max D
h
P
h
H
h
map


Comments:
Computational intensive
Providing a standard for judging the performance of learning
algorithms
Choosing P(h) and P(D|h) reflects our prior
knowledge about the learning task

Bayes Optimal Classifier
 Question: Given new instance x, what is its most probable
classification?
 Hmap(x) is not the most probable classification!
Example: Let P(h1|D) = .4, P(h2|D) = .3, P(h3 |D) =.3
Given new data x, we have h1(x)=+, h2(x) = -, h3(x) = -
What is the most probable classification of x ?
Bayes optimal classification:
)
|
(
)
|
(
max D
h
P
h
v
P i
H
hj
i
j
V
vj



Example:
P(h1| D) =.4, P(-|h1)=0, P(+|h1)=1
P(h2|D) =.3, P(-|h2)=1, P(+|h2)=0
P(h3|D)=.3, P(-|h3)=1, P(+|h3)=0 6
.
)
|
(
)
|
(
4
.
)
|
(
)
|
(








D
h
P
h
P
D
h
P
h
P
i
H
hi
i
i
H
hi
i

Naïve Bayes Learner
Assume target function f: X-> V, where each instance x described by attributes
<a1, a2, …., an>. Most probable value of f(x) is:
)
(
)
|
....
,
(
max
)
....
,
(
)
(
)
|
....
,
(
max
)
....
,
|
(
max
2
1
2
1
2
1
2
1
j
j
n
V
vj
n
j
j
n
V
vj
n
j
V
vj
v
P
v
a
a
a
P
a
a
a
P
v
P
v
a
a
a
P
a
a
a
v
P
v






Naïve Bayes assumption:
)
|
(
)
|
....
,
( 2
1 j
i
i
j
n v
a
P
v
a
a
a
P 

(attributes are conditionally independent)
a1 (#
persons)
a2
(temp)
…
.
an label

Bayesian classification
 The classification problem may be formalized
using a-posteriori probabilities:
 P(C|X) = prob. that the sample tuple
X=<x1,…,xk> is of class C.
 E.g. P(class=N | outlook=sunny,windy=true,…)
 Idea: assign to sample X the class label C such
that P(C|X) is maximal

Estimating a-posteriori probabilities
 Bayes theorem:
P(C|X) = P(X|C)·P(C) / P(X)
 P(X) is constant for all classes
 P(C) = relative freq of class C samples
 C such that P(C|X) is maximum =
C such that P(X|C)·P(C) is maximum
 Problem: computing P(X|C) is unfeasible!

Naïve Bayesian Classification
 Naïve assumption: attribute independence
P(x1,…,xk|C) = P(x1|C)·…·P(xk|C)
 If i-th attribute is categorical:
P(xi|C) is estimated as the relative freq of
samples having value xi as i-th attribute in class
C
 If i-th attribute is continuous:
P(xi|C) is estimated thru a Gaussian density
function
 Computationally easy in both cases

Naive Bayesian Classifier (II)
 Given a training set, we can compute the
probabilities
O utlook P N H um idity P N
sunny 2/9 3/5 high 3/9 4/5
overcast 4/9 0 norm al 6/9 1/5
rain 3/9 2/5
Tem preature W indy
hot 2/9 2/5 true 3/9 3/5
m ild 4/9 2/5 false 6/9 2/5
cool 3/9 1/5

Example : Naïve Bayes
Predict playing tennis in the day with the condition <sunny, cool, high, strong> (P(v|
o=sunny, t= cool, h=high w=strong)) using the following training data:
Day Outlook Temperature Humidity Wind Play Tennis
1 Sunny Hot High Weak No
2 Sunny Hot High Strong No
3 Overcast Hot High Weak Yes
4 Rain Mild High Weak Yes
5 Rain Cool Normal Weak Yes
6 Rain Cool Normal Strong No
7 Overcast Cool Normal Strong Yes
8 Sunny Mild High Weak No
9 Sunny Cool Normal Weak Yes
10 Rain Mild Normal Weak Yes
11 Sunny Mild Normal Strong Yes
12 Overcast Mild High Strong Yes
13 Overcast Hot Normal Weak Yes
14 Rain Mild High Strong No
we have :
021
.
)
|
(
)
|
(
)
|
(
)
|
(
)
(
005
.
)
|
(
)
|
(
)
|
(
)
|
(
)
(


n
strong
p
n
high
p
n
cool
p
n
sun
p
n
p
y
strong
p
y
high
p
y
cool
p
y
sun
p
y
p
tennise
playing
of
days
wind
strong
with
tennise
playing
of
days
#
#

The independence hypothesis…
 … makes computation possible
 … yields optimal classifiers when satisfied
 … but is seldom satisfied in practice, as
attributes (variables) are often correlated.
 Attempts to overcome this limitation:
 Bayesian networks, that combine Bayesian reasoning
with causal relationships between attributes
 Decision trees, that reason on one attribute at the
time, considering most important attributes first

Naïve Bayes Algorithm
Naïve_Bayes_Learn (examples)
for each target value vj
estimate P(vj)
for each attribute value ai of each attribute a
estimate P(ai | vj )
Classify_New_Instance (x)
)
|
(
)
(
max j
x
a
i
V
vj
j v
a
P
v
P
v
i




Typical estimation of P(ai | vj)
m
n
mp
n
v
a
P c
j
i



)
|
( Where
n: examples with v=v; p is prior estimate for P(ai|vj)
nc: examples with a=ai, m is the weight to prior

K-Nearest-Neighbors Algorithm
 K nearest neighbors (KNN) is a simple algorithm that stores all
available cases and classifies new cases based on a similarity
measure (distance function)
 KNN has been used in statistical estimation and pattern recognition
since 1970’s.

K-Nearest-Neighbors Algorithm
 A case is classified by a majority voting of its neighbors, with the
case being assigned to the class most common among its K nearest
neighbors measured by a distance function.
 If K=1, then the case is simply assigned to the class of its nearest
neighbor

Distance Function Measurements

Hamming Distance
 For category variables, Hamming distance can be used.

What is the most possible label for c?
c

 Solution: Looking for the nearest K neighbors of c.
 Take the majority label as c’s label
 Let’s suppose k = 3:

 The 3 nearest points to c are: a, a and o.
 Therefore, the most possible label for c is a.

Nearest Neighbour Rule
Non-parametric pattern
classification.
Consider a two class problem
where each sample consists of
two measurements (x,y).
k = 1
k = 3
For a given query point q,
assign the class of the
nearest neighbour.
Compute the k nearest
neighbours and assign the
class by majority vote.

Nearest Neighbour Issues
 Expensive
 To determine the nearest neighbour of a query point q, must compute the distance to all N training
examples
+ Pre-sort training examples into fast data structures (kd-trees)
+ Compute only an approximate distance (LSH)
+ Remove redundant data (condensing)
 Storage Requirements
 Must store all training data P
+ Remove redundant data (condensing)
- Pre-sorting often increases the storage requirements
 High Dimensional Data
 “Curse of Dimensionality”
 Required amount of training data increases exponentially with dimension
 Computational cost also increases dramatically
 Partitioning techniques degrade to linear search in high dimension

Decision theory
 Decision theory is the study of making decisions that have
a significant impact
 Decision-making is distinguished into:
 Decision-making under certainty
 Decision-making under non-certainty
 Decision-making under risk
 Decision-making under uncertainty

Probability theory
 Most decisions have to be taken in the presence of uncertainty
 Probability theory quantifies uncertainty regarding the
occurrence of events or states of the world
 Basic elements of probability theory:
 Random variables describe aspects of the world whose state
is initially unknown
 Each random variable has a domain of values that it can take
on (discrete, boolean, continuous)
 An atomic event is a complete specification of the state of the
world, i.e. an assignment of values to variables of which the
world is composed

Probability Theory..
 Probability space
 The sample space S={e1 ,e2 ,…,en } which is
a set of atomic events
 Probability measure P which assigns a real
number between 0 and 1 to the members of
the sample space
 Axioms
 All probabilities are between 0 and 1
 The sum of probabilities for the atomic events
of a probability space must sum up to 1
 The certain event S (the sample space itself)

Prior
 Priori Probabilities or Prior reflects our prior knowledge of
how likely an event occurs.
 In the absence of any other information, a random
variable is assigned a degree of belief called unconditional
or prior probability

Class Conditional probability
 When we have information concerning
previously unknown random variables
then we use posterior or conditional
probabilities: P(a|b) the probability of a
given event a that we know b
 Alternatively this can be written (the
product rule):
P(a b)=P(a|b)P(b)

)
(
)
(
)
|
(
b
P
b
a
P
b
a
P



Bayes’ rule
 The product rule can be written as:
 P(a b)=P(a|b)P(b)
 P(a b)=P(b|a)P(a)
 By equating the right-hand sides:
 This is known as Bayes’ rule


)
(
)
(
)
|
(
)
|
(
a
P
b
P
b
a
P
a
b
P 

Bayesian Decision Theory
 Bayesian Decision Theory is a fundamental
statistical approach that quantifies the
tradeoffs between various decisions using
probabilities and costs that accompany such
decisions.
 Example: Patient has trouble breathing
– Decision: Asthma versus Lung cancer
– Decide lung cancer when person has
asthma
 Cost: moderately high (e.g., order
unnecessary tests, scare patient)
– Decide asthma when person has lung

Decision Rules
 Progression of decision rules:
 – (1) Decide based on prior probabilities
 – (2) Decide based on posterior probabilities
 – (3) Decide based on risk

Fish Sorting Example Revisited

Decision based on prior probabilities

Question
 Consider a two-class problem, { c1 and c2 } where the prior
probabilities of the two classes are given by
 P ( c1 ) = ⋅7 and P ( c2 ) = ⋅3
 Design a classification rule for a pattern based only on prior
probabilities
 Calculation of Error Probability – P ( error )

Decision based on class conditional probabilities

Bayes Formula
 Suppose the priors P(wj) and conditional densities p(x|wj) are
known,
( | ) ( )
( | )
( )
j j
j
p x P
P x
p x
 
 
posterior
likelihood
prior
evidence

Probability of Error
Average probability of
error
P(error
)
Bayes decision rule minimizes this error because

 The dotted line at x0 is a threshold partitioning the feature
 space into two regions,R1 and R2. According to the Bayes decision
rule,for all values
 of x in R1 the classifier decides 1 and for all values in R2 it decides
2. However,
 it is obvious from the figure that decision errors are unavoidable.
Example of the two regions R1 and R2 formed by the Bayesian
classifier for the case of two equiprobable classes.
The dotted line at x0 is a threshold partitioning the feature space into two
regions,R1 and R2. According to the Bayes decision rule, for all values of x in
R1 the classifier decides 1 and for all values in R2 it decides 2. However, it is
obvious from the figure that decision errors are unavoidable.

total probability,Pe,of committing a decision error
 which is equal to the total shaded area under the curves in Figure

Minimizing the Classification Error Probability
 Show that the Bayesian classifier is optimal with respect to
minimizing the classification error probability.

Generalized Bayesian Decision Theory

Unit-2 Bayes Decision Theory.pptx

Recomendados

Recomendados

Mais conteúdo relacionado

Semelhante a Unit-2 Bayes Decision Theory.pptx

Semelhante a Unit-2 Bayes Decision Theory.pptx (20)

Último

Último (20)

Unit-2 Bayes Decision Theory.pptx