Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Unit-2 Bayes Decision Theory.pptx
1.
2. 2
Data Preprocessing
Data Preprocessing: An Overview
Data Quality
Major Tasks in Data Preprocessing
Data Cleaning
Data Integration
Data Reduction
Data Transformation and Data Discretization
1. Data collection
parameters,
features, variables
• Errors will propagate
If you have error in very beginning (1st step)
1st step 2nd 3rd 4th
3. Data Quality: Why Preprocess the Data?
Measures for data quality: A multidimensional view
Accuracy: correct or wrong, accurate or not
Completeness: not recorded, unavailable, …
Consistency: some modified but some not, dangling,
…
Timeliness: timely update?
Believability: how trustable the data are correct?
Interpretability: how easily the data can be
understood?
4. Major Tasks in Data Preprocessing
Data cleaning
Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
Data integration
Integration of multiple databases, data cubes, or files
Data reduction
Dimensionality reduction
Numerosity reduction
Data compression
Data transformation and data discretization
Normalization
Concept hierarchy generation
5. 5
Histogram Analysis
Divide data into buckets and
store average (sum) for each
bucket
Partitioning rules:
Equal-width: equal bucket
range
Equal-frequency (or equal-
depth) 0
5
10
15
20
25
30
35
40
10000 30000 50000 70000 90000
Interval of 10000
6. 6
Correlation Analysis (Nominal Data)
Χ2 (chi-square) test
The larger the Χ2 value, the more likely the variables
are related
The cells that contribute the most to the Χ2 value are
those whose actual count is very different from the
expected count
Correlation does not imply causality
# of hospitals and # of car-theft in a city are correlated
Both are causally linked to the third variable: population
Expected
Expected
Observed 2
2 )
(
Features,
attributes or
variable
7. 7
Chi-Square Calculation: An Example
Χ2 (chi-square) calculation (numbers in parenthesis are
expected counts calculated based on the data distribution
in the two categories)
It shows that like_science_fiction and play_chess are
correlated in the group
93
.
507
840
)
840
1000
(
360
)
360
200
(
210
)
210
50
(
90
)
90
250
( 2
2
2
2
2
Play chess Not play chess Sum (row)
Like science fiction 250(90) 200(360) 450
Not like science fiction 50(210) 1000(840) 1050
Sum(col.) 300 1200 1500
8. 8
Correlation Analysis (Numeric Data)
Correlation coefficient (also called Pearson’s product
moment coefficient)
where n is the number of tuples, and are the respective
means of A and B, σA and σB are the respective standard deviation
of A and B, and Σ(aibi) is the sum of the AB cross-product.
If rA,B > 0, A and B are positively correlated (A’s values
increase as B’s). The higher, the stronger correlation.
rA,B = 0: independent; rAB < 0: negatively correlated
B
A
n
i i
i
B
A
n
i i
i
B
A
n
B
A
n
b
a
n
B
b
A
a
r
)
1
(
)
(
)
1
(
)
)(
( 1
1
,
A B
10. Covariance (Numeric Data)
Covariance is similar to correlation
where n is the number of tuples, and are the respective mean or
expected values of A and B, σA and σB are the respective standard
deviation of A and B.
Positive covariance: If CovA,B > 0, then A and B both tend to be larger
than their expected values.
Negative covariance: If CovA,B < 0 then if A is larger than its expected
value, B is likely to be smaller than its expected value.
Independence: CovA,B = 0 but the converse is not true:
Some pairs of random variables may have a covariance of 0 but are not
independent. Only under some additional assumptions (e.g., the data follow
multivariate normal distributions) does a covariance of 0 imply independence
A B
Correlation coefficient:
11. Co-Variance: An Example
It can be simplified in computation as
Suppose two stocks A and B have the following values in one week:
(2, 5), (3, 8), (5, 10), (4, 11), (6, 14).
Question: If the stocks are affected by the same industry trends, will
their prices rise or fall together?
E(A) = (2 + 3 + 5 + 4 + 6)/ 5 = 20/5 = 4
E(B) = (5 + 8 + 10 + 11 + 14) /5 = 48/5 = 9.6
Cov(A,B) = (2×5+3×8+5×10+4×11+6×14)/5 − 4 × 9.6 = 4
Thus, A and B rise together since Cov(A, B) > 0.
12. The Normal Distribution
X
f(X)
Changing μ shifts the
distribution left or right.
Changing σ increases or
decreases the spread.
- Population
- Samples
- Mean
- Standard Deviation
13. The Normal Distribution:
as mathematical function (pdf)
2
)
(
2
1
2
1
)
(
x
e
x
f
Note constants:
=3.14159
e=2.71828
This is a bell shaped
curve with different
centers and spreads
depending on and
14. **The beauty of the normal curve:
No matter what and are, the area between -1 and
+1 is about 68%; the area between -2 and +2 is
about 95%; and the area between -3 and +3 is
about 99.7%. Almost all values fall within 3 standard
deviations.
16. Example
Suppose SAT scores roughly follows a normal
distribution in the U.S. population of college-
bound students (with range restricted to 200-800),
and the average math SAT is 500 with a standard
deviation of 50, then:
68% of students will have scores between 450
and 550
95% will be between 400 and 600
99.7% will be between 350 and 650
17. Basic Formulas for Probabilities
• Product Rule : probability P(AB) of a conjunction of two events A and B:
•Sum Rule: probability of a disjunction of two events A and B:
•Theorem of Total Probability : if events A1, …., An are mutually exclusive with
)
(
)
|
(
)
(
)
|
(
)
,
( A
P
A
B
P
B
P
B
A
P
B
A
P
)
(
)
(
)
(
)
( AB
P
B
P
A
P
B
A
P
)
(
)
|
(
)
(
1
i
n
i
i A
P
A
B
P
B
P
18. Basic Approach
Bayes Rule:
)
(
)
(
)
|
(
)
|
(
D
P
h
P
h
D
P
D
h
P
P(h) = prior probability of hypothesis h
P(D) = prior probability of training data D
P(h|D) = probability of h given D (posterior density )
P(D|h) = probability of D given h (likelihood of D given h)
The Goal of Bayesian Learning: the most probable hypothesis given
the training data (Maximum A Posteriori hypothesis )
map
h
)
(
)
|
(
max
)
(
)
(
)
|
(
max
)
|
(
max
h
P
h
D
P
D
P
h
P
h
D
P
D
h
P
h
H
h
H
h
H
h
map
Null hypothesis
Alternate hypothesis
In which class I have to put my
sample?
Prediction (classification )
Class = [0 1]
19. An Example
Does patient have cancer or not?
A patient takes a lab test and the result comes back positive.
The test returns a correct positive result in only 98% of the
cases in which the disease is actually present, and a correct
negative result in only 97% of the cases in which the disease is
not present. Furthermore, .008 of the entire population have this
cancer.
)
(
)
(
)
|
(
)
|
(
)
(
)
(
)
|
(
)
|
(
97
.
)
|
(
,
03
.
)
|
(
02
.
)
|
(
,
98
.
)
|
(
992
.
)
(
,
008
.
)
(
P
cancer
P
cancer
P
cancer
P
P
cancer
P
cancer
P
cancer
P
cancer
P
cancer
P
cancer
P
cancer
P
cancer
P
cancer
P
20. MAP Learner
For each hypothesis h in H, calculate the posterior probability
)
(
)
(
)
|
(
)
|
(
D
P
h
P
h
D
P
D
h
P
Output the hypothesis hmap with the highest posterior probability
)
|
(
max D
h
P
h
H
h
map
Comments:
Computational intensive
Providing a standard for judging the performance of learning
algorithms
Choosing P(h) and P(D|h) reflects our prior
knowledge about the learning task
21. Bayes Optimal Classifier
Question: Given new instance x, what is its most probable
classification?
Hmap(x) is not the most probable classification!
Example: Let P(h1|D) = .4, P(h2|D) = .3, P(h3 |D) =.3
Given new data x, we have h1(x)=+, h2(x) = -, h3(x) = -
What is the most probable classification of x ?
Bayes optimal classification:
)
|
(
)
|
(
max D
h
P
h
v
P i
H
hj
i
j
V
vj
Example:
P(h1| D) =.4, P(-|h1)=0, P(+|h1)=1
P(h2|D) =.3, P(-|h2)=1, P(+|h2)=0
P(h3|D)=.3, P(-|h3)=1, P(+|h3)=0 6
.
)
|
(
)
|
(
4
.
)
|
(
)
|
(
D
h
P
h
P
D
h
P
h
P
i
H
hi
i
i
H
hi
i
22. Naïve Bayes Learner
Assume target function f: X-> V, where each instance x described by attributes
<a1, a2, …., an>. Most probable value of f(x) is:
)
(
)
|
....
,
(
max
)
....
,
(
)
(
)
|
....
,
(
max
)
....
,
|
(
max
2
1
2
1
2
1
2
1
j
j
n
V
vj
n
j
j
n
V
vj
n
j
V
vj
v
P
v
a
a
a
P
a
a
a
P
v
P
v
a
a
a
P
a
a
a
v
P
v
Naïve Bayes assumption:
)
|
(
)
|
....
,
( 2
1 j
i
i
j
n v
a
P
v
a
a
a
P
(attributes are conditionally independent)
a1 (#
persons)
a2
(temp)
…
.
an label
23. Bayesian classification
The classification problem may be formalized
using a-posteriori probabilities:
P(C|X) = prob. that the sample tuple
X=<x1,…,xk> is of class C.
E.g. P(class=N | outlook=sunny,windy=true,…)
Idea: assign to sample X the class label C such
that P(C|X) is maximal
24. Estimating a-posteriori probabilities
Bayes theorem:
P(C|X) = P(X|C)·P(C) / P(X)
P(X) is constant for all classes
P(C) = relative freq of class C samples
C such that P(C|X) is maximum =
C such that P(X|C)·P(C) is maximum
Problem: computing P(X|C) is unfeasible!
25. Naïve Bayesian Classification
Naïve assumption: attribute independence
P(x1,…,xk|C) = P(x1|C)·…·P(xk|C)
If i-th attribute is categorical:
P(xi|C) is estimated as the relative freq of
samples having value xi as i-th attribute in class
C
If i-th attribute is continuous:
P(xi|C) is estimated thru a Gaussian density
function
Computationally easy in both cases
26. Naive Bayesian Classifier (II)
Given a training set, we can compute the
probabilities
O utlook P N H um idity P N
sunny 2/9 3/5 high 3/9 4/5
overcast 4/9 0 norm al 6/9 1/5
rain 3/9 2/5
Tem preature W indy
hot 2/9 2/5 true 3/9 3/5
m ild 4/9 2/5 false 6/9 2/5
cool 3/9 1/5
27. Play-tennis example: estimating P(xi|C)
Outlook Temperature Humidity Windy Class
sunny hot high false N
sunny hot high true N
overcast hot high false P
rain mild high false P
rain cool normal false P
rain cool normal true N
overcast cool normal true P
sunny mild high false N
sunny cool normal false P
rain mild normal false P
sunny mild normal true P
overcast mild high true P
overcast hot normal false P
rain mild high true N
outlook
P(sunny|p) = 2/9 P(sunny|n) = 3/5
P(overcast|p) = 4/9 P(overcast|n) = 0
P(rain|p) = 3/9 P(rain|n) = 2/5
temperature
P(hot|p) = 2/9 P(hot|n) = 2/5
P(mild|p) = 4/9 P(mild|n) = 2/5
P(cool|p) = 3/9 P(cool|n) = 1/5
humidity
P(high|p) = 3/9 P(high|n) = 4/5
P(normal|p) = 6/9 P(normal|n) = 2/5
windy
P(true|p) = 3/9 P(true|n) = 3/5
P(false|p) = 6/9 P(false|n) = 2/5
P(p) = 9/14
P(n) = 5/14
28. Example : Naïve Bayes
Predict playing tennis in the day with the condition <sunny, cool, high, strong> (P(v|
o=sunny, t= cool, h=high w=strong)) using the following training data:
Day Outlook Temperature Humidity Wind Play Tennis
1 Sunny Hot High Weak No
2 Sunny Hot High Strong No
3 Overcast Hot High Weak Yes
4 Rain Mild High Weak Yes
5 Rain Cool Normal Weak Yes
6 Rain Cool Normal Strong No
7 Overcast Cool Normal Strong Yes
8 Sunny Mild High Weak No
9 Sunny Cool Normal Weak Yes
10 Rain Mild Normal Weak Yes
11 Sunny Mild Normal Strong Yes
12 Overcast Mild High Strong Yes
13 Overcast Hot Normal Weak Yes
14 Rain Mild High Strong No
we have :
021
.
)
|
(
)
|
(
)
|
(
)
|
(
)
(
005
.
)
|
(
)
|
(
)
|
(
)
|
(
)
(
n
strong
p
n
high
p
n
cool
p
n
sun
p
n
p
y
strong
p
y
high
p
y
cool
p
y
sun
p
y
p
tennise
playing
of
days
wind
strong
with
tennise
playing
of
days
#
#
29. The independence hypothesis…
… makes computation possible
… yields optimal classifiers when satisfied
… but is seldom satisfied in practice, as
attributes (variables) are often correlated.
Attempts to overcome this limitation:
Bayesian networks, that combine Bayesian reasoning
with causal relationships between attributes
Decision trees, that reason on one attribute at the
time, considering most important attributes first
30. Naïve Bayes Algorithm
Naïve_Bayes_Learn (examples)
for each target value vj
estimate P(vj)
for each attribute value ai of each attribute a
estimate P(ai | vj )
Classify_New_Instance (x)
)
|
(
)
(
max j
x
a
i
V
vj
j v
a
P
v
P
v
i
Typical estimation of P(ai | vj)
m
n
mp
n
v
a
P c
j
i
)
|
( Where
n: examples with v=v; p is prior estimate for P(ai|vj)
nc: examples with a=ai, m is the weight to prior
31. K-Nearest-Neighbors Algorithm
K nearest neighbors (KNN) is a simple algorithm that stores all
available cases and classifies new cases based on a similarity
measure (distance function)
KNN has been used in statistical estimation and pattern recognition
since 1970’s.
32. K-Nearest-Neighbors Algorithm
A case is classified by a majority voting of its neighbors, with the
case being assigned to the class most common among its K nearest
neighbors measured by a distance function.
If K=1, then the case is simply assigned to the class of its nearest
neighbor
37. What is the most possible label for c?
Solution: Looking for the nearest K neighbors of c.
Take the majority label as c’s label
Let’s suppose k = 3:
39. What is the most possible label for c?
The 3 nearest points to c are: a, a and o.
Therefore, the most possible label for c is a.
40. Nearest Neighbour Rule
Non-parametric pattern
classification.
Consider a two class problem
where each sample consists of
two measurements (x,y).
k = 1
k = 3
For a given query point q,
assign the class of the
nearest neighbour.
Compute the k nearest
neighbours and assign the
class by majority vote.
41. Nearest Neighbour Issues
Expensive
To determine the nearest neighbour of a query point q, must compute the distance to all N training
examples
+ Pre-sort training examples into fast data structures (kd-trees)
+ Compute only an approximate distance (LSH)
+ Remove redundant data (condensing)
Storage Requirements
Must store all training data P
+ Remove redundant data (condensing)
- Pre-sorting often increases the storage requirements
High Dimensional Data
“Curse of Dimensionality”
Required amount of training data increases exponentially with dimension
Computational cost also increases dramatically
Partitioning techniques degrade to linear search in high dimension
42. Decision theory
Decision theory is the study of making decisions that have
a significant impact
Decision-making is distinguished into:
Decision-making under certainty
Decision-making under non-certainty
Decision-making under risk
Decision-making under uncertainty
43. Probability theory
Most decisions have to be taken in the presence of uncertainty
Probability theory quantifies uncertainty regarding the
occurrence of events or states of the world
Basic elements of probability theory:
Random variables describe aspects of the world whose state
is initially unknown
Each random variable has a domain of values that it can take
on (discrete, boolean, continuous)
An atomic event is a complete specification of the state of the
world, i.e. an assignment of values to variables of which the
world is composed
44. Probability Theory..
Probability space
The sample space S={e1 ,e2 ,…,en } which is
a set of atomic events
Probability measure P which assigns a real
number between 0 and 1 to the members of
the sample space
Axioms
All probabilities are between 0 and 1
The sum of probabilities for the atomic events
of a probability space must sum up to 1
The certain event S (the sample space itself)
45. Prior
Priori Probabilities or Prior reflects our prior knowledge of
how likely an event occurs.
In the absence of any other information, a random
variable is assigned a degree of belief called unconditional
or prior probability
46. Class Conditional probability
When we have information concerning
previously unknown random variables
then we use posterior or conditional
probabilities: P(a|b) the probability of a
given event a that we know b
Alternatively this can be written (the
product rule):
P(a b)=P(a|b)P(b)
)
(
)
(
)
|
(
b
P
b
a
P
b
a
P
47. Bayes’ rule
The product rule can be written as:
P(a b)=P(a|b)P(b)
P(a b)=P(b|a)P(a)
By equating the right-hand sides:
This is known as Bayes’ rule
)
(
)
(
)
|
(
)
|
(
a
P
b
P
b
a
P
a
b
P
48. Bayesian Decision Theory
Bayesian Decision Theory is a fundamental
statistical approach that quantifies the
tradeoffs between various decisions using
probabilities and costs that accompany such
decisions.
Example: Patient has trouble breathing
– Decision: Asthma versus Lung cancer
– Decide lung cancer when person has
asthma
Cost: moderately high (e.g., order
unnecessary tests, scare patient)
– Decide asthma when person has lung
49. Decision Rules
Progression of decision rules:
– (1) Decide based on prior probabilities
– (2) Decide based on posterior probabilities
– (3) Decide based on risk
53. Question
Consider a two-class problem, { c1 and c2 } where the prior
probabilities of the two classes are given by
P ( c1 ) = ⋅7 and P ( c2 ) = ⋅3
Design a classification rule for a pattern based only on prior
probabilities
Calculation of Error Probability – P ( error )
57. Bayes Formula
Suppose the priors P(wj) and conditional densities p(x|wj) are
known,
( | ) ( )
( | )
( )
j j
j
p x P
P x
p x
posterior
likelihood
prior
evidence
60. The dotted line at x0 is a threshold partitioning the feature
space into two regions,R1 and R2. According to the Bayes decision
rule,for all values
of x in R1 the classifier decides 1 and for all values in R2 it decides
2. However,
it is obvious from the figure that decision errors are unavoidable.
Example of the two regions R1 and R2 formed by the Bayesian
classifier for the case of two equiprobable classes.
The dotted line at x0 is a threshold partitioning the feature space into two
regions,R1 and R2. According to the Bayes decision rule, for all values of x in
R1 the classifier decides 1 and for all values in R2 it decides 2. However, it is
obvious from the figure that decision errors are unavoidable.
62. Minimizing the Classification Error Probability
Show that the Bayesian classifier is optimal with respect to
minimizing the classification error probability.