SlideShare uma empresa Scribd logo
1 de 65
2
Data Preprocessing
 Data Preprocessing: An Overview
 Data Quality
 Major Tasks in Data Preprocessing
 Data Cleaning
 Data Integration
 Data Reduction
 Data Transformation and Data Discretization
1. Data collection
parameters,
features, variables
• Errors will propagate
If you have error in very beginning (1st step)
1st step  2nd  3rd  4th
Data Quality: Why Preprocess the Data?
 Measures for data quality: A multidimensional view
 Accuracy: correct or wrong, accurate or not
 Completeness: not recorded, unavailable, …
 Consistency: some modified but some not, dangling,
…
 Timeliness: timely update?
 Believability: how trustable the data are correct?
 Interpretability: how easily the data can be
understood?
Major Tasks in Data Preprocessing
 Data cleaning
 Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
 Data integration
 Integration of multiple databases, data cubes, or files
 Data reduction
 Dimensionality reduction
 Numerosity reduction
 Data compression
 Data transformation and data discretization
 Normalization
 Concept hierarchy generation
5
Histogram Analysis
 Divide data into buckets and
store average (sum) for each
bucket
 Partitioning rules:
 Equal-width: equal bucket
range
 Equal-frequency (or equal-
depth) 0
5
10
15
20
25
30
35
40
10000 30000 50000 70000 90000
Interval of 10000
6
Correlation Analysis (Nominal Data)
 Χ2 (chi-square) test
 The larger the Χ2 value, the more likely the variables
are related
 The cells that contribute the most to the Χ2 value are
those whose actual count is very different from the
expected count
 Correlation does not imply causality
 # of hospitals and # of car-theft in a city are correlated
 Both are causally linked to the third variable: population



Expected
Expected
Observed 2
2 )
(

Features,
attributes or
variable
7
Chi-Square Calculation: An Example
 Χ2 (chi-square) calculation (numbers in parenthesis are
expected counts calculated based on the data distribution
in the two categories)
 It shows that like_science_fiction and play_chess are
correlated in the group
93
.
507
840
)
840
1000
(
360
)
360
200
(
210
)
210
50
(
90
)
90
250
( 2
2
2
2
2










Play chess Not play chess Sum (row)
Like science fiction 250(90) 200(360) 450
Not like science fiction 50(210) 1000(840) 1050
Sum(col.) 300 1200 1500
8
Correlation Analysis (Numeric Data)
 Correlation coefficient (also called Pearson’s product
moment coefficient)
where n is the number of tuples, and are the respective
means of A and B, σA and σB are the respective standard deviation
of A and B, and Σ(aibi) is the sum of the AB cross-product.
 If rA,B > 0, A and B are positively correlated (A’s values
increase as B’s). The higher, the stronger correlation.
 rA,B = 0: independent; rAB < 0: negatively correlated
B
A
n
i i
i
B
A
n
i i
i
B
A
n
B
A
n
b
a
n
B
b
A
a
r



 )
1
(
)
(
)
1
(
)
)(
( 1
1
,








 

A B
Visually Evaluating Correlation
Scatter plots
showing the
similarity from
–1 to 1.
Age
Height
f1
f2
Covariance (Numeric Data)
 Covariance is similar to correlation
where n is the number of tuples, and are the respective mean or
expected values of A and B, σA and σB are the respective standard
deviation of A and B.
 Positive covariance: If CovA,B > 0, then A and B both tend to be larger
than their expected values.
 Negative covariance: If CovA,B < 0 then if A is larger than its expected
value, B is likely to be smaller than its expected value.
 Independence: CovA,B = 0 but the converse is not true:
 Some pairs of random variables may have a covariance of 0 but are not
independent. Only under some additional assumptions (e.g., the data follow
multivariate normal distributions) does a covariance of 0 imply independence
A B
Correlation coefficient:
Co-Variance: An Example
 It can be simplified in computation as
 Suppose two stocks A and B have the following values in one week:
(2, 5), (3, 8), (5, 10), (4, 11), (6, 14).
 Question: If the stocks are affected by the same industry trends, will
their prices rise or fall together?
 E(A) = (2 + 3 + 5 + 4 + 6)/ 5 = 20/5 = 4
 E(B) = (5 + 8 + 10 + 11 + 14) /5 = 48/5 = 9.6
 Cov(A,B) = (2×5+3×8+5×10+4×11+6×14)/5 − 4 × 9.6 = 4
 Thus, A and B rise together since Cov(A, B) > 0.
The Normal Distribution
X
f(X)


Changing μ shifts the
distribution left or right.
Changing σ increases or
decreases the spread.
- Population
- Samples
- Mean
- Standard Deviation
The Normal Distribution:
as mathematical function (pdf)
2
)
(
2
1
2
1
)
( 







x
e
x
f
Note constants:
=3.14159
e=2.71828
This is a bell shaped
curve with different
centers and spreads
depending on  and 
**The beauty of the normal curve:
No matter what  and  are, the area between -1 and
+1 is about 68%; the area between -2 and +2 is
about 95%; and the area between -3 and +3 is
about 99.7%. Almost all values fall within 3 standard
deviations.
68-95-99.7 Rule
68% of
the data
95% of the data
99.7% of the data
Example
 Suppose SAT scores roughly follows a normal
distribution in the U.S. population of college-
bound students (with range restricted to 200-800),
and the average math SAT is 500 with a standard
deviation of 50, then:
 68% of students will have scores between 450
and 550
 95% will be between 400 and 600
 99.7% will be between 350 and 650
Basic Formulas for Probabilities
• Product Rule : probability P(AB) of a conjunction of two events A and B:
•Sum Rule: probability of a disjunction of two events A and B:
•Theorem of Total Probability : if events A1, …., An are mutually exclusive with
)
(
)
|
(
)
(
)
|
(
)
,
( A
P
A
B
P
B
P
B
A
P
B
A
P 

)
(
)
(
)
(
)
( AB
P
B
P
A
P
B
A
P 



)
(
)
|
(
)
(
1
i
n
i
i A
P
A
B
P
B
P 


Basic Approach
Bayes Rule:
)
(
)
(
)
|
(
)
|
(
D
P
h
P
h
D
P
D
h
P 
 P(h) = prior probability of hypothesis h
 P(D) = prior probability of training data D
 P(h|D) = probability of h given D (posterior density )
 P(D|h) = probability of D given h (likelihood of D given h)
The Goal of Bayesian Learning: the most probable hypothesis given
the training data (Maximum A Posteriori hypothesis )
map
h
)
(
)
|
(
max
)
(
)
(
)
|
(
max
)
|
(
max
h
P
h
D
P
D
P
h
P
h
D
P
D
h
P
h
H
h
H
h
H
h
map






Null hypothesis
Alternate hypothesis
In which class I have to put my
sample?
Prediction (classification )
Class = [0 1]
An Example
Does patient have cancer or not?
A patient takes a lab test and the result comes back positive.
The test returns a correct positive result in only 98% of the
cases in which the disease is actually present, and a correct
negative result in only 97% of the cases in which the disease is
not present. Furthermore, .008 of the entire population have this
cancer.
)
(
)
(
)
|
(
)
|
(
)
(
)
(
)
|
(
)
|
(
97
.
)
|
(
,
03
.
)
|
(
02
.
)
|
(
,
98
.
)
|
(
992
.
)
(
,
008
.
)
(
























P
cancer
P
cancer
P
cancer
P
P
cancer
P
cancer
P
cancer
P
cancer
P
cancer
P
cancer
P
cancer
P
cancer
P
cancer
P
MAP Learner
For each hypothesis h in H, calculate the posterior probability
)
(
)
(
)
|
(
)
|
(
D
P
h
P
h
D
P
D
h
P 
Output the hypothesis hmap with the highest posterior probability
)
|
(
max D
h
P
h
H
h
map


Comments:
Computational intensive
Providing a standard for judging the performance of learning
algorithms
Choosing P(h) and P(D|h) reflects our prior
knowledge about the learning task
Bayes Optimal Classifier
 Question: Given new instance x, what is its most probable
classification?
 Hmap(x) is not the most probable classification!
Example: Let P(h1|D) = .4, P(h2|D) = .3, P(h3 |D) =.3
Given new data x, we have h1(x)=+, h2(x) = -, h3(x) = -
What is the most probable classification of x ?
Bayes optimal classification:
)
|
(
)
|
(
max D
h
P
h
v
P i
H
hj
i
j
V
vj



Example:
P(h1| D) =.4, P(-|h1)=0, P(+|h1)=1
P(h2|D) =.3, P(-|h2)=1, P(+|h2)=0
P(h3|D)=.3, P(-|h3)=1, P(+|h3)=0 6
.
)
|
(
)
|
(
4
.
)
|
(
)
|
(








D
h
P
h
P
D
h
P
h
P
i
H
hi
i
i
H
hi
i
Naïve Bayes Learner
Assume target function f: X-> V, where each instance x described by attributes
<a1, a2, …., an>. Most probable value of f(x) is:
)
(
)
|
....
,
(
max
)
....
,
(
)
(
)
|
....
,
(
max
)
....
,
|
(
max
2
1
2
1
2
1
2
1
j
j
n
V
vj
n
j
j
n
V
vj
n
j
V
vj
v
P
v
a
a
a
P
a
a
a
P
v
P
v
a
a
a
P
a
a
a
v
P
v






Naïve Bayes assumption:
)
|
(
)
|
....
,
( 2
1 j
i
i
j
n v
a
P
v
a
a
a
P 

(attributes are conditionally independent)
a1 (#
persons)
a2
(temp)
…
.
an label
Bayesian classification
 The classification problem may be formalized
using a-posteriori probabilities:
 P(C|X) = prob. that the sample tuple
X=<x1,…,xk> is of class C.
 E.g. P(class=N | outlook=sunny,windy=true,…)
 Idea: assign to sample X the class label C such
that P(C|X) is maximal
Estimating a-posteriori probabilities
 Bayes theorem:
P(C|X) = P(X|C)·P(C) / P(X)
 P(X) is constant for all classes
 P(C) = relative freq of class C samples
 C such that P(C|X) is maximum =
C such that P(X|C)·P(C) is maximum
 Problem: computing P(X|C) is unfeasible!
Naïve Bayesian Classification
 Naïve assumption: attribute independence
P(x1,…,xk|C) = P(x1|C)·…·P(xk|C)
 If i-th attribute is categorical:
P(xi|C) is estimated as the relative freq of
samples having value xi as i-th attribute in class
C
 If i-th attribute is continuous:
P(xi|C) is estimated thru a Gaussian density
function
 Computationally easy in both cases
Naive Bayesian Classifier (II)
 Given a training set, we can compute the
probabilities
O utlook P N H um idity P N
sunny 2/9 3/5 high 3/9 4/5
overcast 4/9 0 norm al 6/9 1/5
rain 3/9 2/5
Tem preature W indy
hot 2/9 2/5 true 3/9 3/5
m ild 4/9 2/5 false 6/9 2/5
cool 3/9 1/5
Play-tennis example: estimating P(xi|C)
Outlook Temperature Humidity Windy Class
sunny hot high false N
sunny hot high true N
overcast hot high false P
rain mild high false P
rain cool normal false P
rain cool normal true N
overcast cool normal true P
sunny mild high false N
sunny cool normal false P
rain mild normal false P
sunny mild normal true P
overcast mild high true P
overcast hot normal false P
rain mild high true N
outlook
P(sunny|p) = 2/9 P(sunny|n) = 3/5
P(overcast|p) = 4/9 P(overcast|n) = 0
P(rain|p) = 3/9 P(rain|n) = 2/5
temperature
P(hot|p) = 2/9 P(hot|n) = 2/5
P(mild|p) = 4/9 P(mild|n) = 2/5
P(cool|p) = 3/9 P(cool|n) = 1/5
humidity
P(high|p) = 3/9 P(high|n) = 4/5
P(normal|p) = 6/9 P(normal|n) = 2/5
windy
P(true|p) = 3/9 P(true|n) = 3/5
P(false|p) = 6/9 P(false|n) = 2/5
P(p) = 9/14
P(n) = 5/14
Example : Naïve Bayes
Predict playing tennis in the day with the condition <sunny, cool, high, strong> (P(v|
o=sunny, t= cool, h=high w=strong)) using the following training data:
Day Outlook Temperature Humidity Wind Play Tennis
1 Sunny Hot High Weak No
2 Sunny Hot High Strong No
3 Overcast Hot High Weak Yes
4 Rain Mild High Weak Yes
5 Rain Cool Normal Weak Yes
6 Rain Cool Normal Strong No
7 Overcast Cool Normal Strong Yes
8 Sunny Mild High Weak No
9 Sunny Cool Normal Weak Yes
10 Rain Mild Normal Weak Yes
11 Sunny Mild Normal Strong Yes
12 Overcast Mild High Strong Yes
13 Overcast Hot Normal Weak Yes
14 Rain Mild High Strong No
we have :
021
.
)
|
(
)
|
(
)
|
(
)
|
(
)
(
005
.
)
|
(
)
|
(
)
|
(
)
|
(
)
(


n
strong
p
n
high
p
n
cool
p
n
sun
p
n
p
y
strong
p
y
high
p
y
cool
p
y
sun
p
y
p
tennise
playing
of
days
wind
strong
with
tennise
playing
of
days
#
#
The independence hypothesis…
 … makes computation possible
 … yields optimal classifiers when satisfied
 … but is seldom satisfied in practice, as
attributes (variables) are often correlated.
 Attempts to overcome this limitation:
 Bayesian networks, that combine Bayesian reasoning
with causal relationships between attributes
 Decision trees, that reason on one attribute at the
time, considering most important attributes first
Naïve Bayes Algorithm
Naïve_Bayes_Learn (examples)
for each target value vj
estimate P(vj)
for each attribute value ai of each attribute a
estimate P(ai | vj )
Classify_New_Instance (x)
)
|
(
)
(
max j
x
a
i
V
vj
j v
a
P
v
P
v
i




Typical estimation of P(ai | vj)
m
n
mp
n
v
a
P c
j
i



)
|
( Where
n: examples with v=v; p is prior estimate for P(ai|vj)
nc: examples with a=ai, m is the weight to prior
K-Nearest-Neighbors Algorithm
 K nearest neighbors (KNN) is a simple algorithm that stores all
available cases and classifies new cases based on a similarity
measure (distance function)
 KNN has been used in statistical estimation and pattern recognition
since 1970’s.
K-Nearest-Neighbors Algorithm
 A case is classified by a majority voting of its neighbors, with the
case being assigned to the class most common among its K nearest
neighbors measured by a distance function.
 If K=1, then the case is simply assigned to the class of its nearest
neighbor
Distance Function Measurements
Hamming Distance
 For category variables, Hamming distance can be used.
K-Nearest-Neighbors
What is the most possible label for c?
c
What is the most possible label for c?
 Solution: Looking for the nearest K neighbors of c.
 Take the majority label as c’s label
 Let’s suppose k = 3:
What is the most possible label for c?
c
What is the most possible label for c?
 The 3 nearest points to c are: a, a and o.
 Therefore, the most possible label for c is a.
Nearest Neighbour Rule
Non-parametric pattern
classification.
Consider a two class problem
where each sample consists of
two measurements (x,y).
k = 1
k = 3
For a given query point q,
assign the class of the
nearest neighbour.
Compute the k nearest
neighbours and assign the
class by majority vote.
Nearest Neighbour Issues
 Expensive
 To determine the nearest neighbour of a query point q, must compute the distance to all N training
examples
+ Pre-sort training examples into fast data structures (kd-trees)
+ Compute only an approximate distance (LSH)
+ Remove redundant data (condensing)
 Storage Requirements
 Must store all training data P
+ Remove redundant data (condensing)
- Pre-sorting often increases the storage requirements
 High Dimensional Data
 “Curse of Dimensionality”
 Required amount of training data increases exponentially with dimension
 Computational cost also increases dramatically
 Partitioning techniques degrade to linear search in high dimension
Decision theory
 Decision theory is the study of making decisions that have
a significant impact
 Decision-making is distinguished into:
 Decision-making under certainty
 Decision-making under non-certainty
 Decision-making under risk
 Decision-making under uncertainty
Probability theory
 Most decisions have to be taken in the presence of uncertainty
 Probability theory quantifies uncertainty regarding the
occurrence of events or states of the world
 Basic elements of probability theory:
 Random variables describe aspects of the world whose state
is initially unknown
 Each random variable has a domain of values that it can take
on (discrete, boolean, continuous)
 An atomic event is a complete specification of the state of the
world, i.e. an assignment of values to variables of which the
world is composed
Probability Theory..
 Probability space
 The sample space S={e1 ,e2 ,…,en } which is
a set of atomic events
 Probability measure P which assigns a real
number between 0 and 1 to the members of
the sample space
 Axioms
 All probabilities are between 0 and 1
 The sum of probabilities for the atomic events
of a probability space must sum up to 1
 The certain event S (the sample space itself)
Prior
 Priori Probabilities or Prior reflects our prior knowledge of
how likely an event occurs.
 In the absence of any other information, a random
variable is assigned a degree of belief called unconditional
or prior probability
Class Conditional probability
 When we have information concerning
previously unknown random variables
then we use posterior or conditional
probabilities: P(a|b) the probability of a
given event a that we know b
 Alternatively this can be written (the
product rule):
P(a b)=P(a|b)P(b)

)
(
)
(
)
|
(
b
P
b
a
P
b
a
P


Bayes’ rule
 The product rule can be written as:
 P(a b)=P(a|b)P(b)
 P(a b)=P(b|a)P(a)
 By equating the right-hand sides:
 This is known as Bayes’ rule


)
(
)
(
)
|
(
)
|
(
a
P
b
P
b
a
P
a
b
P 
Bayesian Decision Theory
 Bayesian Decision Theory is a fundamental
statistical approach that quantifies the
tradeoffs between various decisions using
probabilities and costs that accompany such
decisions.
 Example: Patient has trouble breathing
– Decision: Asthma versus Lung cancer
– Decide lung cancer when person has
asthma
 Cost: moderately high (e.g., order
unnecessary tests, scare patient)
– Decide asthma when person has lung
Decision Rules
 Progression of decision rules:
 – (1) Decide based on prior probabilities
 – (2) Decide based on posterior probabilities
 – (3) Decide based on risk
Fish Sorting Example Revisited
Decision based on prior probabilities
Question
 Consider a two-class problem, { c1 and c2 } where the prior
probabilities of the two classes are given by
 P ( c1 ) = ⋅7 and P ( c2 ) = ⋅3
 Design a classification rule for a pattern based only on prior
probabilities
 Calculation of Error Probability – P ( error )
Solution
Decision based on class conditional probabilities
Posterior Probabilities
Bayes Formula
 Suppose the priors P(wj) and conditional densities p(x|wj) are
known,
( | ) ( )
( | )
( )
j j
j
p x P
P x
p x
 
 
posterior
likelihood
prior
evidence
Making a Decision
Probability of Error
Average probability of
error
P(error
)
Bayes decision rule minimizes this error because
 The dotted line at x0 is a threshold partitioning the feature
 space into two regions,R1 and R2. According to the Bayes decision
rule,for all values
 of x in R1 the classifier decides 1 and for all values in R2 it decides
2. However,
 it is obvious from the figure that decision errors are unavoidable.
Example of the two regions R1 and R2 formed by the Bayesian
classifier for the case of two equiprobable classes.
The dotted line at x0 is a threshold partitioning the feature space into two
regions,R1 and R2. According to the Bayes decision rule, for all values of x in
R1 the classifier decides 1 and for all values in R2 it decides 2. However, it is
obvious from the figure that decision errors are unavoidable.
total probability,Pe,of committing a decision error
 which is equal to the total shaded area under the curves in Figure
Minimizing the Classification Error Probability
 Show that the Bayesian classifier is optimal with respect to
minimizing the classification error probability.
Generalized Bayesian Decision Theory
Bayesian Decision Theory…
Bayesian Decision Theory…

Mais conteúdo relacionado

Semelhante a Unit-2 Bayes Decision Theory.pptx

Novel set approximations in generalized multi valued decision information sys...
Novel set approximations in generalized multi valued decision information sys...Novel set approximations in generalized multi valued decision information sys...
Novel set approximations in generalized multi valued decision information sys...
Soaad Abd El-Badie
 
Introduction to machine learning
Introduction to machine learningIntroduction to machine learning
Introduction to machine learning
butest
 
09 logic programming
09 logic programming09 logic programming
09 logic programming
saru40
 
lecture 5 about lecture 5 about lecture lecture
lecture 5 about lecture 5 about lecture lecturelecture 5 about lecture 5 about lecture lecture
lecture 5 about lecture 5 about lecture lecture
anxiousanoja
 

Semelhante a Unit-2 Bayes Decision Theory.pptx (20)

Novel set approximations in generalized multi valued decision information sys...
Novel set approximations in generalized multi valued decision information sys...Novel set approximations in generalized multi valued decision information sys...
Novel set approximations in generalized multi valued decision information sys...
 
Statistics symposium talk, Harvard University
Statistics symposium talk, Harvard UniversityStatistics symposium talk, Harvard University
Statistics symposium talk, Harvard University
 
Classification
ClassificationClassification
Classification
 
Interpreting Logistic Regression.pptx
Interpreting Logistic Regression.pptxInterpreting Logistic Regression.pptx
Interpreting Logistic Regression.pptx
 
Probability based learning (in book: Machine learning for predictve data anal...
Probability based learning (in book: Machine learning for predictve data anal...Probability based learning (in book: Machine learning for predictve data anal...
Probability based learning (in book: Machine learning for predictve data anal...
 
Workshop in honour of Don Poskitt and Gael Martin
Workshop in honour of Don Poskitt and Gael MartinWorkshop in honour of Don Poskitt and Gael Martin
Workshop in honour of Don Poskitt and Gael Martin
 
M03 nb-02
M03 nb-02M03 nb-02
M03 nb-02
 
Introduction to machine learning
Introduction to machine learningIntroduction to machine learning
Introduction to machine learning
 
09 logic programming
09 logic programming09 logic programming
09 logic programming
 
Data and donuts: Data Visualization using R
Data and donuts: Data Visualization using RData and donuts: Data Visualization using R
Data and donuts: Data Visualization using R
 
Fst ch3 notes
Fst ch3 notesFst ch3 notes
Fst ch3 notes
 
lecture 5 about lecture 5 about lecture lecture
lecture 5 about lecture 5 about lecture lecturelecture 5 about lecture 5 about lecture lecture
lecture 5 about lecture 5 about lecture lecture
 
Regression on gaussian symbols
Regression on gaussian symbolsRegression on gaussian symbols
Regression on gaussian symbols
 
R for Statistical Computing
R for Statistical ComputingR for Statistical Computing
R for Statistical Computing
 
Data classification sammer
Data classification sammer Data classification sammer
Data classification sammer
 
Outrageous Ideas for Graph Databases
Outrageous Ideas for Graph DatabasesOutrageous Ideas for Graph Databases
Outrageous Ideas for Graph Databases
 
Information in the Weights
Information in the WeightsInformation in the Weights
Information in the Weights
 
ABC workshop: 17w5025
ABC workshop: 17w5025ABC workshop: 17w5025
ABC workshop: 17w5025
 
Information in the Weights
Information in the WeightsInformation in the Weights
Information in the Weights
 
Decision Trees and Bayes Classifiers
Decision Trees and Bayes ClassifiersDecision Trees and Bayes Classifiers
Decision Trees and Bayes Classifiers
 

Último

怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
vexqp
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
chadhar227
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
ranjankumarbehera14
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
nirzagarg
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
ahmedjiabur940
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
gajnagarg
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
vexqp
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
q6pzkpark
 
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling ManjurJual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
ptikerjasaptiker
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Bertram Ludäscher
 

Último (20)

怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATIONCapstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
 
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptxThe-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham Ware
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - Almora
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
 
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling ManjurJual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
 
Sequential and reinforcement learning for demand side management by Margaux B...
Sequential and reinforcement learning for demand side management by Margaux B...Sequential and reinforcement learning for demand side management by Margaux B...
Sequential and reinforcement learning for demand side management by Margaux B...
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
 

Unit-2 Bayes Decision Theory.pptx

  • 1.
  • 2. 2 Data Preprocessing  Data Preprocessing: An Overview  Data Quality  Major Tasks in Data Preprocessing  Data Cleaning  Data Integration  Data Reduction  Data Transformation and Data Discretization 1. Data collection parameters, features, variables • Errors will propagate If you have error in very beginning (1st step) 1st step  2nd  3rd  4th
  • 3. Data Quality: Why Preprocess the Data?  Measures for data quality: A multidimensional view  Accuracy: correct or wrong, accurate or not  Completeness: not recorded, unavailable, …  Consistency: some modified but some not, dangling, …  Timeliness: timely update?  Believability: how trustable the data are correct?  Interpretability: how easily the data can be understood?
  • 4. Major Tasks in Data Preprocessing  Data cleaning  Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies  Data integration  Integration of multiple databases, data cubes, or files  Data reduction  Dimensionality reduction  Numerosity reduction  Data compression  Data transformation and data discretization  Normalization  Concept hierarchy generation
  • 5. 5 Histogram Analysis  Divide data into buckets and store average (sum) for each bucket  Partitioning rules:  Equal-width: equal bucket range  Equal-frequency (or equal- depth) 0 5 10 15 20 25 30 35 40 10000 30000 50000 70000 90000 Interval of 10000
  • 6. 6 Correlation Analysis (Nominal Data)  Χ2 (chi-square) test  The larger the Χ2 value, the more likely the variables are related  The cells that contribute the most to the Χ2 value are those whose actual count is very different from the expected count  Correlation does not imply causality  # of hospitals and # of car-theft in a city are correlated  Both are causally linked to the third variable: population    Expected Expected Observed 2 2 ) (  Features, attributes or variable
  • 7. 7 Chi-Square Calculation: An Example  Χ2 (chi-square) calculation (numbers in parenthesis are expected counts calculated based on the data distribution in the two categories)  It shows that like_science_fiction and play_chess are correlated in the group 93 . 507 840 ) 840 1000 ( 360 ) 360 200 ( 210 ) 210 50 ( 90 ) 90 250 ( 2 2 2 2 2           Play chess Not play chess Sum (row) Like science fiction 250(90) 200(360) 450 Not like science fiction 50(210) 1000(840) 1050 Sum(col.) 300 1200 1500
  • 8. 8 Correlation Analysis (Numeric Data)  Correlation coefficient (also called Pearson’s product moment coefficient) where n is the number of tuples, and are the respective means of A and B, σA and σB are the respective standard deviation of A and B, and Σ(aibi) is the sum of the AB cross-product.  If rA,B > 0, A and B are positively correlated (A’s values increase as B’s). The higher, the stronger correlation.  rA,B = 0: independent; rAB < 0: negatively correlated B A n i i i B A n i i i B A n B A n b a n B b A a r     ) 1 ( ) ( ) 1 ( ) )( ( 1 1 ,            A B
  • 9. Visually Evaluating Correlation Scatter plots showing the similarity from –1 to 1. Age Height f1 f2
  • 10. Covariance (Numeric Data)  Covariance is similar to correlation where n is the number of tuples, and are the respective mean or expected values of A and B, σA and σB are the respective standard deviation of A and B.  Positive covariance: If CovA,B > 0, then A and B both tend to be larger than their expected values.  Negative covariance: If CovA,B < 0 then if A is larger than its expected value, B is likely to be smaller than its expected value.  Independence: CovA,B = 0 but the converse is not true:  Some pairs of random variables may have a covariance of 0 but are not independent. Only under some additional assumptions (e.g., the data follow multivariate normal distributions) does a covariance of 0 imply independence A B Correlation coefficient:
  • 11. Co-Variance: An Example  It can be simplified in computation as  Suppose two stocks A and B have the following values in one week: (2, 5), (3, 8), (5, 10), (4, 11), (6, 14).  Question: If the stocks are affected by the same industry trends, will their prices rise or fall together?  E(A) = (2 + 3 + 5 + 4 + 6)/ 5 = 20/5 = 4  E(B) = (5 + 8 + 10 + 11 + 14) /5 = 48/5 = 9.6  Cov(A,B) = (2×5+3×8+5×10+4×11+6×14)/5 − 4 × 9.6 = 4  Thus, A and B rise together since Cov(A, B) > 0.
  • 12. The Normal Distribution X f(X)   Changing μ shifts the distribution left or right. Changing σ increases or decreases the spread. - Population - Samples - Mean - Standard Deviation
  • 13. The Normal Distribution: as mathematical function (pdf) 2 ) ( 2 1 2 1 ) (         x e x f Note constants: =3.14159 e=2.71828 This is a bell shaped curve with different centers and spreads depending on  and 
  • 14. **The beauty of the normal curve: No matter what  and  are, the area between -1 and +1 is about 68%; the area between -2 and +2 is about 95%; and the area between -3 and +3 is about 99.7%. Almost all values fall within 3 standard deviations.
  • 15. 68-95-99.7 Rule 68% of the data 95% of the data 99.7% of the data
  • 16. Example  Suppose SAT scores roughly follows a normal distribution in the U.S. population of college- bound students (with range restricted to 200-800), and the average math SAT is 500 with a standard deviation of 50, then:  68% of students will have scores between 450 and 550  95% will be between 400 and 600  99.7% will be between 350 and 650
  • 17. Basic Formulas for Probabilities • Product Rule : probability P(AB) of a conjunction of two events A and B: •Sum Rule: probability of a disjunction of two events A and B: •Theorem of Total Probability : if events A1, …., An are mutually exclusive with ) ( ) | ( ) ( ) | ( ) , ( A P A B P B P B A P B A P   ) ( ) ( ) ( ) ( AB P B P A P B A P     ) ( ) | ( ) ( 1 i n i i A P A B P B P   
  • 18. Basic Approach Bayes Rule: ) ( ) ( ) | ( ) | ( D P h P h D P D h P   P(h) = prior probability of hypothesis h  P(D) = prior probability of training data D  P(h|D) = probability of h given D (posterior density )  P(D|h) = probability of D given h (likelihood of D given h) The Goal of Bayesian Learning: the most probable hypothesis given the training data (Maximum A Posteriori hypothesis ) map h ) ( ) | ( max ) ( ) ( ) | ( max ) | ( max h P h D P D P h P h D P D h P h H h H h H h map       Null hypothesis Alternate hypothesis In which class I have to put my sample? Prediction (classification ) Class = [0 1]
  • 19. An Example Does patient have cancer or not? A patient takes a lab test and the result comes back positive. The test returns a correct positive result in only 98% of the cases in which the disease is actually present, and a correct negative result in only 97% of the cases in which the disease is not present. Furthermore, .008 of the entire population have this cancer. ) ( ) ( ) | ( ) | ( ) ( ) ( ) | ( ) | ( 97 . ) | ( , 03 . ) | ( 02 . ) | ( , 98 . ) | ( 992 . ) ( , 008 . ) (                         P cancer P cancer P cancer P P cancer P cancer P cancer P cancer P cancer P cancer P cancer P cancer P cancer P
  • 20. MAP Learner For each hypothesis h in H, calculate the posterior probability ) ( ) ( ) | ( ) | ( D P h P h D P D h P  Output the hypothesis hmap with the highest posterior probability ) | ( max D h P h H h map   Comments: Computational intensive Providing a standard for judging the performance of learning algorithms Choosing P(h) and P(D|h) reflects our prior knowledge about the learning task
  • 21. Bayes Optimal Classifier  Question: Given new instance x, what is its most probable classification?  Hmap(x) is not the most probable classification! Example: Let P(h1|D) = .4, P(h2|D) = .3, P(h3 |D) =.3 Given new data x, we have h1(x)=+, h2(x) = -, h3(x) = - What is the most probable classification of x ? Bayes optimal classification: ) | ( ) | ( max D h P h v P i H hj i j V vj    Example: P(h1| D) =.4, P(-|h1)=0, P(+|h1)=1 P(h2|D) =.3, P(-|h2)=1, P(+|h2)=0 P(h3|D)=.3, P(-|h3)=1, P(+|h3)=0 6 . ) | ( ) | ( 4 . ) | ( ) | (         D h P h P D h P h P i H hi i i H hi i
  • 22. Naïve Bayes Learner Assume target function f: X-> V, where each instance x described by attributes <a1, a2, …., an>. Most probable value of f(x) is: ) ( ) | .... , ( max ) .... , ( ) ( ) | .... , ( max ) .... , | ( max 2 1 2 1 2 1 2 1 j j n V vj n j j n V vj n j V vj v P v a a a P a a a P v P v a a a P a a a v P v       Naïve Bayes assumption: ) | ( ) | .... , ( 2 1 j i i j n v a P v a a a P   (attributes are conditionally independent) a1 (# persons) a2 (temp) … . an label
  • 23. Bayesian classification  The classification problem may be formalized using a-posteriori probabilities:  P(C|X) = prob. that the sample tuple X=<x1,…,xk> is of class C.  E.g. P(class=N | outlook=sunny,windy=true,…)  Idea: assign to sample X the class label C such that P(C|X) is maximal
  • 24. Estimating a-posteriori probabilities  Bayes theorem: P(C|X) = P(X|C)·P(C) / P(X)  P(X) is constant for all classes  P(C) = relative freq of class C samples  C such that P(C|X) is maximum = C such that P(X|C)·P(C) is maximum  Problem: computing P(X|C) is unfeasible!
  • 25. Naïve Bayesian Classification  Naïve assumption: attribute independence P(x1,…,xk|C) = P(x1|C)·…·P(xk|C)  If i-th attribute is categorical: P(xi|C) is estimated as the relative freq of samples having value xi as i-th attribute in class C  If i-th attribute is continuous: P(xi|C) is estimated thru a Gaussian density function  Computationally easy in both cases
  • 26. Naive Bayesian Classifier (II)  Given a training set, we can compute the probabilities O utlook P N H um idity P N sunny 2/9 3/5 high 3/9 4/5 overcast 4/9 0 norm al 6/9 1/5 rain 3/9 2/5 Tem preature W indy hot 2/9 2/5 true 3/9 3/5 m ild 4/9 2/5 false 6/9 2/5 cool 3/9 1/5
  • 27. Play-tennis example: estimating P(xi|C) Outlook Temperature Humidity Windy Class sunny hot high false N sunny hot high true N overcast hot high false P rain mild high false P rain cool normal false P rain cool normal true N overcast cool normal true P sunny mild high false N sunny cool normal false P rain mild normal false P sunny mild normal true P overcast mild high true P overcast hot normal false P rain mild high true N outlook P(sunny|p) = 2/9 P(sunny|n) = 3/5 P(overcast|p) = 4/9 P(overcast|n) = 0 P(rain|p) = 3/9 P(rain|n) = 2/5 temperature P(hot|p) = 2/9 P(hot|n) = 2/5 P(mild|p) = 4/9 P(mild|n) = 2/5 P(cool|p) = 3/9 P(cool|n) = 1/5 humidity P(high|p) = 3/9 P(high|n) = 4/5 P(normal|p) = 6/9 P(normal|n) = 2/5 windy P(true|p) = 3/9 P(true|n) = 3/5 P(false|p) = 6/9 P(false|n) = 2/5 P(p) = 9/14 P(n) = 5/14
  • 28. Example : Naïve Bayes Predict playing tennis in the day with the condition <sunny, cool, high, strong> (P(v| o=sunny, t= cool, h=high w=strong)) using the following training data: Day Outlook Temperature Humidity Wind Play Tennis 1 Sunny Hot High Weak No 2 Sunny Hot High Strong No 3 Overcast Hot High Weak Yes 4 Rain Mild High Weak Yes 5 Rain Cool Normal Weak Yes 6 Rain Cool Normal Strong No 7 Overcast Cool Normal Strong Yes 8 Sunny Mild High Weak No 9 Sunny Cool Normal Weak Yes 10 Rain Mild Normal Weak Yes 11 Sunny Mild Normal Strong Yes 12 Overcast Mild High Strong Yes 13 Overcast Hot Normal Weak Yes 14 Rain Mild High Strong No we have : 021 . ) | ( ) | ( ) | ( ) | ( ) ( 005 . ) | ( ) | ( ) | ( ) | ( ) (   n strong p n high p n cool p n sun p n p y strong p y high p y cool p y sun p y p tennise playing of days wind strong with tennise playing of days # #
  • 29. The independence hypothesis…  … makes computation possible  … yields optimal classifiers when satisfied  … but is seldom satisfied in practice, as attributes (variables) are often correlated.  Attempts to overcome this limitation:  Bayesian networks, that combine Bayesian reasoning with causal relationships between attributes  Decision trees, that reason on one attribute at the time, considering most important attributes first
  • 30. Naïve Bayes Algorithm Naïve_Bayes_Learn (examples) for each target value vj estimate P(vj) for each attribute value ai of each attribute a estimate P(ai | vj ) Classify_New_Instance (x) ) | ( ) ( max j x a i V vj j v a P v P v i     Typical estimation of P(ai | vj) m n mp n v a P c j i    ) | ( Where n: examples with v=v; p is prior estimate for P(ai|vj) nc: examples with a=ai, m is the weight to prior
  • 31. K-Nearest-Neighbors Algorithm  K nearest neighbors (KNN) is a simple algorithm that stores all available cases and classifies new cases based on a similarity measure (distance function)  KNN has been used in statistical estimation and pattern recognition since 1970’s.
  • 32. K-Nearest-Neighbors Algorithm  A case is classified by a majority voting of its neighbors, with the case being assigned to the class most common among its K nearest neighbors measured by a distance function.  If K=1, then the case is simply assigned to the class of its nearest neighbor
  • 34. Hamming Distance  For category variables, Hamming distance can be used.
  • 36. What is the most possible label for c? c
  • 37. What is the most possible label for c?  Solution: Looking for the nearest K neighbors of c.  Take the majority label as c’s label  Let’s suppose k = 3:
  • 38. What is the most possible label for c? c
  • 39. What is the most possible label for c?  The 3 nearest points to c are: a, a and o.  Therefore, the most possible label for c is a.
  • 40. Nearest Neighbour Rule Non-parametric pattern classification. Consider a two class problem where each sample consists of two measurements (x,y). k = 1 k = 3 For a given query point q, assign the class of the nearest neighbour. Compute the k nearest neighbours and assign the class by majority vote.
  • 41. Nearest Neighbour Issues  Expensive  To determine the nearest neighbour of a query point q, must compute the distance to all N training examples + Pre-sort training examples into fast data structures (kd-trees) + Compute only an approximate distance (LSH) + Remove redundant data (condensing)  Storage Requirements  Must store all training data P + Remove redundant data (condensing) - Pre-sorting often increases the storage requirements  High Dimensional Data  “Curse of Dimensionality”  Required amount of training data increases exponentially with dimension  Computational cost also increases dramatically  Partitioning techniques degrade to linear search in high dimension
  • 42. Decision theory  Decision theory is the study of making decisions that have a significant impact  Decision-making is distinguished into:  Decision-making under certainty  Decision-making under non-certainty  Decision-making under risk  Decision-making under uncertainty
  • 43. Probability theory  Most decisions have to be taken in the presence of uncertainty  Probability theory quantifies uncertainty regarding the occurrence of events or states of the world  Basic elements of probability theory:  Random variables describe aspects of the world whose state is initially unknown  Each random variable has a domain of values that it can take on (discrete, boolean, continuous)  An atomic event is a complete specification of the state of the world, i.e. an assignment of values to variables of which the world is composed
  • 44. Probability Theory..  Probability space  The sample space S={e1 ,e2 ,…,en } which is a set of atomic events  Probability measure P which assigns a real number between 0 and 1 to the members of the sample space  Axioms  All probabilities are between 0 and 1  The sum of probabilities for the atomic events of a probability space must sum up to 1  The certain event S (the sample space itself)
  • 45. Prior  Priori Probabilities or Prior reflects our prior knowledge of how likely an event occurs.  In the absence of any other information, a random variable is assigned a degree of belief called unconditional or prior probability
  • 46. Class Conditional probability  When we have information concerning previously unknown random variables then we use posterior or conditional probabilities: P(a|b) the probability of a given event a that we know b  Alternatively this can be written (the product rule): P(a b)=P(a|b)P(b)  ) ( ) ( ) | ( b P b a P b a P  
  • 47. Bayes’ rule  The product rule can be written as:  P(a b)=P(a|b)P(b)  P(a b)=P(b|a)P(a)  By equating the right-hand sides:  This is known as Bayes’ rule   ) ( ) ( ) | ( ) | ( a P b P b a P a b P 
  • 48. Bayesian Decision Theory  Bayesian Decision Theory is a fundamental statistical approach that quantifies the tradeoffs between various decisions using probabilities and costs that accompany such decisions.  Example: Patient has trouble breathing – Decision: Asthma versus Lung cancer – Decide lung cancer when person has asthma  Cost: moderately high (e.g., order unnecessary tests, scare patient) – Decide asthma when person has lung
  • 49. Decision Rules  Progression of decision rules:  – (1) Decide based on prior probabilities  – (2) Decide based on posterior probabilities  – (3) Decide based on risk
  • 50. Fish Sorting Example Revisited
  • 51. Decision based on prior probabilities
  • 52.
  • 53. Question  Consider a two-class problem, { c1 and c2 } where the prior probabilities of the two classes are given by  P ( c1 ) = ⋅7 and P ( c2 ) = ⋅3  Design a classification rule for a pattern based only on prior probabilities  Calculation of Error Probability – P ( error )
  • 55. Decision based on class conditional probabilities
  • 57. Bayes Formula  Suppose the priors P(wj) and conditional densities p(x|wj) are known, ( | ) ( ) ( | ) ( ) j j j p x P P x p x     posterior likelihood prior evidence
  • 59. Probability of Error Average probability of error P(error ) Bayes decision rule minimizes this error because
  • 60.  The dotted line at x0 is a threshold partitioning the feature  space into two regions,R1 and R2. According to the Bayes decision rule,for all values  of x in R1 the classifier decides 1 and for all values in R2 it decides 2. However,  it is obvious from the figure that decision errors are unavoidable. Example of the two regions R1 and R2 formed by the Bayesian classifier for the case of two equiprobable classes. The dotted line at x0 is a threshold partitioning the feature space into two regions,R1 and R2. According to the Bayes decision rule, for all values of x in R1 the classifier decides 1 and for all values in R2 it decides 2. However, it is obvious from the figure that decision errors are unavoidable.
  • 61. total probability,Pe,of committing a decision error  which is equal to the total shaded area under the curves in Figure
  • 62. Minimizing the Classification Error Probability  Show that the Bayesian classifier is optimal with respect to minimizing the classification error probability.