CSC446: Pattern Recognition (LN7)

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 1
Chapter 4:
Nonparametric Techniques
(Study Chapter 4, Sections: 4.1 – 4.4)
CSC446 : Pattern Recognition
Prof. Dr. Mostafa Gadal-Haqq
Faculty of Computer & Information Sciences
Computer Science Department
AIN SHAMS UNIVERSITY

4-1. Introduction
4-2. Density Estimation
4-3. Parzen windows
4-3.1. Classification Example
4-3.2. Probabilistic Neural Networks (PNN)
4-4. k-NN Method
4-4.1 Metrics and k-NN Classification
Nonparametric Techniques

Density Estimation: Introduction
• In Chapter 3, we treated supervised learning under
the assumption that the forms of the underlying
densities were known. In most PR application this
assumption is suspect.
• All of the classical parametric densities are
unimodal (have a single local maximum), whereas
many practical problems involve multimodal
densities.

Density Estimation: Introduction
• Nonparametric procedures can be used with
arbitrary distributions and without the
assumption that the forms of the underlying
densities are known.
• There are several types of nonparametric methods,
two of them are of interest:
– Estimating the density function p(x |j ).
– Bypass Estimating the density and go directly to
estimate the a posteriori probability P(j | x).
ASU-CSC446 : Pattern Recognition. Prof. Mostafa Gadal-Haqq slide - 4

Density Estimation: Basic idea
•We deed to estimate the density (likelihood) of each
category at the test point x:
• We expect p(x) to be
given by the formula:
k
n
V
V
nk
xp
/
)( 

•The Probability P that a vector x falls in a region R
is:
•If we have a sample of size n, the probability that k
of them fall in R is :
and the expected value for k is:
•As we expected, the ratio k/n is a good estimate for
the probability P and hence for p(x). (good when n is
large)

R
dxxpP (1)')'(
(2))1( knk
k PP
k
n
P 







(3)k/nP][  nPk

• Assume p(x) is continuous and that the region R is
so small that p(x) does not vary significantly within
it, we can write:
• where V is the volume enclosed by R . Then,
combining eq.(1) and eq.(4) yields:
(4))(')(')'(  
R R
VxpdxxpdxxpP
V
nk
xp
/
)( 

• Several problems arise for the estimate k/(nV), some
practical, and some theoretical.
• Practical Standpoints:
– If we fix V and take more samples (increase n), we get a
space averaged value of p(x), therefore, to obtain p(x) we
must let V approaches zero.
– However, if we fix n, we may get p(x)~0, which is useless.
Or, by chance one or more sample coincide with x, then
p(x)~.
• Then Practically, V can not be allowed to become arbitrary
small and we must accept that we will have some variance in
k/n and some averaging in p(x).
Density Estimation: Convergence

• Theoretical Standpoints: To estimate p(x) we
consider use the following:
– we form a sequence of regions R1, R2, …, with one
sample in R1, and two R2, .. and so on.
– if Vn the volume of Rn, kn the number of samples falling
in Rn, and pn(x) is the nth estimate of p(x).
– if pn(x) is to converge to p(x), three conditions appears:
n
n
n
V
nk
xp
/
)( 
0/lim)3,lim)2,0lim)1 

nkkV n
n
n
n
n
n

• Theoretical Standpoints: :
• First condition: assures us that the space average
P/V will converge to p(x).
• Second condition: assures us that the frequency
ratio k/n will converge to the probability P.
• Third condition: assures the convergence of pn(x) .
0/lim)3,lim)2,0lim)1 

nkkV n
n
n
n
n
n

• There are two different ways of obtaining sequences
of regions that satisfy these conditions:
– the Parzen-window method:
• Shrink an initial region where Vn = 1/n and show
that
– the kn-nearest neighbor method:
• “Specify kn as some function of n, such as kn = n; the
volume Vn is grown until it encloses kn neighbors of x.
)()( xpxp
n
n


Density Estimation: Implementation

Density Estimation: Implementation

Parzen Window Method
Non Parametric Density Estimation

Density Estimation: Parzen Windows
• Parzen-window approach to estimate densities
assume that the region Rn is a d-dimensional
hypercube,
• Let (u) be a window function of the form:
• That is (u) is a hypercube, and ((x-xi )/hn ) is equal
to unity if xi falls within a hypercube of volume Vn
centered at x and equal to zero otherwise.






otherwise0
d,1,...j;
2
1
u1
(u) j

)ofedgetheoflength:(h nn RhV d
nn 

• The number of samples in this hypercube is:
• By substituting kn in pn(x), we obtain the following
estimate:
• pn(x) estimates p(x) as an average of functions of x
and the samples xi ; i = 1,… ,n. These functions 
can be general!








 

ni
i n
i
n
h
xx
k
1






 
 

 n
i
ni
i n
n
h
xx
Vn
xp 
1
11
)(

• Illustrating the behavior of Parzen-window:
• Consider p(x) N(0,1)
– Let
and
– where n >1 and h1 is a parameter of our choice, thus:
is an average of normal densities centered at the samples xi.





 
 

 n
i
ni
i n
n
h
xx
hn
xp 
11
)(
1
  2
2
2
1
u
eu




n
h
hn
1


Numerical Example:
– For n = 1 and h1=1, then
is a single normal density centered about the
first sample x1.
– For n = 10 and h1 = 0.1, the contributions of the
individual samples are clearly noticeable !
– It is clear that many samples are required to have
an accurate estimate.
)1,(
2
1
)()(p 1
)(2/1
11
2
1
xNexxx xx
 



Analogous results are also obtained in two dimensions as
illustrated:

• Case where p(x) = 1.U(a,b) + 2.T(c,d) (unknown
density) (mixture of a uniform and a triangle density)

Classification example:
• We estimate the densities for each category and
classify a test point by the label corresponding to the
maximum posterior.
• It is clear that many samples are required to have an
accurate estimate.
• The decision region for a Parzen-window classifier
depends upon the choice of window function as
illustrated in the previous figures.

• In general the training error – the empirical error on
the training points themselves – can be made
arbitrarily low by making the window width
sufficiently small.
• However, a low training error does not guarantee a
small test error.

• These examples illustrate some of the power and
some of the limitations of nonparametric methods.
• Their power resides in their generality. Exactly the
same procedure was used for the unimodal normal
case and the bimodal mixture case.
• On the other hand, the number of samples needed
may be very large indeed — much greater than would
be required if we knew the form of the unknown
density.
Density Estimation: Conclusion

Probabilistic Neural Networks
• Most PR methods can be implemented in a
parallel fashion that trade space complexity
for time complexity.
• parallel implementation of Parzen-window
method is known as Probabilistic Neural
Networks (PNN).

• Suppose we wish to form a Parzen estimate
based on n patterns, each of which is d-
dimensional, randomly sampled from c
classes.
• The PNN for this case consists of:
– d input units; comprising the input layer,
– each of the input units is connected to each of
the n patterns,
– each of the n patterns is connected to one and
only one of the c category (output) units.

Modifiable weights, w

• Algorithm 1: Training the PNN network
1. Normalize each pattern x of the training set to unity.
2. Place the first training pattern on the input units.
3. Set the weights linking the input units and the first
pattern units such that: w1 = x1 .
4. Make a single connection from the first pattern unit to
the category unit corresponding to the known class of
that pattern.
5. Repeat the process for all remaining training patterns by
setting the weights such that wk = xk (k = 1, 2, …, n).

Algorithm 1: PNN Training
begin initialize: n, j ← 0, aji ← 0, k ← 0; j=1,2,…,n; i=1,2,…,c; k =
1,2,..,d.
do j ← j + 1
do k ← 1
xjk ← xjk / (normalize x )
wjk ← xjk (train)
until k = d
if xj  ωi then aji ← 1
until j = n
end
2/1
1
2









di
i
jix

• Algorithm 2: Testing the PNN network
1. Normalize the test pattern x and place it at the input
units
2. Each pattern unit computes the inner product in order
to yield the net activation
and emit a nonlinear function
3. Each output unit sums the contributions from all pattern
units connected to it
4. Classify by selecting the maximum value of Pn(x | j )
xwnet t
kk .





 
 2
1
exp)(

 k
k
net
net
)|()|(
1
xPxP j
n
i
ijn   

 is a free parameter
Algorithm 2: PNN Classification
begin initialize j← 0, x ← test pattern
do j← j+ 1
netj ← wj
t x
if aji = 1 then gi← gi +exp((netj -1)/2)
until j= n
return class ← arg max gi(x)
end
i

K-Nearest Neighbor
Non Parametric Density Estimation

• Motivation: a solution for the problem of the
unknown “best” window function.
– Solution: Let the cell volume be a function of the
training data; the value of kn .
• Kn-NN Procedure:
– Center a cell about x and let it grows until it
captures kn samples ( kn =n ).
– kn are called the kn-nearest-neighbors of x.
The kn–Nearest-Neighbor Estimation

Figure 4.10: Eight points in one dimension and the k-nearest-
neighbor density estimates, for k = 3 and 5. Note especially that
the discontinuities in the slopes in the estimates generally occur
away from the positions of the points themselves.
n
n
n
V
nk
xp
/
)( 

• Two possibilities can occur:
– if the Density is high near x, therefore the cell will be
small which provides a good resolution.
– if the Density is low near x, therefore the cell will grow
large and stop until higher density regions are reached.
• We can obtain a family of estimates by setting
kn=k1n and choosing different values for k1

• kn-nearest-neighbors vs. Parzen-Window
For n = 1 , kn = n = 1; the estimate becomes:
which is a poor estimate for small n, and gets better
only for large n.
That is: for small n Parzen-window outperform the
kn-nearest-neighbors.
12
11/
)(
xxVV
nk
xp
an
n
n



• Estimation of the a posteriori probabilities
– Goal: estimate P(i | x) from a set of n labeled
samples.
– Let’s place a cell of volume V around x that captures
k samples.
– ki samples amongst k turned out to be labeled i
then:
The estimate for Pn(i| x) is: k
k
xp
xp
xP i
cj
j
jn
in
in 


1
),(
),(
)|(



The Non-Parametric Density Estimation
n
i
in
V
nk
xp
/
),( 

• Classify x by assigning it the label most
frequently represented among the k-nearest
samples and use a voting scheme.
The k-Nearest–Neighbor rule
In this example,
x should be assigned
the of the black
samples.

• ki /k is the fraction of the samples within the cell
that are labeled i
• For minimum error rate, the most frequently
represented category within the cell is selected.
• If k is large and the cell sufficiently small, the
performance will approach the best possible.

• Let Dn = {x1, x2, …, xn} be a set of n prototypes.
• Let x’  Dn be the closest prototype to a test point
x then the nearest-neighbor rule for classifying x
is to assign it the label associated with x’ .
– The NN rule leads to an error rate greater than
the minimum possible Bayes error rate .
– If the number of prototype is large (unlimited), the error
rate of the nearest-neighbor classifier is never worse than
twice the Bayes rate (Proof not required: sec. 4-5-3.).
The Nearest–Neighbor rule

• If n is very large, it is reasonable to assume
that x’ is sufficiently close to x, so that:
P(i | x)  P(i | x’)
• Voronoi tessellation:
– The NN rule allows us to partition the feature
space into cells consisting of all points closer to a
given prototype point x’, all points in such a cell is
labeled the same label of x’.
The Nearest –Neighbor rule

The NN rule: Voronoi Tessellation

• The k-NN classifier relies on a metric or distance
function.
• A metric D(., .) is a function that gives a generalized
scalar distance between two arguments (patterns).
• A metric must has four properties:
– non-negativity: D(a, b)  0.
– reflexivity: D(a, b) = 0  a = b.
– Symmetry: D(a, b) = D (b, a) .
– Triangle inequality: D(a, b) + D (b, c)  D (a, c).
Metrics and k-NN Classification

• The most popular metric functions (in d
dimensions):
– The Euclidean Distance:
– The Minkowski Distance
or Lk norm:
– The Manhattan or
City Block Distance:
• Note that, L1 and L2 norms give the Euclidean and
Manhattan metric, respectively.
2/1
1
2
)()b,a( 





 
d
k
kk
baD
kd
i
k
iik
baL
/1
1
||)b,a( 





 
||)b,a(
1


d
i
ii baC

• The Minkowski distance for different values of k:

• The Euclidean distance has drawbacks that they
are sensitive to transformations (translation,
rotation, and scaling).
Figure 4.20: The uncritical use of Euclidean
metric cannot address the problem of
translation invariance. Pattern x represents a
handwritten 5, and x(s = 3) the same shape
but shifted three pixels to the right. The
Euclidean distance D(x, x(s = 3)) is much
larger than D(x, x8), where x8 represents the
handwritten 8. Nearest-neighbor
classification based on the Euclidean distance
in this way leads to very large errors. Instead,
we seek a distance measure that would be
insensitive to such translations, or indeed
other known invariance, such as scale or
rotation.
shifted 3 pixels

• The Tangent Distance is a more general metric that
account for invariance: in which, we build tangent vectors
TV to all the possible transformation, during training. For
example, the tangent vectors take the form:
• Fi(x’; i) are the transformations.
• Now, to get the distance between
• x and x’, we use the tangent dist.
• a is obtained by minimizing D.
xx  );( i
ii
TV F
 xTaxxx
a
 )(min),(D

• If we have n prototypes in d dimensions, and we seek the
closest prototype to a test point x (k = 1).
• we calculate the distance, O (d ), to x from each prototype
point, O(n ), Thus total is O (dn ).
• Three algorithms can be used to reduce complexity:
– Computing partial distance: to compute distance using a
subset r of the full dimension d.
– Prestructuring: Create a search tree in which prototypes
are selectively linked.
– editing: (Also known as pruning or condensing) eliminate
useless prototype during search.
Computational Complexity of the k-NN

Consider the prototypes in the table, the test point
x = (0.10, 0.25), and k = 3 (odd to avoid ties),
• The closest k vectors to x are:
(0.10, 0.28)  2;
(0.12, 0.20)  2;
(0.09, 0.30)  5
• Then the voting scheme assigns the label 2 to x,
since 2 is the most frequently represented.
Prototypes Labels
(0.15, 0.35)
(0.10, 0.28)
(0.09, 0.30)
(0.12, 0.20)
1
2
5
2
Exercise: The k-NN rule

Next Time
Discriminant Functions
and
Neural Networks

CSC446: Pattern Recognition (LN7)

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a CSC446: Pattern Recognition (LN7)

Semelhante a CSC446: Pattern Recognition (LN7) (20)

Mais de Mostafa G. M. Mostafa

Mais de Mostafa G. M. Mostafa (13)

Último

Último (20)

CSC446: Pattern Recognition (LN7)