1. ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 1
Chapter 4:
Nonparametric Techniques
(Study Chapter 4, Sections: 4.1 – 4.4)
CSC446 : Pattern Recognition
Prof. Dr. Mostafa Gadal-Haqq
Faculty of Computer & Information Sciences
Computer Science Department
AIN SHAMS UNIVERSITY
2. 4-1. Introduction
4-2. Density Estimation
4-3. Parzen windows
4-3.1. Classification Example
4-3.2. Probabilistic Neural Networks (PNN)
4-4. k-NN Method
4-4.1 Metrics and k-NN Classification
Nonparametric Techniques
ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 2
3. Density Estimation: Introduction
• In Chapter 3, we treated supervised learning under
the assumption that the forms of the underlying
densities were known. In most PR application this
assumption is suspect.
• All of the classical parametric densities are
unimodal (have a single local maximum), whereas
many practical problems involve multimodal
densities.
ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 3
4. Density Estimation: Introduction
• Nonparametric procedures can be used with
arbitrary distributions and without the
assumption that the forms of the underlying
densities are known.
• There are several types of nonparametric methods,
two of them are of interest:
– Estimating the density function p(x |j ).
– Bypass Estimating the density and go directly to
estimate the a posteriori probability P(j | x).
ASU-CSC446 : Pattern Recognition. Prof. Mostafa Gadal-Haqq slide - 4
5. Density Estimation: Basic idea
•We deed to estimate the density (likelihood) of each
category at the test point x:
• We expect p(x) to be
given by the formula:
k
n
V
V
nk
xp
/
)(
ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 5
6. Density Estimation: Basic idea
•The Probability P that a vector x falls in a region R
is:
•If we have a sample of size n, the probability that k
of them fall in R is :
and the expected value for k is:
•As we expected, the ratio k/n is a good estimate for
the probability P and hence for p(x). (good when n is
large)
R
dxxpP (1)')'(
(2))1( knk
k PP
k
n
P
(3)k/nP][ nPk
ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 6
7. Density Estimation: Basic idea
ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 7
8. Density Estimation: Basic idea
• Assume p(x) is continuous and that the region R is
so small that p(x) does not vary significantly within
it, we can write:
• where V is the volume enclosed by R . Then,
combining eq.(1) and eq.(4) yields:
(4))(')(')'(
R R
VxpdxxpdxxpP
V
nk
xp
/
)(
ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 8
9. • Several problems arise for the estimate k/(nV), some
practical, and some theoretical.
• Practical Standpoints:
– If we fix V and take more samples (increase n), we get a
space averaged value of p(x), therefore, to obtain p(x) we
must let V approaches zero.
– However, if we fix n, we may get p(x)~0, which is useless.
Or, by chance one or more sample coincide with x, then
p(x)~.
• Then Practically, V can not be allowed to become arbitrary
small and we must accept that we will have some variance in
k/n and some averaging in p(x).
Density Estimation: Convergence
ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 9
10. Density Estimation: Convergence
• Theoretical Standpoints: To estimate p(x) we
consider use the following:
– we form a sequence of regions R1, R2, …, with one
sample in R1, and two R2, .. and so on.
– if Vn the volume of Rn, kn the number of samples falling
in Rn, and pn(x) is the nth estimate of p(x).
– if pn(x) is to converge to p(x), three conditions appears:
n
n
n
V
nk
xp
/
)(
0/lim)3,lim)2,0lim)1
nkkV n
n
n
n
n
n
ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 10
11. Density Estimation: Convergence
• Theoretical Standpoints: :
• First condition: assures us that the space average
P/V will converge to p(x).
• Second condition: assures us that the frequency
ratio k/n will converge to the probability P.
• Third condition: assures the convergence of pn(x) .
0/lim)3,lim)2,0lim)1
nkkV n
n
n
n
n
n
ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 11
12. • There are two different ways of obtaining sequences
of regions that satisfy these conditions:
– the Parzen-window method:
• Shrink an initial region where Vn = 1/n and show
that
– the kn-nearest neighbor method:
• “Specify kn as some function of n, such as kn = n; the
volume Vn is grown until it encloses kn neighbors of x.
)()( xpxp
n
n
Density Estimation: Implementation
ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 12
14. Parzen Window Method
Non Parametric Density Estimation
ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 14
15. Density Estimation: Parzen Windows
• Parzen-window approach to estimate densities
assume that the region Rn is a d-dimensional
hypercube,
• Let (u) be a window function of the form:
• That is (u) is a hypercube, and ((x-xi )/hn ) is equal
to unity if xi falls within a hypercube of volume Vn
centered at x and equal to zero otherwise.
otherwise0
d,1,...j;
2
1
u1
(u) j
)ofedgetheoflength:(h nn RhV d
nn
ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 15
16. • The number of samples in this hypercube is:
• By substituting kn in pn(x), we obtain the following
estimate:
• pn(x) estimates p(x) as an average of functions of x
and the samples xi ; i = 1,… ,n. These functions
can be general!
ni
i n
i
n
h
xx
k
1
n
i
ni
i n
n
h
xx
Vn
xp
1
11
)(
Density Estimation: Parzen Windows
ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 16
17. • Illustrating the behavior of Parzen-window:
• Consider p(x) N(0,1)
– Let
and
– where n >1 and h1 is a parameter of our choice, thus:
is an average of normal densities centered at the samples xi.
n
i
ni
i n
n
h
xx
hn
xp
11
)(
1
Density Estimation: Parzen Windows
2
2
2
1
u
eu
n
h
hn
1
ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 17
18. Numerical Example:
– For n = 1 and h1=1, then
is a single normal density centered about the
first sample x1.
– For n = 10 and h1 = 0.1, the contributions of the
individual samples are clearly noticeable !
– It is clear that many samples are required to have
an accurate estimate.
)1,(
2
1
)()(p 1
)(2/1
11
2
1
xNexxx xx
Density Estimation: Parzen Windows
ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 18
19. Density Estimation: Parzen Windows
ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 19
20. Density Estimation: Parzen Windows
ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 20
21. Analogous results are also obtained in two dimensions as
illustrated:
Density Estimation: Parzen Windows
ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 21
22. Density Estimation: Parzen Windows
ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 22
23. • Case where p(x) = 1.U(a,b) + 2.T(c,d) (unknown
density) (mixture of a uniform and a triangle density)
Density Estimation: Parzen Windows
ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 23
24. Density Estimation: Parzen Windows
ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 24
25. Classification example:
• We estimate the densities for each category and
classify a test point by the label corresponding to the
maximum posterior.
• It is clear that many samples are required to have an
accurate estimate.
• The decision region for a Parzen-window classifier
depends upon the choice of window function as
illustrated in the previous figures.
Density Estimation: Parzen Windows
ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 25
26. • In general the training error – the empirical error on
the training points themselves – can be made
arbitrarily low by making the window width
sufficiently small.
• However, a low training error does not guarantee a
small test error.
Density Estimation: Parzen Windows
ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 26
27. Density Estimation: Parzen Windows
ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 27
28. • These examples illustrate some of the power and
some of the limitations of nonparametric methods.
• Their power resides in their generality. Exactly the
same procedure was used for the unimodal normal
case and the bimodal mixture case.
• On the other hand, the number of samples needed
may be very large indeed — much greater than would
be required if we knew the form of the unknown
density.
Density Estimation: Conclusion
ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 28
29. Probabilistic Neural Networks
• Most PR methods can be implemented in a
parallel fashion that trade space complexity
for time complexity.
• parallel implementation of Parzen-window
method is known as Probabilistic Neural
Networks (PNN).
ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 29
30. Probabilistic Neural Networks
• Suppose we wish to form a Parzen estimate
based on n patterns, each of which is d-
dimensional, randomly sampled from c
classes.
• The PNN for this case consists of:
– d input units; comprising the input layer,
– each of the input units is connected to each of
the n patterns,
– each of the n patterns is connected to one and
only one of the c category (output) units.
ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 30
32. • Algorithm 1: Training the PNN network
1. Normalize each pattern x of the training set to unity.
2. Place the first training pattern on the input units.
3. Set the weights linking the input units and the first
pattern units such that: w1 = x1 .
4. Make a single connection from the first pattern unit to
the category unit corresponding to the known class of
that pattern.
5. Repeat the process for all remaining training patterns by
setting the weights such that wk = xk (k = 1, 2, …, n).
Probabilistic Neural Networks
ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 32
33. Probabilistic Neural Networks
Algorithm 1: PNN Training
begin initialize: n, j ← 0, aji ← 0, k ← 0; j=1,2,…,n; i=1,2,…,c; k =
1,2,..,d.
do j ← j + 1
do k ← 1
xjk ← xjk / (normalize x )
wjk ← xjk (train)
until k = d
if xj ωi then aji ← 1
until j = n
end
2/1
1
2
di
i
jix
ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 33
34. • Algorithm 2: Testing the PNN network
1. Normalize the test pattern x and place it at the input
units
2. Each pattern unit computes the inner product in order
to yield the net activation
and emit a nonlinear function
3. Each output unit sums the contributions from all pattern
units connected to it
4. Classify by selecting the maximum value of Pn(x | j )
xwnet t
kk .
2
1
exp)(
k
k
net
net
)|()|(
1
xPxP j
n
i
ijn
Probabilistic Neural Networks
ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 34
35. Probabilistic Neural Networks
ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 35
is a free parameter
Algorithm 2: PNN Classification
begin initialize j← 0, x ← test pattern
do j← j+ 1
netj ← wj
t x
if aji = 1 then gi← gi +exp((netj -1)/2)
until j= n
return class ← arg max gi(x)
end
i
37. ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 37
• Motivation: a solution for the problem of the
unknown “best” window function.
– Solution: Let the cell volume be a function of the
training data; the value of kn .
• Kn-NN Procedure:
– Center a cell about x and let it grows until it
captures kn samples ( kn =n ).
– kn are called the kn-nearest-neighbors of x.
The kn–Nearest-Neighbor Estimation
38. ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 38
The kn–Nearest-Neighbor Estimation
Figure 4.10: Eight points in one dimension and the k-nearest-
neighbor density estimates, for k = 3 and 5. Note especially that
the discontinuities in the slopes in the estimates generally occur
away from the positions of the points themselves.
n
n
n
V
nk
xp
/
)(
39. ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 39
• Two possibilities can occur:
– if the Density is high near x, therefore the cell will be
small which provides a good resolution.
– if the Density is low near x, therefore the cell will grow
large and stop until higher density regions are reached.
• We can obtain a family of estimates by setting
kn=k1n and choosing different values for k1
The kn–Nearest-Neighbor Estimation
40. ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 40
The kn–Nearest-Neighbor Estimation
• kn-nearest-neighbors vs. Parzen-Window
For n = 1 , kn = n = 1; the estimate becomes:
which is a poor estimate for small n, and gets better
only for large n.
That is: for small n Parzen-window outperform the
kn-nearest-neighbors.
12
11/
)(
xxVV
nk
xp
an
n
n
41. ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 41
The kn–Nearest-Neighbor Estimation
42. ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 42
The kn–Nearest-Neighbor Estimation
43. ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 43
• Estimation of the a posteriori probabilities
– Goal: estimate P(i | x) from a set of n labeled
samples.
– Let’s place a cell of volume V around x that captures
k samples.
– ki samples amongst k turned out to be labeled i
then:
The estimate for Pn(i| x) is: k
k
xp
xp
xP i
cj
j
jn
in
in
1
),(
),(
)|(
The Non-Parametric Density Estimation
n
i
in
V
nk
xp
/
),(
44. ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 44
• Classify x by assigning it the label most
frequently represented among the k-nearest
samples and use a voting scheme.
The k-Nearest–Neighbor rule
In this example,
x should be assigned
the of the black
samples.
45. ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 45
• ki /k is the fraction of the samples within the cell
that are labeled i
• For minimum error rate, the most frequently
represented category within the cell is selected.
• If k is large and the cell sufficiently small, the
performance will approach the best possible.
The kn–Nearest-Neighbor Estimation
46. ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 46
• Let Dn = {x1, x2, …, xn} be a set of n prototypes.
• Let x’ Dn be the closest prototype to a test point
x then the nearest-neighbor rule for classifying x
is to assign it the label associated with x’ .
– The NN rule leads to an error rate greater than
the minimum possible Bayes error rate .
– If the number of prototype is large (unlimited), the error
rate of the nearest-neighbor classifier is never worse than
twice the Bayes rate (Proof not required: sec. 4-5-3.).
The Nearest–Neighbor rule
47. ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 47
• If n is very large, it is reasonable to assume
that x’ is sufficiently close to x, so that:
P(i | x) P(i | x’)
• Voronoi tessellation:
– The NN rule allows us to partition the feature
space into cells consisting of all points closer to a
given prototype point x’, all points in such a cell is
labeled the same label of x’.
The Nearest –Neighbor rule
48. ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 48
The NN rule: Voronoi Tessellation
49. ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 49
• The k-NN classifier relies on a metric or distance
function.
• A metric D(., .) is a function that gives a generalized
scalar distance between two arguments (patterns).
• A metric must has four properties:
– non-negativity: D(a, b) 0.
– reflexivity: D(a, b) = 0 a = b.
– Symmetry: D(a, b) = D (b, a) .
– Triangle inequality: D(a, b) + D (b, c) D (a, c).
Metrics and k-NN Classification
50. ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 50
Metrics and k-NN Classification
• The most popular metric functions (in d
dimensions):
– The Euclidean Distance:
– The Minkowski Distance
or Lk norm:
– The Manhattan or
City Block Distance:
• Note that, L1 and L2 norms give the Euclidean and
Manhattan metric, respectively.
2/1
1
2
)()b,a(
d
k
kk
baD
kd
i
k
iik
baL
/1
1
||)b,a(
||)b,a(
1
d
i
ii baC
51. ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 51
Metrics and k-NN Classification
• The Minkowski distance for different values of k:
52. ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 52
Metrics and k-NN Classification
• The Euclidean distance has drawbacks that they
are sensitive to transformations (translation,
rotation, and scaling).
Figure 4.20: The uncritical use of Euclidean
metric cannot address the problem of
translation invariance. Pattern x represents a
handwritten 5, and x(s = 3) the same shape
but shifted three pixels to the right. The
Euclidean distance D(x, x(s = 3)) is much
larger than D(x, x8), where x8 represents the
handwritten 8. Nearest-neighbor
classification based on the Euclidean distance
in this way leads to very large errors. Instead,
we seek a distance measure that would be
insensitive to such translations, or indeed
other known invariance, such as scale or
rotation.
shifted 3 pixels
53. ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 53
Metrics and k-NN Classification
• The Tangent Distance is a more general metric that
account for invariance: in which, we build tangent vectors
TV to all the possible transformation, during training. For
example, the tangent vectors take the form:
• Fi(x’; i) are the transformations.
• Now, to get the distance between
• x and x’, we use the tangent dist.
• a is obtained by minimizing D.
xx );( i
ii
TV F
xTaxxx
a
)(min),(D
54. ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 54
• If we have n prototypes in d dimensions, and we seek the
closest prototype to a test point x (k = 1).
• we calculate the distance, O (d ), to x from each prototype
point, O(n ), Thus total is O (dn ).
• Three algorithms can be used to reduce complexity:
– Computing partial distance: to compute distance using a
subset r of the full dimension d.
– Prestructuring: Create a search tree in which prototypes
are selectively linked.
– editing: (Also known as pruning or condensing) eliminate
useless prototype during search.
Computational Complexity of the k-NN
55. ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 55
Consider the prototypes in the table, the test point
x = (0.10, 0.25), and k = 3 (odd to avoid ties),
• The closest k vectors to x are:
(0.10, 0.28) 2;
(0.12, 0.20) 2;
(0.09, 0.30) 5
• Then the voting scheme assigns the label 2 to x,
since 2 is the most frequently represented.
Prototypes Labels
(0.15, 0.35)
(0.10, 0.28)
(0.09, 0.30)
(0.12, 0.20)
1
2
5
2
Exercise: The k-NN rule
56. ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 56
Next Time
Discriminant Functions
and
Neural Networks