This presentation summarizes paper #7 titled "Nonlinear component analysis as a kernel eigenvalue problem" by Scholkopf, Smola, and Muller. It introduces Kernel Principal Component Analysis (KPCA) as an extension of PCA that maps data into a higher dimensional feature space. The presentation discusses how KPCA frames PCA as a kernel eigenvalue problem and computes principal components in this new feature space. It provides the mathematical formulation and algorithm for KPCA. The presentation also discusses applications, advantages, disadvantages, and experiments comparing KPCA to other dimensionality reduction techniques.
Nonlinear component analysis as a kernel eigenvalue problem
1. Presentation of paper #7:
Nonlinear component
analysis as a kernel
eigenvalue problem
Scholkopf, Smola, Muller
Neural Computation 10, 1299-1319, MIT Press (1998)
Group C:
M. Filannino, G. Rates, U. Sandouk
COMP61021: Modelling and Visualization of high-dimensional data
2. Introduction
● Kernel Principal Component Analysis (KPCA)
○ KPCA is an extension of Principal Component Analysis
○ It computes PCA into a new feature space dimension
○ Useful for feature extraction, dimensionality reduction
3. Introduction
● Kernel Principal Component Analysis (KPCA)
○ KPCA is an extension of Principal Component Analysis
○ It computes PCA into a new feature space
○ Useful for feature extraction, dimensionality reduction
4. Motivation: possible solutions
Principal Curves
Trevor Hastie; Werner Stuetzle, “Principal Curves,” Journal of the American
Statistical Association, Vol. 84, No. 406. (Jun. 1989), pp. 502-516.
● Optimization (including the quality of data approximation)
● Natural geometric meaning
● Natural projection
http://pisuerga.inf.ubu.es/cgosorio/Visualization/imgs/review3_html_m20a05243.png
5. Motivation: possible solutions
Autoencoders
Hinton, G. E., & Salakhutdinov, R. R. (2006). Reducing the dimensionality of
data with neural networks. Science, 313, 504--507.
● Feed forward neural network
● Approximate the identity
function
http://www.nlpca.de/fig_NLPCA_bottleneck_autoassociative_autoencoder_neural_network.png
6. Motivation: some new problems
● Low input dimensions
● Problem dependant
● Hard optimization problems
12. Principle
Data
Features
"We are not interested in PCs
in the input space, we are
interested in PCs of features
that are nonlinearly related to
the original ones"
13. Principle
Data
"We are not interested in PCs
New features
in the input space, we are
interested in PCs of features
that are nonlinearly related to
the original ones"
14. Principle
Given a data set of N centered observations in a d-dimensional space
● PCA diagonalizes the covariance matrix:
● It is necessary to solve the following system of equations:
● We can define the same computation in another dot product space F:
15. Principle
Given a data set of N centered observations in a high-dimensional space
● Covariance matrix in new space:
● Again, it is necessary to solve the following system of equations:
● This means that:
16. Principle
● Combining the last tree equations, we obtain:
● we define a new function
● and a new N x N matrix:
● our equation becomes:
17. Principle
● let λ1 ≤ λ2 ≤ ... ≤ λN denote the eigenvalues of K, and α1, ..., αN the
corresponding eigenvectors, with λp being the first nonzero eigenvalue
then we require they are normalized in F:
● Encoding a data point y means computing:
18. Algorithm
● Centralization
For a given data set, subtracting the mean for all the observation to
achieve the centralized data in RN.
● Finding principal components
Compute the matrix using kernel function, find
eigenvectors and eigenvalues
● Encoding training/testing data
where x is a vector that encodes the training
data. This can be done since we calculated eigenvalues and
eigenvectors.
19. Algorithm
● Reconstructing training data
The operation cannot be done because eigenvectors do not have
a pre-images in the original dimension.
● Reconstructing test data point
The operation cannot be done because eigenvectors do not have
a pre-images in the original dimension.
20. Disadvantages
● Centering in original space does not mean centering in F, we need
to adjust the K matrix as follows:
● KPCA is now a parametric technique:
○ choice of a proper kernel function
■ Gaussian, sigmoid, polynomial
○ Mercer's theorem
■ k(x,y) must be continue, simmetric, and semi-defined positive
(xTAx ≥ 0)
■ it guarantees that there are non-zero eigenvalues
● Data reconstruction is not possible, unless using approximation
formula:
21. Advantages
● Time complexity
○ we will return to this point later
● Handle non linearly separable problems
● Extraction of more principal components than PCA
○ Feature extraction vs. dimensionality reduction
23. Applications
● Clustering
○ Density Estimation
■ ex High correlation between features
○ De-noising
■ ex Lighting removing from bright images
○ Compression
■ ex Image compression
● Classification
○ ex categorisations
24. Datasets
Experiment Name Created by Representation
y x2 C y= x2
● Simple 1+2 = 3 Uniform distribution C noise sd 0.1
- Unlabelled
example1 Dist [-1, 1]
- 2 Dimensions
Three clusters
1+2 = 3
Three Gaussians - Unlabelled
● Simple
sd = 0.1 - 2 Dimensions
example2 Dist [1,1] x [0.5, 1]
Kernels
A circle and square
The eleven gaussians - Unlabelled
● De-noising
Dist [-1, 1] with zero mean - 10 Dimensions
● USPS Hand written digit
Character -Labelled
Recognition -256 Dimensions
-9298 Digits
25. Experiments
1 Simple Example 1 experiment
Dataset : 1+ 2 = 3 The uniform dist sd = 0.2
Kernel: Polynomial 1 – 4
2 USPS Character Recognition Parameters
Dataset: USPS Kernel PCA
Kernel Polynomial 1 7
Components 32 2048 (x x2)
Methods
Five layer Neural Networks Kernel SVM PCA SVM Neural Networks and SVM
The best parameters for the task
3 De- noising
Parameters
Dataset: De-noising 11 gaussians sd = 0.1
The best parameters for the task
Methods
Kernel Autoencoders Principal Curves Kernel PCA Linear PCA
4 Kernels Parameters
The best parameters for the task
Radial Basis Function
Sigmoid
26. Methods
These are the methods we used in the experiments
Dimensionality
reduction
Classification
● Supervised Unsupervised
Linear PCA Linear
Neural Networks Kernel PCA
● SVM Kernel Autoencoders Linear
Non
● Kernel LDA Principal Curves
Face
Recognition
27. Assessment
● 1 Accuracy
Classification: Exact Classification
Clustering: Comparable to other clusters
●
● 2 Time Complexity
● The time to compute
●
● 3 Storage Complexity
● The storage of the data
●
● 4 Interpretability
● How easy it is to understand
28. Simple Example
● Recreated example ● Nonlinear PCA paper ex
Dataset: The USPS Handwritten digits Dataset: 1+ 2 =3 The uniform dist with sd 0.2
Training set: 3000
Classifier: The polynomial Kernel 1 - 4
Classifier: The SVM dot product Kernel 1 -7
PC: 32 – 2048 x2 PC: 1 – 3
The
eigenvector
3D 1 -3 of
highest
by a eigenvalue
Kernel
Do
PCA
Kernel Polynomial 1 -4
Accurate
2D The function y = x2 + B
Clustering
of Non with noise B of sd= 0.2
linear from uniform distribution
features [-1, 1]
29. Character recognition
Dataset: The USPS Handwritten digits
Training set: 3000
Classifier: The SVM dot product Kernel 1 -7
PC: 32 – 2048 (x x2)
● The performance is better
for Linear Classifier
trained on non linear
components than linear
components
● The performance is
improved from linear as
the number of component
is increased Fig The result of the Character Recognition experiment ( )
30. De-noising
Dataset: The De-noising eleven gaussians
Training set: 100
Classifier: The Gaussian Kernel sd parameter
PC: 2
The de-noising on non linear feature of the distribution
Fig The result of the denoising experiment ( )
31. Kernels
The choice of Kernel regulates the accuracy of the algorithm and is dependent on the
application. The Mercer Kernels Gram Matrix are
Experiments
Radial Basis Function
Dataset Three gaussian sd 0.1
Classifier y exp x y 0.1 Kernel 1 4
PC 1 8
Sigmoid
Dataset Three Gaussian sd 0.1
Classifier Kernel
PC 1 3
32. Results -The PC 1-2 separate the 3 clusters
RBF
- The PC of 3 -5 half the clusters
PC 1 PC 2 PC 3
PC 4
-The PC of 6-8 split them
orthogonally
PC 5 PC 6 PC 7
PC8
The clusters are split to 12 places.
Sigmoid
-The PC 1 -2 separates the 3
clusters
- The PC 3 half the 3 clusters
-The same no of PC’s to separate
PC 1 PC2 clusters.
PC3 - The Sigmoid needs < PC to half.
33. Results
Experiment 1 Experiment 2 Experiment 3 Experiment 4
1 Accuracy
Kernel Polynomial 4 Polynomial 4 Gaussian 0.2 Sigmoid
Components 8 Split to 12 512 2 3 split to 6
Accuracy 4.4
2 Time
3 Space
4 Interpretability
Very Good Very Good Complicated Very good
34. Discussions: KDA
Kernel Fisher Discriminant (KDA)
Sebastian Mika , Gunnar Rätsch , Jason Weston , Bernhard Schölkopf
, Klaus-Robert Müller
● Best discriminant projection
http://lh3.ggpht.com/_qIDcOEX659I/S14l1wmtv6I/AAAAAAAAAxE/3G9kOsTt0VM/s1600-h/kda62.png
35. Discussions
Doing PCA in F rather in Rd
● The first k principal components carry more variance than any
other k directions
● The mean squared error observed by the first k principles is
minimal
● The principal components are uncorrelated
36. Discussions
Going into a higher dimensionality for a lower
dimensionality
● Pick the right high dimensionality space
Need of a proper kernel
● What kernel to use?
○ Gaussian, sigmoidal, polynomial
● Problem dependent
37. Discussions
Time Complexity
● Alot of features (alot of dimensions).
● KPCA works!
○ Subspace of F (only the observed x's)
○ No dot product calculation
● Computational complexity is hardly changed by the fact that we
need to evaluate kernel function rather than just dot products
○ (if the kernel is easy to compute)
○ e.g. Polynomial Kernels
Payback: using linear classifier.
38. Discussions
Pre-image reconstruction maybe impossible
Approximation can be done in F
Need explicite ϕ
● Regression learning problem
● Non-linear optimization problem
● Algebric Solution (rarely)
41. References
[1] J.T. Kwok and I.W. Tsang, “The Pre-Image Problem in Kernel Methods,”
IEEE Trans. Neural Networks, vol. 15, no. 6, pp. 1517-1525, 2004.
[2] Hinton, G. E., & Salakhutdinov, R. R. (2006). Reducing the dimensionality of
data with neural networks. Science, 313, 504-507.
[3] Sebastian Mika , Gunnar Rätsch , Jason Weston , Bernhard Schölkopf ,
Klaus-Robert Müller
[4] Trevor Hastie; Werner Stuetzle, “Principal Curves,” Journal of the American
Statistical Association, Vol. 84, No. 406. (Jun. 1989), pp. 502-516.
[5] G. Moser, "Analisi delle componenti principali", Tecniche di trasformazione
di spazi vettoriali per analisi statistica multi-dimensionale.
[6] I.T. Jolliffe, "Principal component analysis", Spriger-Verlag, 2002.
[7] Wikipedia, "Kernel Principal Component Analysis", 2011.
[8] A. Ghodsi, "Data visualization", 2006.
[9] B. Scholkopf, S. Mika, A. Smola, G. Ratsch, and K.R. Muller, "Kernel PCA
pattern reconstruction via approximate pre-images". In Proceedings of the 8th
International Conference on Artificial Neural Networks, pages 147 - 152, 1998.
42. References
[10] J.T.Kwok, I.W.Tsang, "The pre-image problem in kernel methods",
Proceedings of the Twentieth International Conference on Machine Learning
(ICML-2003), 2003.
● K-R, Müller, S, Mika, G, Rätsch, K,Tsuda, and B, Schölkopf “An
Introduction to Kernel-Based Learning Algorithms” IEEE
TRANSACTIONS ON NEURAL NETWORKS, VOL. 12, NO. 2, MARCH
2001
● S, Mika, B, Schölkopf, A, Smola Klaus-Robert M¨uller, M,Scholz, G, Rätsch
“Kernel PCA and De-Noising in Feature Spaces”