A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction

A Unifying Probabilistic Perspective for Spectral
Dimensionality Reduction: Insights and New
Models – Neil D. Lawrence
Presenters: Sean Golliher and Derek Reimanis
1 / 21

Claims and Contributions
“Unifying Approach”: Views previous methods as less general that
this approach and proves they are special cases
Presents the Uniﬁed Algorithm: Maximum Entropy Unfolding (MEU)
Improves Local Linear Embedding (LLE) and introduces ALLE
(Acyclic Version)
Introduces 3rd Algorithm DRILL. Estimating the dimensional
structure from data (rather than K nearest neighbors)
Mostly theoretical paper
2 / 21

Features of Spectral Methods
Spectral Methods: Low dimensional representations are derived from
eigenvectors of specially constructed matrices.
Methods Share a Common Approach:
Compute nearest neighbors of input pattern
Construct a weighted graph based on neighborhood relations
Derive a matrix from the weighted graph
Produce a representation of the matrix from the eigenvectors of the
matrix
Keep this mind when trying to understand uniﬁcation
3 / 21

Most Common Examples
PCA: Technique that preserves the covariance of the data structure
MDS Multidimensional Scaling: preserves inner products of input
patterns. ex: xi − xj 2
ISOMAP: Preserves pairwise distances by measuring along a
sub-manifold from the sample space. Variant of MDS which uses
geodesics along sub-manifold
geodesic: curve whose tangent vectors remain parallel if they are
transported along it
4 / 21

Local Linear Embedding (LLE)
Tries to capture non-linear relationships by preserving local linear
structure of data.
Four Step Algorithm:
Step 1: Compute K-nearest neighbors of each higher-dimensional input
pattern xi and create directed graph whose edges indicate nearest
neighbor relations.
Step 2: Assign weights Wj to edges in graph. Each input pattern, and
neighbors, can be viewed as samples from linear patch on a lower
dimensional sub manifold.
Find the “reconstruction weights” that provide a local linear ﬁt of the
k + 1 one points at each of the n points in the data set. Minimize
error function:
E(W|D) =
x∈D
x −
y∈N(x)
Wxyy 2
5 / 21

Local Linear Embedding (LLE) Cont’d
Step 3: After n local models are constructed keep weights ﬁxed and
ﬁnd new projection. We turn this into a minimization problem similar
to MDS
E(Z|D) =
x∈D
zx
−
y∈N(x)
Wxyzy 2
Put into a matrix Mxy that gives us a new objective function in
matrix form (error function):
E(Z|W ) =
x,y
Mxy(zx
)T
zy
Find the m + 1 eigenvectors with the smallest eigenvectors and drop
them. Giving us a new set of coordinates for new k-dimensional space.
6 / 21

Maximum Entropy Unfolding (MEU)
Kernel PCA idea: Apply “Kernal Trick” to transform to higher
dimensional space. Then transform using covariance space as in PCA.
Increases the features space rather than reducing the dimensions.
This motived the development of Maximum Variance Unfolding
(MVU) .
MEU: Since entropy is related to variance they use entropy and
obtain a probabilistic model. Derive a density over p(Y) directly (not
over squared distances) by constraining the expected square
inter-point distances dij of any two samples (yi ) and (yj )
7 / 21

Maximum Entropy Unfolding (MEU) Cont’d
Observations are over squared distances they derive a density that
gives rise to those distances p(Y).
Uses KL divergence for calculating entropy between two distributions.
However, we need two distributions: H = − p(Y)log p(Y)
m(Y)dY
Assume a very broad, spherical, Gaussian density that can be
assumed to be small.
Solving for Y the density that minimizes the KL divergence under the
constraints on the expectations is:
p(Y) =
p
j=1
|L+γI|1/2
τn/2 exp(−1/2yT
j (L + γI))yj
Where L is a special matrix where oﬀ diagonal elements contain
information for D. Similar to previous examples.
Key point is that this assumes independence of density across data
features (multiplying densities)
8 / 21

GMRF: (ﬁnite-dimensional) random vector following a multivariate
normal (or Gaussian) distribution.
Since GRFs show provide an alternative approach to reducing number
of parameters in covariance matrix they use GRF approach which is
common on spatial methods.
Independence in their model is being expressed over data features
instead of data points. This assumes features are i.i.d
For their model the number of parameters does not increase with the
number of data. As number of features increases their is a clear
“blessing of dimensionality”.
9 / 21

With the GRF representation they show that with some manipulation
they can compute < dij >from the covariance matrix:
< dij >= p/2(ki,j − 2ki,j + ki,j )
This is of the same form as distances and similarities in Kernel PCA
methods
10 / 21

MEU and LLE
They show that LLE is approximating maximum likelihood in their
MEU model
Finding estimates of the parameters that maximize the probability of
observing the data that we have
Thus the claim of a “unifying model”. LLE is special case of MEU.
Also showed that using pseudo-likelihood in the MEU model reduces
to the generalization they presented in equation (9) for LLE.
Using fact that join probability density of GRF can be represented as
a factorization of the cliques of the graph.
Pseudo-likelihood reduces the multiplication of the of the distribution
presented earlier to the objective matrix for LLE.
11 / 21

Acyclic Locally Linear Embedding (ALLE)
Pseudo-likelihood approximation to the joint probability distribution
of a collection of random variables.
Thus the claim of a “unifying model”. LLE is special case of MEU.
Also showed that using pseudo-likelihood in the MEU model reduces
to the generalization they presented in equation (9) for LLE.
If they force their matrix M to be lower triangular they that the
covariance of the try log-likelihood of y log p(Y) factors into n
independent regression functions.
This approach requires an ordering of the data because of the
restriction for j > i in the matrix M
12 / 21

Dimensionality reduction through Regularization of the
Inverse covariance in the Log Likelihood (DRILL)
Optimization technique that applies L-1 regularization of
dependencies to estimate graph structure
How does this compare to the nearest-neighbor approaches?
E(Λ) = − log p(Y ) +
i<j
λi,j 1
Maximize E(Λ) through Least Absolute Shrinkage and Selection
Operator (LASSO) regression
13 / 21

Gaussian process latent variable model (GP-LVM)
likelihood
Process which maps data from a latent space to a non-latent data
space where the locale of points is determined by maximizing a
Gaussian likelihood.
This process allows us to map from a low dimensional space Rd
(latent space) to a high dimensional space RD (data space)
GP-LVM scoring
Deﬁne hyperparameter θ = (σ, θrbf , θnoise)
lml = max
σ,θrbf ,θnoise
log p(Y |X, θ)
Find hyperparameter values through gradient descent
14 / 21

Motion Capture Experiments
15 / 21

Motion Capture Experiments
16 / 21

Robot Navigation Experiments
17 / 21

Robot Navigation Experiments
18 / 21

GP-LVM scores for experiments
19 / 21

Structural learning for DRILL
20 / 21

Discussion Questions
Is there an equivalent “no free lunch theorem” for dimensionality
reduction? Why or why not?
Were you convinced that this is a unifying approach?
If this is a unifying approach why is MEU the worst performer?
Does a theoretical paper such as this need more experiments?
Is the GP-LVM score a good performance measure for comparing
these methods?
Are the experiments themselves introducing any bias in the results?
Why do you think ALLE outperforms MEU?
Is the “blessing of dimensionality” a valid claim?
21 / 21

A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction

Similar to A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction (20)

More from Sean Golliher

More from Sean Golliher (9)

Recently uploaded

Recently uploaded (20)

A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction