Unblocking The Main Thread Solving ANRs and Frozen Frames
2010 ICML
1. Donglin Niu, Jennifer G. Dy
Department of Electrical and Computer Engineering, Northeastern University, Boston, MA
Michael I. Jordan
EECS and Statistics Departments, University of California, Berkeley
2.
3. Given medical data,
From doctor’s view:
according to type of disease
From insurance company view:
based on patient’s cost/risk
4. Two kinds of Approaches: Iterative & Simultaneous
Iterative
Given an existing clustering, find another
clustering
Conditional Information Bottleneck. Gondek and
Hofmann (2004)
COALA. Bae and Bailey (2006)
Minimizing KL-divergence. Qi and Davidson (2009)
Multiple alternative clusterings
Orthogonal Projection. Cui et al. (2007)
5. Simultaneous
Discovery of all the possible partitionings
Meta Clustering. Caruana et al. (2006)
De-correlated kmeans. Jain et al. (2008)
7. VIEW 1 VIEW 2
There are O( KN ) possible clustering solutions.
We’d like to find solutions that:
1. have high cluster quality, and
2. be non-redundant
and we’d like to simultaneously
3. learn the subspace in each view
8. Normalized Cut
(On Spectral Clustering, Ng et al.)
-maximize within-cluster similarity and minimize
between-cluster similarity.
Let U be the cluster assignment
T 1/ 2 1/ 2
max tr(U D KD U)
T
s.t. U U I
Advantage: Can discover arbitrarily-shaped clusters.
9. There are several possible criteria:
Correlation, Mutual information.
Correlation: can capture only linear dependencies.
Mutual information: can capture non-linear
dependencies, but requires estimating the joint probability
distribution.
In this approach, we choose
Hilbert-Schmidt Information Criterion
2
HSIC (x, y) c xy
HS
Advantage: Can detect non-linear dependence, do not need
to estimate joint probability distributions.
10. HSIC is the norm of a cross-covariance matrix
in kernel space.
2
HSIC (x, y) c xy
HS
C xy E xy [( ( x) x ) ( ( y) y )]
Empirical estimate of HSIC
1 s.t.
HSIC( X , Y ) : 2 tr (KHLH )
n H, K, L R n n ,
K ij : k ( xi , x j ), L ij : l ( yi , y j )
Number of
observations 1 T
H I 1n1n
n
Kernel functions
11. Cluster Quality: NormalizedCut
Redundancy HSIC
:
T 1/ 2 1/ 2
maximize Uv Rn c
tr(U v Dv K v Dv U v ) v q
tr( K v HK q H )
T T
s.t. Uv Uv I , Wv Wv I , K v ,ij K (WvT xi ,WvT x j )
Where Uv is the embedding,
Kv is the kernel matrix,
Dv is the degree matrix for each view v.
Hv is the matrix to centralize the kernel matrix.
All these are defined in subspace Wv.
12. We use a coordinate ascent approach.
Step 1: Fixed Wv, optimize for Uv
Solution to Uv is equal to the eigenvectors with the
largest eigenvalues of the normalized kernel
similarity matrix.
Step 2: Fixed Uv, optimize for Wv
We use gradient ascent on a Stiefel manifold.
Repeat Steps 1 & 2 until convergence.
K-means Step:
Normalize Uv. Apply k-means on Uv.
13. Cluster the features using spectral clustering.
Data x = [f1 f2 f3 f4 f5 …fd]
Feature similarity based on HSIC(fi,fj).
Transformation Matrix
f1 f2
… Wv
f4 1 0 0 . .
0 1 0 . .
f15 f34 f21 0 0 0 . .
… f3 …
f7 f9
0 0 1 . .
. . 0 . .
15. Identity (ID)View Pose View NMI Results
FACE
ID POSE
mSC 0.79 0.42
OPC 0.67 0.37
DK 0.70 0.40
SC 0.67 0.22
Kmeans 0.64 0.24
•Mean face
•Number below each image is cluster purity
16. Webkb Data High Weight Words
High weight word in each subspace view
view 1 Cornell, Texas, Wisconsin, Madison, Washington
view 2 homework, student, professor, project, Ph.d
NMI Webkb
Univ. Type
Results mSC 0.81 0.54
OPC 0.43 0.53
DK 0.48 0.57
SC 0.25 0.39
Kmeans 0.10 0.50
17. NSF Award Data High Frequent Words
Subjects Work Type
Physics Information Biology experimental theoretical
materials control cell methods Experiments
chemical programming gene mathematical Processes
metal information protein develop Techniques
optical function DNA equation Measurements
quantum languages Biological theoretical surface
18. Machine Sound Data
Machine Sound Data
Motor Fan Pump
mSC 0.82 0.75 0.83
OPC 0.73 0.68 0.47
DK 0.64 0.58 0.75
SC 0.42 0.16 0.09
Kmeans 0.57 0.16 0.09
Normalized Mutual Information (NMI) Results
19. Most clustering algorithms only find one single
clustering solution. However, data may be multi-
faceted (i.e., it can be interpreted in many different
ways).
We introduced a new method for discovering
multiple non-redundant clusterings.
Our approach, mSC, optimizes both a spectral
clustering (to measure quality) and an HSIC
regularization (to measure redundancy).
mSC, can discover multiple clusters with flexible
shapes, while simultaneously find the subspace in
which these clustering views reside.
Good afternoon. My name is DonglinNiu and I’m going to talk about “Multiple Non-Redundant Spectral Clustering Views.” This is work I did with my advisor, Jennifer Dy, from Northeastern University and with Mike Jordan form UC Berkeley.
Clustering is often the first step in exploring data. Most clustering algorithms only find one clustering solution. However, data may be multi-faceted by nature (i.e., a single data can be interpreted in many different ways). For example, let’s say, are data is a bunch of web-pages as shown here. One way to cluster this data is by grouping faculty webpages together in one cluster and the student webpages into another cluster.Another way is to group them is according to the university they belong to.
Another example is:Given medical data, A doctor may be interested in grouping the data based on disease type.An insurance company may be interested in grouping the patients according to their cost/risk.
Because of the realization of the need for finding multiple alternative clustering interpretations, there is recent interest in this new clustering research paradigm.There are two kinds of approaches in solving this problem: Iterative and Simultaneous.In iterative methods,One is given an existing clustering, and the goal is to find an alternative clustering.Gondek and Hofman finds an alternative clustering using a conditional information bottleneck approach,Bae and Bailey applies must & cannot-link constraints and agglomerative clustering,Qi and Davidson minimizes a KL-divergence criterion.In many cases, one may be interested in finding not just one but multiple alternative clusterings. Cui et al. introduced an iterative orthogonal projection approach for finding multiple alternative clustering solutions.
Another type of solution is simultaneously discovering all the possible partitionings.Meta Clustering by Caruana et al. generates several alternative solutions by random projection, then they apply hierarchical clustering of the clustering solutions.De-correlated Kmeans by Jain et al. minimizes both, the k-means sum-squared-error for each clustering solution and their correlation with each other, to find multiple cluster partitionings.Our approach is a simultaneous approach. However unlike meta-clustering which applies random projection, we find multiple alternative clusterings based on an objective function. Unlike de-correlated k-means which is based on k-means and thereby limited to find only spherical clusters, our approach can discover non-convex shaped clusters. Moreover, de-correlated k-means uses all the features in all the views; our approach, learns the subspace in each clustering view.
The paradigm of finding multiple alternative clusterings is different from ensemble methods. Like this paradigm, ensemble clustering generate several alternative clusterings, but their ultimate goal is to find a SINGLE consensus clustering solution.Hierarchical clustering also generate several partitionings; however, they generate a hierarchy of coarse-to-fine clusters, such that samples that belong in the same cluster in the lower or fine levels of the hierarchy stay together at the higher or coarser levels. In our case, samples that belong to the same cluster in one view or solution can belong to different clusters in other views.
Let’s say we have data in four dimensions. In features F1 and F2 it has a 3 ring cluster structure as shown in View 1, and a two half-moon cluster structure in features F3 and F4 in view 2. A standard clustering algorithm will have the dilemna of selecting which of these two structures is more interesting to discover. Instead of finding one of them, our goal is to find all possible interesting cluster structures/views. There are O(K^n) possible ways to cluster n samples into K groups modulo permutation of the clusters.We do not want to show these ways to the user as it will overwhelm the data analyst.We’d like to find solutions that:Have high cluster quality andWe’d like to provide non-redundant cluster views.Moreover, we’ve noticed that typically, the different alternative clusterings reside in different subspaces (i.e., they have utilize different similarity metrics to find these clusters).Thus, in our formulation, we also simultaneously learn the subspace in which the clusterings reside in each view.I’ll discuss each component in the following slides.
We’d like to capture arbitrarily-shaped clusters. We employ the normalized-cut criterion and spectral clustering to define cluster quality.Normalized cut maximizes the within-cluster similarity and minimizes between-cluster similarity.Let U be the cluster assignment. In spectral clustering, we relax the cluster assignment U to take on any real value, then the normalize-cut clustering objective becomes maximizing the trace of U transposed the normalized similarity matrix U) subject to the constraint that U is orthonormal.The advantage of this criterion is that it can discover arbitrarily-shaped clusters.
We’d like the clustering solutions we discover to be non-redundant with each other. There are several possible criteria for measuring non-redundancy: correlation or mutual information.(Read slide)
HSIC is a norm of a cross-covariance matrix in kernel space.Empirically, we can estimate the HSIC between two random variables X and Y as theTrace of two kernel matrices K and L. H here simply centers the kernel matrices.
Our overall objective is then to maximize this function.The first term optimizes for cluster quality, the spectral clustering criterion.The second term minimizes the redundancies among the clustering views.Lambda is the regularization parameter that controls the trade-off between these two criteria.We incorporate discovering the subspace in which the clustering solutions in each view reside by learning transformation matrix W_v. Note that W_v is inside the kernel and operates on the original input x.
We optimize our objective to solve for the cluster embedding Uv and the subspace Wv in each view as follows.(Read slide)We discretize by applying a K-means step: (read slide)
Our approach is only guaranteed to find local optima. Thus, the solution is dependent on initialization.We initialize the subspaces Wv in each view as follows.We cluster the features (i.e., columns of x) using spectral clustering and apply Hsic(f_i, f_j) between features as a measure of similarity. This groups together features that are dependent on each other into the same cluster and those that are independent from each other into different groups. Each feature group forms the transformation matrix Wv in each group as follows. (click through the animation and explain).Note that even though each view started with disjoint features, after running our algorithm to convergence, each feature will have some weight in all views. Note to that the dimensions in each view are set by the number of features in each view in our initialization.