2. Application 1: Topic Modeling
Document modeling
Observed: words in document corpus.
Hidden: topics.
Goal: carry out document summarization.
3. Application 2: Understanding Human Communities
Social Networks
Observed: network of social ties, e.g. friendships, co-authorships
Hidden: groups/communities of actors.
4. Application 3: Recommender Systems
Recommender System
Observed: Ratings of users for various products, e.g. yelp reviews.
Goal: Predict new recommendations.
Modeling: Find groups/communities of users and products.
5. Application 4: Feature Learning
Feature Engineering
Learn good features/representations for classification tasks, e.g.
image and speech recognition.
Sparse representations, low dimensional hidden structures.
6. Application 5: Computational Biology
Observed: gene expression levels
Goal: discover gene groups
Hidden variables: regulators controlling gene groups
“Unsupervised Learning of Transcriptional Regulatory Networks via Latent Tree Graphical
Model” by A. Gitter, F. Huang, R. Valluvan, E. Fraenkel and A. Anandkumar Submitted to
BMC Bioinformatics, Jan. 2014.
7. Statistical Framework
In all applications: discover hidden structure in data: unsupervised
learning.
Latent Variable Models
Concise statistical description through
graphical modeling
Conditional independence relationships
or hierarchy of variables. x
h
8. Statistical Framework
In all applications: discover hidden structure in data: unsupervised
learning.
Latent Variable Models
Concise statistical description through
graphical modeling
Conditional independence relationships
or hierarchy of variables. x1 x2 x3 x4 x5
h
9. Statistical Framework
In all applications: discover hidden structure in data: unsupervised
learning.
Latent Variable Models
Concise statistical description through
graphical modeling
Conditional independence relationships
or hierarchy of variables. x1 x2 x3 x4 x5
h1
h2 h3
10. Computational Framework
Challenge: Efficient Learning of Latent Variable Models
Maximum likelihood is NP-hard.
Practice: EM, Variational Bayes have no consistency guarantees.
Efficient computational and sample complexities?
11. Computational Framework
Challenge: Efficient Learning of Latent Variable Models
Maximum likelihood is NP-hard.
Practice: EM, Variational Bayes have no consistency guarantees.
Efficient computational and sample complexities?
Fast methods such as matrix factorization are not statistical. We
cannot learn the latent variable model through such methods.
12. Computational Framework
Challenge: Efficient Learning of Latent Variable Models
Maximum likelihood is NP-hard.
Practice: EM, Variational Bayes have no consistency guarantees.
Efficient computational and sample complexities?
Fast methods such as matrix factorization are not statistical. We
cannot learn the latent variable model through such methods.
Tensor-based Estimation
Estimate moment tensors from data: higher order relationships.
Compute decomposition of moment tensor.
Iterative updates, e.g. tensor power iterations, alternating
minimization.
Non-convex: convergence to a local optima. No guarantees.
13. Computational Framework
Challenge: Efficient Learning of Latent Variable Models
Maximum likelihood is NP-hard.
Practice: EM, Variational Bayes have no consistency guarantees.
Efficient computational and sample complexities?
Fast methods such as matrix factorization are not statistical. We
cannot learn the latent variable model through such methods.
Tensor-based Estimation
Estimate moment tensors from data: higher order relationships.
Compute decomposition of moment tensor.
Iterative updates, e.g. tensor power iterations, alternating
minimization.
Non-convex: convergence to a local optima. No guarantees.
Innovation: Guaranteed convergence to correct model.
14. Computational Framework
Challenge: Efficient Learning of Latent Variable Models
Maximum likelihood is NP-hard.
Practice: EM, Variational Bayes have no consistency guarantees.
Efficient computational and sample complexities?
Fast methods such as matrix factorization are not statistical. We
cannot learn the latent variable model through such methods.
Tensor-based Estimation
Estimate moment tensors from data: higher order relationships.
Compute decomposition of moment tensor.
Iterative updates, e.g. tensor power iterations, alternating
minimization.
Non-convex: convergence to a local optima. No guarantees.
Innovation: Guaranteed convergence to correct model.
In this talk: tensor decompositions and applications
17. Probabilistic Topic Models
Bag of words: order of words does not matter
Graphical model representation
l words in a document x1, . . . , xl.
h: proportions of topics in a document.
Word xi generated from topic yi.
A(i, j) := P[xm = i|ym = j] :
topic-word matrix.
Words
Topics
Topic
Mixture
x1 x2 x3 x4 x5
y1 y2 y3 y4 y5
AAAAA
h
18. Geometric Picture for Topic Models
Topic proportions vector (h)
Document
Linear Model:
E[xi|h] = Ah .
Multiview model: h is
fixed and multiple words
(xi) are generated.
19. Geometric Picture for Topic Models
Single topic (h)
Linear Model:
E[xi|h] = Ah .
Multiview model: h is
fixed and multiple words
(xi) are generated.
20. Geometric Picture for Topic Models
Topic proportions vector (h)
Linear Model:
E[xi|h] = Ah .
Multiview model: h is
fixed and multiple words
(xi) are generated.
21. Geometric Picture for Topic Models
Topic proportions vector (h)
AAA
x1
x2
x3
Word generation (x1, x2, . . .)
Linear Model:
E[xi|h] = Ah .
Multiview model: h is
fixed and multiple words
(xi) are generated.
22. Moment Tensors
Consider single topic model.
E[xi|h] = Ah. λ := [E[h]]i.
Learn topic-word matrix A, vector λ = P[h]
M2: Co-occurrence of two words in a document
M2 := E[x1x⊤
2 ] = E[E[x1x⊤
2 |h]] = AE[hh⊤
]A⊤
=
k
r=1
λrara⊤
r
23. Moment Tensors
Consider single topic model.
E[xi|h] = Ah. λ := [E[h]]i.
Learn topic-word matrix A, vector λ = P[h]
M2: Co-occurrence of two words in a document
M2 := E[x1x⊤
2 ] = E[E[x1x⊤
2 |h]] = AE[hh⊤
]A⊤
=
k
r=1
λrara⊤
r
Tensor M3: Co-occurrence of three words
M3 := E(x1 ⊗ x2 ⊗ x3) =
r
λrar ⊗ ar ⊗ ar
24. Moment Tensors
Consider single topic model.
E[xi|h] = Ah. λ := [E[h]]i.
Learn topic-word matrix A, vector λ = P[h]
M2: Co-occurrence of two words in a document
M2 := E[x1x⊤
2 ] = E[E[x1x⊤
2 |h]] = AE[hh⊤
]A⊤
=
k
r=1
λrara⊤
r
Tensor M3: Co-occurrence of three words
M3 := E(x1 ⊗ x2 ⊗ x3) =
r
λrar ⊗ ar ⊗ ar
Matrix and Tensor Forms: ar := rth
column of A.
M2 =
k
r=1
λrar ⊗ ar. M3 =
k
r=1
λrar ⊗ ar ⊗ ar
25. Tensor Decomposition Problem
M2 =
k
r=1
λrar ⊗ ar. M3 =
k
r=1
λrar ⊗ ar ⊗ ar
= + ....
Tensor M3 λ1a1 ⊗ a1 ⊗ a1 λ2a2 ⊗ a2 ⊗ a2
u ⊗ v ⊗ w is a rank-1 tensor whose i, j, kth
entry is uivjwk.
k topics, d words in vocabulary.
M3: O(d × d × d) tensor, Rank k.
Learning Topic Models through Tensor Decomposition
28. Detecting Communities in Networks
Stochastic Block Model
Non-overlapping
Mixed Membership Model
Overlapping
29. Detecting Communities in Networks
Stochastic Block Model
Non-overlapping
Mixed Membership Model
Overlapping
30. Detecting Communities in Networks
Stochastic Block Model
Non-overlapping
Mixed Membership Model
Overlapping
Unifying Assumption
Edges conditionally independent given community memberships
32. Tensor Forms in Other Models
Independent Component Analysis
Independent sources, unknown mixing.
Blind source separation of speech, image, video..
h1 h2 hk
x1 x2 xd
A
Gaussian Mixtures Hidden Markov
Models/Latent Trees
x1 x2 x3 x4 x5
h1
h2 h3
Reduction to similar moment forms
34. Tensor Decomposition Problem
M3 =
k
r=1
λrar ⊗ ar ⊗ ar
= + ....
Tensor M3 λ1a1 ⊗ a1 ⊗ a1 λ2a2 ⊗ a2 ⊗ a2
u ⊗ v ⊗ w is a rank-1 tensor whose i, j, kth
entry is uivjwk.
k topics, d words in vocabulary.
M3: O(d × d × d) tensor, Rank k.
d: vocabulary size for topic models or n: size of network for
community models.
35. Dimensionality Reduction for Tensor Decomposition
M3 =
k
r=1
λrar ⊗ ar ⊗ ar
Dimensionality Reduction
(Whitening)
Convert M3 of size O(d × d × d)
to tensor T of size k × k × k
Carry out decomposition of T Tensor M3 Tensor T
Dimensionality reduction through multi-linear transforms
Computed from data, e.g. pairwise moments.
T = i ρir⊗3
i is symmetric orthogonal tensor: {ri} are orthonormal
38. Orthogonal/Eigen Decomposition
Orthogonal symmetric tensor: T =
j∈[k]
ρjr⊗3
j
T(I, r1, r1) =
j∈[k]
ρj r1, rj
2rj = ρ1r1
Obtaining eigenvectors through power iterations
u →
T(I, u, u)
T(I, u, u)
Basic Algorithm
Random initialization, run power iterations and deflate
39. Practical Considerations
k communities, n nodes, k ≪ n.
Steps
k-SVD of n × n matrix: randomized techniques
Online k × k × k tensor decomposition: No tensor explicitly formed.
Parallelization: Inherently parallelizable, GPU deployment.
Sparse implementation: real-world networks are sparse
Validation Metric: p-value test based “soft-pairing”
Parallel time complexity: O
nsk
c
+ k3 ,
s is max. degree in graph and c is number of cores.
Huang, Niranjan, Hakeem and Anandkumar, “Fast Detection of Overlapping Communities via
Online Tensor Methods,” Preprint, Sept. 2013.
40. Scaling Of The Stochastic Iterations
vt+1
i ← vt
i − 3θβt
k
j=1
vt
j, vt
i
2
vt
j + βt
vt
i, yt
A vt
i , yt
B yt
C + . . .
Parallelize across
eigenvectors.
STGD is iterative:
device code reuse
buffers for updates.
vt
i
yt
A,yt
B,yt
C
CPU
GPU
Standard Interface
vt
i
yt
A,yt
B,yt
C
CPU
GPU
Device Interface
vt
i
41. Scaling Of The Stochastic Iterations
10
2
10
3
10
−1
10
0
10
1
10
2
10
3
10
4
Number of communities k
Runningtime(secs)
MATLAB Tensor Toolbox
CULA Standard Interface
CULA Device Interface
Eigen Sparse
43. Experimental Results
Friend
Users
Facebook
n ∼ 20, 000
Business
User
Reviews
Yelp
n ∼ 40, 000
Author
Coauthor
DBLP
n ∼ 1 million
Error (E) and Recovery ratio (R)
Dataset ˆk Method Running Time E R
Facebook(k=360) 500 ours 468 0.0175 100%
Facebook(k=360) 500 variational 86,808 0.0308 100%
.
Yelp(k=159) 100 ours 287 0.046 86%
Yelp(k=159) 100 variational N.A.
.
DBLP(k=6000) 100 ours 5407 0.105 95%
44. Experimental Results on Yelp
Lowest error business categories & largest weight businesses
Rank Category Business Stars Review Counts
1 Latin American Salvadoreno Restaurant 4.0 36
2 Gluten Free P.F. Chang’s China Bistro 3.5 55
3 Hobby Shops Make Meaning 4.5 14
4 Mass Media KJZZ 91.5FM 4.0 13
5 Yoga Sutra Midtown 4.5 31
45. Experimental Results on Yelp
Lowest error business categories & largest weight businesses
Rank Category Business Stars Review Counts
1 Latin American Salvadoreno Restaurant 4.0 36
2 Gluten Free P.F. Chang’s China Bistro 3.5 55
3 Hobby Shops Make Meaning 4.5 14
4 Mass Media KJZZ 91.5FM 4.0 13
5 Yoga Sutra Midtown 4.5 31
Bridgeness: Distance from vector [1/ˆk, . . . , 1/ˆk]⊤
Top-5 bridging nodes (businesses)
Business Categories
Four Peaks Brewing Restaurants, Bars, American, Nightlife, Food, Pubs, Tempe
Pizzeria Bianco Restaurants, Pizza, Phoenix
FEZ Restaurants, Bars, American, Nightlife, Mediterranean, Lounges, Phoenix
Matt’s Big Breakfast Restaurants, Phoenix, Breakfast& Brunch
Cornish Pasty Co Restaurants, Bars, Nightlife, Pubs, Tempe
47. Conclusion
Guaranteed Learning of Latent Variable Models
Guaranteed to recover correct model
Efficient sample and computational complexities
Better performance compared to EM, Variational
Bayes etc.
Mixed membership communities, topic models,
ICA, Gaussian mixtures...
Current and Future Goals
Guaranteed online learning in high dimensions
Large-scale cloud-based implementation of tensor approaches
Code available on website and Github