Supply chain analytics to combat the effects of Ukraine-Russia-conflict
Clustering Relational Data using the Infinite Relational Model
1. Clustering Relational Data using the Infinite Relational
Model
Ana Daglis
Supervised by: Matthew Ludkin
September 4, 2015
Ana Daglis Clustering Data using the Infinite Relational Model September 4, 2015 1 / 29
2. Outline
1 Clustering
2 Model
3 Gibbs Sampling
Methodology
Results
4 Split-Merge Algorithm
Methodology
Results
5 Future Work
Ana Daglis Clustering Data using the Infinite Relational Model September 4, 2015 2 / 29
3. Clustering
Clustering
Cluster Analysis: Given an unlabelled data, want algorithms that
automatically group the datapoints into coherent subsets/clusters.
Applications:
recommendation engines (Netflix, iTunes, Quora,...)
image compression
targeted marketing
Google News
Ana Daglis Clustering Data using the Infinite Relational Model September 4, 2015 3 / 29
4. Model
Infinite Relational Model
Infinite Relational Model (IRM) is a model, in which each node is
assigned to a cluster. The number of clusters is not known initially and is
learned from the data as part of the statistical inference.
IRM is represented by the following parameters:
zi - cluster, containing node i, for i = 1, ..., n.
φi,j - probability of an edge between i-th and j-th
clusters.
Ana Daglis Clustering Data using the Infinite Relational Model September 4, 2015 4 / 29
5. Model
Assumptions
Given the adjacency matrix of the graph, X, as our data, we assume
that Xi,j ∼ Bernoulli(φzi ,zj ).
Since z and φ are not known, hierarchical and beta priors respectively
are imposed:
(
z ∼ CRP(A)
φi,j ∼ Beta(a, b).
Ana Daglis Clustering Data using the Infinite Relational Model September 4, 2015 5 / 29
6. Model
Chinese Restaurant Process (CRP(A))
The Chinese restaurant process is a discrete process, whose value at
time n is the partition of 1, 2, ..., n. At time n = 1, have trivial partition
{{1}}. At time n + 1, element n + 1 is either:
1 added to an existing block with probability |b|/(n + A), where |b| is
the size of the block, or
2 creates a completely new block with probability A/(n + A).
Ana Daglis Clustering Data using the Infinite Relational Model September 4, 2015 6 / 29
7. Model
Chinese Restaurant Process (CRP(A))
1 0 0
Ana Daglis Clustering Data using the Infinite Relational Model September 4, 2015 7 / 29
8. Model
Chinese Restaurant Process (CRP(A))
1
1+A
A
1+A 0
Ana Daglis Clustering Data using the Infinite Relational Model September 4, 2015 8 / 29
9. Model
Chinese Restaurant Process (CRP(A))
1
2+A
1
2+A
A
2+A
Ana Daglis Clustering Data using the Infinite Relational Model September 4, 2015 9 / 29
10. Model
Chinese Restaurant Process (CRP(A))
1
3+A
2
3+A
A
3+A
Ana Daglis Clustering Data using the Infinite Relational Model September 4, 2015 10 / 29
11. Gibbs Sampling Methodology
Gibbs Sampling
Want: a sample from a multivariate distribution θ = (θ1, θ2, . . . , θd ).
Algorithm:
1 Initialize with θ = (θ
(0)
1 , θ
(0)
2 , . . . , θ
(0)
d ).
2 For i = 1, 2, . . . , n,
Simulate θ
(i)
1 from the conditional θ1|(θ
(i−1)
2 , . . . , θ
(i−1)
d )
Simulate θ
(i)
2 from the conditional θ2|(θ
(i)
1 , θ
(i−1)
3 , . . . , θ
(i−1)
d )
...
Simulate θ
(i)
d from the conditional θd |(θ
(i)
1 , θ
(i)
2 , . . . , θ
(i)
d−1).
3 Discard the first k iterations and estimate the posterior distribution
using (θ
(k+1)
1 , θ
(k+1)
2 , . . . , θ
(k+1)
d ), . . . , (θ
(n)
1 , θ
(n)
2 , . . . , θ
(n)
d ).
Ana Daglis Clustering Data using the Infinite Relational Model September 4, 2015 11 / 29
12. Gibbs Sampling Methodology
Gibbs Sampling
We use the Gibbs sampling to infer the posterior distribution of z.
The cluster assignments, zi , are iteratively sampled from their
conditional distribution,
P(zi = k|zi
, X) ∝ P(X|z)P(zi = k|zi
),
where zi denotes all cluster assignments except zi .
Ana Daglis Clustering Data using the Infinite Relational Model September 4, 2015 12 / 29
13. Gibbs Sampling Methodology
Simulated Data
We applied the Gibbs sampling algorithm to a simulated network with
the following parameters:
96 nodes split into 6 blocks
φi,i = 0.85, for i = 1, ...n
φi,j = 0.05, for i 6= j
a = b = 1 for uniform prior
A = 1.
(a) Simulated network
Ana Daglis Clustering Data using the Infinite Relational Model September 4, 2015 13 / 29
14. Gibbs Sampling Results
Simulated Data
We applied the Gibbs sampling algorithm to a simulated network with
the following parameters:
96 nodes split into 6 blocks
φi,i = 0.85, for i = 1, ...n
φi,j = 0.05, for i 6= j
a = b = 1 for uniform prior
A = 1.
(b) Supplied network
Ana Daglis Clustering Data using the Infinite Relational Model September 4, 2015 14 / 29
15. Gibbs Sampling Results
Block structure obtained
Ana Daglis Clustering Data using the Infinite Relational Model September 4, 2015 15 / 29
16. Gibbs Sampling Results
Trace-plot of the number of blocks
0 2000 4000 6000 8000 10000
1
2
3
4
5
6
Iteration
Number
of
blocks
Ana Daglis Clustering Data using the Infinite Relational Model September 4, 2015 16 / 29
17. Gibbs Sampling Results
Gibbs Sampling Summary
The algorithm fails to split the data into 6 clusters within 10000
iterations, and is stuck in five-cluster configuration for a long time.
The main problem with the Gibbs sampler is that it is slow to
converge, and it often becomes trapped in a local mode (5 blocks
in this case).
A possible improvement is the split-merge algorithm, which updates
simultaneously a group of nodes and avoids these problems.
Ana Daglis Clustering Data using the Infinite Relational Model September 4, 2015 17 / 29
18. Split-Merge Algorithm Methodology
Split-Merge Algorithm
Algorithm:
1 Select two distinct nodes, i and j, uniformly at random.
2 If i and j belong to the same cluster, split that cluster into two by
assigning elements to either of the two clusters independently with
equal probability.
3 If i and j belong to different clusters, merge those clusters.
4 Evaluate Metropolis-Hastings acceptance probability. If accepted,
the new cluster assignment becomes the next step of the algorithm.
Otherwise, the initial cluster assignment remains as the next state.
a(z∗
, z) = min[1,
q(z|z∗)P(z∗)L(X|z∗)
q(z∗|z)P(z)L(X|z)
],
where q is proposal probability, P(z) prior, L(X|z) likelihood.
Ana Daglis Clustering Data using the Infinite Relational Model September 4, 2015 18 / 29
25. Split-Merge Algorithm Results
Gibbs Sampler + Split-Merge
We applied the Gibbs sampler together with the split-merge algorithm
to the earlier network. For every nine full Gibbs sampling scans, one
split-merge step was used.
The algorithm appropriately splits the data into six clusters, has short
burn-in time and mixes well.
Ana Daglis Clustering Data using the Infinite Relational Model September 4, 2015 25 / 29
26. Split-Merge Algorithm Results
Block structure obtained
Ana Daglis Clustering Data using the Infinite Relational Model September 4, 2015 26 / 29
27. Split-Merge Algorithm Results
Trace-plot of the number of blocks
0 200 400 600 800 1000
1
2
3
4
5
6
7
Iteration
Number
of
blocks
Ana Daglis Clustering Data using the Infinite Relational Model September 4, 2015 27 / 29
28. Future Work
Future Work
Assess the performance of the algorithms when the blocks
significantly vary in size.
Evaluate the complexities of the algorithms.
Explore more advanced algorithms (such as the Restricted Gibbs
Sampling Split-Merge).
Ana Daglis Clustering Data using the Infinite Relational Model September 4, 2015 28 / 29
29. Future Work
References
Schmidt, M. N. and Mørup, M. (2013). Non-parametric Bayesian modeling of
complex networks.
IEEE Signal Processing Magazine, 30:110-128.
Jain, S. and Neal, R. M. (2004). A split-merge Markov chain Monte Carlo
procedure for the Dirichlet process mixture model.
Journal of Computational and Graphical Statistics, 13:158–182.
Ana Daglis Clustering Data using the Infinite Relational Model September 4, 2015 29 / 29