SlideShare a Scribd company logo
Clustering Relational Data using the Infinite Relational
Model
Ana Daglis
Supervised by: Matthew Ludkin
September 4, 2015
Ana Daglis Clustering Data using the Infinite Relational Model September 4, 2015 1 / 29
Outline
1 Clustering
2 Model
3 Gibbs Sampling
Methodology
Results
4 Split-Merge Algorithm
Methodology
Results
5 Future Work
Ana Daglis Clustering Data using the Infinite Relational Model September 4, 2015 2 / 29
Clustering
Clustering
Cluster Analysis: Given an unlabelled data, want algorithms that
automatically group the datapoints into coherent subsets/clusters.
Applications:
recommendation engines (Netflix, iTunes, Quora,...)
image compression
targeted marketing
Google News
Ana Daglis Clustering Data using the Infinite Relational Model September 4, 2015 3 / 29
Model
Infinite Relational Model
Infinite Relational Model (IRM) is a model, in which each node is
assigned to a cluster. The number of clusters is not known initially and is
learned from the data as part of the statistical inference.
IRM is represented by the following parameters:
zi - cluster, containing node i, for i = 1, ..., n.
φi,j - probability of an edge between i-th and j-th
clusters.
Ana Daglis Clustering Data using the Infinite Relational Model September 4, 2015 4 / 29
Model
Assumptions
Given the adjacency matrix of the graph, X, as our data, we assume
that Xi,j ∼ Bernoulli(φzi ,zj ).
Since z and φ are not known, hierarchical and beta priors respectively
are imposed:
(
z ∼ CRP(A)
φi,j ∼ Beta(a, b).
Ana Daglis Clustering Data using the Infinite Relational Model September 4, 2015 5 / 29
Model
Chinese Restaurant Process (CRP(A))
The Chinese restaurant process is a discrete process, whose value at
time n is the partition of 1, 2, ..., n. At time n = 1, have trivial partition
{{1}}. At time n + 1, element n + 1 is either:
1 added to an existing block with probability |b|/(n + A), where |b| is
the size of the block, or
2 creates a completely new block with probability A/(n + A).
Ana Daglis Clustering Data using the Infinite Relational Model September 4, 2015 6 / 29
Model
Chinese Restaurant Process (CRP(A))
1 0 0
Ana Daglis Clustering Data using the Infinite Relational Model September 4, 2015 7 / 29
Model
Chinese Restaurant Process (CRP(A))
1
1+A
A
1+A 0
Ana Daglis Clustering Data using the Infinite Relational Model September 4, 2015 8 / 29
Model
Chinese Restaurant Process (CRP(A))
1
2+A
1
2+A
A
2+A
Ana Daglis Clustering Data using the Infinite Relational Model September 4, 2015 9 / 29
Model
Chinese Restaurant Process (CRP(A))
1
3+A
2
3+A
A
3+A
Ana Daglis Clustering Data using the Infinite Relational Model September 4, 2015 10 / 29
Gibbs Sampling Methodology
Gibbs Sampling
Want: a sample from a multivariate distribution θ = (θ1, θ2, . . . , θd ).
Algorithm:
1 Initialize with θ = (θ
(0)
1 , θ
(0)
2 , . . . , θ
(0)
d ).
2 For i = 1, 2, . . . , n,
Simulate θ
(i)
1 from the conditional θ1|(θ
(i−1)
2 , . . . , θ
(i−1)
d )
Simulate θ
(i)
2 from the conditional θ2|(θ
(i)
1 , θ
(i−1)
3 , . . . , θ
(i−1)
d )
...
Simulate θ
(i)
d from the conditional θd |(θ
(i)
1 , θ
(i)
2 , . . . , θ
(i)
d−1).
3 Discard the first k iterations and estimate the posterior distribution
using (θ
(k+1)
1 , θ
(k+1)
2 , . . . , θ
(k+1)
d ), . . . , (θ
(n)
1 , θ
(n)
2 , . . . , θ
(n)
d ).
Ana Daglis Clustering Data using the Infinite Relational Model September 4, 2015 11 / 29
Gibbs Sampling Methodology
Gibbs Sampling
We use the Gibbs sampling to infer the posterior distribution of z.
The cluster assignments, zi , are iteratively sampled from their
conditional distribution,
P(zi = k|zi
, X) ∝ P(X|z)P(zi = k|zi
),
where zi denotes all cluster assignments except zi .
Ana Daglis Clustering Data using the Infinite Relational Model September 4, 2015 12 / 29
Gibbs Sampling Methodology
Simulated Data
We applied the Gibbs sampling algorithm to a simulated network with
the following parameters:
96 nodes split into 6 blocks
φi,i = 0.85, for i = 1, ...n
φi,j = 0.05, for i 6= j
a = b = 1 for uniform prior
A = 1.
(a) Simulated network
Ana Daglis Clustering Data using the Infinite Relational Model September 4, 2015 13 / 29
Gibbs Sampling Results
Simulated Data
We applied the Gibbs sampling algorithm to a simulated network with
the following parameters:
96 nodes split into 6 blocks
φi,i = 0.85, for i = 1, ...n
φi,j = 0.05, for i 6= j
a = b = 1 for uniform prior
A = 1.
(b) Supplied network
Ana Daglis Clustering Data using the Infinite Relational Model September 4, 2015 14 / 29
Gibbs Sampling Results
Block structure obtained
Ana Daglis Clustering Data using the Infinite Relational Model September 4, 2015 15 / 29
Gibbs Sampling Results
Trace-plot of the number of blocks
0 2000 4000 6000 8000 10000
1
2
3
4
5
6
Iteration
Number
of
blocks
Ana Daglis Clustering Data using the Infinite Relational Model September 4, 2015 16 / 29
Gibbs Sampling Results
Gibbs Sampling Summary
The algorithm fails to split the data into 6 clusters within 10000
iterations, and is stuck in five-cluster configuration for a long time.
The main problem with the Gibbs sampler is that it is slow to
converge, and it often becomes trapped in a local mode (5 blocks
in this case).
A possible improvement is the split-merge algorithm, which updates
simultaneously a group of nodes and avoids these problems.
Ana Daglis Clustering Data using the Infinite Relational Model September 4, 2015 17 / 29
Split-Merge Algorithm Methodology
Split-Merge Algorithm
Algorithm:
1 Select two distinct nodes, i and j, uniformly at random.
2 If i and j belong to the same cluster, split that cluster into two by
assigning elements to either of the two clusters independently with
equal probability.
3 If i and j belong to different clusters, merge those clusters.
4 Evaluate Metropolis-Hastings acceptance probability. If accepted,
the new cluster assignment becomes the next step of the algorithm.
Otherwise, the initial cluster assignment remains as the next state.
a(z∗
, z) = min[1,
q(z|z∗)P(z∗)L(X|z∗)
q(z∗|z)P(z)L(X|z)
],
where q is proposal probability, P(z) prior, L(X|z) likelihood.
Ana Daglis Clustering Data using the Infinite Relational Model September 4, 2015 18 / 29
Split-Merge Algorithm Methodology
Split-Merge Algorithm
Ana Daglis Clustering Data using the Infinite Relational Model September 4, 2015 19 / 29
Split-Merge Algorithm Methodology
Split-Merge Algorithm
Ana Daglis Clustering Data using the Infinite Relational Model September 4, 2015 20 / 29
Split-Merge Algorithm Methodology
Split-Merge Algorithm
Ana Daglis Clustering Data using the Infinite Relational Model September 4, 2015 21 / 29
Split-Merge Algorithm Methodology
Split-Merge Algorithm
Ana Daglis Clustering Data using the Infinite Relational Model September 4, 2015 22 / 29
Split-Merge Algorithm Methodology
Split-Merge Algorithm
Ana Daglis Clustering Data using the Infinite Relational Model September 4, 2015 23 / 29
Split-Merge Algorithm Methodology
Split-Merge Algorithm
Ana Daglis Clustering Data using the Infinite Relational Model September 4, 2015 24 / 29
Split-Merge Algorithm Results
Gibbs Sampler + Split-Merge
We applied the Gibbs sampler together with the split-merge algorithm
to the earlier network. For every nine full Gibbs sampling scans, one
split-merge step was used.
The algorithm appropriately splits the data into six clusters, has short
burn-in time and mixes well.
Ana Daglis Clustering Data using the Infinite Relational Model September 4, 2015 25 / 29
Split-Merge Algorithm Results
Block structure obtained
Ana Daglis Clustering Data using the Infinite Relational Model September 4, 2015 26 / 29
Split-Merge Algorithm Results
Trace-plot of the number of blocks
0 200 400 600 800 1000
1
2
3
4
5
6
7
Iteration
Number
of
blocks
Ana Daglis Clustering Data using the Infinite Relational Model September 4, 2015 27 / 29
Future Work
Future Work
Assess the performance of the algorithms when the blocks
significantly vary in size.
Evaluate the complexities of the algorithms.
Explore more advanced algorithms (such as the Restricted Gibbs
Sampling Split-Merge).
Ana Daglis Clustering Data using the Infinite Relational Model September 4, 2015 28 / 29
Future Work
References
Schmidt, M. N. and Mørup, M. (2013). Non-parametric Bayesian modeling of
complex networks.
IEEE Signal Processing Magazine, 30:110-128.
Jain, S. and Neal, R. M. (2004). A split-merge Markov chain Monte Carlo
procedure for the Dirichlet process mixture model.
Journal of Computational and Graphical Statistics, 13:158–182.
Ana Daglis Clustering Data using the Infinite Relational Model September 4, 2015 29 / 29

More Related Content

Similar to Clustering Relational Data using the Infinite Relational Model

Similar to Clustering Relational Data using the Infinite Relational Model (20)

Master defense presentation 2019 04_18_rev2
Master defense presentation 2019 04_18_rev2Master defense presentation 2019 04_18_rev2
Master defense presentation 2019 04_18_rev2
 
clustering in DataMining and differences in models/ clustering in data mining
clustering in DataMining and differences in models/ clustering in data miningclustering in DataMining and differences in models/ clustering in data mining
clustering in DataMining and differences in models/ clustering in data mining
 
Big data Clustering Algorithms And Strategies
Big data Clustering Algorithms And StrategiesBig data Clustering Algorithms And Strategies
Big data Clustering Algorithms And Strategies
 
Introduction to Multi-Objective Clustering Ensemble
Introduction to Multi-Objective Clustering EnsembleIntroduction to Multi-Objective Clustering Ensemble
Introduction to Multi-Objective Clustering Ensemble
 
A new link based approach for categorical data clustering
A new link based approach for categorical data clusteringA new link based approach for categorical data clustering
A new link based approach for categorical data clustering
 
PPT s10-machine vision-s2
PPT s10-machine vision-s2PPT s10-machine vision-s2
PPT s10-machine vision-s2
 
A Comparative Study Of Various Clustering Algorithms In Data Mining
A Comparative Study Of Various Clustering Algorithms In Data MiningA Comparative Study Of Various Clustering Algorithms In Data Mining
A Comparative Study Of Various Clustering Algorithms In Data Mining
 
Experimental study of Data clustering using k- Means and modified algorithms
Experimental study of Data clustering using k- Means and modified algorithmsExperimental study of Data clustering using k- Means and modified algorithms
Experimental study of Data clustering using k- Means and modified algorithms
 
Ikdd co ds2017presentation_v2
Ikdd co ds2017presentation_v2Ikdd co ds2017presentation_v2
Ikdd co ds2017presentation_v2
 
presentation 2019 04_09_rev1
presentation 2019 04_09_rev1presentation 2019 04_09_rev1
presentation 2019 04_09_rev1
 
Parallelisation of the PC Algorithm (CAEPIA2015)
Parallelisation of the PC Algorithm (CAEPIA2015)Parallelisation of the PC Algorithm (CAEPIA2015)
Parallelisation of the PC Algorithm (CAEPIA2015)
 
Clustering and Classification Algorithms Ankita Dubey
Clustering and Classification Algorithms Ankita DubeyClustering and Classification Algorithms Ankita Dubey
Clustering and Classification Algorithms Ankita Dubey
 
Extended pso algorithm for improvement problems k means clustering algorithm
Extended pso algorithm for improvement problems k means clustering algorithmExtended pso algorithm for improvement problems k means clustering algorithm
Extended pso algorithm for improvement problems k means clustering algorithm
 
Toward a Unified Approach to Fitting Loss Models
Toward a Unified Approach to Fitting Loss ModelsToward a Unified Approach to Fitting Loss Models
Toward a Unified Approach to Fitting Loss Models
 
Recent improvements to the RDKit
Recent improvements to the RDKitRecent improvements to the RDKit
Recent improvements to the RDKit
 
3.2 partitioning methods
3.2 partitioning methods3.2 partitioning methods
3.2 partitioning methods
 
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
 
Brief introduction on GAN
Brief introduction on GANBrief introduction on GAN
Brief introduction on GAN
 
Extended pso algorithm for improvement problems k means clustering algorithm
Extended pso algorithm for improvement problems k means clustering algorithmExtended pso algorithm for improvement problems k means clustering algorithm
Extended pso algorithm for improvement problems k means clustering algorithm
 
Improve the Performance of Clustering Using Combination of Multiple Clusterin...
Improve the Performance of Clustering Using Combination of Multiple Clusterin...Improve the Performance of Clustering Using Combination of Multiple Clusterin...
Improve the Performance of Clustering Using Combination of Multiple Clusterin...
 

Recently uploaded

一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
vcaxypu
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
ewymefz
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
ocavb
 
Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...
Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...
Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...
Domenico Conte
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
ewymefz
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
ewymefz
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
ewymefz
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
vcaxypu
 
Investigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesInvestigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_Crimes
StarCompliance.io
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
nscud
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
enxupq
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
ArpitMalhotra16
 

Recently uploaded (20)

一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
 
Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...
Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...
Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
 
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPsWebinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
 
Uber Ride Supply Demand Gap Analysis Report
Uber Ride Supply Demand Gap Analysis ReportUber Ride Supply Demand Gap Analysis Report
Uber Ride Supply Demand Gap Analysis Report
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
 
Investigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesInvestigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_Crimes
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
 
Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMI
 
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
 
Tabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflowsTabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflows
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
 
Supply chain analytics to combat the effects of Ukraine-Russia-conflict
Supply chain analytics to combat the effects of Ukraine-Russia-conflictSupply chain analytics to combat the effects of Ukraine-Russia-conflict
Supply chain analytics to combat the effects of Ukraine-Russia-conflict
 

Clustering Relational Data using the Infinite Relational Model

  • 1. Clustering Relational Data using the Infinite Relational Model Ana Daglis Supervised by: Matthew Ludkin September 4, 2015 Ana Daglis Clustering Data using the Infinite Relational Model September 4, 2015 1 / 29
  • 2. Outline 1 Clustering 2 Model 3 Gibbs Sampling Methodology Results 4 Split-Merge Algorithm Methodology Results 5 Future Work Ana Daglis Clustering Data using the Infinite Relational Model September 4, 2015 2 / 29
  • 3. Clustering Clustering Cluster Analysis: Given an unlabelled data, want algorithms that automatically group the datapoints into coherent subsets/clusters. Applications: recommendation engines (Netflix, iTunes, Quora,...) image compression targeted marketing Google News Ana Daglis Clustering Data using the Infinite Relational Model September 4, 2015 3 / 29
  • 4. Model Infinite Relational Model Infinite Relational Model (IRM) is a model, in which each node is assigned to a cluster. The number of clusters is not known initially and is learned from the data as part of the statistical inference. IRM is represented by the following parameters: zi - cluster, containing node i, for i = 1, ..., n. φi,j - probability of an edge between i-th and j-th clusters. Ana Daglis Clustering Data using the Infinite Relational Model September 4, 2015 4 / 29
  • 5. Model Assumptions Given the adjacency matrix of the graph, X, as our data, we assume that Xi,j ∼ Bernoulli(φzi ,zj ). Since z and φ are not known, hierarchical and beta priors respectively are imposed: ( z ∼ CRP(A) φi,j ∼ Beta(a, b). Ana Daglis Clustering Data using the Infinite Relational Model September 4, 2015 5 / 29
  • 6. Model Chinese Restaurant Process (CRP(A)) The Chinese restaurant process is a discrete process, whose value at time n is the partition of 1, 2, ..., n. At time n = 1, have trivial partition {{1}}. At time n + 1, element n + 1 is either: 1 added to an existing block with probability |b|/(n + A), where |b| is the size of the block, or 2 creates a completely new block with probability A/(n + A). Ana Daglis Clustering Data using the Infinite Relational Model September 4, 2015 6 / 29
  • 7. Model Chinese Restaurant Process (CRP(A)) 1 0 0 Ana Daglis Clustering Data using the Infinite Relational Model September 4, 2015 7 / 29
  • 8. Model Chinese Restaurant Process (CRP(A)) 1 1+A A 1+A 0 Ana Daglis Clustering Data using the Infinite Relational Model September 4, 2015 8 / 29
  • 9. Model Chinese Restaurant Process (CRP(A)) 1 2+A 1 2+A A 2+A Ana Daglis Clustering Data using the Infinite Relational Model September 4, 2015 9 / 29
  • 10. Model Chinese Restaurant Process (CRP(A)) 1 3+A 2 3+A A 3+A Ana Daglis Clustering Data using the Infinite Relational Model September 4, 2015 10 / 29
  • 11. Gibbs Sampling Methodology Gibbs Sampling Want: a sample from a multivariate distribution θ = (θ1, θ2, . . . , θd ). Algorithm: 1 Initialize with θ = (θ (0) 1 , θ (0) 2 , . . . , θ (0) d ). 2 For i = 1, 2, . . . , n, Simulate θ (i) 1 from the conditional θ1|(θ (i−1) 2 , . . . , θ (i−1) d ) Simulate θ (i) 2 from the conditional θ2|(θ (i) 1 , θ (i−1) 3 , . . . , θ (i−1) d ) ... Simulate θ (i) d from the conditional θd |(θ (i) 1 , θ (i) 2 , . . . , θ (i) d−1). 3 Discard the first k iterations and estimate the posterior distribution using (θ (k+1) 1 , θ (k+1) 2 , . . . , θ (k+1) d ), . . . , (θ (n) 1 , θ (n) 2 , . . . , θ (n) d ). Ana Daglis Clustering Data using the Infinite Relational Model September 4, 2015 11 / 29
  • 12. Gibbs Sampling Methodology Gibbs Sampling We use the Gibbs sampling to infer the posterior distribution of z. The cluster assignments, zi , are iteratively sampled from their conditional distribution, P(zi = k|zi , X) ∝ P(X|z)P(zi = k|zi ), where zi denotes all cluster assignments except zi . Ana Daglis Clustering Data using the Infinite Relational Model September 4, 2015 12 / 29
  • 13. Gibbs Sampling Methodology Simulated Data We applied the Gibbs sampling algorithm to a simulated network with the following parameters: 96 nodes split into 6 blocks φi,i = 0.85, for i = 1, ...n φi,j = 0.05, for i 6= j a = b = 1 for uniform prior A = 1. (a) Simulated network Ana Daglis Clustering Data using the Infinite Relational Model September 4, 2015 13 / 29
  • 14. Gibbs Sampling Results Simulated Data We applied the Gibbs sampling algorithm to a simulated network with the following parameters: 96 nodes split into 6 blocks φi,i = 0.85, for i = 1, ...n φi,j = 0.05, for i 6= j a = b = 1 for uniform prior A = 1. (b) Supplied network Ana Daglis Clustering Data using the Infinite Relational Model September 4, 2015 14 / 29
  • 15. Gibbs Sampling Results Block structure obtained Ana Daglis Clustering Data using the Infinite Relational Model September 4, 2015 15 / 29
  • 16. Gibbs Sampling Results Trace-plot of the number of blocks 0 2000 4000 6000 8000 10000 1 2 3 4 5 6 Iteration Number of blocks Ana Daglis Clustering Data using the Infinite Relational Model September 4, 2015 16 / 29
  • 17. Gibbs Sampling Results Gibbs Sampling Summary The algorithm fails to split the data into 6 clusters within 10000 iterations, and is stuck in five-cluster configuration for a long time. The main problem with the Gibbs sampler is that it is slow to converge, and it often becomes trapped in a local mode (5 blocks in this case). A possible improvement is the split-merge algorithm, which updates simultaneously a group of nodes and avoids these problems. Ana Daglis Clustering Data using the Infinite Relational Model September 4, 2015 17 / 29
  • 18. Split-Merge Algorithm Methodology Split-Merge Algorithm Algorithm: 1 Select two distinct nodes, i and j, uniformly at random. 2 If i and j belong to the same cluster, split that cluster into two by assigning elements to either of the two clusters independently with equal probability. 3 If i and j belong to different clusters, merge those clusters. 4 Evaluate Metropolis-Hastings acceptance probability. If accepted, the new cluster assignment becomes the next step of the algorithm. Otherwise, the initial cluster assignment remains as the next state. a(z∗ , z) = min[1, q(z|z∗)P(z∗)L(X|z∗) q(z∗|z)P(z)L(X|z) ], where q is proposal probability, P(z) prior, L(X|z) likelihood. Ana Daglis Clustering Data using the Infinite Relational Model September 4, 2015 18 / 29
  • 19. Split-Merge Algorithm Methodology Split-Merge Algorithm Ana Daglis Clustering Data using the Infinite Relational Model September 4, 2015 19 / 29
  • 20. Split-Merge Algorithm Methodology Split-Merge Algorithm Ana Daglis Clustering Data using the Infinite Relational Model September 4, 2015 20 / 29
  • 21. Split-Merge Algorithm Methodology Split-Merge Algorithm Ana Daglis Clustering Data using the Infinite Relational Model September 4, 2015 21 / 29
  • 22. Split-Merge Algorithm Methodology Split-Merge Algorithm Ana Daglis Clustering Data using the Infinite Relational Model September 4, 2015 22 / 29
  • 23. Split-Merge Algorithm Methodology Split-Merge Algorithm Ana Daglis Clustering Data using the Infinite Relational Model September 4, 2015 23 / 29
  • 24. Split-Merge Algorithm Methodology Split-Merge Algorithm Ana Daglis Clustering Data using the Infinite Relational Model September 4, 2015 24 / 29
  • 25. Split-Merge Algorithm Results Gibbs Sampler + Split-Merge We applied the Gibbs sampler together with the split-merge algorithm to the earlier network. For every nine full Gibbs sampling scans, one split-merge step was used. The algorithm appropriately splits the data into six clusters, has short burn-in time and mixes well. Ana Daglis Clustering Data using the Infinite Relational Model September 4, 2015 25 / 29
  • 26. Split-Merge Algorithm Results Block structure obtained Ana Daglis Clustering Data using the Infinite Relational Model September 4, 2015 26 / 29
  • 27. Split-Merge Algorithm Results Trace-plot of the number of blocks 0 200 400 600 800 1000 1 2 3 4 5 6 7 Iteration Number of blocks Ana Daglis Clustering Data using the Infinite Relational Model September 4, 2015 27 / 29
  • 28. Future Work Future Work Assess the performance of the algorithms when the blocks significantly vary in size. Evaluate the complexities of the algorithms. Explore more advanced algorithms (such as the Restricted Gibbs Sampling Split-Merge). Ana Daglis Clustering Data using the Infinite Relational Model September 4, 2015 28 / 29
  • 29. Future Work References Schmidt, M. N. and Mørup, M. (2013). Non-parametric Bayesian modeling of complex networks. IEEE Signal Processing Magazine, 30:110-128. Jain, S. and Neal, R. M. (2004). A split-merge Markov chain Monte Carlo procedure for the Dirichlet process mixture model. Journal of Computational and Graphical Statistics, 13:158–182. Ana Daglis Clustering Data using the Infinite Relational Model September 4, 2015 29 / 29