Analysis of on line social networks (OSNs) represented as graphs- extraction of an approximation of community structure using sampling

Modeling Decisions for Artificial Intelligence (MDAI)
2012, Girona, Spain

Introduction

 We present a benchmarking of two distinct algorithms for
extracting community structure from On-line Social Networks
(OSNs), considering how we can representatively sample an OSN
graph while maintaining its community structure.

 We do this by extracting the community structure from the
original and sampled versions of five well-known benchmarking
datasets and comparing the results.
• We assume there is NO a priori knowledge about the
expected result.
• A supervised sampling is performed.

Extraction of the community structure

Algorithm 1: Newman’s algorithm

• Extracts the communities by successively dividing the graph
into components, using Freeman’s betweenness centrality
measure until modularity Q is maximized.

• Modularity (Q): Is the measure used to quantify the quality of
the community partitions ‘on the fly’. Usual range: [0.3 - 0.7].

• We have implemented it in Python (using NetworkX library).

Extraction of the community structure

Algorithm 2: Blondel’s method

1. The method looks for smaller communities by optimizing
modularity locally.
2. Then it aggregates nodes of the same community and builds
a new network whose nodes are communities.
Steps 1 and 2 are repeated until modularity Q is maximized.

• The default version was used from the Gephi graph
processing software.

Filtering / Sampling process

2-step process: In order to obtain a subset of a complete graph we
apply a process consisting of filtering and sampling.

• First step: Filtering (seed node selection).
Consists of filtering the graph nodes based on their degree
or their clustering coefficient. Filtering thresholds are user
defined.

 Goal: Identify hub nodes and dense regions of the
graph.

• Second step: Sampling.
We apply a sampling at 1 hop to obtain all the neighbours
connected to each seed node.

 Goal: Maintain core community structure.

Datasets used

Graph Statistics

Karate Dolphins GrQc Enron Facebook

#Nodes 34 62 5242 10630 31720

#Edges 78 159 14496 164837 80592

Avg. degree 4.59 5.13 5.530 31.013 5.081

Clust. coef. 0.57 0.26 0.529 0.383 0.079

Avg. path
2.408 3.356 6.049 3.160 6.432
length

Diameter 5 8 17 20 9

Sample criteria

• Sampling requires user supervision

• Sampling % desired: [ 10%-20% ]

• Only large graphs datasets (3) are sampled

Resulting sample
Filter Value Sample
size

ArXiv-GrQc Degree ≥30 17.91% All neighbors

Enron Clustering coef. =1 20.83% All neighbors

Facebook Clustering coef. ≥0.5 10.75% All neighbors

Sampling statistics

• Graph statistics for sampled Vs original datasets (sss / ooo)

GrQc Enron Facebook
Degree >= 30 Clust.Coef = 1 Clust.Coef >= 0.5

#Nodes 939 / 5242 2218 / 10630 3410 / 31720

#Edges 5715 / 14446 14912 / 164387 6561 / 80592

Avg. degree 12.17 / 5,53 12.315 / 31,013 3.848 / 5,081

Clust. coef. 0.698 / 0,529 0.761 / 0,383 0.632 / 0,079

Avg. path length 4.51 / 6,049 3.143 / 3,160 8.388 / 6,432

Diameter 10 / 17 7 / 20 27 / 9

Sampling statistics

• Indicator: Clustering coefficient shows a common pattern, increasing
in sampled datasets. This serves as an indicator that the core is
preferentially included in the samples.
GrQc Enron Facebook
Degree >= 30 Clust.Coef = 1 Clust.Coef >= 0.5

#Nodes 939 / 5242 2218 / 10630 3410 / 31720

#Edges 5715 / 14446 14912 / 164387 6561 / 80592

Avg. degree 12.17 / 5,53 12.315 / 31,013 3.848 / 5,081

Clust. coef. 0.698 / 0,529 0.761 / 0,383 0.632 / 0,079

Avg. path length 4.51 / 6,049 3.143 / 3,160 8.388 / 6,432

Diameter 10 / 17 7 / 20 27 / 9

Empirical Tests and Results

1. First, we evaluate Newman’s algorithm with the sampled datasets.

Stop Original or
Iteration Q Communities Sampled

Karate 4 0.494 5 O
Dolphins 5 0.591 6 O
GrQc 56 0.777 57 S
Enron 865 0.421 869 S
Enron Early* 51 0.325 56 S
Facebook 40 0.870 190 S

Empirical Tests and Results

2. Blondel’s method allows us to extract the communities from the
original dataset, given it’s greater execution speed in comparison with
Newman’s method.

Original Sampled

Q C Q C
GrQc 0.856 390 0.789 11
Enron 0.491 43 0.560 68
Facebook 0.681 1105 0.519 33

• How to compare nodes community matching?
 NMI : Normalized Mutual Information

Normalized Mutual Information

 After labeling the communities, we match the nodes inside every
corresponding community in the sampled and original datasets.

 Purity: 100% means that all nodes in same communities are matched in
both datasets.

 We compare the Top N communities ( N =10 )

o Handicap
Newman’s and Blondel’s methods are stochastic and non-deterministic
 Give slightly different results in each execution.

Normalized Mutual Information

 After labeling the communities, we match the nodes inside every
corresponding community in the sampled and original datasets.

 Purity: 100% means that all nodes in same communities are matched in
both datasets.

 We compare the Top N communities ( N =10 )

o Handicap
Newman’s and Blondel’s methods are stochastic and non-deterministic
 Give slightly different results in each execution.
NMI sampled
NMI orig. Vs. NMI orig Vs. orig. Net loss
Vs. sampled (B)
sampled (A) (C) (C- A)

GrQc 0.66559 0.82544 0.77301 0.10742

Enron 0.69069 0.86903 0.82012 0.12943

Facebook 0.58996 0.73249 0.69215 0.10219

Newman’s Vs. Blondel’s

• In terms of modularity (Q) and number of communities (C)

Blondel’s Blondel’s Newman’s
Original Sampled Sampled

Q C Q C Q C
GrQc 0.856 390 0.789 11 0.777 57
Enron 0.491 43 0.560 68 0.325 56
Facebook 0.681 1105 0.519 33 0.870 190

• The best modularity values are dataset dependent.

Newman’s Vs. Blondel’s

• In terms of modularity (Q) and number of communities (C)

Blondel’s Blondel’s Newman’s
Original Sampled Sampled

Q C Q C Q C
GrQc 0.856 390 0.789 11 0.777 57
Enron 0.491 43 0.560 68 0.325 56
Facebook 0.681 1105 0.519 33 0.870 190

• The methods may give distinct results in terms of the number of
communities and modularity values.

Newman’s (NG) Vs. Blondel’s (BN)

• In terms of NMI: Normalized Mutual Information.
Comparing Top N communities

NMI BN Vs. NG NMI NG Vs. BN NMI orig Vs. orig. Net loss
(A) (B) (C) (C- Avg (A,B))

GrQc 0.69116 0.87243 0.77301 -0.00878

Enron 0.31313 0.68796 0.82012 0.31958

Enron Early 0.83437 0.44320 0.82012 0.18133

Facebook 0.62056 0.54551 0.69215 0.10911

• Results show significant differences between the assignment of the
nodes between methods.

Conclusions

 We’ve benchmarked 5 statistically and topologically distinct datasets

• Applying 2 community structure algorithms
• Sampling original datasets

 Results indicate that is possible to identify the principal communities
for large complex datasets using sampling.

 It maintains the key facets of the community structure of a
dataset (NMI statistic shows high correspondence is maintained)
 Significantly reduces of the dataset size (80-90%)

 However, a difference is found in the assignment of nodes to
communities between different executions and methods, due to their
stochastic nature.

Thank you for
your attention :-)

Analysis of on line social networks (OSNs) represented as graphs- extraction of an approximation of community structure using sampling

Recomendados

Recomendados

Mais conteúdo relacionado

Semelhante a Analysis of on line social networks (OSNs) represented as graphs- extraction of an approximation of community structure using sampling

Semelhante a Analysis of on line social networks (OSNs) represented as graphs- extraction of an approximation of community structure using sampling (20)

Último

Último (20)

Analysis of on line social networks (OSNs) represented as graphs- extraction of an approximation of community structure using sampling