Presentation celebrated in 9th International Conference on Modeling Decisions for Artificial (MDAI'12) at Girona, Spain.
Abstract. In this paper we benchmark two distinct algorithms for extracting community structure from social networks represented as graphs, considering how we can representatively sample an OSN graph while maintaining its com-munity structure. We also evaluate the extraction algorithms’ optimum value (modularity) for the number of communities using five well-known benchmark-ing datasets, two of which represent real online OSN data. Also we consider the assignment of the filtering and sampling criteria for each dataset. We find that the extraction algorithms work well for finding the major communities in the original and the sampled datasets. The quality of the results is measured using an NMI (Normalized Mutual Information) type metric to identify the grade of correspondence between the communities generated from the original data and those generated from the sampled data. We find that a representative sampling is possible which preserves the key community structures of an OSN graph, significantly reducing computational cost and also making the resulting graph structure easier to visualize. Finally, comparing the communities generated by each algorithm, we identify the grade of correspondence.
Semelhante a Analysis of on line social networks (OSNs) represented as graphs- extraction of an approximation of community structure using sampling (20)
2. Introduction
We present a benchmarking of two distinct algorithms for
extracting community structure from On-line Social Networks
(OSNs), considering how we can representatively sample an OSN
graph while maintaining its community structure.
We do this by extracting the community structure from the
original and sampled versions of five well-known benchmarking
datasets and comparing the results.
• We assume there is NO a priori knowledge about the
expected result.
• A supervised sampling is performed.
3. Extraction of the community structure
Algorithm 1: Newman’s algorithm
• Extracts the communities by successively dividing the graph
into components, using Freeman’s betweenness centrality
measure until modularity Q is maximized.
• Modularity (Q): Is the measure used to quantify the quality of
the community partitions ‘on the fly’. Usual range: [0.3 - 0.7].
• We have implemented it in Python (using NetworkX library).
4. Extraction of the community structure
Algorithm 2: Blondel’s method
1. The method looks for smaller communities by optimizing
modularity locally.
2. Then it aggregates nodes of the same community and builds
a new network whose nodes are communities.
Steps 1 and 2 are repeated until modularity Q is maximized.
• The default version was used from the Gephi graph
processing software.
5. Filtering / Sampling process
2-step process: In order to obtain a subset of a complete graph we
apply a process consisting of filtering and sampling.
• First step: Filtering (seed node selection).
Consists of filtering the graph nodes based on their degree
or their clustering coefficient. Filtering thresholds are user
defined.
Goal: Identify hub nodes and dense regions of the
graph.
• Second step: Sampling.
We apply a sampling at 1 hop to obtain all the neighbours
connected to each seed node.
Goal: Maintain core community structure.
9. Sampling statistics
• Indicator: Clustering coefficient shows a common pattern, increasing
in sampled datasets. This serves as an indicator that the core is
preferentially included in the samples.
GrQc Enron Facebook
Degree >= 30 Clust.Coef = 1 Clust.Coef >= 0.5
#Nodes 939 / 5242 2218 / 10630 3410 / 31720
#Edges 5715 / 14446 14912 / 164387 6561 / 80592
Avg. degree 12.17 / 5,53 12.315 / 31,013 3.848 / 5,081
Clust. coef. 0.698 / 0,529 0.761 / 0,383 0.632 / 0,079
Avg. path length 4.51 / 6,049 3.143 / 3,160 8.388 / 6,432
Diameter 10 / 17 7 / 20 27 / 9
10. Empirical Tests and Results
1. First, we evaluate Newman’s algorithm with the sampled datasets.
Stop Original or
Iteration Q Communities Sampled
Karate 4 0.494 5 O
Dolphins 5 0.591 6 O
GrQc 56 0.777 57 S
Enron 865 0.421 869 S
Enron Early* 51 0.325 56 S
Facebook 40 0.870 190 S
11. Empirical Tests and Results
2. Blondel’s method allows us to extract the communities from the
original dataset, given it’s greater execution speed in comparison with
Newman’s method.
Original Sampled
Q C Q C
GrQc 0.856 390 0.789 11
Enron 0.491 43 0.560 68
Facebook 0.681 1105 0.519 33
• How to compare nodes community matching?
NMI : Normalized Mutual Information
12. Normalized Mutual Information
After labeling the communities, we match the nodes inside every
corresponding community in the sampled and original datasets.
Purity: 100% means that all nodes in same communities are matched in
both datasets.
We compare the Top N communities ( N =10 )
o Handicap
Newman’s and Blondel’s methods are stochastic and non-deterministic
Give slightly different results in each execution.
13. Normalized Mutual Information
After labeling the communities, we match the nodes inside every
corresponding community in the sampled and original datasets.
Purity: 100% means that all nodes in same communities are matched in
both datasets.
We compare the Top N communities ( N =10 )
o Handicap
Newman’s and Blondel’s methods are stochastic and non-deterministic
Give slightly different results in each execution.
NMI sampled
NMI orig. Vs. NMI orig Vs. orig. Net loss
Vs. sampled (B)
sampled (A) (C) (C- A)
GrQc 0.66559 0.82544 0.77301 0.10742
Enron 0.69069 0.86903 0.82012 0.12943
Facebook 0.58996 0.73249 0.69215 0.10219
14. Newman’s Vs. Blondel’s
• In terms of modularity (Q) and number of communities (C)
Blondel’s Blondel’s Newman’s
Original Sampled Sampled
Q C Q C Q C
GrQc 0.856 390 0.789 11 0.777 57
Enron 0.491 43 0.560 68 0.325 56
Facebook 0.681 1105 0.519 33 0.870 190
• The best modularity values are dataset dependent.
15. Newman’s Vs. Blondel’s
• In terms of modularity (Q) and number of communities (C)
Blondel’s Blondel’s Newman’s
Original Sampled Sampled
Q C Q C Q C
GrQc 0.856 390 0.789 11 0.777 57
Enron 0.491 43 0.560 68 0.325 56
Facebook 0.681 1105 0.519 33 0.870 190
• The methods may give distinct results in terms of the number of
communities and modularity values.
16. Newman’s (NG) Vs. Blondel’s (BN)
• In terms of NMI: Normalized Mutual Information.
Comparing Top N communities
NMI BN Vs. NG NMI NG Vs. BN NMI orig Vs. orig. Net loss
(A) (B) (C) (C- Avg (A,B))
GrQc 0.69116 0.87243 0.77301 -0.00878
Enron 0.31313 0.68796 0.82012 0.31958
Enron Early 0.83437 0.44320 0.82012 0.18133
Facebook 0.62056 0.54551 0.69215 0.10911
• Results show significant differences between the assignment of the
nodes between methods.
17. Conclusions
We’ve benchmarked 5 statistically and topologically distinct datasets
• Applying 2 community structure algorithms
• Sampling original datasets
Results indicate that is possible to identify the principal communities
for large complex datasets using sampling.
It maintains the key facets of the community structure of a
dataset (NMI statistic shows high correspondence is maintained)
Significantly reduces of the dataset size (80-90%)
However, a difference is found in the assignment of nodes to
communities between different executions and methods, due to their
stochastic nature.