SlideShare uma empresa Scribd logo
1 de 18
Modeling Decisions for Artificial Intelligence (MDAI)
               2012, Girona, Spain
Introduction

 We present a benchmarking of two distinct algorithms for
extracting community structure from On-line Social Networks
(OSNs), considering how we can representatively sample an OSN
graph while maintaining its community structure.



 We do this by extracting the community structure from the
original and sampled versions of five well-known benchmarking
datasets and comparing the results.
   • We assume there is NO a priori knowledge about the
   expected result.
   • A supervised sampling is performed.
Extraction of the community structure


Algorithm 1: Newman’s algorithm


   • Extracts the communities by successively dividing the graph
   into components, using Freeman’s betweenness centrality
   measure until modularity Q is maximized.


   • Modularity (Q): Is the measure used to quantify the quality of
   the community partitions ‘on the fly’. Usual range: [0.3 - 0.7].


   • We have implemented it in Python (using NetworkX library).
Extraction of the community structure


Algorithm 2: Blondel’s method


   1. The method looks for smaller communities by optimizing
   modularity locally.
   2. Then it aggregates nodes of the same community and builds
   a new network whose nodes are communities.
    Steps 1 and 2 are repeated until modularity Q is maximized.


   • The default version was used from the Gephi graph
   processing software.
Filtering / Sampling process

2-step process: In order to obtain a subset of a complete graph we
apply a process consisting of filtering and sampling.

    • First step: Filtering (seed node selection).
         Consists of filtering the graph nodes based on their degree
         or their clustering coefficient. Filtering thresholds are user
         defined.

         Goal: Identify hub nodes and dense regions of the
        graph.

    • Second step: Sampling.
        We apply a sampling at 1 hop to obtain all the neighbours
        connected to each seed node.

         Goal: Maintain core community structure.
Datasets used

                         Graph Statistics

               Karate   Dolphins    GrQc    Enron    Facebook


#Nodes          34        62        5242    10630     31720


#Edges          78        159       14496   164837    80592


Avg. degree     4.59      5.13      5.530   31.013    5.081


Clust. coef.    0.57      0.26      0.529   0.383     0.079



Avg. path
               2.408     3.356      6.049   3.160     6.432
length



Diameter         5         8         17       20        9
Sample criteria

• Sampling requires user supervision

• Sampling % desired: [ 10%-20% ]

• Only large graphs datasets (3) are sampled


                                           Resulting sample
                    Filter         Value                        Sample
                                                size

   ArXiv-GrQc       Degree          ≥30        17.91%         All neighbors


   Enron        Clustering coef.    =1         20.83%         All neighbors


   Facebook     Clustering coef.   ≥0.5        10.75%         All neighbors
Sampling statistics

• Graph statistics for sampled Vs original datasets (sss / ooo)


                           GrQc             Enron          Facebook
                       Degree >= 30     Clust.Coef = 1   Clust.Coef >= 0.5


   #Nodes               939 / 5242      2218 / 10630     3410 / 31720



   #Edges              5715 / 14446    14912 / 164387    6561 / 80592



   Avg. degree         12.17 / 5,53    12.315 / 31,013   3.848 / 5,081



   Clust. coef.        0.698 / 0,529    0.761 / 0,383    0.632 / 0,079



   Avg. path length    4.51 / 6,049     3.143 / 3,160    8.388 / 6,432



   Diameter               10 / 17           7 / 20            27 / 9
Sampling statistics

• Indicator: Clustering coefficient shows a common pattern, increasing
   in sampled datasets. This serves as an indicator that the core is
   preferentially included in the samples.
                           GrQc             Enron          Facebook
                       Degree >= 30     Clust.Coef = 1   Clust.Coef >= 0.5


    #Nodes              939 / 5242      2218 / 10630     3410 / 31720



    #Edges             5715 / 14446    14912 / 164387    6561 / 80592



    Avg. degree        12.17 / 5,53    12.315 / 31,013   3.848 / 5,081



    Clust. coef.       0.698 / 0,529   0.761 / 0,383     0.632 / 0,079



    Avg. path length   4.51 / 6,049     3.143 / 3,160    8.388 / 6,432



    Diameter              10 / 17           7 / 20            27 / 9
Empirical Tests and Results

1. First, we evaluate Newman’s algorithm with the sampled datasets.

                        Stop                           Original or
                     Iteration    Q      Communities   Sampled

     Karate              4       0.494        5            O
     Dolphins            5       0.591        6            O
     GrQc                56      0.777       57            S
     Enron              865      0.421       869           S
     Enron Early*        51      0.325       56            S
     Facebook            40      0.870       190           S
Empirical Tests and Results

2. Blondel’s method allows us to extract the communities from the
original dataset, given it’s greater execution speed in comparison with
Newman’s method.

                              Original              Sampled

                          Q              C      Q             C
           GrQc          0.856           390   0.789      11
           Enron         0.491           43    0.560      68
           Facebook      0.681       1105      0.519      33




• How to compare nodes community matching?
     NMI : Normalized Mutual Information
Normalized Mutual Information

 After labeling the communities, we match the nodes inside every
corresponding community in the sampled and original datasets.

 Purity: 100% means that all nodes in same communities are matched in
both datasets.

 We compare the Top N communities ( N =10 )

o Handicap
    Newman’s and Blondel’s methods are stochastic and non-deterministic
     Give slightly different results in each execution.
Normalized Mutual Information

 After labeling the communities, we match the nodes inside every
corresponding community in the sampled and original datasets.

 Purity: 100% means that all nodes in same communities are matched in
both datasets.

 We compare the Top N communities ( N =10 )

o Handicap
    Newman’s and Blondel’s methods are stochastic and non-deterministic
     Give slightly different results in each execution.
                                 NMI sampled
               NMI orig. Vs.                     NMI orig Vs. orig.   Net loss
                               Vs. sampled (B)
               sampled (A)                            (C)              (C- A)

   GrQc           0.66559          0.82544            0.77301         0.10742


   Enron          0.69069          0.86903            0.82012         0.12943


   Facebook       0.58996          0.73249            0.69215         0.10219
Newman’s Vs. Blondel’s

• In terms of modularity (Q) and number of communities (C)



                    Blondel’s          Blondel’s        Newman’s
                    Original           Sampled          Sampled

                   Q            C    Q         C       Q       C
     GrQc        0.856      390     0.789      11     0.777    57
     Enron       0.491      43      0.560      68     0.325    56
     Facebook    0.681     1105     0.519      33     0.870   190




• The best modularity values are dataset dependent.
Newman’s Vs. Blondel’s

• In terms of modularity (Q) and number of communities (C)



                     Blondel’s          Blondel’s      Newman’s
                     Original           Sampled        Sampled

                   Q             C    Q         C     Q       C
     GrQc         0.856      390     0.789      11   0.777    57
     Enron        0.491      43      0.560      68   0.325    56
     Facebook     0.681     1105     0.519      33   0.870   190




• The methods may give distinct results in terms of the number of
communities and modularity values.
Newman’s (NG) Vs. Blondel’s (BN)

• In terms of NMI: Normalized Mutual Information.
         Comparing Top N communities

              NMI BN Vs. NG   NMI NG Vs. BN   NMI orig Vs. orig.      Net loss
                  (A)             (B)              (C)             (C- Avg (A,B))

GrQc             0.69116         0.87243            0.77301          -0.00878


Enron            0.31313         0.68796            0.82012            0.31958


Enron Early      0.83437         0.44320            0.82012            0.18133


Facebook         0.62056         0.54551            0.69215            0.10911




• Results show significant differences between the assignment of the
nodes between methods.
Conclusions

 We’ve benchmarked 5 statistically and topologically distinct datasets

    • Applying 2 community structure algorithms
    • Sampling original datasets


 Results indicate that is possible to identify the principal communities
for large complex datasets using sampling.

     It maintains the key facets of the community structure of a
    dataset (NMI statistic shows high correspondence is maintained)
     Significantly reduces of the dataset size (80-90%)


 However, a difference is found in the assignment of nodes to
communities between different executions and methods, due to their
stochastic nature.
Thank you for
your attention :-)

Mais conteúdo relacionado

Semelhante a Analysis of on line social networks (OSNs) represented as graphs- extraction of an approximation of community structure using sampling

Original SOINN
Original SOINNOriginal SOINN
Original SOINNSOINN Inc.
 
log6kntt4i4dgwfwbpxw-signature-75c4ed0a4b22d2fef90396cdcdae85b38911f9dce0924a...
log6kntt4i4dgwfwbpxw-signature-75c4ed0a4b22d2fef90396cdcdae85b38911f9dce0924a...log6kntt4i4dgwfwbpxw-signature-75c4ed0a4b22d2fef90396cdcdae85b38911f9dce0924a...
log6kntt4i4dgwfwbpxw-signature-75c4ed0a4b22d2fef90396cdcdae85b38911f9dce0924a...ABINASHPADHY6
 
Improving Hardware Efficiency for DNN Applications
Improving Hardware Efficiency for DNN ApplicationsImproving Hardware Efficiency for DNN Applications
Improving Hardware Efficiency for DNN ApplicationsChester Chen
 
MetaPerturb: Transferable Regularizer for Heterogeneous Tasks and Architectures
MetaPerturb: Transferable Regularizer for Heterogeneous Tasks and ArchitecturesMetaPerturb: Transferable Regularizer for Heterogeneous Tasks and Architectures
MetaPerturb: Transferable Regularizer for Heterogeneous Tasks and ArchitecturesMLAI2
 
Fuzzy c means clustering protocol for wireless sensor networks
Fuzzy c means clustering protocol for wireless sensor networksFuzzy c means clustering protocol for wireless sensor networks
Fuzzy c means clustering protocol for wireless sensor networksmourya chandra
 
Network sampling, community detection
Network sampling, community detectionNetwork sampling, community detection
Network sampling, community detectionroberval mariano
 
Do Fractional Norms and Quasinorms Help to Overcome the Curse of Dimensiona...
Do Fractional Norms and Quasinorms Help to Overcome the Curse of Dimensiona...Do Fractional Norms and Quasinorms Help to Overcome the Curse of Dimensiona...
Do Fractional Norms and Quasinorms Help to Overcome the Curse of Dimensiona...Alexander Gorban
 
CDAC 2018 Pellegrini clustering ppi networks
CDAC 2018 Pellegrini clustering ppi networksCDAC 2018 Pellegrini clustering ppi networks
CDAC 2018 Pellegrini clustering ppi networksMarco Antoniotti
 
Feature Engineering
Feature Engineering Feature Engineering
Feature Engineering odsc
 
Design of an Intelligent System for Improving Classification of Cancer Diseases
Design of an Intelligent System for Improving Classification of Cancer DiseasesDesign of an Intelligent System for Improving Classification of Cancer Diseases
Design of an Intelligent System for Improving Classification of Cancer DiseasesMohamed Loey
 
4.1 network analysis basic
4.1 network analysis basic4.1 network analysis basic
4.1 network analysis basicjilung hsieh
 
Graph Analysis Beyond Linear Algebra
Graph Analysis Beyond Linear AlgebraGraph Analysis Beyond Linear Algebra
Graph Analysis Beyond Linear AlgebraJason Riedy
 
A Test-suite Diagnosability Metric for Spectrum-based Fault Localization Appr...
A Test-suite Diagnosability Metric for Spectrum-based Fault Localization Appr...A Test-suite Diagnosability Metric for Spectrum-based Fault Localization Appr...
A Test-suite Diagnosability Metric for Spectrum-based Fault Localization Appr...Rui Maranhao Abreu
 

Semelhante a Analysis of on line social networks (OSNs) represented as graphs- extraction of an approximation of community structure using sampling (20)

Hierarchical Clustering
Hierarchical ClusteringHierarchical Clustering
Hierarchical Clustering
 
Microarray Analysis
Microarray AnalysisMicroarray Analysis
Microarray Analysis
 
Original SOINN
Original SOINNOriginal SOINN
Original SOINN
 
log6kntt4i4dgwfwbpxw-signature-75c4ed0a4b22d2fef90396cdcdae85b38911f9dce0924a...
log6kntt4i4dgwfwbpxw-signature-75c4ed0a4b22d2fef90396cdcdae85b38911f9dce0924a...log6kntt4i4dgwfwbpxw-signature-75c4ed0a4b22d2fef90396cdcdae85b38911f9dce0924a...
log6kntt4i4dgwfwbpxw-signature-75c4ed0a4b22d2fef90396cdcdae85b38911f9dce0924a...
 
Improving Hardware Efficiency for DNN Applications
Improving Hardware Efficiency for DNN ApplicationsImproving Hardware Efficiency for DNN Applications
Improving Hardware Efficiency for DNN Applications
 
MetaPerturb: Transferable Regularizer for Heterogeneous Tasks and Architectures
MetaPerturb: Transferable Regularizer for Heterogeneous Tasks and ArchitecturesMetaPerturb: Transferable Regularizer for Heterogeneous Tasks and Architectures
MetaPerturb: Transferable Regularizer for Heterogeneous Tasks and Architectures
 
Fuzzy c means clustering protocol for wireless sensor networks
Fuzzy c means clustering protocol for wireless sensor networksFuzzy c means clustering protocol for wireless sensor networks
Fuzzy c means clustering protocol for wireless sensor networks
 
Fa19_P1.pptx
Fa19_P1.pptxFa19_P1.pptx
Fa19_P1.pptx
 
Network sampling, community detection
Network sampling, community detectionNetwork sampling, community detection
Network sampling, community detection
 
Do Fractional Norms and Quasinorms Help to Overcome the Curse of Dimensiona...
Do Fractional Norms and Quasinorms Help to Overcome the Curse of Dimensiona...Do Fractional Norms and Quasinorms Help to Overcome the Curse of Dimensiona...
Do Fractional Norms and Quasinorms Help to Overcome the Curse of Dimensiona...
 
CDAC 2018 Pellegrini clustering ppi networks
CDAC 2018 Pellegrini clustering ppi networksCDAC 2018 Pellegrini clustering ppi networks
CDAC 2018 Pellegrini clustering ppi networks
 
Feature Engineering
Feature Engineering Feature Engineering
Feature Engineering
 
Phylogenetics1
Phylogenetics1Phylogenetics1
Phylogenetics1
 
Ieek fall Conference 2013
Ieek fall Conference 2013Ieek fall Conference 2013
Ieek fall Conference 2013
 
Design of an Intelligent System for Improving Classification of Cancer Diseases
Design of an Intelligent System for Improving Classification of Cancer DiseasesDesign of an Intelligent System for Improving Classification of Cancer Diseases
Design of an Intelligent System for Improving Classification of Cancer Diseases
 
4.1 network analysis basic
4.1 network analysis basic4.1 network analysis basic
4.1 network analysis basic
 
L6 Speed Studies
L6 Speed StudiesL6 Speed Studies
L6 Speed Studies
 
Graph Analysis Beyond Linear Algebra
Graph Analysis Beyond Linear AlgebraGraph Analysis Beyond Linear Algebra
Graph Analysis Beyond Linear Algebra
 
A Test-suite Diagnosability Metric for Spectrum-based Fault Localization Appr...
A Test-suite Diagnosability Metric for Spectrum-based Fault Localization Appr...A Test-suite Diagnosability Metric for Spectrum-based Fault Localization Appr...
A Test-suite Diagnosability Metric for Spectrum-based Fault Localization Appr...
 
Declarative data analysis
Declarative data analysisDeclarative data analysis
Declarative data analysis
 

Último

Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 

Último (20)

Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 

Analysis of on line social networks (OSNs) represented as graphs- extraction of an approximation of community structure using sampling

  • 1. Modeling Decisions for Artificial Intelligence (MDAI) 2012, Girona, Spain
  • 2. Introduction  We present a benchmarking of two distinct algorithms for extracting community structure from On-line Social Networks (OSNs), considering how we can representatively sample an OSN graph while maintaining its community structure.  We do this by extracting the community structure from the original and sampled versions of five well-known benchmarking datasets and comparing the results. • We assume there is NO a priori knowledge about the expected result. • A supervised sampling is performed.
  • 3. Extraction of the community structure Algorithm 1: Newman’s algorithm • Extracts the communities by successively dividing the graph into components, using Freeman’s betweenness centrality measure until modularity Q is maximized. • Modularity (Q): Is the measure used to quantify the quality of the community partitions ‘on the fly’. Usual range: [0.3 - 0.7]. • We have implemented it in Python (using NetworkX library).
  • 4. Extraction of the community structure Algorithm 2: Blondel’s method 1. The method looks for smaller communities by optimizing modularity locally. 2. Then it aggregates nodes of the same community and builds a new network whose nodes are communities. Steps 1 and 2 are repeated until modularity Q is maximized. • The default version was used from the Gephi graph processing software.
  • 5. Filtering / Sampling process 2-step process: In order to obtain a subset of a complete graph we apply a process consisting of filtering and sampling. • First step: Filtering (seed node selection). Consists of filtering the graph nodes based on their degree or their clustering coefficient. Filtering thresholds are user defined.  Goal: Identify hub nodes and dense regions of the graph. • Second step: Sampling. We apply a sampling at 1 hop to obtain all the neighbours connected to each seed node.  Goal: Maintain core community structure.
  • 6. Datasets used Graph Statistics Karate Dolphins GrQc Enron Facebook #Nodes 34 62 5242 10630 31720 #Edges 78 159 14496 164837 80592 Avg. degree 4.59 5.13 5.530 31.013 5.081 Clust. coef. 0.57 0.26 0.529 0.383 0.079 Avg. path 2.408 3.356 6.049 3.160 6.432 length Diameter 5 8 17 20 9
  • 7. Sample criteria • Sampling requires user supervision • Sampling % desired: [ 10%-20% ] • Only large graphs datasets (3) are sampled Resulting sample Filter Value Sample size ArXiv-GrQc Degree ≥30 17.91% All neighbors Enron Clustering coef. =1 20.83% All neighbors Facebook Clustering coef. ≥0.5 10.75% All neighbors
  • 8. Sampling statistics • Graph statistics for sampled Vs original datasets (sss / ooo) GrQc Enron Facebook Degree >= 30 Clust.Coef = 1 Clust.Coef >= 0.5 #Nodes 939 / 5242 2218 / 10630 3410 / 31720 #Edges 5715 / 14446 14912 / 164387 6561 / 80592 Avg. degree 12.17 / 5,53 12.315 / 31,013 3.848 / 5,081 Clust. coef. 0.698 / 0,529 0.761 / 0,383 0.632 / 0,079 Avg. path length 4.51 / 6,049 3.143 / 3,160 8.388 / 6,432 Diameter 10 / 17 7 / 20 27 / 9
  • 9. Sampling statistics • Indicator: Clustering coefficient shows a common pattern, increasing in sampled datasets. This serves as an indicator that the core is preferentially included in the samples. GrQc Enron Facebook Degree >= 30 Clust.Coef = 1 Clust.Coef >= 0.5 #Nodes 939 / 5242 2218 / 10630 3410 / 31720 #Edges 5715 / 14446 14912 / 164387 6561 / 80592 Avg. degree 12.17 / 5,53 12.315 / 31,013 3.848 / 5,081 Clust. coef. 0.698 / 0,529 0.761 / 0,383 0.632 / 0,079 Avg. path length 4.51 / 6,049 3.143 / 3,160 8.388 / 6,432 Diameter 10 / 17 7 / 20 27 / 9
  • 10. Empirical Tests and Results 1. First, we evaluate Newman’s algorithm with the sampled datasets. Stop Original or Iteration Q Communities Sampled Karate 4 0.494 5 O Dolphins 5 0.591 6 O GrQc 56 0.777 57 S Enron 865 0.421 869 S Enron Early* 51 0.325 56 S Facebook 40 0.870 190 S
  • 11. Empirical Tests and Results 2. Blondel’s method allows us to extract the communities from the original dataset, given it’s greater execution speed in comparison with Newman’s method. Original Sampled Q C Q C GrQc 0.856 390 0.789 11 Enron 0.491 43 0.560 68 Facebook 0.681 1105 0.519 33 • How to compare nodes community matching?  NMI : Normalized Mutual Information
  • 12. Normalized Mutual Information  After labeling the communities, we match the nodes inside every corresponding community in the sampled and original datasets.  Purity: 100% means that all nodes in same communities are matched in both datasets.  We compare the Top N communities ( N =10 ) o Handicap Newman’s and Blondel’s methods are stochastic and non-deterministic  Give slightly different results in each execution.
  • 13. Normalized Mutual Information  After labeling the communities, we match the nodes inside every corresponding community in the sampled and original datasets.  Purity: 100% means that all nodes in same communities are matched in both datasets.  We compare the Top N communities ( N =10 ) o Handicap Newman’s and Blondel’s methods are stochastic and non-deterministic  Give slightly different results in each execution. NMI sampled NMI orig. Vs. NMI orig Vs. orig. Net loss Vs. sampled (B) sampled (A) (C) (C- A) GrQc 0.66559 0.82544 0.77301 0.10742 Enron 0.69069 0.86903 0.82012 0.12943 Facebook 0.58996 0.73249 0.69215 0.10219
  • 14. Newman’s Vs. Blondel’s • In terms of modularity (Q) and number of communities (C) Blondel’s Blondel’s Newman’s Original Sampled Sampled Q C Q C Q C GrQc 0.856 390 0.789 11 0.777 57 Enron 0.491 43 0.560 68 0.325 56 Facebook 0.681 1105 0.519 33 0.870 190 • The best modularity values are dataset dependent.
  • 15. Newman’s Vs. Blondel’s • In terms of modularity (Q) and number of communities (C) Blondel’s Blondel’s Newman’s Original Sampled Sampled Q C Q C Q C GrQc 0.856 390 0.789 11 0.777 57 Enron 0.491 43 0.560 68 0.325 56 Facebook 0.681 1105 0.519 33 0.870 190 • The methods may give distinct results in terms of the number of communities and modularity values.
  • 16. Newman’s (NG) Vs. Blondel’s (BN) • In terms of NMI: Normalized Mutual Information. Comparing Top N communities NMI BN Vs. NG NMI NG Vs. BN NMI orig Vs. orig. Net loss (A) (B) (C) (C- Avg (A,B)) GrQc 0.69116 0.87243 0.77301 -0.00878 Enron 0.31313 0.68796 0.82012 0.31958 Enron Early 0.83437 0.44320 0.82012 0.18133 Facebook 0.62056 0.54551 0.69215 0.10911 • Results show significant differences between the assignment of the nodes between methods.
  • 17. Conclusions  We’ve benchmarked 5 statistically and topologically distinct datasets • Applying 2 community structure algorithms • Sampling original datasets  Results indicate that is possible to identify the principal communities for large complex datasets using sampling.  It maintains the key facets of the community structure of a dataset (NMI statistic shows high correspondence is maintained)  Significantly reduces of the dataset size (80-90%)  However, a difference is found in the assignment of nodes to communities between different executions and methods, due to their stochastic nature.
  • 18. Thank you for your attention :-)