O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.

Anonymizing Graphs: Measuring Quality for Clustering

346 visualizações

Publicada em

Anonymization of graph-based data is a problem which has been widely studied in recent years and several anonymization methods have been developed. In this presentation the authors (Jordi Casas-Roma, Jordi Herrera-Joancomartí, Vicenç Torra) study different generic information
loss measures for graphs comparing such measures to the cluster-specific ones. They evaluate whether the generic information loss measures are indicative of the usefulness of the data for subsequent data mining processes.

Publicada em: Software
  • Seja o primeiro a comentar

  • Seja a primeira pessoa a gostar disto

Anonymizing Graphs: Measuring Quality for Clustering

  1. 1. Motivation Information loss measures Experimental framework Correlating GIL and SIL measures Conclusions Anonymizing graphs: measuring quality for clustering Jordi Casas-Roma 1 Jordi Herrera-Joancomart´ı 2 Vicen¸c Torra 3 1 Universitat Oberta de Catalunya (UOC) jcasasr@uoc.edu 2 Universitat Aut`onoma de Barcelona (UAB) jherrera@deic.uab.cat 3 Artificial Intelligence Research Institute (IIIA) Spanish National Research Council (CSIC) vtorra@iiia.csic.es UOC Research Showcase 2015. February 11, 2015 1 / 17
  2. 2. Motivation Information loss measures Experimental framework Correlating GIL and SIL measures Conclusions Overview 1 Motivation 2 Information loss measures 3 Experimental framework 4 Correlating GIL and SIL measures 5 Conclusions 2 / 17
  3. 3. Motivation Information loss measures Experimental framework Correlating GIL and SIL measures Conclusions Scenario Release data to third parties Preserve the privacy of users 3 / 17
  4. 4. Motivation Information loss measures Experimental framework Correlating GIL and SIL measures Conclusions Motivation We observe... There are several graph-mining tasks and several methods to compute each task. How can we evaluate the real data utility? Question Can we use some generic graph metrics to predict real graph-mining tasks? 4 / 17
  5. 5. Motivation Information loss measures Experimental framework Correlating GIL and SIL measures Conclusions Generic information loss (GIL) Specific information loss (SIL) Generic information loss measures (GIL) G G m(G, G) Anonymization process p Metric m Metric m Framework for evaluating generic information loss measures 5 / 17
  6. 6. Motivation Information loss measures Experimental framework Correlating GIL and SIL measures Conclusions Generic information loss (GIL) Specific information loss (SIL) Generic information loss measures (GIL) Network metrics average distance (dist) diameter (d) harmonic mean of the shortest distance (h) sub-graph centrality (SC) transitivity (T) edge intersection (EI) clustering coefficient (C) modularity (Q) m(G, G) = |m(G) − m(Gp)| (1) 6 / 17
  7. 7. Motivation Information loss measures Experimental framework Correlating GIL and SIL measures Conclusions Generic information loss (GIL) Specific information loss (SIL) Generic information loss measures (GIL) Spectral metrics the largest eigenvalue of the adjacency matrix A (λ1) the second smallest eigenvalue of the Laplacian matrix L (µ2) Vertex metrics betweenness centrality (CB ) closeness centrality (CC ) degree centrality (CD) m(G, G) = 1 n n i=1 (m(vi ) − m(vi )) 2 (2) 7 / 17
  8. 8. Motivation Information loss measures Experimental framework Correlating GIL and SIL measures Conclusions Generic information loss (GIL) Specific information loss (SIL) Clustering-specific information loss measures (SIL) G G Original clusters c(G) Precision index Perturbed clusters c(G) Anonymization process p Clustering method c Clustering method c precision index(G, G) = 1 n n v=1 ltc =lpc (3) 8 / 17
  9. 9. Motivation Information loss measures Experimental framework Correlating GIL and SIL measures Conclusions Generic information loss (GIL) Specific information loss (SIL) Clustering-specific information loss measures (SIL) Clustering algorithms Markov Cluster Algorithm (MCL) Algorithm of Girvan and Newman (Girvan-Newman or GN) Fast greedy modularity optimization (Fastgreedy or FG) Walktrap (WT) Infomap (IM) Multilevel (ML) 9 / 17
  10. 10. Motivation Information loss measures Experimental framework Correlating GIL and SIL measures Conclusions Experimental framework Original 1% Anon. 25% Anon. Graph assessment 1% ... 25% GIL Clustering assessment 1% ... 25% SIL Perturbation process Are they equal? Experimental framework for testing the correlation between GIL and SIL 10 / 17
  11. 11. Motivation Information loss measures Experimental framework Correlating GIL and SIL measures Conclusions GIL Self-correlation SIL Self-correlation GIL vs. SIL Comparing datasets GIL Self-correlation Do the generic information loss measures behave in similar way independently of the dataset? Pearson dist d CB CC CD EI C T λ1 µ2 r 0.85 0.15 0.96 0.90 0.99 0.99 0.97 0.94 0.24 0.09 ρ-value 0 0.007 0 0 0 0 0 0 0 0.006 Pearson self-correlation value (r) and its associated ρ-value of GIL measures. 11 / 17
  12. 12. Motivation Information loss measures Experimental framework Correlating GIL and SIL measures Conclusions GIL Self-correlation SIL Self-correlation GIL vs. SIL Comparing datasets SIL self-correlation Do the clustering-specific information loss measures behave in similar way independently of the dataset? Pearson MCL IM ML GN FG WT r 0.287 0.626 0.777 0.828 0.782 0.656 ρ-value 0 0 0 0 0 0 Pearson self-correlation value (r) and its associated ρ-value of precision index. 12 / 17
  13. 13. Motivation Information loss measures Experimental framework Correlating GIL and SIL measures Conclusions GIL Self-correlation SIL Self-correlation GIL vs. SIL Comparing datasets GIL vs. SIL Are GIL and SIL measures correlated? Pearson MCL IM ML GN FG WT µ dist 0.580 0.716 0.807 0.785 0.747 0.755 0.732 d 0.201 0.101 * 0.098 * 0.134 0.218 0.014 * 0.128 CB 0.559 0.687 0.854 0.865 0.831 0.724 0.753 CC 0.667 0.833 0.903 0.909 0.874 0.899 0.848 CD 0.296 0.380 0.416 0.504 0.481 0.457 0.422 EI 0.581 0.820 0.861 0.887 0.814 0.748 0.785 C 0.614 0.833 0.889 0.909 0.836 0.802 0.814 T 0.557 0.763 0.840 0.840 0.770 0.690 0.743 λ1 0.191 0.482 0.509 0.546 0.529 0.397 0.442 µ2 0.086 * 0.152 0.131 0.154 0.135 0.040 * 0.116 µ 0.433 0.577 0.631 0.653 0.624 0.553 NA Pearson correlation values (r) and their average values µ. An asterisk indicates ρ-values ≥ 0.05, i.e, results which are not statistically significant. 13 / 17
  14. 14. Motivation Information loss measures Experimental framework Correlating GIL and SIL measures Conclusions GIL Self-correlation SIL Self-correlation GIL vs. SIL Comparing datasets Aggregated GIL vs. SIL Can we use more than one GIL measure to improve correlation? Num. GIL measures r-square σ 1 CC 0.725 0.146 2 CB +CC 0.742 0.150 3 CB +CC +EI 0.765 0.155 4 d+CB +CC +EI 0.777 0.127 5 dist+d+CB +CC +EI 0.787 0.117 Multivariate regression analysis: r-square is indicative of the aggregate correlation and σ is the standard deviation. 14 / 17
  15. 15. Motivation Information loss measures Experimental framework Correlating GIL and SIL measures Conclusions GIL Self-correlation SIL Self-correlation GIL vs. SIL Comparing datasets GIL vs. SIL Are the results independently of the data where they are applied? Pearson Karate Football Jazz Flickr URV Email µ 0.716 0.796 0.717 0.780 0.729 σ 0.247 0.119 0.170 0.184 0.163 Pearson correlation averaged values (µ) and standard deviation (σ) for each dataset. 15 / 17
  16. 16. Motivation Information loss measures Experimental framework Correlating GIL and SIL measures Conclusions Conclusions Some measures behave in similar way independently of the data in which they are applied. There is strong correlation between some GIL and SIL: 1 closeness centrality 2 clustering coefficient 3 edge intersection 4 betweenness centrality 5 transitivity 6 average distance Considering more than one metric helps us to get slightly higher correlation values, but adding computational cost. 16 / 17
  17. 17. Motivation Information loss measures Experimental framework Correlating GIL and SIL measures Conclusions The End Thanks for your attention Jordi Casas-Roma UOC jcasasr@uoc.edu Jordi Herrera-Joancomart´ı UAB jherrera@deic.uab.cat Vicen¸c Torra IIIA-CSIC vtorra@iiia.csic.es 17 / 17

×