O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.

A Framework for Online Clustering Based on Evolving Semi-supervision

149 visualizações

Publicada em

Presentation of the paper entitled "A Framework for Online Clustering Based on Evolving Semi-supervision" at SBBD'17

Publicada em: Ciências
  • Seja o primeiro a comentar

  • Seja a primeira pessoa a gostar disto

A Framework for Online Clustering Based on Evolving Semi-supervision

  1. 1. A Framework for Online Clustering Based on Evolving Semi-Supervision Guilherme Alves, Maria Camila N. Barioni and Elaine Faria Universidade Federal de Uberlândia
  2. 2. Outline ■ Introduction ■ The CABESS Framework ■ Pointwise CABESS ■ Experiments and Results ■ Conclusion 6-Oct-17 32nd Brazilian Symposium on Databases (SBBD 2017) 2
  3. 3. Introduction Motivation ■ The desired organization for the data may change over time. Semi-supervised approaches may be useful for guiding clustering algorithms in the adaptation process. ■ Additional information (semi-supervision) may change over time. ■ It may causes clustering transitions: birth, split or merge. 6-Oct-17 32nd Brazilian Symposium on Databases (SBBD 2017) 3
  4. 4. Introduction Objective and Research Questions Goal: to provide a framework that is able to use and maintain semi-supervision correctly to enable efficient and effective online clustering processes. Q1. How accurately does semi-supervision aid on clustering effectiveness when there are external clustering transitions over time? 6-Oct-17 32nd Brazilian Symposium on Databases (SBBD 2017) 4
  5. 5. Introduction Objective and Research Questions Goal: to provide a framework that is able to use and maintain semi-supervision correctly to enable efficient and effective online clustering processes. Q1. How accurately does semi-supervision aid on clustering effectiveness when there are external clustering transitions over time? Q2. Are there major differences between clustering effectiveness when using semi-supervised clustering approaches based on feedback and labels? 6-Oct-17 32nd Brazilian Symposium on Databases (SBBD 2017) 5
  6. 6. Introduction Objective and Research Questions Goal: to provide a framework that is able to use and maintain semi-supervision correctly to enable efficient and effective online clustering processes. Q1. How accurately does semi-supervision aid on clustering effectiveness when there are external clustering transitions over time? Q2. Are there major differences between clustering effectiveness when using semi-supervised clustering approaches based on feedback and labels? Q3. How the feedback window size variation affects semi-supervision information and clustering effectiveness? 6-Oct-17 32nd Brazilian Symposium on Databases (SBBD 2017) 6
  7. 7. Introduction Objective and Research Questions Goal: to provide a framework that is able to use and maintain semi-supervision correctly to enable efficient and effective online clustering processes. Q1. How accurately does semi-supervision aid on clustering effectiveness when there are external clustering transitions over time? Q2. Are there major differences between clustering effectiveness when using semi-supervised clustering approaches based on feedback and labels? Q3. How the feedback window size variation affects semi-supervision information and clustering effectiveness? Q4. How efficient is our approach compared to existing semi-supervised approaches? 6-Oct-17 32nd Brazilian Symposium on Databases (SBBD 2017) 7
  8. 8. Introduction Example 6-Oct-17 32nd Brazilian Symposium on Databases (SBBD 2017) 8 𝑡0 1 1 1 1 2 2 2 1 2 1 1 1 1 1 1 1 2 2 2 2 2 1 1 1 1 1 III.Groundtruth Time
  9. 9. Introduction Example 6-Oct-17 32nd Brazilian Symposium on Databases (SBBD 2017) 9 𝑡0 𝑡 1 1 1 1 2 2 2 1 2 1 1 1 1 1 1 1 2 2 2 2 2 1 1 1 1 1 3 3 3 2 2 2 3 2 4 4 4 4 4 2 2 2 2 2 3 33 4 4 4 4 2 2 4 4 III.Groundtruth Time
  10. 10. Introduction Example 6-Oct-17 32nd Brazilian Symposium on Databases (SBBD 2017) 10 𝑡0 𝑡 𝑡 1 1 1 1 2 2 2 1 2 1 1 1 1 1 1 1 2 2 2 2 2 1 1 1 1 1 3 3 3 2 2 2 3 2 4 4 4 4 4 2 2 2 2 2 3 33 4 4 4 4 2 2 4 4 3 3 2 2 2 3 2 4 4 4 4 4 4 2 2 4 3 4 4 2 2 2 3 4 4 3 2 2233 3 2 2 3 III.Groundtruth Time
  11. 11. Introduction Example 6-Oct-17 32nd Brazilian Symposium on Databases (SBBD 2017) 11 𝑡0 𝑡 𝑡 𝑡 1 1 1 1 2 2 2 1 2 1 1 1 1 1 1 1 2 2 2 2 2 1 1 1 1 1 3 3 3 2 2 2 3 2 4 4 4 4 4 2 2 2 2 2 3 33 4 4 4 4 2 2 4 4 3 3 2 2 2 3 2 4 4 4 4 4 4 2 2 4 3 4 4 2 2 2 3 4 4 3 2 2233 3 2 2 3 3 3 2 2 2 3 2 4 4 4 4 4 2 2 2 4 3 4 4 4 2 23 4 4 3 2 2 4 22 2233 3 3 III.Groundtruth Time
  12. 12. Introduction Example 6-Oct-17 32nd Brazilian Symposium on Databases (SBBD 2017) 12 𝑡0 1 2 1 1 2 2 2 1 1 1 1 1 2 2 2 1 2 1 1 1 1 1 1 1 2 2 2 2 2 1 1 1 1 1 II.PartitionSetIII.Groundtruth Time
  13. 13. Introduction Example 6-Oct-17 32nd Brazilian Symposium on Databases (SBBD 2017) 13 𝑡0 𝑡 1 2 1 1 2 2 2 1 1 2 1 2 2 23 3 41 1 1 1 1 2 2 2 1 2 1 1 1 1 1 1 1 2 2 2 2 2 1 1 1 1 1 3 3 3 2 2 2 3 2 4 4 4 4 4 2 2 2 2 2 3 33 4 4 4 4 2 2 4 4 II.PartitionSetIII.Groundtruth Time
  14. 14. Introduction Example 6-Oct-17 32nd Brazilian Symposium on Databases (SBBD 2017) 14 𝑡0 𝑡 𝑡 DivisãoSplit: → { , } 1 2 1 1 2 2 2 1 1 2 1 2 2 23 3 41 2 4 4 2 4 3 1 2 2 23 3 2 1 1 1 1 2 2 2 1 2 1 1 1 1 1 1 1 2 2 2 2 2 1 1 1 1 1 3 3 3 2 2 2 3 2 4 4 4 4 4 2 2 2 2 2 3 33 4 4 4 4 2 2 4 4 3 3 2 2 2 3 2 4 4 4 4 4 4 2 2 4 3 4 4 2 2 2 3 4 4 3 2 2233 3 2 2 3 II.PartitionSetIII.Groundtruth Time I.Clustering transition
  15. 15. Introduction Example 6-Oct-17 32nd Brazilian Symposium on Databases (SBBD 2017) 15 𝑡0 𝑡 𝑡 𝑡 DivisãoSplit: → { , } 1 2 1 1 2 2 2 1 1 2 1 2 2 23 3 41 2 4 4 2 4 3 1 2 2 23 3 2 3 2 4 4 2 4 3 4 2 2 2 23 3 1 1 1 1 2 2 2 1 2 1 1 1 1 1 1 1 2 2 2 2 2 1 1 1 1 1 3 3 3 2 2 2 3 2 4 4 4 4 4 2 2 2 2 2 3 33 4 4 4 4 2 2 4 4 3 3 2 2 2 3 2 4 4 4 4 4 4 2 2 4 3 4 4 2 2 2 3 4 4 3 2 2233 3 2 2 3 3 3 2 2 2 3 2 4 4 4 4 4 2 2 2 4 3 4 4 4 2 23 4 4 3 2 2 4 22 2233 3 3 II.PartitionSetIII.Groundtruth Time I.Clustering transition
  16. 16. Introduction Main Contributions ■ The introduction of CABESS (Cluster Adaptation Based on Evolving Semi-Supervision) ■ A framework for online clustering using semi-supervision in the form of feedback. ■ A strategy that extracts semi-supervision information from feedback given in the form of labels. ■ An approach to keep the labels consistent over time. 6-Oct-17 32nd Brazilian Symposium on Databases (SBBD 2017) 16
  17. 17. The CABESS Framework 6-Oct-17 32nd Brazilian Symposium on Databases (SBBD 2017) 17 AF Summarization 1 Semi-supervised Clustering 4 Clustering 2 ℱ Semi-Supervision Deduction 3 B Transition detection 5 𝒟 CABESS 𝑡 ℛ 𝑡 , 𝒮 𝑡 ℱ 𝑡 𝒟 𝑡 T Partition Set Storage 𝑡 𝑡− 𝑡 F T 𝑡− 𝒮 𝑡 A B Verify if new instances have been generated or the user has been satisfied with the cluster quality. Verify if it is the first clustering performed.
  18. 18. The Framework CABESS Pointwise CABESS 6-Oct-17 32nd Brazilian Symposium on Databases (SBBD 2017) 18 AF BIRCH [Zhang et al.1996] 1 SSDBScan [Lelis and Sander 2009] 4 DBScan [Ester et al.1996] 2 ℱ Semi-Supervision Deduction 3 B MONIC [Spiliopoulou et al. 2006] 5 𝒟 CABESS 𝑡 ℛ 𝑡 , 𝒮 𝑡 ℱ 𝑡 𝒟 𝑡 T Partition Set Storage 𝑡 𝑡− 𝑡 F T 𝑡− 𝒮 𝑡
  19. 19. The CABESS Framework » Pointwise CABESS Extracting labels from feedbacks 6-Oct-17 32nd Brazilian Symposium on Databases (SBBD 2017) 19 1. Feedbacks 𝑒𝑥𝑡𝑟𝑎𝑐𝑡𝑖𝑜𝑛 Instance-level labels ✓ ✓ ✓ ✓ ✓ ✓ ✓ (a)
  20. 20. The CABESS Framework » Pointwise CABESS Extracting labels from feedbacks 6-Oct-17 32nd Brazilian Symposium on Databases (SBBD 2017) 20 1. Feedbacks 𝑒𝑥𝑡𝑟𝑎𝑐𝑡𝑖𝑜𝑛 Instance-level labels ■ A neighborhood 𝑁 is defined as a set of instances that must be in the same cluster. ✓ ✓ ✓ ✓ ✓ ✓ ✓ (a) (b)
  21. 21. The CABESS Framework » Pointwise CABESS Extracting labels from feedbacks 6-Oct-17 32nd Brazilian Symposium on Databases (SBBD 2017) 21 1. Feedbacks 𝑒𝑥𝑡𝑟𝑎𝑐𝑡𝑖𝑜𝑛 Instance-level labels ■ A neighborhood 𝑁 is defined as a set of instances that must be in the same cluster. ■ Same previous cluster AND received positive feedback → Same label ■ Received displacement feedback → assign to label of the neighborhood of the destination rule. ✓ ✓ ✓ ✓ ✓ ✓ ✓ 1 1 2 1 2 2 2 2 2 (a) (b) (c)
  22. 22. The CABESS Framework » Pointwise CABESS Extracting labels from feedbacks 6-Oct-17 32nd Brazilian Symposium on Databases (SBBD 2017) 22 2. Instance-level labels 𝑑𝑒𝑑𝑢𝑐𝑡𝑖𝑜𝑛 Summarized-level labels ■ Performed as a propagation task ■ If one of the summarized instances has labeled instances with different labels ■ Split it to obtain purified summarized instances ■ summarized instances that contain only the same label in labeled instances. 1 1 2 1 2 2 2 2 2 1 1 2 1 2 2 2 2 2 1 1 2 2 2 2 (c) (d)
  23. 23. The CABESS Framework » Pointwise CABESS Dealing with obsolete labels 6-Oct-17 32nd Brazilian Symposium on Databases (SBBD 2017) 23 ■ Obsolete labels: labels assigned to instances, for which the clusters do not exist anymore. ■ Minimizing the problem: adoption of a detector of transitions. ■ Better neighborhood management. ■ when a cluster survives both neighborhood and associated labels are preserved. 𝑠𝑝𝑙𝑖𝑡 does not exist at 𝒕 → label 2 becomes obsolete 𝑡 𝑡
  24. 24. Experiments Datasets 6-Oct-17 32nd Brazilian Symposium on Databases (SBBD 2017) 24 Name # instances d # classes Reference Type DB7 9,050 2 8 [Silva et al. 2015] SyntheticSYN3 5,000 2 3 streamMOA SYN4 10,000 3 5 streamMOA FROGS 1,484 8 4 [Colonna et al. 2016] RealIPEA 5,564 5 27 IPEA KDD’995 24,692 19 11 UCI
  25. 25. Experiments Comparison Methods 6-Oct-17 32nd Brazilian Symposium on Databases (SBBD 2017) 25 ■ Unsupervised (DBScan) ■ Consists in periodically executing a clustering algorithm without any semi-supervision. ■ Semi-supervised (SSDBScan) ■ Static Approach. ■ Consists in periodically applying a semi-supervised clustering algorithm. ■ This approach does not discard any label over time. ■ Window-based Approach. ■ It is a variation of the previous approach. ■ Instead of executing the clustering algorithm over all the label set, we remove old labels.
  26. 26. Experiments Protocol and Evaluation 6-Oct-17 32nd Brazilian Symposium on Databases (SBBD 2017) 26 ■ Adjusted Rand Index (ARI) ■ Prequential Protocol ■ For each timestamp 𝑡 only one label is considered valid for a data instance according to the grouping tree. ■ Online arrival of instances and feedback. ■ The arrival of the data instances and the user feedback are given according to the uniform distribution.
  27. 27. Experimental Results Semi-supervised vs. Unsupervised Q1. How accurately does semi-supervision aid on clustering effectiveness when there are external clustering transitions over time? 6-Oct-17 32nd Brazilian Symposium on Databases (SBBD 2017) 27
  28. 28. Experimental Results Semi-supervised vs. Unsupervised Q1. How accurately does semi-supervision aid on clustering effectiveness when there are external clustering transitions over time? 6-Oct-17 32nd Brazilian Symposium on Databases (SBBD 2017) 28
  29. 29. Experimental Results Feedbacks vs. Labels 6-Oct-17 32nd Brazilian Symposium on Databases (SBBD 2017) 29 Q2. Are there major differences between clustering effectiveness when using semi-supervised clustering approaches based on feedback and labels?
  30. 30. Experimental Results Feedbacks vs. Labels 6-Oct-17 32nd Brazilian Symposium on Databases (SBBD 2017) 30 Q2. Are there major differences between clustering effectiveness when using semi-supervised clustering approaches based on feedback and labels?
  31. 31. Experimental Results Feedback Window Size Variation 6-Oct-17 32nd Brazilian Symposium on Databases (SBBD 2017) 31 Q3. How the feedback window size variation affects semi-supervision information and clustering effectiveness?
  32. 32. Experimental Results Efficiency 6-Oct-17 32nd Brazilian Symposium on Databases (SBBD 2017) 32 Q4. How efficient is our approach compared to existing semi-supervised approaches?
  33. 33. Conclusion ■ Higher efficiency when compared to other semi-supervised approaches ■ While keeping an equivalent effectiveness. ■ Future works: ■ Exploring other types of semi-supervision information such as instance-level constraints. ■ Tackling other strategies for detecting transitions. 6-Oct-17 32nd Brazilian Symposium on Databases (SBBD 2017) 33
  34. 34. References Ester, M., Kriegel, H.-P., Sander, J., and Xu, X. (1996). A density-based algorithm for discovering clusters in large spatial databases with noise. In KDD, pages 226–231. AAAI Press Colonna, J. G., Gama, J., and Nakamura, E. F. (2016). Recognizing Family, Genus, and Species of Anuran Using a Hierarchical Classification Approach. pages 198–212. Springer, Cham. Lai, H. P., Visani, M., Boucher, A., and Ogier, J.-M. (2014). A new interactive semi- supervised clustering model for large image database indexing. Pattern Recognition Letters, 37(1):94–106. Lelis, L. and Sander, J. (2009). Semi-supervised Density-Based Clustering. In IEEE ICDM, pages 842–847. Spiliopoulou, M., Ntoutsi, I., Theodoridis, Y., and Schult, R. (2006). MONIC: Modeling and Monitoring Cluster Transitions. In ACM SIGKDD, page 706, New York, NY, USA. ACM Press. Zhang, T., Ramakrishnan, R., and Livny, M. (1996). BIRCH: An Efficient Data Clustering Method for very Large Databases. ACM SIGMOD Record, 25(2):103–114. 6-Oct-17 32nd Brazilian Symposium on Databases (SBBD 2017) 34
  35. 35. A Framework for Online Clustering Based on Evolving Semi-Supervision Guilherme Alves guilhermealves@ufu.br Maria Camila N. Barioni camila.barioni@ufu.br Elaine Faria elaine@ufu.br 6-Oct-17 32nd Brazilian Symposium on Databases (SBBD 2017) 35 Acknowledgments

×