SlideShare uma empresa Scribd logo
1 de 39
Baixar para ler offline
Adaptive XML Tree Mining on Evolving Data Streams


                                 Albert Bifet

       Laboratory for Relational Algorithmics, Complexity and Learning LARCA
                Departament de Llenguatges i Sistemes Informàtics
                         Universitat Politècnica de Catalunya




                           Porto, 21 May 2009
Mining Evolving Massive Structured Data


                                    The basic problem
                                    Finding interesting structure
                                    on data
                                        Mining massive data
                                        Mining time varying data
                                        Mining on real time
                                        Mining XML data


The Disintegration of Persistence
      of Memory 1952-54

         Salvador Dalí


                                                                   2 / 30
XML Tree Classification on evolving data
               streams

        D             D               D           D

        B         B       B       B       B   B       B

    C       C     C           C       C       C

    A                                         A

    C LASS 1     C LASS 2         C LASS 1    C LASS 2


                              D

                Figure: A dataset example


                                                          3 / 30
Tree Pattern Mining

                         Given a dataset of trees, find the
                         complete set of frequent subtrees
                             Frequent Tree Pattern (FT):
                                 Include all the trees whose
                                 support is no less than min_sup

                             Closed Frequent Tree Pattern
                             (CT):
                                 Include no tree which has a
Trees are sanctuaries.           super-tree with the same
Whoever knows how                support
   to listen to them,
can learn the truth.         CT ⊆ FT


  Herman Hesse

                                                               4 / 30
Mining Closed Frequent Trees

Our trees are:                   Our subtrees are:
    Labeled and Unlabeled             Induced
    Ordered and Unordered             Top-down

                  Two different ordered trees
                 but the same unordered tree




                                                     5 / 30
A tale of two trees


Consider D = {A, B}, where       Frequent subtrees
                                             A   B
     A:




              B:




and let min_sup = 2.


                                                     6 / 30
A tale of two trees


Consider D = {A, B}, where        Closed subtrees
                                             A      B
     A:




              B:




and let min_sup = 2.


                                                        6 / 30
XML Tree Classification on evolving data
               streams

        D             D               D           D

        B         B       B       B       B   B       B

    C       C     C           C       C       C

    A                                         A

    C LASS 1     C LASS 2         C LASS 1    C LASS 2


                              D

                Figure: A dataset example


                                                          7 / 30
XML Tree Classification on evolving data
               streams

                                            Tree Trans.
      Closed      Freq. not Closed Trees   1 2 3 4
          D
          B                   B

 c1   C       C           C       C        1   0   1   0
          D
          B           B       C       A
          C           C       A

 c2       A           A                    1   0   0   1


                                                           8 / 30
XML Tree Classification on evolving data
               streams
                        Frequent Trees
       c1          c2         c3            c4
 Id       1
      c1 f1        1 2 3
               c2 f2 f2 f2       1
                             c3 f3      1 2 3    4 5
                                    c4 f4 f4 f4 f4 f4
  1   1 1     1 1 1 1 0 0 1 1 1 1 1 1
  2   0 0     0 0 0 0 1 1 1 1 1 1 1 1
  3   1 1     0 0 0 0 1 1 1 1 1 1 1 1
  4   0 0     1 1 1 1 1 1 1 1 1 1 1 1

                   Closed         Maximal
                    Trees          Trees
  Id Tree     c1   c2 c3    c4   c1 c2 c3    Class
     1        1    1 0      1    1 1 0      C LASS 1
     2        0    0 1      1    0 0 1      C LASS 2
     3        1    0 1      1    1 0 1      C LASS 1
     4        0    1 1      1    0 1 1      C LASS 2

                                                        9 / 30
XML Tree Framework on evolving data
                streams


XML Tree Classification Framework Components
   An XML closed frequent tree miner
   A Data stream classifier algorithm, which we will feed with tuples
   to be classified online.




                                                                       10 / 30
Mining Evolving Tree Data Streams


Problem
Given a data stream D of rooted and unordered trees, find
frequent closed trees.


                                 We provide three algorithms,
                                 of increasing power
                                     Incremental
                                     Sliding Window
                                     Adaptive

           D



                                                                11 / 30
Mining Closed Unordered Subtrees



C LOSED _S UBTREES(t, D, min_sup, T )
 1
 2
 3 for every t that can be extended from t in one step
 4      do if Support(t ) ≥ min_sup
 5            then T ← C LOSED _S UBTREES(t , D, min_sup, T )
 6
 7
 8
 9
10 return T



                                                            12 / 30
Mining Closed Unordered Subtrees



C LOSED _S UBTREES(t, D, min_sup, T )
 1 if not C ANONICAL _R EPRESENTATIVE(t)
 2    then return T
 3 for every t that can be extended from t in one step
 4      do if Support(t ) ≥ min_sup
 5            then T ← C LOSED _S UBTREES(t , D, min_sup, T )
 6
 7
 8
 9
10 return T



                                                            12 / 30
Mining Closed Unordered Subtrees



C LOSED _S UBTREES(t, D, min_sup, T )
 1   if not C ANONICAL _R EPRESENTATIVE(t)
 2       then return T
 3   for every t that can be extended from t in one step
 4          do if Support(t ) ≥ min_sup
 5                then T ← C LOSED _S UBTREES(t , D, min_sup, T )
 6          do if Support(t ) = Support(t)
 7                then t is not closed
 8   if t is closed
 9       then insert t into T
10   return T



                                                                12 / 30
Example
D = {A, B}            A = (0, 1, 2, 3, 2, 1)        B = (0, 1, 2, 3, 1, 2, 2)

min_sup = 2.


                                          (0, 1, 2, 1)

                                                            (0, 1, 2, 2, 1)
                          (0, 1, 1)
                                          (0, 1, 2, 2)
  (0)        (0, 1)
                          (0, 1, 2)                         (0, 1, 2, 3, 1)

                                          (0, 1, 2, 3)




                                                                              13 / 30
Example
D = {A, B}            A = (0, 1, 2, 3, 2, 1)        B = (0, 1, 2, 3, 1, 2, 2)

min_sup = 2.


                                          (0, 1, 2, 1)

                                                            (0, 1, 2, 2, 1)
                          (0, 1, 1)
                                          (0, 1, 2, 2)
  (0)        (0, 1)
                          (0, 1, 2)                         (0, 1, 2, 3, 1)

                                          (0, 1, 2, 3)




                                                                              13 / 30
Experimental results




TreeNat                   CMTreeMiner
   Unlabeled Trees           Labeled Trees
   Top-Down Subtrees         Induced Subtrees
   No Occurrences            Occurrences

                                                14 / 30
Closure Operator on Trees
    D: the finite input dataset of trees
    T : the (infinite) set of all trees

Definition
We define the following the Galois connection pair:
    For finite A ⊆ D
         σ (A) is the set of subtrees of the A trees in T

                           σ (A) = {t ∈ T   ∀ t ∈ A (t      t )}
    For finite B ⊂ T
         τD (B) is the set of supertrees of the B trees in D

                          τD (B) = {t ∈ D ∀ t ∈ B (t        t )}

Closure Operator
The composition ΓD = σ ◦ τD is a closure operator.
                                                                   15 / 30
Galois Lattice of closed set of trees




      1           2          3




     12                13    23




                 123
                                        16 / 30
Galois Lattice of closed set of trees




                 1          2           3

      D


B={       }     12               13     23




                           123
                                            17 / 30
Galois Lattice of closed set of trees


     B={      }



                        1          2           3




τD (B) = {    ,    }
                       12               13     23




                                  123
                                                   17 / 30
Galois Lattice of closed set of trees


     B={         }



                                 1               2          3




τD (B) = {       ,      }
                                12                     13   23




ΓD (B) = σ ◦τD(B) = {       and its subtrees }

                                                 123
                                                                17 / 30
Algorithms

Algorithms
    Incremental: I NC T REE N AT
    Sliding Window: W IN T REE N AT
    Adaptive: A DAT REE N AT Uses ADWIN to monitor change

ADWIN
An adaptive sliding window whose size is recomputed online
according to the rate of change observed.

ADWIN has rigorous guarantees (theorems)
    On ratio of false positives and false negatives
    On the relation of the size of the current window and change
    rates


                                                                   18 / 30
Experimental Validation: TN1



                                    CMTreeMiner
       300

 Time 200
(sec.)
       100
                                     I NC T REE N AT
                     2        4          6             8
                          Size (Milions)

Figure: Experiments on ordered trees with TN1 dataset




                                                           19 / 30
What is MOA?

{M}assive {O}nline {A}nalysis is a framework for online learning
from data streams.




    It is closely related to WEKA
    It includes a collection of offline and online as well as tools for
    evaluation:
         boosting and bagging
         Hoeffding Trees
    with and without Naïve Bayes classifiers at the leaves.




                                                                         20 / 30
WEKA: the bird




                 21 / 30
MOA: the bird

The Moa (another native NZ bird) is not only flightless, like the
                  Weka, but also extinct.




                                                                   22 / 30
MOA: the bird

The Moa (another native NZ bird) is not only flightless, like the
                  Weka, but also extinct.




                                                                   22 / 30
MOA: the bird

The Moa (another native NZ bird) is not only flightless, like the
                  Weka, but also extinct.




                                                                   22 / 30
Data stream classification cycle

1   Process an example at a
    time, and inspect it only
    once (at most)
2   Use a limited amount of
    memory
3   Work in a limited amount
    of time
4   Be ready to predict at any
    point




                                           23 / 30
Environments and Data Sources


   Environments
      Sensor Network: 100Kb
      Handheld Computer: 32 Mb
      Server: 400 Mb

   Data Sources
      Random Tree Generator
      Random RBF Generator
      LED Generator
      Waveform Generator
      Function Generator


                                 24 / 30
Algorithms




Naive Bayes               Prediction strategies
Decision stumps               Majority class
Hoeffding Tree                Naive Bayes Leaves
Hoeffding Option Tree         Adaptive Hybrid
Bagging and Boosting




                                                   25 / 30
Hoeffding Option Tree
Hoeffding Option Trees
Regular Hoeffding tree containing additional option nodes that
allow several tests to be applied, leading to multiple Hoeffding
trees as separate paths.




                                                                   26 / 30
GUI
java -cp .:moa.jar:weka.jar
-javaagent:sizeofag.jar moa.gui.TaskLauncher




                                               27 / 30
GUI
java -cp .:moa.jar:weka.jar
-javaagent:sizeofag.jar moa.gui.TaskLauncher




                                               27 / 30
Ensemble Methods
   http://www.cs.waikato.ac.nz/∼abifet/MOA/




New ensemble methods:
   ADWIN bagging: When a change is detected, the worst classifier
   is removed and a new classifier is added.
   Adaptive-Size Hoeffding Tree bagging

                                                                   28 / 30
XML Tree Framework on evolving data
               streams



                            Maximal                 Closed
           # Trees   Att.   Acc.    Mem.     Att.   Acc.     Mem.
CSLOG12     15483    84     79.64      1.2   228    78.12    2.54
CSLOG23     15037    88     79.81     1.21   243    78.77    2.75
CSLOG31     15702    86     79.94     1.25   243    77.60    2.73
CSLOG123    23111    84     80.02      1.7   228    78.91    4.18

           Table: BAGGING on unordered trees.




                                                              29 / 30
Conclusions

XML tree stream classifier system.


    Using Galois Latice Theory, we present methods for mining
    closed trees
        Incremental
        Sliding Window
        Adaptive: using ADWIN to monitor change

    We use MOA data stream classifiers.




                                                                30 / 30

Mais conteúdo relacionado

Mais de Albert Bifet

MOA for the IoT at ACML 2016
MOA for the IoT at ACML 2016 MOA for the IoT at ACML 2016
MOA for the IoT at ACML 2016 Albert Bifet
 
Mining Big Data Streams with APACHE SAMOA
Mining Big Data Streams with APACHE SAMOAMining Big Data Streams with APACHE SAMOA
Mining Big Data Streams with APACHE SAMOAAlbert Bifet
 
Efficient Online Evaluation of Big Data Stream Classifiers
Efficient Online Evaluation of Big Data Stream ClassifiersEfficient Online Evaluation of Big Data Stream Classifiers
Efficient Online Evaluation of Big Data Stream ClassifiersAlbert Bifet
 
Apache Samoa: Mining Big Data Streams with Apache Flink
Apache Samoa: Mining Big Data Streams with Apache FlinkApache Samoa: Mining Big Data Streams with Apache Flink
Apache Samoa: Mining Big Data Streams with Apache FlinkAlbert Bifet
 
Introduction to Big Data Science
Introduction to Big Data ScienceIntroduction to Big Data Science
Introduction to Big Data ScienceAlbert Bifet
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big DataAlbert Bifet
 
Internet of Things Data Science
Internet of Things Data ScienceInternet of Things Data Science
Internet of Things Data ScienceAlbert Bifet
 
Real Time Big Data Management
Real Time Big Data ManagementReal Time Big Data Management
Real Time Big Data ManagementAlbert Bifet
 
A Short Course in Data Stream Mining
A Short Course in Data Stream MiningA Short Course in Data Stream Mining
A Short Course in Data Stream MiningAlbert Bifet
 
Real-Time Big Data Stream Analytics
Real-Time Big Data Stream AnalyticsReal-Time Big Data Stream Analytics
Real-Time Big Data Stream AnalyticsAlbert Bifet
 
Multi-label Classification with Meta-labels
Multi-label Classification with Meta-labelsMulti-label Classification with Meta-labels
Multi-label Classification with Meta-labelsAlbert Bifet
 
Pitfalls in benchmarking data stream classification and how to avoid them
Pitfalls in benchmarking data stream classification and how to avoid themPitfalls in benchmarking data stream classification and how to avoid them
Pitfalls in benchmarking data stream classification and how to avoid themAlbert Bifet
 
STRIP: stream learning of influence probabilities.
STRIP: stream learning of influence probabilities.STRIP: stream learning of influence probabilities.
STRIP: stream learning of influence probabilities.Albert Bifet
 
Efficient Data Stream Classification via Probabilistic Adaptive Windows
Efficient Data Stream Classification via Probabilistic Adaptive WindowsEfficient Data Stream Classification via Probabilistic Adaptive Windows
Efficient Data Stream Classification via Probabilistic Adaptive WindowsAlbert Bifet
 
Mining Big Data in Real Time
Mining Big Data in Real TimeMining Big Data in Real Time
Mining Big Data in Real TimeAlbert Bifet
 
Mining Big Data in Real Time
Mining Big Data in Real TimeMining Big Data in Real Time
Mining Big Data in Real TimeAlbert Bifet
 
Mining Frequent Closed Graphs on Evolving Data Streams
Mining Frequent Closed Graphs on Evolving Data StreamsMining Frequent Closed Graphs on Evolving Data Streams
Mining Frequent Closed Graphs on Evolving Data StreamsAlbert Bifet
 
PAKDD 2011 TUTORIAL Handling Concept Drift: Importance, Challenges and Solutions
PAKDD 2011 TUTORIAL Handling Concept Drift: Importance, Challenges and SolutionsPAKDD 2011 TUTORIAL Handling Concept Drift: Importance, Challenges and Solutions
PAKDD 2011 TUTORIAL Handling Concept Drift: Importance, Challenges and SolutionsAlbert Bifet
 
Fast Perceptron Decision Tree Learning from Evolving Data Streams
Fast Perceptron Decision Tree Learning from Evolving Data StreamsFast Perceptron Decision Tree Learning from Evolving Data Streams
Fast Perceptron Decision Tree Learning from Evolving Data StreamsAlbert Bifet
 
Métodos Adaptativos de Minería de Datos y Aprendizaje para Flujos de Datos.
Métodos Adaptativos de Minería de Datos y Aprendizaje para Flujos de Datos.Métodos Adaptativos de Minería de Datos y Aprendizaje para Flujos de Datos.
Métodos Adaptativos de Minería de Datos y Aprendizaje para Flujos de Datos.Albert Bifet
 

Mais de Albert Bifet (20)

MOA for the IoT at ACML 2016
MOA for the IoT at ACML 2016 MOA for the IoT at ACML 2016
MOA for the IoT at ACML 2016
 
Mining Big Data Streams with APACHE SAMOA
Mining Big Data Streams with APACHE SAMOAMining Big Data Streams with APACHE SAMOA
Mining Big Data Streams with APACHE SAMOA
 
Efficient Online Evaluation of Big Data Stream Classifiers
Efficient Online Evaluation of Big Data Stream ClassifiersEfficient Online Evaluation of Big Data Stream Classifiers
Efficient Online Evaluation of Big Data Stream Classifiers
 
Apache Samoa: Mining Big Data Streams with Apache Flink
Apache Samoa: Mining Big Data Streams with Apache FlinkApache Samoa: Mining Big Data Streams with Apache Flink
Apache Samoa: Mining Big Data Streams with Apache Flink
 
Introduction to Big Data Science
Introduction to Big Data ScienceIntroduction to Big Data Science
Introduction to Big Data Science
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Internet of Things Data Science
Internet of Things Data ScienceInternet of Things Data Science
Internet of Things Data Science
 
Real Time Big Data Management
Real Time Big Data ManagementReal Time Big Data Management
Real Time Big Data Management
 
A Short Course in Data Stream Mining
A Short Course in Data Stream MiningA Short Course in Data Stream Mining
A Short Course in Data Stream Mining
 
Real-Time Big Data Stream Analytics
Real-Time Big Data Stream AnalyticsReal-Time Big Data Stream Analytics
Real-Time Big Data Stream Analytics
 
Multi-label Classification with Meta-labels
Multi-label Classification with Meta-labelsMulti-label Classification with Meta-labels
Multi-label Classification with Meta-labels
 
Pitfalls in benchmarking data stream classification and how to avoid them
Pitfalls in benchmarking data stream classification and how to avoid themPitfalls in benchmarking data stream classification and how to avoid them
Pitfalls in benchmarking data stream classification and how to avoid them
 
STRIP: stream learning of influence probabilities.
STRIP: stream learning of influence probabilities.STRIP: stream learning of influence probabilities.
STRIP: stream learning of influence probabilities.
 
Efficient Data Stream Classification via Probabilistic Adaptive Windows
Efficient Data Stream Classification via Probabilistic Adaptive WindowsEfficient Data Stream Classification via Probabilistic Adaptive Windows
Efficient Data Stream Classification via Probabilistic Adaptive Windows
 
Mining Big Data in Real Time
Mining Big Data in Real TimeMining Big Data in Real Time
Mining Big Data in Real Time
 
Mining Big Data in Real Time
Mining Big Data in Real TimeMining Big Data in Real Time
Mining Big Data in Real Time
 
Mining Frequent Closed Graphs on Evolving Data Streams
Mining Frequent Closed Graphs on Evolving Data StreamsMining Frequent Closed Graphs on Evolving Data Streams
Mining Frequent Closed Graphs on Evolving Data Streams
 
PAKDD 2011 TUTORIAL Handling Concept Drift: Importance, Challenges and Solutions
PAKDD 2011 TUTORIAL Handling Concept Drift: Importance, Challenges and SolutionsPAKDD 2011 TUTORIAL Handling Concept Drift: Importance, Challenges and Solutions
PAKDD 2011 TUTORIAL Handling Concept Drift: Importance, Challenges and Solutions
 
Fast Perceptron Decision Tree Learning from Evolving Data Streams
Fast Perceptron Decision Tree Learning from Evolving Data StreamsFast Perceptron Decision Tree Learning from Evolving Data Streams
Fast Perceptron Decision Tree Learning from Evolving Data Streams
 
Métodos Adaptativos de Minería de Datos y Aprendizaje para Flujos de Datos.
Métodos Adaptativos de Minería de Datos y Aprendizaje para Flujos de Datos.Métodos Adaptativos de Minería de Datos y Aprendizaje para Flujos de Datos.
Métodos Adaptativos de Minería de Datos y Aprendizaje para Flujos de Datos.
 

Último

The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Nikki Chapple
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observabilityitnewsafrica
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructureitnewsafrica
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesBernd Ruecker
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integrationmarketing932765
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkPixlogix Infotech
 

Último (20)

The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architectures
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App Framework
 

Adaptive XML Tree Mining on Evolving Data Streams

  • 1. Adaptive XML Tree Mining on Evolving Data Streams Albert Bifet Laboratory for Relational Algorithmics, Complexity and Learning LARCA Departament de Llenguatges i Sistemes Informàtics Universitat Politècnica de Catalunya Porto, 21 May 2009
  • 2. Mining Evolving Massive Structured Data The basic problem Finding interesting structure on data Mining massive data Mining time varying data Mining on real time Mining XML data The Disintegration of Persistence of Memory 1952-54 Salvador Dalí 2 / 30
  • 3. XML Tree Classification on evolving data streams D D D D B B B B B B B C C C C C C A A C LASS 1 C LASS 2 C LASS 1 C LASS 2 D Figure: A dataset example 3 / 30
  • 4. Tree Pattern Mining Given a dataset of trees, find the complete set of frequent subtrees Frequent Tree Pattern (FT): Include all the trees whose support is no less than min_sup Closed Frequent Tree Pattern (CT): Include no tree which has a Trees are sanctuaries. super-tree with the same Whoever knows how support to listen to them, can learn the truth. CT ⊆ FT Herman Hesse 4 / 30
  • 5. Mining Closed Frequent Trees Our trees are: Our subtrees are: Labeled and Unlabeled Induced Ordered and Unordered Top-down Two different ordered trees but the same unordered tree 5 / 30
  • 6. A tale of two trees Consider D = {A, B}, where Frequent subtrees A B A: B: and let min_sup = 2. 6 / 30
  • 7. A tale of two trees Consider D = {A, B}, where Closed subtrees A B A: B: and let min_sup = 2. 6 / 30
  • 8. XML Tree Classification on evolving data streams D D D D B B B B B B B C C C C C C A A C LASS 1 C LASS 2 C LASS 1 C LASS 2 D Figure: A dataset example 7 / 30
  • 9. XML Tree Classification on evolving data streams Tree Trans. Closed Freq. not Closed Trees 1 2 3 4 D B B c1 C C C C 1 0 1 0 D B B C A C C A c2 A A 1 0 0 1 8 / 30
  • 10. XML Tree Classification on evolving data streams Frequent Trees c1 c2 c3 c4 Id 1 c1 f1 1 2 3 c2 f2 f2 f2 1 c3 f3 1 2 3 4 5 c4 f4 f4 f4 f4 f4 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 2 0 0 0 0 0 0 1 1 1 1 1 1 1 1 3 1 1 0 0 0 0 1 1 1 1 1 1 1 1 4 0 0 1 1 1 1 1 1 1 1 1 1 1 1 Closed Maximal Trees Trees Id Tree c1 c2 c3 c4 c1 c2 c3 Class 1 1 1 0 1 1 1 0 C LASS 1 2 0 0 1 1 0 0 1 C LASS 2 3 1 0 1 1 1 0 1 C LASS 1 4 0 1 1 1 0 1 1 C LASS 2 9 / 30
  • 11. XML Tree Framework on evolving data streams XML Tree Classification Framework Components An XML closed frequent tree miner A Data stream classifier algorithm, which we will feed with tuples to be classified online. 10 / 30
  • 12. Mining Evolving Tree Data Streams Problem Given a data stream D of rooted and unordered trees, find frequent closed trees. We provide three algorithms, of increasing power Incremental Sliding Window Adaptive D 11 / 30
  • 13. Mining Closed Unordered Subtrees C LOSED _S UBTREES(t, D, min_sup, T ) 1 2 3 for every t that can be extended from t in one step 4 do if Support(t ) ≥ min_sup 5 then T ← C LOSED _S UBTREES(t , D, min_sup, T ) 6 7 8 9 10 return T 12 / 30
  • 14. Mining Closed Unordered Subtrees C LOSED _S UBTREES(t, D, min_sup, T ) 1 if not C ANONICAL _R EPRESENTATIVE(t) 2 then return T 3 for every t that can be extended from t in one step 4 do if Support(t ) ≥ min_sup 5 then T ← C LOSED _S UBTREES(t , D, min_sup, T ) 6 7 8 9 10 return T 12 / 30
  • 15. Mining Closed Unordered Subtrees C LOSED _S UBTREES(t, D, min_sup, T ) 1 if not C ANONICAL _R EPRESENTATIVE(t) 2 then return T 3 for every t that can be extended from t in one step 4 do if Support(t ) ≥ min_sup 5 then T ← C LOSED _S UBTREES(t , D, min_sup, T ) 6 do if Support(t ) = Support(t) 7 then t is not closed 8 if t is closed 9 then insert t into T 10 return T 12 / 30
  • 16. Example D = {A, B} A = (0, 1, 2, 3, 2, 1) B = (0, 1, 2, 3, 1, 2, 2) min_sup = 2. (0, 1, 2, 1) (0, 1, 2, 2, 1) (0, 1, 1) (0, 1, 2, 2) (0) (0, 1) (0, 1, 2) (0, 1, 2, 3, 1) (0, 1, 2, 3) 13 / 30
  • 17. Example D = {A, B} A = (0, 1, 2, 3, 2, 1) B = (0, 1, 2, 3, 1, 2, 2) min_sup = 2. (0, 1, 2, 1) (0, 1, 2, 2, 1) (0, 1, 1) (0, 1, 2, 2) (0) (0, 1) (0, 1, 2) (0, 1, 2, 3, 1) (0, 1, 2, 3) 13 / 30
  • 18. Experimental results TreeNat CMTreeMiner Unlabeled Trees Labeled Trees Top-Down Subtrees Induced Subtrees No Occurrences Occurrences 14 / 30
  • 19. Closure Operator on Trees D: the finite input dataset of trees T : the (infinite) set of all trees Definition We define the following the Galois connection pair: For finite A ⊆ D σ (A) is the set of subtrees of the A trees in T σ (A) = {t ∈ T ∀ t ∈ A (t t )} For finite B ⊂ T τD (B) is the set of supertrees of the B trees in D τD (B) = {t ∈ D ∀ t ∈ B (t t )} Closure Operator The composition ΓD = σ ◦ τD is a closure operator. 15 / 30
  • 20. Galois Lattice of closed set of trees 1 2 3 12 13 23 123 16 / 30
  • 21. Galois Lattice of closed set of trees 1 2 3 D B={ } 12 13 23 123 17 / 30
  • 22. Galois Lattice of closed set of trees B={ } 1 2 3 τD (B) = { , } 12 13 23 123 17 / 30
  • 23. Galois Lattice of closed set of trees B={ } 1 2 3 τD (B) = { , } 12 13 23 ΓD (B) = σ ◦τD(B) = { and its subtrees } 123 17 / 30
  • 24. Algorithms Algorithms Incremental: I NC T REE N AT Sliding Window: W IN T REE N AT Adaptive: A DAT REE N AT Uses ADWIN to monitor change ADWIN An adaptive sliding window whose size is recomputed online according to the rate of change observed. ADWIN has rigorous guarantees (theorems) On ratio of false positives and false negatives On the relation of the size of the current window and change rates 18 / 30
  • 25. Experimental Validation: TN1 CMTreeMiner 300 Time 200 (sec.) 100 I NC T REE N AT 2 4 6 8 Size (Milions) Figure: Experiments on ordered trees with TN1 dataset 19 / 30
  • 26. What is MOA? {M}assive {O}nline {A}nalysis is a framework for online learning from data streams. It is closely related to WEKA It includes a collection of offline and online as well as tools for evaluation: boosting and bagging Hoeffding Trees with and without Naïve Bayes classifiers at the leaves. 20 / 30
  • 27. WEKA: the bird 21 / 30
  • 28. MOA: the bird The Moa (another native NZ bird) is not only flightless, like the Weka, but also extinct. 22 / 30
  • 29. MOA: the bird The Moa (another native NZ bird) is not only flightless, like the Weka, but also extinct. 22 / 30
  • 30. MOA: the bird The Moa (another native NZ bird) is not only flightless, like the Weka, but also extinct. 22 / 30
  • 31. Data stream classification cycle 1 Process an example at a time, and inspect it only once (at most) 2 Use a limited amount of memory 3 Work in a limited amount of time 4 Be ready to predict at any point 23 / 30
  • 32. Environments and Data Sources Environments Sensor Network: 100Kb Handheld Computer: 32 Mb Server: 400 Mb Data Sources Random Tree Generator Random RBF Generator LED Generator Waveform Generator Function Generator 24 / 30
  • 33. Algorithms Naive Bayes Prediction strategies Decision stumps Majority class Hoeffding Tree Naive Bayes Leaves Hoeffding Option Tree Adaptive Hybrid Bagging and Boosting 25 / 30
  • 34. Hoeffding Option Tree Hoeffding Option Trees Regular Hoeffding tree containing additional option nodes that allow several tests to be applied, leading to multiple Hoeffding trees as separate paths. 26 / 30
  • 37. Ensemble Methods http://www.cs.waikato.ac.nz/∼abifet/MOA/ New ensemble methods: ADWIN bagging: When a change is detected, the worst classifier is removed and a new classifier is added. Adaptive-Size Hoeffding Tree bagging 28 / 30
  • 38. XML Tree Framework on evolving data streams Maximal Closed # Trees Att. Acc. Mem. Att. Acc. Mem. CSLOG12 15483 84 79.64 1.2 228 78.12 2.54 CSLOG23 15037 88 79.81 1.21 243 78.77 2.75 CSLOG31 15702 86 79.94 1.25 243 77.60 2.73 CSLOG123 23111 84 80.02 1.7 228 78.91 4.18 Table: BAGGING on unordered trees. 29 / 30
  • 39. Conclusions XML tree stream classifier system. Using Galois Latice Theory, we present methods for mining closed trees Incremental Sliding Window Adaptive: using ADWIN to monitor change We use MOA data stream classifiers. 30 / 30