SlideShare a Scribd company logo
1 of 72
Download to read offline
Mining Frequent Closed Graphs on Evolving Data Streams



                                                         `
         A. Bifet, G. Holmes, B. Pfahringer and R. Gavalda
                                 University of Waikato
                                Hamilton, New Zealand


         Laboratory for Relational Algorithmics, Complexity and Learning LARCA
                              UPC-Barcelona Tech, Catalonia




                 San Diego, 24 August 2011
         17th ACM SIGKDD International Conference on
          Knowledge Discovery and Data Mining 2011
Mining Evolving Graph Data Streams
Problem
Given a data stream D of graphs, find frequent closed graphs.


Transaction Id      Graph        We provide three algorithms,
                       O         of increasing power
                 C C S N             Incremental
       1                O            Sliding Window
                        O            Adaptive
                 C C S N
       2                C
                        N
       3         C C S N

                                                                2 / 48
Non-incremental
    Frequent Closed Graph Mining


CloseGraph: Xifeng Yan, Jiawei Han
   right-most extension based on depth-first search
   based on gSpan ICDM’02

MoSS: Christian Borgelt, Michael R. Berthold
   breadth-first search
   based on MoFa ICDM’02




                                                     3 / 48
Outline




1   Streaming Data
2   Frequent Pattern Mining
      Mining Evolving Graph Streams
      A DAG RAPH M INER
3   Experimental Evaluation
4   Summary and Future Work


                                      4 / 48
Outline




1   Streaming Data
2   Frequent Pattern Mining
      Mining Evolving Graph Streams
      A DAG RAPH M INER
3   Experimental Evaluation
4   Summary and Future Work


                                      5 / 48
Mining Massive Data




Source: IDC’s Digital Universe Study (EMC), June 2011

                                                        6 / 48
Data Streams

Data Streams
   Sequence is potentially infinite
   High amount of data
   High speed of arrival
   Once an element from a data stream has been processed
   it is discarded or archived

Tools:
   approximation
   randomization, sampling
   sketching



                                                           7 / 48
Data Streams Approximation Algorithms

           1011000111 1010101

 Sliding Window
 We can maintain simple statistics over sliding windows, using
 O ( 1 log2 N ) space, where
     ε
     N is the length of the sliding window
     ε is the accuracy parameter

     M. Datar, A. Gionis, P. Indyk, and R. Motwani.
     Maintaining stream statistics over sliding windows. 2002




                                                                 8 / 48
Data Streams Approximation Algorithms

           10110001111 0101011

 Sliding Window
 We can maintain simple statistics over sliding windows, using
 O ( 1 log2 N ) space, where
     ε
     N is the length of the sliding window
     ε is the accuracy parameter

     M. Datar, A. Gionis, P. Indyk, and R. Motwani.
     Maintaining stream statistics over sliding windows. 2002




                                                                 8 / 48
Data Streams Approximation Algorithms

           101100011110 1010111

 Sliding Window
 We can maintain simple statistics over sliding windows, using
 O ( 1 log2 N ) space, where
     ε
     N is the length of the sliding window
     ε is the accuracy parameter

     M. Datar, A. Gionis, P. Indyk, and R. Motwani.
     Maintaining stream statistics over sliding windows. 2002




                                                                 8 / 48
Data Streams Approximation Algorithms

           1011000111101 0101110

 Sliding Window
 We can maintain simple statistics over sliding windows, using
 O ( 1 log2 N ) space, where
     ε
     N is the length of the sliding window
     ε is the accuracy parameter

     M. Datar, A. Gionis, P. Indyk, and R. Motwani.
     Maintaining stream statistics over sliding windows. 2002




                                                                 8 / 48
Data Streams Approximation Algorithms

           10110001111010 1011101

 Sliding Window
 We can maintain simple statistics over sliding windows, using
 O ( 1 log2 N ) space, where
     ε
     N is the length of the sliding window
     ε is the accuracy parameter

     M. Datar, A. Gionis, P. Indyk, and R. Motwani.
     Maintaining stream statistics over sliding windows. 2002




                                                                 8 / 48
Data Streams Approximation Algorithms

           101100011110101 0111010

 Sliding Window
 We can maintain simple statistics over sliding windows, using
 O ( 1 log2 N ) space, where
     ε
     N is the length of the sliding window
     ε is the accuracy parameter

     M. Datar, A. Gionis, P. Indyk, and R. Motwani.
     Maintaining stream statistics over sliding windows. 2002




                                                                 8 / 48
Outline




1   Streaming Data
2   Frequent Pattern Mining
      Mining Evolving Graph Streams
      A DAG RAPH M INER
3   Experimental Evaluation
4   Summary and Future Work


                                      9 / 48
Pattern Mining


Dataset Example
Document   Patterns
   d1        abce
   d2         cde
   d3        abce
   d4        acde
   d5       abcde
   d6         bcd




                            10 / 48
Itemset Mining


d1   abce          Support          Frequent
d2   cde       d1,d2,d3,d4,d5,d6         c
d3   abce       d1,d2,d3,d4,d5         e,ce
d4   acde         d1,d3,d4,d5      a,ac,ae,ace
d5   abcde        d1,d3,d5,d6          b,bc
d6   bcd           d2,d4,d5            d,cd
                   d1,d3,d5        ab,abc,abe
                                   be,bce,abce
                   d2,d4,d5          de,cde




                                                 11 / 48
Itemset Mining


d1   abce      Support    Frequent
d2   cde          6            c
d3   abce         5          e,ce
d4   acde         4      a,ac,ae,ace
d5   abcde        4          b,bc
d6   bcd          4          d,cd
                  3      ab,abc,abe
                         be,bce,abce
                 3         de,cde




                                       12 / 48
Itemset Mining


d1   abce    Support    Frequent     Gen   Closed
d2   cde        6            c         c      c
d3   abce       5          e,ce        e     ce
d4   acde       4      a,ac,ae,ace     a    ace
d5   abcde      4          b,bc        b     bc
d6   bcd        4          d,cd        d     cd
                3      ab,abc,abe     ab
                       be,bce,abce    be    abce
               3         de,cde       de    cde




                                                    12 / 48
Itemset Mining


d1   abce    Support    Frequent     Gen   Closed   Max
d2   cde        6            c         c      c
d3   abce       5          e,ce        e     ce
d4   acde       4      a,ac,ae,ace     a    ace
d5   abcde      4          b,bc        b     bc
d6   bcd        4          d,cd        d     cd
                3      ab,abc,abe     ab
                       be,bce,abce    be    abce    abce
               3         de,cde       de    cde     cde




                                                           12 / 48
Outline




1   Streaming Data
2   Frequent Pattern Mining
      Mining Evolving Graph Streams
      A DAG RAPH M INER
3   Experimental Evaluation
4   Summary and Future Work


                                      13 / 48
Graph Dataset
Transaction Id    Graph     Weight
                     O
                 C C S N
      1                 O     1
                        O
                 C C S N
      2                 C     1
                   O
                  C S N
      3             C         1
                        N
      4          C C S N      1

                                     14 / 48
Graph Coresets

Coreset of a set P with respect to some problem
Small subset that approximates the original set P.
    Solving the problem for the coreset provides an
    approximate solution for the problem on P.

δ -tolerance Closed Graph
A graph g is δ -tolerance closed if none of its proper frequent
supergraphs has a weighted support ≥ (1 − δ ) · support (g ).
    Maximal graph: 1-tolerance closed graph
    Closed graph: 0-tolerance closed graph.




                                                                  15 / 48
Graph Coresets

Coreset of a set P with respect to some problem
Small subset that approximates the original set P.
    Solving the problem for the coreset provides an
    approximate solution for the problem on P.

δ -tolerance Closed Graph
A graph g is δ -tolerance closed if none of its proper frequent
supergraphs has a weighted support ≥ (1 − δ ) · support (g ).
    Maximal graph: 1-tolerance closed graph
    Closed graph: 0-tolerance closed graph.




                                                                  15 / 48
Graph Coresets

Relative support of a closed graph
Support of a graph minus the relative support of its closed
supergraphs.
    The sum of the closed supergraphs’ relative supports of a
    graph and its relative support is equal to its own support.

(s, δ )-coreset for the problem of computing closed
graphs
Weighted multiset of frequent δ -tolerance closed graphs with
minimum support s using their relative support as a weight.




                                                                  16 / 48
Graph Coresets

Relative support of a closed graph
Support of a graph minus the relative support of its closed
supergraphs.
    The sum of the closed supergraphs’ relative supports of a
    graph and its relative support is equal to its own support.

(s, δ )-coreset for the problem of computing closed
graphs
Weighted multiset of frequent δ -tolerance closed graphs with
minimum support s using their relative support as a weight.




                                                                  16 / 48
Graph Dataset
Transaction Id    Graph     Weight
                     O
                 C C S N
      1                 O     1
                        O
                 C C S N
      2                 C     1
                   O
                  C S N
      3             C         1
                        N
      4          C C S N      1

                                     17 / 48
Graph Coresets

           Graph         Relative Support    Support
         C C S N                 3              3
               O
           C S N                 3              3
                   N
             C S                 3              3
Table: Example of a coreset with minimum support 50% and δ = 1




                                                                 18 / 48
Graph Coresets




Figure: Number of graphs in a (40%, δ )-coreset for NCI.



                                                           19 / 48
Outline




1   Streaming Data
2   Frequent Pattern Mining
      Mining Evolving Graph Streams
      A DAG RAPH M INER
3   Experimental Evaluation
4   Summary and Future Work


                                      20 / 48
I NC G RAPH M INER


I NC G RAPH M INER (D, min sup )
    Input: A graph dataset D, and min sup.
    Output: The frequent graph set G.

1   G←0   /
2   for every batch bt of graphs in D
3        do C ← C ORESET (bt , min sup )
4           G ← C ORESET (G ∪ C, min sup )
5   return G




                                             21 / 48
W IN G RAPH M INER
W IN G RAPH M INER (D, W , min sup )
    Input: A graph dataset D, a size window W and min sup.
    Output: The frequent graph set G.

1   G←0   /
2   for every batch bt of graphs in D
3        do C ← C ORESET (bt , min sup )
4           Store C in sliding window
5           if sliding window is full
6              then R ← Oldest C stored in sliding window,
                         negate all support values
7              else R ← 0 /
8           G ← C ORESET (G ∪ C ∪ R, min sup )
9   return G


                                                             22 / 48
Algorithm ADaptive Sliding WINdow
Example
W = 101010110111111
W0 = 1



ADWIN: A DAPTIVE W INDOWING A LGORITHM
1 Initialize Window W
2 for each t > 0
3       do W ← W ∪ {xt } (i.e., add xt to the head of W )
4           repeat Drop elements from the tail of W
5             until |µW0 − µW1 | ≥ εc holds
                      ˆ    ˆ
6               for every split of W into W = W0 · W1
7           Output µWˆ



                                                            23 / 48
Algorithm ADaptive Sliding WINdow
Example
W = 101010110111111
W0 = 1 W1 = 01010110111111



ADWIN: A DAPTIVE W INDOWING A LGORITHM
1 Initialize Window W
2 for each t > 0
3       do W ← W ∪ {xt } (i.e., add xt to the head of W )
4           repeat Drop elements from the tail of W
5             until |µW0 − µW1 | ≥ εc holds
                      ˆ    ˆ
6               for every split of W into W = W0 · W1
7           Output µWˆ



                                                            23 / 48
Algorithm ADaptive Sliding WINdow
Example
W = 101010110111111
W0 = 10 W1 = 1010110111111



ADWIN: A DAPTIVE W INDOWING A LGORITHM
1 Initialize Window W
2 for each t > 0
3       do W ← W ∪ {xt } (i.e., add xt to the head of W )
4           repeat Drop elements from the tail of W
5             until |µW0 − µW1 | ≥ εc holds
                      ˆ    ˆ
6               for every split of W into W = W0 · W1
7           Output µWˆ



                                                            23 / 48
Algorithm ADaptive Sliding WINdow
Example
W = 101010110111111
W0 = 101 W1 = 010110111111



ADWIN: A DAPTIVE W INDOWING A LGORITHM
1 Initialize Window W
2 for each t > 0
3       do W ← W ∪ {xt } (i.e., add xt to the head of W )
4           repeat Drop elements from the tail of W
5             until |µW0 − µW1 | ≥ εc holds
                      ˆ    ˆ
6               for every split of W into W = W0 · W1
7           Output µWˆ



                                                            23 / 48
Algorithm ADaptive Sliding WINdow
Example
W = 101010110111111
W0 = 1010 W1 = 10110111111



ADWIN: A DAPTIVE W INDOWING A LGORITHM
1 Initialize Window W
2 for each t > 0
3       do W ← W ∪ {xt } (i.e., add xt to the head of W )
4           repeat Drop elements from the tail of W
5             until |µW0 − µW1 | ≥ εc holds
                      ˆ    ˆ
6               for every split of W into W = W0 · W1
7           Output µWˆ



                                                            23 / 48
Algorithm ADaptive Sliding WINdow
Example
W = 101010110111111
W0 = 10101 W1 = 0110111111



ADWIN: A DAPTIVE W INDOWING A LGORITHM
1 Initialize Window W
2 for each t > 0
3       do W ← W ∪ {xt } (i.e., add xt to the head of W )
4           repeat Drop elements from the tail of W
5             until |µW0 − µW1 | ≥ εc holds
                      ˆ    ˆ
6               for every split of W into W = W0 · W1
7           Output µWˆ



                                                            23 / 48
Algorithm ADaptive Sliding WINdow
Example
W = 101010110111111
W0 = 101010 W1 = 110111111



ADWIN: A DAPTIVE W INDOWING A LGORITHM
1 Initialize Window W
2 for each t > 0
3       do W ← W ∪ {xt } (i.e., add xt to the head of W )
4           repeat Drop elements from the tail of W
5             until |µW0 − µW1 | ≥ εc holds
                      ˆ    ˆ
6               for every split of W into W = W0 · W1
7           Output µWˆ



                                                            23 / 48
Algorithm ADaptive Sliding WINdow
Example
W = 101010110111111
W0 = 1010101 W1 = 10111111



ADWIN: A DAPTIVE W INDOWING A LGORITHM
1 Initialize Window W
2 for each t > 0
3       do W ← W ∪ {xt } (i.e., add xt to the head of W )
4           repeat Drop elements from the tail of W
5             until |µW0 − µW1 | ≥ εc holds
                      ˆ    ˆ
6               for every split of W into W = W0 · W1
7           Output µWˆ



                                                            23 / 48
Algorithm ADaptive Sliding WINdow
Example
W = 101010110111111
W0 = 10101011 W1 = 0111111



ADWIN: A DAPTIVE W INDOWING A LGORITHM
1 Initialize Window W
2 for each t > 0
3       do W ← W ∪ {xt } (i.e., add xt to the head of W )
4           repeat Drop elements from the tail of W
5             until |µW0 − µW1 | ≥ εc holds
                      ˆ    ˆ
6               for every split of W into W = W0 · W1
7           Output µWˆ



                                                            23 / 48
Algorithm ADaptive Sliding WINdow
Example
W = 101010110111111 |µW0 − µW1 | ≥ εc : CHANGE DET.!
                      ˆ    ˆ
W0 = 101010110 W1 = 111111



ADWIN: A DAPTIVE W INDOWING A LGORITHM
1 Initialize Window W
2 for each t > 0
3       do W ← W ∪ {xt } (i.e., add xt to the head of W )
4           repeat Drop elements from the tail of W
5             until |µW0 − µW1 | ≥ εc holds
                      ˆ    ˆ
6               for every split of W into W = W0 · W1
7           Output µWˆ



                                                            23 / 48
Algorithm ADaptive Sliding WINdow
Example
W = 101010110111111 Drop elements from the tail of W
W0 = 101010110 W1 = 111111



ADWIN: A DAPTIVE W INDOWING A LGORITHM
1 Initialize Window W
2 for each t > 0
3       do W ← W ∪ {xt } (i.e., add xt to the head of W )
4           repeat Drop elements from the tail of W
5             until |µW0 − µW1 | ≥ εc holds
                      ˆ    ˆ
6               for every split of W into W = W0 · W1
7           Output µWˆ



                                                            23 / 48
Algorithm ADaptive Sliding WINdow
Example
W = 01010110111111 Drop elements from the tail of W
W0 = 101010110 W1 = 111111



ADWIN: A DAPTIVE W INDOWING A LGORITHM
1 Initialize Window W
2 for each t > 0
3       do W ← W ∪ {xt } (i.e., add xt to the head of W )
4           repeat Drop elements from the tail of W
5             until |µW0 − µW1 | ≥ εc holds
                      ˆ    ˆ
6               for every split of W into W = W0 · W1
7           Output µWˆ



                                                            23 / 48
Algorithm ADaptive Sliding WINdow
Theorem
At every time step we have:
 1   (False positive rate bound). If µt remains constant within
     W , the probability that ADWIN shrinks the window at this
     step is at most δ .
 2   (False negative rate bound). Suppose that for some
     partition of W in two parts W0 W1 (where W1 contains the
     most recent items) we have |µW0 − µW1 | > 2εc . Then with
     probability 1 − δ ADWIN shrinks W to W1 , or shorter.

ADWIN tunes itself to the data stream at hand, with no need for
the user to hardwire or precompute parameters.



                                                                  24 / 48
Algorithm ADaptive Sliding WINdow
ADWIN using a Data Stream Sliding Window Model,
    can provide the exact counts of 1’s in O (1) time per point.
    tries O (log W ) cutpoints
    uses O ( 1 log W ) memory words
             ε
    the processing time per example is O (log W ) (amortized
    and worst-case).

                     Sliding Window Model

            1010101 101 11 1 1
Content:         4      2        2   1 1
Capacity:        7      3        2   1 1




                                                                   25 / 48
A DAG RAPH M INER
A DAG RAPH M INER (D, Mode, min sup )
 1 G←0    /
 2 Init ADWIN
 3 for every batch bt of graphs in D
 4       do C ← C ORESET (bt , min sup )
 5          R←0   /
 6          if Mode is Sliding Window
 7             then Store C in sliding window
 8                  if ADWIN detected change
 9                     then R ← Batches to remove
                                in sliding window
                                with negative support
10          G ← C ORESET (G ∪ C ∪ R, min sup )
11          if Mode is Sliding Window
12             then Insert # closed graphs into ADWIN
13             else for every g in G update g’s ADWIN
14 return G

                                                        26 / 48
A DAG RAPH M INER
A DAG RAPH M INER (D, Mode, min sup )
 1 G←0    /
 2 Init ADWIN
 3 for every batch bt of graphs in D
 4       do C ← C ORESET (bt , min sup )
 5          R←0 /
 6
 7
 8
 9


10        G ← C ORESET (G ∪ C ∪ R, min sup )
11
12
13         for every g in G update g’s ADWIN
14 return G

                                               26 / 48
A DAG RAPH M INER
A DAG RAPH M INER (D, Mode, min sup )
 1 G←0    /
 2 Init ADWIN
 3 for every batch bt of graphs in D
 4       do C ← C ORESET (bt , min sup )
 5          R←0   /
 6          if Mode is Sliding Window
 7             then Store C in sliding window
 8                  if ADWIN detected change
 9                     then R ← Batches to remove
                                in sliding window
                                with negative support
10          G ← C ORESET (G ∪ C ∪ R, min sup )
11          if Mode is Sliding Window
12             then Insert # closed graphs into ADWIN
13
14 return G

                                                        26 / 48
Outline




1   Streaming Data
2   Frequent Pattern Mining
      Mining Evolving Graph Streams
      A DAG RAPH M INER
3   Experimental Evaluation
4   Summary and Future Work


                                      27 / 48
Experimental Evaluation

ChemDB dataset
   Public dataset
   4 million molecules
   Institute for Genomics and Bioinformatics at the University
   of California, Irvine

Open NCI Database
   Public domain
   250,000 structures
   National Cancer Institute




                                                                 28 / 48
What is MOA?
{M}assive {O}nline {A}nalysis is a framework for online
learning from data streams.




    It is closely related to WEKA
    It includes a collection of offline and online as well as tools
    for evaluation:
        classification
        clustering
    Easy to extend
    Easy to design and run experiments


                                                                     29 / 48
WEKA: the bird




                 30 / 48
MOA: the bird
The Moa (another native NZ bird) is not only flightless, like the
                  Weka, but also extinct.




                                                                   31 / 48
MOA: the bird
The Moa (another native NZ bird) is not only flightless, like the
                  Weka, but also extinct.




                                                                   31 / 48
MOA: the bird
The Moa (another native NZ bird) is not only flightless, like the
                  Weka, but also extinct.




                                                                   31 / 48
Classification Experimental Setting




                                     32 / 48
Classification Experimental Setting
Classifiers
   Naive Bayes
   Decision stumps
   Hoeffding Tree
   Hoeffding Option Tree
   Bagging and Boosting
   ADWIN Bagging and
   Leveraging Bagging

Prediction strategies
   Majority class
   Naive Bayes Leaves
   Adaptive Hybrid
                                        33 / 48
Clustering Experimental Setting




                                  34 / 48
Clustering Experimental Setting

Clusterers
   StreamKM++
   CluStream
   ClusTree
   Den-Stream
   CobWeb




                                      35 / 48
Clustering Experimental Setting
  Internal measures              External measures
  Gamma                          Rand statistic
  C Index                        Jaccard coefficient
  Point-Biserial                 Folkes and Mallow Index
  Log Likelihood                 Hubert Γ statistics
  Dunn’s Index                   Minkowski score
  Tau                            Purity
  Tau A                          van Dongen criterion
  Tau C                          V-measure
  Somer’s Gamma                  Completeness
  Ratio of Repetition            Homogeneity
  Modified Ratio of Repetition    Variation of information
  Adjusted Ratio of Clustering   Mutual information
  Fagan’s Index                  Class-based entropy
  Deviation Index                Cluster-based entropy
  Z-Score Index                  Precision
  D Index                        Recall
  Silhouette coefficient          F-measure
Table: Internal and external clustering evaluation measures.
                                                               36 / 48
Cluster Mapping Measure

   Hardy Kremer, Philipp Kranen, Timm Jansen, Thomas
   Seidl, Albert Bifet, Geoff Holmes and Bernhard Pfahringer.
   An Effective Evaluation Measure for Clustering on Evolving
   Data Stream
   KDD’11

CMM: Cluster Mapping Measure
A novel evaluation measure for stream clustering on evolving
streams

                                ∑o∈F w (o ) · pen(o, C )
        CMM (C , C L ) = 1 −
                               ∑o∈F w (o ) · con(o, Cl (o ))



                                                                37 / 48
Extensions of MOA




    Multi-label Classification
    Active Learning
    Regression
    Closed Frequent Graph Mining
    Twitter Sentiment Analysis

Challenges for bigger data streams
Sampling and distributed systems (Map-Reduce, Hadoop, S4)

                                                            38 / 48
Open NCI dataset

                                         Time NCI Dataset

          120
          100
           80
Seconds




           60
           40
           20
            0
               00

                     00

                              00

                                    00

                                             00

                                                       0

                                                                0

                                                                         0

                                                                                  0

                                                                                           0

                                                                                                    0

                                                                                                             0

                                                                                                                      0
                                                   00

                                                            00

                                                                     00

                                                                              00

                                                                                       00

                                                                                                00

                                                                                                         00

                                                                                                                  00
           .0

                    .0

                          .0

                                   .0

                                         .0

                                                  0.

                                                           0.

                                                                    0.

                                                                             0.

                                                                                      0.

                                                                                               0.

                                                                                                        0.

                                                                                                                 0.
          10

                30

                         50

                               70

                                        90

                                              11

                                                       13

                                                                15

                                                                         17

                                                                                  19

                                                                                           21

                                                                                                    23

                                                                                                             25
                                                           Instances
                         IncGraphMiner             IncGraphMiner-C                    MoSS              closeGraph




                                                                                                                          39 / 48
Open NCI dataset

                                          Memory NCI Dataset

            9000
            8000
            7000
Megabytes




            6000
            5000
            4000
            3000
            2000
            1000
               0
                 00

                        00

                                00

                                       00

                                               00

                                                         0

                                                                  0

                                                                           0

                                                                                    0

                                                                                             0

                                                                                                      0

                                                                                                               0

                                                                                                                        0
                                                      00

                                                              00

                                                                       00

                                                                                00

                                                                                         00

                                                                                                  00

                                                                                                           00

                                                                                                                    00
              .0

                      .0

                             .0

                                     .0

                                            .0

                                                    0.

                                                             0.

                                                                      0.

                                                                               0.

                                                                                        0.

                                                                                                 0.

                                                                                                          0.

                                                                                                                   0.
            10

                   30

                           50

                                  70

                                          90

                                                 11

                                                         13

                                                                  15

                                                                           17

                                                                                    19

                                                                                             21

                                                                                                      23

                                                                                                               25
                                                             Instances
                           IncGraphMiner              IncGraphMiner-C                   MoSS              closeGraph




                                                                                                                            40 / 48
Megabytes
                                           10
                                             .




                                                         0
                                                      5000
                                                     10000
                                                     15000
                                                     20000
                                                     25000
                                                     30000
                                                     35000
                                                     40000
                                                     45000
                                          24 000     50000
                                            0.
                                          47 000
                                            0.
                                          70 000
                                            0.
                                          93 000




          IncGraphMiner
                                        1. 0. 0
                                          16 00
                                        1. 0.0
                                          39 00
                                        1. 0.0
                                          62 00
                                        1. 0.0
                                          85 00
                                        2. 0.0
                                          08 00
                                        2. 0.0
                                          31 00

          IncGraphMiner-C
                                        2. 0.0
                                          54 00
                            Instances

                                        2. 0.0
                                          77 00
                                                                  Memory ChemDB Dataset




                                        3. 0.0
          MoSS


                                          00 00
                                        3. 0.0
                                          23 00
                                                                                          ChemDB dataset




                                        3. 0.0
                                          46 00
                                        3. 0.0
                                          69 00
                                        3. 0.0
                                          92 00
          closeGraph




                                            0.
                                              00
                                                 0
41 / 48
Seconds
                                           10
                                             .




                                                        0
                                                      500
                                                     1000
                                                     1500
                                                     2000
                                                     2500
                                                     3000
                                                     3500
                                                     4000
                                                     4500
                                                     5000
                                          24 000
                                            0.
                                          47 000
                                            0.
                                          70 000
                                            0.
                                          93 000
                                        1. 0. 0
                                          16 00
                                        1. 0.0
                                          39 00




          IncGraphMiner
                                        1. 0.0
                                          62 00
                                        1. 0.0
                                          85 00
                                        2. 0.0
                                          08 00
                                        2. 0.0
                                          31 00
                                        2. 0.0
                                          54 00
                            Instances
          IncGraphMiner-C
                                        2. 0.0
                                          77 00
                                                                Time ChemDB Dataset




                                        3. 0.0
                                          00 00
                                        3. 0.0
                                          23 00
          MoSS
                                                                                      ChemDB dataset




                                        3. 0.0
                                          46 00
                                        3. 0.0
                                          69 00
                                        3. 0.0
                                          92 00
                                            0.
                                              00
                                                 0
          closeGraph




42 / 48
Number of Closed Graphs
                                              10
                                                .0




                                                         0
                                                             10
                                                                  20
                                                                        30
                                                                              40
                                                                                    50
                                                                                         60

                                                    0
                                              60 0
                                                .0
                                             11 00
                                               0.
                                                  0
                                             16 00
                                               0.
                                                  00
                                             21 0
                                               0.
                                                  0
                                             26 00
                                               0.
                                                  0
                                             31 00
                                               0.
                                                  0
                                             36 00
                                               0.
                                                  0




          ADAGRAPHMINER
                                             41 00
                                               0.
                                                  0
                                             46 00
                                               0.
                                                  0
                                             51 00
                                               0.
                                                  0
                                             56 00
                                               0.
                                                  00
                                 Instances   61 0
                                               0.
                                                  0
                                             66 00
                                               0.
                                                  0
                                             71 00
                                               0.
                                                  0
                                             76 00
          ADAGRAPHMINER-Window




                                               0.
                                                  0
                                             81 00
                                               0.
                                                                                              A DAG RAPH M INER




                                                  0
                                             86 00
                                               0.
                                                  0
                                             91 00
                                               0.
                                                  0
                                             96 00
                                               0.
                                                  00
          IncGraphMiner




                                                     0
43 / 48
Outline




1   Streaming Data
2   Frequent Pattern Mining
      Mining Evolving Graph Streams
      A DAG RAPH M INER
3   Experimental Evaluation
4   Summary and Future Work


                                      44 / 48
Summary
Mining Evolving Graph Data Streams
Given a data stream D of graphs, find frequent closed graphs.


Transaction Id      Graph        We provide three algorithms,
                       O         of increasing power
                 C C S N             Incremental
       1                O            Sliding Window
                        O            Adaptive
                 C C S N
       2                C
                        N
       3         C C S N

                                                                45 / 48
Summary
{M}assive {O}nline {A}nalysis is a framework for online
learning from data streams.




                       http://moa.cs.waikato.ac.nz

    It is closely related to WEKA
    It includes a collection of offline and online as well as tools
    for evaluation:
        classification
        clustering
        frequent pattern mining
    MOA deals with evolving data streams
    MOA is easy to use and extend


                                                                     46 / 48
Future work




Structured massive data mining
   sampling, sketching
   parallel techniques




                                  47 / 48
Thanks!




          48 / 48

More Related Content

What's hot

Leveraging Bagging for Evolving Data Streams
Leveraging Bagging for Evolving Data StreamsLeveraging Bagging for Evolving Data Streams
Leveraging Bagging for Evolving Data StreamsAlbert Bifet
 
EuroPython 2017 - PyData - Deep Learning your Broadband Network @ HOME
EuroPython 2017 - PyData - Deep Learning your Broadband Network @ HOMEEuroPython 2017 - PyData - Deep Learning your Broadband Network @ HOME
EuroPython 2017 - PyData - Deep Learning your Broadband Network @ HOMEHONGJOO LEE
 
Speaker Diarization
Speaker DiarizationSpeaker Diarization
Speaker DiarizationHONGJOO LEE
 
Distance oracle - Truy vấn nhanh khoảng cách giữa hai điểm bất kỳ trên đồ thị
Distance oracle - Truy vấn nhanh khoảng cách giữa hai điểm bất kỳ trên đồ thịDistance oracle - Truy vấn nhanh khoảng cách giữa hai điểm bất kỳ trên đồ thị
Distance oracle - Truy vấn nhanh khoảng cách giữa hai điểm bất kỳ trên đồ thịHong Ong
 
PAC Bayesian for Deep Learning
PAC Bayesian for Deep LearningPAC Bayesian for Deep Learning
PAC Bayesian for Deep LearningMark Chang
 
PAC-Bayesian Bound for Deep Learning
PAC-Bayesian Bound for Deep LearningPAC-Bayesian Bound for Deep Learning
PAC-Bayesian Bound for Deep LearningMark Chang
 
Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16
Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16
Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16MLconf
 
Processing Reachability Queries with Realistic Constraints on Massive Network...
Processing Reachability Queries with Realistic Constraints on Massive Network...Processing Reachability Queries with Realistic Constraints on Massive Network...
Processing Reachability Queries with Realistic Constraints on Massive Network...BigMine
 
Parallel Optimization in Machine Learning
Parallel Optimization in Machine LearningParallel Optimization in Machine Learning
Parallel Optimization in Machine LearningFabian Pedregosa
 
Distributed GLM with H2O - Atlanta Meetup
Distributed GLM with H2O - Atlanta MeetupDistributed GLM with H2O - Atlanta Meetup
Distributed GLM with H2O - Atlanta MeetupSri Ambati
 
Domain Adaptation
Domain AdaptationDomain Adaptation
Domain AdaptationMark Chang
 
Sebastian Schelter – Distributed Machine Learing with the Samsara DSL
Sebastian Schelter – Distributed Machine Learing with the Samsara DSLSebastian Schelter – Distributed Machine Learing with the Samsara DSL
Sebastian Schelter – Distributed Machine Learing with the Samsara DSLFlink Forward
 
Sea Amsterdam 2014 November 19
Sea Amsterdam 2014 November 19Sea Amsterdam 2014 November 19
Sea Amsterdam 2014 November 19GoDataDriven
 
Text classification in scikit-learn
Text classification in scikit-learnText classification in scikit-learn
Text classification in scikit-learnJimmy Lai
 
Scaling Deep Learning with MXNet
Scaling Deep Learning with MXNetScaling Deep Learning with MXNet
Scaling Deep Learning with MXNetAI Frontiers
 
Accelerating Metropolis Hastings with Lightweight Inference Compilation
Accelerating Metropolis Hastings with Lightweight Inference CompilationAccelerating Metropolis Hastings with Lightweight Inference Compilation
Accelerating Metropolis Hastings with Lightweight Inference CompilationFeynman Liang
 
Accelerating Collapsed Variational Bayesian Inference for Latent Dirichlet Al...
Accelerating Collapsed Variational Bayesian Inference for Latent Dirichlet Al...Accelerating Collapsed Variational Bayesian Inference for Latent Dirichlet Al...
Accelerating Collapsed Variational Bayesian Inference for Latent Dirichlet Al...Tomonari Masada
 

What's hot (19)

Leveraging Bagging for Evolving Data Streams
Leveraging Bagging for Evolving Data StreamsLeveraging Bagging for Evolving Data Streams
Leveraging Bagging for Evolving Data Streams
 
EuroPython 2017 - PyData - Deep Learning your Broadband Network @ HOME
EuroPython 2017 - PyData - Deep Learning your Broadband Network @ HOMEEuroPython 2017 - PyData - Deep Learning your Broadband Network @ HOME
EuroPython 2017 - PyData - Deep Learning your Broadband Network @ HOME
 
Speaker Diarization
Speaker DiarizationSpeaker Diarization
Speaker Diarization
 
Distance oracle - Truy vấn nhanh khoảng cách giữa hai điểm bất kỳ trên đồ thị
Distance oracle - Truy vấn nhanh khoảng cách giữa hai điểm bất kỳ trên đồ thịDistance oracle - Truy vấn nhanh khoảng cách giữa hai điểm bất kỳ trên đồ thị
Distance oracle - Truy vấn nhanh khoảng cách giữa hai điểm bất kỳ trên đồ thị
 
PAC Bayesian for Deep Learning
PAC Bayesian for Deep LearningPAC Bayesian for Deep Learning
PAC Bayesian for Deep Learning
 
PAC-Bayesian Bound for Deep Learning
PAC-Bayesian Bound for Deep LearningPAC-Bayesian Bound for Deep Learning
PAC-Bayesian Bound for Deep Learning
 
Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16
Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16
Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16
 
Processing Reachability Queries with Realistic Constraints on Massive Network...
Processing Reachability Queries with Realistic Constraints on Massive Network...Processing Reachability Queries with Realistic Constraints on Massive Network...
Processing Reachability Queries with Realistic Constraints on Massive Network...
 
Parallel Optimization in Machine Learning
Parallel Optimization in Machine LearningParallel Optimization in Machine Learning
Parallel Optimization in Machine Learning
 
Distributed GLM with H2O - Atlanta Meetup
Distributed GLM with H2O - Atlanta MeetupDistributed GLM with H2O - Atlanta Meetup
Distributed GLM with H2O - Atlanta Meetup
 
Domain Adaptation
Domain AdaptationDomain Adaptation
Domain Adaptation
 
Python classes in mumbai
Python classes in mumbaiPython classes in mumbai
Python classes in mumbai
 
Sebastian Schelter – Distributed Machine Learing with the Samsara DSL
Sebastian Schelter – Distributed Machine Learing with the Samsara DSLSebastian Schelter – Distributed Machine Learing with the Samsara DSL
Sebastian Schelter – Distributed Machine Learing with the Samsara DSL
 
Sea Amsterdam 2014 November 19
Sea Amsterdam 2014 November 19Sea Amsterdam 2014 November 19
Sea Amsterdam 2014 November 19
 
Codes and Isogenies
Codes and IsogeniesCodes and Isogenies
Codes and Isogenies
 
Text classification in scikit-learn
Text classification in scikit-learnText classification in scikit-learn
Text classification in scikit-learn
 
Scaling Deep Learning with MXNet
Scaling Deep Learning with MXNetScaling Deep Learning with MXNet
Scaling Deep Learning with MXNet
 
Accelerating Metropolis Hastings with Lightweight Inference Compilation
Accelerating Metropolis Hastings with Lightweight Inference CompilationAccelerating Metropolis Hastings with Lightweight Inference Compilation
Accelerating Metropolis Hastings with Lightweight Inference Compilation
 
Accelerating Collapsed Variational Bayesian Inference for Latent Dirichlet Al...
Accelerating Collapsed Variational Bayesian Inference for Latent Dirichlet Al...Accelerating Collapsed Variational Bayesian Inference for Latent Dirichlet Al...
Accelerating Collapsed Variational Bayesian Inference for Latent Dirichlet Al...
 

Viewers also liked

Comunicación en Redes Sociales y Políticas Corporativas por IBM México en 5° ...
Comunicación en Redes Sociales y Políticas Corporativas por IBM México en 5° ...Comunicación en Redes Sociales y Políticas Corporativas por IBM México en 5° ...
Comunicación en Redes Sociales y Políticas Corporativas por IBM México en 5° ...PRORP México
 
New opportunities for connected data : Neo4j the graph database
New opportunities for connected data : Neo4j the graph databaseNew opportunities for connected data : Neo4j the graph database
New opportunities for connected data : Neo4j the graph databaseCédric Fauvet
 
Frequent Pattern Mining - Krishna Sridhar, Feb 2016
Frequent Pattern Mining - Krishna Sridhar, Feb 2016Frequent Pattern Mining - Krishna Sridhar, Feb 2016
Frequent Pattern Mining - Krishna Sridhar, Feb 2016Seattle DAML meetup
 
Frequent Itemset Mining(FIM) on BigData
Frequent Itemset Mining(FIM) on BigDataFrequent Itemset Mining(FIM) on BigData
Frequent Itemset Mining(FIM) on BigDataRaju Gupta
 
Survey on Frequent Pattern Mining on Graph Data - Slides
Survey on Frequent Pattern Mining on Graph Data - SlidesSurvey on Frequent Pattern Mining on Graph Data - Slides
Survey on Frequent Pattern Mining on Graph Data - SlidesKasun Gajasinghe
 
Data Mining: Concepts and Techniques chapter 07 : Advanced Frequent Pattern M...
Data Mining: Concepts and Techniques chapter 07 : Advanced Frequent Pattern M...Data Mining: Concepts and Techniques chapter 07 : Advanced Frequent Pattern M...
Data Mining: Concepts and Techniques chapter 07 : Advanced Frequent Pattern M...Salah Amean
 
9 Data Mining Challenges From Data Scientists Like You
9 Data Mining Challenges From Data Scientists Like You9 Data Mining Challenges From Data Scientists Like You
9 Data Mining Challenges From Data Scientists Like YouSalford Systems
 
Frequent itemset mining methods
Frequent itemset mining methodsFrequent itemset mining methods
Frequent itemset mining methodsProf.Nilesh Magar
 

Viewers also liked (10)

Comunicación en Redes Sociales y Políticas Corporativas por IBM México en 5° ...
Comunicación en Redes Sociales y Políticas Corporativas por IBM México en 5° ...Comunicación en Redes Sociales y Políticas Corporativas por IBM México en 5° ...
Comunicación en Redes Sociales y Políticas Corporativas por IBM México en 5° ...
 
New opportunities for connected data : Neo4j the graph database
New opportunities for connected data : Neo4j the graph databaseNew opportunities for connected data : Neo4j the graph database
New opportunities for connected data : Neo4j the graph database
 
Frequent Pattern Mining - Krishna Sridhar, Feb 2016
Frequent Pattern Mining - Krishna Sridhar, Feb 2016Frequent Pattern Mining - Krishna Sridhar, Feb 2016
Frequent Pattern Mining - Krishna Sridhar, Feb 2016
 
Frequent Itemset Mining(FIM) on BigData
Frequent Itemset Mining(FIM) on BigDataFrequent Itemset Mining(FIM) on BigData
Frequent Itemset Mining(FIM) on BigData
 
Survey on Frequent Pattern Mining on Graph Data - Slides
Survey on Frequent Pattern Mining on Graph Data - SlidesSurvey on Frequent Pattern Mining on Graph Data - Slides
Survey on Frequent Pattern Mining on Graph Data - Slides
 
Data Mining: Concepts and Techniques chapter 07 : Advanced Frequent Pattern M...
Data Mining: Concepts and Techniques chapter 07 : Advanced Frequent Pattern M...Data Mining: Concepts and Techniques chapter 07 : Advanced Frequent Pattern M...
Data Mining: Concepts and Techniques chapter 07 : Advanced Frequent Pattern M...
 
9 Data Mining Challenges From Data Scientists Like You
9 Data Mining Challenges From Data Scientists Like You9 Data Mining Challenges From Data Scientists Like You
9 Data Mining Challenges From Data Scientists Like You
 
Frequent itemset mining methods
Frequent itemset mining methodsFrequent itemset mining methods
Frequent itemset mining methods
 
Lecture13 - Association Rules
Lecture13 - Association RulesLecture13 - Association Rules
Lecture13 - Association Rules
 
Data Mining: Association Rules Basics
Data Mining: Association Rules BasicsData Mining: Association Rules Basics
Data Mining: Association Rules Basics
 

Similar to Mining Frequent Closed Graphs on Evolving Data Streams

Lecture09-SQ-P2.pdf
Lecture09-SQ-P2.pdfLecture09-SQ-P2.pdf
Lecture09-SQ-P2.pdfssuserb4d806
 
Publishing consuming Linked Sensor Data meetup Cuenca
Publishing consuming Linked Sensor Data meetup CuencaPublishing consuming Linked Sensor Data meetup Cuenca
Publishing consuming Linked Sensor Data meetup CuencaJean-Paul Calbimonte
 
An optimal and progressive algorithm for skyline queries slide
An optimal and progressive algorithm for skyline queries slideAn optimal and progressive algorithm for skyline queries slide
An optimal and progressive algorithm for skyline queries slideWooSung Choi
 
ieee nss mic 2016 poster N30-21
ieee nss mic 2016 poster N30-21ieee nss mic 2016 poster N30-21
ieee nss mic 2016 poster N30-21Dae Woon Kim
 
OpenCL applications in genomics
OpenCL applications in genomicsOpenCL applications in genomics
OpenCL applications in genomicsUSC
 
Spectral-, source-, connectivity- and network analysis of EEG and MEG data
Spectral-, source-, connectivity- and network analysis of EEG and MEG dataSpectral-, source-, connectivity- and network analysis of EEG and MEG data
Spectral-, source-, connectivity- and network analysis of EEG and MEG dataRobert Oostenveld
 
AUTO AI 2021 talk Real world data augmentations for autonomous driving : B Ra...
AUTO AI 2021 talk Real world data augmentations for autonomous driving : B Ra...AUTO AI 2021 talk Real world data augmentations for autonomous driving : B Ra...
AUTO AI 2021 talk Real world data augmentations for autonomous driving : B Ra...Ravi Kiran B.
 
Tall and Skinny QRs in MapReduce
Tall and Skinny QRs in MapReduceTall and Skinny QRs in MapReduce
Tall and Skinny QRs in MapReduceDavid Gleich
 
A robust blind and secure watermarking scheme using positive semi definite ma...
A robust blind and secure watermarking scheme using positive semi definite ma...A robust blind and secure watermarking scheme using positive semi definite ma...
A robust blind and secure watermarking scheme using positive semi definite ma...ijcsit
 
Particle Filters and Applications in Computer Vision
Particle Filters and Applications in Computer VisionParticle Filters and Applications in Computer Vision
Particle Filters and Applications in Computer Visionzukun
 
Dna computing
Dna computingDna computing
Dna computingsathish3
 
Reverse Engineering for additive manufacturing
Reverse Engineering for additive manufacturingReverse Engineering for additive manufacturing
Reverse Engineering for additive manufacturingAchmadRifaie4
 
Mit15 082 jf10_lec01
Mit15 082 jf10_lec01Mit15 082 jf10_lec01
Mit15 082 jf10_lec01Saad Liaqat
 
Principal Component Analysis For Novelty Detection
Principal Component Analysis For Novelty DetectionPrincipal Component Analysis For Novelty Detection
Principal Component Analysis For Novelty DetectionJordan McBain
 
Process Mining: Understanding and Improving Desire Lines in Big Data
Process Mining: Understanding and Improving Desire Lines in Big DataProcess Mining: Understanding and Improving Desire Lines in Big Data
Process Mining: Understanding and Improving Desire Lines in Big DataWil van der Aalst
 
Some thoughts on 2 d 3-d information processing
Some thoughts on 2 d 3-d information processingSome thoughts on 2 d 3-d information processing
Some thoughts on 2 d 3-d information processingCurvSurf
 
Mining Scientific Diagrams for facts
Mining Scientific Diagrams for facts Mining Scientific Diagrams for facts
Mining Scientific Diagrams for facts TheContentMine
 
Mining Scientific Diagrams for facts
Mining Scientific Diagrams for factsMining Scientific Diagrams for facts
Mining Scientific Diagrams for factspetermurrayrust
 

Similar to Mining Frequent Closed Graphs on Evolving Data Streams (20)

Lecture09-SQ-P2.pdf
Lecture09-SQ-P2.pdfLecture09-SQ-P2.pdf
Lecture09-SQ-P2.pdf
 
Publishing consuming Linked Sensor Data meetup Cuenca
Publishing consuming Linked Sensor Data meetup CuencaPublishing consuming Linked Sensor Data meetup Cuenca
Publishing consuming Linked Sensor Data meetup Cuenca
 
An optimal and progressive algorithm for skyline queries slide
An optimal and progressive algorithm for skyline queries slideAn optimal and progressive algorithm for skyline queries slide
An optimal and progressive algorithm for skyline queries slide
 
ieee nss mic 2016 poster N30-21
ieee nss mic 2016 poster N30-21ieee nss mic 2016 poster N30-21
ieee nss mic 2016 poster N30-21
 
OpenCL applications in genomics
OpenCL applications in genomicsOpenCL applications in genomics
OpenCL applications in genomics
 
3. Synthesis.pptx
3. Synthesis.pptx3. Synthesis.pptx
3. Synthesis.pptx
 
Spectral-, source-, connectivity- and network analysis of EEG and MEG data
Spectral-, source-, connectivity- and network analysis of EEG and MEG dataSpectral-, source-, connectivity- and network analysis of EEG and MEG data
Spectral-, source-, connectivity- and network analysis of EEG and MEG data
 
AUTO AI 2021 talk Real world data augmentations for autonomous driving : B Ra...
AUTO AI 2021 talk Real world data augmentations for autonomous driving : B Ra...AUTO AI 2021 talk Real world data augmentations for autonomous driving : B Ra...
AUTO AI 2021 talk Real world data augmentations for autonomous driving : B Ra...
 
2014 nci-edrn
2014 nci-edrn2014 nci-edrn
2014 nci-edrn
 
Tall and Skinny QRs in MapReduce
Tall and Skinny QRs in MapReduceTall and Skinny QRs in MapReduce
Tall and Skinny QRs in MapReduce
 
A robust blind and secure watermarking scheme using positive semi definite ma...
A robust blind and secure watermarking scheme using positive semi definite ma...A robust blind and secure watermarking scheme using positive semi definite ma...
A robust blind and secure watermarking scheme using positive semi definite ma...
 
Particle Filters and Applications in Computer Vision
Particle Filters and Applications in Computer VisionParticle Filters and Applications in Computer Vision
Particle Filters and Applications in Computer Vision
 
Dna computing
Dna computingDna computing
Dna computing
 
Reverse Engineering for additive manufacturing
Reverse Engineering for additive manufacturingReverse Engineering for additive manufacturing
Reverse Engineering for additive manufacturing
 
Mit15 082 jf10_lec01
Mit15 082 jf10_lec01Mit15 082 jf10_lec01
Mit15 082 jf10_lec01
 
Principal Component Analysis For Novelty Detection
Principal Component Analysis For Novelty DetectionPrincipal Component Analysis For Novelty Detection
Principal Component Analysis For Novelty Detection
 
Process Mining: Understanding and Improving Desire Lines in Big Data
Process Mining: Understanding and Improving Desire Lines in Big DataProcess Mining: Understanding and Improving Desire Lines in Big Data
Process Mining: Understanding and Improving Desire Lines in Big Data
 
Some thoughts on 2 d 3-d information processing
Some thoughts on 2 d 3-d information processingSome thoughts on 2 d 3-d information processing
Some thoughts on 2 d 3-d information processing
 
Mining Scientific Diagrams for facts
Mining Scientific Diagrams for facts Mining Scientific Diagrams for facts
Mining Scientific Diagrams for facts
 
Mining Scientific Diagrams for facts
Mining Scientific Diagrams for factsMining Scientific Diagrams for facts
Mining Scientific Diagrams for facts
 

More from Albert Bifet

Artificial intelligence and data stream mining
Artificial intelligence and data stream miningArtificial intelligence and data stream mining
Artificial intelligence and data stream miningAlbert Bifet
 
MOA for the IoT at ACML 2016
MOA for the IoT at ACML 2016 MOA for the IoT at ACML 2016
MOA for the IoT at ACML 2016 Albert Bifet
 
Mining Big Data Streams with APACHE SAMOA
Mining Big Data Streams with APACHE SAMOAMining Big Data Streams with APACHE SAMOA
Mining Big Data Streams with APACHE SAMOAAlbert Bifet
 
Efficient Online Evaluation of Big Data Stream Classifiers
Efficient Online Evaluation of Big Data Stream ClassifiersEfficient Online Evaluation of Big Data Stream Classifiers
Efficient Online Evaluation of Big Data Stream ClassifiersAlbert Bifet
 
Apache Samoa: Mining Big Data Streams with Apache Flink
Apache Samoa: Mining Big Data Streams with Apache FlinkApache Samoa: Mining Big Data Streams with Apache Flink
Apache Samoa: Mining Big Data Streams with Apache FlinkAlbert Bifet
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big DataAlbert Bifet
 
Multi-label Classification with Meta-labels
Multi-label Classification with Meta-labelsMulti-label Classification with Meta-labels
Multi-label Classification with Meta-labelsAlbert Bifet
 
Pitfalls in benchmarking data stream classification and how to avoid them
Pitfalls in benchmarking data stream classification and how to avoid themPitfalls in benchmarking data stream classification and how to avoid them
Pitfalls in benchmarking data stream classification and how to avoid themAlbert Bifet
 
Mining Big Data in Real Time
Mining Big Data in Real TimeMining Big Data in Real Time
Mining Big Data in Real TimeAlbert Bifet
 
Mining Big Data in Real Time
Mining Big Data in Real TimeMining Big Data in Real Time
Mining Big Data in Real TimeAlbert Bifet
 
PAKDD 2011 TUTORIAL Handling Concept Drift: Importance, Challenges and Solutions
PAKDD 2011 TUTORIAL Handling Concept Drift: Importance, Challenges and SolutionsPAKDD 2011 TUTORIAL Handling Concept Drift: Importance, Challenges and Solutions
PAKDD 2011 TUTORIAL Handling Concept Drift: Importance, Challenges and SolutionsAlbert Bifet
 
Sentiment Knowledge Discovery in Twitter Streaming Data
Sentiment Knowledge Discovery in Twitter Streaming DataSentiment Knowledge Discovery in Twitter Streaming Data
Sentiment Knowledge Discovery in Twitter Streaming DataAlbert Bifet
 
Fast Perceptron Decision Tree Learning from Evolving Data Streams
Fast Perceptron Decision Tree Learning from Evolving Data StreamsFast Perceptron Decision Tree Learning from Evolving Data Streams
Fast Perceptron Decision Tree Learning from Evolving Data StreamsAlbert Bifet
 
Moa: Real Time Analytics for Data Streams
Moa: Real Time Analytics for Data StreamsMoa: Real Time Analytics for Data Streams
Moa: Real Time Analytics for Data StreamsAlbert Bifet
 
MOA : Massive Online Analysis
MOA : Massive Online AnalysisMOA : Massive Online Analysis
MOA : Massive Online AnalysisAlbert Bifet
 
New ensemble methods for evolving data streams
New ensemble methods for evolving data streamsNew ensemble methods for evolving data streams
New ensemble methods for evolving data streamsAlbert Bifet
 
Métodos Adaptativos de Minería de Datos y Aprendizaje para Flujos de Datos.
Métodos Adaptativos de Minería de Datos y Aprendizaje para Flujos de Datos.Métodos Adaptativos de Minería de Datos y Aprendizaje para Flujos de Datos.
Métodos Adaptativos de Minería de Datos y Aprendizaje para Flujos de Datos.Albert Bifet
 
Adaptive XML Tree Mining on Evolving Data Streams
Adaptive XML Tree Mining on Evolving Data StreamsAdaptive XML Tree Mining on Evolving Data Streams
Adaptive XML Tree Mining on Evolving Data StreamsAlbert Bifet
 
Adaptive Learning and Mining for Data Streams and Frequent Patterns
Adaptive Learning and Mining for Data Streams and Frequent PatternsAdaptive Learning and Mining for Data Streams and Frequent Patterns
Adaptive Learning and Mining for Data Streams and Frequent PatternsAlbert Bifet
 
Mining Implications from Lattices of Closed Trees
Mining Implications from Lattices of Closed TreesMining Implications from Lattices of Closed Trees
Mining Implications from Lattices of Closed TreesAlbert Bifet
 

More from Albert Bifet (20)

Artificial intelligence and data stream mining
Artificial intelligence and data stream miningArtificial intelligence and data stream mining
Artificial intelligence and data stream mining
 
MOA for the IoT at ACML 2016
MOA for the IoT at ACML 2016 MOA for the IoT at ACML 2016
MOA for the IoT at ACML 2016
 
Mining Big Data Streams with APACHE SAMOA
Mining Big Data Streams with APACHE SAMOAMining Big Data Streams with APACHE SAMOA
Mining Big Data Streams with APACHE SAMOA
 
Efficient Online Evaluation of Big Data Stream Classifiers
Efficient Online Evaluation of Big Data Stream ClassifiersEfficient Online Evaluation of Big Data Stream Classifiers
Efficient Online Evaluation of Big Data Stream Classifiers
 
Apache Samoa: Mining Big Data Streams with Apache Flink
Apache Samoa: Mining Big Data Streams with Apache FlinkApache Samoa: Mining Big Data Streams with Apache Flink
Apache Samoa: Mining Big Data Streams with Apache Flink
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Multi-label Classification with Meta-labels
Multi-label Classification with Meta-labelsMulti-label Classification with Meta-labels
Multi-label Classification with Meta-labels
 
Pitfalls in benchmarking data stream classification and how to avoid them
Pitfalls in benchmarking data stream classification and how to avoid themPitfalls in benchmarking data stream classification and how to avoid them
Pitfalls in benchmarking data stream classification and how to avoid them
 
Mining Big Data in Real Time
Mining Big Data in Real TimeMining Big Data in Real Time
Mining Big Data in Real Time
 
Mining Big Data in Real Time
Mining Big Data in Real TimeMining Big Data in Real Time
Mining Big Data in Real Time
 
PAKDD 2011 TUTORIAL Handling Concept Drift: Importance, Challenges and Solutions
PAKDD 2011 TUTORIAL Handling Concept Drift: Importance, Challenges and SolutionsPAKDD 2011 TUTORIAL Handling Concept Drift: Importance, Challenges and Solutions
PAKDD 2011 TUTORIAL Handling Concept Drift: Importance, Challenges and Solutions
 
Sentiment Knowledge Discovery in Twitter Streaming Data
Sentiment Knowledge Discovery in Twitter Streaming DataSentiment Knowledge Discovery in Twitter Streaming Data
Sentiment Knowledge Discovery in Twitter Streaming Data
 
Fast Perceptron Decision Tree Learning from Evolving Data Streams
Fast Perceptron Decision Tree Learning from Evolving Data StreamsFast Perceptron Decision Tree Learning from Evolving Data Streams
Fast Perceptron Decision Tree Learning from Evolving Data Streams
 
Moa: Real Time Analytics for Data Streams
Moa: Real Time Analytics for Data StreamsMoa: Real Time Analytics for Data Streams
Moa: Real Time Analytics for Data Streams
 
MOA : Massive Online Analysis
MOA : Massive Online AnalysisMOA : Massive Online Analysis
MOA : Massive Online Analysis
 
New ensemble methods for evolving data streams
New ensemble methods for evolving data streamsNew ensemble methods for evolving data streams
New ensemble methods for evolving data streams
 
Métodos Adaptativos de Minería de Datos y Aprendizaje para Flujos de Datos.
Métodos Adaptativos de Minería de Datos y Aprendizaje para Flujos de Datos.Métodos Adaptativos de Minería de Datos y Aprendizaje para Flujos de Datos.
Métodos Adaptativos de Minería de Datos y Aprendizaje para Flujos de Datos.
 
Adaptive XML Tree Mining on Evolving Data Streams
Adaptive XML Tree Mining on Evolving Data StreamsAdaptive XML Tree Mining on Evolving Data Streams
Adaptive XML Tree Mining on Evolving Data Streams
 
Adaptive Learning and Mining for Data Streams and Frequent Patterns
Adaptive Learning and Mining for Data Streams and Frequent PatternsAdaptive Learning and Mining for Data Streams and Frequent Patterns
Adaptive Learning and Mining for Data Streams and Frequent Patterns
 
Mining Implications from Lattices of Closed Trees
Mining Implications from Lattices of Closed TreesMining Implications from Lattices of Closed Trees
Mining Implications from Lattices of Closed Trees
 

Recently uploaded

🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 

Recently uploaded (20)

🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 

Mining Frequent Closed Graphs on Evolving Data Streams

  • 1. Mining Frequent Closed Graphs on Evolving Data Streams ` A. Bifet, G. Holmes, B. Pfahringer and R. Gavalda University of Waikato Hamilton, New Zealand Laboratory for Relational Algorithmics, Complexity and Learning LARCA UPC-Barcelona Tech, Catalonia San Diego, 24 August 2011 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 2011
  • 2. Mining Evolving Graph Data Streams Problem Given a data stream D of graphs, find frequent closed graphs. Transaction Id Graph We provide three algorithms, O of increasing power C C S N Incremental 1 O Sliding Window O Adaptive C C S N 2 C N 3 C C S N 2 / 48
  • 3. Non-incremental Frequent Closed Graph Mining CloseGraph: Xifeng Yan, Jiawei Han right-most extension based on depth-first search based on gSpan ICDM’02 MoSS: Christian Borgelt, Michael R. Berthold breadth-first search based on MoFa ICDM’02 3 / 48
  • 4. Outline 1 Streaming Data 2 Frequent Pattern Mining Mining Evolving Graph Streams A DAG RAPH M INER 3 Experimental Evaluation 4 Summary and Future Work 4 / 48
  • 5. Outline 1 Streaming Data 2 Frequent Pattern Mining Mining Evolving Graph Streams A DAG RAPH M INER 3 Experimental Evaluation 4 Summary and Future Work 5 / 48
  • 6. Mining Massive Data Source: IDC’s Digital Universe Study (EMC), June 2011 6 / 48
  • 7. Data Streams Data Streams Sequence is potentially infinite High amount of data High speed of arrival Once an element from a data stream has been processed it is discarded or archived Tools: approximation randomization, sampling sketching 7 / 48
  • 8. Data Streams Approximation Algorithms 1011000111 1010101 Sliding Window We can maintain simple statistics over sliding windows, using O ( 1 log2 N ) space, where ε N is the length of the sliding window ε is the accuracy parameter M. Datar, A. Gionis, P. Indyk, and R. Motwani. Maintaining stream statistics over sliding windows. 2002 8 / 48
  • 9. Data Streams Approximation Algorithms 10110001111 0101011 Sliding Window We can maintain simple statistics over sliding windows, using O ( 1 log2 N ) space, where ε N is the length of the sliding window ε is the accuracy parameter M. Datar, A. Gionis, P. Indyk, and R. Motwani. Maintaining stream statistics over sliding windows. 2002 8 / 48
  • 10. Data Streams Approximation Algorithms 101100011110 1010111 Sliding Window We can maintain simple statistics over sliding windows, using O ( 1 log2 N ) space, where ε N is the length of the sliding window ε is the accuracy parameter M. Datar, A. Gionis, P. Indyk, and R. Motwani. Maintaining stream statistics over sliding windows. 2002 8 / 48
  • 11. Data Streams Approximation Algorithms 1011000111101 0101110 Sliding Window We can maintain simple statistics over sliding windows, using O ( 1 log2 N ) space, where ε N is the length of the sliding window ε is the accuracy parameter M. Datar, A. Gionis, P. Indyk, and R. Motwani. Maintaining stream statistics over sliding windows. 2002 8 / 48
  • 12. Data Streams Approximation Algorithms 10110001111010 1011101 Sliding Window We can maintain simple statistics over sliding windows, using O ( 1 log2 N ) space, where ε N is the length of the sliding window ε is the accuracy parameter M. Datar, A. Gionis, P. Indyk, and R. Motwani. Maintaining stream statistics over sliding windows. 2002 8 / 48
  • 13. Data Streams Approximation Algorithms 101100011110101 0111010 Sliding Window We can maintain simple statistics over sliding windows, using O ( 1 log2 N ) space, where ε N is the length of the sliding window ε is the accuracy parameter M. Datar, A. Gionis, P. Indyk, and R. Motwani. Maintaining stream statistics over sliding windows. 2002 8 / 48
  • 14. Outline 1 Streaming Data 2 Frequent Pattern Mining Mining Evolving Graph Streams A DAG RAPH M INER 3 Experimental Evaluation 4 Summary and Future Work 9 / 48
  • 15. Pattern Mining Dataset Example Document Patterns d1 abce d2 cde d3 abce d4 acde d5 abcde d6 bcd 10 / 48
  • 16. Itemset Mining d1 abce Support Frequent d2 cde d1,d2,d3,d4,d5,d6 c d3 abce d1,d2,d3,d4,d5 e,ce d4 acde d1,d3,d4,d5 a,ac,ae,ace d5 abcde d1,d3,d5,d6 b,bc d6 bcd d2,d4,d5 d,cd d1,d3,d5 ab,abc,abe be,bce,abce d2,d4,d5 de,cde 11 / 48
  • 17. Itemset Mining d1 abce Support Frequent d2 cde 6 c d3 abce 5 e,ce d4 acde 4 a,ac,ae,ace d5 abcde 4 b,bc d6 bcd 4 d,cd 3 ab,abc,abe be,bce,abce 3 de,cde 12 / 48
  • 18. Itemset Mining d1 abce Support Frequent Gen Closed d2 cde 6 c c c d3 abce 5 e,ce e ce d4 acde 4 a,ac,ae,ace a ace d5 abcde 4 b,bc b bc d6 bcd 4 d,cd d cd 3 ab,abc,abe ab be,bce,abce be abce 3 de,cde de cde 12 / 48
  • 19. Itemset Mining d1 abce Support Frequent Gen Closed Max d2 cde 6 c c c d3 abce 5 e,ce e ce d4 acde 4 a,ac,ae,ace a ace d5 abcde 4 b,bc b bc d6 bcd 4 d,cd d cd 3 ab,abc,abe ab be,bce,abce be abce abce 3 de,cde de cde cde 12 / 48
  • 20. Outline 1 Streaming Data 2 Frequent Pattern Mining Mining Evolving Graph Streams A DAG RAPH M INER 3 Experimental Evaluation 4 Summary and Future Work 13 / 48
  • 21. Graph Dataset Transaction Id Graph Weight O C C S N 1 O 1 O C C S N 2 C 1 O C S N 3 C 1 N 4 C C S N 1 14 / 48
  • 22. Graph Coresets Coreset of a set P with respect to some problem Small subset that approximates the original set P. Solving the problem for the coreset provides an approximate solution for the problem on P. δ -tolerance Closed Graph A graph g is δ -tolerance closed if none of its proper frequent supergraphs has a weighted support ≥ (1 − δ ) · support (g ). Maximal graph: 1-tolerance closed graph Closed graph: 0-tolerance closed graph. 15 / 48
  • 23. Graph Coresets Coreset of a set P with respect to some problem Small subset that approximates the original set P. Solving the problem for the coreset provides an approximate solution for the problem on P. δ -tolerance Closed Graph A graph g is δ -tolerance closed if none of its proper frequent supergraphs has a weighted support ≥ (1 − δ ) · support (g ). Maximal graph: 1-tolerance closed graph Closed graph: 0-tolerance closed graph. 15 / 48
  • 24. Graph Coresets Relative support of a closed graph Support of a graph minus the relative support of its closed supergraphs. The sum of the closed supergraphs’ relative supports of a graph and its relative support is equal to its own support. (s, δ )-coreset for the problem of computing closed graphs Weighted multiset of frequent δ -tolerance closed graphs with minimum support s using their relative support as a weight. 16 / 48
  • 25. Graph Coresets Relative support of a closed graph Support of a graph minus the relative support of its closed supergraphs. The sum of the closed supergraphs’ relative supports of a graph and its relative support is equal to its own support. (s, δ )-coreset for the problem of computing closed graphs Weighted multiset of frequent δ -tolerance closed graphs with minimum support s using their relative support as a weight. 16 / 48
  • 26. Graph Dataset Transaction Id Graph Weight O C C S N 1 O 1 O C C S N 2 C 1 O C S N 3 C 1 N 4 C C S N 1 17 / 48
  • 27. Graph Coresets Graph Relative Support Support C C S N 3 3 O C S N 3 3 N C S 3 3 Table: Example of a coreset with minimum support 50% and δ = 1 18 / 48
  • 28. Graph Coresets Figure: Number of graphs in a (40%, δ )-coreset for NCI. 19 / 48
  • 29. Outline 1 Streaming Data 2 Frequent Pattern Mining Mining Evolving Graph Streams A DAG RAPH M INER 3 Experimental Evaluation 4 Summary and Future Work 20 / 48
  • 30. I NC G RAPH M INER I NC G RAPH M INER (D, min sup ) Input: A graph dataset D, and min sup. Output: The frequent graph set G. 1 G←0 / 2 for every batch bt of graphs in D 3 do C ← C ORESET (bt , min sup ) 4 G ← C ORESET (G ∪ C, min sup ) 5 return G 21 / 48
  • 31. W IN G RAPH M INER W IN G RAPH M INER (D, W , min sup ) Input: A graph dataset D, a size window W and min sup. Output: The frequent graph set G. 1 G←0 / 2 for every batch bt of graphs in D 3 do C ← C ORESET (bt , min sup ) 4 Store C in sliding window 5 if sliding window is full 6 then R ← Oldest C stored in sliding window, negate all support values 7 else R ← 0 / 8 G ← C ORESET (G ∪ C ∪ R, min sup ) 9 return G 22 / 48
  • 32. Algorithm ADaptive Sliding WINdow Example W = 101010110111111 W0 = 1 ADWIN: A DAPTIVE W INDOWING A LGORITHM 1 Initialize Window W 2 for each t > 0 3 do W ← W ∪ {xt } (i.e., add xt to the head of W ) 4 repeat Drop elements from the tail of W 5 until |µW0 − µW1 | ≥ εc holds ˆ ˆ 6 for every split of W into W = W0 · W1 7 Output µWˆ 23 / 48
  • 33. Algorithm ADaptive Sliding WINdow Example W = 101010110111111 W0 = 1 W1 = 01010110111111 ADWIN: A DAPTIVE W INDOWING A LGORITHM 1 Initialize Window W 2 for each t > 0 3 do W ← W ∪ {xt } (i.e., add xt to the head of W ) 4 repeat Drop elements from the tail of W 5 until |µW0 − µW1 | ≥ εc holds ˆ ˆ 6 for every split of W into W = W0 · W1 7 Output µWˆ 23 / 48
  • 34. Algorithm ADaptive Sliding WINdow Example W = 101010110111111 W0 = 10 W1 = 1010110111111 ADWIN: A DAPTIVE W INDOWING A LGORITHM 1 Initialize Window W 2 for each t > 0 3 do W ← W ∪ {xt } (i.e., add xt to the head of W ) 4 repeat Drop elements from the tail of W 5 until |µW0 − µW1 | ≥ εc holds ˆ ˆ 6 for every split of W into W = W0 · W1 7 Output µWˆ 23 / 48
  • 35. Algorithm ADaptive Sliding WINdow Example W = 101010110111111 W0 = 101 W1 = 010110111111 ADWIN: A DAPTIVE W INDOWING A LGORITHM 1 Initialize Window W 2 for each t > 0 3 do W ← W ∪ {xt } (i.e., add xt to the head of W ) 4 repeat Drop elements from the tail of W 5 until |µW0 − µW1 | ≥ εc holds ˆ ˆ 6 for every split of W into W = W0 · W1 7 Output µWˆ 23 / 48
  • 36. Algorithm ADaptive Sliding WINdow Example W = 101010110111111 W0 = 1010 W1 = 10110111111 ADWIN: A DAPTIVE W INDOWING A LGORITHM 1 Initialize Window W 2 for each t > 0 3 do W ← W ∪ {xt } (i.e., add xt to the head of W ) 4 repeat Drop elements from the tail of W 5 until |µW0 − µW1 | ≥ εc holds ˆ ˆ 6 for every split of W into W = W0 · W1 7 Output µWˆ 23 / 48
  • 37. Algorithm ADaptive Sliding WINdow Example W = 101010110111111 W0 = 10101 W1 = 0110111111 ADWIN: A DAPTIVE W INDOWING A LGORITHM 1 Initialize Window W 2 for each t > 0 3 do W ← W ∪ {xt } (i.e., add xt to the head of W ) 4 repeat Drop elements from the tail of W 5 until |µW0 − µW1 | ≥ εc holds ˆ ˆ 6 for every split of W into W = W0 · W1 7 Output µWˆ 23 / 48
  • 38. Algorithm ADaptive Sliding WINdow Example W = 101010110111111 W0 = 101010 W1 = 110111111 ADWIN: A DAPTIVE W INDOWING A LGORITHM 1 Initialize Window W 2 for each t > 0 3 do W ← W ∪ {xt } (i.e., add xt to the head of W ) 4 repeat Drop elements from the tail of W 5 until |µW0 − µW1 | ≥ εc holds ˆ ˆ 6 for every split of W into W = W0 · W1 7 Output µWˆ 23 / 48
  • 39. Algorithm ADaptive Sliding WINdow Example W = 101010110111111 W0 = 1010101 W1 = 10111111 ADWIN: A DAPTIVE W INDOWING A LGORITHM 1 Initialize Window W 2 for each t > 0 3 do W ← W ∪ {xt } (i.e., add xt to the head of W ) 4 repeat Drop elements from the tail of W 5 until |µW0 − µW1 | ≥ εc holds ˆ ˆ 6 for every split of W into W = W0 · W1 7 Output µWˆ 23 / 48
  • 40. Algorithm ADaptive Sliding WINdow Example W = 101010110111111 W0 = 10101011 W1 = 0111111 ADWIN: A DAPTIVE W INDOWING A LGORITHM 1 Initialize Window W 2 for each t > 0 3 do W ← W ∪ {xt } (i.e., add xt to the head of W ) 4 repeat Drop elements from the tail of W 5 until |µW0 − µW1 | ≥ εc holds ˆ ˆ 6 for every split of W into W = W0 · W1 7 Output µWˆ 23 / 48
  • 41. Algorithm ADaptive Sliding WINdow Example W = 101010110111111 |µW0 − µW1 | ≥ εc : CHANGE DET.! ˆ ˆ W0 = 101010110 W1 = 111111 ADWIN: A DAPTIVE W INDOWING A LGORITHM 1 Initialize Window W 2 for each t > 0 3 do W ← W ∪ {xt } (i.e., add xt to the head of W ) 4 repeat Drop elements from the tail of W 5 until |µW0 − µW1 | ≥ εc holds ˆ ˆ 6 for every split of W into W = W0 · W1 7 Output µWˆ 23 / 48
  • 42. Algorithm ADaptive Sliding WINdow Example W = 101010110111111 Drop elements from the tail of W W0 = 101010110 W1 = 111111 ADWIN: A DAPTIVE W INDOWING A LGORITHM 1 Initialize Window W 2 for each t > 0 3 do W ← W ∪ {xt } (i.e., add xt to the head of W ) 4 repeat Drop elements from the tail of W 5 until |µW0 − µW1 | ≥ εc holds ˆ ˆ 6 for every split of W into W = W0 · W1 7 Output µWˆ 23 / 48
  • 43. Algorithm ADaptive Sliding WINdow Example W = 01010110111111 Drop elements from the tail of W W0 = 101010110 W1 = 111111 ADWIN: A DAPTIVE W INDOWING A LGORITHM 1 Initialize Window W 2 for each t > 0 3 do W ← W ∪ {xt } (i.e., add xt to the head of W ) 4 repeat Drop elements from the tail of W 5 until |µW0 − µW1 | ≥ εc holds ˆ ˆ 6 for every split of W into W = W0 · W1 7 Output µWˆ 23 / 48
  • 44. Algorithm ADaptive Sliding WINdow Theorem At every time step we have: 1 (False positive rate bound). If µt remains constant within W , the probability that ADWIN shrinks the window at this step is at most δ . 2 (False negative rate bound). Suppose that for some partition of W in two parts W0 W1 (where W1 contains the most recent items) we have |µW0 − µW1 | > 2εc . Then with probability 1 − δ ADWIN shrinks W to W1 , or shorter. ADWIN tunes itself to the data stream at hand, with no need for the user to hardwire or precompute parameters. 24 / 48
  • 45. Algorithm ADaptive Sliding WINdow ADWIN using a Data Stream Sliding Window Model, can provide the exact counts of 1’s in O (1) time per point. tries O (log W ) cutpoints uses O ( 1 log W ) memory words ε the processing time per example is O (log W ) (amortized and worst-case). Sliding Window Model 1010101 101 11 1 1 Content: 4 2 2 1 1 Capacity: 7 3 2 1 1 25 / 48
  • 46. A DAG RAPH M INER A DAG RAPH M INER (D, Mode, min sup ) 1 G←0 / 2 Init ADWIN 3 for every batch bt of graphs in D 4 do C ← C ORESET (bt , min sup ) 5 R←0 / 6 if Mode is Sliding Window 7 then Store C in sliding window 8 if ADWIN detected change 9 then R ← Batches to remove in sliding window with negative support 10 G ← C ORESET (G ∪ C ∪ R, min sup ) 11 if Mode is Sliding Window 12 then Insert # closed graphs into ADWIN 13 else for every g in G update g’s ADWIN 14 return G 26 / 48
  • 47. A DAG RAPH M INER A DAG RAPH M INER (D, Mode, min sup ) 1 G←0 / 2 Init ADWIN 3 for every batch bt of graphs in D 4 do C ← C ORESET (bt , min sup ) 5 R←0 / 6 7 8 9 10 G ← C ORESET (G ∪ C ∪ R, min sup ) 11 12 13 for every g in G update g’s ADWIN 14 return G 26 / 48
  • 48. A DAG RAPH M INER A DAG RAPH M INER (D, Mode, min sup ) 1 G←0 / 2 Init ADWIN 3 for every batch bt of graphs in D 4 do C ← C ORESET (bt , min sup ) 5 R←0 / 6 if Mode is Sliding Window 7 then Store C in sliding window 8 if ADWIN detected change 9 then R ← Batches to remove in sliding window with negative support 10 G ← C ORESET (G ∪ C ∪ R, min sup ) 11 if Mode is Sliding Window 12 then Insert # closed graphs into ADWIN 13 14 return G 26 / 48
  • 49. Outline 1 Streaming Data 2 Frequent Pattern Mining Mining Evolving Graph Streams A DAG RAPH M INER 3 Experimental Evaluation 4 Summary and Future Work 27 / 48
  • 50. Experimental Evaluation ChemDB dataset Public dataset 4 million molecules Institute for Genomics and Bioinformatics at the University of California, Irvine Open NCI Database Public domain 250,000 structures National Cancer Institute 28 / 48
  • 51. What is MOA? {M}assive {O}nline {A}nalysis is a framework for online learning from data streams. It is closely related to WEKA It includes a collection of offline and online as well as tools for evaluation: classification clustering Easy to extend Easy to design and run experiments 29 / 48
  • 52. WEKA: the bird 30 / 48
  • 53. MOA: the bird The Moa (another native NZ bird) is not only flightless, like the Weka, but also extinct. 31 / 48
  • 54. MOA: the bird The Moa (another native NZ bird) is not only flightless, like the Weka, but also extinct. 31 / 48
  • 55. MOA: the bird The Moa (another native NZ bird) is not only flightless, like the Weka, but also extinct. 31 / 48
  • 57. Classification Experimental Setting Classifiers Naive Bayes Decision stumps Hoeffding Tree Hoeffding Option Tree Bagging and Boosting ADWIN Bagging and Leveraging Bagging Prediction strategies Majority class Naive Bayes Leaves Adaptive Hybrid 33 / 48
  • 59. Clustering Experimental Setting Clusterers StreamKM++ CluStream ClusTree Den-Stream CobWeb 35 / 48
  • 60. Clustering Experimental Setting Internal measures External measures Gamma Rand statistic C Index Jaccard coefficient Point-Biserial Folkes and Mallow Index Log Likelihood Hubert Γ statistics Dunn’s Index Minkowski score Tau Purity Tau A van Dongen criterion Tau C V-measure Somer’s Gamma Completeness Ratio of Repetition Homogeneity Modified Ratio of Repetition Variation of information Adjusted Ratio of Clustering Mutual information Fagan’s Index Class-based entropy Deviation Index Cluster-based entropy Z-Score Index Precision D Index Recall Silhouette coefficient F-measure Table: Internal and external clustering evaluation measures. 36 / 48
  • 61. Cluster Mapping Measure Hardy Kremer, Philipp Kranen, Timm Jansen, Thomas Seidl, Albert Bifet, Geoff Holmes and Bernhard Pfahringer. An Effective Evaluation Measure for Clustering on Evolving Data Stream KDD’11 CMM: Cluster Mapping Measure A novel evaluation measure for stream clustering on evolving streams ∑o∈F w (o ) · pen(o, C ) CMM (C , C L ) = 1 − ∑o∈F w (o ) · con(o, Cl (o )) 37 / 48
  • 62. Extensions of MOA Multi-label Classification Active Learning Regression Closed Frequent Graph Mining Twitter Sentiment Analysis Challenges for bigger data streams Sampling and distributed systems (Map-Reduce, Hadoop, S4) 38 / 48
  • 63. Open NCI dataset Time NCI Dataset 120 100 80 Seconds 60 40 20 0 00 00 00 00 00 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 .0 .0 .0 .0 .0 0. 0. 0. 0. 0. 0. 0. 0. 10 30 50 70 90 11 13 15 17 19 21 23 25 Instances IncGraphMiner IncGraphMiner-C MoSS closeGraph 39 / 48
  • 64. Open NCI dataset Memory NCI Dataset 9000 8000 7000 Megabytes 6000 5000 4000 3000 2000 1000 0 00 00 00 00 00 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 .0 .0 .0 .0 .0 0. 0. 0. 0. 0. 0. 0. 0. 10 30 50 70 90 11 13 15 17 19 21 23 25 Instances IncGraphMiner IncGraphMiner-C MoSS closeGraph 40 / 48
  • 65. Megabytes 10 . 0 5000 10000 15000 20000 25000 30000 35000 40000 45000 24 000 50000 0. 47 000 0. 70 000 0. 93 000 IncGraphMiner 1. 0. 0 16 00 1. 0.0 39 00 1. 0.0 62 00 1. 0.0 85 00 2. 0.0 08 00 2. 0.0 31 00 IncGraphMiner-C 2. 0.0 54 00 Instances 2. 0.0 77 00 Memory ChemDB Dataset 3. 0.0 MoSS 00 00 3. 0.0 23 00 ChemDB dataset 3. 0.0 46 00 3. 0.0 69 00 3. 0.0 92 00 closeGraph 0. 00 0 41 / 48
  • 66. Seconds 10 . 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 24 000 0. 47 000 0. 70 000 0. 93 000 1. 0. 0 16 00 1. 0.0 39 00 IncGraphMiner 1. 0.0 62 00 1. 0.0 85 00 2. 0.0 08 00 2. 0.0 31 00 2. 0.0 54 00 Instances IncGraphMiner-C 2. 0.0 77 00 Time ChemDB Dataset 3. 0.0 00 00 3. 0.0 23 00 MoSS ChemDB dataset 3. 0.0 46 00 3. 0.0 69 00 3. 0.0 92 00 0. 00 0 closeGraph 42 / 48
  • 67. Number of Closed Graphs 10 .0 0 10 20 30 40 50 60 0 60 0 .0 11 00 0. 0 16 00 0. 00 21 0 0. 0 26 00 0. 0 31 00 0. 0 36 00 0. 0 ADAGRAPHMINER 41 00 0. 0 46 00 0. 0 51 00 0. 0 56 00 0. 00 Instances 61 0 0. 0 66 00 0. 0 71 00 0. 0 76 00 ADAGRAPHMINER-Window 0. 0 81 00 0. A DAG RAPH M INER 0 86 00 0. 0 91 00 0. 0 96 00 0. 00 IncGraphMiner 0 43 / 48
  • 68. Outline 1 Streaming Data 2 Frequent Pattern Mining Mining Evolving Graph Streams A DAG RAPH M INER 3 Experimental Evaluation 4 Summary and Future Work 44 / 48
  • 69. Summary Mining Evolving Graph Data Streams Given a data stream D of graphs, find frequent closed graphs. Transaction Id Graph We provide three algorithms, O of increasing power C C S N Incremental 1 O Sliding Window O Adaptive C C S N 2 C N 3 C C S N 45 / 48
  • 70. Summary {M}assive {O}nline {A}nalysis is a framework for online learning from data streams. http://moa.cs.waikato.ac.nz It is closely related to WEKA It includes a collection of offline and online as well as tools for evaluation: classification clustering frequent pattern mining MOA deals with evolving data streams MOA is easy to use and extend 46 / 48
  • 71. Future work Structured massive data mining sampling, sketching parallel techniques 47 / 48
  • 72. Thanks! 48 / 48