SlideShare uma empresa Scribd logo
1 de 39
March 30, 2011, The 5th International Workshop on
Data-Mining and Statistical Science@Osaka University	


    Kernel-based Similarity Search
     in Massive Graph Databases
          with Wavelet Trees

               Yasuo Tabei
        (JST ERATO Minato Project)
         Joint work with Koji Tsuda (AIST)
Outline	
n  Introduction
  l  Recent development of graph databases
  l  Needs for graph similarity search

  l  Bag-of-words representation of a graph

  l  Semi-conjunctive query

n  Method
  l    Scalable similarity search with wavelet trees
n  Experiments
  l    Use a large-scale graph dataset
  l    25 million chemical compounds
Graphs are everywhere	



Gene co-expression
network                                     RNA 2D structure



                     Protein 3D structure

	
                                             	
Chemical compound
                                             Social Network
Graph Similarity Search	
n  Retrievegraphs similar to the query
n  Large databases
  l    More than 20 million chemical compounds in
        PubChem database
n  Bag-of-words    representation of graphs
  l    WL procedure (NIPS 2009)
n  Why    not use document retrieval methods?
  l  Inverted index
  l  Not that easy (explained later)
Weisfeiler-Lehman Procedure (NIPS,09)	
n Convert
         a graph into a set of words
 (bag-of-words)	
 i) Make a label set of adjacent
    A	
                     vertices ex) {E,A,D}
        B	
  E	
  ii) Sort     ex) A,D,E
                  iii) Add the vertex label as a prefix
    D	
                              ex) B,A,D,E
    A	
                  iv) Map the label sequence to a
        R	
   E	
     unique value
                               ex) B,A,D,E R
    D	
                  v) Assign the value as the new
 Bag-of-words
 {A,B,D,E,…,R,…}	
 vertex label
Search by cosine similarity	
n  Identify
          all graphs in the database whose
  cosine is larger than a threshold 1-ε
                                        | Wi !Q |
                 Wi s.t K N (Wi ,Q) =                ! 1" !
                                        | Wi | | Q |

   l    Wi, Q: bag-of-words of graphs
n  The above solution can be relaxed as
  follows,
 If K (Q,W ) ! 1" ! , then
         N

                                             |Q |
                   (1! ! )2 | Q |!| W |!
                                           (1! ! )2
   l    Can be used for fast search
Semi-conjunctive query	
n Cosine   query can be relaxed to the following
form
              Wi s.t | Wi !Q |" k
  l  The number of common words between
      two bag-of-words Wi and Q
  Ex) |Wi Q|=|(A,C,E,F,H) (A,E,I,J,L)|
                 =|(A,E)|=2
  l  k=(1-ε)2|Q|
  l No false negatives

  l False positives can easily be filtered out by
  cosine calculations
Inverted index	
n  Innatural language processing, inverted
   index has been used to solve semi-
   conjunctive query	

Inverted Index	
         	
           	
   n  Associative  map
  	
                 	
    l    key    word
  	
          	
  	
            	
         l    value graph identifiers
  	
          	
  	
                 	
                                       including a word
Bottom-up search	
   Inverted Index	
        	
           	
       i) Look the index up with
   	
               	
   	
             	
              query bag-of-words
   	
                	
   	
             	
          ii) Aggregate all the lists
   	
                   	
         of graph indices
         Query:(A,C,E)	
      iii) Sort
                Aggregation	
     )Scan
(2,8,13,15,8,10,16,4,9,13,14)	
               Sort	
(2,4,8,8,9,10,13,13,14,15,16)	
	
 k=
Search time of inverted index
         on 25 million graphs	
n Searchtime of inverted index is not so different from
that of sequencial scan	
                                              40 sec	
                                              38 sec
Why?	
                           n  Each    word is not
        Query:(A,C,E)	
          specific enough
                Aggregation	
 Query contains 1000s
                             n 
(2,8,13,15,8,10,16,4,9,13,14)	
 of words
                Sort	
       n  Aggregated array is

(2,4,8,8,9,10,13,13,14,15,16)	
 VERY long
                             n  Sorting takes O(ClogC)
              C	
                in time
Overview of our method 	
n  Top-down   search in a tree over the series of
    graphs
n  Huge memory, if tree is implemented with
    pointers
n  Wavelet Tree: Succinct data structure
n  The smaller the similarity threshold is, the
    quicker the algorithm finishes
     •  Not the case in inverted index
Binary tree over graphs	
                                      n  leaf       graph
                  [1,8]	
           0	
                        n  node       interval
                            1	
        [1,4]	
             [5,8]	
                                      n        Each node is identified
      0	
        1	
      0	
      1	
          by a bit string (v={01})
                                            n  At the leaves, the graph
    [1,2]	
    [3,4]	
   [5,6]	
  [7,8]	
                                                indices correspond to
 0	
     1	
 0	
 1	
0	
 1	
 0	
        1	
                                                int(v)+1
  1	
 2	
 3	
 4	
 5	
 6	
 7	
 8	
{000}	
 001}	
      {     {010}	
 {100}	
 {110}	
                  {011}	
 {101}	
   {111}	
    (e.g., int(010)+1=2+1=3)
Summarization of bag-of-words	
  Represent bag-of-words as a bit array
n 
                       12345678
Ex) Wi=(1,3,4,7,8) xi=(1,0,1,1,0,0,1,1)

n    Take disjunction   of all bit arrays in the interval
      of a node v


Ex) For an interval [1,4]
X1=(0,1,0,0,0,0,1,0)           yv=x1 x2 x3 x4
X2=(1,0,1,1,0,0,0,0)             =(1,1,1,1,0,0,1,1)	
X3=(1,0,0,0,0,0,1,1)
X4=(1,0,0,0,0,0,0,1)
Binary tree over graphs	
                        yv=111111	
                       n  Assign   to each node
                         [1,8]	
                             v a bit arrays yv
        yv=110111	
                yv=101101	
                                                          n  yv : bit array
              [1,4]	
                 [5,8]	
                                                            l    i-th bit is 1 if graphs
yv=010101	
yv=110100	
 v=100100	
yv=001101	
                    y
                                                                  in an interval have
    [1,2]	
       [3,4]	
       [5,6]	
         [7,8]	
                                                                  the corresponding
                                                                  word.
  1	
   2	
      3	
     4	
   5	
    6	
   7	
     8	
{000}	
 001}	
      {     {010}	
 {100}	
 {110}	
                 {011}	
 {101}	
 {111}
Top-down traversal	
                   yv=111111	
                      n  Q: bag-of-words of a
                       [1,8]	
                          query
        vy =110111 	
                               y =101101
                               v          	
        n  Perform top-down
             [1,4]	
              [5,8]	
               traversal
 y =010101 y =110100 y =100100 y =001101 n  Prune the search
 v          	
 v        	
 v       	
 v        	

     [1,2]	
     [3,4]	
     [5,6]	
      [7,8]	
       space if # y [ j] ! k
                                                             j"Q
                                                                   v


                                                    n  The larger k is, the
   1	
 2	
 3	
 4	
 5	
 6	
 7	
 8	
{000}	
 001}	
       {      {010}	
 {100}	
 {110}	
                     {011}	
 {101}	
        {111}	
     smaller the search
                                                        space is
Huge Memory	
n  Time   is O(τm) : Very fast
  l  τ: the number of traversed node
  l  m: the number of bag-of-words in a query

n  Space   is O(Mnlogn) bit
  l  M: the number of unique words
  l  n: the number of graphs
Wavelet Tree! (SODA,03)	
n  Replace yv    in each node v by a rank
  dictionary
  l    explained in next slides
n  Implement  a tree without using pointers
n  Only 60% memory overhead compared to
    the inverted index (Vigna,08)
n  Access to the summary information in any
    internal node
Rank dictionary (Raman,02)	
n  Give    bit array B[1,n] the following operation:
   l    rankc(B,i): return the number of c   {0,1} in B[1…i]	



Ex) B=0110011100	
                        i 1 2 3 4 5 6 7 8 9 10
           rank1(B,8)=5 0 1 1 0 0 1 1 1 0 0
           rank0(B,5)=3	
 0 1 1 0 0 1 1 1 0 0
Implementation of rank dictionary	
                           n  Divide the bit array B into
B	
                        large blocks of length l=log2n
                             RL=Ranks of large blocks
RL	
                       n  Divide each large block to
                           small blocks of length s=logn/2
                             Rs=Ranks of small blocks
RS	
                            relative to the large block
               rank1(B,i)=RL[i/l]+Rs[i/s]+(remaining rank)	
                                 Time:O(1)
                                 Memory: n +o(n) bits
Restricted inverted index	
Inverted Index	
                      n  Concatenate    graph ids
     	
                          	
                                          for words in the root
	
                          	
	
                     	
             n  Restrict the inverted index for
	
                     	
                 the interval [sv,tv] of a node v	
	
                	

          [1,8]    A	
    B	
  C	
   D	
          Aroot 1 3 6 8 2 5 7 1 2 7 4 5

                                       4	
              >4	
[1,4]        A	
 B	
 C	
 D	
                           A	
 B	
 C	
 D	
 [4,8]
 Aleft      1 3 2 1 2 4                        Aright 6 8 5 7 7 5
Whole structure of
          restricted inverted index	
          [1,8]    A	
    B	
   C	
  D	
                1 3 6 8 2 5 7 1 2 7 4 5


  [1,4]    A	
 B	
 C	
 D	
                  A	
 B	
 C	
 D	
[5,8]
          1 3 2 1 2 4                      6 8 5 7 7 5

[1,2]                        [3,4] [5,6]                 [6,7]
   A	
B	
 C	
          A	
D	
        A	
B	
 D	
      A	
 B	
C	
   1 2 1 2             3 4           6 5 5            8 7 7


 A	
C	
    B	
C	
      A	
     D	
   B	
D	
 A	
      B	
 C	
 A	
 1 1       2 2         3        4    5 5 6           7 7     8
Similarity search	
n  To
    retrieve graphs similar to a query
  Q=(A,C), the tree is traversed in the top-
  down manner.

          [1,8]    A	
         C	
          Aroot 1 3 6 8 2 5 7 1 2 7 4 5

                       4	
           >4	
 [1,4]       A	
  C	
        [5,8]   A	
   C	
  Aleft     1 3 2 1 2 4      Aright 6 8 5 7 7 5
Similarity search	
n    To retrieve graphs similar to a query Q=(A,C), the tree
      is traversed in a top-down manner.
n  Observation
      l    To perform top-down traversal, only intervals
            of words in each node are necessary

                    A [1,4]	
    C [8,10]	
              Aroot 1 3 6 8 2 5 7 1 2 7 4 5

                             4	
              >4	
            A [1,2]	
C [4,5]	
             A[1,2]	
 C[5,5]	
       Aleft 1 3 2 1 2 4            Aright 6 8 5 7 7 5
Similarity search	
n  Replace
          restricted inverted index Av in
  each node v with a bit array bv.
   l    bv[i]=1 if Av[i] goes to the right child


                	
 0	
 0	
 1	
1	
 0	
 1	
 1	
 0	
 0	
 1	
 0	
 1	
            Aroot 1 3 6 8 2 5 7 1 2 7 4 5
                             0	
                       1	

 bleft	
     0	
 1	
 0	
0	
 0	
 1	
      bright	
 0	
 1	
 0	
1	
 1	
 0	
 Aleft       1 3 2 1 2 4                 Aright 6 8 5 7 7 5
Similarity search	
n  Intervals       of child nodes can be computed by
      rank operations
l    sleft(v),j=rank0(bv,svj-1)+1,tleft(v),j=rank0(bv,tvj)
l    sright(v),j =rank1(bv,svj-1)+1,tright(v),j=rank1(bv,tvj)
                                               C [8,10]	
      Ex)
                    	
 0	
 0	
 1	
1	
 0	
 1	
 1	
 0	
 0	
 1	
 0	
 1	
                Aroot 1 3 6 8 2 5 7 1 2 7 4 5
rank0(broot,8-1)+1=4,                                            rank1(broot,8-1)+1=5,
rank0(broot,10)=5	
                                              rank1(broot,10)=5	
                        C [4,5]	
                               C [5,5]	
      bleft	
    0	
 1	
 0	
0	
 0	
 1	
       bright	
 0	
 1	
 0	
1	
 1	
 0	
      Aleft      1 3 2 1 2 4                  Aright 6 8 5 7 7 5
Wavelet Tree	
n Wavelet tree can be obtained to replace the restricted
Inverted indices with bit arrays
n Wavelet tree consists of bit arrays bv and initial
intervals Croot.

     Croot	
      A	
    B	
   C	
  D	
               0 0 1 1 0 1 1 0 0 1 0 1



         0 1 0 0 0 1             0 1 0 1 1 0



    0 1 0 1          0 1       1 0 0       1 0 0
Wavelet Tree	
n Graphids can be recovered from bit strings
 on the path from the root to leaves	
  Croot	
 A	
       B	
   C	
   D	
                 0 0 1 1 0 1 1 0 0 1 0 1

                     0	
                   1	

        0 1 0 0 0 1                0 1 0 1 1 0

        0	
           1	
           0	
           1	

  0 1 0 1                  0 1   1 0 0            1 0 0

  0	
      1	
        0	
 1	
    0	
 1	
          0	
 1	
 000    001          010 011 100 101             110 111
Memory	
n  (1+α)Nlogn          + MlogN bits
        Bit arrays bv	
 Initial intervals Croot	
      l  N: the number of all words in the database
      l  n: the number of graphs

      l  α: overhead for rank dictionary (α=0.6)

n    For inverted index, Nlogn bits
n    About 60% overhead to inverted index!!
Experiments	
n  25 million chemical compounds from
    PubChem database
n  Use search time and memory as
    evaluation measures
n  Compare our method gWT to
   l  inverted index
   l  sequential scan implemented in G-Hash

       (Wang et al, 2009)
Search time on 25 million graphs	
                              40 sec	
                              38 sec	




                              8 sec	
                              3 sec	
                              2 sec
Memory usage
Overhead of rank dictionary
Construction time
Related work	
•  A lot of methods have been proposed so far.
 1.gIndex [Yan et al., 04]
 2.Graph Grep [Shasha et al., 07]
 3.Tree+Delta [Zhao et al., 07]
 4.TreePi [Zhang et al., 07]
 5.Gstring [Jiang et al., 07]
 6.FG-Index [Cheng et al., 07]
 7.GDIndex [Williams et al., 07]
 etc
Related work 	
              •  Decompose graphs
                 into a set of
Decompose	
      substructures
               - subgraphs, trees,
          …	
    paths etc
              •  Build a substructure-
Index	
                 based index
Drawbacks	
•  Require frequent subgraph mining
•  Do not scale to millions of graphs
Summary	
n  Efficientsimilarity search method for
    massive graph databases
n  Solve semi-conjunctive query efficiently
n  Built on wavelet trees
n  Use Weisfeiler-Lehman procedure to
    convert graphs into bag-of-words
n  Applicable to more than 20 million graphs
n  Software
    http://code.google.com/p/gwt/
Acknowledgements	

•  Prof. Shin-ichi Minato (Hokkaido Univ.)
•  Dr. Daisuke Okanohara (PFI)
•  Members in ERATO Minato Project

Mais conteúdo relacionado

Mais procurados

A new class of a stable implicit schemes for treatment of stiff
A new class of a stable implicit schemes for treatment of stiffA new class of a stable implicit schemes for treatment of stiff
A new class of a stable implicit schemes for treatment of stiffAlexander Decker
 
Introduction to inverse problems
Introduction to inverse problemsIntroduction to inverse problems
Introduction to inverse problemsDelta Pi Systems
 
Multilinear singular integrals with entangled structure
Multilinear singular integrals with entangled structureMultilinear singular integrals with entangled structure
Multilinear singular integrals with entangled structureVjekoslavKovac1
 
Datamining 6th svm
Datamining 6th svmDatamining 6th svm
Datamining 6th svmsesejun
 
A Szemerédi-type theorem for subsets of the unit cube
A Szemerédi-type theorem for subsets of the unit cubeA Szemerédi-type theorem for subsets of the unit cube
A Szemerédi-type theorem for subsets of the unit cubeVjekoslavKovac1
 
Numerical solution of boundary value problems by piecewise analysis method
Numerical solution of boundary value problems by piecewise analysis methodNumerical solution of boundary value problems by piecewise analysis method
Numerical solution of boundary value problems by piecewise analysis methodAlexander Decker
 
Tales on two commuting transformations or flows
Tales on two commuting transformations or flowsTales on two commuting transformations or flows
Tales on two commuting transformations or flowsVjekoslavKovac1
 
WE4.L09 - MEAN-SHIFT AND HIERARCHICAL CLUSTERING FOR TEXTURED POLARIMETRIC SA...
WE4.L09 - MEAN-SHIFT AND HIERARCHICAL CLUSTERING FOR TEXTURED POLARIMETRIC SA...WE4.L09 - MEAN-SHIFT AND HIERARCHICAL CLUSTERING FOR TEXTURED POLARIMETRIC SA...
WE4.L09 - MEAN-SHIFT AND HIERARCHICAL CLUSTERING FOR TEXTURED POLARIMETRIC SA...grssieee
 
New Mathematical Tools for the Financial Sector
New Mathematical Tools for the Financial SectorNew Mathematical Tools for the Financial Sector
New Mathematical Tools for the Financial SectorSSA KPI
 
Optimal Finite Difference Grids for Elliptic and Parabolic PDEs with Applicat...
Optimal Finite Difference Grids for Elliptic and Parabolic PDEs with Applicat...Optimal Finite Difference Grids for Elliptic and Parabolic PDEs with Applicat...
Optimal Finite Difference Grids for Elliptic and Parabolic PDEs with Applicat...Alex (Oleksiy) Varfolomiyev
 
Paraproducts with general dilations
Paraproducts with general dilationsParaproducts with general dilations
Paraproducts with general dilationsVjekoslavKovac1
 
Signal Processing Course : Convex Optimization
Signal Processing Course : Convex OptimizationSignal Processing Course : Convex Optimization
Signal Processing Course : Convex OptimizationGabriel Peyré
 
Signal Processing Course : Inverse Problems Regularization
Signal Processing Course : Inverse Problems RegularizationSignal Processing Course : Inverse Problems Regularization
Signal Processing Course : Inverse Problems RegularizationGabriel Peyré
 
Datamining 6th Svm
Datamining 6th SvmDatamining 6th Svm
Datamining 6th Svmsesejun
 
Data Exchange over RDF
Data Exchange over RDFData Exchange over RDF
Data Exchange over RDFnet2-project
 
A Szemeredi-type theorem for subsets of the unit cube
A Szemeredi-type theorem for subsets of the unit cubeA Szemeredi-type theorem for subsets of the unit cube
A Szemeredi-type theorem for subsets of the unit cubeVjekoslavKovac1
 
Catalan Tau Collocation for Numerical Solution of 2-Dimentional Nonlinear Par...
Catalan Tau Collocation for Numerical Solution of 2-Dimentional Nonlinear Par...Catalan Tau Collocation for Numerical Solution of 2-Dimentional Nonlinear Par...
Catalan Tau Collocation for Numerical Solution of 2-Dimentional Nonlinear Par...IJERA Editor
 

Mais procurados (20)

A new class of a stable implicit schemes for treatment of stiff
A new class of a stable implicit schemes for treatment of stiffA new class of a stable implicit schemes for treatment of stiff
A new class of a stable implicit schemes for treatment of stiff
 
Introduction to inverse problems
Introduction to inverse problemsIntroduction to inverse problems
Introduction to inverse problems
 
1 cb02e45d01
1 cb02e45d011 cb02e45d01
1 cb02e45d01
 
Image denoising
Image denoisingImage denoising
Image denoising
 
Multilinear singular integrals with entangled structure
Multilinear singular integrals with entangled structureMultilinear singular integrals with entangled structure
Multilinear singular integrals with entangled structure
 
Datamining 6th svm
Datamining 6th svmDatamining 6th svm
Datamining 6th svm
 
A Szemerédi-type theorem for subsets of the unit cube
A Szemerédi-type theorem for subsets of the unit cubeA Szemerédi-type theorem for subsets of the unit cube
A Szemerédi-type theorem for subsets of the unit cube
 
Numerical solution of boundary value problems by piecewise analysis method
Numerical solution of boundary value problems by piecewise analysis methodNumerical solution of boundary value problems by piecewise analysis method
Numerical solution of boundary value problems by piecewise analysis method
 
cyclic_code.pdf
cyclic_code.pdfcyclic_code.pdf
cyclic_code.pdf
 
Tales on two commuting transformations or flows
Tales on two commuting transformations or flowsTales on two commuting transformations or flows
Tales on two commuting transformations or flows
 
WE4.L09 - MEAN-SHIFT AND HIERARCHICAL CLUSTERING FOR TEXTURED POLARIMETRIC SA...
WE4.L09 - MEAN-SHIFT AND HIERARCHICAL CLUSTERING FOR TEXTURED POLARIMETRIC SA...WE4.L09 - MEAN-SHIFT AND HIERARCHICAL CLUSTERING FOR TEXTURED POLARIMETRIC SA...
WE4.L09 - MEAN-SHIFT AND HIERARCHICAL CLUSTERING FOR TEXTURED POLARIMETRIC SA...
 
New Mathematical Tools for the Financial Sector
New Mathematical Tools for the Financial SectorNew Mathematical Tools for the Financial Sector
New Mathematical Tools for the Financial Sector
 
Optimal Finite Difference Grids for Elliptic and Parabolic PDEs with Applicat...
Optimal Finite Difference Grids for Elliptic and Parabolic PDEs with Applicat...Optimal Finite Difference Grids for Elliptic and Parabolic PDEs with Applicat...
Optimal Finite Difference Grids for Elliptic and Parabolic PDEs with Applicat...
 
Paraproducts with general dilations
Paraproducts with general dilationsParaproducts with general dilations
Paraproducts with general dilations
 
Signal Processing Course : Convex Optimization
Signal Processing Course : Convex OptimizationSignal Processing Course : Convex Optimization
Signal Processing Course : Convex Optimization
 
Signal Processing Course : Inverse Problems Regularization
Signal Processing Course : Inverse Problems RegularizationSignal Processing Course : Inverse Problems Regularization
Signal Processing Course : Inverse Problems Regularization
 
Datamining 6th Svm
Datamining 6th SvmDatamining 6th Svm
Datamining 6th Svm
 
Data Exchange over RDF
Data Exchange over RDFData Exchange over RDF
Data Exchange over RDF
 
A Szemeredi-type theorem for subsets of the unit cube
A Szemeredi-type theorem for subsets of the unit cubeA Szemeredi-type theorem for subsets of the unit cube
A Szemeredi-type theorem for subsets of the unit cube
 
Catalan Tau Collocation for Numerical Solution of 2-Dimentional Nonlinear Par...
Catalan Tau Collocation for Numerical Solution of 2-Dimentional Nonlinear Par...Catalan Tau Collocation for Numerical Solution of 2-Dimentional Nonlinear Par...
Catalan Tau Collocation for Numerical Solution of 2-Dimentional Nonlinear Par...
 

Destaque

Sketch sort sugiyamalab-20101026 - public
Sketch sort sugiyamalab-20101026 - publicSketch sort sugiyamalab-20101026 - public
Sketch sort sugiyamalab-20101026 - publicYasuo Tabei
 
Ibisml2011 06-20
Ibisml2011 06-20Ibisml2011 06-20
Ibisml2011 06-20Yasuo Tabei
 
Kdd2015reading-tabei
Kdd2015reading-tabeiKdd2015reading-tabei
Kdd2015reading-tabeiYasuo Tabei
 
DCC2014 - Fully Online Grammar Compression in Constant Space
DCC2014 - Fully Online Grammar Compression in Constant SpaceDCC2014 - Fully Online Grammar Compression in Constant Space
DCC2014 - Fully Online Grammar Compression in Constant SpaceYasuo Tabei
 
Sketch sort ochadai20101015-public
Sketch sort ochadai20101015-publicSketch sort ochadai20101015-public
Sketch sort ochadai20101015-publicYasuo Tabei
 
CPM2013-tabei201306
CPM2013-tabei201306CPM2013-tabei201306
CPM2013-tabei201306Yasuo Tabei
 
SPIRE2013-tabei20131009
SPIRE2013-tabei20131009SPIRE2013-tabei20131009
SPIRE2013-tabei20131009Yasuo Tabei
 
NIPS2013読み会: Scalable kernels for graphs with continuous attributes
NIPS2013読み会: Scalable kernels for graphs with continuous attributesNIPS2013読み会: Scalable kernels for graphs with continuous attributes
NIPS2013読み会: Scalable kernels for graphs with continuous attributesYasuo Tabei
 
Scalable Partial Least Squares Regression on Grammar-Compressed Data Matrices
Scalable Partial Least Squares Regression on Grammar-Compressed Data MatricesScalable Partial Least Squares Regression on Grammar-Compressed Data Matrices
Scalable Partial Least Squares Regression on Grammar-Compressed Data MatricesYasuo Tabei
 
Lgm pakdd2011 public
Lgm pakdd2011 publicLgm pakdd2011 public
Lgm pakdd2011 publicYasuo Tabei
 
異常検知 - 何を探すかよく分かっていないものを見つける方法
異常検知 - 何を探すかよく分かっていないものを見つける方法異常検知 - 何を探すかよく分かっていないものを見つける方法
異常検知 - 何を探すかよく分かっていないものを見つける方法MapR Technologies Japan
 
ウェーブレット木の世界
ウェーブレット木の世界ウェーブレット木の世界
ウェーブレット木の世界Preferred Networks
 
文法圧縮入門:超高速テキスト処理のためのデータ圧縮(NLP2014チュートリアル)
文法圧縮入門:超高速テキスト処理のためのデータ圧縮(NLP2014チュートリアル)文法圧縮入門:超高速テキスト処理のためのデータ圧縮(NLP2014チュートリアル)
文法圧縮入門:超高速テキスト処理のためのデータ圧縮(NLP2014チュートリアル)Shirou Maruyama
 
Grammar Tenses.pptx
Grammar Tenses.pptxGrammar Tenses.pptx
Grammar Tenses.pptxwendyvinueza
 
Com Ensenyar Llengua A Xinesos Xiv Tallers
Com Ensenyar Llengua A Xinesos Xiv TallersCom Ensenyar Llengua A Xinesos Xiv Tallers
Com Ensenyar Llengua A Xinesos Xiv TallersArnau Cerdà
 

Destaque (20)

GIW2013
GIW2013GIW2013
GIW2013
 
Sketch sort sugiyamalab-20101026 - public
Sketch sort sugiyamalab-20101026 - publicSketch sort sugiyamalab-20101026 - public
Sketch sort sugiyamalab-20101026 - public
 
Ibisml2011 06-20
Ibisml2011 06-20Ibisml2011 06-20
Ibisml2011 06-20
 
Kdd2015reading-tabei
Kdd2015reading-tabeiKdd2015reading-tabei
Kdd2015reading-tabei
 
DCC2014 - Fully Online Grammar Compression in Constant Space
DCC2014 - Fully Online Grammar Compression in Constant SpaceDCC2014 - Fully Online Grammar Compression in Constant Space
DCC2014 - Fully Online Grammar Compression in Constant Space
 
Sketch sort ochadai20101015-public
Sketch sort ochadai20101015-publicSketch sort ochadai20101015-public
Sketch sort ochadai20101015-public
 
Lp Boost
Lp BoostLp Boost
Lp Boost
 
CPM2013-tabei201306
CPM2013-tabei201306CPM2013-tabei201306
CPM2013-tabei201306
 
SPIRE2013-tabei20131009
SPIRE2013-tabei20131009SPIRE2013-tabei20131009
SPIRE2013-tabei20131009
 
Lgm saarbrucken
Lgm saarbruckenLgm saarbrucken
Lgm saarbrucken
 
NIPS2013読み会: Scalable kernels for graphs with continuous attributes
NIPS2013読み会: Scalable kernels for graphs with continuous attributesNIPS2013読み会: Scalable kernels for graphs with continuous attributes
NIPS2013読み会: Scalable kernels for graphs with continuous attributes
 
Scalable Partial Least Squares Regression on Grammar-Compressed Data Matrices
Scalable Partial Least Squares Regression on Grammar-Compressed Data MatricesScalable Partial Least Squares Regression on Grammar-Compressed Data Matrices
Scalable Partial Least Squares Regression on Grammar-Compressed Data Matrices
 
Lgm pakdd2011 public
Lgm pakdd2011 publicLgm pakdd2011 public
Lgm pakdd2011 public
 
異常検知 - 何を探すかよく分かっていないものを見つける方法
異常検知 - 何を探すかよく分かっていないものを見つける方法異常検知 - 何を探すかよく分かっていないものを見つける方法
異常検知 - 何を探すかよく分かっていないものを見つける方法
 
ウェーブレット木の世界
ウェーブレット木の世界ウェーブレット木の世界
ウェーブレット木の世界
 
文法圧縮入門:超高速テキスト処理のためのデータ圧縮(NLP2014チュートリアル)
文法圧縮入門:超高速テキスト処理のためのデータ圧縮(NLP2014チュートリアル)文法圧縮入門:超高速テキスト処理のためのデータ圧縮(NLP2014チュートリアル)
文法圧縮入門:超高速テキスト処理のためのデータ圧縮(NLP2014チュートリアル)
 
bigdata2012nlp okanohara
bigdata2012nlp okanoharabigdata2012nlp okanohara
bigdata2012nlp okanohara
 
мобильный смарт фитнес
мобильный смарт фитнесмобильный смарт фитнес
мобильный смарт фитнес
 
Grammar Tenses.pptx
Grammar Tenses.pptxGrammar Tenses.pptx
Grammar Tenses.pptx
 
Com Ensenyar Llengua A Xinesos Xiv Tallers
Com Ensenyar Llengua A Xinesos Xiv TallersCom Ensenyar Llengua A Xinesos Xiv Tallers
Com Ensenyar Llengua A Xinesos Xiv Tallers
 

Semelhante a Dmss2011 public

[DL輪読会]Conditional Neural Processes
[DL輪読会]Conditional Neural Processes[DL輪読会]Conditional Neural Processes
[DL輪読会]Conditional Neural ProcessesDeep Learning JP
 
Conditional neural processes
Conditional neural processesConditional neural processes
Conditional neural processesKazuki Fujikawa
 
Bolt: Building A Distributed ndarray
Bolt: Building A Distributed ndarrayBolt: Building A Distributed ndarray
Bolt: Building A Distributed ndarrayJen Aman
 
Review session2
Review session2Review session2
Review session2NEEDY12345
 
Estimating the Evolution Direction of Populations to Improve Genetic Algorithms
Estimating the Evolution Direction of Populations to Improve Genetic AlgorithmsEstimating the Evolution Direction of Populations to Improve Genetic Algorithms
Estimating the Evolution Direction of Populations to Improve Genetic AlgorithmsAnnibale Panichella
 
Pre-Cal 20S January 20, 2009
Pre-Cal 20S January 20, 2009Pre-Cal 20S January 20, 2009
Pre-Cal 20S January 20, 2009Darren Kuropatwa
 
Convergence Theorems for Implicit Iteration Scheme With Errors For A Finite F...
Convergence Theorems for Implicit Iteration Scheme With Errors For A Finite F...Convergence Theorems for Implicit Iteration Scheme With Errors For A Finite F...
Convergence Theorems for Implicit Iteration Scheme With Errors For A Finite F...inventy
 
Skew Products on Directed Graphs
Skew Products on Directed GraphsSkew Products on Directed Graphs
Skew Products on Directed GraphsCaridad Arroyo
 
20121020 semi local-string_comparison_tiskin
20121020 semi local-string_comparison_tiskin20121020 semi local-string_comparison_tiskin
20121020 semi local-string_comparison_tiskinComputer Science Club
 
Part4 graph algorithms
Part4 graph algorithmsPart4 graph algorithms
Part4 graph algorithmsOh-Heum Kwon
 

Semelhante a Dmss2011 public (20)

Partitions
PartitionsPartitions
Partitions
 
SISAP17
SISAP17SISAP17
SISAP17
 
Vector Space.pptx
Vector Space.pptxVector Space.pptx
Vector Space.pptx
 
Algorithm Exam Help
Algorithm Exam HelpAlgorithm Exam Help
Algorithm Exam Help
 
[DL輪読会]Conditional Neural Processes
[DL輪読会]Conditional Neural Processes[DL輪読会]Conditional Neural Processes
[DL輪読会]Conditional Neural Processes
 
Conditional neural processes
Conditional neural processesConditional neural processes
Conditional neural processes
 
Bolt: Building A Distributed ndarray
Bolt: Building A Distributed ndarrayBolt: Building A Distributed ndarray
Bolt: Building A Distributed ndarray
 
Review session2
Review session2Review session2
Review session2
 
Algorithm Assignment Help
Algorithm Assignment HelpAlgorithm Assignment Help
Algorithm Assignment Help
 
Estimating the Evolution Direction of Populations to Improve Genetic Algorithms
Estimating the Evolution Direction of Populations to Improve Genetic AlgorithmsEstimating the Evolution Direction of Populations to Improve Genetic Algorithms
Estimating the Evolution Direction of Populations to Improve Genetic Algorithms
 
Pre-Cal 20S January 20, 2009
Pre-Cal 20S January 20, 2009Pre-Cal 20S January 20, 2009
Pre-Cal 20S January 20, 2009
 
Thesis defense
Thesis defenseThesis defense
Thesis defense
 
Convergence Theorems for Implicit Iteration Scheme With Errors For A Finite F...
Convergence Theorems for Implicit Iteration Scheme With Errors For A Finite F...Convergence Theorems for Implicit Iteration Scheme With Errors For A Finite F...
Convergence Theorems for Implicit Iteration Scheme With Errors For A Finite F...
 
Skew Products on Directed Graphs
Skew Products on Directed GraphsSkew Products on Directed Graphs
Skew Products on Directed Graphs
 
Q
QQ
Q
 
ALG5.1.ppt
ALG5.1.pptALG5.1.ppt
ALG5.1.ppt
 
20121020 semi local-string_comparison_tiskin
20121020 semi local-string_comparison_tiskin20121020 semi local-string_comparison_tiskin
20121020 semi local-string_comparison_tiskin
 
Linear sorting
Linear sortingLinear sorting
Linear sorting
 
10_rnn.pdf
10_rnn.pdf10_rnn.pdf
10_rnn.pdf
 
Part4 graph algorithms
Part4 graph algorithmsPart4 graph algorithms
Part4 graph algorithms
 

Último

From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...AliaaTarek5
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Scott Andery
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 

Último (20)

From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 

Dmss2011 public

  • 1. March 30, 2011, The 5th International Workshop on Data-Mining and Statistical Science@Osaka University Kernel-based Similarity Search in Massive Graph Databases with Wavelet Trees Yasuo Tabei (JST ERATO Minato Project) Joint work with Koji Tsuda (AIST)
  • 2. Outline n  Introduction l  Recent development of graph databases l  Needs for graph similarity search l  Bag-of-words representation of a graph l  Semi-conjunctive query n  Method l  Scalable similarity search with wavelet trees n  Experiments l  Use a large-scale graph dataset l  25 million chemical compounds
  • 3. Graphs are everywhere Gene co-expression network RNA 2D structure Protein 3D structure Chemical compound Social Network
  • 4. Graph Similarity Search n  Retrievegraphs similar to the query n  Large databases l  More than 20 million chemical compounds in PubChem database n  Bag-of-words representation of graphs l  WL procedure (NIPS 2009) n  Why not use document retrieval methods? l  Inverted index l  Not that easy (explained later)
  • 5. Weisfeiler-Lehman Procedure (NIPS,09) n Convert a graph into a set of words (bag-of-words) i) Make a label set of adjacent A vertices ex) {E,A,D} B E ii) Sort ex) A,D,E iii) Add the vertex label as a prefix D ex) B,A,D,E A iv) Map the label sequence to a R E unique value ex) B,A,D,E R D v) Assign the value as the new Bag-of-words {A,B,D,E,…,R,…} vertex label
  • 6. Search by cosine similarity n  Identify all graphs in the database whose cosine is larger than a threshold 1-ε | Wi !Q | Wi s.t K N (Wi ,Q) = ! 1" ! | Wi | | Q | l  Wi, Q: bag-of-words of graphs n  The above solution can be relaxed as follows, If K (Q,W ) ! 1" ! , then N |Q | (1! ! )2 | Q |!| W |! (1! ! )2 l  Can be used for fast search
  • 7. Semi-conjunctive query n Cosine query can be relaxed to the following form Wi s.t | Wi !Q |" k l  The number of common words between two bag-of-words Wi and Q Ex) |Wi Q|=|(A,C,E,F,H) (A,E,I,J,L)| =|(A,E)|=2 l  k=(1-ε)2|Q| l No false negatives l False positives can easily be filtered out by cosine calculations
  • 8. Inverted index n  Innatural language processing, inverted index has been used to solve semi- conjunctive query Inverted Index n  Associative map l  key word l  value graph identifiers including a word
  • 9. Bottom-up search Inverted Index i) Look the index up with query bag-of-words ii) Aggregate all the lists of graph indices Query:(A,C,E) iii) Sort Aggregation )Scan (2,8,13,15,8,10,16,4,9,13,14) Sort (2,4,8,8,9,10,13,13,14,15,16) k=
  • 10. Search time of inverted index on 25 million graphs n Searchtime of inverted index is not so different from that of sequencial scan 40 sec 38 sec
  • 11. Why? n  Each word is not Query:(A,C,E) specific enough Aggregation Query contains 1000s n  (2,8,13,15,8,10,16,4,9,13,14) of words Sort n  Aggregated array is (2,4,8,8,9,10,13,13,14,15,16) VERY long n  Sorting takes O(ClogC) C in time
  • 12. Overview of our method n  Top-down search in a tree over the series of graphs n  Huge memory, if tree is implemented with pointers n  Wavelet Tree: Succinct data structure n  The smaller the similarity threshold is, the quicker the algorithm finishes •  Not the case in inverted index
  • 13. Binary tree over graphs n  leaf graph [1,8] 0 n  node interval 1 [1,4] [5,8] n  Each node is identified 0 1 0 1 by a bit string (v={01}) n  At the leaves, the graph [1,2] [3,4] [5,6] [7,8] indices correspond to 0 1 0 1 0 1 0 1 int(v)+1 1 2 3 4 5 6 7 8 {000} 001} { {010} {100} {110} {011} {101} {111} (e.g., int(010)+1=2+1=3)
  • 14. Summarization of bag-of-words Represent bag-of-words as a bit array n  12345678 Ex) Wi=(1,3,4,7,8) xi=(1,0,1,1,0,0,1,1) n  Take disjunction of all bit arrays in the interval of a node v Ex) For an interval [1,4] X1=(0,1,0,0,0,0,1,0) yv=x1 x2 x3 x4 X2=(1,0,1,1,0,0,0,0) =(1,1,1,1,0,0,1,1) X3=(1,0,0,0,0,0,1,1) X4=(1,0,0,0,0,0,0,1)
  • 15. Binary tree over graphs yv=111111 n  Assign to each node [1,8] v a bit arrays yv yv=110111 yv=101101 n  yv : bit array [1,4] [5,8] l  i-th bit is 1 if graphs yv=010101 yv=110100 v=100100 yv=001101 y in an interval have [1,2] [3,4] [5,6] [7,8] the corresponding word. 1 2 3 4 5 6 7 8 {000} 001} { {010} {100} {110} {011} {101} {111}
  • 16. Top-down traversal yv=111111 n  Q: bag-of-words of a [1,8] query vy =110111 y =101101 v n  Perform top-down [1,4] [5,8] traversal y =010101 y =110100 y =100100 y =001101 n  Prune the search v v v v [1,2] [3,4] [5,6] [7,8] space if # y [ j] ! k j"Q v n  The larger k is, the 1 2 3 4 5 6 7 8 {000} 001} { {010} {100} {110} {011} {101} {111} smaller the search space is
  • 17. Huge Memory n  Time is O(τm) : Very fast l  τ: the number of traversed node l  m: the number of bag-of-words in a query n  Space is O(Mnlogn) bit l  M: the number of unique words l  n: the number of graphs
  • 18. Wavelet Tree! (SODA,03) n  Replace yv in each node v by a rank dictionary l  explained in next slides n  Implement a tree without using pointers n  Only 60% memory overhead compared to the inverted index (Vigna,08) n  Access to the summary information in any internal node
  • 19. Rank dictionary (Raman,02) n  Give bit array B[1,n] the following operation: l  rankc(B,i): return the number of c {0,1} in B[1…i] Ex) B=0110011100 i 1 2 3 4 5 6 7 8 9 10 rank1(B,8)=5 0 1 1 0 0 1 1 1 0 0 rank0(B,5)=3 0 1 1 0 0 1 1 1 0 0
  • 20. Implementation of rank dictionary n  Divide the bit array B into B large blocks of length l=log2n RL=Ranks of large blocks RL n  Divide each large block to small blocks of length s=logn/2 Rs=Ranks of small blocks RS relative to the large block rank1(B,i)=RL[i/l]+Rs[i/s]+(remaining rank) Time:O(1) Memory: n +o(n) bits
  • 21. Restricted inverted index Inverted Index n  Concatenate graph ids for words in the root n  Restrict the inverted index for the interval [sv,tv] of a node v [1,8] A B C D Aroot 1 3 6 8 2 5 7 1 2 7 4 5 4 >4 [1,4] A B C D A B C D [4,8] Aleft 1 3 2 1 2 4 Aright 6 8 5 7 7 5
  • 22. Whole structure of restricted inverted index [1,8] A B C D 1 3 6 8 2 5 7 1 2 7 4 5 [1,4] A B C D A B C D [5,8] 1 3 2 1 2 4 6 8 5 7 7 5 [1,2] [3,4] [5,6] [6,7] A B C A D A B D A B C 1 2 1 2 3 4 6 5 5 8 7 7 A C B C A D B D A B C A 1 1 2 2 3 4 5 5 6 7 7 8
  • 23. Similarity search n  To retrieve graphs similar to a query Q=(A,C), the tree is traversed in the top- down manner. [1,8] A C Aroot 1 3 6 8 2 5 7 1 2 7 4 5 4 >4 [1,4] A C [5,8] A C Aleft 1 3 2 1 2 4 Aright 6 8 5 7 7 5
  • 24. Similarity search n  To retrieve graphs similar to a query Q=(A,C), the tree is traversed in a top-down manner. n  Observation l  To perform top-down traversal, only intervals of words in each node are necessary A [1,4] C [8,10] Aroot 1 3 6 8 2 5 7 1 2 7 4 5 4 >4 A [1,2] C [4,5] A[1,2] C[5,5] Aleft 1 3 2 1 2 4 Aright 6 8 5 7 7 5
  • 25. Similarity search n  Replace restricted inverted index Av in each node v with a bit array bv. l  bv[i]=1 if Av[i] goes to the right child 0 0 1 1 0 1 1 0 0 1 0 1 Aroot 1 3 6 8 2 5 7 1 2 7 4 5 0 1 bleft 0 1 0 0 0 1 bright 0 1 0 1 1 0 Aleft 1 3 2 1 2 4 Aright 6 8 5 7 7 5
  • 26. Similarity search n  Intervals of child nodes can be computed by rank operations l  sleft(v),j=rank0(bv,svj-1)+1,tleft(v),j=rank0(bv,tvj) l  sright(v),j =rank1(bv,svj-1)+1,tright(v),j=rank1(bv,tvj) C [8,10] Ex) 0 0 1 1 0 1 1 0 0 1 0 1 Aroot 1 3 6 8 2 5 7 1 2 7 4 5 rank0(broot,8-1)+1=4, rank1(broot,8-1)+1=5, rank0(broot,10)=5 rank1(broot,10)=5 C [4,5] C [5,5] bleft 0 1 0 0 0 1 bright 0 1 0 1 1 0 Aleft 1 3 2 1 2 4 Aright 6 8 5 7 7 5
  • 27. Wavelet Tree n Wavelet tree can be obtained to replace the restricted Inverted indices with bit arrays n Wavelet tree consists of bit arrays bv and initial intervals Croot. Croot A B C D 0 0 1 1 0 1 1 0 0 1 0 1 0 1 0 0 0 1 0 1 0 1 1 0 0 1 0 1 0 1 1 0 0 1 0 0
  • 28. Wavelet Tree n Graphids can be recovered from bit strings on the path from the root to leaves Croot A B C D 0 0 1 1 0 1 1 0 0 1 0 1 0 1 0 1 0 0 0 1 0 1 0 1 1 0 0 1 0 1 0 1 0 1 0 1 1 0 0 1 0 0 0 1 0 1 0 1 0 1 000 001 010 011 100 101 110 111
  • 29. Memory n  (1+α)Nlogn + MlogN bits Bit arrays bv Initial intervals Croot l  N: the number of all words in the database l  n: the number of graphs l  α: overhead for rank dictionary (α=0.6) n  For inverted index, Nlogn bits n  About 60% overhead to inverted index!!
  • 30. Experiments n  25 million chemical compounds from PubChem database n  Use search time and memory as evaluation measures n  Compare our method gWT to l  inverted index l  sequential scan implemented in G-Hash (Wang et al, 2009)
  • 31. Search time on 25 million graphs 40 sec 38 sec 8 sec 3 sec 2 sec
  • 33. Overhead of rank dictionary
  • 35. Related work •  A lot of methods have been proposed so far. 1.gIndex [Yan et al., 04] 2.Graph Grep [Shasha et al., 07] 3.Tree+Delta [Zhao et al., 07] 4.TreePi [Zhang et al., 07] 5.Gstring [Jiang et al., 07] 6.FG-Index [Cheng et al., 07] 7.GDIndex [Williams et al., 07] etc
  • 36. Related work •  Decompose graphs into a set of Decompose substructures - subgraphs, trees, … paths etc •  Build a substructure- Index based index
  • 37. Drawbacks •  Require frequent subgraph mining •  Do not scale to millions of graphs
  • 38. Summary n  Efficientsimilarity search method for massive graph databases n  Solve semi-conjunctive query efficiently n  Built on wavelet trees n  Use Weisfeiler-Lehman procedure to convert graphs into bag-of-words n  Applicable to more than 20 million graphs n  Software http://code.google.com/p/gwt/
  • 39. Acknowledgements •  Prof. Shin-ichi Minato (Hokkaido Univ.) •  Dr. Daisuke Okanohara (PFI) •  Members in ERATO Minato Project