SlideShare uma empresa Scribd logo
1 de 36
Baixar para ler offline
.
                        Algorithm of Suffix Tree
                           by farseerfc@gmail.com
.

                                   Jiachen Yang1
                           1 Research   Student in Osaka University


                               November 12, 2011




                                                                .     .   .        .      .       .

    jc-yang (Osaka-U)                       SuffixTree                         November 12, 2011       1 / 25
Part Outline


.
1   What is Suffix Tree

.
2   History & Naïve Algorithm

.
3   Optimization of Naïve Algorithm

.
4   Examples & Analysis




                                            .   .   .        .      .       .

      jc-yang (Osaka-U)         SuffixTree               November 12, 2011       2 / 25
What can we do with suffix tree?
.
                                           Find properties of the strings
Linear algorithms for exact string           . Find the longest common substrings of the
                                             1
matching
                                               string Si and Sj in θ (ni + nj ) time .
    like KMP                                 . Find all maximal pairs, maximal repeats or
                                             2

Search for strings                             supermaximal repeats in θ (n + z) time, if there
  . Check if a string P of length m is
  1                                            are z such repeats .
                                             . Find the Lempel-Ziv decomposition in θ (n)
                                             3
    a substring in O(m) time.
  . Find all z occurrences of the
  2                                            time .
                                             . Find the longest repeated substrings in θ (n)
    patterns P1 , · · · , Pq of total        4


    length m as substrings in                  time.
                                             . Find the most frequently occurring substrings
                                             5
    O(m + z) time.
  . Search for a regular expression P
  3                                            of a minimum length in θ (n) time.
                                             . Find the shortest strings from ∑ that do not
                                             6
    in time expected sublinear in n .
  . Find for each suffix of a pattern
  4                                            occur in D, in O(n + z) time, if there are z such
    P, the length of the longest               strings.
                                             . Find the shortest substrings occurring only
                                             7
    match between a prefix of
    P[i . . . m] and a substring in D in       once in θ (n) time.
                                             . Find, for each i, the shortest substrings of Si
    θ (m) time .                             8

  . ...
  5                                            not occurring elsewhere in D in θ (n) time.
                                             . ...
                                             9


                                                               .     .     .        .      .       .

      jc-yang (Osaka-U)                     SuffixTree                          November 12, 2011       3 / 25
Trie, Radix Tree, Suffix Trie & Suffix Tree

         trie1                    is a dictionary tree.
                   (AKA prefix tree)

                             stores a set of words.
                             each node represents a character except that root is empty string.
                             words with common prefix share same parent nodes.
                             minimal deterministic finite automaton that accepts all words.
  radix tree                            is a trie with compressed chain of nodes.
                   (AKA patricia trie or radix trie)

                             Each internal node has at least 2 children.
  suffix trie is a trie which stores all suffix of a given string.
  suffix tree is a suffix radix tree.
                   that enables linear time construction and fast
                   algorithms of other problems on a string.
   1 pronounced as in word retrieval by its inventor, /tri:/ “tree”, but
pronounced /traI/ “try” by other authors                           .   .    .        .      .       .

      jc-yang (Osaka-U)                                SuffixTree                November 12, 2011       4 / 25
Trie & Radix Tree

.
Trie of “A to tea ted ten i
in inn”
.
                                 .
                                 Radix tree example
                                 .




                                 .

.
                                          .   .   .        .      .       .

    jc-yang (Osaka-U)         SuffixTree               November 12, 2011       5 / 25
Suffix Trie & Suffix Tree of “banana”
                   ^




     b        a          n       $
                                                                                0:6


 a       n         $         a                                  banana$ a             na         $


                                                      1:0             8:6                 6:6            10:6
 n       a                   n       $

                                                                na     $                   na$       $
 a       n         $         a
                                                      4:6             9:6                 3:2            7:6

 n       a                   $
                                                      na$   $


 a       $
                                                2:1         5:6



 $


                                                                  .         .         .          .         .    .

     jc-yang (Osaka-U)                   SuffixTree                                         November 12, 2011        6 / 25
Suffix Tree of “mississippi”
                                                    mississippi$

                                                            0:11


                            (1,2)                     (0,...) (8,9)            (11,...)      (2,3)
                              i                     mississi... p                $...          s


                  12:8                        1:0          15:10                  18:11              4:4


            (8,...) (2,5)          (11,...)                   (10,...)         (9,...)
            ppi$... ssi              $...                       i$...          pi$...

                                                                                             (3,5)           (4,5)
  13:8           6:8                 17:11                 16:10                  14:8
                                                                                               si              i

                (8,...)       (5,...)
                ppi$...     ssippi$...


          7:8                2:1                                         8:8                                 10:8


                                                               (8,...) (5,...)                           (8,...)   (5,...)
                                                               ppi$... ssippi$...                        ppi$... ssippi$...


                                                     9:8                 3:2              11:8                  5:4



                                                                                                     .           .            .        .      .       .

     jc-yang (Osaka-U)                                             SuffixTree                                                      November 12, 2011       7 / 25
Part Outline


.
1   What is Suffix Tree

.
2   History & Naïve Algorithm

.
3   Optimization of Naïve Algorithm

.
4   Examples & Analysis




                                            .   .   .        .      .       .

      jc-yang (Osaka-U)         SuffixTree               November 12, 2011       8 / 25
History of Suffix Tree Algorithms
.
                                                           M. Farach. Optimal suffix tree
    First linear algorithm was introduced by Weiner            construction with large
                                                               alphabets. In focs, page
    1973 as position tree. Awarded by Donald Knuth             137. Published by the IEEE
                                                               Computer Society, 1997.
    as “Algorithm of the year 1973”.                       E. M. McCreight. A
                                                               space-economical suffix
    Greatly simplified by McCreight 1976.                       tree construction algorithm.
                                                               J. ACM, 23:262–272, April
Above two algorithms are processing string backward.           1976. ISSN 0004-5411.
                                                               doi: http://doi.acm.org/10.
                                                               1145/321941.321946. URL
    First online construction by Ukkonen 1995, which           http:
                                                               //doi.acm.org/10.
    is easier to understand.                                   1145/321941.321946.
                                                           E. Ukkonen. On-line
                                                               construction of suffix trees.
Above algorithms assume size of alphabet as fixed               Algorithmica, 14:249–260,
                                                               1995. ISSN 0178-4617.
constant .                                                     URL
                                                               http://dx.doi.org/10.
    Limitation was break by Farach 1997, optimal for           1007/BF01206331.
                                                               10.1007/BF01206331.
    all alphabets.                                         P. Weiner. Linear pattern
                                                               matching algorithms. In
                                                               Switching and Automata
Further study are continued to scale to scenarios when         Theory, 1973. SWAT ’08.
                                                               IEEE Conference Record of
the whole suffix tree or even input string cannot fit into       14th Annual Symposium on,
                                                               pages 1 –11, oct. 1973. doi:
memory.                                           .   .    .
                                                               10.1109/SWAT.1973.13.
                                                                    .        .        .

     jc-yang (Osaka-U)             SuffixTree                   November 12, 2011          9 / 25
Backward Construction of Suffix Tree
   1                       2                        3                                4


  banana$            banana$ anana$      banana$ anana$       nana$        banana$ ana nana$



                                                                                    na$ $




                 7                              6                               5


         banana$ a na          $      banana$       a    na           banana$ ana                na



              na $         na$ $         na $           na$ $             na$ $                na$ $


        na$ $                         na$ $


                                                                      .     .            .         .       .     .

       jc-yang (Osaka-U)                        SuffixTree                                    November 12, 2011   10 / 25
Construct Suffix Tree by Sorting Suffix
Suffix :                  Sorted suffix :          Tree of sorted suffix :
mississippi              i                       | -i - >| -
ississippi               ippi                    |       | - ppi
ssissippi                issippi                 |       | - ssi - >| - ppi
sissippi                 ississippi              |                   | - ssippi
issippi                  mississippi             | - mississippi
ssippi                   pi                      | -p - >| - i
sippi                    ppi                     |       | - pi
ippi                     sippi                   | -s - >| -i - - >| - ppi
ppi                      sissippi                        |         | - ssippi
pi                       ssippi                          | - si - >| - ppi
i                        ssissippi                                 | - ssippi
Time complexity will be O(N 2 log N) .
Space complexity will be O(N 2 ) .
                                                        .   .    .         .       .     .

     jc-yang (Osaka-U)               SuffixTree                       November 12, 2011   11 / 25
Naïve Algorithm

S UFFIX T REE(string)
1 for i ← 1 to length(string)
2       do U PDATE(treei )    £ Phrase i

U PDATE(treei )
1 for j ← 1 to i
2       do node ← treei .F IND(suffix[j to i − 1])
3          E XTEND(node, string[i])       £ Extension j

Time complexity will be O(N 3 ) .
Space complexity will be O(N 2 ) .
The challenge is to make sure treei is updated to treei+1 efficiently.
                                                .   .    .         .       .     .

     jc-yang (Osaka-U)           SuffixTree                   November 12, 2011   12 / 25
Suffix Extend Cases of Naïve Algorithm
.
     Case I If path Sj...i ends at leaf, append a char Si+1 to end of edge into leaf.
         Sj...i = . . . na                                       Sj...i+1 = . . . nan

                    na                                                  nan

    Case II If path Sj...i ends in the middle of an edge , and next char Si+1 is not equal to
              the next char in the edge, split that edge, create a internal node, add a new
              edge to a new leaf.
         Sj...i = . . . na                                        Sj...i+1 = . . . nay
                                                                                         n
                    nan                                                 na
                                                                                         y
    Case III If path Sj...i ends in the middle of an edge , and next char Si+1 is equal to the
               next char in the edge, do nothing, extenstion has done.
          Sj...i = . . . na                                      Sj...i+1 = . . . nan
                    nan                                                 nan

                                                              .     .     .         .        .    .

       jc-yang (Osaka-U)                   SuffixTree                          November 12, 2011   13 / 25
Naïve Online Construction of Suffix Tree
   1                   2            3                                  4


       b           ba      a     ban an n                       bana ana na



               7                     6                             5


       banana$ a na        $   banana anana     nana           banan anan nan



           na $        na$ $



       na$ $


                                                       .   .       .         .       .     .

   jc-yang (Osaka-U)                SuffixTree                          November 12, 2011   14 / 25
Properties of Suffix Tree
.
    . Each update will add exactly 1
    1

      leaf node .
                                                                                          0:6
              nr_leaf = N
    . Suffix tree is full tree.
    2
                                                                           banana$ a              na          $
              Each internal node has at least 2
              children.                                      1:0                    8:6             6:6               10:6
              nr_internal < N
              nr_node < 2N                                                 na        $                  na$       $
    . Worst case Fabonacci word
    3
                                                             4:6                    9:6             3:2               7:6
              abaababaabaab
    . Suffix is either explicit or implicit.
    4
                                                             na$       $

              Explicit when it ends at a node.         2:1             5:6
              Implicit when it ends in the
              middle of an edge.

                                                                   .            .         .         .         .        .

        jc-yang (Osaka-U)                  SuffixTree                                          November 12, 2011         15 / 25
Part Outline


.
1   What is Suffix Tree

.
2   History & Naïve Algorithm

.
3   Optimization of Naïve Algorithm

.
4   Examples & Analysis




                                            .   .   .         .       .     .

      jc-yang (Osaka-U)         SuffixTree               November 12, 2011   16 / 25
Optimization of Naïve Algorithm
.
     . Substrings can be represented as (start, end) pair of their index in
     1

       orignal string.
                    Reduce space complexity to O(N) if size of alphabet is fixed constant.
     . Once a leaf, Always a leaf
     2

                    Represent edge that links to a leaf as (start, · · · ).
                    Extend leaf nodes for free. We do not need Extend Case I.
                               0:6
                                                                                                                   0:6


                    banana$ a        na         $                                                    (1,2)        (0,...)  (6,...)           (2,4)
                                                                                                       a        banana$... $...               na

          1:0            8:6          6:6               10:6                      8:6                 1:0                   10:6                    6:6


                                                                                 (6,...)   (2,4)                                                   (6,...)   (4,...)
                    na    $               na$       $                             $...      na                                                      $...     na$...


          4:6            9:6          3:2               7:6                9:6                 4:6                                     7:6                    3:2


                                                                                           (6,...)    (4,...)
          na$   $                                                                           $...      na$...


                                                                                   5:6                  2:1
    2:1         5:6                                                                        .            .            .             .           .              .

           jc-yang (Osaka-U)                                   SuffixTree                                                 November 12, 2011                    17 / 25
Active Point

                                     During a phrase, if we meet Extend Case III, that is if we found
                                     S[i + 1] already exists in suffix[j . . . i] then S[i + 1] will exists in
                                     ∀suffix[k . . . i], k ∈ j . . . i.
                                     Thus Case III is a sign that means update of this phrase is finished.
                                     During phrase i if we stopped at suffix[k . . . i] by Case III, then in
                                     next phrase we can start from suffix[k . . . i + 1] because all suffix
                                     start with 1 . . . k − 1 will end at Case I.
                                     We called this point(current internal node, current position k in
                                     string) as Active Point.
a                    ab                       aba                      abaa                   abaab                    abaaba                                  abaabab                                              abaababa                                              abaababaa                                           abaababaab                                          abaababaaba                                                         abaababaabaa                                                                       abaababaaba ab


0:0               0:1                       0:2                      0:3                      0:4                      0:5                                        0:6                                                   0:7                                                 0:8                                                0:9                                                 0:10                                                                             0:11                                                                          0:12


 (0,...)         (0,...)   (1,...)         (0,...)   (1,...)         (0,1)    (1,...)     (0,1)        (1,...)     (0,1)         (1,...)                       (0,1)      (1,3)                                     (0,1)      (1,3)                                       (0,1)    (1,3)                                     (0,1)     (1,3)                                     (0,1)     (1,3)                                                                   (0,1)       (1,3)                                                            (0,1)         (1,3)
  a...            ab...     b...           aba...     ba...            a      baa...        a         baab...        a          baaba...                         a         ba                                         a         ba                                           a       ba                                         a        ba                                         a        ba                                                                       a          ba                                                                a            ba


1:0        1:0              2:1      1:0              2:1      3:3              2:1     3:3             2:1      3:3             2:1                    3:3               7:6                                 3:3              7:6                                  3:3             7:6                                 3:3             7:6                                 3:3             7:6                                                              3:3                  7:6                                                   3:3                          7:6


                                                                (3,...)       (1,...)    (3,...)       (1,...)    (3,...)     (1,...)             (3,...) (1,3)              (3,...)   (6,...)          (3,...) (1,3)             (3,...)   (6,...)           (3,...) (1,3)             (3,...)   (6,...)       (3,...)  (1,3)            (3,...)     (6,...)       (3,...)  (1,3)             (3,...)     (6,...)                                    (3,6) (1,3)                  (3,6)       (6,...)                             (3,6) (1,3)                        (3,6)         (6,...)
                                                                 a...         baa...      ab...       baab...     aba...     baaba...            abab... ba                 abab...     b...           ababa... ba               ababa...    ba...          ababaa... ba              ababaa...   baa...      ababaab... ba             ababaab...   baab...     ababaaba... ba             ababaaba...   baaba...                                     aba   ba                     aba       baabaa...                             aba ba                             aba        baabaab...


                                                               4:3              1:0     4:3             1:0      4:3             1:0       4:3          5:6               2:1          8:6       4:3          5:6              2:1          8:6       4:3           5:6              2:1          8:6       4:3          5:6                2:1       8:6       4:3         5:6                 2:1       8:6                        13:11                   5:6                11:11           8:6                   13:11             5:6                         11:11           8:6


                                                                                                                                                     (3,...)    (6,...)                                    (3,...) (6,...)                                       (3,...) (6,...)                                     (3,...)   (6,...)                                    (3,...)   (6,...)                                      (11,...) (6,...)         (3,6)      (6,...)       (11,...)     (6,...)            (11,...) (6,...)  (3,6)               (6,...)       (11,...)      (6,...)
                                                                                                                                                    abab...      b...                                     ababa... ba...                                       ababaa... baa...                                    ababaab... baab...                                  ababaaba... baaba...                                        a... baabaa...          aba      baabaa...        a...      baabaa...            ab... baabaab... aba               baabaab...       ab...      baabaab...


                                                                                                                                                 1:0              6:6                                  1:0              6:6                                   1:0             6:6                                 1:0                 6:6                             1:0                 6:6                            14:11        4:3            9:11            6:6           12:11           2:1     14:11         4:3           9:11              6:6           12:11           2:1


                                                                                                                                                                                                                                                                                                                                                                                                                                                (11,...)      (6,...)                                                              (11,...)     (6,...)
                                                                                                                                                                                                                                                                                                                                                                                                                                                  a...       baabaa...                                                              ab...     baabaab...


                                                                                                                                                                                                                                                                                                                                                                                                                                             10:11            1:0                                                              10:11             1:0




                                                                                                                                                                                                                                                                                                                                                                            .                                        .                                .                                        .                               .                                            .

                                     jc-yang (Osaka-U)                                                                                                                                                                                                       SuffixTree                                                                                                                                                                                      November 12, 2011                                                                                                 18 / 25
Ukk’s Update using Active Point

U PDATE(treei )
 1 current_suffix ← active_point
 2 next_char ← string[i]
 3 while True
 4      do if there exists edge start with next_char
 5              then break       £ Case III
 6              else
 7                   split current edge if implicit
 8                   create new leaf with new edge labelled next_char
 9          if current_suffix is empty
10              then break
11              else current_suffix ← next shorter suffix
12 active_point ← current_suffix
                                              .   .   .         .       .     .

     jc-yang (Osaka-U)          SuffixTree                 November 12, 2011   19 / 25
Suffix Link to find next shorter suffix
            Suffix link
                             Internal node of suffix X α has a link to node α .
                             If α is empty, suffix link points to root.
            How to create suffix link
                             Link together every internal nodes that are created by splitting in
                             same phrase.
                            banana$                                                     banana$

                             0:6                                                        0:6


            (1,2) (0,...)  (6,...)               (2,4)                  (0,...)  (1,2)             (6,...)     (2,4)
              a banana$... $...                   na                  banana$...   a                $...        na


 8:6                 1:0               10:6         6:6         1:0              8:6               10:6            6:6


  (6,...)         (2,4)                       (6,...) (4,...)            (6,...) (2,4)                       (6,...) (4,...)
   $...            na                          $... na$...                $... na                             $... na$...


 9:6                 4:6               7:6          3:2         9:6              4:6               7:6             3:2


                  (6,...)    (4,...)                                          (6,...)    (4,...)
                   $...      na$...                                            $...      na$...


            5:6               2:1                                       5:6               2:1
                                                                                                                               .   .   .         .       .     .

             jc-yang (Osaka-U)                                                                SuffixTree                                    November 12, 2011   20 / 25
Fast jump using Suffix Link


   Assume we are at Suffix X αβ , whose parent internal node
   represent X α .
      1. Go back to parent internal node,
      2. Jumping follow the node’s suffix link to the node represent Suffix
         α
      3. Go down to Suffix αβ .

   Even jump down in step 3 because we already know length of β .
   (Skip/Count trick)
   Combining all these tricks we can Extend a character in O(1)
   time.


                                                .    .   .         .       .     .

   jc-yang (Osaka-U)             SuffixTree                   November 12, 2011   21 / 25
Part Outline


.
1   What is Suffix Tree

.
2   History & Naïve Algorithm

.
3   Optimization of Naïve Algorithm

.
4   Examples & Analysis




                                            .   .   .         .       .     .

      jc-yang (Osaka-U)         SuffixTree               November 12, 2011   22 / 25
Experiment – mississippi
.

                        m

                        0:0


                         (0,...)
                          m...


                        1:0




                                               .   .   .         .       .     .

    jc-yang (Osaka-U)              SuffixTree               November 12, 2011   23 / 25
Experiment – mississippi
.
                             mi

                         0:1


                        (1,...)   (0,...)
                          i...     mi...


               2:1                    1:0




                                               .   .   .         .       .     .

    jc-yang (Osaka-U)              SuffixTree               November 12, 2011   23 / 25
Experiment – mississippi
.
                             mis

                              0:2


                        (1,...) (2,...)        (0,...)
                         is... s...            mis...


        2:1                   3:2                     1:0




                                                            .   .   .         .       .     .

    jc-yang (Osaka-U)                     SuffixTree                     November 12, 2011   23 / 25
Experiment – mississippi
.
                            miss

                              0:3


                        (1,...) (2,...)        (0,...)
                        iss... ss...           miss...


        2:1                   3:2                     1:0




                                                            .   .   .         .       .     .

    jc-yang (Osaka-U)                     SuffixTree                     November 12, 2011   23 / 25
Experiment – mississippi
.
                             miss i

                              0:4


                        (1,...) (0,...)        (2,3)
                        issi... missi...         s


         2:1                  1:0                      4:4


                                               (4,...) (3,...)
                                                 i...   si...


                             5:4                       3:2




                                                                 .   .   .         .       .     .

    jc-yang (Osaka-U)                      SuffixTree                         November 12, 2011   23 / 25
Experiment – mississippi
.
                              miss is

                              0:5


                         (1,...) (0,...)          (2,3)
                        issis... missis...          s


         2:1                   1:0                        4:4


                                                 (4,...) (3,...)
                                                  is... sis...


                              5:4                         3:2




                                                                   .   .   .         .       .     .

    jc-yang (Osaka-U)                        SuffixTree                         November 12, 2011   23 / 25
Experiment – mississippi
.
                               miss iss

                             0:6


                         (1,...)   (0,...)          (2,3)
                        ississ... mississ...          s


         2:1                       1:0                      4:4


                                                  (4,...) (3,...)
                                                  iss... siss...


                                5:4                        3:2




                                                                    .   .   .         .       .     .

    jc-yang (Osaka-U)                          SuffixTree                        November 12, 2011   23 / 25
Experiment – mississippi
.
                                mississi

                              0:7


                          (1,...)    (0,...)           (2,3)
                        ississi... mississi...           s


         2:1                        1:0                       4:4


                                                    (4,...) (3,...)
                                                    issi... sissi...


                                 5:4                         3:2




                                                                       .   .   .         .       .     .

    jc-yang (Osaka-U)                            SuffixTree                         November 12, 2011   23 / 25
Experiment – mississippi
.
                                               mississip

                                       0:8


                        (8,...) (1,2)                   (0,...)             (2,3)
                         p...     i                   mississi...             s


    14:8                     12:8                           1:0                      4:4


                   (8,...)     (2,5)
                    p...        ssi

                                                                            (3,5)             (4,5)
    13:8                     6:8
                                                                              si                i

                   (8,...)     (5,...)
                    p...       ssip...


     7:8                     2:1                            8:8                               10:8


                                          (8,...)     (5,...)                       (8,...)      (5,...)
                                           p...       ssip...                        p...        ssip...


           9:8                          3:2                         11:8                       5:4




                                                                                                           .   .   .         .       .     .

    jc-yang (Osaka-U)                                                  SuffixTree                                       November 12, 2011   23 / 25
Experiment – mississippi
.
                                              mississip p

                                       0:9


                        (8,...) (1,2)                   (0,...)             (2,3)
                        pp...     i                   mississi...             s


    14:8                     12:8                           1:0                      4:4


                   (8,...)     (2,5)
                   pp...        ssi

                                                                            (3,5)             (4,5)
    13:8                     6:8
                                                                              si                i

                   (8,...)      (5,...)
                   pp...       ssipp...


     7:8                     2:1                            8:8                               10:8


                                          (8,...)      (5,...)                      (8,...)       (5,...)
                                          pp...       ssipp...                      pp...        ssipp...


           9:8                          3:2                         11:8                       5:4




                                                                                                            .   .   .         .       .     .

    jc-yang (Osaka-U)                                                  SuffixTree                                        November 12, 2011   23 / 25
Experiment – mississippi
.
                                                       mississippi

                                                        0:10


                                              (1,2)       (0,...)         (8,9)             (2,3)
                                                i       mississi...         p                 s


                             12:8              1:0                       15:10                    4:4


               (2,5)                (8,...)                           (9,...) (10,...)
                ssi                 ppi...                             pi...    i...

                                                                                         (3,5)          (4,5)
    6:8                            13:8               14:8                   16:10
                                                                                           si             i

     (8,...)            (5,...)
     ppi...            ssippi...


    7:8                     2:1                                        8:8                                 10:8


                                                                (8,...) (5,...)                         (8,...)    (5,...)
                                                                ppi... ssippi...                        ppi...    ssippi...


                                              9:8                      3:2                 11:8                   5:4




                                                                                                                        .     .   .         .       .     .

      jc-yang (Osaka-U)                                                              SuffixTree                                        November 12, 2011   23 / 25
Experiment – mississippi
.
                                                        mississippi$

                                                                0:11


                                (1,2)                     (0,...) (8,9)            (11,...)      (2,3)
                                  i                     mississi... p                $...          s


                      12:8                        1:0          15:10                  18:11              4:4


                (8,...) (2,5)          (11,...)                   (10,...)         (9,...)
                ppi$... ssi              $...                       i$...          pi$...

                                                                                                 (3,5)         (4,5)
    13:8             6:8                 17:11                 16:10                  14:8
                                                                                                   si            i

                    (8,...)       (5,...)
                    ppi$...     ssippi$...


              7:8                2:1                                         8:8                               10:8


                                                                   (8,...) (5,...)                       (8,...)   (5,...)
                                                                   ppi$... ssippi$...                    ppi$... ssippi$...


                                                         9:8                 3:2              11:8                5:4




                                                                                                                        .     .   .         .       .     .

           jc-yang (Osaka-U)                                                         SuffixTree                                        November 12, 2011   23 / 25
Time Complexity Analysis
$
i
p
p
i
s
s
i
s
s
i
m
 .   m        i          s   s   i   s     s         i   p       p       i          $
Time complexity is 2N = O(N) .
                                                             .       .       .          .      .     .

     jc-yang (Osaka-U)                   SuffixTree                               November 12, 2011   24 / 25
Experiment – English text
.

                           16
                                                                "data" using 1:3
                           14
Construction Time (sec.)




                           12

                           10

                           8

                           6

                           4

                           2

                           0
                                0      20000 40000   60000 80000 100000 120000 140000
                                                     File size (byte)
                                                                             .     .   .         .       .     .

                           jc-yang (Osaka-U)               SuffixTree                       November 12, 2011   25 / 25

Mais conteúdo relacionado

Mais de Jiachen Yang

データモデルの更新を効率よく検証するの並列可能性
データモデルの更新を効率よく検証するの並列可能性データモデルの更新を効率よく検証するの並列可能性
データモデルの更新を効率よく検証するの並列可能性Jiachen Yang
 
Slides for Semantic Versioning versus Breaking Changes: A Study of the Maven ...
Slides for Semantic Versioning versus Breaking Changes: A Study of the Maven ...Slides for Semantic Versioning versus Breaking Changes: A Study of the Maven ...
Slides for Semantic Versioning versus Breaking Changes: A Study of the Maven ...Jiachen Yang
 
チェックリストと分割に基づく 網羅と使用テスト
チェックリストと分割に基づく  網羅と使用テストチェックリストと分割に基づく  網羅と使用テスト
チェックリストと分割に基づく 網羅と使用テストJiachen Yang
 
Active Refinement of Clone Anomaly Reports
Active Refinement of Clone Anomaly ReportsActive Refinement of Clone Anomaly Reports
Active Refinement of Clone Anomaly ReportsJiachen Yang
 
Inference and Checking of Object Ownership
Inference  and  Checking  of  Object OwnershipInference  and  Checking  of  Object Ownership
Inference and Checking of Object OwnershipJiachen Yang
 
基于OpenNEbula的虚拟化服务器集群中节能 的研究
基于OpenNEbula的虚拟化服务器集群中节能 的研究基于OpenNEbula的虚拟化服务器集群中节能 的研究
基于OpenNEbula的虚拟化服务器集群中节能 的研究Jiachen Yang
 
Output fica.beamer.43
Output fica.beamer.43Output fica.beamer.43
Output fica.beamer.43Jiachen Yang
 

Mais de Jiachen Yang (8)

データモデルの更新を効率よく検証するの並列可能性
データモデルの更新を効率よく検証するの並列可能性データモデルの更新を効率よく検証するの並列可能性
データモデルの更新を効率よく検証するの並列可能性
 
Slides for Semantic Versioning versus Breaking Changes: A Study of the Maven ...
Slides for Semantic Versioning versus Breaking Changes: A Study of the Maven ...Slides for Semantic Versioning versus Breaking Changes: A Study of the Maven ...
Slides for Semantic Versioning versus Breaking Changes: A Study of the Maven ...
 
チェックリストと分割に基づく 網羅と使用テスト
チェックリストと分割に基づく  網羅と使用テストチェックリストと分割に基づく  網羅と使用テスト
チェックリストと分割に基づく 網羅と使用テスト
 
Active Refinement of Clone Anomaly Reports
Active Refinement of Clone Anomaly ReportsActive Refinement of Clone Anomaly Reports
Active Refinement of Clone Anomaly Reports
 
Inference and Checking of Object Ownership
Inference  and  Checking  of  Object OwnershipInference  and  Checking  of  Object Ownership
Inference and Checking of Object Ownership
 
基于OpenNEbula的虚拟化服务器集群中节能 的研究
基于OpenNEbula的虚拟化服务器集群中节能 的研究基于OpenNEbula的虚拟化服务器集群中节能 的研究
基于OpenNEbula的虚拟化服务器集群中节能 的研究
 
Output fica.beamer.43
Output fica.beamer.43Output fica.beamer.43
Output fica.beamer.43
 
Cloud sim report
Cloud sim reportCloud sim report
Cloud sim report
 

Último

CREATING A POSITIVE SCHOOL CULTURE CHAPTER 10
CREATING A POSITIVE SCHOOL CULTURE CHAPTER 10CREATING A POSITIVE SCHOOL CULTURE CHAPTER 10
CREATING A POSITIVE SCHOOL CULTURE CHAPTER 10uasjlagroup
 
PORTAFOLIO 2024_ ANASTASIYA KUDINOVA
PORTAFOLIO   2024_  ANASTASIYA  KUDINOVAPORTAFOLIO   2024_  ANASTASIYA  KUDINOVA
PORTAFOLIO 2024_ ANASTASIYA KUDINOVAAnastasiya Kudinova
 
Mookuthi is an artisanal nose ornament brand based in Madras.
Mookuthi is an artisanal nose ornament brand based in Madras.Mookuthi is an artisanal nose ornament brand based in Madras.
Mookuthi is an artisanal nose ornament brand based in Madras.Mookuthi
 
昆士兰大学毕业证(UQ毕业证)#文凭成绩单#真实留信学历认证永久存档
昆士兰大学毕业证(UQ毕业证)#文凭成绩单#真实留信学历认证永久存档昆士兰大学毕业证(UQ毕业证)#文凭成绩单#真实留信学历认证永久存档
昆士兰大学毕业证(UQ毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
Call Us ✡️97111⇛47426⇛Call In girls Vasant Vihar༒(Delhi)
Call Us ✡️97111⇛47426⇛Call In girls Vasant Vihar༒(Delhi)Call Us ✡️97111⇛47426⇛Call In girls Vasant Vihar༒(Delhi)
Call Us ✡️97111⇛47426⇛Call In girls Vasant Vihar༒(Delhi)jennyeacort
 
MT. Marseille an Archipelago. Strategies for Integrating Residential Communit...
MT. Marseille an Archipelago. Strategies for Integrating Residential Communit...MT. Marseille an Archipelago. Strategies for Integrating Residential Communit...
MT. Marseille an Archipelago. Strategies for Integrating Residential Communit...katerynaivanenko1
 
Design Portfolio - 2024 - William Vickery
Design Portfolio - 2024 - William VickeryDesign Portfolio - 2024 - William Vickery
Design Portfolio - 2024 - William VickeryWilliamVickery6
 
FiveHypotheses_UIDMasterclass_18April2024.pdf
FiveHypotheses_UIDMasterclass_18April2024.pdfFiveHypotheses_UIDMasterclass_18April2024.pdf
FiveHypotheses_UIDMasterclass_18April2024.pdfShivakumar Viswanathan
 
How to Empower the future of UX Design with Gen AI
How to Empower the future of UX Design with Gen AIHow to Empower the future of UX Design with Gen AI
How to Empower the future of UX Design with Gen AIyuj
 
3D Printing And Designing Final Report.pdf
3D Printing And Designing Final Report.pdf3D Printing And Designing Final Report.pdf
3D Printing And Designing Final Report.pdfSwaraliBorhade
 
Design principles on typography in design
Design principles on typography in designDesign principles on typography in design
Design principles on typography in designnooreen17
 
专业一比一美国亚利桑那大学毕业证成绩单pdf电子版制作修改#真实工艺展示#真实防伪#diploma#degree
专业一比一美国亚利桑那大学毕业证成绩单pdf电子版制作修改#真实工艺展示#真实防伪#diploma#degree专业一比一美国亚利桑那大学毕业证成绩单pdf电子版制作修改#真实工艺展示#真实防伪#diploma#degree
专业一比一美国亚利桑那大学毕业证成绩单pdf电子版制作修改#真实工艺展示#真实防伪#diploma#degreeyuu sss
 
Pharmaceutical Packaging for the elderly.pdf
Pharmaceutical Packaging for the elderly.pdfPharmaceutical Packaging for the elderly.pdf
Pharmaceutical Packaging for the elderly.pdfAayushChavan5
 
办理(UC毕业证书)查尔斯顿大学毕业证成绩单原版一比一
办理(UC毕业证书)查尔斯顿大学毕业证成绩单原版一比一办理(UC毕业证书)查尔斯顿大学毕业证成绩单原版一比一
办理(UC毕业证书)查尔斯顿大学毕业证成绩单原版一比一z xss
 
8377877756 Full Enjoy @24/7 Call Girls in Nirman Vihar Delhi NCR
8377877756 Full Enjoy @24/7 Call Girls in Nirman Vihar Delhi NCR8377877756 Full Enjoy @24/7 Call Girls in Nirman Vihar Delhi NCR
8377877756 Full Enjoy @24/7 Call Girls in Nirman Vihar Delhi NCRdollysharma2066
 
'CASE STUDY OF INDIRA PARYAVARAN BHAVAN DELHI ,
'CASE STUDY OF INDIRA PARYAVARAN BHAVAN DELHI ,'CASE STUDY OF INDIRA PARYAVARAN BHAVAN DELHI ,
'CASE STUDY OF INDIRA PARYAVARAN BHAVAN DELHI ,Aginakm1
 
Top 10 Modern Web Design Trends for 2025
Top 10 Modern Web Design Trends for 2025Top 10 Modern Web Design Trends for 2025
Top 10 Modern Web Design Trends for 2025Rndexperts
 
group_15_empirya_p1projectIndustrial.pdf
group_15_empirya_p1projectIndustrial.pdfgroup_15_empirya_p1projectIndustrial.pdf
group_15_empirya_p1projectIndustrial.pdfneelspinoy
 
办理学位证(SFU证书)西蒙菲莎大学毕业证成绩单原版一比一
办理学位证(SFU证书)西蒙菲莎大学毕业证成绩单原版一比一办理学位证(SFU证书)西蒙菲莎大学毕业证成绩单原版一比一
办理学位证(SFU证书)西蒙菲莎大学毕业证成绩单原版一比一F dds
 
Call Girls Aslali 7397865700 Ridhima Hire Me Full Night
Call Girls Aslali 7397865700 Ridhima Hire Me Full NightCall Girls Aslali 7397865700 Ridhima Hire Me Full Night
Call Girls Aslali 7397865700 Ridhima Hire Me Full Nightssuser7cb4ff
 

Último (20)

CREATING A POSITIVE SCHOOL CULTURE CHAPTER 10
CREATING A POSITIVE SCHOOL CULTURE CHAPTER 10CREATING A POSITIVE SCHOOL CULTURE CHAPTER 10
CREATING A POSITIVE SCHOOL CULTURE CHAPTER 10
 
PORTAFOLIO 2024_ ANASTASIYA KUDINOVA
PORTAFOLIO   2024_  ANASTASIYA  KUDINOVAPORTAFOLIO   2024_  ANASTASIYA  KUDINOVA
PORTAFOLIO 2024_ ANASTASIYA KUDINOVA
 
Mookuthi is an artisanal nose ornament brand based in Madras.
Mookuthi is an artisanal nose ornament brand based in Madras.Mookuthi is an artisanal nose ornament brand based in Madras.
Mookuthi is an artisanal nose ornament brand based in Madras.
 
昆士兰大学毕业证(UQ毕业证)#文凭成绩单#真实留信学历认证永久存档
昆士兰大学毕业证(UQ毕业证)#文凭成绩单#真实留信学历认证永久存档昆士兰大学毕业证(UQ毕业证)#文凭成绩单#真实留信学历认证永久存档
昆士兰大学毕业证(UQ毕业证)#文凭成绩单#真实留信学历认证永久存档
 
Call Us ✡️97111⇛47426⇛Call In girls Vasant Vihar༒(Delhi)
Call Us ✡️97111⇛47426⇛Call In girls Vasant Vihar༒(Delhi)Call Us ✡️97111⇛47426⇛Call In girls Vasant Vihar༒(Delhi)
Call Us ✡️97111⇛47426⇛Call In girls Vasant Vihar༒(Delhi)
 
MT. Marseille an Archipelago. Strategies for Integrating Residential Communit...
MT. Marseille an Archipelago. Strategies for Integrating Residential Communit...MT. Marseille an Archipelago. Strategies for Integrating Residential Communit...
MT. Marseille an Archipelago. Strategies for Integrating Residential Communit...
 
Design Portfolio - 2024 - William Vickery
Design Portfolio - 2024 - William VickeryDesign Portfolio - 2024 - William Vickery
Design Portfolio - 2024 - William Vickery
 
FiveHypotheses_UIDMasterclass_18April2024.pdf
FiveHypotheses_UIDMasterclass_18April2024.pdfFiveHypotheses_UIDMasterclass_18April2024.pdf
FiveHypotheses_UIDMasterclass_18April2024.pdf
 
How to Empower the future of UX Design with Gen AI
How to Empower the future of UX Design with Gen AIHow to Empower the future of UX Design with Gen AI
How to Empower the future of UX Design with Gen AI
 
3D Printing And Designing Final Report.pdf
3D Printing And Designing Final Report.pdf3D Printing And Designing Final Report.pdf
3D Printing And Designing Final Report.pdf
 
Design principles on typography in design
Design principles on typography in designDesign principles on typography in design
Design principles on typography in design
 
专业一比一美国亚利桑那大学毕业证成绩单pdf电子版制作修改#真实工艺展示#真实防伪#diploma#degree
专业一比一美国亚利桑那大学毕业证成绩单pdf电子版制作修改#真实工艺展示#真实防伪#diploma#degree专业一比一美国亚利桑那大学毕业证成绩单pdf电子版制作修改#真实工艺展示#真实防伪#diploma#degree
专业一比一美国亚利桑那大学毕业证成绩单pdf电子版制作修改#真实工艺展示#真实防伪#diploma#degree
 
Pharmaceutical Packaging for the elderly.pdf
Pharmaceutical Packaging for the elderly.pdfPharmaceutical Packaging for the elderly.pdf
Pharmaceutical Packaging for the elderly.pdf
 
办理(UC毕业证书)查尔斯顿大学毕业证成绩单原版一比一
办理(UC毕业证书)查尔斯顿大学毕业证成绩单原版一比一办理(UC毕业证书)查尔斯顿大学毕业证成绩单原版一比一
办理(UC毕业证书)查尔斯顿大学毕业证成绩单原版一比一
 
8377877756 Full Enjoy @24/7 Call Girls in Nirman Vihar Delhi NCR
8377877756 Full Enjoy @24/7 Call Girls in Nirman Vihar Delhi NCR8377877756 Full Enjoy @24/7 Call Girls in Nirman Vihar Delhi NCR
8377877756 Full Enjoy @24/7 Call Girls in Nirman Vihar Delhi NCR
 
'CASE STUDY OF INDIRA PARYAVARAN BHAVAN DELHI ,
'CASE STUDY OF INDIRA PARYAVARAN BHAVAN DELHI ,'CASE STUDY OF INDIRA PARYAVARAN BHAVAN DELHI ,
'CASE STUDY OF INDIRA PARYAVARAN BHAVAN DELHI ,
 
Top 10 Modern Web Design Trends for 2025
Top 10 Modern Web Design Trends for 2025Top 10 Modern Web Design Trends for 2025
Top 10 Modern Web Design Trends for 2025
 
group_15_empirya_p1projectIndustrial.pdf
group_15_empirya_p1projectIndustrial.pdfgroup_15_empirya_p1projectIndustrial.pdf
group_15_empirya_p1projectIndustrial.pdf
 
办理学位证(SFU证书)西蒙菲莎大学毕业证成绩单原版一比一
办理学位证(SFU证书)西蒙菲莎大学毕业证成绩单原版一比一办理学位证(SFU证书)西蒙菲莎大学毕业证成绩单原版一比一
办理学位证(SFU证书)西蒙菲莎大学毕业证成绩单原版一比一
 
Call Girls Aslali 7397865700 Ridhima Hire Me Full Night
Call Girls Aslali 7397865700 Ridhima Hire Me Full NightCall Girls Aslali 7397865700 Ridhima Hire Me Full Night
Call Girls Aslali 7397865700 Ridhima Hire Me Full Night
 

Ukk's Algorithm of Suffix Tree

  • 1. . Algorithm of Suffix Tree by farseerfc@gmail.com . Jiachen Yang1 1 Research Student in Osaka University November 12, 2011 . . . . . . jc-yang (Osaka-U) SuffixTree November 12, 2011 1 / 25
  • 2. Part Outline . 1 What is Suffix Tree . 2 History & Naïve Algorithm . 3 Optimization of Naïve Algorithm . 4 Examples & Analysis . . . . . . jc-yang (Osaka-U) SuffixTree November 12, 2011 2 / 25
  • 3. What can we do with suffix tree? . Find properties of the strings Linear algorithms for exact string . Find the longest common substrings of the 1 matching string Si and Sj in θ (ni + nj ) time . like KMP . Find all maximal pairs, maximal repeats or 2 Search for strings supermaximal repeats in θ (n + z) time, if there . Check if a string P of length m is 1 are z such repeats . . Find the Lempel-Ziv decomposition in θ (n) 3 a substring in O(m) time. . Find all z occurrences of the 2 time . . Find the longest repeated substrings in θ (n) patterns P1 , · · · , Pq of total 4 length m as substrings in time. . Find the most frequently occurring substrings 5 O(m + z) time. . Search for a regular expression P 3 of a minimum length in θ (n) time. . Find the shortest strings from ∑ that do not 6 in time expected sublinear in n . . Find for each suffix of a pattern 4 occur in D, in O(n + z) time, if there are z such P, the length of the longest strings. . Find the shortest substrings occurring only 7 match between a prefix of P[i . . . m] and a substring in D in once in θ (n) time. . Find, for each i, the shortest substrings of Si θ (m) time . 8 . ... 5 not occurring elsewhere in D in θ (n) time. . ... 9 . . . . . . jc-yang (Osaka-U) SuffixTree November 12, 2011 3 / 25
  • 4. Trie, Radix Tree, Suffix Trie & Suffix Tree trie1 is a dictionary tree. (AKA prefix tree) stores a set of words. each node represents a character except that root is empty string. words with common prefix share same parent nodes. minimal deterministic finite automaton that accepts all words. radix tree is a trie with compressed chain of nodes. (AKA patricia trie or radix trie) Each internal node has at least 2 children. suffix trie is a trie which stores all suffix of a given string. suffix tree is a suffix radix tree. that enables linear time construction and fast algorithms of other problems on a string. 1 pronounced as in word retrieval by its inventor, /tri:/ “tree”, but pronounced /traI/ “try” by other authors . . . . . . jc-yang (Osaka-U) SuffixTree November 12, 2011 4 / 25
  • 5. Trie & Radix Tree . Trie of “A to tea ted ten i in inn” . . Radix tree example . . . . . . . . . jc-yang (Osaka-U) SuffixTree November 12, 2011 5 / 25
  • 6. Suffix Trie & Suffix Tree of “banana” ^ b a n $ 0:6 a n $ a banana$ a na $ 1:0 8:6 6:6 10:6 n a n $ na $ na$ $ a n $ a 4:6 9:6 3:2 7:6 n a $ na$ $ a $ 2:1 5:6 $ . . . . . . jc-yang (Osaka-U) SuffixTree November 12, 2011 6 / 25
  • 7. Suffix Tree of “mississippi” mississippi$ 0:11 (1,2) (0,...) (8,9) (11,...) (2,3) i mississi... p $... s 12:8 1:0 15:10 18:11 4:4 (8,...) (2,5) (11,...) (10,...) (9,...) ppi$... ssi $... i$... pi$... (3,5) (4,5) 13:8 6:8 17:11 16:10 14:8 si i (8,...) (5,...) ppi$... ssippi$... 7:8 2:1 8:8 10:8 (8,...) (5,...) (8,...) (5,...) ppi$... ssippi$... ppi$... ssippi$... 9:8 3:2 11:8 5:4 . . . . . . jc-yang (Osaka-U) SuffixTree November 12, 2011 7 / 25
  • 8. Part Outline . 1 What is Suffix Tree . 2 History & Naïve Algorithm . 3 Optimization of Naïve Algorithm . 4 Examples & Analysis . . . . . . jc-yang (Osaka-U) SuffixTree November 12, 2011 8 / 25
  • 9. History of Suffix Tree Algorithms . M. Farach. Optimal suffix tree First linear algorithm was introduced by Weiner construction with large alphabets. In focs, page 1973 as position tree. Awarded by Donald Knuth 137. Published by the IEEE Computer Society, 1997. as “Algorithm of the year 1973”. E. M. McCreight. A space-economical suffix Greatly simplified by McCreight 1976. tree construction algorithm. J. ACM, 23:262–272, April Above two algorithms are processing string backward. 1976. ISSN 0004-5411. doi: http://doi.acm.org/10. 1145/321941.321946. URL First online construction by Ukkonen 1995, which http: //doi.acm.org/10. is easier to understand. 1145/321941.321946. E. Ukkonen. On-line construction of suffix trees. Above algorithms assume size of alphabet as fixed Algorithmica, 14:249–260, 1995. ISSN 0178-4617. constant . URL http://dx.doi.org/10. Limitation was break by Farach 1997, optimal for 1007/BF01206331. 10.1007/BF01206331. all alphabets. P. Weiner. Linear pattern matching algorithms. In Switching and Automata Further study are continued to scale to scenarios when Theory, 1973. SWAT ’08. IEEE Conference Record of the whole suffix tree or even input string cannot fit into 14th Annual Symposium on, pages 1 –11, oct. 1973. doi: memory. . . . 10.1109/SWAT.1973.13. . . . jc-yang (Osaka-U) SuffixTree November 12, 2011 9 / 25
  • 10. Backward Construction of Suffix Tree 1 2 3 4 banana$ banana$ anana$ banana$ anana$ nana$ banana$ ana nana$ na$ $ 7 6 5 banana$ a na $ banana$ a na banana$ ana na na $ na$ $ na $ na$ $ na$ $ na$ $ na$ $ na$ $ . . . . . . jc-yang (Osaka-U) SuffixTree November 12, 2011 10 / 25
  • 11. Construct Suffix Tree by Sorting Suffix Suffix : Sorted suffix : Tree of sorted suffix : mississippi i | -i - >| - ississippi ippi | | - ppi ssissippi issippi | | - ssi - >| - ppi sissippi ississippi | | - ssippi issippi mississippi | - mississippi ssippi pi | -p - >| - i sippi ppi | | - pi ippi sippi | -s - >| -i - - >| - ppi ppi sissippi | | - ssippi pi ssippi | - si - >| - ppi i ssissippi | - ssippi Time complexity will be O(N 2 log N) . Space complexity will be O(N 2 ) . . . . . . . jc-yang (Osaka-U) SuffixTree November 12, 2011 11 / 25
  • 12. Naïve Algorithm S UFFIX T REE(string) 1 for i ← 1 to length(string) 2 do U PDATE(treei ) £ Phrase i U PDATE(treei ) 1 for j ← 1 to i 2 do node ← treei .F IND(suffix[j to i − 1]) 3 E XTEND(node, string[i]) £ Extension j Time complexity will be O(N 3 ) . Space complexity will be O(N 2 ) . The challenge is to make sure treei is updated to treei+1 efficiently. . . . . . . jc-yang (Osaka-U) SuffixTree November 12, 2011 12 / 25
  • 13. Suffix Extend Cases of Naïve Algorithm . Case I If path Sj...i ends at leaf, append a char Si+1 to end of edge into leaf. Sj...i = . . . na Sj...i+1 = . . . nan na nan Case II If path Sj...i ends in the middle of an edge , and next char Si+1 is not equal to the next char in the edge, split that edge, create a internal node, add a new edge to a new leaf. Sj...i = . . . na Sj...i+1 = . . . nay n nan na y Case III If path Sj...i ends in the middle of an edge , and next char Si+1 is equal to the next char in the edge, do nothing, extenstion has done. Sj...i = . . . na Sj...i+1 = . . . nan nan nan . . . . . . jc-yang (Osaka-U) SuffixTree November 12, 2011 13 / 25
  • 14. Naïve Online Construction of Suffix Tree 1 2 3 4 b ba a ban an n bana ana na 7 6 5 banana$ a na $ banana anana nana banan anan nan na $ na$ $ na$ $ . . . . . . jc-yang (Osaka-U) SuffixTree November 12, 2011 14 / 25
  • 15. Properties of Suffix Tree . . Each update will add exactly 1 1 leaf node . 0:6 nr_leaf = N . Suffix tree is full tree. 2 banana$ a na $ Each internal node has at least 2 children. 1:0 8:6 6:6 10:6 nr_internal < N nr_node < 2N na $ na$ $ . Worst case Fabonacci word 3 4:6 9:6 3:2 7:6 abaababaabaab . Suffix is either explicit or implicit. 4 na$ $ Explicit when it ends at a node. 2:1 5:6 Implicit when it ends in the middle of an edge. . . . . . . jc-yang (Osaka-U) SuffixTree November 12, 2011 15 / 25
  • 16. Part Outline . 1 What is Suffix Tree . 2 History & Naïve Algorithm . 3 Optimization of Naïve Algorithm . 4 Examples & Analysis . . . . . . jc-yang (Osaka-U) SuffixTree November 12, 2011 16 / 25
  • 17. Optimization of Naïve Algorithm . . Substrings can be represented as (start, end) pair of their index in 1 orignal string. Reduce space complexity to O(N) if size of alphabet is fixed constant. . Once a leaf, Always a leaf 2 Represent edge that links to a leaf as (start, · · · ). Extend leaf nodes for free. We do not need Extend Case I. 0:6 0:6 banana$ a na $ (1,2) (0,...) (6,...) (2,4) a banana$... $... na 1:0 8:6 6:6 10:6 8:6 1:0 10:6 6:6 (6,...) (2,4) (6,...) (4,...) na $ na$ $ $... na $... na$... 4:6 9:6 3:2 7:6 9:6 4:6 7:6 3:2 (6,...) (4,...) na$ $ $... na$... 5:6 2:1 2:1 5:6 . . . . . . jc-yang (Osaka-U) SuffixTree November 12, 2011 17 / 25
  • 18. Active Point During a phrase, if we meet Extend Case III, that is if we found S[i + 1] already exists in suffix[j . . . i] then S[i + 1] will exists in ∀suffix[k . . . i], k ∈ j . . . i. Thus Case III is a sign that means update of this phrase is finished. During phrase i if we stopped at suffix[k . . . i] by Case III, then in next phrase we can start from suffix[k . . . i + 1] because all suffix start with 1 . . . k − 1 will end at Case I. We called this point(current internal node, current position k in string) as Active Point. a ab aba abaa abaab abaaba abaabab abaababa abaababaa abaababaab abaababaaba abaababaabaa abaababaaba ab 0:0 0:1 0:2 0:3 0:4 0:5 0:6 0:7 0:8 0:9 0:10 0:11 0:12 (0,...) (0,...) (1,...) (0,...) (1,...) (0,1) (1,...) (0,1) (1,...) (0,1) (1,...) (0,1) (1,3) (0,1) (1,3) (0,1) (1,3) (0,1) (1,3) (0,1) (1,3) (0,1) (1,3) (0,1) (1,3) a... ab... b... aba... ba... a baa... a baab... a baaba... a ba a ba a ba a ba a ba a ba a ba 1:0 1:0 2:1 1:0 2:1 3:3 2:1 3:3 2:1 3:3 2:1 3:3 7:6 3:3 7:6 3:3 7:6 3:3 7:6 3:3 7:6 3:3 7:6 3:3 7:6 (3,...) (1,...) (3,...) (1,...) (3,...) (1,...) (3,...) (1,3) (3,...) (6,...) (3,...) (1,3) (3,...) (6,...) (3,...) (1,3) (3,...) (6,...) (3,...) (1,3) (3,...) (6,...) (3,...) (1,3) (3,...) (6,...) (3,6) (1,3) (3,6) (6,...) (3,6) (1,3) (3,6) (6,...) a... baa... ab... baab... aba... baaba... abab... ba abab... b... ababa... ba ababa... ba... ababaa... ba ababaa... baa... ababaab... ba ababaab... baab... ababaaba... ba ababaaba... baaba... aba ba aba baabaa... aba ba aba baabaab... 4:3 1:0 4:3 1:0 4:3 1:0 4:3 5:6 2:1 8:6 4:3 5:6 2:1 8:6 4:3 5:6 2:1 8:6 4:3 5:6 2:1 8:6 4:3 5:6 2:1 8:6 13:11 5:6 11:11 8:6 13:11 5:6 11:11 8:6 (3,...) (6,...) (3,...) (6,...) (3,...) (6,...) (3,...) (6,...) (3,...) (6,...) (11,...) (6,...) (3,6) (6,...) (11,...) (6,...) (11,...) (6,...) (3,6) (6,...) (11,...) (6,...) abab... b... ababa... ba... ababaa... baa... ababaab... baab... ababaaba... baaba... a... baabaa... aba baabaa... a... baabaa... ab... baabaab... aba baabaab... ab... baabaab... 1:0 6:6 1:0 6:6 1:0 6:6 1:0 6:6 1:0 6:6 14:11 4:3 9:11 6:6 12:11 2:1 14:11 4:3 9:11 6:6 12:11 2:1 (11,...) (6,...) (11,...) (6,...) a... baabaa... ab... baabaab... 10:11 1:0 10:11 1:0 . . . . . . jc-yang (Osaka-U) SuffixTree November 12, 2011 18 / 25
  • 19. Ukk’s Update using Active Point U PDATE(treei ) 1 current_suffix ← active_point 2 next_char ← string[i] 3 while True 4 do if there exists edge start with next_char 5 then break £ Case III 6 else 7 split current edge if implicit 8 create new leaf with new edge labelled next_char 9 if current_suffix is empty 10 then break 11 else current_suffix ← next shorter suffix 12 active_point ← current_suffix . . . . . . jc-yang (Osaka-U) SuffixTree November 12, 2011 19 / 25
  • 20. Suffix Link to find next shorter suffix Suffix link Internal node of suffix X α has a link to node α . If α is empty, suffix link points to root. How to create suffix link Link together every internal nodes that are created by splitting in same phrase. banana$ banana$ 0:6 0:6 (1,2) (0,...) (6,...) (2,4) (0,...) (1,2) (6,...) (2,4) a banana$... $... na banana$... a $... na 8:6 1:0 10:6 6:6 1:0 8:6 10:6 6:6 (6,...) (2,4) (6,...) (4,...) (6,...) (2,4) (6,...) (4,...) $... na $... na$... $... na $... na$... 9:6 4:6 7:6 3:2 9:6 4:6 7:6 3:2 (6,...) (4,...) (6,...) (4,...) $... na$... $... na$... 5:6 2:1 5:6 2:1 . . . . . . jc-yang (Osaka-U) SuffixTree November 12, 2011 20 / 25
  • 21. Fast jump using Suffix Link Assume we are at Suffix X αβ , whose parent internal node represent X α . 1. Go back to parent internal node, 2. Jumping follow the node’s suffix link to the node represent Suffix α 3. Go down to Suffix αβ . Even jump down in step 3 because we already know length of β . (Skip/Count trick) Combining all these tricks we can Extend a character in O(1) time. . . . . . . jc-yang (Osaka-U) SuffixTree November 12, 2011 21 / 25
  • 22. Part Outline . 1 What is Suffix Tree . 2 History & Naïve Algorithm . 3 Optimization of Naïve Algorithm . 4 Examples & Analysis . . . . . . jc-yang (Osaka-U) SuffixTree November 12, 2011 22 / 25
  • 23. Experiment – mississippi . m 0:0 (0,...) m... 1:0 . . . . . . jc-yang (Osaka-U) SuffixTree November 12, 2011 23 / 25
  • 24. Experiment – mississippi . mi 0:1 (1,...) (0,...) i... mi... 2:1 1:0 . . . . . . jc-yang (Osaka-U) SuffixTree November 12, 2011 23 / 25
  • 25. Experiment – mississippi . mis 0:2 (1,...) (2,...) (0,...) is... s... mis... 2:1 3:2 1:0 . . . . . . jc-yang (Osaka-U) SuffixTree November 12, 2011 23 / 25
  • 26. Experiment – mississippi . miss 0:3 (1,...) (2,...) (0,...) iss... ss... miss... 2:1 3:2 1:0 . . . . . . jc-yang (Osaka-U) SuffixTree November 12, 2011 23 / 25
  • 27. Experiment – mississippi . miss i 0:4 (1,...) (0,...) (2,3) issi... missi... s 2:1 1:0 4:4 (4,...) (3,...) i... si... 5:4 3:2 . . . . . . jc-yang (Osaka-U) SuffixTree November 12, 2011 23 / 25
  • 28. Experiment – mississippi . miss is 0:5 (1,...) (0,...) (2,3) issis... missis... s 2:1 1:0 4:4 (4,...) (3,...) is... sis... 5:4 3:2 . . . . . . jc-yang (Osaka-U) SuffixTree November 12, 2011 23 / 25
  • 29. Experiment – mississippi . miss iss 0:6 (1,...) (0,...) (2,3) ississ... mississ... s 2:1 1:0 4:4 (4,...) (3,...) iss... siss... 5:4 3:2 . . . . . . jc-yang (Osaka-U) SuffixTree November 12, 2011 23 / 25
  • 30. Experiment – mississippi . mississi 0:7 (1,...) (0,...) (2,3) ississi... mississi... s 2:1 1:0 4:4 (4,...) (3,...) issi... sissi... 5:4 3:2 . . . . . . jc-yang (Osaka-U) SuffixTree November 12, 2011 23 / 25
  • 31. Experiment – mississippi . mississip 0:8 (8,...) (1,2) (0,...) (2,3) p... i mississi... s 14:8 12:8 1:0 4:4 (8,...) (2,5) p... ssi (3,5) (4,5) 13:8 6:8 si i (8,...) (5,...) p... ssip... 7:8 2:1 8:8 10:8 (8,...) (5,...) (8,...) (5,...) p... ssip... p... ssip... 9:8 3:2 11:8 5:4 . . . . . . jc-yang (Osaka-U) SuffixTree November 12, 2011 23 / 25
  • 32. Experiment – mississippi . mississip p 0:9 (8,...) (1,2) (0,...) (2,3) pp... i mississi... s 14:8 12:8 1:0 4:4 (8,...) (2,5) pp... ssi (3,5) (4,5) 13:8 6:8 si i (8,...) (5,...) pp... ssipp... 7:8 2:1 8:8 10:8 (8,...) (5,...) (8,...) (5,...) pp... ssipp... pp... ssipp... 9:8 3:2 11:8 5:4 . . . . . . jc-yang (Osaka-U) SuffixTree November 12, 2011 23 / 25
  • 33. Experiment – mississippi . mississippi 0:10 (1,2) (0,...) (8,9) (2,3) i mississi... p s 12:8 1:0 15:10 4:4 (2,5) (8,...) (9,...) (10,...) ssi ppi... pi... i... (3,5) (4,5) 6:8 13:8 14:8 16:10 si i (8,...) (5,...) ppi... ssippi... 7:8 2:1 8:8 10:8 (8,...) (5,...) (8,...) (5,...) ppi... ssippi... ppi... ssippi... 9:8 3:2 11:8 5:4 . . . . . . jc-yang (Osaka-U) SuffixTree November 12, 2011 23 / 25
  • 34. Experiment – mississippi . mississippi$ 0:11 (1,2) (0,...) (8,9) (11,...) (2,3) i mississi... p $... s 12:8 1:0 15:10 18:11 4:4 (8,...) (2,5) (11,...) (10,...) (9,...) ppi$... ssi $... i$... pi$... (3,5) (4,5) 13:8 6:8 17:11 16:10 14:8 si i (8,...) (5,...) ppi$... ssippi$... 7:8 2:1 8:8 10:8 (8,...) (5,...) (8,...) (5,...) ppi$... ssippi$... ppi$... ssippi$... 9:8 3:2 11:8 5:4 . . . . . . jc-yang (Osaka-U) SuffixTree November 12, 2011 23 / 25
  • 35. Time Complexity Analysis $ i p p i s s i s s i m . m i s s i s s i p p i $ Time complexity is 2N = O(N) . . . . . . . jc-yang (Osaka-U) SuffixTree November 12, 2011 24 / 25
  • 36. Experiment – English text . 16 "data" using 1:3 14 Construction Time (sec.) 12 10 8 6 4 2 0 0 20000 40000 60000 80000 100000 120000 140000 File size (byte) . . . . . . jc-yang (Osaka-U) SuffixTree November 12, 2011 25 / 25