SlideShare uma empresa Scribd logo
1 de 35
Baixar para ler offline
A High-Performance Input-Aware
 Multiple String-Match Algorithm
                                    Erez
                                   Buchnik
Agenda
• Problem
• Existing Solutions
• Bouma2 – Model
• Comparisons
• Preprocessing in Detail
• Future Work


                        Page 2
Agenda
• Problem
• Existing Solutions
• Bouma2 – Model
• Comparisons
• Preprocessing in Detail
• Future Work

                        Page 3
The Multiple String-Match Problem
• Goal: Given a set of strings and input
 text, find all occurrences of any of the
 strings in the text
• Input: Set of strings L and input text M
• Output: Offsets 1 ≤ i ≤ |M| where a
 substring of M matches any of the
 strings in L
• Uses: AV, IPS, DPI, DNA Search etc…
                             Page 4
The Multiple String-Match Problem - References

• Aho-Corasick ’75
• Commentz-Walter ’79
• Rabin-Karp ’87
• Wu-Manber ’94
• Muth-Manber ’96
• Hopcroft-Motwani-Ullman ’00
• Dori-Landau ’06
                              Page 5
Agenda
• Problem
• Existing Solutions
• Bouma2 – Model
• Comparisons
• Preprocessing in Detail
• Future Work

                        Page 6
Stateful Approach (e.g. Aho-Corasick)


• One state
 transition per
 symbol
• Linear in the length of the input
• Large automatons cause cache-
 misses and degrade performance
                          Page 7
Agenda
• Problem
• Existing Solutions
• Bouma2 – Model
• Comparisons
• Preprocessing in Detail
• Future Work

                        Page 8
Guidelines
• INTUITIVE: Search for ‘Hints’ of
 a Match Before the Full Match

• REALISTIC: Use Prior
 Knowledge of Expected Input

• SIMPLE: Trivial Match Process

                      Page 9
Bouma2: Motif-Based String Match
Set of               re             Set of selected
           bore
strings                             2-symbols long
           core
                     ek             substrings
           trek
           bits      bi
          corridor   at
            boat
           book      ok
           cooks
                     or
• Preprocessing: Map every string to
 its own substring: Motif            Q1: How to
                                     select motifs?
                          Page 10
Bouma2: Motif-Based String Match (cont.)
     “ r a b b i t s       h a t e             c o o k s “
                       No match                No match



                        b o a t                b o o k
                Match                          Match      Match

            b i t s                            c o o k s
• Match: Examine symbols 2-by-2
 (STATELESS); attempt full match
 around motif occurrences
                                  Q2: How to
                                  resolve collisions?
                                     Page 11
Capturing all Occurrences

 “ h a b i t s    o f   r a b b i t s “
          Match                        Match

      b i t s                       b i t s

• Even-offset occurrences and odd-
 offset occurrences require separate
 passes, but instead…
                          Page 12
Upgrade #1: 2-Symbol Strides

 “ h a b i t s       o f   r a b b i t s “
     Match   Match                        Match

      b i t s                          b i t s

• We map each string TWICE: once to
 an even-offset motif, and once to an
 odd-offset motif
                             Page 13
Upgrade #2: Fast-Path / Slow-Path
       4                   14


“ h a b i t s   o f   r a b b i t s “     4
                                          14


 • Fast-Path:
 - Stateless
 - “Monolithic” (zero branches)
 - Cache-Aware (small direct-table)
 - SIMPLE…
                                Page 14
Upgrade #2: Fast-Path / Slow-Path
                 4                           14


     4   “ h a b i t s       o f      r a b b i t s “
    14
             Match   Match                        Match

              b i t s                        b i t s
• Slow-Path:
  - Memory-Efficient (pointers to
  original strings for comparison)
 - “Localized” (separate structure for
  every motif)
                                   Page 15
Agenda
• Problem
• Existing Solutions
• Bouma2 – Model
• Comparisons
• Preprocessing in Detail
• Future Work

                        Page 16
Bouma2 vs. Aho-Corasick
• n – length of input
• S – no. of string-matches in n
• m – no. of motif-matches in n
• l – length of the longest string
• Match Complexities:
- Aho-Corasick:     O( n  S )
                      n
- Bouma2:           O(  m  l )
                      2
                           Page 17
Bouma2 vs. Aho-Corasick (Speed)
 Bouma2      Bouma2 Slow-Path
 Fast-Path   (Sub-Optimal)
                                      Aho-Corasick




• In practice, Bouma2 is usually at
 least twice as fast as Aho-Corasick
• Fast-path alone is 10 times faster
                  Q3: How to optimize
                  slow-path?      Page 18
Bouma2 vs. Aho-Corasick (Cache)
  Bouma2
  Cache-Misses

                              Aho-Corasick
                              Cache-Misses




• Bouma2 exhibits 8.5 times less
 cache-misses than Aho-Corasick
 (fast-path + slow-path)
                           Page 19
Bouma2 vs. Aho-Corasick (Memory)
Bouma2      Bouma2      Original
Fast-Path   Slow-Path   Strings

                                       Aho-Corasick




• Bouma2 footprint is less than 70%
 of Aho-Corasick for textual search
 (down to 35% in other cases)
                             Page 20
Agenda
• Problem
• Existing Solutions
• Bouma2 – Model
• Comparisons
• Preprocessing in Detail
• Future Work

                       Page 21
Q1: How to select motifs?
                          bo co do id or re ri rr
             bo re         •              •
     Even
    Offset   co re            •           •
             co rr id or      •     • •         •
             b or e                    •
      Odd
    Offset   c or e                    •
             c or ri do r        •     •     •

• A1: Out of all 2-symbol substrings,
 find a minimum subset that covers
 all given strings (even & odd offsets)
                                    Page 22
Q1: How to select motifs?
                         bo co do id or re ri rr
            bo re         Χ              √
    Even
   Offset   co re            Χ           √
            co rr id or      Χ     Χ √         Χ
            b or e                    √
     Odd
   Offset   c or e                    √
            c or ri do r        Χ     √     Χ


• But… maybe the minimum subset is
 not the optimal subset?

                                   Page 23
Q1: How to select motifs?
• Bad selection of motifs for English
     text searches: substrings of ‘the’ -
     the most common word in English
                                                  at ea er he te th
                       Even
                      Offset   th ea te r               Χ                   Χ     √
                        Odd
                      Offset   t he at er Χ                  Χ       √

“The good, the bad and the ugly“ in theaters nearby
No match   No match    Match   No match   Match   No match

   thea ter             thea ter           thea ter                             Match


                                                                           thea ter

                                                                 Page 24
Q1: How to select motifs?
     2-Symbol Sequence Occurrence Probability
            bo         0.0002
            re         0.001861
            co         0.001028
            rr         0.000031
            id         0.001756
            or         0.000444
            ri         0.000284
            do         0.000151
• Use input-specific occurrence
 statistics to optimize motif-sets
• REALISTIC…
                                     Page 25
Q1: How to select motifs?
                          bo co do id or re ri rr
             bo re         √              Χ
     Even
    Offset   co re            √           Χ
             co rr id or      √     Χ √         Χ
             b or e                    √
      Odd
    Offset   c or e                    √
             c or ri do r        Χ     √     Χ

• NOTE: After selecting the motif-set,
 remove redundant mappings from
 the final String-to-Motif mapping
                                    Page 26
Statistics for Motif Selection
                      10000000

                       8000000
                                     00 00
(more than 100,000)
   Occurrences




                       6000000

                       4000000
                                       “rn”                                                  FF FF
                       2000000

                             0
                                 0     10000     20000   30000   40000             50000   60000       70000
                      35000000

                      30000000       00 00
(more than 40,000)




                      25000000
  Occurrences




                      20000000
                                                                                              FF FF
                      15000000
                                               “??”
                      10000000

                       5000000

                             0
                                 0     10000     20000   30000   40000             50000   60000       70000


• 2-symbol sequence statistics: IP
                      traffic (top) vs. OS files (bottom)
                                                                         Page 27
Motif Selection as an ILP Problem
• L: a given string-set
• TL: all 2-symbol substrings of strings in L
• c(t): cost-function for every t in TL

Minimize     c(t )  x
            tTL
                      t   ,
  whereas xt {0,1} for every t  TL

Subject To: for every w  L

 x  assoc (w, t )  1, and  x  assoc (w, t )  1
tTL
       t    0
                              tTL
                                      t        1


                                     Page 28
Q2: How to resolve collisions?
          -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6


               b        o   re              I
               c        o   re
               c        o   rridor
          corrid        o   r
• A2:
- Examine adjacent symbols at
 relative offsets to eliminate strings
- New structure: The Mangled-Trie
                               Page 29
The Mangled-Trie
                            ‘or’ Motif at Offset 0
                                  1                    OTHER
                                      Resolve:                  NO
                                      Offset -1                MATCH
                                          ‘b’                          ‘d’
                                                  NO                              NO
                                        ‘e’ in       NO            “corri” in           NO
                      ‘c’             Offset 2?     MATCH          Offset -6?          MATCH
                  2
      OTHER                            YES                           YES
  NO        Resolve:
MATCH       Offset 2                  “bore” in                   “corridor” in
     ‘e’                              Offset -1                     Offset -6

 “core” in
 Offset -1                                             bore
                      ‘r’                              core
              3                                        corridor
                            NO                    corridor
             “idor” in            NO
             Offset 3?           MATCH                                                  I
                                                  -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6
           YES                            ...corricorridor...
        “corridor” in
          Offset -1                                               1      2   3

                                                            Page 30
Agenda
• Problem
• Existing Solutions
• Bouma2 – Model
• Comparisons
• Preprocessing in Detail
• Future Work

                        Page 31
Q3: How optimize slow-path?
• A3:
- Optimize Frequent Scenarios:
 Apply statistics to Mangled-Trie
 construction
- Improve Motif-Set Quality: Avoid
 slow-path altogether when possible


                        Page 32
More Future Work…
• Adaptive System: Collect statistics
 “on-the-go” and improve motif-set
• Faster Preprocessing: Custom
 Branch-and-Cut (Margot ’10)
• Regular Expressions
• Hardware Implementation
• Bouma3?…

                         Page 33
“ Search has always been about
 people. It's not an abstract thing.
 It's not a formula. It's about getting
 people what they need... It depends
 on the type of search you do—and
 how to take all those signals and
 put them together.”
- Udi Manber, Google, 2008
                         Page 34
Thank You

Mais conteúdo relacionado

Semelhante a Bouma2 talk

Quines—Programming your way back to where you were
Quines—Programming your way back to where you wereQuines—Programming your way back to where you were
Quines—Programming your way back to where you wereJean-Baptiste Mazon
 
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...huguk
 
Nearest Neighbor Customer Insight
Nearest Neighbor Customer InsightNearest Neighbor Customer Insight
Nearest Neighbor Customer InsightMapR Technologies
 
Using Spark's RDD APIs for complex, custom applications
Using Spark's RDD APIs for complex, custom applicationsUsing Spark's RDD APIs for complex, custom applications
Using Spark's RDD APIs for complex, custom applicationsTejas Patil
 
Game playing (tic tac-toe), andor graph
Game playing (tic tac-toe), andor graphGame playing (tic tac-toe), andor graph
Game playing (tic tac-toe), andor graphSyed Zaid Irshad
 
Scalable real-time processing techniques
Scalable real-time processing techniquesScalable real-time processing techniques
Scalable real-time processing techniquesLars Albertsson
 

Semelhante a Bouma2 talk (8)

Quines—Programming your way back to where you were
Quines—Programming your way back to where you wereQuines—Programming your way back to where you were
Quines—Programming your way back to where you were
 
ACM 2013-02-25
ACM 2013-02-25ACM 2013-02-25
ACM 2013-02-25
 
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
 
Nearest Neighbor Customer Insight
Nearest Neighbor Customer InsightNearest Neighbor Customer Insight
Nearest Neighbor Customer Insight
 
Using Spark's RDD APIs for complex, custom applications
Using Spark's RDD APIs for complex, custom applicationsUsing Spark's RDD APIs for complex, custom applications
Using Spark's RDD APIs for complex, custom applications
 
Game playing (tic tac-toe), andor graph
Game playing (tic tac-toe), andor graphGame playing (tic tac-toe), andor graph
Game playing (tic tac-toe), andor graph
 
Let's Get to the Rapids
Let's Get to the RapidsLet's Get to the Rapids
Let's Get to the Rapids
 
Scalable real-time processing techniques
Scalable real-time processing techniquesScalable real-time processing techniques
Scalable real-time processing techniques
 

Último

GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfhans926745
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdfChristopherTHyatt
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 

Último (20)

GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 

Bouma2 talk

  • 1. A High-Performance Input-Aware Multiple String-Match Algorithm Erez Buchnik
  • 2. Agenda • Problem • Existing Solutions • Bouma2 – Model • Comparisons • Preprocessing in Detail • Future Work Page 2
  • 3. Agenda • Problem • Existing Solutions • Bouma2 – Model • Comparisons • Preprocessing in Detail • Future Work Page 3
  • 4. The Multiple String-Match Problem • Goal: Given a set of strings and input text, find all occurrences of any of the strings in the text • Input: Set of strings L and input text M • Output: Offsets 1 ≤ i ≤ |M| where a substring of M matches any of the strings in L • Uses: AV, IPS, DPI, DNA Search etc… Page 4
  • 5. The Multiple String-Match Problem - References • Aho-Corasick ’75 • Commentz-Walter ’79 • Rabin-Karp ’87 • Wu-Manber ’94 • Muth-Manber ’96 • Hopcroft-Motwani-Ullman ’00 • Dori-Landau ’06 Page 5
  • 6. Agenda • Problem • Existing Solutions • Bouma2 – Model • Comparisons • Preprocessing in Detail • Future Work Page 6
  • 7. Stateful Approach (e.g. Aho-Corasick) • One state transition per symbol • Linear in the length of the input • Large automatons cause cache- misses and degrade performance Page 7
  • 8. Agenda • Problem • Existing Solutions • Bouma2 – Model • Comparisons • Preprocessing in Detail • Future Work Page 8
  • 9. Guidelines • INTUITIVE: Search for ‘Hints’ of a Match Before the Full Match • REALISTIC: Use Prior Knowledge of Expected Input • SIMPLE: Trivial Match Process Page 9
  • 10. Bouma2: Motif-Based String Match Set of re Set of selected bore strings 2-symbols long core ek substrings trek bits bi corridor at boat book ok cooks or • Preprocessing: Map every string to its own substring: Motif Q1: How to select motifs? Page 10
  • 11. Bouma2: Motif-Based String Match (cont.) “ r a b b i t s h a t e c o o k s “ No match No match b o a t b o o k Match Match Match b i t s c o o k s • Match: Examine symbols 2-by-2 (STATELESS); attempt full match around motif occurrences Q2: How to resolve collisions? Page 11
  • 12. Capturing all Occurrences “ h a b i t s o f r a b b i t s “ Match Match b i t s b i t s • Even-offset occurrences and odd- offset occurrences require separate passes, but instead… Page 12
  • 13. Upgrade #1: 2-Symbol Strides “ h a b i t s o f r a b b i t s “ Match Match Match b i t s b i t s • We map each string TWICE: once to an even-offset motif, and once to an odd-offset motif Page 13
  • 14. Upgrade #2: Fast-Path / Slow-Path 4 14 “ h a b i t s o f r a b b i t s “ 4 14 • Fast-Path: - Stateless - “Monolithic” (zero branches) - Cache-Aware (small direct-table) - SIMPLE… Page 14
  • 15. Upgrade #2: Fast-Path / Slow-Path 4 14 4 “ h a b i t s o f r a b b i t s “ 14 Match Match Match b i t s b i t s • Slow-Path: - Memory-Efficient (pointers to original strings for comparison) - “Localized” (separate structure for every motif) Page 15
  • 16. Agenda • Problem • Existing Solutions • Bouma2 – Model • Comparisons • Preprocessing in Detail • Future Work Page 16
  • 17. Bouma2 vs. Aho-Corasick • n – length of input • S – no. of string-matches in n • m – no. of motif-matches in n • l – length of the longest string • Match Complexities: - Aho-Corasick: O( n  S ) n - Bouma2: O(  m  l ) 2 Page 17
  • 18. Bouma2 vs. Aho-Corasick (Speed) Bouma2 Bouma2 Slow-Path Fast-Path (Sub-Optimal) Aho-Corasick • In practice, Bouma2 is usually at least twice as fast as Aho-Corasick • Fast-path alone is 10 times faster Q3: How to optimize slow-path? Page 18
  • 19. Bouma2 vs. Aho-Corasick (Cache) Bouma2 Cache-Misses Aho-Corasick Cache-Misses • Bouma2 exhibits 8.5 times less cache-misses than Aho-Corasick (fast-path + slow-path) Page 19
  • 20. Bouma2 vs. Aho-Corasick (Memory) Bouma2 Bouma2 Original Fast-Path Slow-Path Strings Aho-Corasick • Bouma2 footprint is less than 70% of Aho-Corasick for textual search (down to 35% in other cases) Page 20
  • 21. Agenda • Problem • Existing Solutions • Bouma2 – Model • Comparisons • Preprocessing in Detail • Future Work Page 21
  • 22. Q1: How to select motifs? bo co do id or re ri rr bo re • • Even Offset co re • • co rr id or • • • • b or e • Odd Offset c or e • c or ri do r • • • • A1: Out of all 2-symbol substrings, find a minimum subset that covers all given strings (even & odd offsets) Page 22
  • 23. Q1: How to select motifs? bo co do id or re ri rr bo re Χ √ Even Offset co re Χ √ co rr id or Χ Χ √ Χ b or e √ Odd Offset c or e √ c or ri do r Χ √ Χ • But… maybe the minimum subset is not the optimal subset? Page 23
  • 24. Q1: How to select motifs? • Bad selection of motifs for English text searches: substrings of ‘the’ - the most common word in English at ea er he te th Even Offset th ea te r Χ Χ √ Odd Offset t he at er Χ Χ √ “The good, the bad and the ugly“ in theaters nearby No match No match Match No match Match No match thea ter thea ter thea ter Match thea ter Page 24
  • 25. Q1: How to select motifs? 2-Symbol Sequence Occurrence Probability bo 0.0002 re 0.001861 co 0.001028 rr 0.000031 id 0.001756 or 0.000444 ri 0.000284 do 0.000151 • Use input-specific occurrence statistics to optimize motif-sets • REALISTIC… Page 25
  • 26. Q1: How to select motifs? bo co do id or re ri rr bo re √ Χ Even Offset co re √ Χ co rr id or √ Χ √ Χ b or e √ Odd Offset c or e √ c or ri do r Χ √ Χ • NOTE: After selecting the motif-set, remove redundant mappings from the final String-to-Motif mapping Page 26
  • 27. Statistics for Motif Selection 10000000 8000000 00 00 (more than 100,000) Occurrences 6000000 4000000 “rn” FF FF 2000000 0 0 10000 20000 30000 40000 50000 60000 70000 35000000 30000000 00 00 (more than 40,000) 25000000 Occurrences 20000000 FF FF 15000000 “??” 10000000 5000000 0 0 10000 20000 30000 40000 50000 60000 70000 • 2-symbol sequence statistics: IP traffic (top) vs. OS files (bottom) Page 27
  • 28. Motif Selection as an ILP Problem • L: a given string-set • TL: all 2-symbol substrings of strings in L • c(t): cost-function for every t in TL Minimize  c(t )  x tTL t , whereas xt {0,1} for every t  TL Subject To: for every w  L  x  assoc (w, t )  1, and  x  assoc (w, t )  1 tTL t 0 tTL t 1 Page 28
  • 29. Q2: How to resolve collisions? -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 b o re I c o re c o rridor corrid o r • A2: - Examine adjacent symbols at relative offsets to eliminate strings - New structure: The Mangled-Trie Page 29
  • 30. The Mangled-Trie ‘or’ Motif at Offset 0 1 OTHER Resolve: NO Offset -1 MATCH ‘b’ ‘d’ NO NO ‘e’ in NO “corri” in NO ‘c’ Offset 2? MATCH Offset -6? MATCH 2 OTHER YES YES NO Resolve: MATCH Offset 2 “bore” in “corridor” in ‘e’ Offset -1 Offset -6 “core” in Offset -1 bore ‘r’ core 3 corridor NO corridor “idor” in NO Offset 3? MATCH I -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 YES ...corricorridor... “corridor” in Offset -1 1 2 3 Page 30
  • 31. Agenda • Problem • Existing Solutions • Bouma2 – Model • Comparisons • Preprocessing in Detail • Future Work Page 31
  • 32. Q3: How optimize slow-path? • A3: - Optimize Frequent Scenarios: Apply statistics to Mangled-Trie construction - Improve Motif-Set Quality: Avoid slow-path altogether when possible Page 32
  • 33. More Future Work… • Adaptive System: Collect statistics “on-the-go” and improve motif-set • Faster Preprocessing: Custom Branch-and-Cut (Margot ’10) • Regular Expressions • Hardware Implementation • Bouma3?… Page 33
  • 34. “ Search has always been about people. It's not an abstract thing. It's not a formula. It's about getting people what they need... It depends on the type of search you do—and how to take all those signals and put them together.” - Udi Manber, Google, 2008 Page 34