String Matching Algorithms Explained

String Matching
 Deﬁnition of string matching
 Naive string-matching algorithm
 Rabin-Karp algorithm
 Finite automata
 Linear time matching using ﬁnite
automata
 Knuth-Morris-Pratt algorithm
Dr. AMIT KUMAR @JUET

Outline
String Matching
 Introduction
 Naïve Algorithm

Introduction
 What is string matching?
 Finding all occurrences of a pattern in a
given text (or body of text)
 Many applications
 While using editor/word processor/browser
 Login name & password checking
 Virus detection
 Header analysis in data communications
 DNA sequence analysis

TYPES OF STRING MATCHING:-
 Exact string matching:
means finding one or all exact occurrences
of a pattern in a text.
 Naïve (Brute force) algorithm
 Boyer and Moore
 Knuth-Morris and Pratt
are exact string matching
algorithms. Dr. AMIT KUMAR @JUET

 Approximate string matching
It is the technique of finding approximate
(may not exact) matches to a pattern in a
string
 Karp and Rabin algorithm

String-Matching Problem
 The text is in an array T [1..n] of length n
 The pattern is in an array P [1..m] of
length m
 Elements of T and P are characters from
a finite alphabet 
 E.g.,  = {0,1} or  = {a, b, …, z}
 Usually T and P are called strings of
characters

String-Matching Problem
…contd
 We say that pattern P occurs with shift s
in text T if:
a) 0 ≤ s ≤ n-m and
b) T [(s+1)..(s+m)] = P [1..m]
 If P occurs with shift s in T, then s is a valid
shift, otherwise s is an invalid shift
 String-matching problem: finding all
valid shifts for a given T and P

Example 1
a b c a b a a b c a b a c
a b a a
text T
pattern P s = 3
shift s = 3 is a valid shift
(n=13, m=4 and 0 ≤ s ≤ n-m holds)
1 2 3 4 5 6 7 8 9 10 11 12 13
1 2 3 4

Example 2
a b c a b a a b c a b a a
a b a a
text T
pattern P
s = 3
a b a a
a b a a
s = 9
1 2 3 4 5 6 7 8 9 10 11 12 13
1 2 3 4

Terminology
 Concatenation of 2 strings x and y is xy
 E.g., x=“putra”, y=“jaya”  xy =
“putrajaya”
 A string w is a prefix of a string x, if x=wy
for some string y
 E.g., “putra” is a prefix of “putrajaya”
 A string w is a suffix of a string x, if x=yw
for some string y
 E.g., “jaya” is a suffix of “putrajaya”

Naïve String-Matching Algorithm
Input: Text strings T [1..n] and P[1..m]
Result: All valid shifts displayed
NAÏVE-STRING-MATCHER (T, P)
n ← length[T]
m ← length[P]
for s ← 0 to n-m
if P[1..m] = T [(s+1)..(s+m)]
print “pattern occurs with shift” s

WORKING OF NAÏVE STRING
MATCHING
 The naive string‐matching procedure can be
interpreted graphically as sliding a
"template“ containing the pattern over the
text, noting for which shifts all of the
characters on the template equal the
corresponding characters in the text.

Contd…
 The for loop beginning on line 3 considers
each possible shift explicitly.
 match successfully or a mismatch is found.
 Line 5 prints out each valid shift s
 The test on line 4 determines whether the
current shift is valid or not; this test involves an
implicit loop to check corresponding character
positions until all positions Dr. AMIT KUMAR @JUET

Analysis: Worst-case Example
a a a a a a a a a a a a atext T
pattern P
a a a b
a a a b
1 2 3 4 5 6 7 8 9 10 11 12 13
1 2 3 4
a a a bDr. AMIT KUMAR @JUET

Worst-case Analysis
 There are m comparisons for each shift
in the worst case
 There are n-m+1 shifts
 So, the worst-case running time is
Θ((n-m+1)m) , which is Θ(n2) if
m = floor(n/2)
 In the example on previous slide, we
have (13-4+1)4 comparisons in total
 Naïve method is inefficient because
information from a shift is not used again

ADVANTAGES:-
 No preprocessing phase required
because the running time of
NAIVE‐STRING‐ MATCHER is equal to its
matching time
 No extra space are needed.
 Also, the comparisons can be done in
any order.

Problem with naïve algorithm
 Problem with Naïve algorithm:
 Suppose p=ababc, T=cabababcd.
T: c a b a b a b c d
P: a …
P: a b a b c
P: a…
P: a b a b c
 Whenever a character mismatch occurs after
matching of several characters, the comparison
begins by going back in from the character
which follows the last beginning character.

QUESTION???
Consider a situation where all characters of
pattern are different. Can we modify the
original Naive String Matching algorithm so
that it works better for these types of patterns.
If we can, then what are the changes to
original algorithm?

ANSWER:-
In the original Naive String matching algorithm , we
always slide the pattern by 1. When all characters of
pattern are different, we can slide the pattern by
more than 1.
When a mismatch occurs after j matches, we know
that the first character of pattern will not match the j
matched characters because all characters of
pattern are different. So we can always slide the
pattern by j without missing any valid shifts.

QUESTION??
HOW TO REDUCE THE
PROCESSING TIME OF NAÏVE
STRING MATCHING ??

Three exact single pattern matching
algorithms:-
 FC-RJ (First Character-Rami and Jehad)
 FLC-RJ (First and Last Characters-Rami
and Jehad)
 FMLC-RJ (First, Middle and Last
Characters-Rami and Jehad) .

FC-RJ (First Character-Rami and Jehad
 The algorithm creates a new array called
(Occurrence_List) of size (n - m + 1), where
n is the size of the text and m is the size of
the pattern. The length of the
Occurrence_List is (n - m + 1) because it is
impossible to the pattern to occur after
the position (n - m) in the text

 This array will hold the indices of the
occurrences of the pattern’s first character in the
text using an integer variable (i) starting from (0)
and incremented by one after each match
 The algorithm scans the text in a single pass,
using an integer variable (j) and compares its
characters with the pattern’s first character. If
the current character of the text (jth character)
is equal to the pattern's first character, the
algorithm saves the index of the current
character in the text (the value of j) in the ith
index of the Occurrence_List array and
increments the value by one. Dr. AMIT KUMAR @JUET

FLC-RJ algorithm:
 The concept of FLC-RJ (first and Last
Characters-Rami and Jehad) algorithm
follows the concept of FC-RJ algorithm.
 It seems more efficient to attempt
matching the pattern only with the sub-
strings of the text that start with the
pattern’s first character and also end with
the pattern’s last character.
 This technique decreases the number of
character comparisons in the text.

FMLC-RJ Algorithm:-
 FMLC-RJ algorithm adds another restriction to a sub-
string of the text to be considered as an expected
occurrence of the pattern.
 It seems more efficient to attempt matching the pattern
only with the sub-strings of the text that start with the
pattern’s first character and end with the pattern’s last
character and at the same time, they have middle
characters equal the pattern’s middle character.
 This technique decreases the number of character
comparisons in the text during the searching phase.

RESULTS:-
 The best performance of the naïve string
algorithms is when the length of the
pattern was relatively short. Since the
algorithm compares almost m characters
at each index of the text, the execution
time increases as m gets larger.
 The best performance of the FLC-RJ
pattern was two characters. Since, the
algorithm only outputs the content of the
Occurrence_List array if the pattern’s
length is two characters.

Contd…
 The best performance of the FMLC-RJ
pattern was three characters. The
algorithm searches for the first, middle and
last characters of the pattern and then it
outputs the content of the Occurrence_List
array as a result.

Experimental results of FC-
RJ algorithm
Experimental results of FLC-RJ algorithm

Experimental results of FMLC-RJ algorithm
Experimental results of the naïve string
algorithm

CONCLUSION:-

 It is apparent that the FC-RJ, FLC-RJ and FMLC-RJ algorithms
outperform the performance of the brute force algorithm.
 It is clear that our proposed algorithms enhance the execution time of
string matching as compared to the brute force algorithm.
 This enhancement is calculated by considering the differences in
execution times of the algorithms to search for 14 patterns samples as
recorded in Table 1.

SUMMARY
 The "naive" approach is easy to understand and
implement but it can be too slow in some cases. If
the length of the text is n and the length of the
pattern m, in the worst case it may take as much as
(n * m) iterations to complete the task.
 It should be noted though, that for most practical
purposes, which deal with texts based on human
languages, this approach is much faster since the
inner loop usually quickly finds a mismatch and
breaks. A problem arises when we are faced with
different kinds of "texts," such as the genetic code.Dr. AMIT KUMAR @JUET

THANK YOU

String Matching Algorithms Explained

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to String Matching Algorithms Explained

Similar to String Matching Algorithms Explained (20)

More from Amit Kumar Rathi

More from Amit Kumar Rathi (20)

Recently uploaded

Recently uploaded (20)

String Matching Algorithms Explained