This document discusses string matching algorithms. It begins with an introduction to the naive string matching algorithm and its quadratic runtime. Then it proposes three improved algorithms: FC-RJ, FLC-RJ, and FMLC-RJ, which attempt to match patterns by restricting comparisons based on the first, first and last, or first, middle, and last characters, respectively. Experimental results show that these three proposed algorithms outperform the naive algorithm by reducing execution time, with FMLC-RJ working best for three-character patterns.
4. Introduction
What is string matching?
Finding all occurrences of a pattern in a
given text (or body of text)
Many applications
While using editor/word processor/browser
Login name & password checking
Virus detection
Header analysis in data communications
DNA sequence analysis
Dr. AMIT KUMAR @JUET
5. TYPES OF STRING MATCHING:-
Exact string matching:
means finding one or all exact occurrences
of a pattern in a text.
Naïve (Brute force) algorithm
Boyer and Moore
Knuth-Morris and Pratt
are exact string matching
algorithms. Dr. AMIT KUMAR @JUET
6. Approximate string matching
It is the technique of finding approximate
(may not exact) matches to a pattern in a
string
Karp and Rabin algorithm
Dr. AMIT KUMAR @JUET
7. String-Matching Problem
The text is in an array T [1..n] of length n
The pattern is in an array P [1..m] of
length m
Elements of T and P are characters from
a finite alphabet
E.g., = {0,1} or = {a, b, …, z}
Usually T and P are called strings of
characters
Dr. AMIT KUMAR @JUET
8. String-Matching Problem
…contd
We say that pattern P occurs with shift s
in text T if:
a) 0 ≤ s ≤ n-m and
b) T [(s+1)..(s+m)] = P [1..m]
If P occurs with shift s in T, then s is a valid
shift, otherwise s is an invalid shift
String-matching problem: finding all
valid shifts for a given T and P
Dr. AMIT KUMAR @JUET
9. Example 1
a b c a b a a b c a b a c
a b a a
text T
pattern P s = 3
shift s = 3 is a valid shift
(n=13, m=4 and 0 ≤ s ≤ n-m holds)
1 2 3 4 5 6 7 8 9 10 11 12 13
1 2 3 4
Dr. AMIT KUMAR @JUET
10. Example 2
a b c a b a a b c a b a a
a b a a
text T
pattern P
s = 3
a b a a
a b a a
s = 9
1 2 3 4 5 6 7 8 9 10 11 12 13
1 2 3 4
Dr. AMIT KUMAR @JUET
11. Terminology
Concatenation of 2 strings x and y is xy
E.g., x=“putra”, y=“jaya” xy =
“putrajaya”
A string w is a prefix of a string x, if x=wy
for some string y
E.g., “putra” is a prefix of “putrajaya”
A string w is a suffix of a string x, if x=yw
for some string y
E.g., “jaya” is a suffix of “putrajaya”
Dr. AMIT KUMAR @JUET
12. Naïve String-Matching Algorithm
Input: Text strings T [1..n] and P[1..m]
Result: All valid shifts displayed
NAÏVE-STRING-MATCHER (T, P)
n ← length[T]
m ← length[P]
for s ← 0 to n-m
if P[1..m] = T [(s+1)..(s+m)]
print “pattern occurs with shift” s
Dr. AMIT KUMAR @JUET
13. WORKING OF NAÏVE STRING
MATCHING
The naive string‐matching procedure can be
interpreted graphically as sliding a
"template“ containing the pattern over the
text, noting for which shifts all of the
characters on the template equal the
corresponding characters in the text.
Dr. AMIT KUMAR @JUET
14. Contd…
The for loop beginning on line 3 considers
each possible shift explicitly.
match successfully or a mismatch is found.
Line 5 prints out each valid shift s
The test on line 4 determines whether the
current shift is valid or not; this test involves an
implicit loop to check corresponding character
positions until all positions Dr. AMIT KUMAR @JUET
15. Analysis: Worst-case Example
a a a a a a a a a a a a atext T
pattern P
a a a b
a a a b
1 2 3 4 5 6 7 8 9 10 11 12 13
1 2 3 4
a a a bDr. AMIT KUMAR @JUET
16. Worst-case Analysis
There are m comparisons for each shift
in the worst case
There are n-m+1 shifts
So, the worst-case running time is
Θ((n-m+1)m) , which is Θ(n2) if
m = floor(n/2)
In the example on previous slide, we
have (13-4+1)4 comparisons in total
Naïve method is inefficient because
information from a shift is not used again
Dr. AMIT KUMAR @JUET
17. ADVANTAGES:-
No preprocessing phase required
because the running time of
NAIVE‐STRING‐ MATCHER is equal to its
matching time
No extra space are needed.
Also, the comparisons can be done in
any order.
Dr. AMIT KUMAR @JUET
18. Problem with naïve algorithm
Problem with Naïve algorithm:
Suppose p=ababc, T=cabababcd.
T: c a b a b a b c d
P: a …
P: a b a b c
P: a…
P: a b a b c
Whenever a character mismatch occurs after
matching of several characters, the comparison
begins by going back in from the character
which follows the last beginning character.
Dr. AMIT KUMAR @JUET
19. QUESTION???
Consider a situation where all characters of
pattern are different. Can we modify the
original Naive String Matching algorithm so
that it works better for these types of patterns.
If we can, then what are the changes to
original algorithm?
Dr. AMIT KUMAR @JUET
20. ANSWER:-
In the original Naive String matching algorithm , we
always slide the pattern by 1. When all characters of
pattern are different, we can slide the pattern by
more than 1.
When a mismatch occurs after j matches, we know
that the first character of pattern will not match the j
matched characters because all characters of
pattern are different. So we can always slide the
pattern by j without missing any valid shifts.
Dr. AMIT KUMAR @JUET
22. Three exact single pattern matching
algorithms:-
FC-RJ (First Character-Rami and Jehad)
FLC-RJ (First and Last Characters-Rami
and Jehad)
FMLC-RJ (First, Middle and Last
Characters-Rami and Jehad) .
Dr. AMIT KUMAR @JUET
23. FC-RJ (First Character-Rami and Jehad
The algorithm creates a new array called
(Occurrence_List) of size (n - m + 1), where
n is the size of the text and m is the size of
the pattern. The length of the
Occurrence_List is (n - m + 1) because it is
impossible to the pattern to occur after
the position (n - m) in the text
Dr. AMIT KUMAR @JUET
24. This array will hold the indices of the
occurrences of the pattern’s first character in the
text using an integer variable (i) starting from (0)
and incremented by one after each match
The algorithm scans the text in a single pass,
using an integer variable (j) and compares its
characters with the pattern’s first character. If
the current character of the text (jth character)
is equal to the pattern's first character, the
algorithm saves the index of the current
character in the text (the value of j) in the ith
index of the Occurrence_List array and
increments the value by one. Dr. AMIT KUMAR @JUET
25. FLC-RJ algorithm:
The concept of FLC-RJ (first and Last
Characters-Rami and Jehad) algorithm
follows the concept of FC-RJ algorithm.
It seems more efficient to attempt
matching the pattern only with the sub-
strings of the text that start with the
pattern’s first character and also end with
the pattern’s last character.
This technique decreases the number of
character comparisons in the text.
Dr. AMIT KUMAR @JUET
26. FMLC-RJ Algorithm:-
FMLC-RJ algorithm adds another restriction to a sub-
string of the text to be considered as an expected
occurrence of the pattern.
It seems more efficient to attempt matching the pattern
only with the sub-strings of the text that start with the
pattern’s first character and end with the pattern’s last
character and at the same time, they have middle
characters equal the pattern’s middle character.
This technique decreases the number of character
comparisons in the text during the searching phase.
Dr. AMIT KUMAR @JUET
27. RESULTS:-
The best performance of the naïve string
algorithms is when the length of the
pattern was relatively short. Since the
algorithm compares almost m characters
at each index of the text, the execution
time increases as m gets larger.
The best performance of the FLC-RJ
algorithms is when the length of the
pattern was two characters. Since, the
algorithm only outputs the content of the
Occurrence_List array if the pattern’s
length is two characters.
Dr. AMIT KUMAR @JUET
28. Contd…
The best performance of the FMLC-RJ
algorithms is when the length of the
pattern was three characters. The
algorithm searches for the first, middle and
last characters of the pattern and then it
outputs the content of the Occurrence_List
array as a result.
Dr. AMIT KUMAR @JUET
33. It is apparent that the FC-RJ, FLC-RJ and FMLC-RJ algorithms
outperform the performance of the brute force algorithm.
It is clear that our proposed algorithms enhance the execution time of
string matching as compared to the brute force algorithm.
This enhancement is calculated by considering the differences in
execution times of the algorithms to search for 14 patterns samples as
recorded in Table 1.
Dr. AMIT KUMAR @JUET
34. SUMMARY
The "naive" approach is easy to understand and
implement but it can be too slow in some cases. If
the length of the text is n and the length of the
pattern m, in the worst case it may take as much as
(n * m) iterations to complete the task.
It should be noted though, that for most practical
purposes, which deal with texts based on human
languages, this approach is much faster since the
inner loop usually quickly finds a mismatch and
breaks. A problem arises when we are faced with
different kinds of "texts," such as the genetic code.Dr. AMIT KUMAR @JUET