String Matching Algorithms Explained

String Matching
Algorithms
Presented by
Swapan Shakhari
Under the guidance of
Dr. Prasun Ghosal

What is String Matching?
• Checking whether two or more strings are
same or not.
• Finding a string (pattern) into another string
(text).  Looking for substring
Text ATGCTTATCG
Pattern ATC

Algorithms Discussed
• Knuth–Morris–Pratt algorithm
• Boyer–Moore string search algorithm
• Bitap algorithm (for exact string searching)

Knuth–Morris–Pratt Algorithm

Knuth–Morris–Pratt algorithm
Inventors
• Donald Knuth
• Vaughan Pratt and
• James H. Morris.

Outline of the Algorithm
• The Knuth–Morris–Pratt string searching
algorithm (or KMP algorithm) searches for
occurrences of a "word" W within a main "text
string" S by employing the observation that
when a mismatch occurs.

Outline of the Algorithm
• The word itself embodies sufficient
information to determine where the next
match could begin.
• Thus bypassing re-examination of previously
matched characters.

Worked example
• Let, W = "ABCDABD" and
S = "ABC ABCDAB ABCDABCDABDE".
• At any given time, the algorithm is in a state
determined by two integers:
– m, denoting the position within S where the
prospective match for W begins,
– i, denoting the index of the currently considered
character in W.

Worked example
• In each step we compare S[m+i] with W[i] and
advance if they are equal. This is depicted, at
the start of the run, like

Worked example
• We proceed by comparing successive
characters of W to "parallel" characters of S,
moving from one to the next if they match.
• In the fourth step, we get S[3] = ' ' and W[3] =
'D', a mismatch.

Worked example
• Rather than beginning to search again at S[1],
we note that no 'A' occurs between positions
0 and 3 in S, except at 0.

Worked example
• Hence, having checked all those characters
previously, we know that there is no chance of
finding the beginning of a match if we check
them again.

Worked example
• Therefore, we move on to the next character,
setting m = 4 and i = 0.

Worked example
• At W[6] & S[10], we again have a mismatch.

Worked example
• The algorithm passed an "AB", which could be
the beginning of a new match.

Worked example
• The algorithm passed an "AB", which could be
the beginning of a new match.
– it will simply reset m = 8, i = 2

Worked example
• This search fails immediately in the first trial.

Worked example
• This search fails immediately in the first trial.
– reset m = 11, i = 0.

Worked example
• We again have a mismatch.
– W[6]==‘D’ but S[17]==‘C’.

Worked example
• Reasoning as before (S[15]==W[0]), we set m
= 15, and to start at the two-character
string "AB“ set i = 2.

Worked example
• Reasoning as before (S[15]==W[0]), we set m
= 15, and to start at the two-character
string "AB“ set i = 2.
• Found a match at S[15].

Boyer–Moore string search
Algorithm
The standard benchmark for practical
string search literature!!

Algorithm
Inventors
• Robert S. Boyer and
• J Strother Moore
• in 1977

Algorithm
Some Definitions Required
• S[i] refers to the character at index i of
string S, counting from 1.
• S[i..j] refers to the substring of string S starting
at index i and ending at j, inclusive.
• A prefix of S is a substring S[1..i] for some i in
range [1, n], where n is the length of S.

Algorithm
• A suffix of S is a substring S[i..n] for some i in
range [1, n], where n is the length of S.
• The string to be searched for is called
the pattern and is referred to with symbol P.
• The string being searched in is called
the text and is referred to with symbol T.

Algorithm
• The length of P is n.
• The length of T is m.
• An alignment of P to T is an index k in T such
that the last character of P is aligned with
index k of T.
• A match or occurrence of P occurs at an
alignment if P is equivalent to T[(k-n+1)..k].

Algorithm
Explanation
The Boyer-Moore algorithm searches for
occurrences of P in T by performing explicit
character comparisons at different
alignments. Instead of a brute-force search of
all alignments (of which there are m - n + 1),
Boyer-Moore uses information gained by
preprocessing P to skip as many alignments as
possible.

Algorithm
Explanation
The algorithm begins at alignment k = n,
so the start of P is aligned with the start of T.
Characters in P and T are then compared
starting at index n in P and k in T , moving
backward: the strings are matched from the
end of P to the start of P.

Algorithm
Explanation
The comparisons continue until either the
beginning of P is reached (which means there
is a match)
Or a mismatch occurs upon which the
alignment is shifted to the right according to
the maximum value permitted by a number
of rules.

Algorithm
Explanation
The comparisons are performed again at
the new alignment, and the process repeats
until the alignment is shifted past the end
of T, which means no further matches will be
found.
The shift rules are implemented as
constant-time table lookups, using tables
generated during the preprocessing of P.

Algorithm
Explanation
Shift Rules
A shift is calculated by applying two rules:
the bad character rule and the good suffix
rule. The actual shifting offset is the maximum
of the shifts calculated by these rules.

Algorithm
Explanation
Shift Rules: The Bad Character Rule
The idea of Bad Character Rule is to shift P
more than 1 character when possible.
For each character x, let R(x) be the position
of the right-most occurrence of character x in
P.

Algorithm
Explanation
R(x) is defined to be zero if x does not occur in
P.
Time to construct table R: O(n) – length of P.
Space used by R: O(|∑|)
Access time of R: O(1)

Algorithm
Explanation
Example of R
Pattern P=
R=
R(P)=
A C C T T T
O/W A C T
0 1 3 6

Algorithm
Explanation
In a particular alignment of P against T
Let The rightmost n-i characters of P match the
corresponding characters in T and the character
P(i) does not match with T(k). Let the rightmost
position of character T(k) in P, R(T(k)), be j.

Algorithm
Explanation
If j<i, then shift P so that P[j] is aligned below
T[k].
Shift by max{1, i-R(T(k))}

Algorithm
Explanation
 If j>i, then shift P to the right by 1.
 If R(T(k))=0, that is, T(k) does not occur in P.
 Align P[1,…,n] with T[k+1,…,k+n].

Algorithm
Explanation
R=
T: R(C)=3
P: i=5
P shift: Shift 5-3
G A A C C T T T
A C C T T T
A C C T T T
O/W A C T
0 1 3 6

Algorithm
Explanation
Shift Rules: The Good Suffix Rule
Description: Suppose for a given alignment
of P and T, a substring t of T matches a suffix
of P, but a mismatch occurs at the next
comparison to the left.
T=
P=
t
G A A A G A A
A T G G C A A T T G G A A A G A A T T G A T

Algorithm
Explanation
Description: Then find, if it exists, the right-most
copy t' of t in P such that t' is not a suffix of P and the
character to the left of t' in P differs from the
character to the left of t in P.
T=
P=
t’ t
G A A A G A A

Algorithm
Explanation
Description: Shift P to the right so that
substring t' in P aligns with substring t in T.
T=
P=
t’ t
G A A A G A A

Algorithm
Explanation
Description: If no such shift is possible, then
shift P by n places to the right.
(Example with different text and pattern)
T=
P=
A T G G C A T G A A G A A A G A A T T G A T
A G A A G A A

Algorithm
Explanation
shift P by n places to the right.
T=
P=
G A A A G A A

Algorithm
Explanation
Description: If an occurrence of P is found, then
shift P by the least amount so that a proper prefix of
the shifted P matches a suffix of the occurrence
of P in T.
T=
P=
G A A A G A A

Algorithm
Explanation
shift P by n places, that is, shift P past t.
(Example with different text and pattern)
T=
P=
A T G G C A A T G C G A A A G A A T T G A T
A T G C

Bitap Algorithm
(for exact string searching)

Bitap Algorithm
Inventors
• The bitap algorithm for exact string searching
was invented by Bálint Dömölki in 1964
and
extended by R. K. Shyamasundar in 1977.

Bitap Algorithm
Pseudo code
bitap_search(text : string, pattern : string)
m := length(pattern)
if m == 0 return -1
/* Initialize the bit array R. */
R := new array[m+1] of bit, initially all 0
R[0] = 1

Bitap Algorithm
Pseudo code
bitap_search(text : string, pattern : string)
for i = 0; i < length(text); i += 1:
/* Update the bit array. */
for k = m; k >= 1; k -= 1:
R[k] = R[k-1] & (text[i] ==
pattern[k-1])
if R[m]: return i - m + 1
return -1

Bitap Algorithm
Explanation of the Algorithm
The algorithm begins by pre-computing a set
of bitmasks (bit array) containing one bit for
each element of the pattern and an extra bit.
Then it is able to do most of the work
with bitwise operations, which are extremely
fast.

Bitap Algorithm
Initially first position of the bit array contains 1
and all the remaining positions contains 0.
Now, try to update the bit array from end
position to the first position (1st, not 0th) for
every character of the text from start to end.

Bitap Algorithm
The current bit array position will set to 1
if, the previous bit array position is 1 and the
text character & the pattern character of the
previous bit array position are same.

Bitap Algorithm
Bit_array[current_position]=Bit_array[previous_position]
&
text[i]==pattern[previous_position]
for(i = 0; i < text.size(); i += 1)
for(k = m; k >= 1; k -= 1)
r[k] = r[k-1] & (text[i] == pattern[k-1]);

Bitap Algorithm
A match is found when, the contents of the
last position of the bit array becomes 1.
if(Bit_array[last_position])
found a match!

Bitap Algorithm
Explanation with an example
The text is: ATTGCAC
The pattern is: TGCA
m = 4 (pattern length)
i= index of the text
r= bit array
Initial bit array is: 1 0 0 0 0

Bitap Algorithm
i= 0
text = ATTGCAC
pattern = TGCA
k= 4, r= 1 0 0 0 0
k= 3, r= 1 0 0 0 0
k= 2, r= 1 0 0 0 0
k= 1, r= 1 0 0 0 0

Bitap Algorithm
i= 1
text = ATTGCAC
pattern = TGCA
k= 4, r= 1 0 0 0 0
k= 3, r= 1 0 0 0 0
k= 2, r= 1 0 0 0 0
k= 1, r= 1 1 0 0 0

Bitap Algorithm
i= 2
text = ATTGCAC
pattern = TGCA
k= 4, r= 1 1 0 0 0
k= 3, r= 1 1 0 0 0
k= 2, r= 1 1 0 0 0
k= 1, r= 1 1 0 0 0

Bitap Algorithm
i= 3
text = ATTGCAC
pattern = TGCA
k= 4, r= 1 1 0 0 0
k= 3, r= 1 1 0 0 0
k= 2, r= 1 1 1 0 0
k= 1, r= 1 0 1 0 0

Bitap Algorithm
i= 4
text = ATTGCAC
pattern = TGCA
k= 4, r= 1 0 1 0 0
k= 3, r= 1 0 1 1 0
k= 2, r= 1 0 0 1 0
k= 1, r= 1 0 0 1 0

Bitap Algorithm
i= 5
text = ATTGCAC
pattern = TGCA
k= 4, r= 1 0 0 1 1
k= 3, r= 1 0 0 0 1
k= 2, r= 1 0 0 0 1
k= 1, r= 1 0 0 0 1

Bitap Algorithm
Properties
Due to the data structures required by the
algorithm, it performs best on patterns less than
a constant, and also prefers inputs over a small
alphabet. (Suitable for DNA strings)
It runs in O(mn) operations, no matter the
structure of the text or the pattern.

References
• http://en.wikipedia.org/wiki/Knuth%E2%80%
93Morris%E2%80%93Pratt_algorithm
• http://www.ijsce.org/attachments/File/Vol-
1_Issue-6/F0304111611.pdf
• http://en.wikipedia.org/wiki/Boyer%E2%80%9
3Moore_string_search_algorithm
• http://en.wikipedia.org/wiki/Bitap_algorithm

String Matching Algorithms Explained

String Matching Algorithms Explained

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a String Matching Algorithms Explained

Semelhante a String Matching Algorithms Explained (20)

Último

Último (20)

String Matching Algorithms Explained