Regular expressions (regex) are used to match patterns in text. They contain special characters called meta characters that represent expressions to match, like * for 0 or more matches and + for 1 or more matches. Regex can be used for text processing tasks like validating formats. The document discusses various regex meta characters, quantifiers, character sets, modifiers, grouping, backreferences, and lookahead/lookbehind operations. It provides examples of regex patterns for tasks like matching XML tags and validating email addresses.
2. Introduction
Also referred to as Regex or RegExp
Used to match the pattern of text
− Ex: maven and maeven can be matched with
regex “mae?ven”
Regular Expressions are processed by a piece
of software called “Regular Expressions
Engine”
Most of the languages support Regex
− Ex: perl, java, c# etc.
3. Introduction (Contd..)
Used where text processing is required.
XML parsing involves Regex as it is based on the pattern
matching.
− We will see how to match xml or html tag.
Automation of the tasks
− Ex: if mail subject contains “<operation> <some task
name> <command>” then start processing the task.
Text Editors updating the comments to functions
automatically(Replacing a pattern with some text)
− Ex: replace
− “sub subroutine(parameters){<statements>}” by
/* this is a sample subroutine*/
sub subroutine(parameters){<statements>}
5. Meta Characters (Contd..)
Character Meaning
* 0 or more
+ 1 or more
? 0 or 1 (optional)
. All characters excluding new-line
^ Start of line. But [^abc] means
character other than 'a' or 'b' or 'c'
$ End of line
A Start of string
Z End of string
6. Meta Characters (Contd..)
Character Meaning
{ } If I know How many times the pattern
repeats I can use this
Ex: a{2, 5} matches 'a' repeated
minimum 2 times and maximum 5
times.
| Saying 'or' in patterns
Ex: cat|dog|mouse
() Used to capture groups
[ ] Only one letter from the set
7. Quantifiers
To specify the quantity
− Ex: ear, eaaaar – the quantity of a is 1 and 4
in these two cases.
If a pattern is repeated then we need to use
quantifiers to match that repeated pattern.
To match the above case we use the following
regex
− ea+r means a can come 1 or more times
8. Quantifiers (Contd..)
* 0 or more times (it is hungry matching)
Ex: ca* matches c, ca, caa, caaa etc.
Matches even if the character does not
exist and matches any number of 'a' s
generally till last occurrence of pattern
+ 1 or more times (it is hungry matching)
Ex: ca+ matches ca, caa, caaa etc
{n} Match exactly n times
Ex: ca{4}r matches caaaar
{m,} Matches minimum of m times and
maximum of more than m times
Ex: ca{2,}r matches only if a repeats
greater than 2 times. (hungry matching)
{m,n} Matches minimum m times and maximum n
times.
Ex: ca{2,3}r matches and 'a' repeats
minimum 2 times and maximum 3 times.
(hungry matching)
Hungry Matching refers to the behavior that the pattern matches maximum possible text.
Ex: for ca{0,4} the text “caaaa” matches I.e all the 4 'a's are matched.
9. Quantifiers (Contd..)
*? Lazy matching i.e it matches 0 or
more times but stops at first match
Ex: if text is “caaaaaa” then “ca*?”
will match only 'c'.
+? Lazy matching i.e it matches 1 or
more times but stops at first match
Ex: if text is “caaaaaa” then “ca+?”
will match only 'ca'.
?? Lazy matching i.e it matches 0 or 1
times but stops at first match
Ex: if text is “ca” then “ca??” will
match only 'c'.
{min,}?
{n}?
{min, max}?
Lazy matching
Lazy Matching refers to the behavior that the pattern matches minimum possible text.
Ex: for ca{0,4}? the text “caaaa” matches only “c”
10. Character Sets
Matches one character among the set of
characters
[abcd] is same as [a-d]
[a-di-l] is same as [abcdijkl]
[^abcd] matches any character other than
a,b,c,d
Quantifiers can be applied to the character sets
− [a-z]+ matches the string 'hello' in
'hello1234E'
11. Characters for Matching
Common character classes shorthand
[a-zA-Z0-9_] w
[0-9] d
[ tnr] s
[^a-zA-Z0-9_] W
[^0-9] D
[^ tnr] S
b Word Boundary
B Other than a Word Boundary
12. Simple Matching
modegunta.srikanth@gmail.com
− mail id should not start with number or special
symbols
− Mail id id can start with _
− Mail id can have '.' in the middle
− Should end with @domain.com
Pattern :
− [a-zA-Z_][a-zA-Z_.]+@w+.(com|co.in)
− Meta characters must be escaped in the
pattern to match them as normal characters
13. Modifiers
Modifier Meaning
i Case insensitive
g Global matching (in perl)
m Multiline matching
s Dot all ('.' matches n also)
x Extended regex pattern (pretty format
ref: perl)
e (Used for replacing string) evaluate the
replacing pattern as an expression
(ref: perl)
14. Grouping
Groups can be captured using parenthesis
− (<pattern>)
− Saves the text identified by the group into a
backreference (we will see it later)
Groups are to capture part of text in the matching
pattern
− Ex: take simple xml element
<root>test</root>
− <(w+)>.*?</1>
− Here 1 is back reference
Java has a method “group(int)” method in
“java.util.regex.Matcher” class.
15. Grouping Example
If the command is
− /sbin/service <service-name> <command>
− ([^s]+)s+([w-_]+)s+(start|stop|status)
− Group 0=matched pattern
− Group 1=”/sbin/service”
− Group 2=<service-name>
− Group 3=<command>
− Command can be start, stop or status
16. Back References
Stores the part of the string matched by the part
of the regular expression inside the
parentheses
If there is any string that occurs multiple times
in the input, we can use back reference to
identify the match
Ex: xml/html start-tag should have the end-tag
Here if we capture the start-tag name in first
group, we can put end-tag name as back
reference (1)
17. Back references example
For example take the xml tag
− <root id=”E12”>test</root>
− <([w-_]+)s*([^<>]+)?>w+</1> matches
xml element
− Group 0: <root id=”E12”>test</root>
− Group 1: root
− Group 2: id=”E12”
− 1 in the regex pattern is the back reference to
group 1.
18. No grouping with parenthesis
If groups are not required for the parenthesized
patterns
− Use ?: inside group (?:)
− (text1|text2|text3) is any on of text1, text2 and
text3
− (?:text1|text2|text3) but will not be a group
19. Look ahead and Look behind
Positive look-ahead
− w+(?=:) not all words.... select words that come
before ':'
Negative look-ahead
− w+(?!:) words other than those coming before :
When the pattern comes the regex engine looks ahead for
the filtering pattern in case of Look ahead.
Positive look-behind
− (?<=a)b selects 'b' that follows 'a'
Negative look-behind
− (?<!a)b selects 'b' that doesn't follow 'a'
When the pattern comes the regex engine looks behind for
the filtering pattern in case of Look behind.