Regular Expressions for Beginners: A Guide to Regex Patterns

Regular Expressions
for
Beginners
Srikanth Modegunta

Introduction

Also referred to as Regex or RegExp

Used to match the pattern of text
− Ex: maven and maeven can be matched with
regex “mae?ven”

Regular Expressions are processed by a piece
of software called “Regular Expressions
Engine”

Most of the languages support Regex
− Ex: perl, java, c# etc.

Introduction (Contd..)

Used where text processing is required.

XML parsing involves Regex as it is based on the pattern
matching.
− We will see how to match xml or html tag.

Automation of the tasks
− Ex: if mail subject contains “<operation> <some task
name> <command>” then start processing the task.

Text Editors updating the comments to functions
automatically(Replacing a pattern with some text)
− Ex: replace
− “sub subroutine(parameters){<statements>}” by
/* this is a sample subroutine*/
sub subroutine(parameters){<statements>}

Meta Characters
The following are the meta characters
| ( ) [ { ^ $ * + ? .

Meta Characters (Contd..)
Character Meaning
* 0 or more
+ 1 or more
? 0 or 1 (optional)
. All characters excluding new-line
^ Start of line. But [^abc] means
character other than 'a' or 'b' or 'c'
$ End of line
A Start of string
Z End of string

Meta Characters (Contd..)
Character Meaning
{ } If I know How many times the pattern
repeats I can use this
Ex: a{2, 5} matches 'a' repeated
minimum 2 times and maximum 5
times.
| Saying 'or' in patterns
Ex: cat|dog|mouse
() Used to capture groups
[ ] Only one letter from the set

Quantifiers

To specify the quantity
− Ex: ear, eaaaar – the quantity of a is 1 and 4
in these two cases.

If a pattern is repeated then we need to use
quantifiers to match that repeated pattern.

To match the above case we use the following
regex
− ea+r means a can come 1 or more times

Quantifiers (Contd..)
* 0 or more times (it is hungry matching)
Ex: ca* matches c, ca, caa, caaa etc.
Matches even if the character does not
exist and matches any number of 'a' s
generally till last occurrence of pattern
+ 1 or more times (it is hungry matching)
Ex: ca+ matches ca, caa, caaa etc
{n} Match exactly n times
Ex: ca{4}r matches caaaar
{m,} Matches minimum of m times and
maximum of more than m times
Ex: ca{2,}r matches only if a repeats
greater than 2 times. (hungry matching)
{m,n} Matches minimum m times and maximum n
times.
Ex: ca{2,3}r matches and 'a' repeats
minimum 2 times and maximum 3 times.
(hungry matching)
Hungry Matching refers to the behavior that the pattern matches maximum possible text.
Ex: for ca{0,4} the text “caaaa” matches I.e all the 4 'a's are matched.

Quantifiers (Contd..)
*? Lazy matching i.e it matches 0 or
more times but stops at first match
Ex: if text is “caaaaaa” then “ca*?”
will match only 'c'.
+? Lazy matching i.e it matches 1 or
more times but stops at first match
Ex: if text is “caaaaaa” then “ca+?”
will match only 'ca'.
?? Lazy matching i.e it matches 0 or 1
times but stops at first match
Ex: if text is “ca” then “ca??” will
match only 'c'.
{min,}?
{n}?
{min, max}?
Lazy matching
Lazy Matching refers to the behavior that the pattern matches minimum possible text.
Ex: for ca{0,4}? the text “caaaa” matches only “c”

Character Sets

Matches one character among the set of
characters

[abcd] is same as [a-d]

[a-di-l] is same as [abcdijkl]

[^abcd] matches any character other than
a,b,c,d

Quantifiers can be applied to the character sets
− [a-z]+ matches the string 'hello' in
'hello1234E'

Characters for Matching
Common character classes shorthand
[a-zA-Z0-9_] w
[0-9] d
[ tnr] s
[^a-zA-Z0-9_] W
[^0-9] D
[^ tnr] S
b Word Boundary
B Other than a Word Boundary

Simple Matching

modegunta.srikanth@gmail.com
− mail id should not start with number or special
symbols
− Mail id id can start with _
− Mail id can have '.' in the middle
− Should end with @domain.com

Pattern :
− [a-zA-Z_][a-zA-Z_.]+@w+.(com|co.in)
− Meta characters must be escaped in the
pattern to match them as normal characters

Modifiers
Modifier Meaning
i Case insensitive
g Global matching (in perl)
m Multiline matching
s Dot all ('.' matches n also)
x Extended regex pattern (pretty format
ref: perl)
e (Used for replacing string) evaluate the
replacing pattern as an expression
(ref: perl)

Grouping

Groups can be captured using parenthesis
− (<pattern>)
− Saves the text identified by the group into a
backreference (we will see it later)

Groups are to capture part of text in the matching
pattern
− Ex: take simple xml element
<root>test</root>
− <(w+)>.*?</1>
− Here 1 is back reference

Java has a method “group(int)” method in
“java.util.regex.Matcher” class.

Grouping Example

If the command is
− /sbin/service <service-name> <command>
− ([^s]+)s+([w-_]+)s+(start|stop|status)
− Group 0=matched pattern
− Group 1=”/sbin/service”
− Group 2=<service-name>
− Group 3=<command>
− Command can be start, stop or status

Back References

Stores the part of the string matched by the part
of the regular expression inside the
parentheses

If there is any string that occurs multiple times
in the input, we can use back reference to
identify the match

Ex: xml/html start-tag should have the end-tag

Here if we capture the start-tag name in first
group, we can put end-tag name as back
reference (1)

Back references example

For example take the xml tag
− <root id=”E12”>test</root>
− <([w-_]+)s*([^<>]+)?>w+</1> matches
xml element
− Group 0: <root id=”E12”>test</root>
− Group 1: root
− Group 2: id=”E12”
− 1 in the regex pattern is the back reference to
group 1.

No grouping with parenthesis

If groups are not required for the parenthesized
patterns
− Use ?: inside group (?:)
− (text1|text2|text3) is any on of text1, text2 and
text3
− (?:text1|text2|text3) but will not be a group

Look ahead and Look behind

Positive look-ahead
− w+(?=:) not all words.... select words that come
before ':'

Negative look-ahead
− w+(?!:) words other than those coming before :

When the pattern comes the regex engine looks ahead for
the filtering pattern in case of Look ahead.

Positive look-behind
− (?<=a)b selects 'b' that follows 'a'

Negative look-behind
− (?<!a)b selects 'b' that doesn't follow 'a'

When the pattern comes the regex engine looks behind for
the filtering pattern in case of Look behind.

References:
1) http://www.regular-expressions.info/tutorial.html
2) Thinking in java 4th
Editon –
Chapter: Strings
page 392

Regular Expressions for Beginners: A Guide to Regex Patterns

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Regular Expressions for Beginners: A Guide to Regex Patterns

Semelhante a Regular Expressions for Beginners: A Guide to Regex Patterns (20)

Último

Último (20)

Regular Expressions for Beginners: A Guide to Regex Patterns