2. Regex - What is a regular expression
- A mini program accepts or rejects a string.
- Can be used to parse data out of strings
3. Regex - Why regular expressions?
PRO
● Parsing is fast O(n) + NFA generation time
○ NFA generation time is a one time penalty
○ For a RE of size m we can build an NFA at a cost of
O(2^m)
● Useful for validating string input
● Can be used in all Programming languages
○ Even in MySQL or other databases. But please
please don’t use RE in database queries :-)
● Useful for fetching/parsing data out of strings
● Very powerful tool. A real swiss army knife!
4. Regex - When to avoid RE?
CONS
● Regexes are a mini programs in themselves
● They can become very complex
● Some people argue regexes should always be avoided
● They are not very human readable
● Not everyone is comfortable with RE
● DFA must be created/compiled initially
5. Regex - Getting a feel
Two dummy examples
^aap?$
a()?p+p
Real world example:
DB_BACKUP_REGEX = "^[a-zA-Z0-9_-]+_((d|-)+_(d|-
)+)_UTC.sql.gz$"
6. Regex - Semantic buildingblocks
‘.’ == Matches any character except a newline
‘^’ == Matches the start of the string
‘$’ == Matches the end of a string
‘*’ == Causes the resulting RE to match 0 or more repetitions
‘+’ == Causes the resulting RE to match 1 or more repetitions
‘?’ == Causes the resulting RE to match 0 or 1 repetitions
10. Regex - basics - Which string matches?
Regex: aa+b*b$ old regex: ^a(ab)*b$
Strings:
aaab
aabb
ab
abbb
aababb
_aabb _ == whitespace
aabb_
11. Regex - basics - Which string matches?
Regex: aa+b*b$ old regex: ^a(ab)*b$
Strings:
aaab
aabb
ab
abbb
aababa
_aabb
aabb_
12. Regex - basics - Which string matches?
Regex: aa+b*b$ old regex: ^a(ab)*b$
Strings:
aaab
aabb
ab
abbb
aababa
_aabb
aabb_
13. Regex - some more buildingblocks
[a-zA-Z0-9] == w
Matches 1 character a-z or A-Z or 0-9.
and is the same as w
d == [0-9]
Matches 1 number
d{5}
Matches 5 numbers
14. Regex - bad practical example
import re
data = “2014-06-04 20:00”
# How do we parse this to integers?
regex = “^(d{4})-(d{2})-(d{2}) (d{2}):(d{2})”
regex2 = “(d+)-(d+)-(d+) (d+):(d+)” # Works too!
re.findall(regex, data)
# returns
16. Regex - stuff we didn’t cover! :D
Regex can get very very complicated.
Just to give you some idea:
- Lookahead assertion
(?=...)
Matches if ... matches next, but doesn’t consume any of the
string. This is called a lookahead assertion. For example, Isaac
(?=Asimov) will match 'Isaac ' only if it’s followed by 'Asimov'.
For example: (Isaac (?=Asimov))|(Banaan)
Will match ‘Isaac Asimov’ or ‘Banaan’
17. Regex - stuff we didn’t cover! :D
- Greedy vs Non-Greedy
‘*’, ‘+’, ‘?’ are greedy quanitifiers. They will match as much
as possible to obtain a match.
Non greedy quanitfiers will match as little as possible to
achieve a match.
Adding a ‘?’ makes the above quantifiers non-greedy
‘*?’, ‘+?’, ‘??’
We’ll skip these 2 for now :-)
- Positive lookbehind assertion