4. The Fear Factor!
For unknown reasons regular expressions
are deeply shrouded in mystery
5. The Fear Factor!
For unknown reasons regular expressions
are deeply shrouded in mystery
Many programmers outright fear them
6. The Fear Factor!
For unknown reasons regular expressions
are deeply shrouded in mystery
Many programmers outright fear them
I stumped a room full of programmers
in Tulsa by shouting out a two
character expression
7. The Fear Factor!
For unknown reasons regular expressions
are deeply shrouded in mystery
Many programmers outright fear them
I stumped a room full of programmers
in Tulsa by shouting out a two
character expression
I have know idea why this is
9. What is a Regex?
Regular expression is a very small
language for describing text
10. What is a Regex?
Regular expression is a very small
language for describing text
You can use them to dissect and change
textual data
11. What is a Regex?
Regular expression is a very small
language for describing text
You can use them to dissect and change
textual data
I think of them as a DSL for find and
replace operations
14. Why Learn Regular Expressions?
Ruby leans heavily on regular expressions:
Many text operations in Ruby are
easiest with the right regex
15. Why Learn Regular Expressions?
Ruby leans heavily on regular expressions:
Many text operations in Ruby are
easiest with the right regex
Regular expressions are fast
16. Why Learn Regular Expressions?
Ruby leans heavily on regular expressions:
Many text operations in Ruby are
easiest with the right regex
Regular expressions are fast
Regular expressions are encoding aware
17. Why Learn Regular Expressions?
Ruby leans heavily on regular expressions:
Many text operations in Ruby are
easiest with the right regex
Regular expressions are fast
Regular expressions are encoding aware
You can be the one scaring all the other
programmers
22. Basic Regex Usage
Strings has methods
supporting:
Find/Find All
Replace/Replace All
Use sub!()/gsub!() to
modify a String in
place
23. Basic Regex Usage
Strings has methods if "100" =~ /Ad+z/
supporting: puts "This is a number."
end
Find/Find All
Replace/Replace All
Use sub!()/gsub!() to
modify a String in
place
24. Basic Regex Usage
Strings has methods if "100" =~ /Ad+z/
supporting: puts "This is a number."
end
"Find all, words.".scan(/w+/) do |word|
Find/Find All puts word.downcase
end
year, month, day = "2008-09-04".scan(/d+/)
Replace/Replace All
Use sub!()/gsub!() to
modify a String in
place
25. Basic Regex Usage
Strings has methods if "100" =~ /Ad+z/
supporting: puts "This is a number."
end
"Find all, words.".scan(/w+/) do |word|
Find/Find All puts word.downcase
end
year, month, day = "2008-09-04".scan(/d+/)
Replace/Replace All
csv = "C, S, V".sub(/,s+/, ",")
cap = "one two".sub(/w+/) { |n| n.capitalize }
Use sub!()/gsub!() to
modify a String in
place
26. Basic Regex Usage
Strings has methods if "100" =~ /Ad+z/
supporting: puts "This is a number."
end
"Find all, words.".scan(/w+/) do |word|
Find/Find All puts word.downcase
end
year, month, day = "2008-09-04".scan(/d+/)
Replace/Replace All
csv = "C, S, V".sub(/,s+/, ",")
cap = "one two".sub(/w+/) { |n| n.capitalize }
Use sub!()/gsub!() to
modify a String in csv = "C, S, V".gsub(/,s+/, ",")
caps = "one two".gsub(/w+/) { |n| n.capitalize }
place
30. Literal Characters
Most characters in a regex match
themselves literally
The only special characters are:
[].^$?*+{}|()
You can proceed a special character
with to make it literal
31. Literal Characters
Most characters in a regex match
themselves literally
The only special characters are:
[].^$?*+{}|()
You can proceed a special character
with to make it literal
The regex /James Gray/ matches my name
34. Character Classes
Characters in [ … ] are choices for a
single character match
A leading ^ negates the class, so [^ … ]
matches what is not listed
35. Character Classes
Characters in [ … ] are choices for a
single character match
A leading ^ negates the class, so [^ … ]
matches what is not listed
You can use ranges like a-z or 0-9
36. Character Classes
Characters in [ … ] are choices for a
single character match
A leading ^ negates the class, so [^ … ]
matches what is not listed
You can use ranges like a-z or 0-9
The expression /[bcr]at/ will match
“bat,” “cat,” or “rat”
50. Anchors
Anchor Matches
Anchors match
between characters A Start of the String
End of the String or
Z
They are used to assert before trailing newline
that the content you z End of the String
want must appear in a ^ Start of a line
certain place
$ End of a line
Thus /^Totals/ searches Between wW or Ww,
b
for a line starting with and at A and z
“Totals” B Between ww or WW
51. Anchors
Anchor Matches
Anchors match
between characters A Start of the String
End of the String or
Z
They are used to assert before trailing newline
that the content you z End of the String
want must appear in a ^ Start of a line
certain place
$ End of a line
Thus /^Totals/ searches Between wW or Ww,
b
for a line starting with and at A and z
“Totals” B Between ww or WW
52. Anchors
Anchor Matches
Anchors match
between characters A Start of the String
End of the String or
Z
They are used to assert before trailing newline
that the content you z End of the String
want must appear in a ^ Start of a line
certain place
$ End of a line
Thus /^Totals/ searches Between wW or Ww,
b
for a line starting with and at A and z
“Totals” B Between ww or WW
53. Anchors
Anchor Matches
Anchors match
between characters A Start of the String
End of the String or
Z
They are used to assert before trailing newline
that the content you z End of the String
want must appear in a ^ Start of a line
certain place
$ End of a line
Thus /^Totals/ searches Between wW or Ww,
b
for a line starting with and at A and z
“Totals” B Between ww or WW
54. Anchors
Anchor Matches
Anchors match
between characters A Start of the String
End of the String or
Z
They are used to assert before trailing newline
that the content you z End of the String
want must appear in a ^ Start of a line
certain place
$ End of a line
Thus /^Totals/ searches Between wW or Ww,
b
for a line starting with and at A and z
“Totals” B Between ww or WW
55. Anchors
Anchor Matches
Anchors match
between characters A Start of the String
End of the String or
Z
They are used to assert before trailing newline
that the content you z End of the String
want must appear in a ^ Start of a line
certain place
$ End of a line
Thus /^Totals/ searches Between wW or Ww,
b
for a line starting with and at A and z
“Totals” B Between ww or WW
56. Anchors
Anchor Matches
Anchors match
between characters A Start of the String
End of the String or
Z
They are used to assert before trailing newline
that the content you z End of the String
want must appear in a ^ Start of a line
certain place
$ End of a line
Thus /^Totals/ searches Between wW or Ww,
b
for a line starting with and at A and z
“Totals” B Between ww or WW
57. Anchors
Anchor Matches
Anchors match
between characters A Start of the String
End of the String or
Z
They are used to assert before trailing newline
that the content you z End of the String
want must appear in a ^ Start of a line
certain place
$ End of a line
Thus /^Totals/ searches Between wW or Ww,
b
for a line starting with and at A and z
“Totals” B Between ww or WW
59. Repetition
You can tack symbols
onto an element of a
regex to indicate that
element can repeat
60. Repetition
You can tack symbols
onto an element of a
regex to indicate that
element can repeat
The expression /ab+c?/
matches an a, followed
by one or more b’s, and
optionally followed by
ac
61. Repetition
You can tack symbols Repeater Allowed Count
onto an element of a
? Zero or one
regex to indicate that
element can repeat + One or more
* Zero or more
The expression /ab+c?/ {n} Exactly n
matches an a, followed {n,} At least n
by one or more b’s, and
{,m} No more than m
optionally followed by
ac {n,m} Between n and m
62. Repetition
You can tack symbols Repeater Allowed Count
onto an element of a
? Zero or one
regex to indicate that
element can repeat + One or more
* Zero or more
The expression /ab+c?/ {n} Exactly n
matches an a, followed {n,} At least n
by one or more b’s, and
{,m} No more than m
optionally followed by
ac {n,m} Between n and m
63. Repetition
You can tack symbols Repeater Allowed Count
onto an element of a
? Zero or one
regex to indicate that
element can repeat + One or more
* Zero or more
The expression /ab+c?/ {n} Exactly n
matches an a, followed {n,} At least n
by one or more b’s, and
{,m} No more than m
optionally followed by
ac {n,m} Between n and m
64. Repetition
You can tack symbols Repeater Allowed Count
onto an element of a
? Zero or one
regex to indicate that
element can repeat + One or more
* Zero or more
The expression /ab+c?/ {n} Exactly n
matches an a, followed {n,} At least n
by one or more b’s, and
{,m} No more than m
optionally followed by
ac {n,m} Between n and m
65. Repetition
You can tack symbols Repeater Allowed Count
onto an element of a
? Zero or one
regex to indicate that
element can repeat + One or more
* Zero or more
The expression /ab+c?/ {n} Exactly n
matches an a, followed {n,} At least n
by one or more b’s, and
{,m} No more than m
optionally followed by
ac {n,m} Between n and m
66. Repetition
You can tack symbols Repeater Allowed Count
onto an element of a
? Zero or one
regex to indicate that
element can repeat + One or more
* Zero or more
The expression /ab+c?/ {n} Exactly n
matches an a, followed {n,} At least n
by one or more b’s, and
{,m} No more than m
optionally followed by
ac {n,m} Between n and m
67. Repetition
You can tack symbols Repeater Allowed Count
onto an element of a
? Zero or one
regex to indicate that
element can repeat + One or more
* Zero or more
The expression /ab+c?/ {n} Exactly n
matches an a, followed {n,} At least n
by one or more b’s, and
{,m} No more than m
optionally followed by
ac {n,m} Between n and m
68. Repetition
You can tack symbols Repeater Allowed Count
onto an element of a
? Zero or one
regex to indicate that
element can repeat + One or more
* Zero or more
The expression /ab+c?/ {n} Exactly n
matches an a, followed {n,} At least n
by one or more b’s, and
{,m} No more than m
optionally followed by
ac {n,m} Between n and m
70. Some Examples
if var =~ /As*z/
puts "Variable is blank."
end
71. Some Examples
if var =~ /As*z/
puts "Variable is blank."
end
if var !~ /S/
puts "Variable is blank."
end
72. Some Examples
if var =~ /As*z/
puts "Variable is blank."
end
if var !~ /S/
puts "Variable is blank."
end
From TopCoder.com, SRM 216 “CultureShock:”
Bob and Doug have recently moved from Canada to the United States, and they are confused
by this strange letter, "ZEE". They need your assistance. Given a String text, replace every
occurrence of the word, "ZEE", with the word, "ZED", and return the result.
Note that if "ZEE" is just part of a larger word (for example, "ZEES"), it should not be altered.
73. Some Examples
if var =~ /As*z/
puts "Variable is blank."
end
if var !~ /S/
puts "Variable is blank."
end
From TopCoder.com, SRM 216 “CultureShock:”
Bob and Doug have recently moved from Canada to the United States, and they are confused
by this strange letter, "ZEE". They need your assistance. Given a String text, replace every
occurrence of the word, "ZEE", with the word, "ZED", and return the result.
Note that if "ZEE" is just part of a larger word (for example, "ZEES"), it should not be altered.
solution = text.gsub(/bZEEb/, "ZED")
75. Greedy Verses Non-Greedy
By default repetition
will always be greedy,
consuming as many
characters as possible
76. Greedy Verses Non-Greedy
By default repetition
will always be greedy,
consuming as many
characters as possible
The match will
backtrack, giving
up characters, if it
helps it succeed
77. Greedy Verses Non-Greedy
By default repetition
will always be greedy,
consuming as many
characters as possible
The match will
backtrack, giving
up characters, if it
helps it succeed
You can negate this,
matching minimal
characters
78. Greedy Verses Non-Greedy
By default repetition
will always be greedy,
consuming as many Greedy Non-Greedy
characters as possible ? ??
+ +?
The match will
* *?
backtrack, giving
up characters, if it {n} N/A
helps it succeed {n,} {n,}?
{,m} {,m}?
You can negate this, {n,m} {n,m}?
matching minimal
characters
79. Greedy Verses Non-Greedy
By default repetition
will always be greedy,
consuming as many Greedy Non-Greedy
characters as possible ? ??
+ +?
The match will
* *?
backtrack, giving
up characters, if it {n} N/A
helps it succeed {n,} {n,}?
{,m} {,m}?
You can negate this, {n,m} {n,m}?
matching minimal
characters
80. Greedy Verses Non-Greedy
By default repetition
will always be greedy,
consuming as many Greedy Non-Greedy
characters as possible ? ??
+ +?
The match will
* *?
backtrack, giving
up characters, if it {n} N/A
helps it succeed {n,} {n,}?
{,m} {,m}?
You can negate this, {n,m} {n,m}?
matching minimal
characters
81. Greedy Verses Non-Greedy
By default repetition
will always be greedy,
consuming as many Greedy Non-Greedy
characters as possible ? ??
+ +?
The match will
* *?
backtrack, giving
up characters, if it {n} N/A
helps it succeed {n,} {n,}?
{,m} {,m}?
You can negate this, {n,m} {n,m}?
matching minimal
characters
82. Greedy Verses Non-Greedy
By default repetition
will always be greedy,
consuming as many Greedy Non-Greedy
characters as possible ? ??
+ +?
The match will
* *?
backtrack, giving
up characters, if it {n} N/A
helps it succeed {n,} {n,}?
{,m} {,m}?
You can negate this, {n,m} {n,m}?
matching minimal
characters
83. Greedy Verses Non-Greedy
By default repetition
will always be greedy,
consuming as many Greedy Non-Greedy
characters as possible ? ??
+ +?
The match will
* *?
backtrack, giving
up characters, if it {n} N/A
helps it succeed {n,} {n,}?
{,m} {,m}?
You can negate this, {n,m} {n,m}?
matching minimal
characters
84. Greedy Verses Non-Greedy
By default repetition
will always be greedy,
consuming as many Greedy Non-Greedy
characters as possible ? ??
+ +?
The match will
* *?
backtrack, giving
up characters, if it {n} N/A
helps it succeed {n,} {n,}?
{,m} {,m}?
You can negate this, {n,m} {n,m}?
matching minimal
characters
85. Greedy Verses Non-Greedy
By default repetition
will always be greedy,
consuming as many Greedy Non-Greedy
characters as possible ? ??
+ +?
The match will
* *?
backtrack, giving
up characters, if it {n} N/A
helps it succeed {n,} {n,}?
{,m} {,m}?
You can negate this, {n,m} {n,m}?
matching minimal
characters
88. Alternation
In a regex, | means “or”
You can put a full expression on the left
and another full expression on the right
89. Alternation
In a regex, | means “or”
You can put a full expression on the left
and another full expression on the right
Either can match
90. Alternation
In a regex, | means “or”
You can put a full expression on the left
and another full expression on the right
Either can match
The expression /James|words?/ will
match “James,” “word,” or “words”
92. Grouping
Everything in ( … ) is grouped into a
single element for the purposes of
repetition and alternation
93. Grouping
Everything in ( … ) is grouped into a
single element for the purposes of
repetition and alternation
The expression /(ha)+/ matches “ha,”
“haha,” “hahaha,” etc.
94. Grouping
Everything in ( … ) is grouped into a
single element for the purposes of
repetition and alternation
The expression /(ha)+/ matches “ha,”
“haha,” “hahaha,” etc.
The expression /Greg(ory)?/ matches
“Greg” and “Gregory”
97. Captures
( … ) also capture
what they match
After a match, you can
access these captures
in the variables $1, $2,
etc., from left to right
98. Captures
( … ) also capture
what they match
After a match, you can
access these captures
in the variables $1, $2,
etc., from left to right
Use 1, 2, etc. in
String replacements
99. Captures
( … ) also capture
what they match
"$99.95" =~ /$(d+(.d+)?)/
After a match, you can
access these captures
in the variables $1, $2,
etc., from left to right
Use 1, 2, etc. in
String replacements
100. Captures
( … ) also capture
what they match
"$99.95" =~ /$(d+(.d+)?)/
After a match, you can
access these captures
in the variables $1, $2,
$1
etc., from left to right
Use 1, 2, etc. in
String replacements
101. Captures
( … ) also capture
what they match
"$99.95" =~ /$(d+(.d+)?)/
After a match, you can
access these captures
in the variables $1, $2,
$1
etc., from left to right $2
Use 1, 2, etc. in
String replacements
105. Modes
Regular expressions have modes
End an expression with /i to make the
expression case insensitive
End with /m for “multi-line” mode
where . will also match newlines
106. Modes
Regular expressions have modes
End an expression with /i to make the
expression case insensitive
End with /m for “multi-line” mode
where . will also match newlines
Use /x to add space and comments
107. Modes
Regular expressions have modes
End an expression with /i to make the
expression case insensitive
End with /m for “multi-line” mode
where . will also match newlines
Use /x to add space and comments
You can combine modes: /mi
109. More Examples
if ip =~ /Ad{1,3}(.d{1,3}){3}z/
puts "IP adress is well formed."
end
110. More Examples
if ip =~ /Ad{1,3}(.d{1,3}){3}z/
puts "IP adress is well formed."
end
if text =~ /b(at|for|in)[.?!]/
puts "You have bad grammar."
end
111. More Examples
if ip =~ /Ad{1,3}(.d{1,3}){3}z/
puts "IP adress is well formed."
end
if text =~ /b(at|for|in)[.?!]/
puts "You have bad grammar."
end
james_gray = "Gray, James".sub(/(S+),s*(.+)/, '2 1')
114. Other Tricks
There are other special
variables for regexen
including $`, $&, and $’
You can escape content
for use in a regex
115. Other Tricks
There are other special
variables for regexen
including $`, $&, and $’
You can escape content
for use in a regex
There’s a MatchData
object for matches
116. Other Tricks
There are other special
variables for regexen
including $`, $&, and $’
You can escape content
for use in a regex
There’s a MatchData
object for matches
Many methods can take
a regex
117. Other Tricks
"one_two_three" =~ /two/
There are other special one_, two, _three = $`, $&, $'
variables for regexen
including $`, $&, and $’
You can escape content
for use in a regex
There’s a MatchData
object for matches
Many methods can take
a regex
118. Other Tricks
"one_two_three" =~ /two/
There are other special one_, two, _three = $`, $&, $'
variables for regexen
print "What's your favorite language? "
including $`, $&, and $’ lang = $stdin.gets.strip
if "Perl Java" =~ /b#{Regexp.escape(lang)}b/i
puts "You are weird."
else
You can escape content puts "OK."
end
for use in a regex
There’s a MatchData
object for matches
Many methods can take
a regex
119. Other Tricks
"one_two_three" =~ /two/
There are other special one_, two, _three = $`, $&, $'
variables for regexen
print "What's your favorite language? "
including $`, $&, and $’ lang = $stdin.gets.strip
if "Perl Java" =~ /b#{Regexp.escape(lang)}b/i
puts "You are weird."
else
You can escape content puts "OK."
end
for use in a regex
CONFIG_RE = /A([^=s]+)s*=s*(S+)/
config = "url = http://ruby-lang.org"
There’s a MatchData key, value = config.match(CONFIG_RE).captures
object for matches
Many methods can take
a regex
120. Other Tricks
"one_two_three" =~ /two/
There are other special one_, two, _three = $`, $&, $'
variables for regexen
print "What's your favorite language? "
including $`, $&, and $’ lang = $stdin.gets.strip
if "Perl Java" =~ /b#{Regexp.escape(lang)}b/i
puts "You are weird."
else
You can escape content puts "OK."
end
for use in a regex
CONFIG_RE = /A([^=s]+)s*=s*(S+)/
config = "url = http://ruby-lang.org"
There’s a MatchData key, value = config.match(CONFIG_RE).captures
object for matches
fields = "1|2 | 3".split(/s*|s*/)
last_word_i = "one two three".rindex(/bw+/)
Many methods can take
five = "Count: 5"[/d+/]
a regex five = "Count: 5"[/Count:s*(d+)/, 1]
134. Regular Expression Extensions
Ruby’s regex engine
adds several common
extensions
These usually look
something like
(? … )
135. Regular Expression Extensions
Ruby’s regex engine
adds several common
extensions
These usually look
something like
(? … )
The simplest is (?: … )
which is grouping
without capturing
136. Regular Expression Extensions
Ruby’s regex engine
adds several common
data = "put the ball in the sack"
extensions re = %r{
(put|set) # verb: $1
s+ # some space (/x safe)
(?:(?:the|a)s+)? # an article (optional)
These usually look (w+) # noun: $2
s+
something like (?:in(?:side)?)? # preposition (optional)
s+
(? … ) (?:(?:the|a)s+)?
(w+) # noun: $3
}x
p data =~ re
The simplest is (?: … ) p [$1, $2, $3]
which is grouping
without capturing
139. Look-Around Assertions
You can use look-ahead
assertions to peek
ahead without
consuming characters:
(?= … ) and (?! … )
140. Look-Around Assertions
You can use look-ahead
assertions to peek
ahead without
consuming characters:
(?= … ) and (?! … )
Ruby 1.9 adds a fixed
look-behind:
141. Look-Around Assertions
You can use look-ahead
assertions to peek
ahead without
consuming characters:
(?= … ) and (?! … )
Ruby 1.9 adds a fixed
look-behind:
(?<= … ) and
(?<! … )
142. Look-Around Assertions
You can use look-ahead
assertions to peek
ahead without
consuming characters:
class Numeric
def commify
(?= … ) and (?! … ) to_s.reverse.
gsub(/(ddd)(?=d)(?!d*.)/, '1,').
reverse
end
Ruby 1.9 adds a fixed end
look-behind:
(?<= … ) and
(?<! … )
147. Oniguruma
Ruby 1.9’s regex engine
is faster and more
powerful:
Named groups
Nested matching
Improved encodings
148. Oniguruma
Ruby 1.9’s regex engine
is faster and more
powerful:
Named groups
Nested matching
Improved encodings
And more…
149. Oniguruma
Ruby 1.9’s regex engine
is faster and more config = "mode = wrap"
if /A(?<key>w+)s*=s*(?<value>w+)/ =~ config
powerful: puts "Key is #{key} and value is #{value}"
end
Named groups
Nested matching
Improved encodings
And more…
150. Oniguruma
Ruby 1.9’s regex engine
is faster and more config = "mode = wrap"
if /A(?<key>w+)s*=s*(?<value>w+)/ =~ config
powerful: puts "Key is #{key} and value is #{value}"
end
Named groups
CHECK = /A(?<paren>((g<paren>|[^()])*?))z/
%w[ ()
(()())
Nested matching (a(b(c,d())))
()) ].each do |test|
unless test =~ CHECK
puts "#{test} isn't balanced"
Improved encodings end
end
And more…
153. The Data
data = <<END_FIELDS.gsub(/s+/, " ")
Business Name (Text Field),
Allows Pets (Check),
Open To (Dropdown: Men, Women, Children, Any),
Atmosphere (Check List: Calm, Romantic, New Age)
END_FIELDS