Advanced Regular Expressions Redux

Regular Expressions
Redux

Scope

• medium to advanced
• 30 minutes
• performance / backtracking irrelevant
• no compatibility charts (yet)

TOC

• basic matching, quantiﬁers
• character classes, types, properties, anchors
• groups, options, replace string
• look-ahead/behind
• subexpressions

RE overview

match “foo” replace with “bar”
Perl /foo/ (on $_) s/foo/bar/ (on $_)

Javascript /foo/ “foolish”.replace(/foo/, “bar”)

Vi /foo/ :s/foo/bar/

TextMate ⌘-F, Find: foo ⌘-F Find: foo, Replace: bar

Quantiﬁers
• classic greedy: ?, *, +

Quantiﬁers
• speciﬁc:{1,5}, {,5}

Quantiﬁers
• speciﬁc:{1,5}, {,5}
• ? == {0,1}

Quantiﬁers
• speciﬁc:{1,5}, {,5}
• ? == {0,1}

• * == {0,}

Quantiﬁers
• speciﬁc:{1,5}, {,5}
• ? == {0,1}

• * == {0,}

• + == {1,}

Quantiﬁers
• speciﬁc:{1,5}, {,5}
• ? == {0,1}

• * == {0,}

• + == {1,}

• non-greedy: ??, *?, +?, {5,7}?

Example
This reveals that plain text is in fact the
technical user's way to regard a ﬁle or a
sequence of bytes. In this sense, there is no
plain text.

/reveal(.*)plain/
/reveal(.*?)plain/
/t.{2,3}t/

Character Classes /
Properties

Character Classes /
Properties
• [0-9a-z] (classes)

Character Classes /
Properties
• +420[0-9]{9} = simpliﬁed czech phone nr.

Character Classes /
Properties

• don’t: [A-z0-]

Character Classes /
Properties

• don’t: [A-z0-]

• [a-z&&[^j-n]] == [a-io-z]

Character Classes /
Properties

• don’t: [A-z0-]

• [a-z&&[^j-n]] == [a-io-z]
• p{Upper} (properties)

Character Classes /
Properties

• don’t: [A-z0-]

• [a-z&&[^j-n]] == [a-io-z]
• works great on Unicode text (Latin,Katakana)

Character Classes /
Properties

• don’t: [A-z0-]

• [a-z&&[^j-n]] == [a-io-z]
• works great on Unicode text (Latin,Katakana)

• [:alnum:], [:^space:] (POSIX bracket)

Character Types
• . == anything (apart from newline)

Character Types
• s == space == [tnvfr ]
• more in unicode

Character Types
• more in unicode

• w == word char == cca [0-9a-zA-Z_]
• is complicated in unicode

Character Types
• more in unicode


• d == digit == [0-9]
• h == hexadecimal digit == [0-9a-fA-F]

Character Types
• more in unicode


• d == digit == [0-9]
• h == hexadecimal digit == [0-9a-fA-F]

• SWD == [^s][^w][^d]

Example
This reveals that plain text is in fact the
technical user's way to regard a ﬁle or a
sequence of bytes. In this sense, there is no
plain text.

/b[w&&[^aA]]+b/
/W{2,}w+b/

Anchors

• ^ - begining (line, string)

Anchors

• $ - end (line, string)

Anchors

• b - word boundary ~ wW (almost)
• b.{5}b != Ww{5}W

Anchors

• b - word boundary ~ wW (almost)
• b.{5}b != Ww{5}W

• zero width!

Options
• /foo/imsx
• i - case insensitive

• m - multiline (^,$ represent start of string/ﬁle)

• s - single line (. matches newlines)

• x - extended!

• g - global

Options
• /foo/imsx



• x - extended!

• g - global

• can be written inline
• (?imsx-imsx)

• (?imsx-imsx:...)

Options
• /foo/imsx



• x - extended!

• g - global (?x-i)
#this is cool
• can be written inline (
foo #my important value
• | #don't forget the alternative
(?imsx-imsx)
bar
• ) # result equals to (foo|bar)
(?imsx-imsx:...)

Groups/Replacing
• (...) - matched group

Groups/Replacing
• $1 - $9
• alternatively 1 - 9 (not recommended)

Groups/Replacing
• $1 - $9

• nested groups ordered by left bracket

Groups/Replacing
• $1 - $9

• nested groups ordered by left bracket
• (?:...) - non-captured group
• useful for (?:foo)+ or (?:foo|bar)

Example
quot;foobarmanquot;.replace(
/(?:f)((o)+)(bar)|(baz|man)/g,
'$1, $2, $3, $4, $5')

Example
'$1, $2, $3, $4, $5')

• foobar
• 1 -- oo

• 2 -- o

• 3 -- bar

• 4 --

Example
'$1, $2, $3, $4, $5')

• foobar • man
• •
1 -- oo 1 --

• •
2 -- o 2 --

• •
3 -- bar 3 --

• •
4 -- 4 -- man

Look-ahead/behind
• deﬁnes custom zero-width anchors

Look-ahead/behind
• deﬁnes custom zero-width anchors
positive negative

ahead (?=...) (?!...)

behind (?<=...) (?<!...)

Example

zdenek@gooddata.com
/.*?@gooddata/

zdenek@gooddata.com
/.*?(?=@gooddata)/

Recursive RE

• very important!
• quote & bracket matching

• technically not part of regular grammar

• two styles
• g<name> or g<n> - TextMate

• (?R) - Perl

Example
(?x:

( # match the initial opening parenthesis

# Now make a named group 'balanced' which
# matches a balanced substring.

(?<balanced>

[^()] # A balanced substring is either something
# that is not a parenthesis:

| # …or a parenthesised string:

( # A parenthesised string begins with an opening parenthesis

g<balanced>* # …followed by a sequence of balanced substrings

) # …and ends with a closing parenthesis

)* # Look for a sequence of balanced substrings

) # Finally, the outer closing parenthesis
)

Example
(?x:

( # match the initial opening parenthesis

# Now make a named group 'balanced' which
# matches a balanced substring.

(?<balanced>

[^()] # A balanced substring is either something
# that is not a parenthesis:

| # …or a parenthesised string:

( # A parenthesised string begins with an opening parenthesis

g<balanced>* # …followed by a sequence of balanced substrings

) # …and ends with a closing parenthesis

)* # Look for a sequence of balanced substrings

) # Finally, the outer closing parenthesis
)

or: (([^()]|(?R))*)

Advanced Regular Expressions Redux

Recomendados

Recomendados

Mais conteúdo relacionado

Destaque

Destaque (20)

Semelhante a Advanced Regular Expressions Redux

Semelhante a Advanced Regular Expressions Redux (20)

Último

Último (20)

Advanced Regular Expressions Redux

Notas do Editor