Oracle database supports perl- and POSIX-compatible regular expressions with five elegant and powerful functions: REGEXP_REPLACE, REGEXP_SUBSTR, REGEXP_INSTR, REGEXP_LIKE, and REGEXP_COUNT.
This session will demonstrate their nuances and how to use them effectively for data cleansing, manipulation and selection, for validating things such as Social Security Numbers, credit cards, IP addresses, phone numbers, DNAs, XMLs, for extracting things such as email-ids, hostnames from URLs and strings, and for transposing delimited columns to rows. There will be a demo of a few tricky examples taken from forums.oracle.com and asktom.oracle.com.
The session will end with fuzzy matching and optimization techniques, and things to watch out for.
http://docs.oracle.com/cd/E11882_01/appdev.112/e25518/adfns_regexp.htm
2. Agenda
Introduction to regular expressions
REGEXP_* functions in Oracle
Coffee Break
Examples
More examples
3. Meeting Basics
Put your phones/pagers on vibrate/mute
Messenger: Change the status to offline or
in-meeting
Remote attendees: Mute yourself (*6). Ask
questions via Adobe Connect.
4. What are Regular Expressions?
A way to express patterns
credit cards, license plate numbers, vehicle identification
numbers, voter id, driving license
UNIX (grep, egrep), PHP, JAVA support Regular
Expressions
PERL made it popular
5. String operations before Regular Expression
support in Oracle
Pull the data from DB and perform it in middle tier
or FE
OWA_PATTERN in 9i and before
LIKE operator
6. LIKE operator
% matches zero or more of any character
_ matches exactly one character
Examples
WHERE col1 LIKE 'abc%';
WHERE col1 LIKE '%abc';
WHERE col1 LIKE 'ab_d';
WHERE col1 LIKE 'ab_d' escape '';
WHERE col1 NOT LIKE 'abc%';
Very limited functionality
Check whether first character is numeric: where c1 like '0%' OR c1
like '1%' OR .. .. c1 like '9%'
Very trivial with Regular Exp: where regexp_like(c1, '^[0-9]')
7. Regular Expressions
Meta Meaning
character
. Matches any single "character" except newline.
* Matches zero or more of the character preceding it
e.g.: bugs*, table.*
^ Denotes the beginning of the line. ^A denotes lines starting
with A
$ Denotes the end of the line. :$ denotes lines ending with :
Escape character (., *, [, , etc)
[] matches one or more characters within the brackets. e.g.
[aeiou], [a-z], [a-zA-Z], [0-9], [:alpha:], [a-z?,!]
[^] negation - matches any characters other than the ones
inside brackets. eg. ^[^13579] denotes all lines not starting
with odd numbers, [^02468]$ denotes all lines not ending
with even numbers
7
8. Extended Regular Expressions
Meta character Meaning
| alternation. e.g.: ho(use|me), the(y|m),
(they|them)
+ one or more occurrences of previous character.
? zero or one occurrences of previous character.
{n} exactly n repetitions of the previous char or group
{n,} n or more repetitions of the previous char or
group
{,m} zero to m repetitions of the previous char or
group
{n, m} n to m repetitions of previous char or group
(....) grouping or subexpression
n back referencing where n stands for the nth sub-
expression. e.g.: 1 is the back reference for first
sub-expression.
8
9. POSIX Character Classes
POSIX Description
[:alnum:] Alphanumeric characters
[:alpha:] Alphabetic characters
[:ascii:] ASCII characters
[:blank:] Space and tab
[:cntrl:] Control characters
[:digit:] Digits, Hexadecimal digits
[:xdigit:]
[:graph:] Visible characters (i.e. anything except spaces, control characters,
etc.)
[:lower:] Lowercase letters
[:print:] Visible characters and spaces (i.e. anything except control
characters)
[:punct:] Punctuation and symbols.
[:space:] All whitespace characters, including line breaks
[:upper:] Uppercase letters
[:word:] Word characters (letters, numbers and underscores)
10. Perl Character Classes
Perl POSIX Description
d [[:digit:]] [0-9]
D [^[:digit:]] [^0-9]
w [[:alnum:]_] [0-9a-zA-Z_]
W [^[:alnum:]_] [^0-9a-zA-Z_]
s [[:space:]]
S [^[:space:]]
10
11. Tools to learn Regular Expressions
http://www.weitz.de/regex-coach/
http://www.regexbuddy.com/
12. REGEXP_* functions
Available from 10g onwards.
Powerful and flexible, but CPU-hungry.
Easy and elegant, but sometimes less performant
Usable on text literal, bind variable, or any column
that holds character data such as CHAR, NCHAR,
CLOB, NCLOB, NVARCHAR2, and VARCHAR2
(but not LONG).
Useful as column constraint for data validation
13. REGEXP_LIKE
Determines whether pattern matches.
REGEXP_LIKE (source_str, pattern,
[,match_parameter])
Returns TRUE or FALSE.
Use in WHERE clause to return rows matching a pattern
Use as a constraint
alter table t add constraint alphanum check (regexp_like (x,
'[[:alnum:]]'));
Use in PL/SQL to return a boolean.
IF (REGEXP_LIKE(v_name, '[[:alnum:]]')) THEN ..
Can't be used in SELECT clause
regexp_like.sql
14. REGEXP_SUBSTR
Extracts the matching pattern. Returns NULL when
nothing matches
REGEXP_SUBSTR(source_str, pattern [, position
[, occurrence [, match_parameter]]])
position: character at which to begin the search.
Default is 1
occurrence: The occurrence of pattern you want to
extract
regexp_substr.sql
15. REGEXP_INSTR
Returns the location of match in a string
REGEXP_INSTR(source_str, pattern, [, position [,
occurrence [, return_option [, match_parameter]]]])
return_option:
0, the default, returns the position of the first character.
1 returns the position of the character following the occurence.
regexp_instr.sql
16. REGEXP_REPLACE
Search and Replace a pattern
REGEXP_REPLACE(source_str, pattern
[, replace_str] [, position [, occurrence
[, match_parameter]]]])
If replace_str is not specified, pattern/search_str is
replaced with empty string
occurence:
when 0, the default, replaces all occurrences of the match.
when n, any positive integer, replaces the nth occurrence.
regexp_replace.sql
17. REGEXP_COUNT
New in 11g
Returns the number of times a pattern appears in a
string.
REGEXP_COUNT(source_str, pattern [,position
[,match_param]])
For simple patterns it is same as
(LENGTH(source_str) –
LENGTH(REPLACE(source_str,
pattern)))/LENGTH(pattern)
regexp_count.sql
18. Pattern Matching modifiers
i – Specifies case-insensitive matching (ignore case)
c – Specifies case-sensitive matching
n – allows the period (.) to match the newline character
m - treats the source string as multiple lines.
x - ignores whitespace characters
when match_parameter is not specified,
case sensitivity is determined by NLS_SORT parameter
(BINARY, BINARY_CI)
A period (.) doesn't match newline character
Source string is treated as a single line
match_params.sql
19. Is a CHAR column all numeric?
to_number(c1) returns ORA-01722:
invalid number if a varchar2 column
contains alpha characters.
is_numeric.sql
20. Check constraints
Put validation close to where the data is
stored
No need to have validation at different
clients
check_constraint.sql
21. Extract email-ids
Find email-ids embedded in text
strings. Possible email-id formats:
abc123@company.com
namex@mail.company.com
xyz_1@yahoo.co.in
extract_emailid.sql
22. Extract dates
Extract dates embedded in text strings.
Possible formats
1/5/2007, 2-5-03, 12-31-2009,
1/31/10, 2/5-10
extract_date.sql
23. Extracting hostnames from URLs
Extract hostnames/domain-names embedded in
text strings. Possible formats
http://us.mg201.mail.yahoo.com/dc/launch?.partner
=sbc&.gx=1&.rand=fegr2vucbecu5
https://www.mybank.com:8080/abc/xyz
www.mybank.com
ftp://www.mycharity.org/abc/xyz
extract_hostname.sql
24. Convert value pairs to XML
Input: A string such as 'remain1=1;remain2=2;'
Output: An XML string
<remain1><value=1></remain1>
<remain2><value=2></remain2>
convert_to_xml.sql
25. Sort IP addresses in numerical order
Sort IP addresses, that are stored as character
strings, in numerical order.
Input
10.10.20.10
127.0.0.1
166.22.33.44
187.222.121.0
20.23.23.20
sort_ip_address.sql
26. Extract first name, last name, and middle initial
Extract the first name, last name with
an optional middle initial.
first_last_mi.sql
27. Finding the Last Occurrence
Find the last numeric sequence from a
sequence.
Return 567 from 'abc/123/def567/xyz'
INSTR and SUBSTR allow backward
search when position is negative.
REGEXP functions don't allow backward
search
last_occurrence.sql
28. Fuzzy Match
Tables t1 and t2 each have a
varchar2(12) column (t1.x, t2.y).
A row in t1 is considered a match for a
row in t2, if any six characters in t1.x
matches with any six characters in t2.y
fuzzy_match.sql
30. Meta-characters with multiple meanings
Same meta characters are used with
multiple meanings
^ used for anchoring and negation.
? used as quantifier and lazy operator
() used for grouping or sub-expression
metachars_with_multiple_meanings.sql
31. Nuances
? (zero or one), * (zero or more)
could sometimes mislead you
nuances.sql
32. Stored patterns
patterns can be stored in table
columns and be referenced in
REGEXP functions
No need to hard-code them
stored_patterns.sql
33. Random things
Insert a dash before the two last digits
Remove a substring
Get rid of useless commas from a string
Find the word that comes immediately
before a substring (e.g. XXX)
Replace whole words, not its parts
Trimming the trailing digits
random.sql
34. A few other points
When not to use Regular Expressions
If the same thing could be used without regular
expressions and without too much coding.
POSIX notations need double brackets [[:upper]]. [:upper:]
won't work. [[:UPPER:]] won't work either. It has to be in
lower case letters.
Locale support provided with Collation Element ][.ce.]],
and Equivalence Classes [[=e=]]
MySQL supports regular expressions with RLIKE
35. References
Oracle® Database Advanced Application
Developer's Guide
(http://download.oracle.com/docs/cd/E11882_0
1/appdev.112/e17125/adfns_regexp.htm#CHDGH
BHF)
Anti-Patterns in Regular Expressions:
http://gennick.com/antiregex.html
First Expressions. An article by Jonathan Gennick Oracle
Magazine, Sep/Oct 2003.
Oracle Regular Expressions Pocket Reference by
Gonathan Gennick
http://examples.oreilly.com/9780596006013/Re
gexPocketData.sql