PT.BUZOO INDONESIA is No1 Japanese offshore development company in Indonesia.
We are professional of web solution and smartphone apps. We can support Japanese, English and Indonesia.
We are hiring now at http://buzoo.co.id/
1. By Niko Adrianus Yuwono
BUZOO PHP TEAM
REGULAR EXPRESSIONS LECTURE
2. What is Regular Expressions?
Regular Expressions or Regex (We’ll mostly use
Regex to call it in this presentation) are a
powerful tool for examining and modifying text.
Regex use general pattern notation to allow you
describe and parse text.
PHP supports two different types of regular
expressions: POSIX-extended and Perl-
Compatible Regular Expressions (PCRE). But
we’ll focus on PCRE in this lecture.
3. Delimiters
When using PCRE functions we need to enclose
the pattern using delimiters.
Often used delimiters are forward slashes (/),
hash signs (#) and tildes (~ ).
Example of usage :
/([^/ | ^-]+).html/
/</span>(.*?)</span>/
4. Literal-Characters
Literal characters are normal characters that
match themselves. Alphanumeric characters and
symbols are example of literal characters
To difference between Meta-Characters and
Literal-Characters we need to add backslash ()
before the literal character to define that
character is a literal character not a meta
character
5. Meta-characters
Meta-characters are the main power of regular
expressions, with meta-characters it’s possible to
encode alternatives and repetitions in the pattern.
Meta-characters are divided into two type, meta-
characters outside class, and meta-characters
inside class.
6. Meta-characters Cont’d
Here is list of meta-character that can work
outside a class :
, ^ , $ , . , [ , ] , | , ( , ) , ? , * , + , { , }
And this is the list of meta-character that work
inside a class :
, ^ , -
7. Character Classes
Character classes in Regex started by opening
square bracket ([) and closed by and closing
square bracket (])
A character class matches a single character in
the subject; the character must be in the set of
characters defined by the class.
Example :
[a-z] will match any lowercase letter
[^A-Z] will match a
ny character that is not a uppercase letter
8. Subpatterns
Subpatterns are delimited by parentheses (round
brackets), which can be nested.
Subpatterns can do two things :
1. It localizes a set of alternatives. For example,
the pattern hen(dy|rio|ri) matches one of the
words “hendy", “henrio", or “henri". Without the
parentheses, it would match “hendy", “rio" or the
“ri”.
2. It sets up the subpattern as a capturing
subpattern (as defined above).
9. Subpatterns Cont’d
For example, if the string “kafji tinggi" is matched
against the pattern ((kafji|niko)
(tinggi|tampan)) the captured substrings are
“kafji tinggi", “kafji", and “tinggi", and are
numbered 1, 2, and 3.
There are often times we don’t need capturing
functions. In that case we can add "?:“ after the
opening parenthesis.
10. Optional Items
The question mark makes the preceding token in
the regular expression optional.
Example : colou?r will match both
colour and color.
You can also wrap a set of characters in
parenthesis to make them optional.
Example : Jan(uary)? will match both Jan and
January.
11. Repetition
There are two repetition characters, star ( * ) and
plus ( + ).
Star ( * ) character will try to match the preceding
token zero or more times.
Plus ( + ) character will try to match the preceding
token one or more times
Example :
[sS]+ will match any character one or more
[sS]* will match any character zero or more
12. Limiting Repetition
Sometimes we need to limit some repetition, to
achieve that we can use { } bracket.
The syntax is {min,max} where min is a must and
you can empty the max but it’ll be counted as
infinity, and if you omit both the coma and max it’ll
repeat the token exactly min times.
Example :
([A-Z]{3}|[0-9]{4}) will matches three letters or four
numbers
13. Greediness
Greediness is a condition where the regex given
to option try to match the pattern or not to match
the pattern.
But the regex will always try to match the pattern.
It can cause some trouble to us and will return an
unexpected result.
For example the regex Feb 23(rd)? to the
string Today is Feb 23rd, 2003, the match will
always be Feb 23rd and not Feb 23.
14. Greediness Cont’d
Example for repetition :
You want to get HTML tag for crawling a website.
Usually new people will use <.+> to match the
HTML tag. But it will return a different result than
you expected. Let’s try to match that pattern with
this string -> “Saya <b>suka</b> makan”
The result will be <b>suka</b>
Why?
15. Greediness Cont’d
That’s because of greediness, the pattern <.+>
will try to match dot ( . ) as many as possible.
Let’s try to do it step by step.
First the regex will try to search < from this string
“Saya <b>suka</b> makan” so Saya will be
skipped.
Then after finding < it’ll try to run (.+) that means
to find any character one or more so it’ll read from
b until the end of string. Then it’ll backtracking
until the last > character that have been found so
the result will be <b>suka</b> not <b> and </b>
16. Laziness
How to fix greediness problem? You can use
laziness by adding ? Question mark after the
repetition or question mark to make them lazy
But there is also another alternative to laziness
that is negated character class.
Example for previous question :
<[^>]+> will match anything except > character