Working with text, Regular expressions

Perl Programming
Course
Working with text
Regular expressions

Krassimir Berov

I-can.eu

Contents
1. Simple word matching
2. Character classes
3. Matching this or that
4. Grouping
5. Extracting matches
6. Matching repetitions
7. Search and replace
8. The split operator

Simple word matching
• It's all about identifying patterns in text
• The simplest regex is simply
a word – a string of characters.
• A regex consisting of a word matches any
string that contains that word
• The sense of the match can be reversed
by using !~ operator
my $string ='some probably big string containing just
about anything in it';
print "found 'string'n" if $string =~ /string/;
print "it is not about dogsn" if $string !~ /dog/;

(2)
• The literal string in the regex can be
replaced by a variable
• If matching against $_ , the $_ =~ part
can be omitted:

my $string ='stringify this world';
my $word = 'string'; my $animal = 'dog';
print "found '$word'n" if $string =~ /$word/;
print "it is not about ${animal}sn"
if $string !~ /$animal/;
for('dog','string','dog'){
print "$wordn" if /$word/
}

(3)
• The // default delimiters for a match can be
changed to arbitrary delimiters by putting an 'm'
in front
• Regexes must match a part of the string exactly
in order for the statement to be true
my $string ='Stringify this world!';
my $word = 'string'; my $animal = 'dog';
print "found '$word'n" if $string =~ m#$word#;
print "found '$word' in any casen"
if $string =~ m#$word#i;
print "it is not about ${animal}sn"
if $string !~ m($animal);
for('dog','string','Dog'){
local $=$/;
print if m|$animal|
}

(4)
• perl will always match at the earliest possible
point in the string
my $string ='Stringify this stringy world!';
my $word = 'string';
print "found '$word' in any casen"
if $string =~ m{$word}i;

• Some characters, called metacharacters, are
reserved for use in regex notation. The
metacharacters are (14):
{ } [ ] ( ) ^ $ . | * + ?

• A metacharacter can be matched by putting a
backslash before it
print "The string n'$string'n contains a DOTn"
if $string =~ m|.|;

(5)
• Non-printable ASCII characters are
represented by escape sequences
• Arbitrary bytes are represented by octal
escape sequences
use utf8;
binmode(STDOUT, ':utf8') if $ENV{LANG} =~/UTF-8/;
$=$/;
my $string ="containsrn Then we have sometttabs.
б";
print 'matched б(x{431})'
if $string =~ /x{431}/;
print 'matched б' if $string =~/б/;
print 'matched rn' if $string =~/rn/;
print 'The string was:"' . $string.'"';

(6)
• To specify where it should match, use the
anchor metacharacters ^ and $ .

use strict; use warnings;$=$/;
my $string ='A probably long chunk of text containing
strings';
print 'matched "A"' if $string =~ /^A/;
print 'matched "strings"' if $string =~ /strings$/;
print 'matched "A", matched "strings" and something in
between'
if $string =~ /^A.*?strings$/;

Character classes
• A character class allows a set of possible
characters, to match
• Character classes are denoted by brackets [ ]
with the set of characters to be possibly matched
inside
• The special characters for a character class are
- ] ^ $ and are matched using an escape
• The special character '-' acts as a range operator
within character classes so you can write [0-9]
and [a-z]

Character classes
• Example
my $string ='A probably long chunk of text containing
strings';
my $thing = 'ong ung ang enanything';
my $every = 'iiiiii';
my $nums = 'I have 4325 Euro';
my $class = 'dog';
print 'matched any of a, b or c'
if $string =~ /[abc]/;

for($thing, $every, $string){
print 'ingy brrrings nothing using: '.$_
if /[$class]/
}
print $nums if $nums =~/[0-9]/;

Character classes
• Perl has several abbreviations for common character
classes
• d is a digit – [0-9]
• s is a whitespace character – [ trnf]
• w is a word character
(alphanumeric or _) – [0-9a-zA-Z_]
• D is a negated d – any character but a digit [^0-9]
• S is a negated s; it represents any non-whitespace
character [^s]
• W is a negated w – any non-word character
• The period '.' matches any character but "n"
• The dswDSW inside and outside of character classes
• The word anchor b matches a boundary between a word
character and a non-word character wW or Ww

Character classes
• Example
my $digits ="here are some digits3434 and then ";

print 'found digit' if $digits =~/d/;

print 'found alphanumeric' if $digits =~/w/;

print 'found space' if $digits =~/s/;

print 'digit followed by space, followed by letter'
if $digits =~/ds[A-z]/;

Matching this or that
• We can match different character strings
with the alternation metacharacter '|'
• perl will try to match the regex at the
earliest possible point in the string
my $digits ="here are some digits3434 and then ";

print 'found "are" or "and"' if $digits =~/are|and/;

Extracting matches
• The grouping metacharacters () also
allow the extraction of the parts of a
string that matched
• For each grouping, the part that matched
inside goes into the special variables $1 ,
$2 , etc.
• They can be used just as ordinary
variables

Extracting matches
• The grouping metacharacters () allow a part of a
regex to be treated as a single unit
• If the groupings in a regex are nested, $1 gets the
group with the leftmost opening parenthesis, $2
the next opening parenthesis, etc.
my $digits ="here are some digits3434 and then678 ";

print 'found a letter followed by leters or digits":'.$1
if $digits =~/[a-z]([a-z]|d+)/;
print 'found a letter followed by digits":'.$1
if $digits =~/([a-z](d+))/;
# $1 $2
print 'found letters followed by digits":'.$1
if $digits =~/([a-z]+)(d+)/;
# $1 $2

Matching repetitions
• The quantifier metacharacters ?, * , + , and {} allow
us to determine the number of repeats of a portion
of a regex
• Quantifiers are put immediately after the character,
character class, or grouping
• a? = match 'a' 1 or 0 times
• a* = match 'a' 0 or more times, i.e., any number of times
• a+ = match 'a' 1 or more times, i.e., at least once
• a{n,m} = match at least n times, but not more than m
times
• a{n,} = match at least n or more times
• a{n} = match exactly n times



print 'found some letters followed by leters or
digits":'.$1 .$2
if $digits =~/([a-z]{2,})(w+)/;

print 'found three letter followed by digits":'.$1 .$2
if $digits =~/([a-z]{3}(d+))/;

print 'found up to four letters followed by digits":'.
$1 .$2
if $digits =~/([a-z]{1,4})(d+)/;

• Greeeedy


print 'found as much as possible letters
followed by digits":'.$1 .$2
if $digits =~/([a-z]*)(d+)/;

Search and replace
• Search and replace is performed using
s/regex/replacement/modifiers.
• The replacement is a Perl double quoted string
that replaces in the string whatever is matched with
the regex .
• The operator =~ is used to associate a string with
s///.
• If matching against $_ , the $_ =~ can be dropped.
• If there is a match, s/// returns the number of
substitutions made, otherwise it returns false

Search and replace
• The matched variables $1 , $2 , etc. are immediately
available for use in the replacement expression.
• With the global modifier, s///g will search and
replace all occurrences of the regex in the string
• The evaluation modifier s///e wraps an eval{...}
around the replacement string and the evaluated
result is substituted for the matched substring.
• s/// can use other delimiters, such as s!!! and s{}{},
and even s{}//
• If single quotes are used s''', then the regex and
replacement are treated as single quoted strings

Search and replace
• Example

#TODO....

The split operator
• split /regex/, string
splits string into a list of substrings and
returns that list
• The regex determines the character sequence
that string is split with respect to
#TODO....

Regular expressions

• Resources
• perlrequick - Perl regular expressions quick start
• perlre - Perl regular expressions
• perlreref - Perl Regular Expressions Reference
• Beginning Perl
(Chapter 5 – Regular Expressions)

Regular expressions

Questions?

Working with text, Regular expressions

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (19)

Semelhante a Working with text, Regular expressions

Semelhante a Working with text, Regular expressions (20)

Mais de Krasimir Berov (Красимир Беров)

Mais de Krasimir Berov (Красимир Беров) (11)

Último

Último (20)

Working with text, Regular expressions