This document provides an introduction to using regular expressions (regex) for parsing text. It explains some common regex syntax elements like character classes, quantifiers, grouping, and alternation. Examples are given for matching letters, digits, whitespace, character ranges, and more. The document encourages trying out regex with tools like Firebug and grep. It also provides starter Python code to parse a sample text using regex to extract tags and titles.
5. • [] denotes “elements” or “class”
• // demarcates regex
• + denotes “one or more of previous thing”
• () denotes “remember this matched group”
• /[ab]/ # an ‘a’ or a ‘b’
• /[ab]+/ # one or more ‘a’s or ‘b’s
• /([ab]+)/ # a group of one or more ‘a’s or ‘b’s
7. • [a-z] is any lower case char bet. a-z
• [0-9] is any digit
• + is one or more of previous thing
• ? is zero or one of previous thing
• | is or, e.g. [a|b] is ‘a’ or ‘b’
• * is zero to many of previous thing
• . matches any character
8. • [^a-z] is anything *but* [a-z]
• [a-zA-Z0-9] is any of a-z, A-Z, 0-9
• {5} matches only 5 of the preceding thing
• {2,} matches at least 2 of the preceding thing
• {2,6} matches from 2 to 6 of preceding thing
• [d] is like [0-9] (any digit)
• [S] is any non-whitespace
9. try this
• visit any web page
• open firebug console
• title = window.document.title
• try regexes to match parts of
the title
16. #!/usr/bin/env python
import re
re_tag = re.compile(r'([A-Z .]+):')
re_title = re.compile('TITLE: (.*)')
for line in open('journals-carol-bean.txt'):
line = line.strip()
m1 = re_tag.match(line)
m2 = re_title.match(line)
if line == "":
continue
print "n->", line, "<-"
if m1 or m2:
print "MATCH"
if m1:
print 'tag:', m1.groups()
if m2:
print 'title:', m2.groups()