SlideShare uma empresa Scribd logo
1 de 16
Baixar para ler offline
hacker 102
 code4lib 2010 preconference
Asheville, NC, USA 2010-02-21
iv. regular expressions

      JavaScript
if all language
      looked like
“aabaaaabbbabaababa”
         it’d be
    easy to parse
parsing
“aabaaaabbbabaababa”
  •   there are two
      elements, “a” and “b”
  •   either may occur in
      any order
  •   /([ab]+)/
• [] denotes “elements” or “class”
• // demarcates regex
• + denotes “one or more of previous thing”
• () denotes “remember this matched group”
• /[ab]/ # an ‘a’ or a ‘b’
• /[ab]+/ # one or more ‘a’s or ‘b’s
• /([ab]+)/ # a group of one or more ‘a’s or ‘b’s
to firebug!
• [a-z] is any lower case char bet. a-z
• [0-9] is any digit
• + is one or more of previous thing
• ? is zero or one of previous thing
• | is or, e.g. [a|b] is ‘a’ or ‘b’
• * is zero to many of previous thing
• . matches any character
• [^a-z] is anything *but* [a-z]
• [a-zA-Z0-9] is any of a-z, A-Z, 0-9
• {5} matches only 5 of the preceding thing
• {2,} matches at least 2 of the preceding thing
• {2,6} matches from 2 to 6 of preceding thing
• [d] is like [0-9] (any digit)
• [S] is any non-whitespace
try this

• visit any web page
• open firebug console
• title = window.document.title
• try regexes to match parts of
  the title
most every language
 has regex support
try unix “grep”
v. glue it together

     Python
problem: Carol’s data
TITLE: ABA journal.
BD. HOLDINGS: Vol. 70 (1984) - Vol. 94 (2008)
CURRENT VOL.: Vol. 95 (2009) -
OTHER LIBRARIES:
      Miami:v. 68 (1982) -
      USDC: v. 88 (2002) -
      Birm.:v. 89 (2003) -
(Formerly: American Bar Association Journal)
(Bound and on Hein)


TITLE: Administrative law review.
BD. HOLDINGS: Vol. 22 (1969/1970) - Vol. 60
(2008)
CURRENT VOL.: Vol. 61 (2009) -
(Bound and on Hein)
starter code
   for you
#!/usr/bin/env python
import re
re_tag = re.compile(r'([A-Z .]+):')
re_title = re.compile('TITLE: (.*)')
for line in open('journals-carol-bean.txt'):
    line = line.strip()
    m1 = re_tag.match(line)
    m2 = re_title.match(line)
    if line == "":
        continue
    print "n->", line, "<-"
    if m1 or m2:
        print "MATCH"
    if m1:
        print 'tag:', m1.groups()
    if m2:
        print 'title:', m2.groups()

Mais conteúdo relacionado

Destaque

think locally, code globally - dchud's code4lib japan 2013 talk
think locally, code globally - dchud's code4lib japan 2013 talkthink locally, code globally - dchud's code4lib japan 2013 talk
think locally, code globally - dchud's code4lib japan 2013 talk
Dan Chudnov
 
Capturing the Ephemeral: Collecting Social Media with Social Feed Manager
Capturing the Ephemeral: Collecting Social Media with Social Feed ManagerCapturing the Ephemeral: Collecting Social Media with Social Feed Manager
Capturing the Ephemeral: Collecting Social Media with Social Feed Manager
Dan Chudnov
 
Experience Gedepahala Corridor Programme
Experience Gedepahala Corridor ProgrammeExperience Gedepahala Corridor Programme
Experience Gedepahala Corridor Programme
GPFLR
 

Destaque (14)

Repository Development at LC - Access 2009
Repository Development at LC - Access 2009Repository Development at LC - Access 2009
Repository Development at LC - Access 2009
 
introduction to Django in five slides
introduction to Django in five slides introduction to Django in five slides
introduction to Django in five slides
 
think locally, code globally - dchud's code4lib japan 2013 talk
think locally, code globally - dchud's code4lib japan 2013 talkthink locally, code globally - dchud's code4lib japan 2013 talk
think locally, code globally - dchud's code4lib japan 2013 talk
 
Hacker 101/102 - Introduction to Programming w/Processing
Hacker 101/102 - Introduction to Programming w/ProcessingHacker 101/102 - Introduction to Programming w/Processing
Hacker 101/102 - Introduction to Programming w/Processing
 
CTS at LC - Access 2010
CTS at LC - Access 2010CTS at LC - Access 2010
CTS at LC - Access 2010
 
Linking Library Data on the Web
Linking Library Data on the WebLinking Library Data on the Web
Linking Library Data on the Web
 
stuff i'm learning in data school
stuff i'm learning in data schoolstuff i'm learning in data school
stuff i'm learning in data school
 
what i want from linked data
what i want from linked datawhat i want from linked data
what i want from linked data
 
CRM: A Business Imperative for Companies during the Global Economic Downturn
CRM: A Business Imperative for Companies during the Global Economic DownturnCRM: A Business Imperative for Companies during the Global Economic Downturn
CRM: A Business Imperative for Companies during the Global Economic Downturn
 
WWIC - Library Linked Data as a Customer Service Medium
WWIC - Library Linked Data as a Customer Service MediumWWIC - Library Linked Data as a Customer Service Medium
WWIC - Library Linked Data as a Customer Service Medium
 
Biodiversity Conservation in the Production Forests of Indonesia
Biodiversity Conservation in the Production Forests of IndonesiaBiodiversity Conservation in the Production Forests of Indonesia
Biodiversity Conservation in the Production Forests of Indonesia
 
Overview of Adaptive Blocking for DDL Research Lab
Overview of Adaptive Blocking for DDL Research LabOverview of Adaptive Blocking for DDL Research Lab
Overview of Adaptive Blocking for DDL Research Lab
 
Capturing the Ephemeral: Collecting Social Media with Social Feed Manager
Capturing the Ephemeral: Collecting Social Media with Social Feed ManagerCapturing the Ephemeral: Collecting Social Media with Social Feed Manager
Capturing the Ephemeral: Collecting Social Media with Social Feed Manager
 
Experience Gedepahala Corridor Programme
Experience Gedepahala Corridor ProgrammeExperience Gedepahala Corridor Programme
Experience Gedepahala Corridor Programme
 

Semelhante a Hacker 102 - regexes w/Javascript, Python

Python advanced 2. regular expression in python
Python advanced 2. regular expression in pythonPython advanced 2. regular expression in python
Python advanced 2. regular expression in python
John(Qiang) Zhang
 
Perl Intro 3 Datalog Parsing
Perl Intro 3 Datalog ParsingPerl Intro 3 Datalog Parsing
Perl Intro 3 Datalog Parsing
Shaun Griffith
 

Semelhante a Hacker 102 - regexes w/Javascript, Python (20)

Library Carpentry. Week One: Basics
Library Carpentry. Week One: BasicsLibrary Carpentry. Week One: Basics
Library Carpentry. Week One: Basics
 
P3 2018 python_regexes
P3 2018 python_regexesP3 2018 python_regexes
P3 2018 python_regexes
 
PHP - Introduction to PHP
PHP -  Introduction to PHPPHP -  Introduction to PHP
PHP - Introduction to PHP
 
Scala in practice - 3 years later
Scala in practice - 3 years laterScala in practice - 3 years later
Scala in practice - 3 years later
 
Scala in-practice-3-years by Patric Fornasier, Springr, presented at Pune Sca...
Scala in-practice-3-years by Patric Fornasier, Springr, presented at Pune Sca...Scala in-practice-3-years by Patric Fornasier, Springr, presented at Pune Sca...
Scala in-practice-3-years by Patric Fornasier, Springr, presented at Pune Sca...
 
Python advanced 2. regular expression in python
Python advanced 2. regular expression in pythonPython advanced 2. regular expression in python
Python advanced 2. regular expression in python
 
Intro to Perl and Bioperl
Intro to Perl and BioperlIntro to Perl and Bioperl
Intro to Perl and Bioperl
 
Testing stateful, concurrent, and async systems using test.check
Testing stateful, concurrent, and async systems using test.checkTesting stateful, concurrent, and async systems using test.check
Testing stateful, concurrent, and async systems using test.check
 
Code for Startup MVP (Ruby on Rails) Session 2
Code for Startup MVP (Ruby on Rails) Session 2Code for Startup MVP (Ruby on Rails) Session 2
Code for Startup MVP (Ruby on Rails) Session 2
 
From Ruby to Scala
From Ruby to ScalaFrom Ruby to Scala
From Ruby to Scala
 
Perl Intro 3 Datalog Parsing
Perl Intro 3 Datalog ParsingPerl Intro 3 Datalog Parsing
Perl Intro 3 Datalog Parsing
 
shellScriptAlt.pptx
shellScriptAlt.pptxshellScriptAlt.pptx
shellScriptAlt.pptx
 
2015 bioinformatics python_strings_wim_vancriekinge
2015 bioinformatics python_strings_wim_vancriekinge2015 bioinformatics python_strings_wim_vancriekinge
2015 bioinformatics python_strings_wim_vancriekinge
 
Compass, Sass, and the Enlightened CSS Developer
Compass, Sass, and the Enlightened CSS DeveloperCompass, Sass, and the Enlightened CSS Developer
Compass, Sass, and the Enlightened CSS Developer
 
Introduction to Perl and BioPerl
Introduction to Perl and BioPerlIntroduction to Perl and BioPerl
Introduction to Perl and BioPerl
 
Bioinformatica: Esercizi su Perl, espressioni regolari e altre amenità (BMR G...
Bioinformatica: Esercizi su Perl, espressioni regolari e altre amenità (BMR G...Bioinformatica: Esercizi su Perl, espressioni regolari e altre amenità (BMR G...
Bioinformatica: Esercizi su Perl, espressioni regolari e altre amenità (BMR G...
 
Regular Expressions
Regular ExpressionsRegular Expressions
Regular Expressions
 
Regular expression for everyone
Regular expression for everyoneRegular expression for everyone
Regular expression for everyone
 
Things that every JavaScript developer should know by Rachel Appel at FrontCo...
Things that every JavaScript developer should know by Rachel Appel at FrontCo...Things that every JavaScript developer should know by Rachel Appel at FrontCo...
Things that every JavaScript developer should know by Rachel Appel at FrontCo...
 
FUNDAMENTALS OF REGULAR EXPRESSION (RegEX).pdf
FUNDAMENTALS OF REGULAR EXPRESSION (RegEX).pdfFUNDAMENTALS OF REGULAR EXPRESSION (RegEX).pdf
FUNDAMENTALS OF REGULAR EXPRESSION (RegEX).pdf
 

Hacker 102 - regexes w/Javascript, Python

  • 1. hacker 102 code4lib 2010 preconference Asheville, NC, USA 2010-02-21
  • 3. if all language looked like “aabaaaabbbabaababa” it’d be easy to parse
  • 4. parsing “aabaaaabbbabaababa” • there are two elements, “a” and “b” • either may occur in any order • /([ab]+)/
  • 5. • [] denotes “elements” or “class” • // demarcates regex • + denotes “one or more of previous thing” • () denotes “remember this matched group” • /[ab]/ # an ‘a’ or a ‘b’ • /[ab]+/ # one or more ‘a’s or ‘b’s • /([ab]+)/ # a group of one or more ‘a’s or ‘b’s
  • 7. • [a-z] is any lower case char bet. a-z • [0-9] is any digit • + is one or more of previous thing • ? is zero or one of previous thing • | is or, e.g. [a|b] is ‘a’ or ‘b’ • * is zero to many of previous thing • . matches any character
  • 8. • [^a-z] is anything *but* [a-z] • [a-zA-Z0-9] is any of a-z, A-Z, 0-9 • {5} matches only 5 of the preceding thing • {2,} matches at least 2 of the preceding thing • {2,6} matches from 2 to 6 of preceding thing • [d] is like [0-9] (any digit) • [S] is any non-whitespace
  • 9. try this • visit any web page • open firebug console • title = window.document.title • try regexes to match parts of the title
  • 10. most every language has regex support
  • 12. v. glue it together Python
  • 14. TITLE: ABA journal. BD. HOLDINGS: Vol. 70 (1984) - Vol. 94 (2008) CURRENT VOL.: Vol. 95 (2009) - OTHER LIBRARIES: Miami:v. 68 (1982) - USDC: v. 88 (2002) - Birm.:v. 89 (2003) - (Formerly: American Bar Association Journal) (Bound and on Hein) TITLE: Administrative law review. BD. HOLDINGS: Vol. 22 (1969/1970) - Vol. 60 (2008) CURRENT VOL.: Vol. 61 (2009) - (Bound and on Hein)
  • 15. starter code for you
  • 16. #!/usr/bin/env python import re re_tag = re.compile(r'([A-Z .]+):') re_title = re.compile('TITLE: (.*)') for line in open('journals-carol-bean.txt'): line = line.strip() m1 = re_tag.match(line) m2 = re_title.match(line) if line == "": continue print "n->", line, "<-" if m1 or m2: print "MATCH" if m1: print 'tag:', m1.groups() if m2: print 'title:', m2.groups()