SlideShare uma empresa Scribd logo
1 de 34
Introduction to Regular
      Expressions

      Ben Brumfield
     RootsTech 2013
Our Texts
• My Texts
  – Manuscript Transcripts
  – OCR
• Our Texts
  – Name Variants
  – Abbreviations
  – Spelling Changes
  – “Mistakes”
What are Regular Expressions?
• Very small language for describing text.
• Not a programming language.
• Incredibly powerful tool for search/replace
  operations.
• Old (1950s-60s)
• Arcane art.
• Ubiquitous.
Why Use Regular Expressions?
• Finding every instance of a string in a file
  – i.e. every mention of “chickens” in a farm
  diary
• How many times does “sing” appear in a
  text in all tenses and conjugations?
• Reformatting dirty data
• Validating input.
• Command line work – listing files,
  grepping log files
The Basics
• A regex is a pattern enclosed within
  delimiters.
• Most characters match themselves.
• /rootstech/ is a regular expression that
  matches “rootstech”.
  – Slash is the delimiter enclosing the
    expression.
  – “rootstech” is the pattern.
/at/
• Matches strings with     at     hat
  “a” followed by “t”.


                           that   atlas



                           aft    Athens
/at/
• Matches strings with     at     hat
  “a” followed by “t”.


                           that   atlas



                           aft    Athens
Some Theory
• Finite State Machine for the regex /at/
Characters
• Matching is case sensitive.
• Special characters: ( ) ^ $ { } [ ]  | . + ? *
• To match a special character in your text,
  precede it with  in your pattern:
  – /snarky [sic]/ does not match “snarky [sic]”
  – /snarky [sic]/ matches “snarky [sic]”
• Regular expressions can support Unicode.
Character Classes
• Characters within [ ] are choices for a
  single-character match.
• Think of a set operation, or a type of or.
• Order within the set is unimportant.
• /x[01]/ matches “x0” and “x1”.
• /[10][23]/ matches “02”, “03”, “12” and
  “13”.
• Initial^ negates the class:
  – /[^45]/ matches all characters except 4 or 5.
/[ch]at/
• Matches strings with      that   at
  “c” or “h”, followed by
  “a”, followed by “t”.
                            chat   cat



                            fat    phat
/[ch]at/
• Matches strings with      that   at
  “c” or “h”, followed by
  “a”, followed by “t”.
                            chat   cat



                            fat    phat
Ranges
• Ranges define sets of characters within a
  class.
  – /[1-9]/ matches any non-zero digit.
  – /[a-zA-Z]/ matches any letter.
  – /[12][0-9]/ matches numbers between 10 and
    29.
Shortcuts
Shortcut Name           Equivalent Class
  d    digit           [0-9]
  D    not digit       [^0-9]
  w    word            [a-zA-Z0-9_]
  W    not word        [^a-zA-Z0-9_]
  s    space           [tnrfv ]
  S    not space       [^tnrfv ]
   .    everything      [^n] (depends on mode)
/ddd[- ]dddd/
• Matches strings with:   501-1234   234 1252
  – Three digits
  – Space or dash
  – Four digits           652.2648   713-342-7452



                          PE6-5000   653-6464x256
/ddd[- ]dddd/
• Matches strings with:   501-1234   234 1252
  – Three digits
  – Space or dash
  – Four digits           652.2648   713-342-7452



                          PE6-5000   653-6464x256
Repeaters
• Symbols indicating       Repeater   Count
  that the preceding           ?      zero or one
  element of the pattern       +      one or more
  can repeat.
                               *      zero or more
• /runs?/ matches runs
  or run                      {n}     exactly n
• /1d*/ matches any         {n,m}    between n and
                                      m times
  number beginning
  with “1”.                   {,m}    no more than m
                                      times
                              {n,}    at least n times
Repeaters
Strings:                     Repeater   Count
1: “at”       2: “art”           ?      zero or one
3: “arrrrt”   4: “aft”           +      one or more
                                 *      zero or more
Patterns:                       {n}     exactly n
A: /ar?t/     B: /a[fr]?t/     {n,m}    between n and
C: /ar*t/     D: /ar+t/                 m times

E: /a.*t/     F: /a.+t/         {,m}    no more than m
                                        times
                                {n,}    at least n times
Repeaters
•   /ar?t/ matches “at” and “art” but not “arrrt”.
•   /a[fr]?t/ matches “at”, “art”, and “aft”.
•   /ar*t/ matches “at”, “art”, and “arrrrt”
•   /ar+t/ matches “art” and “arrrt” but not “at”.
•   /a.*t/ matches anything with an ‘a’
    eventually followed by a ‘t’.
Lab Session I
Try this URL:


tinyurl.com/rootstechlab
Lab Session I
Match “Brumfield” and “Bromfield” in

1702 John Bromfield's estate had been
  proved in Isle of Wight prior to 1702,
Anne Brumfield rec'd. more than her share
  from her father's estate.
Lab Reference
Repeater   Count              Shortcut   Name
    ?      zero or one            d     digit
    +      one or more
                                  D     not digit
    *      zero or more
                                  w     word
   {n}     exactly n times
  {n,m}    between n and          W     not word
           m times                s     space
   {,m}    no more than m         S     not space
           times
   {n,}    at least n times        .     everything
Anchors
• Anchors match            Anchor Matches
  between characters.        ^    start of line
• Used to assert that        $    end of line
  the characters you’re
                             b    word boundary
  matching must
  appear in a certain        B    not boundary
  place.                     A    start of string
• /batb/ matches “at       Z    end of string
  work” but not “batch”.     z    raw end of
                                   string (rare)
Alternation
• In Regex, | means “or”.
• You can put a full expression on the left
  and another full expression on the right.
• Either can match.
• /seeks?|sought/ matches “seek”, “seeks”,
  or “sought”.
Grouping
• Everything within ( … ) is grouped into a
  single element for the purposes of
  repetition and alternation.
• The expression /(la)+/ matches “la”, “lala”,
  “lalalala” but not “all”.
• /schema(ta)?/ matches “schema” and
  “schemata” but not “schematic”.
Grouping Example
• What regular expression matches “eat”,
  “eats”, “ate” and “eaten”?
Grouping Example
• What regular expression matches “eat”,
  “eats”, “ate” and “eaten”?
• /eat(s|en)?|ate/

• Add word boundary anchors to exclude
  “sate” and “eating”: /b(eat(s|en)?|ate)b/
Lab Session II
Match “William” and “Wm.” in

1736 Robert Mosby and John Brumfield
  processioned the lands of Wm. Brittain
1739 … Witnesses: Richard Echols, William
  Brumfield, John Hendrick
Replacement
• Regex most often used for search/replace
• Syntax varies; most scripting languages
  and CLI tools use s/pattern/replacement/ .
• s/dog/hound/ converts “slobbery dogs” to
  “slobbery hounds”.
• s/bsheepsb/sheep/ converts
  – “sheepskin is made from sheeps” to
  – “sheepskin is made from sheep”
Capture
• During searches, ( … ) groups capture
  patterns for use in replacement.
• Special variables $1, $2, $3 etc. contain
  the capture.
• /(ddd)-(dddd)/    “123-4567”
  – $1 contains “123”
  – $2 contains “4567”
Capture
• How do you convert
  – “Smith, James” and “Jones, Sally” to
  – “James Smith” and “Sally Jones”?
Capture
• How do you convert
  – “Smith, James” and “Jones, Sally” to
  – “James Smith” and “Sally Jones”?
• s/(w+), (w+)/$2 $1/
Caveats
• Check the language/application-specific
  documentation: some common shortcuts
  are not universal.
Questions

Ben Brumfield
benwbrum@gmail.com
FromThePage.com
ManuscriptTranscription.blogspot.com

Mais conteúdo relacionado

Mais procurados

Regular Expressions grep and egrep
Regular Expressions grep and egrepRegular Expressions grep and egrep
Regular Expressions grep and egrepTri Truong
 
Regular Expressions 101
Regular Expressions 101Regular Expressions 101
Regular Expressions 101Raj Rajandran
 
Regex Presentation
Regex PresentationRegex Presentation
Regex Presentationarnolambert
 
3.2 javascript regex
3.2 javascript regex3.2 javascript regex
3.2 javascript regexJalpesh Vasa
 
Regular expressions
Regular expressionsRegular expressions
Regular expressionsBrij Kishore
 
Regular expression
Regular expressionRegular expression
Regular expressionLarry Nung
 
Regular Expression
Regular ExpressionRegular Expression
Regular ExpressionLambert Lum
 
The Power of Regular Expression: use in notepad++
The Power of Regular Expression: use in notepad++The Power of Regular Expression: use in notepad++
The Power of Regular Expression: use in notepad++Anjesh Tuladhar
 
Regular expressions
Regular expressionsRegular expressions
Regular expressionsEran Zimbler
 
Regular expressions
Regular expressionsRegular expressions
Regular expressionsRaj Gupta
 
Regular Expression
Regular ExpressionRegular Expression
Regular ExpressionBharat17485
 
Regular Expression (Regex) Fundamentals
Regular Expression (Regex) FundamentalsRegular Expression (Regex) Fundamentals
Regular Expression (Regex) FundamentalsMesut Günes
 
Finaal application on regular expression
Finaal application on regular expressionFinaal application on regular expression
Finaal application on regular expressionGagan019
 
Bioinformatics p2-p3-perl-regexes v2013-wim_vancriekinge
Bioinformatics p2-p3-perl-regexes v2013-wim_vancriekingeBioinformatics p2-p3-perl-regexes v2013-wim_vancriekinge
Bioinformatics p2-p3-perl-regexes v2013-wim_vancriekingeProf. Wim Van Criekinge
 
Eloquent Ruby chapter 4 - Find The Right String with Regular Expression
Eloquent Ruby chapter 4 - Find The Right String with Regular ExpressionEloquent Ruby chapter 4 - Find The Right String with Regular Expression
Eloquent Ruby chapter 4 - Find The Right String with Regular ExpressionKuyseng Chhoeun
 

Mais procurados (20)

Regular Expression
Regular ExpressionRegular Expression
Regular Expression
 
Regular Expressions grep and egrep
Regular Expressions grep and egrepRegular Expressions grep and egrep
Regular Expressions grep and egrep
 
Regular Expressions 101
Regular Expressions 101Regular Expressions 101
Regular Expressions 101
 
Regular Expressions
Regular ExpressionsRegular Expressions
Regular Expressions
 
Regex Presentation
Regex PresentationRegex Presentation
Regex Presentation
 
3.2 javascript regex
3.2 javascript regex3.2 javascript regex
3.2 javascript regex
 
Regular Expressions
Regular ExpressionsRegular Expressions
Regular Expressions
 
Regular expressions
Regular expressionsRegular expressions
Regular expressions
 
Regular expression
Regular expressionRegular expression
Regular expression
 
Regular Expression
Regular ExpressionRegular Expression
Regular Expression
 
The Power of Regular Expression: use in notepad++
The Power of Regular Expression: use in notepad++The Power of Regular Expression: use in notepad++
The Power of Regular Expression: use in notepad++
 
Regular expressions
Regular expressionsRegular expressions
Regular expressions
 
Regular expressions
Regular expressionsRegular expressions
Regular expressions
 
Regular Expression
Regular ExpressionRegular Expression
Regular Expression
 
Bioinformatics p2-p3-perl-regexes v2014
Bioinformatics p2-p3-perl-regexes v2014Bioinformatics p2-p3-perl-regexes v2014
Bioinformatics p2-p3-perl-regexes v2014
 
Regular Expression (Regex) Fundamentals
Regular Expression (Regex) FundamentalsRegular Expression (Regex) Fundamentals
Regular Expression (Regex) Fundamentals
 
Unix
UnixUnix
Unix
 
Finaal application on regular expression
Finaal application on regular expressionFinaal application on regular expression
Finaal application on regular expression
 
Bioinformatics p2-p3-perl-regexes v2013-wim_vancriekinge
Bioinformatics p2-p3-perl-regexes v2013-wim_vancriekingeBioinformatics p2-p3-perl-regexes v2013-wim_vancriekinge
Bioinformatics p2-p3-perl-regexes v2013-wim_vancriekinge
 
Eloquent Ruby chapter 4 - Find The Right String with Regular Expression
Eloquent Ruby chapter 4 - Find The Right String with Regular ExpressionEloquent Ruby chapter 4 - Find The Right String with Regular Expression
Eloquent Ruby chapter 4 - Find The Right String with Regular Expression
 

Destaque

Mobile app design 2010
Mobile app design 2010Mobile app design 2010
Mobile app design 2010Baidu
 
MARISSA_thesis1_12.11.2011
MARISSA_thesis1_12.11.2011MARISSA_thesis1_12.11.2011
MARISSA_thesis1_12.11.2011mwendolo
 
Marsss!!!!
Marsss!!!!Marsss!!!!
Marsss!!!!Radevski
 
Mobile interface design for color blind user
Mobile interface design for color blind userMobile interface design for color blind user
Mobile interface design for color blind userBaidu
 
MCN2011 Crowdsourcing Transcription
MCN2011 Crowdsourcing TranscriptionMCN2011 Crowdsourcing Transcription
MCN2011 Crowdsourcing TranscriptionBen Brumfield
 
171326626 gambaran-pengetahuan-sikap-dan-tindakan-penderita-hipertensi-dalam-...
171326626 gambaran-pengetahuan-sikap-dan-tindakan-penderita-hipertensi-dalam-...171326626 gambaran-pengetahuan-sikap-dan-tindakan-penderita-hipertensi-dalam-...
171326626 gambaran-pengetahuan-sikap-dan-tindakan-penderita-hipertensi-dalam-...Jeffry Shin
 

Destaque (8)

Atsu-hime
Atsu-himeAtsu-hime
Atsu-hime
 
Mobile app design 2010
Mobile app design 2010Mobile app design 2010
Mobile app design 2010
 
MARISSA_thesis1_12.11.2011
MARISSA_thesis1_12.11.2011MARISSA_thesis1_12.11.2011
MARISSA_thesis1_12.11.2011
 
Marsss!!!!
Marsss!!!!Marsss!!!!
Marsss!!!!
 
Mobile interface design for color blind user
Mobile interface design for color blind userMobile interface design for color blind user
Mobile interface design for color blind user
 
MCN2011 Crowdsourcing Transcription
MCN2011 Crowdsourcing TranscriptionMCN2011 Crowdsourcing Transcription
MCN2011 Crowdsourcing Transcription
 
Ne water powerpoint
Ne water powerpointNe water powerpoint
Ne water powerpoint
 
171326626 gambaran-pengetahuan-sikap-dan-tindakan-penderita-hipertensi-dalam-...
171326626 gambaran-pengetahuan-sikap-dan-tindakan-penderita-hipertensi-dalam-...171326626 gambaran-pengetahuan-sikap-dan-tindakan-penderita-hipertensi-dalam-...
171326626 gambaran-pengetahuan-sikap-dan-tindakan-penderita-hipertensi-dalam-...
 

Semelhante a Introduction to Regular Expressions (Regex) Basics

Basta mastering regex power
Basta mastering regex powerBasta mastering regex power
Basta mastering regex powerMax Kleiner
 
Regular expressions
Regular expressionsRegular expressions
Regular expressionsJames Gray
 
Looking for Patterns
Looking for PatternsLooking for Patterns
Looking for PatternsKeith Wright
 
Regexp secrets
Regexp secretsRegexp secrets
Regexp secretsHiro Asari
 
Regular Expressions Boot Camp
Regular Expressions Boot CampRegular Expressions Boot Camp
Regular Expressions Boot CampChris Schiffhauer
 
Regular Expressions and You
Regular Expressions and YouRegular Expressions and You
Regular Expressions and YouJames Armes
 
Week-2: Theory & Practice of Data Cleaning: Regular Expressions in Practice
Week-2: Theory & Practice of Data Cleaning: Regular Expressions in PracticeWeek-2: Theory & Practice of Data Cleaning: Regular Expressions in Practice
Week-2: Theory & Practice of Data Cleaning: Regular Expressions in PracticeBertram Ludäscher
 
Php Chapter 4 Training
Php Chapter 4 TrainingPhp Chapter 4 Training
Php Chapter 4 TrainingChris Chubb
 
Course 102: Lecture 13: Regular Expressions
Course 102: Lecture 13: Regular Expressions Course 102: Lecture 13: Regular Expressions
Course 102: Lecture 13: Regular Expressions Ahmed El-Arabawy
 
Bioinformatica 06-10-2011-p2 introduction
Bioinformatica 06-10-2011-p2 introductionBioinformatica 06-10-2011-p2 introduction
Bioinformatica 06-10-2011-p2 introductionProf. Wim Van Criekinge
 
FUNDAMENTALS OF REGULAR EXPRESSION (RegEX).pdf
FUNDAMENTALS OF REGULAR EXPRESSION (RegEX).pdfFUNDAMENTALS OF REGULAR EXPRESSION (RegEX).pdf
FUNDAMENTALS OF REGULAR EXPRESSION (RegEX).pdfBryan Alejos
 
Lecture 23
Lecture 23Lecture 23
Lecture 23rhshriva
 
An Introduction to Regular expressions
An Introduction to Regular expressionsAn Introduction to Regular expressions
An Introduction to Regular expressionsYamagata Europe
 
Class 5 - PHP Strings
Class 5 - PHP StringsClass 5 - PHP Strings
Class 5 - PHP StringsAhmed Swilam
 
And now you have two problems. Ruby regular expressions for fun and profit by...
And now you have two problems. Ruby regular expressions for fun and profit by...And now you have two problems. Ruby regular expressions for fun and profit by...
And now you have two problems. Ruby regular expressions for fun and profit by...Codemotion
 

Semelhante a Introduction to Regular Expressions (Regex) Basics (20)

Working with text, Regular expressions
Working with text, Regular expressionsWorking with text, Regular expressions
Working with text, Regular expressions
 
Regular expressions
Regular expressionsRegular expressions
Regular expressions
 
Basta mastering regex power
Basta mastering regex powerBasta mastering regex power
Basta mastering regex power
 
regex.ppt
regex.pptregex.ppt
regex.ppt
 
Regular expressions
Regular expressionsRegular expressions
Regular expressions
 
Looking for Patterns
Looking for PatternsLooking for Patterns
Looking for Patterns
 
Regexp secrets
Regexp secretsRegexp secrets
Regexp secrets
 
Regular Expressions Boot Camp
Regular Expressions Boot CampRegular Expressions Boot Camp
Regular Expressions Boot Camp
 
Regular Expressions and You
Regular Expressions and YouRegular Expressions and You
Regular Expressions and You
 
Week-2: Theory & Practice of Data Cleaning: Regular Expressions in Practice
Week-2: Theory & Practice of Data Cleaning: Regular Expressions in PracticeWeek-2: Theory & Practice of Data Cleaning: Regular Expressions in Practice
Week-2: Theory & Practice of Data Cleaning: Regular Expressions in Practice
 
Php Chapter 4 Training
Php Chapter 4 TrainingPhp Chapter 4 Training
Php Chapter 4 Training
 
Course 102: Lecture 13: Regular Expressions
Course 102: Lecture 13: Regular Expressions Course 102: Lecture 13: Regular Expressions
Course 102: Lecture 13: Regular Expressions
 
Bioinformatica 06-10-2011-p2 introduction
Bioinformatica 06-10-2011-p2 introductionBioinformatica 06-10-2011-p2 introduction
Bioinformatica 06-10-2011-p2 introduction
 
FUNDAMENTALS OF REGULAR EXPRESSION (RegEX).pdf
FUNDAMENTALS OF REGULAR EXPRESSION (RegEX).pdfFUNDAMENTALS OF REGULAR EXPRESSION (RegEX).pdf
FUNDAMENTALS OF REGULAR EXPRESSION (RegEX).pdf
 
Lecture 23
Lecture 23Lecture 23
Lecture 23
 
An Introduction to Regular expressions
An Introduction to Regular expressionsAn Introduction to Regular expressions
An Introduction to Regular expressions
 
Perl_Tutorial_v1
Perl_Tutorial_v1Perl_Tutorial_v1
Perl_Tutorial_v1
 
Perl_Tutorial_v1
Perl_Tutorial_v1Perl_Tutorial_v1
Perl_Tutorial_v1
 
Class 5 - PHP Strings
Class 5 - PHP StringsClass 5 - PHP Strings
Class 5 - PHP Strings
 
And now you have two problems. Ruby regular expressions for fun and profit by...
And now you have two problems. Ruby regular expressions for fun and profit by...And now you have two problems. Ruby regular expressions for fun and profit by...
And now you have two problems. Ruby regular expressions for fun and profit by...
 

Último

Collective Mining | Corporate Presentation - April 2024
Collective Mining | Corporate Presentation - April 2024Collective Mining | Corporate Presentation - April 2024
Collective Mining | Corporate Presentation - April 2024CollectiveMining1
 
Corporate Presentation Probe April 2024.pdf
Corporate Presentation Probe April 2024.pdfCorporate Presentation Probe April 2024.pdf
Corporate Presentation Probe April 2024.pdfProbe Gold
 
slideshare_2404_presentation materials_en.pdf
slideshare_2404_presentation materials_en.pdfslideshare_2404_presentation materials_en.pdf
slideshare_2404_presentation materials_en.pdfsansanir
 
Corporate Presentation Probe April 2024.pdf
Corporate Presentation Probe April 2024.pdfCorporate Presentation Probe April 2024.pdf
Corporate Presentation Probe April 2024.pdfProbe Gold
 
WheelTug PLC Pitch Deck | Investor Insights | April 2024
WheelTug PLC Pitch Deck | Investor Insights | April 2024WheelTug PLC Pitch Deck | Investor Insights | April 2024
WheelTug PLC Pitch Deck | Investor Insights | April 2024Hector Del Castillo, CPM, CPMM
 
Best investment platform in india - falcon invoice discounting
Best investment platform in india - falcon invoice discountingBest investment platform in india - falcon invoice discounting
Best investment platform in india - falcon invoice discountingFalcon Invoice Discounting
 
Basic Accountants in|TaxlinkConcept.pdf
Basic  Accountants in|TaxlinkConcept.pdfBasic  Accountants in|TaxlinkConcept.pdf
Basic Accountants in|TaxlinkConcept.pdftaxlinkcpa
 
9654467111 Call Girls In Katwaria Sarai Short 1500 Night 6000
9654467111 Call Girls In Katwaria Sarai Short 1500 Night 60009654467111 Call Girls In Katwaria Sarai Short 1500 Night 6000
9654467111 Call Girls In Katwaria Sarai Short 1500 Night 6000Sapana Sha
 
Nicola Mining Inc. Corporate Presentation April 2024
Nicola Mining Inc. Corporate Presentation April 2024Nicola Mining Inc. Corporate Presentation April 2024
Nicola Mining Inc. Corporate Presentation April 2024nicola_mining
 
the 25 most beautiful words for a loving and lasting relationship.pdf
the 25 most beautiful words for a loving and lasting relationship.pdfthe 25 most beautiful words for a loving and lasting relationship.pdf
the 25 most beautiful words for a loving and lasting relationship.pdfFrancenel Paul
 
Q1 Quarterly Update - April 16, 2024.pdf
Q1 Quarterly Update - April 16, 2024.pdfQ1 Quarterly Update - April 16, 2024.pdf
Q1 Quarterly Update - April 16, 2024.pdfProbe Gold
 
9654467111 Low Rate Call Girls In Tughlakabad, Delhi NCR
9654467111 Low Rate Call Girls In Tughlakabad, Delhi NCR9654467111 Low Rate Call Girls In Tughlakabad, Delhi NCR
9654467111 Low Rate Call Girls In Tughlakabad, Delhi NCRSapana Sha
 
Leveraging USDA Rural Development Grants for Community Growth and Sustainabil...
Leveraging USDA Rural Development Grants for Community Growth and Sustainabil...Leveraging USDA Rural Development Grants for Community Growth and Sustainabil...
Leveraging USDA Rural Development Grants for Community Growth and Sustainabil...USDAReapgrants.com
 
The Concept of Humanity in Islam and its effects at future of humanity
The Concept of Humanity in Islam and its effects at future of humanityThe Concept of Humanity in Islam and its effects at future of humanity
The Concept of Humanity in Islam and its effects at future of humanityJohanAspro
 

Último (19)

Collective Mining | Corporate Presentation - April 2024
Collective Mining | Corporate Presentation - April 2024Collective Mining | Corporate Presentation - April 2024
Collective Mining | Corporate Presentation - April 2024
 
Corporate Presentation Probe April 2024.pdf
Corporate Presentation Probe April 2024.pdfCorporate Presentation Probe April 2024.pdf
Corporate Presentation Probe April 2024.pdf
 
slideshare_2404_presentation materials_en.pdf
slideshare_2404_presentation materials_en.pdfslideshare_2404_presentation materials_en.pdf
slideshare_2404_presentation materials_en.pdf
 
Corporate Presentation Probe April 2024.pdf
Corporate Presentation Probe April 2024.pdfCorporate Presentation Probe April 2024.pdf
Corporate Presentation Probe April 2024.pdf
 
young call girls in Yamuna Vihar 🔝 9953056974 🔝 Delhi escort Service
young  call girls in   Yamuna Vihar 🔝 9953056974 🔝 Delhi escort Serviceyoung  call girls in   Yamuna Vihar 🔝 9953056974 🔝 Delhi escort Service
young call girls in Yamuna Vihar 🔝 9953056974 🔝 Delhi escort Service
 
WheelTug PLC Pitch Deck | Investor Insights | April 2024
WheelTug PLC Pitch Deck | Investor Insights | April 2024WheelTug PLC Pitch Deck | Investor Insights | April 2024
WheelTug PLC Pitch Deck | Investor Insights | April 2024
 
Best investment platform in india - falcon invoice discounting
Best investment platform in india - falcon invoice discountingBest investment platform in india - falcon invoice discounting
Best investment platform in india - falcon invoice discounting
 
Basic Accountants in|TaxlinkConcept.pdf
Basic  Accountants in|TaxlinkConcept.pdfBasic  Accountants in|TaxlinkConcept.pdf
Basic Accountants in|TaxlinkConcept.pdf
 
9654467111 Call Girls In Katwaria Sarai Short 1500 Night 6000
9654467111 Call Girls In Katwaria Sarai Short 1500 Night 60009654467111 Call Girls In Katwaria Sarai Short 1500 Night 6000
9654467111 Call Girls In Katwaria Sarai Short 1500 Night 6000
 
Nicola Mining Inc. Corporate Presentation April 2024
Nicola Mining Inc. Corporate Presentation April 2024Nicola Mining Inc. Corporate Presentation April 2024
Nicola Mining Inc. Corporate Presentation April 2024
 
young call girls in Hauz Khas,🔝 9953056974 🔝 escort Service
young call girls in Hauz Khas,🔝 9953056974 🔝 escort Serviceyoung call girls in Hauz Khas,🔝 9953056974 🔝 escort Service
young call girls in Hauz Khas,🔝 9953056974 🔝 escort Service
 
young call girls in Govindpuri 🔝 9953056974 🔝 Delhi escort Service
young call girls in Govindpuri 🔝 9953056974 🔝 Delhi escort Serviceyoung call girls in Govindpuri 🔝 9953056974 🔝 Delhi escort Service
young call girls in Govindpuri 🔝 9953056974 🔝 Delhi escort Service
 
the 25 most beautiful words for a loving and lasting relationship.pdf
the 25 most beautiful words for a loving and lasting relationship.pdfthe 25 most beautiful words for a loving and lasting relationship.pdf
the 25 most beautiful words for a loving and lasting relationship.pdf
 
Q1 Quarterly Update - April 16, 2024.pdf
Q1 Quarterly Update - April 16, 2024.pdfQ1 Quarterly Update - April 16, 2024.pdf
Q1 Quarterly Update - April 16, 2024.pdf
 
young Call girls in Dwarka sector 1🔝 9953056974 🔝 Delhi escort Service
young Call girls in Dwarka sector 1🔝 9953056974 🔝 Delhi escort Serviceyoung Call girls in Dwarka sector 1🔝 9953056974 🔝 Delhi escort Service
young Call girls in Dwarka sector 1🔝 9953056974 🔝 Delhi escort Service
 
9654467111 Low Rate Call Girls In Tughlakabad, Delhi NCR
9654467111 Low Rate Call Girls In Tughlakabad, Delhi NCR9654467111 Low Rate Call Girls In Tughlakabad, Delhi NCR
9654467111 Low Rate Call Girls In Tughlakabad, Delhi NCR
 
Call Girls in South Ex⎝⎝9953056974⎝⎝ Escort Delhi NCR
Call Girls in South Ex⎝⎝9953056974⎝⎝ Escort Delhi NCRCall Girls in South Ex⎝⎝9953056974⎝⎝ Escort Delhi NCR
Call Girls in South Ex⎝⎝9953056974⎝⎝ Escort Delhi NCR
 
Leveraging USDA Rural Development Grants for Community Growth and Sustainabil...
Leveraging USDA Rural Development Grants for Community Growth and Sustainabil...Leveraging USDA Rural Development Grants for Community Growth and Sustainabil...
Leveraging USDA Rural Development Grants for Community Growth and Sustainabil...
 
The Concept of Humanity in Islam and its effects at future of humanity
The Concept of Humanity in Islam and its effects at future of humanityThe Concept of Humanity in Islam and its effects at future of humanity
The Concept of Humanity in Islam and its effects at future of humanity
 

Introduction to Regular Expressions (Regex) Basics

  • 1. Introduction to Regular Expressions Ben Brumfield RootsTech 2013
  • 2. Our Texts • My Texts – Manuscript Transcripts – OCR • Our Texts – Name Variants – Abbreviations – Spelling Changes – “Mistakes”
  • 3. What are Regular Expressions? • Very small language for describing text. • Not a programming language. • Incredibly powerful tool for search/replace operations. • Old (1950s-60s) • Arcane art. • Ubiquitous.
  • 4. Why Use Regular Expressions? • Finding every instance of a string in a file – i.e. every mention of “chickens” in a farm diary • How many times does “sing” appear in a text in all tenses and conjugations? • Reformatting dirty data • Validating input. • Command line work – listing files, grepping log files
  • 5. The Basics • A regex is a pattern enclosed within delimiters. • Most characters match themselves. • /rootstech/ is a regular expression that matches “rootstech”. – Slash is the delimiter enclosing the expression. – “rootstech” is the pattern.
  • 6. /at/ • Matches strings with at hat “a” followed by “t”. that atlas aft Athens
  • 7. /at/ • Matches strings with at hat “a” followed by “t”. that atlas aft Athens
  • 8. Some Theory • Finite State Machine for the regex /at/
  • 9. Characters • Matching is case sensitive. • Special characters: ( ) ^ $ { } [ ] | . + ? * • To match a special character in your text, precede it with in your pattern: – /snarky [sic]/ does not match “snarky [sic]” – /snarky [sic]/ matches “snarky [sic]” • Regular expressions can support Unicode.
  • 10. Character Classes • Characters within [ ] are choices for a single-character match. • Think of a set operation, or a type of or. • Order within the set is unimportant. • /x[01]/ matches “x0” and “x1”. • /[10][23]/ matches “02”, “03”, “12” and “13”. • Initial^ negates the class: – /[^45]/ matches all characters except 4 or 5.
  • 11. /[ch]at/ • Matches strings with that at “c” or “h”, followed by “a”, followed by “t”. chat cat fat phat
  • 12. /[ch]at/ • Matches strings with that at “c” or “h”, followed by “a”, followed by “t”. chat cat fat phat
  • 13. Ranges • Ranges define sets of characters within a class. – /[1-9]/ matches any non-zero digit. – /[a-zA-Z]/ matches any letter. – /[12][0-9]/ matches numbers between 10 and 29.
  • 14. Shortcuts Shortcut Name Equivalent Class d digit [0-9] D not digit [^0-9] w word [a-zA-Z0-9_] W not word [^a-zA-Z0-9_] s space [tnrfv ] S not space [^tnrfv ] . everything [^n] (depends on mode)
  • 15. /ddd[- ]dddd/ • Matches strings with: 501-1234 234 1252 – Three digits – Space or dash – Four digits 652.2648 713-342-7452 PE6-5000 653-6464x256
  • 16. /ddd[- ]dddd/ • Matches strings with: 501-1234 234 1252 – Three digits – Space or dash – Four digits 652.2648 713-342-7452 PE6-5000 653-6464x256
  • 17. Repeaters • Symbols indicating Repeater Count that the preceding ? zero or one element of the pattern + one or more can repeat. * zero or more • /runs?/ matches runs or run {n} exactly n • /1d*/ matches any {n,m} between n and m times number beginning with “1”. {,m} no more than m times {n,} at least n times
  • 18. Repeaters Strings: Repeater Count 1: “at” 2: “art” ? zero or one 3: “arrrrt” 4: “aft” + one or more * zero or more Patterns: {n} exactly n A: /ar?t/ B: /a[fr]?t/ {n,m} between n and C: /ar*t/ D: /ar+t/ m times E: /a.*t/ F: /a.+t/ {,m} no more than m times {n,} at least n times
  • 19. Repeaters • /ar?t/ matches “at” and “art” but not “arrrt”. • /a[fr]?t/ matches “at”, “art”, and “aft”. • /ar*t/ matches “at”, “art”, and “arrrrt” • /ar+t/ matches “art” and “arrrt” but not “at”. • /a.*t/ matches anything with an ‘a’ eventually followed by a ‘t’.
  • 20. Lab Session I Try this URL: tinyurl.com/rootstechlab
  • 21. Lab Session I Match “Brumfield” and “Bromfield” in 1702 John Bromfield's estate had been proved in Isle of Wight prior to 1702, Anne Brumfield rec'd. more than her share from her father's estate.
  • 22. Lab Reference Repeater Count Shortcut Name ? zero or one d digit + one or more D not digit * zero or more w word {n} exactly n times {n,m} between n and W not word m times s space {,m} no more than m S not space times {n,} at least n times . everything
  • 23. Anchors • Anchors match Anchor Matches between characters. ^ start of line • Used to assert that $ end of line the characters you’re b word boundary matching must appear in a certain B not boundary place. A start of string • /batb/ matches “at Z end of string work” but not “batch”. z raw end of string (rare)
  • 24. Alternation • In Regex, | means “or”. • You can put a full expression on the left and another full expression on the right. • Either can match. • /seeks?|sought/ matches “seek”, “seeks”, or “sought”.
  • 25. Grouping • Everything within ( … ) is grouped into a single element for the purposes of repetition and alternation. • The expression /(la)+/ matches “la”, “lala”, “lalalala” but not “all”. • /schema(ta)?/ matches “schema” and “schemata” but not “schematic”.
  • 26. Grouping Example • What regular expression matches “eat”, “eats”, “ate” and “eaten”?
  • 27. Grouping Example • What regular expression matches “eat”, “eats”, “ate” and “eaten”? • /eat(s|en)?|ate/ • Add word boundary anchors to exclude “sate” and “eating”: /b(eat(s|en)?|ate)b/
  • 28. Lab Session II Match “William” and “Wm.” in 1736 Robert Mosby and John Brumfield processioned the lands of Wm. Brittain 1739 … Witnesses: Richard Echols, William Brumfield, John Hendrick
  • 29. Replacement • Regex most often used for search/replace • Syntax varies; most scripting languages and CLI tools use s/pattern/replacement/ . • s/dog/hound/ converts “slobbery dogs” to “slobbery hounds”. • s/bsheepsb/sheep/ converts – “sheepskin is made from sheeps” to – “sheepskin is made from sheep”
  • 30. Capture • During searches, ( … ) groups capture patterns for use in replacement. • Special variables $1, $2, $3 etc. contain the capture. • /(ddd)-(dddd)/ “123-4567” – $1 contains “123” – $2 contains “4567”
  • 31. Capture • How do you convert – “Smith, James” and “Jones, Sally” to – “James Smith” and “Sally Jones”?
  • 32. Capture • How do you convert – “Smith, James” and “Jones, Sally” to – “James Smith” and “Sally Jones”? • s/(w+), (w+)/$2 $1/
  • 33. Caveats • Check the language/application-specific documentation: some common shortcuts are not universal.