SlideShare a Scribd company logo
1 of 27
Paolo Carrasco

The Nearshore Advantage!
•
•
•

Definition
How a regex engine works
Applications
– Matching
•
•

Regex Pattern
Characters
–
–
–

•
•

Literal Characters
Special characters
Character classes/sets

Anchors
Repetition and optional items

– Extracting
•
•
•

Grouping
Alternation
Groups

– Replacing

•

Advanced topics
– Greediness
– Lookahead and lookbehind

•

Techniques for Faster Expressions

The Nearshore Advantage!
• Basically, a regular expression is a pattern
describing a certain amount of text.
• Their name comes from the mathematical
theory on which they are based.
• Regular expressions provide a
powerful, flexible, and efficient method for
processing text.
• Sometimes it is also named regex or regexp.
The Nearshore Advantage!
The centerpiece of text
processing with regular
expressions is the regex
engine.
A regex "engine" is a piece
of software that can
process regular
expressions, trying to
match the pattern to the
given string.
Usually, the engine is part
of a larger application and
you do not access the
engine directly.

The Nearshore Advantage!

Regex
Engine
At a minimum, processing text using
regular expressions requires that the

Pattern
Input

engine be provided with the
following two items of information:

•The regular expression pattern to

Replacement

identify in the text.
•The text to parse for the regular
expression pattern.
• (Optionally) A replacement string.

The Nearshore Advantage!

Regex Engine
•

Matching

The extensive pattern-matching
notation of regular expressions
enables you to quickly parse large
amounts of text to find specific
character patterns; to validate text to

Extracting

ensure that it matches a predefined
pattern; to extract, edit, replace, or
delete text substrings; and to add the
extracted strings to a collection in

Replacing
The Nearshore Advantage!

order to generate a report.
Applications:

The Nearshore Advantage!
Operators
Literal
Characters

Constructs

Pattern

A regular expression is a pattern that the regular expression engine attempts to
match in input text. A pattern consists of one or more character
literals, operators, or constructs.

The Nearshore Advantage!
• All characters except [^$.|?*+() refer to the
simple meaning of characters.
• All characters except the listed special
characters match a single instance of
themselves.
• Note: The regex engines are case sensitive by
default.

The Nearshore Advantage!
• Most of the times we will
need special characters
for complex matches.
• These characters are
often called
metacharacters.
• The dot or period is one
of the most commonly
used. Unfortunately, it is
also the most commonly
misused metacharacter.
The Nearshore Advantage!

• When a special char is
needed as literal, it
requires a backslash
followed by any of
metacharacters.
• A backslash escapes
special characters to
suppress their special
meaning.
Escaped
character

Description

Pattern

Matches

t

Matches a tab,
u0009.

(w+)t

"item1t",
"item2t" in
"item1titem2t"

r

Matches a carriage rn(w+)
return, u000D.
(r is not
equivalent to the
newline
character, n.)

"rnThese" in
"rnThese
arentwo lines."

n

Matches a new
line, u000A.

rn(w+)

"rnThese" in
"rnThese
arentwo lines."

s

Matches an empty w+sw+
space

The Nearshore Advantage!

“Hello world” in
“Hello world”
A character class matches any one of a set of characters.
The order of the characters inside a character class does
not matter.

Positive
character
group

• A character in the input string
must match one of a
specified set of characters.

Negative
character
group

• A character in the input string
must not match one of a
specified set of characters.

Any
character

• The dot or period character is
a wildcard character that
matches any character
except n

The Nearshore Advantage!

Shorthand classes
• Since certain character
classes are used often, a
series of shorthand
character classes are
available.
• Shorthand character classes
can be used both inside and
outside the square brackets.
• Some shorthand have
negated versions.
Anchor

Description

^

The match must occur
at the beginning of
the string or line.

$

The match must occur
at the end of the
string or line, or
before n at the end
of the string or line.

b

The match must occur
on a word boundary.

B

The match must not
occur on a word
boundary.

The Nearshore Advantage!

• Anchors match a
position before,
after or between
characters.
• They can be used
to "anchor" the
regex match at a
certain position.
*

•Match-zero-or-more

+

•Match-one-or-more

?

•Match-zero-or-one

{}

•Interval

The Nearshore Advantage!
The Nearshore Advantage!
Applications:

The Nearshore Advantage!
• A group, also known as a subexpression,
consists of an "open-group operator", any
number of other operators, and a "closegroup operator".
open-group-operator

close-group-operator

• Regex treats this sequence as a unit, just as
mathematics and programming languages
treat a parenthesized expression as a unit.
The Nearshore Advantage!
• Alternation match one of a choice of regular
expressions:
– If you put the character(s) representing the
alternation operator between any two regular
expressions A and B, the result matches the
union of the strings that A and B match.

• It operates on the largest possible surrounding
regular expressions. Thus, the only way you
can delimit its arguments is to use grouping.
The Nearshore Advantage!
• By placing part of a regular expression inside
round brackets or parentheses, you can group
that part of the regular expression together.

The Nearshore Advantage!
Applications:

The Nearshore Advantage!
Feature

.NET

Java

Perl

ECMA

Ruby

$& (whole regex match)

YES

error

YES

YES

no

$0 (whole regex match)

YES

YES

no

no

no

$1 through $99 (backreference)

YES

YES

YES

YES

no

${1} through ${99} (backreference)

YES

error

YES

no

no

${group} (named backreference)

YES

error

no

no

no

$` (backtick; subject text to the left of
the match)

YES

error

YES

YES

no

$' (straight quote; subject text to the
right of the match)

YES

error

YES

YES

no

$_ (entire subject string)

YES

error

YES

IE only

no

$+ (highest-numbered group in the
regex)

YES

error

no

IE and
Firefox

no

$$ (escape dollar with another dollar)

YES

error

no

YES

no

YES

error

no

YES

YES

$ (unescaped dollar as
The Nearshore Advantage!literal text)
The Nearshore Advantage!
• The repetition operators or quantifiers are
greedy.
• They will expand the match as far as they
can, and only give back if they must to satisfy the
remainder of the regex.
• The quick fix to this problem is to make the
quantifier lazy instead of greedy. Lazy quantifiers
are sometimes also called "ungreedy" or
"reluctant". You can do this by putting a question
mark behind the plus in the regex.
The Nearshore Advantage!
Negative lookahead
• It is indispensable if you
want to match something
not followed by
something else.

Positive lookahead
• It is indispensable if you
want to match something
not followed by
something else.

Collectively, these are called "lookaround".
They do not consume characters in the string, but only assert whether a match
is possible or not.

The Nearshore Advantage!
Common Sense
Techniques

•Avoid recompiling
•Use non-capturing parentheses
•Don't add superfluous parentheses
•Don't use superfluous character classes
•Use leading anchors

Expose Anchors

•Expose ^ and G at the front of expressions
•Expose $ at the end of expressions

Lazy Versus Greedy: Be
Specific
Lead the Engine to a
Match

The Nearshore Advantage!

•The repetition operators or quantifiers are greedy.

•Put the most likely alternative first
•Distribute into the end of alternation
The Nearshore Advantage!
• Regex Testers
– http://www.gskinner.com/RegExr/
– http://osteele.com/tools/rework/

• Regex Patterns Library
– http://regexlib.com

• Complete Tutorials
– http://www.regular-expressions.info/
– Javascript:
• http://www.w3schools.com/jsref/jsref_obj_regexp.asp

– .NET:
• http://msdn.microsoft.com/en-us/library/hs600312.aspx

– Java:
• http://java.sun.com/docs/books/tutorial/essential/regex/index.html

The Nearshore Advantage!

More Related Content

What's hot

Regular expression automata
Regular expression automataRegular expression automata
Regular expression automata성욱 유
 
Java: Regular Expression
Java: Regular ExpressionJava: Regular Expression
Java: Regular ExpressionMasudul Haque
 
Javascript regular expression
Javascript regular expressionJavascript regular expression
Javascript regular expressionDhairya Joshi
 
Bioinformatica 06-10-2011-p2 introduction
Bioinformatica 06-10-2011-p2 introductionBioinformatica 06-10-2011-p2 introduction
Bioinformatica 06-10-2011-p2 introductionProf. Wim Van Criekinge
 
Php String And Regular Expressions
Php String  And Regular ExpressionsPhp String  And Regular Expressions
Php String And Regular Expressionsmussawir20
 
Effective Scala (SoftShake 2013)
Effective Scala (SoftShake 2013)Effective Scala (SoftShake 2013)
Effective Scala (SoftShake 2013)mircodotta
 
Regular expressions and php
Regular expressions and phpRegular expressions and php
Regular expressions and phpDavid Stockton
 
Plunging Into Perl While Avoiding the Deep End (mostly)
Plunging Into Perl While Avoiding the Deep End (mostly)Plunging Into Perl While Avoiding the Deep End (mostly)
Plunging Into Perl While Avoiding the Deep End (mostly)Roy Zimmer
 
Alexey Golub - Writing parsers in c# | 3Shape Meetup
Alexey Golub - Writing parsers in c# | 3Shape MeetupAlexey Golub - Writing parsers in c# | 3Shape Meetup
Alexey Golub - Writing parsers in c# | 3Shape MeetupOleksii Holub
 
Regular Expressions and You
Regular Expressions and YouRegular Expressions and You
Regular Expressions and YouJames Armes
 
Perl programming language
Perl programming languagePerl programming language
Perl programming languageElie Obeid
 
Python advanced 2. regular expression in python
Python advanced 2. regular expression in pythonPython advanced 2. regular expression in python
Python advanced 2. regular expression in pythonJohn(Qiang) Zhang
 
Python Programming - XI. String Manipulation and Regular Expressions
Python Programming - XI. String Manipulation and Regular ExpressionsPython Programming - XI. String Manipulation and Regular Expressions
Python Programming - XI. String Manipulation and Regular ExpressionsRanel Padon
 
Webinar: Simpler Semantic Search with Solr
Webinar: Simpler Semantic Search with SolrWebinar: Simpler Semantic Search with Solr
Webinar: Simpler Semantic Search with SolrLucidworks
 

What's hot (20)

Regular expression automata
Regular expression automataRegular expression automata
Regular expression automata
 
Java: Regular Expression
Java: Regular ExpressionJava: Regular Expression
Java: Regular Expression
 
Javascript regular expression
Javascript regular expressionJavascript regular expression
Javascript regular expression
 
Bioinformatica 06-10-2011-p2 introduction
Bioinformatica 06-10-2011-p2 introductionBioinformatica 06-10-2011-p2 introduction
Bioinformatica 06-10-2011-p2 introduction
 
PHP Regular Expressions
PHP Regular ExpressionsPHP Regular Expressions
PHP Regular Expressions
 
Php String And Regular Expressions
Php String  And Regular ExpressionsPhp String  And Regular Expressions
Php String And Regular Expressions
 
Effective Scala (SoftShake 2013)
Effective Scala (SoftShake 2013)Effective Scala (SoftShake 2013)
Effective Scala (SoftShake 2013)
 
Making Topicmaps SPARQL
Making Topicmaps SPARQLMaking Topicmaps SPARQL
Making Topicmaps SPARQL
 
Regular expressions and php
Regular expressions and phpRegular expressions and php
Regular expressions and php
 
Hashes
HashesHashes
Hashes
 
Plunging Into Perl While Avoiding the Deep End (mostly)
Plunging Into Perl While Avoiding the Deep End (mostly)Plunging Into Perl While Avoiding the Deep End (mostly)
Plunging Into Perl While Avoiding the Deep End (mostly)
 
Alexey Golub - Writing parsers in c# | 3Shape Meetup
Alexey Golub - Writing parsers in c# | 3Shape MeetupAlexey Golub - Writing parsers in c# | 3Shape Meetup
Alexey Golub - Writing parsers in c# | 3Shape Meetup
 
Spsl II unit
Spsl   II unitSpsl   II unit
Spsl II unit
 
Lexing and parsing
Lexing and parsingLexing and parsing
Lexing and parsing
 
Regular Expressions and You
Regular Expressions and YouRegular Expressions and You
Regular Expressions and You
 
Perl programming language
Perl programming languagePerl programming language
Perl programming language
 
Python advanced 2. regular expression in python
Python advanced 2. regular expression in pythonPython advanced 2. regular expression in python
Python advanced 2. regular expression in python
 
Python Programming - XI. String Manipulation and Regular Expressions
Python Programming - XI. String Manipulation and Regular ExpressionsPython Programming - XI. String Manipulation and Regular Expressions
Python Programming - XI. String Manipulation and Regular Expressions
 
Perl Programming - 02 Regular Expression
Perl Programming - 02 Regular ExpressionPerl Programming - 02 Regular Expression
Perl Programming - 02 Regular Expression
 
Webinar: Simpler Semantic Search with Solr
Webinar: Simpler Semantic Search with SolrWebinar: Simpler Semantic Search with Solr
Webinar: Simpler Semantic Search with Solr
 

Similar to Regular expressions

Introduction to Boost regex
Introduction to Boost regexIntroduction to Boost regex
Introduction to Boost regexYongqiang Li
 
Regular expressions
Regular expressionsRegular expressions
Regular expressionskeeyre
 
Don't Fear the Regex LSP15
Don't Fear the Regex LSP15Don't Fear the Regex LSP15
Don't Fear the Regex LSP15Sandy Smith
 
Don't Fear the Regex - CapitalCamp/GovDays 2014
Don't Fear the Regex - CapitalCamp/GovDays 2014Don't Fear the Regex - CapitalCamp/GovDays 2014
Don't Fear the Regex - CapitalCamp/GovDays 2014Sandy Smith
 
Regular expressions
Regular expressionsRegular expressions
Regular expressionsRaghu nath
 
CiNPA Security SIG - Regex Presentation
CiNPA Security SIG - Regex PresentationCiNPA Security SIG - Regex Presentation
CiNPA Security SIG - Regex PresentationCiNPA Security SIG
 
/Regex makes me want to (weep|give up|(╯°□°)╯︵ ┻━┻)\.?/i
/Regex makes me want to (weep|give up|(╯°□°)╯︵ ┻━┻)\.?/i/Regex makes me want to (weep|give up|(╯°□°)╯︵ ┻━┻)\.?/i
/Regex makes me want to (weep|give up|(╯°□°)╯︵ ┻━┻)\.?/ibrettflorio
 
Don't Fear the Regex - Northeast PHP 2015
Don't Fear the Regex - Northeast PHP 2015Don't Fear the Regex - Northeast PHP 2015
Don't Fear the Regex - Northeast PHP 2015Sandy Smith
 
Don't Fear the Regex WordCamp DC 2017
Don't Fear the Regex WordCamp DC 2017Don't Fear the Regex WordCamp DC 2017
Don't Fear the Regex WordCamp DC 2017Sandy Smith
 
Regular Expressions grep and egrep
Regular Expressions grep and egrepRegular Expressions grep and egrep
Regular Expressions grep and egrepTri Truong
 

Similar to Regular expressions (20)

Introduction to Boost regex
Introduction to Boost regexIntroduction to Boost regex
Introduction to Boost regex
 
Regular expressions
Regular expressionsRegular expressions
Regular expressions
 
Regular Expressions
Regular ExpressionsRegular Expressions
Regular Expressions
 
Bioinformatica p2-p3-introduction
Bioinformatica p2-p3-introductionBioinformatica p2-p3-introduction
Bioinformatica p2-p3-introduction
 
Regular Expressions
Regular ExpressionsRegular Expressions
Regular Expressions
 
Don't Fear the Regex LSP15
Don't Fear the Regex LSP15Don't Fear the Regex LSP15
Don't Fear the Regex LSP15
 
Don't Fear the Regex - CapitalCamp/GovDays 2014
Don't Fear the Regex - CapitalCamp/GovDays 2014Don't Fear the Regex - CapitalCamp/GovDays 2014
Don't Fear the Regex - CapitalCamp/GovDays 2014
 
Regular expressions
Regular expressionsRegular expressions
Regular expressions
 
PHP - Introduction to PHP
PHP -  Introduction to PHPPHP -  Introduction to PHP
PHP - Introduction to PHP
 
2013 - Andrei Zmievski: Clínica Regex
2013 - Andrei Zmievski: Clínica Regex2013 - Andrei Zmievski: Clínica Regex
2013 - Andrei Zmievski: Clínica Regex
 
CiNPA Security SIG - Regex Presentation
CiNPA Security SIG - Regex PresentationCiNPA Security SIG - Regex Presentation
CiNPA Security SIG - Regex Presentation
 
Quick start reg ex
Quick start reg exQuick start reg ex
Quick start reg ex
 
Andrei's Regex Clinic
Andrei's Regex ClinicAndrei's Regex Clinic
Andrei's Regex Clinic
 
Regular Expression
Regular ExpressionRegular Expression
Regular Expression
 
/Regex makes me want to (weep|give up|(╯°□°)╯︵ ┻━┻)\.?/i
/Regex makes me want to (weep|give up|(╯°□°)╯︵ ┻━┻)\.?/i/Regex makes me want to (weep|give up|(╯°□°)╯︵ ┻━┻)\.?/i
/Regex makes me want to (weep|give up|(╯°□°)╯︵ ┻━┻)\.?/i
 
RegEx Parsing
RegEx ParsingRegEx Parsing
RegEx Parsing
 
Don't Fear the Regex - Northeast PHP 2015
Don't Fear the Regex - Northeast PHP 2015Don't Fear the Regex - Northeast PHP 2015
Don't Fear the Regex - Northeast PHP 2015
 
Don't Fear the Regex WordCamp DC 2017
Don't Fear the Regex WordCamp DC 2017Don't Fear the Regex WordCamp DC 2017
Don't Fear the Regex WordCamp DC 2017
 
Regular expression for everyone
Regular expression for everyoneRegular expression for everyone
Regular expression for everyone
 
Regular Expressions grep and egrep
Regular Expressions grep and egrepRegular Expressions grep and egrep
Regular Expressions grep and egrep
 

Recently uploaded

From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 

Recently uploaded (20)

From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 

Regular expressions

  • 2. • • • Definition How a regex engine works Applications – Matching • • Regex Pattern Characters – – – • • Literal Characters Special characters Character classes/sets Anchors Repetition and optional items – Extracting • • • Grouping Alternation Groups – Replacing • Advanced topics – Greediness – Lookahead and lookbehind • Techniques for Faster Expressions The Nearshore Advantage!
  • 3. • Basically, a regular expression is a pattern describing a certain amount of text. • Their name comes from the mathematical theory on which they are based. • Regular expressions provide a powerful, flexible, and efficient method for processing text. • Sometimes it is also named regex or regexp. The Nearshore Advantage!
  • 4. The centerpiece of text processing with regular expressions is the regex engine. A regex "engine" is a piece of software that can process regular expressions, trying to match the pattern to the given string. Usually, the engine is part of a larger application and you do not access the engine directly. The Nearshore Advantage! Regex Engine
  • 5. At a minimum, processing text using regular expressions requires that the Pattern Input engine be provided with the following two items of information: •The regular expression pattern to Replacement identify in the text. •The text to parse for the regular expression pattern. • (Optionally) A replacement string. The Nearshore Advantage! Regex Engine
  • 6. • Matching The extensive pattern-matching notation of regular expressions enables you to quickly parse large amounts of text to find specific character patterns; to validate text to Extracting ensure that it matches a predefined pattern; to extract, edit, replace, or delete text substrings; and to add the extracted strings to a collection in Replacing The Nearshore Advantage! order to generate a report.
  • 8. Operators Literal Characters Constructs Pattern A regular expression is a pattern that the regular expression engine attempts to match in input text. A pattern consists of one or more character literals, operators, or constructs. The Nearshore Advantage!
  • 9. • All characters except [^$.|?*+() refer to the simple meaning of characters. • All characters except the listed special characters match a single instance of themselves. • Note: The regex engines are case sensitive by default. The Nearshore Advantage!
  • 10. • Most of the times we will need special characters for complex matches. • These characters are often called metacharacters. • The dot or period is one of the most commonly used. Unfortunately, it is also the most commonly misused metacharacter. The Nearshore Advantage! • When a special char is needed as literal, it requires a backslash followed by any of metacharacters. • A backslash escapes special characters to suppress their special meaning.
  • 11. Escaped character Description Pattern Matches t Matches a tab, u0009. (w+)t "item1t", "item2t" in "item1titem2t" r Matches a carriage rn(w+) return, u000D. (r is not equivalent to the newline character, n.) "rnThese" in "rnThese arentwo lines." n Matches a new line, u000A. rn(w+) "rnThese" in "rnThese arentwo lines." s Matches an empty w+sw+ space The Nearshore Advantage! “Hello world” in “Hello world”
  • 12. A character class matches any one of a set of characters. The order of the characters inside a character class does not matter. Positive character group • A character in the input string must match one of a specified set of characters. Negative character group • A character in the input string must not match one of a specified set of characters. Any character • The dot or period character is a wildcard character that matches any character except n The Nearshore Advantage! Shorthand classes • Since certain character classes are used often, a series of shorthand character classes are available. • Shorthand character classes can be used both inside and outside the square brackets. • Some shorthand have negated versions.
  • 13. Anchor Description ^ The match must occur at the beginning of the string or line. $ The match must occur at the end of the string or line, or before n at the end of the string or line. b The match must occur on a word boundary. B The match must not occur on a word boundary. The Nearshore Advantage! • Anchors match a position before, after or between characters. • They can be used to "anchor" the regex match at a certain position.
  • 17. • A group, also known as a subexpression, consists of an "open-group operator", any number of other operators, and a "closegroup operator". open-group-operator close-group-operator • Regex treats this sequence as a unit, just as mathematics and programming languages treat a parenthesized expression as a unit. The Nearshore Advantage!
  • 18. • Alternation match one of a choice of regular expressions: – If you put the character(s) representing the alternation operator between any two regular expressions A and B, the result matches the union of the strings that A and B match. • It operates on the largest possible surrounding regular expressions. Thus, the only way you can delimit its arguments is to use grouping. The Nearshore Advantage!
  • 19. • By placing part of a regular expression inside round brackets or parentheses, you can group that part of the regular expression together. The Nearshore Advantage!
  • 21. Feature .NET Java Perl ECMA Ruby $& (whole regex match) YES error YES YES no $0 (whole regex match) YES YES no no no $1 through $99 (backreference) YES YES YES YES no ${1} through ${99} (backreference) YES error YES no no ${group} (named backreference) YES error no no no $` (backtick; subject text to the left of the match) YES error YES YES no $' (straight quote; subject text to the right of the match) YES error YES YES no $_ (entire subject string) YES error YES IE only no $+ (highest-numbered group in the regex) YES error no IE and Firefox no $$ (escape dollar with another dollar) YES error no YES no YES error no YES YES $ (unescaped dollar as The Nearshore Advantage!literal text)
  • 23. • The repetition operators or quantifiers are greedy. • They will expand the match as far as they can, and only give back if they must to satisfy the remainder of the regex. • The quick fix to this problem is to make the quantifier lazy instead of greedy. Lazy quantifiers are sometimes also called "ungreedy" or "reluctant". You can do this by putting a question mark behind the plus in the regex. The Nearshore Advantage!
  • 24. Negative lookahead • It is indispensable if you want to match something not followed by something else. Positive lookahead • It is indispensable if you want to match something not followed by something else. Collectively, these are called "lookaround". They do not consume characters in the string, but only assert whether a match is possible or not. The Nearshore Advantage!
  • 25. Common Sense Techniques •Avoid recompiling •Use non-capturing parentheses •Don't add superfluous parentheses •Don't use superfluous character classes •Use leading anchors Expose Anchors •Expose ^ and G at the front of expressions •Expose $ at the end of expressions Lazy Versus Greedy: Be Specific Lead the Engine to a Match The Nearshore Advantage! •The repetition operators or quantifiers are greedy. •Put the most likely alternative first •Distribute into the end of alternation
  • 27. • Regex Testers – http://www.gskinner.com/RegExr/ – http://osteele.com/tools/rework/ • Regex Patterns Library – http://regexlib.com • Complete Tutorials – http://www.regular-expressions.info/ – Javascript: • http://www.w3schools.com/jsref/jsref_obj_regexp.asp – .NET: • http://msdn.microsoft.com/en-us/library/hs600312.aspx – Java: • http://java.sun.com/docs/books/tutorial/essential/regex/index.html The Nearshore Advantage!

Editor's Notes

  1. Anchors do not match any character at all. Instead, they match a position before, after or between characters. They can be used to "anchor" the regex match at a certain position. The caret ^ matches the position before the first character in the string. Applying ^a to abc matches a. ^b will not match abc at all, because the b cannot be matched right after the start of the string, matched by ^. See below for the inside view of the regex engine.Similarly, $ matches right after the last character in the string. c$ matches c in abc, while a$ does not match at all.There are three different positions that qualify as wordboundaries:Before the first character in the string, if the first character is a word character.After the last character in the string, if the last character is a word character.Between two characters in the string, where one is a word character and the other is not a word character.
  2. Limiting Repetition { }Modern regex flavors (regex engines), like those discussed in this tutorial, have an additional repetition operator that allows you to specify how many times a token can be repeated. The syntax is {min,max}, where min is a positive integer number indicating the minimum number of matches, and max is an integer equal to or greater than min indicating the maximum number of matches. If the comma is present but max is omitted, the maximum number of matches is infinite. So {0,} is the same as*, and {1,} is the same as +. Omitting both the comma and max tells the engine to repeat the token exactly min times.
  3. If you want to search for the literal text cat or dog, separate both options with a vertical bar or pipe symbol: cat|dog. If you want more options, simply expand the list: cat|dog|mouse|fish .The alternation operator has the lowest precedence of all regex operators. That is, it tells the regex engine to match either everything to the left of the vertical bar, or everything to the right of the vertical bar. If you want to limit the reach of the alternation, you will need to use round brackets for grouping. If we want to improve the first example to match whole words only, we would need to use \b(cat|dog)\b. This tells the regex engine to find a word boundary, then either "cat" or "dog", and then another word boundary.