SlideShare uma empresa Scribd logo
1 de 52
Baixar para ler offline
Unicode
Regular Expressions

  s/�/�/g
       Nick Patch
    23 January 2013
Unicode Refresher

    Unicode attempts to support the
characters of the world — a massive task!
Unicode Refresher

It's hard to attach a single meaning to the
  word “character” but most folks think of
  characters as the smallest stand-alone
      components of a writing system.
Unicode Refresher

  In Unicode, this sense of characters is
 represented by one or more code points,
which are each stored in one or more bytes.
Unicode Refresher

      However, programmers and
programming languages tend to think of
  characters as individual code points,
       or worse, individual bytes.

  We need to modernize our habits!
Unicode Refresher

Unicode is not just a big set of characters.
  It also defines standard properties for
 each character and standard algorithms
      for operations such as collation,
     normalization, and segmentation.
Normalization

NFD(ᾀ◌̀) = α◌̓◌̀◌ͅ
NFC(ᾀ◌̀) = ᾂ̀
Normalization

NFD(Чю◌́рлёнис) = Чю◌́рле◌̈нис
NFC(Чю◌́рлёнис) = Чю◌́рлёнис
Normalization

  ᾂ ≡ ἂ◌ͅ ≡ ᾀ◌̀ ≡ ᾳ◌̓◌̀ ≡
 α◌̓◌̀◌ͅ ≡ α◌̓◌ͅ◌̀ ≡ α◌ͅ◌̓◌̀
             ≠
ᾲ◌̓ ≡ ὰ◌̓◌ͅ ≡ ὰ◌ͅ◌̓ ≡ ᾳ◌̀◌̓ ≡
 α◌̀◌̓◌ͅ ≡ α◌̀◌ͅ◌̓ ≡ α◌ͅ◌̀◌̓
Perl Normalization

use Unicode::Normalize;

say $str;          # ᾀ◌̀
say NFD($str);     # α◌̓◌̀◌ͅ
say NFC($str);     # ᾂ̀
JavaScript Normalization

var unorm = require('unorm');

console.log($str);              # ᾀ◌̀
console.log(unorm.nfd($str));   # α◌̓◌̀◌ͅ
console.log(unorm.nfc($str));   # ᾂ̀
PHP Normalization

echo $str;            # ᾀ◌̀

echo Normalizer::normalize($str,
Normalizer::FORM_D); # α◌̓◌̀◌ͅ

echo Normalizer::normalize($str,
Normalizer::FORM_C); # ᾂ̀
Grapheme Clusters

regex:      /^.$/

string 1:   ᾂ


string 2:   α◌̓◌̀◌ͅ
Grapheme Clusters

regex:         /^.$/

string 1:      ᾂ
              ⇧

string 2:      α◌̓◌̀◌ͅ
              ⇧

1. anchor beginning of string
Grapheme Clusters

regex:         /^.$/

string 1:      ᾂ
              ⇧

string 2:      α◌̓◌̀◌ͅ
              ⇧

1. anchor beginning of string
2. match code point (excl. n)
Grapheme Clusters

regex:         /^.$/

string 1:      ᾂ
              ⇧⇧

string 2:      α◌̓◌̀◌ͅ


1. anchor beginning of string
2. match code point (excl. n)
3. anchor at end of string
Grapheme Clusters

regex:         /^.$/

string 1:     ᾂ
             ⇧⇧

string 2:      α◌̓◌̀◌ͅ


1. anchor beginning of string
2. match code point (excl. n)
3. anchor at end of string
4. 1 success but 1 failure — mixed results �
Grapheme Clusters

regex:      /^X$/

string 1:   ᾂ


string 2:   α◌̓◌̀◌ͅ
Grapheme Clusters

regex:         /^X$/

string 1:      ᾂ
              ⇧

string 2:      α◌̓◌̀◌ͅ
              ⇧

1. anchor beginning of string
Grapheme Clusters

regex:         /^X$/

string 1:      ᾂ
              ⇧

string 2:      α◌̓◌̀◌ͅ
              ⇧

1. anchor beginning of string
2. match grapheme cluster
Grapheme Clusters

regex:         /^X$/

string 1:      ᾂ
              ⇧⇧

string 2:      α◌̓◌̀◌ͅ
              ⇧      ⇧

1. anchor beginning of string
2. match grapheme cluster
3. anchor at end of string
Grapheme Clusters

regex:         /^X$/

string 1:      ᾂ
              ⇧⇧

string 2:      α◌̓◌̀◌ͅ
              ⇧      ⇧

1. anchor beginning of string
2. match grapheme cluster
3. anchor at end of string
4. success! �
Perl

use   v5.12; # better yet: v5.14
use   utf8;
use   charnames qw( :full ); # unless v5.16
use   open qw( :encoding(UTF-8) :std );

$str =~ /^X$/;

$str =~ s/^(X)$/->$1<-/;
PHP

preg_match('/^X$/u', $str);

preg_replace('/^(X)$/u', '->$1<-', $str);
JavaScript
[This slide intentionally left blank.]
Match Any Character

two bytes (if byte mode):      е..и
code point (exc. n):          е.и
code point (incl. n):         еp{Any}и
grapheme cluster (incl. n):   еXи
Match Any Letter

letter code point:еp{General_Category=Letter}и
letter code point:   еpLи
Cyrillic code point: еp{Script=Cyrillic}и
Cyrillic code point: еp{Cyrillic}и

letter grapheme cluster: е(?=pL)Xи
regex:      / о p{Cyrillic} т /x

string 1:   който


string 2:   кои◌̆то
regex:          / о p{Cyrillic} т /x

string 1:       който


string 2:       кои◌̆то


1. match letter о
regex:          / о p{Cyrillic} т /x

string 1:       който


string 2:       кои◌̆то


1. match letter о
2. match Cyrillic letter (1 code point)
regex:          / о p{Cyrillic} т /x

string 1:       който


string 2:       кои◌̆то


1. match letter о
2. match Cyrillic letter (1 code point)
3. match letter т
regex:         / о p{Cyrillic} т /x

string 1:      който


string 2:      кои◌̆то


1. match letter о
2. match Cyrillic letter (1 code point)
3. match letter т
4. 1 success but 1 failure — mixed results �
regex:      / о (?= p{Cyrillic} ) X т /x

string 1:   който


string 2:   кои◌̆то
regex:          / о (?= p{Cyrillic} ) X т /x

string 1:       който


string 2:       кои◌̆то


1. match letter о
regex:          / о (?= p{Cyrillic} ) X т /x

string 1:       който
                 ⇧

string 2:       кои◌̆то
                 ⇧

1. match letter о
2. positive lookahead Cyrillic letter (1 code point)
regex:          / о (?= p{Cyrillic} ) X т /x

string 1:       който
                 ⇧

string 2:       кои◌̆то
                 ⇧

1. match letter о
2. positive lookahead Cyrillic letter (1 code point)
3. match grapheme cluster (1+ code points)
regex:          / о (?= p{Cyrillic} ) X т /x

string 1:       който
                 ⇧

string 2:       кои◌̆то
                 ⇧

1. match letter о
2. positive lookahead Cyrillic letter (1 code point)
3. match grapheme cluster (1+ code points)
4. match letter т
regex:          / о (?= p{Cyrillic} ) X т /x

string 1:       който
                 ⇧

string 2:       кои◌̆то
                 ⇧

1. match letter о
2. positive lookahead Cyrillic letter (1 code point)
3. match grapheme cluster (1+ code points)
4. match letter т
5. success! �
Character Literals

      [‫]يی‬

    (?:‫)ي|ی‬
Character Literals

      [‫]يی‬

    (?:‫)ي|ی‬
Character Literals

       [‫]يی‬

     (?:‫)ي|ی‬

[x{064A}x{06CC}]
Character Literals

            [‫]يی‬

          (?:‫)ي|ی‬

     [x{064A}x{06CC}]

   [N{ARABIC LETTER YEH}
N{ARABIC LETTER FARSI YEH}]
Properties

         p{Script=Latin}

           Name: Script
           Value: Latin


   Match any code point with the
value “Latin” for the Script property.
Properties

         P{Script=Latin}

           Name: Script
          Value: not Latin

           Negated form:
 Match any code point without the
value “Latin” for the Script property.
Properties

           p{Latin}

     Name: Script (implicit)
        Value: Latin


The Script and General Category
properties don't require the name
because they're so common and
    their values don't conflict.
Properties

     p{General_Category=Letter}

        Name: General Category
            Value: Letter


   Match any code point with the value
“Letter” for the General Category property.
Properties

          p{gc=Letter}

   Name: General Category (gc)
          Value: Letter


Property names may be abbreviated.
Properties

            p{gc=L}

 Name: General Category (gc)
      Value: Letter (L)


The General Category property is
so commonly used that its values
 all have standard abbreviations.
Properties

                   p{L}

    Name: General Category (implicit)
           Value: Letter (L)


And the General Category values may even
be used on their own, like the Script values.
 These two properties have distinct values.
Properties

               pL

Name: General Category (implicit)
       Value: Letter (L)


Single-character General Category
 values don't require curly braces.
Properties

               PL

Name: General Category (implicit)
      Value: not Letter (L)


      Don't forget negation!
s/�/�/g

Mais conteúdo relacionado

Mais procurados

Regular Expressions grep and egrep
Regular Expressions grep and egrepRegular Expressions grep and egrep
Regular Expressions grep and egrep
Tri Truong
 
Regular Expression
Regular ExpressionRegular Expression
Regular Expression
Bharat17485
 
Regular expressions
Regular expressionsRegular expressions
Regular expressions
Raj Gupta
 
Haskell retrospective
Haskell retrospectiveHaskell retrospective
Haskell retrospective
chenge2k
 
Deduplication on large amounts of code
Deduplication on large amounts of codeDeduplication on large amounts of code
Deduplication on large amounts of code
source{d}
 

Mais procurados (20)

Declarative Semantics Definition - Term Rewriting
Declarative Semantics Definition - Term RewritingDeclarative Semantics Definition - Term Rewriting
Declarative Semantics Definition - Term Rewriting
 
Regular Expression
Regular ExpressionRegular Expression
Regular Expression
 
Regular Expressions grep and egrep
Regular Expressions grep and egrepRegular Expressions grep and egrep
Regular Expressions grep and egrep
 
Finaal application on regular expression
Finaal application on regular expressionFinaal application on regular expression
Finaal application on regular expression
 
Regular expressions
Regular expressionsRegular expressions
Regular expressions
 
Optimization of dfa
Optimization of dfaOptimization of dfa
Optimization of dfa
 
Introduction - Imperative and Object-Oriented Languages
Introduction - Imperative and Object-Oriented LanguagesIntroduction - Imperative and Object-Oriented Languages
Introduction - Imperative and Object-Oriented Languages
 
Regular Expressions
Regular ExpressionsRegular Expressions
Regular Expressions
 
Regular Expressions 101 Introduction to Regular Expressions
Regular Expressions 101 Introduction to Regular ExpressionsRegular Expressions 101 Introduction to Regular Expressions
Regular Expressions 101 Introduction to Regular Expressions
 
Regular Expression
Regular ExpressionRegular Expression
Regular Expression
 
And now you have two problems. Ruby regular expressions for fun and profit by...
And now you have two problems. Ruby regular expressions for fun and profit by...And now you have two problems. Ruby regular expressions for fun and profit by...
And now you have two problems. Ruby regular expressions for fun and profit by...
 
Dictor
DictorDictor
Dictor
 
Regular expressions
Regular expressionsRegular expressions
Regular expressions
 
Bioinformatica: Esercizi su Perl, espressioni regolari e altre amenità (BMR G...
Bioinformatica: Esercizi su Perl, espressioni regolari e altre amenità (BMR G...Bioinformatica: Esercizi su Perl, espressioni regolari e altre amenità (BMR G...
Bioinformatica: Esercizi su Perl, espressioni regolari e altre amenità (BMR G...
 
Haskell retrospective
Haskell retrospectiveHaskell retrospective
Haskell retrospective
 
DEFUN 2008 - Real World Haskell
DEFUN 2008 - Real World HaskellDEFUN 2008 - Real World Haskell
DEFUN 2008 - Real World Haskell
 
Ch3
Ch3Ch3
Ch3
 
Introduction to Regular Expressions
Introduction to Regular ExpressionsIntroduction to Regular Expressions
Introduction to Regular Expressions
 
Deduplication on large amounts of code
Deduplication on large amounts of codeDeduplication on large amounts of code
Deduplication on large amounts of code
 
Regular expressions
Regular expressionsRegular expressions
Regular expressions
 

Semelhante a Unicode Regular Expressions

Saumya Debray The University of Arizona Tucson
Saumya Debray The University of Arizona TucsonSaumya Debray The University of Arizona Tucson
Saumya Debray The University of Arizona Tucson
jeronimored
 
Lecture 3 Perl & FreeBSD administration
Lecture 3 Perl & FreeBSD administrationLecture 3 Perl & FreeBSD administration
Lecture 3 Perl & FreeBSD administration
Mohammed Farrag
 
Good Evils In Perl
Good Evils In PerlGood Evils In Perl
Good Evils In Perl
Kang-min Liu
 
Recursive descent parsing
Recursive descent parsingRecursive descent parsing
Recursive descent parsing
Boy Baukema
 
Introduction to Perl
Introduction to PerlIntroduction to Perl
Introduction to Perl
Sway Wang
 

Semelhante a Unicode Regular Expressions (20)

Regular Expressions: JavaScript And Beyond
Regular Expressions: JavaScript And BeyondRegular Expressions: JavaScript And Beyond
Regular Expressions: JavaScript And Beyond
 
my$talk=qr{((?:ir)?reg(?:ular )?exp(?:ressions?)?)}i;
my$talk=qr{((?:ir)?reg(?:ular )?exp(?:ressions?)?)}i;my$talk=qr{((?:ir)?reg(?:ular )?exp(?:ressions?)?)}i;
my$talk=qr{((?:ir)?reg(?:ular )?exp(?:ressions?)?)}i;
 
Linux fundamental - Chap 06 regx
Linux fundamental - Chap 06 regxLinux fundamental - Chap 06 regx
Linux fundamental - Chap 06 regx
 
Perl Presentation
Perl PresentationPerl Presentation
Perl Presentation
 
Ruby presentasjon på NTNU 22 april 2009
Ruby presentasjon på NTNU 22 april 2009Ruby presentasjon på NTNU 22 april 2009
Ruby presentasjon på NTNU 22 april 2009
 
Ruby presentasjon på NTNU 22 april 2009
Ruby presentasjon på NTNU 22 april 2009Ruby presentasjon på NTNU 22 april 2009
Ruby presentasjon på NTNU 22 april 2009
 
Ruby presentasjon på NTNU 22 april 2009
Ruby presentasjon på NTNU 22 april 2009Ruby presentasjon på NTNU 22 april 2009
Ruby presentasjon på NTNU 22 april 2009
 
Saumya Debray The University of Arizona Tucson
Saumya Debray The University of Arizona TucsonSaumya Debray The University of Arizona Tucson
Saumya Debray The University of Arizona Tucson
 
Cleancode
CleancodeCleancode
Cleancode
 
Lecture 3 Perl & FreeBSD administration
Lecture 3 Perl & FreeBSD administrationLecture 3 Perl & FreeBSD administration
Lecture 3 Perl & FreeBSD administration
 
Good Evils In Perl
Good Evils In PerlGood Evils In Perl
Good Evils In Perl
 
Stop overusing regular expressions!
Stop overusing regular expressions!Stop overusing regular expressions!
Stop overusing regular expressions!
 
Recursive descent parsing
Recursive descent parsingRecursive descent parsing
Recursive descent parsing
 
Perl_Part4
Perl_Part4Perl_Part4
Perl_Part4
 
Practical approach to perl day1
Practical approach to perl day1Practical approach to perl day1
Practical approach to perl day1
 
Introduction to Perl
Introduction to PerlIntroduction to Perl
Introduction to Perl
 
Fundamental Unicode in Perl
Fundamental Unicode in PerlFundamental Unicode in Perl
Fundamental Unicode in Perl
 
Bioinformatica 06-10-2011-p2 introduction
Bioinformatica 06-10-2011-p2 introductionBioinformatica 06-10-2011-p2 introduction
Bioinformatica 06-10-2011-p2 introduction
 
Bioinformatica p2-p3-introduction
Bioinformatica p2-p3-introductionBioinformatica p2-p3-introduction
Bioinformatica p2-p3-introduction
 
Quick start reg ex
Quick start reg exQuick start reg ex
Quick start reg ex
 

Último

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 

Último (20)

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 

Unicode Regular Expressions

  • 1. Unicode Regular Expressions s/�/�/g Nick Patch 23 January 2013
  • 2. Unicode Refresher Unicode attempts to support the characters of the world — a massive task!
  • 3. Unicode Refresher It's hard to attach a single meaning to the word “character” but most folks think of characters as the smallest stand-alone components of a writing system.
  • 4. Unicode Refresher In Unicode, this sense of characters is represented by one or more code points, which are each stored in one or more bytes.
  • 5. Unicode Refresher However, programmers and programming languages tend to think of characters as individual code points, or worse, individual bytes. We need to modernize our habits!
  • 6. Unicode Refresher Unicode is not just a big set of characters. It also defines standard properties for each character and standard algorithms for operations such as collation, normalization, and segmentation.
  • 9. Normalization ᾂ ≡ ἂ◌ͅ ≡ ᾀ◌̀ ≡ ᾳ◌̓◌̀ ≡ α◌̓◌̀◌ͅ ≡ α◌̓◌ͅ◌̀ ≡ α◌ͅ◌̓◌̀ ≠ ᾲ◌̓ ≡ ὰ◌̓◌ͅ ≡ ὰ◌ͅ◌̓ ≡ ᾳ◌̀◌̓ ≡ α◌̀◌̓◌ͅ ≡ α◌̀◌ͅ◌̓ ≡ α◌ͅ◌̀◌̓
  • 10. Perl Normalization use Unicode::Normalize; say $str; # ᾀ◌̀ say NFD($str); # α◌̓◌̀◌ͅ say NFC($str); # ᾂ̀
  • 11. JavaScript Normalization var unorm = require('unorm'); console.log($str); # ᾀ◌̀ console.log(unorm.nfd($str)); # α◌̓◌̀◌ͅ console.log(unorm.nfc($str)); # ᾂ̀
  • 12. PHP Normalization echo $str; # ᾀ◌̀ echo Normalizer::normalize($str, Normalizer::FORM_D); # α◌̓◌̀◌ͅ echo Normalizer::normalize($str, Normalizer::FORM_C); # ᾂ̀
  • 13. Grapheme Clusters regex: /^.$/ string 1: ᾂ string 2: α◌̓◌̀◌ͅ
  • 14. Grapheme Clusters regex: /^.$/ string 1: ᾂ ⇧ string 2: α◌̓◌̀◌ͅ ⇧ 1. anchor beginning of string
  • 15. Grapheme Clusters regex: /^.$/ string 1: ᾂ ⇧ string 2: α◌̓◌̀◌ͅ ⇧ 1. anchor beginning of string 2. match code point (excl. n)
  • 16. Grapheme Clusters regex: /^.$/ string 1: ᾂ ⇧⇧ string 2: α◌̓◌̀◌ͅ 1. anchor beginning of string 2. match code point (excl. n) 3. anchor at end of string
  • 17. Grapheme Clusters regex: /^.$/ string 1: ᾂ ⇧⇧ string 2: α◌̓◌̀◌ͅ 1. anchor beginning of string 2. match code point (excl. n) 3. anchor at end of string 4. 1 success but 1 failure — mixed results �
  • 18. Grapheme Clusters regex: /^X$/ string 1: ᾂ string 2: α◌̓◌̀◌ͅ
  • 19. Grapheme Clusters regex: /^X$/ string 1: ᾂ ⇧ string 2: α◌̓◌̀◌ͅ ⇧ 1. anchor beginning of string
  • 20. Grapheme Clusters regex: /^X$/ string 1: ᾂ ⇧ string 2: α◌̓◌̀◌ͅ ⇧ 1. anchor beginning of string 2. match grapheme cluster
  • 21. Grapheme Clusters regex: /^X$/ string 1: ᾂ ⇧⇧ string 2: α◌̓◌̀◌ͅ ⇧ ⇧ 1. anchor beginning of string 2. match grapheme cluster 3. anchor at end of string
  • 22. Grapheme Clusters regex: /^X$/ string 1: ᾂ ⇧⇧ string 2: α◌̓◌̀◌ͅ ⇧ ⇧ 1. anchor beginning of string 2. match grapheme cluster 3. anchor at end of string 4. success! �
  • 23. Perl use v5.12; # better yet: v5.14 use utf8; use charnames qw( :full ); # unless v5.16 use open qw( :encoding(UTF-8) :std ); $str =~ /^X$/; $str =~ s/^(X)$/->$1<-/;
  • 26. Match Any Character two bytes (if byte mode): е..и code point (exc. n): е.и code point (incl. n): еp{Any}и grapheme cluster (incl. n): еXи
  • 27. Match Any Letter letter code point:еp{General_Category=Letter}и letter code point: еpLи Cyrillic code point: еp{Script=Cyrillic}и Cyrillic code point: еp{Cyrillic}и letter grapheme cluster: е(?=pL)Xи
  • 28. regex: / о p{Cyrillic} т /x string 1: който string 2: кои◌̆то
  • 29. regex: / о p{Cyrillic} т /x string 1: който string 2: кои◌̆то 1. match letter о
  • 30. regex: / о p{Cyrillic} т /x string 1: който string 2: кои◌̆то 1. match letter о 2. match Cyrillic letter (1 code point)
  • 31. regex: / о p{Cyrillic} т /x string 1: който string 2: кои◌̆то 1. match letter о 2. match Cyrillic letter (1 code point) 3. match letter т
  • 32. regex: / о p{Cyrillic} т /x string 1: който string 2: кои◌̆то 1. match letter о 2. match Cyrillic letter (1 code point) 3. match letter т 4. 1 success but 1 failure — mixed results �
  • 33. regex: / о (?= p{Cyrillic} ) X т /x string 1: който string 2: кои◌̆то
  • 34. regex: / о (?= p{Cyrillic} ) X т /x string 1: който string 2: кои◌̆то 1. match letter о
  • 35. regex: / о (?= p{Cyrillic} ) X т /x string 1: който ⇧ string 2: кои◌̆то ⇧ 1. match letter о 2. positive lookahead Cyrillic letter (1 code point)
  • 36. regex: / о (?= p{Cyrillic} ) X т /x string 1: който ⇧ string 2: кои◌̆то ⇧ 1. match letter о 2. positive lookahead Cyrillic letter (1 code point) 3. match grapheme cluster (1+ code points)
  • 37. regex: / о (?= p{Cyrillic} ) X т /x string 1: който ⇧ string 2: кои◌̆то ⇧ 1. match letter о 2. positive lookahead Cyrillic letter (1 code point) 3. match grapheme cluster (1+ code points) 4. match letter т
  • 38. regex: / о (?= p{Cyrillic} ) X т /x string 1: който ⇧ string 2: кои◌̆то ⇧ 1. match letter о 2. positive lookahead Cyrillic letter (1 code point) 3. match grapheme cluster (1+ code points) 4. match letter т 5. success! �
  • 39. Character Literals [‫]يی‬ (?:‫)ي|ی‬
  • 40. Character Literals [‫]يی‬ (?:‫)ي|ی‬
  • 41. Character Literals [‫]يی‬ (?:‫)ي|ی‬ [x{064A}x{06CC}]
  • 42. Character Literals [‫]يی‬ (?:‫)ي|ی‬ [x{064A}x{06CC}] [N{ARABIC LETTER YEH} N{ARABIC LETTER FARSI YEH}]
  • 43. Properties p{Script=Latin} Name: Script Value: Latin Match any code point with the value “Latin” for the Script property.
  • 44. Properties P{Script=Latin} Name: Script Value: not Latin Negated form: Match any code point without the value “Latin” for the Script property.
  • 45. Properties p{Latin} Name: Script (implicit) Value: Latin The Script and General Category properties don't require the name because they're so common and their values don't conflict.
  • 46. Properties p{General_Category=Letter} Name: General Category Value: Letter Match any code point with the value “Letter” for the General Category property.
  • 47. Properties p{gc=Letter} Name: General Category (gc) Value: Letter Property names may be abbreviated.
  • 48. Properties p{gc=L} Name: General Category (gc) Value: Letter (L) The General Category property is so commonly used that its values all have standard abbreviations.
  • 49. Properties p{L} Name: General Category (implicit) Value: Letter (L) And the General Category values may even be used on their own, like the Script values. These two properties have distinct values.
  • 50. Properties pL Name: General Category (implicit) Value: Letter (L) Single-character General Category values don't require curly braces.
  • 51. Properties PL Name: General Category (implicit) Value: not Letter (L) Don't forget negation!