SlideShare uma empresa Scribd logo
1 de 18
Baixar para ler offline
The Word-based Regular Expressions
(originally CHeuSoV :-) )
Aleksey Cheusov
vle@gmx.net
Minsk LUG meeting, 29 Dec. 2012, Belarus
Plan for my presentation
Mathematics: RL definition and theorem
Real World: applications of RL and some notes
Linguistics: POS tagset, Semantic tagset, Text tagging,
Tagged text corpus
Wanted: (mumbling...)
Mathematics: ExpRL definition, WRL definition
Linguistics + Real World: WRE syntax and examples
Вброс, драка, разбор завалов, развоз трупов
Regular language
Definition: given a finite non-empty set of elements Σ (alphabet):
∅ is a regular language.
For any element v ∈ Σ, {v} is a regular language.
If A and B are regular languages, so is A ∪ B.
If A and B are regular languages, so is {ab|a ∈ A, b ∈ B},
where ab means string concatenation.
If A is a regular language, so is A∗ where
A∗ = ∅ ∪ {a1a2. . .an|ai ∈ A, n > 0}.
No other languages over Σ are regular.
Example:
Let Σ = {a, b, c}. Then since aab and cc are members of Σ∗,
{aab} and {cc} are regular languages. So is the union of these
two sets {aab, cc}, and so is the concatenation of the two
{aabcc}. Likewise, {aab}∗, {cc}∗ and {aab, cc, aabcc}∗ are
regular languages.
Regular language (interesting fact)
Provable fact:
Regular languages are closed under intersection and negation
operations.
Finite state automaton
Definition: 5-tuple (Σ, S, S0, F, δ) is a finite state automaton
Σ is a finite non-empty set of elements (alphabet).
S is a finite non-empty set of states.
S0 ⊆ S is a set of start states.
F ⊆ S is a set of finite states.
δ : S × Σ → 2S is a state transition function.
Theorem: ∀ regular language R, ∃ FSA f , such that L(f ) = R; ∀
FSA f , L(f ) is a regular language (L — language of FSA, that is a
set of accepted inputs).
Real world. Regular expressions (hello UNIX!).
Regular language is a foundation for so called regular expressions,
in most cases alphabet Σ is a set of characters (ASCII, Unicode
etc.):
POSIX basic regular expressions (grep, sed, vi etc.) and
extended regular expressions (grep -E, awk etc.)
Google re2, Yandex PIRE
Perl, pcre, Ruby, Python (superset of regular language,
and. . . extremely inefficient ;-) )
lex
...
Widely used extensions:
submatch (depending on implementation may still be FSM but
not FSA)
backreferences (incompatible with regular language and FSM
at all)
Tagsets
Penn part-of-speech tag set:
NN noun, singular or mass (apple, computer, fruit etc.)
NNS noun plural (apples, computers, fruits etc.)
CC coordinating conjunction (and, or)
VB verb, base form (give, book, destroy etc.)
VBD verb, past tense form (gave, booked, destroyed etc.)
VBN verb, past participle form (given, booked, destroyed etc.)
VBG verb, gerund/present participle (giving, booking,
destroying etc.)
etc.
Semantic tag set:
LinkVerb a verb that connects the subject to the
complement(seem, feel, look etc.)
AnimateNoun (brother, son etc.)
etc.
Tagged sentence, POS tagging, semantic tagging
Examples:
The book is red →
The_DT book_NN is_VBZ red_JJ →
The_DT book_NN/Object is_VBZ red_JJ/Color
Note: Words book and red are ambiguous.
My son goes to school →
My_PRP$ son_NN goes_VBZ to_IN school_NN →
My_PRP$ son_NN/Person goes_VBZ to_TO
school_NN/Establishment
Questions: How to match tagged sentence or find portions of it?
Can regular expressions help? Do we need
regcomp(3)/regexec(3)-like functionality?
Expanded regular language
Definition: given a non-empty set of elements Σ (alphabet) and a
set of one-place predicates P = {P1, P2, . . . Pk},
Pi : Σ → {true, false}:
∅ is an expanded regular language.
For any element v ∈ Σ, {v} is an expanded regular language.
For any i {v|Pi (v) = true} is an expanded regular language.
Σ is an expanded regular language.
If A and B are expanded regular languages, so is A ∪ B.
If A and B are expanded regular languages, so is A  B.
If A and B are expanded regular languages, so is
{ab|a ∈ A, b ∈ B}, where ab means string concatenation.
If A is a regular language, so is A∗ where
A∗ = ∅ ∪ {a1a2. . .an|ai ∈ A, n > 0}.
No other languages over Σ are expanded regular.
Expanded regular language (interesting facts)
Provable facts:
Expanded regular languages are closed under intersection and
negation operations.
∀ expanded regular language R we can build a regular
language R∗ over alphabet Σ∗, such that ∃f : Σ∗ → 2Σ and
L(R) = L(R∗) (expanding all elements in L(R∗) with a help of
f ). Thus, we can build expanded regular expression engine
based on traditional FSA/FSM-based algorithms.
The Word-based regular language
Definition: given
W — set of words (character sequences), e.g.
”the”, ”apple””123”, ”; ”, ”C2H5OH”, . . .
TPOS — finite non-empty set of part-of-speech tags, e.g.
{DT, NN, NNS, VBP, VBZ, . . . }
TSem — finite non-empty set of semantic tags, e.g.
{LinkVerb, Person, Object, TransitiveVerb, . . . }
DPOS : W → 2TPOS — POS dictionary, e.g.
DPOS (”the”) = {DT}, DPOS(”book”) = {NN, VB, VBP},
DPOS (”and”) = {CC}
DSem : W → 2TSem — semantic dictionary, e.g.
DSem(”son”) = {Person},
DSem(”mouse”) = {Animal, ComputerDevice}
EREs — finite set of POSIX extended regular expressions, e.g.
{”. ∗ ing”, ”.multi. ∗ al”, ”[A − Z][a − z]”, ”[0 − 9] + ” . . . } etc.
(to be continued)
The Word-based regular language
(continuation) the word-based regular language
(TPOS , TSem, DPOS, DSem, EREs) is an expanded regular
language over alphabet W × TPOS × 2TSem and one-place
predicates P = {Pcheck
tPOS
, Pcheck
tSem
, Ptagging
tPOS
, Ptagging
tSem
, Pword
re }, where
Pcheck
tPOS
(w, ·, ·) = true if tPOS ∈ DPOS(w), and false otherwise
Pcheck
tSem
(w, ·, ·) = true if tSem ∈ DSem(w), and false otherwise
Ptagging
tPOS
(·, tagPOS , ·) = true if tPOS = tagPOS, and false
otherwise
Ptagging
tSem
(·, ·, tagsSem) = true if tSem ∈ tagsSem, and false
otherwise
Pword
re (w, ·, ·) = true if POSIX extended regular expression
re ∈ EREs matches w, and false otherwise
The Word-based regular expressions (Finally!).
Syntax:
"word" — word itself (Pword
re ), e.g. "the", "2012-12-29" etc.
’regexp’ — words matched by specified regexp (Pword
re )
Tag — words tagged as tagPOS (Ptagging
tPOS
), e.g. NN, DT, VB
etc.
%Tag — words tagged as tagSem (Ptagging
tSem
), e.g. %Person,
%Object, %LinkLerb etc.
_Tag — words having as tagPOS in POS dictionary (Pcheck
tPOS
),
e.g. _NN, _DT, _VB etc.
@Tag — words having as tagSem in semantic dictionary
(Pcheck
tSem
), e.g. @LinkVerb, @Object etc.
. (dot) — any word with any POS and semantic tags
ˆ — beginning of the sentence
$ — end of the sentence
(to be continued)
The Word-based regular expressions (Finally!).
Syntax (continuation):
( R ) — grouping like in mathematical expressions
<num R > — submatch and extraction
R ? and [ R ] — optional WRE
R * and R + — possibly empty and non-empty repetitions
R {n,m}, R {n,} and R {,m} — repetitions
R – S — subtraction
R & S and R / S — intersection, / is for single word WREs,
& is for complex WREs
R | S — union
R S — concatenation
!R — negation (L(!R) is equal to either Σ  L(R) or Σ ∗ L(R)
depending on a context of use)
(to be continued)
The Word-based regular expressions (Finally!).
Syntax (priorities from highest to lowest, continuation):
, / and | in single word non-spaced WREs
Prepositional unary operation !
Postpositional unary operations {n,m}, ’?’, ’+’ and ’*’
( R ) and <num R >
R & S
R – S
R S
R | S
(to be continued)
WRE examples
How to select noun phrases (leftmost-longest match, only POS
tags is our rules)
( DT | CD+ )? RB * CC|JJ|JJR|JJS * (NN|NNS + | NP + )
Ex.: This absolutely stupid decision
Ex.: The best fuel cell
Ex.: Black and white colors
Ex.: Vasiliy Pupkin
NER (Named Entity Recognition) for person names
DSem("MrDr") = {"Mrs.", "Mr.", "Dr.", "Doctor",
"President", "Dear", . . . }
@MrDr <1 ’[A-Z][a-z]+’/!@MrDr␣+>
Ex.: Doctor <1 Zhivago>
Ex.: Dr. <1 Vasiliy Pupkin>
Ex.: Mrs. <1 Kate>
but not
Ex.: Mr. <1 President>
WRE examples
Question focus (object attributes)
ˆ "What" "is" "the" <1 @Attribute > "of" <2 .* > "?"
Ex.: What is the <1 color > of <2 your book> ?
Tiny definitions
m4_define(NounPhrase, “(see previous slide)”) ˆ (NounPhrase
& (. . . . . . .)) "is" (NounPhrase & (. . . . . . .))
Ex.: What is the <1 color > of <2 your book> ?
Sentiment analysis
DSem("BadCharact") = {"sucks", "stupid", "crappy",
"shitty", . . . }
DSem("GoodCharact") = {"rocks", "awesome", "excellent",
. . . }
DSem("OurProduct") = {"Linux", "NetBSD", "Ubuntu",
"AltLinux", "iPad", "Android", . . . }
<1 @BadCharact|@GoodCharact> <2 @OurProduct> |
<2 @OurProduct> <1 @BadCharact|@GoodCharact>
The Word-based Regular Expressions
is really cool DSL!

Mais conteúdo relacionado

Mais procurados

Deterministic context free grammars &non-deterministic
Deterministic context free grammars &non-deterministicDeterministic context free grammars &non-deterministic
Deterministic context free grammars &non-deterministic
Leyo Stephen
 
Class7
 Class7 Class7
Class7
issbp
 
R.4 solving literal equations
R.4 solving literal equationsR.4 solving literal equations
R.4 solving literal equations
hisema01
 

Mais procurados (20)

Lecture 3,4
Lecture 3,4Lecture 3,4
Lecture 3,4
 
Deterministic context free grammars &non-deterministic
Deterministic context free grammars &non-deterministicDeterministic context free grammars &non-deterministic
Deterministic context free grammars &non-deterministic
 
Regular expressions
Regular expressionsRegular expressions
Regular expressions
 
Formal Languages (in progress)
Formal Languages (in progress)Formal Languages (in progress)
Formal Languages (in progress)
 
Lecture 7: Definite Clause Grammars
Lecture 7: Definite Clause GrammarsLecture 7: Definite Clause Grammars
Lecture 7: Definite Clause Grammars
 
Lecture: Context-Free Grammars
Lecture: Context-Free GrammarsLecture: Context-Free Grammars
Lecture: Context-Free Grammars
 
3.1 intro toautomatatheory h1
3.1 intro toautomatatheory  h13.1 intro toautomatatheory  h1
3.1 intro toautomatatheory h1
 
context free language
context free languagecontext free language
context free language
 
Introduction of predicate logics
Introduction of predicate  logicsIntroduction of predicate  logics
Introduction of predicate logics
 
Introduction of predicate logics
Introduction of predicate  logicsIntroduction of predicate  logics
Introduction of predicate logics
 
Hak ontoforum
Hak ontoforumHak ontoforum
Hak ontoforum
 
Types of Language in Theory of Computation
Types of Language in Theory of ComputationTypes of Language in Theory of Computation
Types of Language in Theory of Computation
 
Word level language identification in code-switched texts
Word level language identification in code-switched textsWord level language identification in code-switched texts
Word level language identification in code-switched texts
 
Regular expressions and languages pdf
Regular expressions and languages pdfRegular expressions and languages pdf
Regular expressions and languages pdf
 
First order logic
First order logicFirst order logic
First order logic
 
Top school in ghaziabad
Top school in ghaziabadTop school in ghaziabad
Top school in ghaziabad
 
Context Free Grammar
Context Free GrammarContext Free Grammar
Context Free Grammar
 
Class7
 Class7 Class7
Class7
 
R.4 solving literal equations
R.4 solving literal equationsR.4 solving literal equations
R.4 solving literal equations
 
Regular expressions
Regular expressionsRegular expressions
Regular expressions
 

Semelhante a Алексей Чеусов - Расчёсываем своё ЧСВ

Speech To Sign Language Interpreter System
Speech To Sign Language Interpreter SystemSpeech To Sign Language Interpreter System
Speech To Sign Language Interpreter System
kkkseld
 

Semelhante a Алексей Чеусов - Расчёсываем своё ЧСВ (20)

INFO-2950-Languages-and-Grammars.ppt
INFO-2950-Languages-and-Grammars.pptINFO-2950-Languages-and-Grammars.ppt
INFO-2950-Languages-and-Grammars.ppt
 
Ch3 4 regular expression and grammar
Ch3 4 regular expression and grammarCh3 4 regular expression and grammar
Ch3 4 regular expression and grammar
 
Lecture Notes-Are Natural Languages Regular.pdf
Lecture Notes-Are Natural Languages Regular.pdfLecture Notes-Are Natural Languages Regular.pdf
Lecture Notes-Are Natural Languages Regular.pdf
 
L_2_apl.pptx
L_2_apl.pptxL_2_apl.pptx
L_2_apl.pptx
 
Formal methods 3 - languages and machines
Formal methods   3 - languages and machinesFormal methods   3 - languages and machines
Formal methods 3 - languages and machines
 
Statistical machine translation
Statistical machine translationStatistical machine translation
Statistical machine translation
 
ToC_M1L3_Grammar and Derivation.pdf
ToC_M1L3_Grammar and Derivation.pdfToC_M1L3_Grammar and Derivation.pdf
ToC_M1L3_Grammar and Derivation.pdf
 
Chapter2CDpdf__2021_11_26_09_19_08.pdf
Chapter2CDpdf__2021_11_26_09_19_08.pdfChapter2CDpdf__2021_11_26_09_19_08.pdf
Chapter2CDpdf__2021_11_26_09_19_08.pdf
 
Speech To Sign Language Interpreter System
Speech To Sign Language Interpreter SystemSpeech To Sign Language Interpreter System
Speech To Sign Language Interpreter System
 
Natural Language parsing.pptx
Natural Language parsing.pptxNatural Language parsing.pptx
Natural Language parsing.pptx
 
Ch2 automata.pptx
Ch2 automata.pptxCh2 automata.pptx
Ch2 automata.pptx
 
01-Introduction&Languages.pdf
01-Introduction&Languages.pdf01-Introduction&Languages.pdf
01-Introduction&Languages.pdf
 
Contemporary Models of Natural Language Processing
Contemporary Models of Natural Language ProcessingContemporary Models of Natural Language Processing
Contemporary Models of Natural Language Processing
 
1LECTURE 8 Regular_Expressions.ppt
1LECTURE 8 Regular_Expressions.ppt1LECTURE 8 Regular_Expressions.ppt
1LECTURE 8 Regular_Expressions.ppt
 
Lecture 3,4
Lecture 3,4Lecture 3,4
Lecture 3,4
 
Lecture Notes-Finite State Automata for NLP.pdf
Lecture Notes-Finite State Automata for NLP.pdfLecture Notes-Finite State Automata for NLP.pdf
Lecture Notes-Finite State Automata for NLP.pdf
 
Lecture 1,2
Lecture 1,2Lecture 1,2
Lecture 1,2
 
10651372.ppt
10651372.ppt10651372.ppt
10651372.ppt
 
Theory of Computation Lecture Notes
Theory of Computation Lecture NotesTheory of Computation Lecture Notes
Theory of Computation Lecture Notes
 
Natural Language processing Parts of speech tagging, its classes, and how to ...
Natural Language processing Parts of speech tagging, its classes, and how to ...Natural Language processing Parts of speech tagging, its classes, and how to ...
Natural Language processing Parts of speech tagging, its classes, and how to ...
 

Mais de Minsk Linux User Group

Mais de Minsk Linux User Group (20)

Vladimir ’mend0za’ Shakhov — Linux firmware for iRMC controller on Fujitsu P...
 Vladimir ’mend0za’ Shakhov — Linux firmware for iRMC controller on Fujitsu P... Vladimir ’mend0za’ Shakhov — Linux firmware for iRMC controller on Fujitsu P...
Vladimir ’mend0za’ Shakhov — Linux firmware for iRMC controller on Fujitsu P...
 
Андрэй Захарэвіч — Hack the Hackpad: Першая спроба публічнага кіравання задач...
Андрэй Захарэвіч — Hack the Hackpad: Першая спроба публічнага кіравання задач...Андрэй Захарэвіч — Hack the Hackpad: Першая спроба публічнага кіравання задач...
Андрэй Захарэвіч — Hack the Hackpad: Першая спроба публічнага кіравання задач...
 
Святлана Ермаковіч — Вікі-дапаможнік. Як узмацніць беларускую вікі-супольнасць
Святлана Ермаковіч — Вікі-дапаможнік. Як узмацніць беларускую вікі-супольнасцьСвятлана Ермаковіч — Вікі-дапаможнік. Як узмацніць беларускую вікі-супольнасць
Святлана Ермаковіч — Вікі-дапаможнік. Як узмацніць беларускую вікі-супольнасць
 
Тимофей Титовец — Elastic+Logstash+Kibana: Архитектура и опыт использования
Тимофей Титовец — Elastic+Logstash+Kibana: Архитектура и опыт использованияТимофей Титовец — Elastic+Logstash+Kibana: Архитектура и опыт использования
Тимофей Титовец — Elastic+Logstash+Kibana: Архитектура и опыт использования
 
Андрэй Захарэвіч - Як мы ставілі KDE пад FreeBSD
Андрэй Захарэвіч - Як мы ставілі KDE пад FreeBSDАндрэй Захарэвіч - Як мы ставілі KDE пад FreeBSD
Андрэй Захарэвіч - Як мы ставілі KDE пад FreeBSD
 
Vitaly ̈_Vi ̈ Shukela - My FOSS projects
Vitaly  ̈_Vi ̈ Shukela - My FOSS projectsVitaly  ̈_Vi ̈ Shukela - My FOSS projects
Vitaly ̈_Vi ̈ Shukela - My FOSS projects
 
Vitaly ̈_Vi ̈ Shukela - Dive
Vitaly  ̈_Vi ̈ Shukela - DiveVitaly  ̈_Vi ̈ Shukela - Dive
Vitaly ̈_Vi ̈ Shukela - Dive
 
Alexander Lomov - Cloud Foundry и BOSH: истории из жизни
Alexander Lomov - Cloud Foundry и BOSH: истории из жизниAlexander Lomov - Cloud Foundry и BOSH: истории из жизни
Alexander Lomov - Cloud Foundry и BOSH: истории из жизни
 
Vikentsi Lapa — How does software testing become software development?
Vikentsi Lapa — How does software testing  become software development?Vikentsi Lapa — How does software testing  become software development?
Vikentsi Lapa — How does software testing become software development?
 
Михаил Волчек — Свободные лицензии. быть или не быть? Продолжение
Михаил Волчек — Свободные лицензии. быть или не быть? ПродолжениеМихаил Волчек — Свободные лицензии. быть или не быть? Продолжение
Михаил Волчек — Свободные лицензии. быть или не быть? Продолжение
 
Максим Мельников — IPv6 at Home: NAT64, DNS64, OpenVPN
Максим Мельников — IPv6 at Home: NAT64, DNS64, OpenVPNМаксим Мельников — IPv6 at Home: NAT64, DNS64, OpenVPN
Максим Мельников — IPv6 at Home: NAT64, DNS64, OpenVPN
 
Слава Машканов — “Wubuntu”: Построение гетерогенной среды Windows+Linux на н...
Слава Машканов — “Wubuntu”: Построение гетерогенной среды  Windows+Linux на н...Слава Машканов — “Wubuntu”: Построение гетерогенной среды  Windows+Linux на н...
Слава Машканов — “Wubuntu”: Построение гетерогенной среды Windows+Linux на н...
 
MajorDoMo: Открытая платформа Умного Дома
MajorDoMo: Открытая платформа Умного ДомаMajorDoMo: Открытая платформа Умного Дома
MajorDoMo: Открытая платформа Умного Дома
 
Максим Салов - Отладочный монитор
Максим Салов - Отладочный мониторМаксим Салов - Отладочный монитор
Максим Салов - Отладочный монитор
 
Максим Мельников - FOSDEM 2014 overview
Максим Мельников - FOSDEM 2014 overviewМаксим Мельников - FOSDEM 2014 overview
Максим Мельников - FOSDEM 2014 overview
 
Константин Шевцов - Пара слов о Jenkins
Константин Шевцов - Пара слов о JenkinsКонстантин Шевцов - Пара слов о Jenkins
Константин Шевцов - Пара слов о Jenkins
 
Ермакович Света - Операция «Пингвин»
Ермакович Света - Операция «Пингвин»Ермакович Света - Операция «Пингвин»
Ермакович Света - Операция «Пингвин»
 
Михаил Волчек - Смогут ли беларусы вкусить плоды Творческих Общин? Creative C...
Михаил Волчек - Смогут ли беларусы вкусить плоды Творческих Общин? Creative C...Михаил Волчек - Смогут ли беларусы вкусить плоды Творческих Общин? Creative C...
Михаил Волчек - Смогут ли беларусы вкусить плоды Творческих Общин? Creative C...
 
Vikentsi Lapa - Tools for testing
Vikentsi Lapa - Tools for testingVikentsi Lapa - Tools for testing
Vikentsi Lapa - Tools for testing
 
Алексей Туля - А нужен ли вам erlang?
Алексей Туля - А нужен ли вам erlang?Алексей Туля - А нужен ли вам erlang?
Алексей Туля - А нужен ли вам erlang?
 

Último

Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 

Último (20)

2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 

Алексей Чеусов - Расчёсываем своё ЧСВ

  • 1. The Word-based Regular Expressions (originally CHeuSoV :-) ) Aleksey Cheusov vle@gmx.net Minsk LUG meeting, 29 Dec. 2012, Belarus
  • 2. Plan for my presentation Mathematics: RL definition and theorem Real World: applications of RL and some notes Linguistics: POS tagset, Semantic tagset, Text tagging, Tagged text corpus Wanted: (mumbling...) Mathematics: ExpRL definition, WRL definition Linguistics + Real World: WRE syntax and examples Вброс, драка, разбор завалов, развоз трупов
  • 3. Regular language Definition: given a finite non-empty set of elements Σ (alphabet): ∅ is a regular language. For any element v ∈ Σ, {v} is a regular language. If A and B are regular languages, so is A ∪ B. If A and B are regular languages, so is {ab|a ∈ A, b ∈ B}, where ab means string concatenation. If A is a regular language, so is A∗ where A∗ = ∅ ∪ {a1a2. . .an|ai ∈ A, n > 0}. No other languages over Σ are regular. Example: Let Σ = {a, b, c}. Then since aab and cc are members of Σ∗, {aab} and {cc} are regular languages. So is the union of these two sets {aab, cc}, and so is the concatenation of the two {aabcc}. Likewise, {aab}∗, {cc}∗ and {aab, cc, aabcc}∗ are regular languages.
  • 4. Regular language (interesting fact) Provable fact: Regular languages are closed under intersection and negation operations.
  • 5. Finite state automaton Definition: 5-tuple (Σ, S, S0, F, δ) is a finite state automaton Σ is a finite non-empty set of elements (alphabet). S is a finite non-empty set of states. S0 ⊆ S is a set of start states. F ⊆ S is a set of finite states. δ : S × Σ → 2S is a state transition function. Theorem: ∀ regular language R, ∃ FSA f , such that L(f ) = R; ∀ FSA f , L(f ) is a regular language (L — language of FSA, that is a set of accepted inputs).
  • 6. Real world. Regular expressions (hello UNIX!). Regular language is a foundation for so called regular expressions, in most cases alphabet Σ is a set of characters (ASCII, Unicode etc.): POSIX basic regular expressions (grep, sed, vi etc.) and extended regular expressions (grep -E, awk etc.) Google re2, Yandex PIRE Perl, pcre, Ruby, Python (superset of regular language, and. . . extremely inefficient ;-) ) lex ... Widely used extensions: submatch (depending on implementation may still be FSM but not FSA) backreferences (incompatible with regular language and FSM at all)
  • 7. Tagsets Penn part-of-speech tag set: NN noun, singular or mass (apple, computer, fruit etc.) NNS noun plural (apples, computers, fruits etc.) CC coordinating conjunction (and, or) VB verb, base form (give, book, destroy etc.) VBD verb, past tense form (gave, booked, destroyed etc.) VBN verb, past participle form (given, booked, destroyed etc.) VBG verb, gerund/present participle (giving, booking, destroying etc.) etc. Semantic tag set: LinkVerb a verb that connects the subject to the complement(seem, feel, look etc.) AnimateNoun (brother, son etc.) etc.
  • 8. Tagged sentence, POS tagging, semantic tagging Examples: The book is red → The_DT book_NN is_VBZ red_JJ → The_DT book_NN/Object is_VBZ red_JJ/Color Note: Words book and red are ambiguous. My son goes to school → My_PRP$ son_NN goes_VBZ to_IN school_NN → My_PRP$ son_NN/Person goes_VBZ to_TO school_NN/Establishment Questions: How to match tagged sentence or find portions of it? Can regular expressions help? Do we need regcomp(3)/regexec(3)-like functionality?
  • 9. Expanded regular language Definition: given a non-empty set of elements Σ (alphabet) and a set of one-place predicates P = {P1, P2, . . . Pk}, Pi : Σ → {true, false}: ∅ is an expanded regular language. For any element v ∈ Σ, {v} is an expanded regular language. For any i {v|Pi (v) = true} is an expanded regular language. Σ is an expanded regular language. If A and B are expanded regular languages, so is A ∪ B. If A and B are expanded regular languages, so is A B. If A and B are expanded regular languages, so is {ab|a ∈ A, b ∈ B}, where ab means string concatenation. If A is a regular language, so is A∗ where A∗ = ∅ ∪ {a1a2. . .an|ai ∈ A, n > 0}. No other languages over Σ are expanded regular.
  • 10. Expanded regular language (interesting facts) Provable facts: Expanded regular languages are closed under intersection and negation operations. ∀ expanded regular language R we can build a regular language R∗ over alphabet Σ∗, such that ∃f : Σ∗ → 2Σ and L(R) = L(R∗) (expanding all elements in L(R∗) with a help of f ). Thus, we can build expanded regular expression engine based on traditional FSA/FSM-based algorithms.
  • 11. The Word-based regular language Definition: given W — set of words (character sequences), e.g. ”the”, ”apple””123”, ”; ”, ”C2H5OH”, . . . TPOS — finite non-empty set of part-of-speech tags, e.g. {DT, NN, NNS, VBP, VBZ, . . . } TSem — finite non-empty set of semantic tags, e.g. {LinkVerb, Person, Object, TransitiveVerb, . . . } DPOS : W → 2TPOS — POS dictionary, e.g. DPOS (”the”) = {DT}, DPOS(”book”) = {NN, VB, VBP}, DPOS (”and”) = {CC} DSem : W → 2TSem — semantic dictionary, e.g. DSem(”son”) = {Person}, DSem(”mouse”) = {Animal, ComputerDevice} EREs — finite set of POSIX extended regular expressions, e.g. {”. ∗ ing”, ”.multi. ∗ al”, ”[A − Z][a − z]”, ”[0 − 9] + ” . . . } etc. (to be continued)
  • 12. The Word-based regular language (continuation) the word-based regular language (TPOS , TSem, DPOS, DSem, EREs) is an expanded regular language over alphabet W × TPOS × 2TSem and one-place predicates P = {Pcheck tPOS , Pcheck tSem , Ptagging tPOS , Ptagging tSem , Pword re }, where Pcheck tPOS (w, ·, ·) = true if tPOS ∈ DPOS(w), and false otherwise Pcheck tSem (w, ·, ·) = true if tSem ∈ DSem(w), and false otherwise Ptagging tPOS (·, tagPOS , ·) = true if tPOS = tagPOS, and false otherwise Ptagging tSem (·, ·, tagsSem) = true if tSem ∈ tagsSem, and false otherwise Pword re (w, ·, ·) = true if POSIX extended regular expression re ∈ EREs matches w, and false otherwise
  • 13. The Word-based regular expressions (Finally!). Syntax: "word" — word itself (Pword re ), e.g. "the", "2012-12-29" etc. ’regexp’ — words matched by specified regexp (Pword re ) Tag — words tagged as tagPOS (Ptagging tPOS ), e.g. NN, DT, VB etc. %Tag — words tagged as tagSem (Ptagging tSem ), e.g. %Person, %Object, %LinkLerb etc. _Tag — words having as tagPOS in POS dictionary (Pcheck tPOS ), e.g. _NN, _DT, _VB etc. @Tag — words having as tagSem in semantic dictionary (Pcheck tSem ), e.g. @LinkVerb, @Object etc. . (dot) — any word with any POS and semantic tags ˆ — beginning of the sentence $ — end of the sentence (to be continued)
  • 14. The Word-based regular expressions (Finally!). Syntax (continuation): ( R ) — grouping like in mathematical expressions <num R > — submatch and extraction R ? and [ R ] — optional WRE R * and R + — possibly empty and non-empty repetitions R {n,m}, R {n,} and R {,m} — repetitions R – S — subtraction R & S and R / S — intersection, / is for single word WREs, & is for complex WREs R | S — union R S — concatenation !R — negation (L(!R) is equal to either Σ L(R) or Σ ∗ L(R) depending on a context of use) (to be continued)
  • 15. The Word-based regular expressions (Finally!). Syntax (priorities from highest to lowest, continuation): , / and | in single word non-spaced WREs Prepositional unary operation ! Postpositional unary operations {n,m}, ’?’, ’+’ and ’*’ ( R ) and <num R > R & S R – S R S R | S (to be continued)
  • 16. WRE examples How to select noun phrases (leftmost-longest match, only POS tags is our rules) ( DT | CD+ )? RB * CC|JJ|JJR|JJS * (NN|NNS + | NP + ) Ex.: This absolutely stupid decision Ex.: The best fuel cell Ex.: Black and white colors Ex.: Vasiliy Pupkin NER (Named Entity Recognition) for person names DSem("MrDr") = {"Mrs.", "Mr.", "Dr.", "Doctor", "President", "Dear", . . . } @MrDr <1 ’[A-Z][a-z]+’/!@MrDr␣+> Ex.: Doctor <1 Zhivago> Ex.: Dr. <1 Vasiliy Pupkin> Ex.: Mrs. <1 Kate> but not Ex.: Mr. <1 President>
  • 17. WRE examples Question focus (object attributes) ˆ "What" "is" "the" <1 @Attribute > "of" <2 .* > "?" Ex.: What is the <1 color > of <2 your book> ? Tiny definitions m4_define(NounPhrase, “(see previous slide)”) ˆ (NounPhrase & (. . . . . . .)) "is" (NounPhrase & (. . . . . . .)) Ex.: What is the <1 color > of <2 your book> ? Sentiment analysis DSem("BadCharact") = {"sucks", "stupid", "crappy", "shitty", . . . } DSem("GoodCharact") = {"rocks", "awesome", "excellent", . . . } DSem("OurProduct") = {"Linux", "NetBSD", "Ubuntu", "AltLinux", "iPad", "Android", . . . } <1 @BadCharact|@GoodCharact> <2 @OurProduct> | <2 @OurProduct> <1 @BadCharact|@GoodCharact>
  • 18. The Word-based Regular Expressions is really cool DSL!