SlideShare uma empresa Scribd logo
1 de 48
LEXING AND PARSING 
THE BEGINNER’S GUIDE
WHY ARE WE DOING THIS? 
• bbcode 
• html 
• xml 
• programming language
BUT I CAN JUST REGEX 
• sometimes you can 
• sometimes you can’t 
• is your html well formed? (view source some time) 
• it depends!!
CHOMSKY HIERARCHY
COMPUTER SCIENCE 
WE LIKE ACRONYMS AND WEIRD WORDS
ENGLISH IS HARD! 
• tokenizer 
• scanner 
• lexer 
• parser 
• lexical analyzer 
• syntactic analyzer 
• formal grammar
LEXICAL ANALYSIS 
BREAK DOWN INPUT INTO A SEQUENCE OF TOKENS 
LEXING
SCANNING 
• Finite State Machine 
• Finds Lexemes 
• Might backtrack
FINITE STATE MACHINE
EVALUATOR 
• looks at lexeme to get value 
• lexeme + value = token
LEXING PHP - $Y = 5; 
• $y 
• array[309, ‘$y’, 1], 
• = 
• = 
• 5 
• array[305, 5, 1] 
• 309 == T_VARIABLE 
• 305 == T_LNUMBER
LEXER GENERATORS 
DO NOT WRITE THIS BY HAND 
Famous 
• lex 
• flex 
• re2c 
• ANTLR 
• DFASTAR 
• jflex 
• jlex 
• quex 
PHP generators 
• https://github.com/oliverheins/PHPSimpleLexYacc 
• lex syntax 
• https://github.com/pear/PHP_LexerGenerator 
• re2c syntax 
• https://github.com/wez/JLexPHP 
• jlex syntax 
• token_get_all (see php-parser) 
• parse_ini_file/string (combined with parser)
RE2C
IN PHP LAND
SYNTACTIC ANALYSIS 
CONSTRUCTING SOMETHING BASED ON A GRAMMAR 
PARSING
THE PARSING PROCESS 
• Tokens come in 
• Magic 
• Data structure comes out 
• parse tree 
• AST
GRAMMAR (FORMAL OF COURSE) 
• "Brave men run in my family.” 
• I can't recommend this book too highly. 
• Prostitutes Appeal to Pope 
• I had had my car for four years before I ever learned to drive it.
TYPES OF PARSERS 
• Top Down 
• Recursive Decent 
• LL (left to right, leftmost derivation) 
• Earley parser 
• Bottom Up 
• Precedence parser 
• Operator-precedence parser 
• Simple precedence parser 
• BC (bounded context) parsing 
• LR parser (Left-to-right, Rightmost derivation) 
• Simple LR (SLR) parser 
• LALR parser 
• Canonical LR (LR(1)) parser 
• GLR parser 
• CYK parser 
• Recursive ascent parser
SENTENCE DIAGRAMMING 
• People who live in glass house shouldn't throw 
stones.
PARSE TREE
TOP DOWN VS. BOTTOM UP PARSING
PARSE TREES 
• Constituency-based parse trees 
• Dependency-based parse trees
AST 
• Not everything appears 
• additional information may be applied 
• can “improve” tree nodes 
• PHP is getting one!
LALR(K) 
• Look ahead prevents “ambiguous” parsing 
• I have one token, what token comes next?
PARSER GENERATORS 
Famous 
• bison 
• bison 
• bison 
• bison 
• yacc 
• lemon 
• ANTLR 
PHP versions 
• https://github.com/wez/lemon-php 
• https://github.com/pear/PHP_ParserGenerator 
• lemon 
• https://github.com/scato/phpeg 
• peg (peg.js) 
• https://github.com/jakubkulhan/pacc 
• yacc
BISON 
• Generates LALR (or GLR) parsers 
• Code in C, C++ or Java 
• reentrant with %define api.pure set 
• used by ALL THE THINGS 
• PHP 
• Ruby 
• Postgresql 
• Go
BISON IN C
LEMON 
• Generates LALR(1) parser 
• reentrant AND thread safe 
• non-terminal destructor (leak avoidance) 
• pull parsing 
• sqlite
PHP LEMON
REENTRANT VS THREAD SAFE 
• Process 
• Thread 
• Locking 
• Scope 
• Reentrant
COMPILE IT 
• transform programming language to computer language
INTERPRET IT 
• directly executes programming language
PROFIT
UNDER THE HOOD 
WHAT USES THIS STUFF?
PHP 
RE2C + Bison + these crazy opcodes….
LALR(1) WRITTEN BY HAND 
How - pythonic
HHVM 
Flex and Bison and JIT – OH MY!
SQLITE 
Lemon is tasty!
WRITING PARSERS AND LEXERS 
THEORIES OF CODING
STEP 1: THINK SMALL 
• Writing a general purpose parser is hard – that’s why you use PHP 
• Writing a single purpose parser is much easier 
• markup text (markdown) 
• configuration or definition files (behat/gherkin syntax) 
• complex validation (addresses in multiple formats)
STEP 2: SEPARATE AND UNOPTIMIZED 
• premature optimization yada yada 
• combine after it’s ready to be used (or not at if you’ll need to change it later) 
• lexer and parser each have unique, well defined goals 
• the ability to potentially switch parser styles later will help you!
STEP 3: LEXER 
• the lexer's job is to recognize tokens 
• it can do this via a giant switch statement of doom 
• or maybe a giant loop 
• or maybe a list of goto statements 
• or maybe a complex class with methods 
• …. or you can just use a generator
LET’S BREAK THAT DOWN 
1. Define a token format 
2. Define grammar format (what are we looking for?) 
3. Go over the input data (usually a string) and make matches 
1. compare or regex or ctype_* or however it make sense 
4. Keep track of your current state 
5. Have an output format – AST, tree, whatever
STEP 4: PARSER 
• Loop over our tokens 
• Look at the values and decide to what to do
STEP 5: DO SOMETHING WITH IT! 
1. Compile – write out to something that can be run (html) 
2. Interpret – run through another program to get output (templates to html) 
3. Analyze – run through to analyze the data inside (code analysis/sniffer tools) 
4. Validate – check for proper “spelling and grammar” 
5. ??? 
6. PROFIT
“If you’re not sure how to do a job – ask!” 
- silly poster on my laundry room wall
RESOURCES 
• http://savage.net.au/Ron/html/graphviz2.marpa/Lexing.and.Parsing.Overview.html 
• http://nikic.github.io/2011/10/23/Improving-lexing-performance-in-PHP.html 
• https://github.com/hafriedlander/php-peg 
• https://github.com/nikic/PHP-Parser/ 
• http://nikic.github.io/2012/06/15/The-true-power-of-regular-expressions.html 
• http://wikipedia.org
CONTACT ME 
• auroraeosrose@gmail.com 
• auroraeosrose – freenode.net #phpmentoring #phpwomen 
• Twitter - @auroraeosrose 
• http://github.com/auroraeosrose

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Python - Introduction
Python - IntroductionPython - Introduction
Python - Introduction
 
Learn python – for beginners
Learn python – for beginnersLearn python – for beginners
Learn python – for beginners
 
Parsing (Automata)
Parsing (Automata)Parsing (Automata)
Parsing (Automata)
 
Introduction to python programming
Introduction to python programmingIntroduction to python programming
Introduction to python programming
 
Python ppt
Python pptPython ppt
Python ppt
 
Introduction to Python Programming
Introduction to Python ProgrammingIntroduction to Python Programming
Introduction to Python Programming
 
F# and the DLR
F# and the DLRF# and the DLR
F# and the DLR
 
Full Python in 20 slides
Full Python in 20 slidesFull Python in 20 slides
Full Python in 20 slides
 
Python by Rj
Python by RjPython by Rj
Python by Rj
 
Introduction to python
Introduction to pythonIntroduction to python
Introduction to python
 
Python programming introduction
Python programming introductionPython programming introduction
Python programming introduction
 
Clonedigger-Python
Clonedigger-PythonClonedigger-Python
Clonedigger-Python
 
Raspberry using Python Session 1
Raspberry using Python Session 1Raspberry using Python Session 1
Raspberry using Python Session 1
 
Groovy Programming Language
Groovy Programming LanguageGroovy Programming Language
Groovy Programming Language
 
Ruby 3の型解析に向けた計画
Ruby 3の型解析に向けた計画Ruby 3の型解析に向けた計画
Ruby 3の型解析に向けた計画
 
Coffee 'n code: Regexes
Coffee 'n code: RegexesCoffee 'n code: Regexes
Coffee 'n code: Regexes
 
JRuby, Not Just For Hard-Headed Pragmatists Anymore
JRuby, Not Just For Hard-Headed Pragmatists AnymoreJRuby, Not Just For Hard-Headed Pragmatists Anymore
JRuby, Not Just For Hard-Headed Pragmatists Anymore
 
Python
PythonPython
Python
 
ppt9
ppt9ppt9
ppt9
 
ppt18
ppt18ppt18
ppt18
 

Destaque (20)

Write Your Own Compiler in 24 Hours
Write Your Own Compiler in 24 HoursWrite Your Own Compiler in 24 Hours
Write Your Own Compiler in 24 Hours
 
Creating own language made easy
Creating own language made easyCreating own language made easy
Creating own language made easy
 
Big Data
Big DataBig Data
Big Data
 
[Infographic] How will Internet of Things (IoT) change the world as we know it?
[Infographic] How will Internet of Things (IoT) change the world as we know it?[Infographic] How will Internet of Things (IoT) change the world as we know it?
[Infographic] How will Internet of Things (IoT) change the world as we know it?
 
How To Collect Requirments Slide Share
How To Collect Requirments Slide ShareHow To Collect Requirments Slide Share
How To Collect Requirments Slide Share
 
Introduction
IntroductionIntroduction
Introduction
 
Introduction to course
Introduction to courseIntroduction to course
Introduction to course
 
Complier designer
Complier designerComplier designer
Complier designer
 
Named Entities
Named EntitiesNamed Entities
Named Entities
 
4 lexical and syntax
4 lexical and syntax4 lexical and syntax
4 lexical and syntax
 
4 lexical and syntax analysis
4 lexical and syntax analysis4 lexical and syntax analysis
4 lexical and syntax analysis
 
LR Parsing
LR ParsingLR Parsing
LR Parsing
 
Natural Language Processing in AI
Natural Language Processing in AINatural Language Processing in AI
Natural Language Processing in AI
 
Mobile App Testing
Mobile App TestingMobile App Testing
Mobile App Testing
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Let's build a parser!
Let's build a parser!Let's build a parser!
Let's build a parser!
 
Compiler Design Basics
Compiler Design BasicsCompiler Design Basics
Compiler Design Basics
 
Module 11
Module 11Module 11
Module 11
 
NLP_session-3_Alexandra
NLP_session-3_AlexandraNLP_session-3_Alexandra
NLP_session-3_Alexandra
 
NLP_lectures_English
NLP_lectures_EnglishNLP_lectures_English
NLP_lectures_English
 

Semelhante a Lexing and parsing

Not Everything is an Object - Rocksolid Tour 2013
Not Everything is an Object  - Rocksolid Tour 2013Not Everything is an Object  - Rocksolid Tour 2013
Not Everything is an Object - Rocksolid Tour 2013Gary Short
 
ANTLR - Writing Parsers the Easy Way
ANTLR - Writing Parsers the Easy WayANTLR - Writing Parsers the Easy Way
ANTLR - Writing Parsers the Easy WayMichael Yarichuk
 
Rails development environment talk
Rails development environment talkRails development environment talk
Rails development environment talkReuven Lerner
 
Performance and Abstractions
Performance and AbstractionsPerformance and Abstractions
Performance and AbstractionsMetosin Oy
 
PureScript Tutorial 1
PureScript Tutorial 1PureScript Tutorial 1
PureScript Tutorial 1Ray Shih
 
Scaling with swagger
Scaling with swaggerScaling with swagger
Scaling with swaggerTony Tam
 
Exploring Natural Language Processing in Ruby
Exploring Natural Language Processing in RubyExploring Natural Language Processing in Ruby
Exploring Natural Language Processing in RubyKevin Dias
 
Functional programming
Functional programmingFunctional programming
Functional programmingPrateek Jain
 
Jsonsaga 100605143125-phpapp02
Jsonsaga 100605143125-phpapp02Jsonsaga 100605143125-phpapp02
Jsonsaga 100605143125-phpapp02Ramamohan Chokkam
 
sete linguagens em sete semanas
sete linguagens em sete semanassete linguagens em sete semanas
sete linguagens em sete semanastdc-globalcode
 
Meta Programming in Ruby - Code Camp 2010
Meta Programming in Ruby - Code Camp 2010Meta Programming in Ruby - Code Camp 2010
Meta Programming in Ruby - Code Camp 2010ssoroka
 
JSR 335 / java 8 - update reference
JSR 335 / java 8 - update referenceJSR 335 / java 8 - update reference
JSR 335 / java 8 - update referencesandeepji_choudhary
 
Functional Programming for Busy Object Oriented Programmers
Functional Programming for Busy Object Oriented ProgrammersFunctional Programming for Busy Object Oriented Programmers
Functional Programming for Busy Object Oriented ProgrammersDiego Freniche Brito
 
Funtional Ruby - Mikhail Bortnyk
Funtional Ruby - Mikhail BortnykFuntional Ruby - Mikhail Bortnyk
Funtional Ruby - Mikhail BortnykRuby Meditation
 
Hibernate ORM: Tips, Tricks, and Performance Techniques
Hibernate ORM: Tips, Tricks, and Performance TechniquesHibernate ORM: Tips, Tricks, and Performance Techniques
Hibernate ORM: Tips, Tricks, and Performance TechniquesBrett Meyer
 
FP Days: Down the Clojure Rabbit Hole
FP Days: Down the Clojure Rabbit HoleFP Days: Down the Clojure Rabbit Hole
FP Days: Down the Clojure Rabbit HoleChristophe Grand
 
An Introduction to Elastic Search.
An Introduction to Elastic Search.An Introduction to Elastic Search.
An Introduction to Elastic Search.Jurriaan Persyn
 
Message:Passing - lpw 2012
Message:Passing - lpw 2012Message:Passing - lpw 2012
Message:Passing - lpw 2012Tomas Doran
 
Messaging, interoperability and log aggregation - a new framework
Messaging, interoperability and log aggregation - a new frameworkMessaging, interoperability and log aggregation - a new framework
Messaging, interoperability and log aggregation - a new frameworkTomas Doran
 

Semelhante a Lexing and parsing (20)

Not Everything is an Object - Rocksolid Tour 2013
Not Everything is an Object  - Rocksolid Tour 2013Not Everything is an Object  - Rocksolid Tour 2013
Not Everything is an Object - Rocksolid Tour 2013
 
ANTLR - Writing Parsers the Easy Way
ANTLR - Writing Parsers the Easy WayANTLR - Writing Parsers the Easy Way
ANTLR - Writing Parsers the Easy Way
 
Rails development environment talk
Rails development environment talkRails development environment talk
Rails development environment talk
 
Performance and Abstractions
Performance and AbstractionsPerformance and Abstractions
Performance and Abstractions
 
PureScript Tutorial 1
PureScript Tutorial 1PureScript Tutorial 1
PureScript Tutorial 1
 
Scaling with swagger
Scaling with swaggerScaling with swagger
Scaling with swagger
 
Exploring Natural Language Processing in Ruby
Exploring Natural Language Processing in RubyExploring Natural Language Processing in Ruby
Exploring Natural Language Processing in Ruby
 
Functional programming
Functional programmingFunctional programming
Functional programming
 
Jsonsaga 100605143125-phpapp02
Jsonsaga 100605143125-phpapp02Jsonsaga 100605143125-phpapp02
Jsonsaga 100605143125-phpapp02
 
sete linguagens em sete semanas
sete linguagens em sete semanassete linguagens em sete semanas
sete linguagens em sete semanas
 
Meta Programming in Ruby - Code Camp 2010
Meta Programming in Ruby - Code Camp 2010Meta Programming in Ruby - Code Camp 2010
Meta Programming in Ruby - Code Camp 2010
 
JSR 335 / java 8 - update reference
JSR 335 / java 8 - update referenceJSR 335 / java 8 - update reference
JSR 335 / java 8 - update reference
 
Functional Programming for Busy Object Oriented Programmers
Functional Programming for Busy Object Oriented ProgrammersFunctional Programming for Busy Object Oriented Programmers
Functional Programming for Busy Object Oriented Programmers
 
Functional Ruby
Functional RubyFunctional Ruby
Functional Ruby
 
Funtional Ruby - Mikhail Bortnyk
Funtional Ruby - Mikhail BortnykFuntional Ruby - Mikhail Bortnyk
Funtional Ruby - Mikhail Bortnyk
 
Hibernate ORM: Tips, Tricks, and Performance Techniques
Hibernate ORM: Tips, Tricks, and Performance TechniquesHibernate ORM: Tips, Tricks, and Performance Techniques
Hibernate ORM: Tips, Tricks, and Performance Techniques
 
FP Days: Down the Clojure Rabbit Hole
FP Days: Down the Clojure Rabbit HoleFP Days: Down the Clojure Rabbit Hole
FP Days: Down the Clojure Rabbit Hole
 
An Introduction to Elastic Search.
An Introduction to Elastic Search.An Introduction to Elastic Search.
An Introduction to Elastic Search.
 
Message:Passing - lpw 2012
Message:Passing - lpw 2012Message:Passing - lpw 2012
Message:Passing - lpw 2012
 
Messaging, interoperability and log aggregation - a new framework
Messaging, interoperability and log aggregation - a new frameworkMessaging, interoperability and log aggregation - a new framework
Messaging, interoperability and log aggregation - a new framework
 

Mais de Elizabeth Smith

Mais de Elizabeth Smith (20)

Welcome to the internet
Welcome to the internetWelcome to the internet
Welcome to the internet
 
Database theory and modeling
Database theory and modelingDatabase theory and modeling
Database theory and modeling
 
Taming the resource tiger
Taming the resource tigerTaming the resource tiger
Taming the resource tiger
 
Modern sql
Modern sqlModern sql
Modern sql
 
Php extensions
Php extensionsPhp extensions
Php extensions
 
Taming the resource tiger
Taming the resource tigerTaming the resource tiger
Taming the resource tiger
 
Php internal architecture
Php internal architecturePhp internal architecture
Php internal architecture
 
Taming the tiger - pnwphp
Taming the tiger - pnwphpTaming the tiger - pnwphp
Taming the tiger - pnwphp
 
Php extensions
Php extensionsPhp extensions
Php extensions
 
Php’s guts
Php’s gutsPhp’s guts
Php’s guts
 
Hacking with hhvm
Hacking with hhvmHacking with hhvm
Hacking with hhvm
 
Security is not a feature
Security is not a featureSecurity is not a feature
Security is not a feature
 
Using unicode with php
Using unicode with phpUsing unicode with php
Using unicode with php
 
Mentoring developers-php benelux-2014
Mentoring developers-php benelux-2014Mentoring developers-php benelux-2014
Mentoring developers-php benelux-2014
 
Using unicode with php
Using unicode with phpUsing unicode with php
Using unicode with php
 
Socket programming with php
Socket programming with phpSocket programming with php
Socket programming with php
 
Mentoring developers
Mentoring developersMentoring developers
Mentoring developers
 
Do the mentor thing
Do the mentor thingDo the mentor thing
Do the mentor thing
 
Spl in the wild - zendcon2012
Spl in the wild - zendcon2012Spl in the wild - zendcon2012
Spl in the wild - zendcon2012
 
Mentoring developers - Zendcon 2012
Mentoring developers - Zendcon 2012Mentoring developers - Zendcon 2012
Mentoring developers - Zendcon 2012
 

Último

Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software DevelopersVinodh Ram
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Steffen Staab
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providermohitmore19
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...OnePlan Solutions
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...gurkirankumar98700
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsJhone kinadey
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsArshad QA
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfjoe51371421
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantAxelRicardoTrocheRiq
 
Test Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendTest Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendArshad QA
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsAndolasoft Inc
 

Último (20)

Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software Developers
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
Exploring iOS App Development: Simplifying the Process
Exploring iOS App Development: Simplifying the ProcessExploring iOS App Development: Simplifying the Process
Exploring iOS App Development: Simplifying the Process
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdf
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service Consultant
 
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
 
Test Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendTest Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and Backend
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
 

Lexing and parsing

  • 1. LEXING AND PARSING THE BEGINNER’S GUIDE
  • 2. WHY ARE WE DOING THIS? • bbcode • html • xml • programming language
  • 3. BUT I CAN JUST REGEX • sometimes you can • sometimes you can’t • is your html well formed? (view source some time) • it depends!!
  • 5. COMPUTER SCIENCE WE LIKE ACRONYMS AND WEIRD WORDS
  • 6. ENGLISH IS HARD! • tokenizer • scanner • lexer • parser • lexical analyzer • syntactic analyzer • formal grammar
  • 7. LEXICAL ANALYSIS BREAK DOWN INPUT INTO A SEQUENCE OF TOKENS LEXING
  • 8. SCANNING • Finite State Machine • Finds Lexemes • Might backtrack
  • 10. EVALUATOR • looks at lexeme to get value • lexeme + value = token
  • 11. LEXING PHP - $Y = 5; • $y • array[309, ‘$y’, 1], • = • = • 5 • array[305, 5, 1] • 309 == T_VARIABLE • 305 == T_LNUMBER
  • 12. LEXER GENERATORS DO NOT WRITE THIS BY HAND Famous • lex • flex • re2c • ANTLR • DFASTAR • jflex • jlex • quex PHP generators • https://github.com/oliverheins/PHPSimpleLexYacc • lex syntax • https://github.com/pear/PHP_LexerGenerator • re2c syntax • https://github.com/wez/JLexPHP • jlex syntax • token_get_all (see php-parser) • parse_ini_file/string (combined with parser)
  • 13. RE2C
  • 15. SYNTACTIC ANALYSIS CONSTRUCTING SOMETHING BASED ON A GRAMMAR PARSING
  • 16. THE PARSING PROCESS • Tokens come in • Magic • Data structure comes out • parse tree • AST
  • 17. GRAMMAR (FORMAL OF COURSE) • "Brave men run in my family.” • I can't recommend this book too highly. • Prostitutes Appeal to Pope • I had had my car for four years before I ever learned to drive it.
  • 18. TYPES OF PARSERS • Top Down • Recursive Decent • LL (left to right, leftmost derivation) • Earley parser • Bottom Up • Precedence parser • Operator-precedence parser • Simple precedence parser • BC (bounded context) parsing • LR parser (Left-to-right, Rightmost derivation) • Simple LR (SLR) parser • LALR parser • Canonical LR (LR(1)) parser • GLR parser • CYK parser • Recursive ascent parser
  • 19. SENTENCE DIAGRAMMING • People who live in glass house shouldn't throw stones.
  • 21. TOP DOWN VS. BOTTOM UP PARSING
  • 22. PARSE TREES • Constituency-based parse trees • Dependency-based parse trees
  • 23. AST • Not everything appears • additional information may be applied • can “improve” tree nodes • PHP is getting one!
  • 24. LALR(K) • Look ahead prevents “ambiguous” parsing • I have one token, what token comes next?
  • 25. PARSER GENERATORS Famous • bison • bison • bison • bison • yacc • lemon • ANTLR PHP versions • https://github.com/wez/lemon-php • https://github.com/pear/PHP_ParserGenerator • lemon • https://github.com/scato/phpeg • peg (peg.js) • https://github.com/jakubkulhan/pacc • yacc
  • 26. BISON • Generates LALR (or GLR) parsers • Code in C, C++ or Java • reentrant with %define api.pure set • used by ALL THE THINGS • PHP • Ruby • Postgresql • Go
  • 28. LEMON • Generates LALR(1) parser • reentrant AND thread safe • non-terminal destructor (leak avoidance) • pull parsing • sqlite
  • 30. REENTRANT VS THREAD SAFE • Process • Thread • Locking • Scope • Reentrant
  • 31. COMPILE IT • transform programming language to computer language
  • 32. INTERPRET IT • directly executes programming language
  • 34. UNDER THE HOOD WHAT USES THIS STUFF?
  • 35. PHP RE2C + Bison + these crazy opcodes….
  • 36. LALR(1) WRITTEN BY HAND How - pythonic
  • 37. HHVM Flex and Bison and JIT – OH MY!
  • 38. SQLITE Lemon is tasty!
  • 39. WRITING PARSERS AND LEXERS THEORIES OF CODING
  • 40. STEP 1: THINK SMALL • Writing a general purpose parser is hard – that’s why you use PHP • Writing a single purpose parser is much easier • markup text (markdown) • configuration or definition files (behat/gherkin syntax) • complex validation (addresses in multiple formats)
  • 41. STEP 2: SEPARATE AND UNOPTIMIZED • premature optimization yada yada • combine after it’s ready to be used (or not at if you’ll need to change it later) • lexer and parser each have unique, well defined goals • the ability to potentially switch parser styles later will help you!
  • 42. STEP 3: LEXER • the lexer's job is to recognize tokens • it can do this via a giant switch statement of doom • or maybe a giant loop • or maybe a list of goto statements • or maybe a complex class with methods • …. or you can just use a generator
  • 43. LET’S BREAK THAT DOWN 1. Define a token format 2. Define grammar format (what are we looking for?) 3. Go over the input data (usually a string) and make matches 1. compare or regex or ctype_* or however it make sense 4. Keep track of your current state 5. Have an output format – AST, tree, whatever
  • 44. STEP 4: PARSER • Loop over our tokens • Look at the values and decide to what to do
  • 45. STEP 5: DO SOMETHING WITH IT! 1. Compile – write out to something that can be run (html) 2. Interpret – run through another program to get output (templates to html) 3. Analyze – run through to analyze the data inside (code analysis/sniffer tools) 4. Validate – check for proper “spelling and grammar” 5. ??? 6. PROFIT
  • 46. “If you’re not sure how to do a job – ask!” - silly poster on my laundry room wall
  • 47. RESOURCES • http://savage.net.au/Ron/html/graphviz2.marpa/Lexing.and.Parsing.Overview.html • http://nikic.github.io/2011/10/23/Improving-lexing-performance-in-PHP.html • https://github.com/hafriedlander/php-peg • https://github.com/nikic/PHP-Parser/ • http://nikic.github.io/2012/06/15/The-true-power-of-regular-expressions.html • http://wikipedia.org
  • 48. CONTACT ME • auroraeosrose@gmail.com • auroraeosrose – freenode.net #phpmentoring #phpwomen • Twitter - @auroraeosrose • http://github.com/auroraeosrose

Notas do Editor

  1. Why I got started with this I’ve never taken a computer class I wanted to understand why PHP worked the way it does because I’d been pondering putting some eventing/asyncn magic inside and I ended up down this deep computer science pit where compilers are at the bottom
  2. Lexers are used to recognize "words" that make up language elements, because the structure of such words is generally simple. Regular expressions are extremely good at handling this simpler structure, and there are very high-performance regular-expression matching engines used to implement lexers. Parsers are used to recognize "structure" of a language phrases. Such structure is generally far beyond what "regular expressions" can recognize, so one needs "context sensitive" parsers to extract such structure. Context-sensitive parsers are hard to build, so the engineering compromise is to use "context-free" grammars and add hacks to the parsers ("symbol tables", etc.) to handle the context-sensitive part.
  3. Regular expressions can only match regular languages but HTML is a context-free language. The only thing you can do with regexps on HTML is heuristics but that will not work on every condition. It should be possible to present a HTML file that will be matched wrongly by any regular expression.
  4. A formal grammar defines (or generates) a formal language, which is a (usually infinite) set of finite-length sequences of symbols (i.e. strings) that may be constructed by applying production rules to another sequence of symbols which initially contains just the start symbol Type-0 grammars (unrestricted grammars) include all formal grammars. Type-1 grammars (context-sensitive grammars) generate the context-sensitive languages. Type-2 grammars (context-free grammars) generate the context-free languages. Type-3 grammars (regular grammars) generate the regular languages.
  5. So computer science is a really weird discipline quite a bit of what computer science is and does comes from – well – math and the other part – the “language” aspects and even concepts of grammar and meaning are from “English” or “language arts” as my kids school calls it the only “science” Part that I think really applies is that we test theories and apply logic  at it’s core remember computers are algorithms (rules) and information (data) but “computer science” has grown to encompass LOTS of things What we’re going to talk about is a small but fundamental window – lexing and parsing – so lets start with words
  6. Ask for people seeing these terms Ask if anyone knows a definition of these terms, even a non-computer science definition so almost all of these terms have different meanings depending on their context in computer science definitions are what we’re going to be using we’re also going to mention that some terms get thrown around a bit (parser and scanner are the two worst) but I’m also going to attempt to help you build your own internal rules so you don’t confuse yourself and others by always using them in the “computer science dictionary” manner
  7. Scanner == first stage of lexer Strictly speaking, a lexer is itself a kind of parser but we won’t EVER call it a parser cause CONFUSION the syntax of some programming languages are divided into two pieces: the lexical syntax (token structure), which is processed by the lexer; and the phrase syntax, which is processed by the parser The lexical syntax is usually a regular language, whose alphabet consists of the individual characters of the source code text. The phrase syntax is usually a context-free language, whose alphabet consists of the tokens produced by the lexer. While this is a common separation, alternatively, a lexer can be combined with the parser in scannerless parsing. I would say though _ DO NOT DO THIS it may seem easier in the short term but when you have to start changing stuff you will have PAIN
  8. Finite state machine – we have a finite (bounded) list of states and the machine can be in one state at any one time Because a finite state machine can represent any history and a reaction, by regarding the change of state as a response to the history it has been argued that it is a sufficient model of human behaviour  i.e. humans are finite state machines. lexeme == characters that have been matched by our state machine needs to be translated to a value
  9. States – happy, sad, angry inputs – money, food, kick in pants outputs – smile, frown, punch back set up example of state machine for people
  10. sometimes there isn’t’ a value (parentheses in a programming language, for example) sometimes a lexeme is suppressed (comments anyone?) sometimes even a lexeme or token is ADDED by the lexer line continuation (C code) semi-colon insertion (lazy bad javascript! and go? really!) off-side rule – blocks with indents (oh python) or braces (php and C and friends) context sensitivity good lexers are NOT context-sensitive the more look ahead, look back, and backtracking
  11. so discuss a little bit about PHP it’s lexer is exposed with token_get_all it’ll “parse”/”tokenize” lex is the correct term, the PHP fed to it this is why there are many parsers written in PHP but not really any lexers, it’s in there  This is GENERALLY the easy part! what is the 1? – line numbers
  12. ANTLR - Can generate lexical analyzers and parsers. DFASTAR - Generates DFA matrix table-driven lexers in C++. Flex - Alternative variant of the classic "lex" (C/C++). JFlex - A rewrite of JLex. Ragel - A state machine and lexer generator with output in C, C++, C#, Objective-C, D, Java, Go and Ruby. The following lexical analysers can handle Unicode: JavaCC - JavaCC generates lexical analyzers written in Java. JLex - A lexical analyzer generator for Java. Quex - A fast universal lexical analyzer generator for C and C++. SO if you’re generating
  13. rules, named definitions and in-place configurations.
  14. ah, the overloading of the word parsing syntactic analysis and grammar looks at the data sent and builds a model – usually some kind of data structure or tree, for what that model looks like just like in English we take grammar to define ideas
  15. A parser is a software component that takes input data (frequently text) and builds a data structure – often some kind of parse tree, abstract syntax tree or other hierarchical structure – giving a structural representation of the input, checking for correct syntax in the process you can do scannerless (again with the silly overloading of words) – a “non lexed” parser but – sigh
  16. A formal grammar is a set of rules for rewriting strings, along with a "start symbol" from which rewriting starts Parsing is the process of recognizing an utterance (a string in natural languages) by breaking it down to a set of symbols and analyzing each one against the grammar of the language why what comes before and after can be important when parsing your brain is a very good parser
  17. one first looks at the highest level of the parse tree and works don the parse tree by using the rewriting rules of a formal grammar. top down parsers can be small and powerful and readable, although it can be slower a top down parser with a direct path is going to beat a more complex path a bottom up can be faster but you need to match the type of parser with what you’re doing
  18. so let’s take a theoretical piece of code that’s been lexed into these values into a “parse tree” – we’ll get into that in a moment
  19. The opposite of this are top-down parsing methods, in which the input's overall structure is decided (or guessed at) first, before dealing with mid-level parts, leaving the lowest-level small details to last. A top-down parser discovers and processes the hierarchical tree starting from the top, and incrementally works its way downwards and rightwards. Top-down parsing eagerly decides what a construct is much earlier, when it has only scanned the leftmost symbol of that construct and has not yet parsed any of its parts. Left corner parsing is a hybrid method which works bottom-up along the left edges of each subtree, and top-down on the rest of the parse tree. If a language grammar has multiple rules that may start with the same leftmost symbols but have different endings, then that grammar can be efficiently handled by a deterministic bottom-up parse but cannot be handled top-down without guesswork and backtracking. So bottom-up parsers handle a somewhat larger range of computer language grammars than do deterministic top-down parsers. Bottom-up parsing is sometimes done by backtracking. But much more commonly, bottom-up parsing is done by a shift-reduce parser such as a LALR parser.
  20. ordered, rooted tree that represents the syntactic structure of a string their structure and elements more concretely reflect the syntax of the input language constituency based – parts – noun, verb, adverb They are simpler on average than constituency-based parse trees because they contain many fewer nodes – so dependency would say noun, verb, adverb constituency would be sentence, noun phrase, verb phrase, and breaks it down into smaller pieces
  21. abstract syntax tree The syntax is "abstract" in not representing every detail appearing in the real syntax. grouping parentheses are implicit in the tree structure, and a syntactic construct like an if-condition-then expression may be denoted by means of a single node with three branches. a
  22. LALR – look ahead left to right rightmost derivation – the look ahead can be different depending on the parser type – but bison and friends are all LALR(1) generators
  23. bison is re-entrant but NOT thread safe
  24. Bison reads a specification of a context-free language, warns about any parsing ambiguities, and generates a parser (either in C, C++, or Java) which reads sequences of tokens and decides whether the sequence conforms to the syntax specified by the grammar note that bison is re-entrant – it’s not by default thread safe (these are two different things)
  25. Lemon requires to write more rules in comparison with Bison because of simplified syntax: no repetitions and optionals, one action per rule, etc. Complete set of LALR(1) parser limitations. Only the C language.
  26. reentrant if it can be interrupted in the middle of its execution and then safely called again ("re-entered") before its previous invocations complete execution A reentrant subroutine can achieve thread-safety,[1] but being reentrant alone might not be sufficient to be thread-safe in all situations. Conversely, thread-safe code does not necessarily have to be reentrant (see below for examples). A piece of code is thread-safe if it only manipulates shared data structures in a manner that guarantees safe execution by multiple threads at the same time
  27. compilers generally write out to assembly or machine code but technically anything can be compiled down to something to be run (plug reckit)
  28. interpreter is a computer program that directly executes, i.e. performs, instructions written in a programming or scripting language, without previously compiling them into a machine language program
  29. PHP bison file PHP bison C output
  30. hand written lexer and lemon parser
  31. A parser is a program which processes an input and "understands" it a lexer is a program which splits something into tokens and assigns it a value There are steps you can take to make doing this easier and make you feel less “OMG I’m WRITING A PARSER” or you can cheat and just use a generator
  32. So when you first get started think of something small
  33. Each of these types of lexer’s are going to have their advantages and disavantages The trick here is not let the lexer do more than it’s supposed to it should be context free or you’ll hate yourself later if you absolutely positively have to lookahead or lookbehind you’ll hate yourself later put as much information into your token definition as you want