SlideShare uma empresa Scribd logo
1 de 17
Chapter 4
                        Lexical analysis

CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc.   1
Scanner

         • Main task: identify tokens
                 – Basic building blocks of programs
                 – E.g. keywords, identifiers, numbers, punctuation marks
         • Desk calculator language example:
                         read A
                         sum := A + 3.45e-3
                         write sum
                         write sum / 2




CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc.              2
Formal definition of tokens

         • A set of tokens is a set of strings over an alphabet
            – {read, write, +, -, *, /, :=, 1, 2, …, 10, …, 3.45e-3, …}
         • A set of tokens is a regular set that can be defined by
           comprehension using a regular expression
         • For every regular set, there is a deterministic finite
           automaton (DFA) that can recognize it
            – (Aka deterministic Finite State Machine (FSM))
            – i.e. determine whether a string belongs to the set or not
            – Scanners extract tokens from source code in the same
              way DFAs determine membership

CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc.        3
Regular Expressions
         A regular expression (RE) is:
         • A single character
         • The empty string, ε
         • The concatenation of two regular expressions
                 – Notation: RE1 RE2 (i.e. RE1 followed by RE2)
         • The union of two regular expressions
                 – Notation: RE1 | RE2
         • The closure of a regular expression
                 – Notation: RE*
                 – * is known as the Kleene star
                 – * represents the concatenation of 0 or more strings
         • Caution: notations for regular expressions vary
                 – Learn the basic concepts and the rest is just syntactic sugar
CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc.                     4
Token Definition Example
                                                                         DIG
     • Numeric literals in Pascal, e.g.
        1, 123, 3.1415, 10e-3, 3.14e4                                            *             DIG

     • Definition of token unsignedNum                                       .

        DIG → 0|1|2|3|4|5|6|7|8|9                                                    DIG
                                                                                              *
        unsignedInt → DIG DIG*                                               e       e
        unsignedNum →                                                                    +
            unsignedInt                                                                  -           DIG
                                                                       DIG                   DIG
           (( . unsignedInt) | ε)
           ((e ( + | – | ε) unsignedInt) | ε)                                    *             DIG

     • Notes:
        – Recursion is not allowed!                              • FAs with epsilons are
                                                                   nondeterministic.
        – Parentheses used to avoid                              • NFAs are much harder to
          ambiguity                                                implement (use backtracking)
        – It’s always possible to rewrite                        • Every NFA can be rewriten as
                                                                   a DFA (gets larger, tho)
          removing epsilons
CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc.                                        5
Simple Problem
          • Write a C program which reads in a character string, consisting
            of a’s and b’s, one character at a time. If the string contains a
            double aa, then print string accepted else print string rejected.
          • An abstract solution to this can be expressed as a DFA

                b                                                       a
                             1-                                  2                   3+          a, b
                                              a
                     Start state               b                               An accepting state

                                                                                         input
                                                                                     a           b
               The state transitions of a DFA
               can be encoded as a table                             current   1      2          1
               which specifies the new state
               for a given current state and
                                                                       state
                                                                               2      3          1
               input                                                           3      3          3
CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc.                                          6
#include <stdio.h>
               main()
               { enum State {S1, S2, S3};                        an approach in C
                  enum State currentState = S1;
                  int c = getchar();
                  while (c != EOF) {
                       switch(currentState) {
                         case S1: if (c == ‘a’) currentState = S2;
                                    if (c == ‘b’) currentState = S1;
                                    break;
                         case S2: if (c == ‘a’) currentState = S3;
                                    if (c == ‘b’) currentState = S1;
                                    break;
                         case S3: break;
                        }
                        c = getchar();
                   }
                   if (currentState == S3) printf(“string acceptedn”);
                   else printf(“string rejectedn”);
               }
CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc.                      7
#include <stdio.h>                                    Using a table
           main()
           { enum State {S1, S2, S3};                            simplifies the
              enum Label {A, B};
              enum State currentState = S1;
                                                                 program
              enum State table[3][2] = {{S2, S1}, {S3, S1}, {S3, S3}};
              int label;
              int c = getchar();
              while (c != EOF) {
                   if (c == ‘a’) label = A;
                   if (c == ‘b’) label = B;
                   currentState = table[currentState][label];
                   c = getchar();
               }
              if (currentState == S3) printf(“string acceptedn”);
              else printf(“string rejectedn”);
           }


CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc.                    8
Lex
         • Lexical analyzer generator
            – It writes a lexical analyzer
         • Assumption
            – each token matches a regular expression
         • Needs
            – set of regular expressions
            – for each expression an action
         • Produces
            – A C program
         • Automatically handles many tricky problems
         • flex is the gnu version of the venerable unix tool lex.
            – Produces highly optimized code

CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc.         9
Scanner Generators
          • E.g. lex, flex
          • These programs take
            a table as their input
            and return a program
            (i.e. a scanner) that
            can extract tokens
            from a stream of
            characters
          • A very useful
            programming utility,
            especially when
            coupled with a parser
            generator (e.g., yacc)
          • standard in Unix
CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc.     10
Lex example                                                                                   input


                                         lex                         cc                foolex
               foo.l                                      foolex.c          foolex
                                                                                              tokens
                                                                     > foolex < input
              > flex -ofoolex.c foo.l                                Keyword: begin
              > cc -ofoolex foolex.c -lfl                            Keyword: if
                                                                     Identifier: size
                                                                     Operator: >
             >more input                                             Integer: 10 (10)
             begin                                                   Keyword: then
                                                                     Identifier: size
              if size>10                                             Operator: *
                then size * -3.1415                                  Operator: -
             end                                                     Float: 3.1415 (3.1415)
                                                                     Keyword: end
CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc.                                    11
A Lex Program

                                                                 DIG [0-9]

        … definitions …                                          ID [a-z][a-z0-9]*
                                                                 %%

        %%                                                       {DIG}+            printf("Integern”);
                                                                 {DIG}+"."{DIG}* printf("Floatn”);

        … rules …                                                {ID}
                                                                 [ tn]+
                                                                                   printf("Identifiern”);
                                                                                   /* skip whitespace */
                                                                 .                 printf(“Huh?n");
        %%
                                                                 %%
        … subroutines …                                          main(){yylex();}




CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc.                                               12
Simplest Example


                              %%
                              .|n                ECHO;
                              %%

                              main()
                              {
                                yylex();
                              }




CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc.       13
Strings containing aa

                  %%
                  (a|b)*aa(a|b)*                            {printf(“Accept %sn”, yytext);}

                  [ab]+                                     {printf(“Reject %sn”, yytext);}

                  .|n              ECHO;
                  %%
                  main() {yylex();}




CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc.                                 14
Rules
              • Each has a rule has a pattern and an action.
              • Patterns are regular expression
              • Only one action is performed
                 – The action corresponding to the pattern matched
                   is performed.
                 – If several patterns match the input, the one
                   corresponding to the longest sequence is chosen.
                 – Among the rules whose patterns match the same
                   number of characters, the rule given first is
                   preferred.

CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc.           15
x         character 'x'
                                                                 Flex’s RE syntax
      .         any character except newline
      [xyz]     character class, in this case, matches either an 'x', a 'y', or a 'z'
      [abj-oZ]  character class with a range in it; matches 'a', 'b', any letter
                from 'j' through 'o', or 'Z'
      [^A-Z]    negated character class, i.e., any character but those in the
                class, e.g. any character except an uppercase letter.
      [^A-Zn] any character EXCEPT an uppercase letter or a newline
      r*        zero or more r's, where r is any regular expression
      r+        one or more r's
      r?        zero or one r's (i.e., an optional r)
      {name} expansion of the "name" definition (see above)
      "[xy]"foo" the literal string: '[xy]"foo' (note escaped “)
      x        if x is an 'a', 'b', 'f', 'n', 'r', 't', or 'v', then the ANSI-C
                interpretation of x. Otherwise, a literal 'x' (e.g., escape)
      rs        RE r followed by RE s (e.g., concatenation)
      r|s       either an r or an s
      <<EOF>> end-of-file
CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc.                          16
/* scanner for a toy Pascal-like language */
                           %{
                           #include <math.h> /* needed for call to atof() */
                           %}
                           DIG [0-9]
                           ID [a-z][a-z0-9]*
                           %%
                           {DIG}+             printf("Integer: %s (%d)n", yytext, atoi(yytext));

                           {DIG}+"."{DIG}* printf("Float: %s (%g)n", yytext, atof(yytext));

                           if|then|begin|end                     printf("Keyword: %sn",yytext);

                           {ID}                                  printf("Identifier: %sn",yytext);

                           "+"|"-"|"*"|"/"                       printf("Operator: %sn",yytext);

                           "{"[^}n]*"}"                         /* skip one-line comments */
                           [ tn]+                              /* skip whitespace */                17
CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc.

Mais conteúdo relacionado

Semelhante a 4 lexical (20)

Intermediate code generation in Compiler Design
Intermediate code generation in Compiler DesignIntermediate code generation in Compiler Design
Intermediate code generation in Compiler Design
 
Lifting 1
Lifting 1Lifting 1
Lifting 1
 
Pcd(Mca)
Pcd(Mca)Pcd(Mca)
Pcd(Mca)
 
PCD ?s(MCA)
PCD ?s(MCA)PCD ?s(MCA)
PCD ?s(MCA)
 
Principles of Compiler Design
Principles of Compiler DesignPrinciples of Compiler Design
Principles of Compiler Design
 
Pcd(Mca)
Pcd(Mca)Pcd(Mca)
Pcd(Mca)
 
Compiler Design Material 2
Compiler Design Material 2Compiler Design Material 2
Compiler Design Material 2
 
Intermediate code generation1
Intermediate code generation1Intermediate code generation1
Intermediate code generation1
 
Intermediate code generation
Intermediate code generationIntermediate code generation
Intermediate code generation
 
New compiler design 101 April 13 2024.pdf
New compiler design 101 April 13 2024.pdfNew compiler design 101 April 13 2024.pdf
New compiler design 101 April 13 2024.pdf
 
Lecture 12 intermediate code generation
Lecture 12 intermediate code generationLecture 12 intermediate code generation
Lecture 12 intermediate code generation
 
Lexicalanalyzer
LexicalanalyzerLexicalanalyzer
Lexicalanalyzer
 
Lexicalanalyzer
LexicalanalyzerLexicalanalyzer
Lexicalanalyzer
 
Interm codegen
Interm codegenInterm codegen
Interm codegen
 
Syntaxdirected
SyntaxdirectedSyntaxdirected
Syntaxdirected
 
Syntaxdirected
SyntaxdirectedSyntaxdirected
Syntaxdirected
 
Syntaxdirected (1)
Syntaxdirected (1)Syntaxdirected (1)
Syntaxdirected (1)
 
Predication
PredicationPredication
Predication
 
System Programming Unit IV
System Programming Unit IVSystem Programming Unit IV
System Programming Unit IV
 
LEC 7-DS ALGO(expression and huffman).pdf
LEC 7-DS  ALGO(expression and huffman).pdfLEC 7-DS  ALGO(expression and huffman).pdf
LEC 7-DS ALGO(expression and huffman).pdf
 

Último

URLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppURLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppCeline George
 
Concept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.CompdfConcept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.CompdfUmakantAnnand
 
Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application ) Sakshi Ghasle
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdfSoniaTolstoy
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentInMediaRes1
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingTechSoup
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxmanuelaromero2013
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxNirmalaLoungPoorunde1
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxOH TEIK BIN
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon AUnboundStockton
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3JemimahLaneBuaron
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha
 
PSYCHIATRIC History collection FORMAT.pptx
PSYCHIATRIC   History collection FORMAT.pptxPSYCHIATRIC   History collection FORMAT.pptx
PSYCHIATRIC History collection FORMAT.pptxPoojaSen20
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsanshu789521
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 

Último (20)

URLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppURLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website App
 
Concept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.CompdfConcept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.Compdf
 
Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application )
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media Component
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptx
 
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptx
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptx
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon A
 
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdfTataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
PSYCHIATRIC History collection FORMAT.pptx
PSYCHIATRIC   History collection FORMAT.pptxPSYCHIATRIC   History collection FORMAT.pptx
PSYCHIATRIC History collection FORMAT.pptx
 
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha elections
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 

4 lexical

  • 1. Chapter 4 Lexical analysis CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc. 1
  • 2. Scanner • Main task: identify tokens – Basic building blocks of programs – E.g. keywords, identifiers, numbers, punctuation marks • Desk calculator language example: read A sum := A + 3.45e-3 write sum write sum / 2 CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc. 2
  • 3. Formal definition of tokens • A set of tokens is a set of strings over an alphabet – {read, write, +, -, *, /, :=, 1, 2, …, 10, …, 3.45e-3, …} • A set of tokens is a regular set that can be defined by comprehension using a regular expression • For every regular set, there is a deterministic finite automaton (DFA) that can recognize it – (Aka deterministic Finite State Machine (FSM)) – i.e. determine whether a string belongs to the set or not – Scanners extract tokens from source code in the same way DFAs determine membership CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc. 3
  • 4. Regular Expressions A regular expression (RE) is: • A single character • The empty string, ε • The concatenation of two regular expressions – Notation: RE1 RE2 (i.e. RE1 followed by RE2) • The union of two regular expressions – Notation: RE1 | RE2 • The closure of a regular expression – Notation: RE* – * is known as the Kleene star – * represents the concatenation of 0 or more strings • Caution: notations for regular expressions vary – Learn the basic concepts and the rest is just syntactic sugar CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc. 4
  • 5. Token Definition Example DIG • Numeric literals in Pascal, e.g. 1, 123, 3.1415, 10e-3, 3.14e4 * DIG • Definition of token unsignedNum . DIG → 0|1|2|3|4|5|6|7|8|9 DIG * unsignedInt → DIG DIG* e e unsignedNum → + unsignedInt - DIG DIG DIG (( . unsignedInt) | ε) ((e ( + | – | ε) unsignedInt) | ε) * DIG • Notes: – Recursion is not allowed! • FAs with epsilons are nondeterministic. – Parentheses used to avoid • NFAs are much harder to ambiguity implement (use backtracking) – It’s always possible to rewrite • Every NFA can be rewriten as a DFA (gets larger, tho) removing epsilons CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc. 5
  • 6. Simple Problem • Write a C program which reads in a character string, consisting of a’s and b’s, one character at a time. If the string contains a double aa, then print string accepted else print string rejected. • An abstract solution to this can be expressed as a DFA b a 1- 2 3+ a, b a Start state b An accepting state input a b The state transitions of a DFA can be encoded as a table current 1 2 1 which specifies the new state for a given current state and state 2 3 1 input 3 3 3 CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc. 6
  • 7. #include <stdio.h> main() { enum State {S1, S2, S3}; an approach in C enum State currentState = S1; int c = getchar(); while (c != EOF) { switch(currentState) { case S1: if (c == ‘a’) currentState = S2; if (c == ‘b’) currentState = S1; break; case S2: if (c == ‘a’) currentState = S3; if (c == ‘b’) currentState = S1; break; case S3: break; } c = getchar(); } if (currentState == S3) printf(“string acceptedn”); else printf(“string rejectedn”); } CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc. 7
  • 8. #include <stdio.h> Using a table main() { enum State {S1, S2, S3}; simplifies the enum Label {A, B}; enum State currentState = S1; program enum State table[3][2] = {{S2, S1}, {S3, S1}, {S3, S3}}; int label; int c = getchar(); while (c != EOF) { if (c == ‘a’) label = A; if (c == ‘b’) label = B; currentState = table[currentState][label]; c = getchar(); } if (currentState == S3) printf(“string acceptedn”); else printf(“string rejectedn”); } CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc. 8
  • 9. Lex • Lexical analyzer generator – It writes a lexical analyzer • Assumption – each token matches a regular expression • Needs – set of regular expressions – for each expression an action • Produces – A C program • Automatically handles many tricky problems • flex is the gnu version of the venerable unix tool lex. – Produces highly optimized code CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc. 9
  • 10. Scanner Generators • E.g. lex, flex • These programs take a table as their input and return a program (i.e. a scanner) that can extract tokens from a stream of characters • A very useful programming utility, especially when coupled with a parser generator (e.g., yacc) • standard in Unix CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc. 10
  • 11. Lex example input lex cc foolex foo.l foolex.c foolex tokens > foolex < input > flex -ofoolex.c foo.l Keyword: begin > cc -ofoolex foolex.c -lfl Keyword: if Identifier: size Operator: > >more input Integer: 10 (10) begin Keyword: then Identifier: size if size>10 Operator: * then size * -3.1415 Operator: - end Float: 3.1415 (3.1415) Keyword: end CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc. 11
  • 12. A Lex Program DIG [0-9] … definitions … ID [a-z][a-z0-9]* %% %% {DIG}+ printf("Integern”); {DIG}+"."{DIG}* printf("Floatn”); … rules … {ID} [ tn]+ printf("Identifiern”); /* skip whitespace */ . printf(“Huh?n"); %% %% … subroutines … main(){yylex();} CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc. 12
  • 13. Simplest Example %% .|n ECHO; %% main() { yylex(); } CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc. 13
  • 14. Strings containing aa %% (a|b)*aa(a|b)* {printf(“Accept %sn”, yytext);} [ab]+ {printf(“Reject %sn”, yytext);} .|n ECHO; %% main() {yylex();} CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc. 14
  • 15. Rules • Each has a rule has a pattern and an action. • Patterns are regular expression • Only one action is performed – The action corresponding to the pattern matched is performed. – If several patterns match the input, the one corresponding to the longest sequence is chosen. – Among the rules whose patterns match the same number of characters, the rule given first is preferred. CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc. 15
  • 16. x character 'x' Flex’s RE syntax . any character except newline [xyz] character class, in this case, matches either an 'x', a 'y', or a 'z' [abj-oZ] character class with a range in it; matches 'a', 'b', any letter from 'j' through 'o', or 'Z' [^A-Z] negated character class, i.e., any character but those in the class, e.g. any character except an uppercase letter. [^A-Zn] any character EXCEPT an uppercase letter or a newline r* zero or more r's, where r is any regular expression r+ one or more r's r? zero or one r's (i.e., an optional r) {name} expansion of the "name" definition (see above) "[xy]"foo" the literal string: '[xy]"foo' (note escaped “) x if x is an 'a', 'b', 'f', 'n', 'r', 't', or 'v', then the ANSI-C interpretation of x. Otherwise, a literal 'x' (e.g., escape) rs RE r followed by RE s (e.g., concatenation) r|s either an r or an s <<EOF>> end-of-file CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc. 16
  • 17. /* scanner for a toy Pascal-like language */ %{ #include <math.h> /* needed for call to atof() */ %} DIG [0-9] ID [a-z][a-z0-9]* %% {DIG}+ printf("Integer: %s (%d)n", yytext, atoi(yytext)); {DIG}+"."{DIG}* printf("Float: %s (%g)n", yytext, atof(yytext)); if|then|begin|end printf("Keyword: %sn",yytext); {ID} printf("Identifier: %sn",yytext); "+"|"-"|"*"|"/" printf("Operator: %sn",yytext); "{"[^}n]*"}" /* skip one-line comments */ [ tn]+ /* skip whitespace */ 17 CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc.

Notas do Editor

  1. Makes parsing much easier (the parsers looks for relations between tokens) Other tasks: remove comments
  2. Automatic generation of scanners