1. Token, Pattern and Lexeme
Token
Token is a valid sequence of characters which are given by lexeme. In a programming language,
• keywords,
• constant,
• identifiers,
• numbers,
• operators and
• punctuations symbols
are possible tokens to be identified.
Lexemes
A lexeme is a sequence of characters in the source program that matches the pattern for a token and is
identified by the lexical analyzer as an instance of that token.
Pattern
Pattern describes a rule that must be matched by sequence of characters (lexemes) to form a token. It can
be defined by regular expressions or grammar rules. In the case of a keyword as a token, the pattern is just
the sequence of characters that form the keyword.
Example: c=a+b*5;
Lexemes and tokens
Lexemes Tokens
c identifier
= assignment symbol
a identifier
+ + (addition symbol)
b identifier
* * (multiplication symbol)
5 5 (number)
Attributes of Tokens
The lexical analyzer collects information about tokens into their associated attributes. As a practical matter
,a token has usually only a single attribute, a pointer to the symbol-table entry in which the information
about the token is kept; the pointer becomes the attribute for the token.
2. Let num be the token representing an integer. When a sequence of digits appears in the input stream, the
lexical analyzer will pass num to the parser. The value of the integer will be passed along as an attribute of
the token num. Logically, the lexical analyzer passes both the token and the attribute to the parser.
If we write a token and its attribute as a tuple enclosed b/w < >, the input 33 + 89 – 60 is transformed into
the sequence of tuples < num, 33 > <+, > <num, 89 > <-, > <num, 60>
The token “+” has no attribute ,the second components of the tuples ,the attribute ,play no role during
parsing, but are needed during translation.
The token names and associated attribute values for the Fortran Statement
are written below as a sequence of pairs.
<id, pointer to symbol-table entry for E>
< assign-op >
<id, pointer to symbol-table entry for M>
<mult -op>
<id, pointer to symbol-table entry for C>
<exp-op>
<number , integer value 2 >
Lexical Errors
It is hard for a lexical analyzer to tell, without the aid of other components, that there is a source-code error.
For instance, if the string f i is encountered for the first time in a C program in the context:
a lexical analyzer cannot tell whether fi is a misspelling of the keyword if or an undeclared function identifier.
Since fi is a valid lexeme for the token id, the lexical analyzer must return the token id to the parser.
A character sequence that cannot be scanned into any valid token is a lexical error.
Lexical errors are uncommon, but they still must be handled by a scanner.
Misspelling of identifiers, keyword, or operators are considered as lexical errors.
Usually, a lexical error is caused by the appearance of some illegal character, mostly at the beginning of a
token.
Error Recovery Strategies
The simplest recovery strategy is "panic mode" recovery. We delete successive characters from the
remaining input, until the lexical analyzer can find a well-formed token at the beginning of what input is left.
This recovery technique may confuse the parser, but in an interactive computing environment, it may be
quite adequate.
The following are the error-recovery actions in lexical analysis:
1. Deleting an extraneous character.
2. Inserting a missing character.
3. Replacing an incorrect character by a correct character.
4. Transforming two adjacent characters.