Unlocking the Future of AI Agents with Large Language Models
Lexical Analysis
1. Challenge the future
Delft
University of
Technology
Course IN4303
Compiler Construction
Eduardo Souza, Guido Wachsmuth, Eelco Visser
Lexical Analysis
10. formal languages
Lexical Analysis
vocabulary Σ
finite, nonempty set of elements (words, letters)
alphabet
string over Σ
finite sequence of elements chosen from Σ
word, sentence, utterance
4
Recap: A Theory of Language
11. formal languages
Lexical Analysis
vocabulary Σ
finite, nonempty set of elements (words, letters)
alphabet
string over Σ
finite sequence of elements chosen from Σ
word, sentence, utterance
formal language λ
set of strings over a vocabulary Σ
λ ⊆ Σ*
4
Recap: A Theory of Language
13. formal grammars
Lexical Analysis
formal grammar G = (N, Σ, P, S)
nonterminal symbols N
terminal symbols Σ
production rules P ⊆ (N∪Σ)*
N (N∪Σ)*
× (N∪Σ)*
start symbol S∈N
5
Recap: A Theory of Language
14. formal grammars
Lexical Analysis
formal grammar G = (N, Σ, P, S)
nonterminal symbols N
terminal symbols Σ
production rules P ⊆ (N∪Σ)*
N (N∪Σ)*
× (N∪Σ)*
start symbol S∈N
5
Recap: A Theory of Language
nonterminal symbol
15. formal grammars
Lexical Analysis
formal grammar G = (N, Σ, P, S)
nonterminal symbols N
terminal symbols Σ
production rules P ⊆ (N∪Σ)*
N (N∪Σ)*
× (N∪Σ)*
start symbol S∈N
5
Recap: A Theory of Language
context
16. formal grammars
Lexical Analysis
formal grammar G = (N, Σ, P, S)
nonterminal symbols N
terminal symbols Σ
production rules P ⊆ (N∪Σ)*
N (N∪Σ)*
× (N∪Σ)*
start symbol S∈N
5
Recap: A Theory of Language
replacement
17. formal grammars
Lexical Analysis
formal grammar G = (N, Σ, P, S)
nonterminal symbols N
terminal symbols Σ
production rules P ⊆ (N∪Σ)*
N (N∪Σ)*
× (N∪Σ)*
start symbol S∈N
grammar classes
type-0, unrestricted
type-1, context-sensitive: (a A c, a b c)
type-2, context-free: P ⊆ N × (N∪Σ)*
type-3, regular: (A, x) or (A, xB)
5
Recap: A Theory of Language
18. right regular grammar
Lexical Analysis 6
Num → “0” Num
Num → “1” Num
Num → “2” Num
Num → “3” Num
Num → “4” Num
Num → “5” Num
Num → “6” Num
Num → “7” Num
Num → “8” Num
Num → “9” Num
Num → “0”
Num → “1”
Num → “2”
Num → “3”
Num → “4”
Num → “5”
Num → “6”
Num → “7”
Num → “8”
Num → “9”
Decimal Numbers
19. right regular grammar
Lexical Analysis 7
Id → “a” R
…
Id → “z” R
R → “a” R
…
R → “z” R
R → “0” R
…
R → “9” R
Id → “a”
…
Id → “z”
R → “a”
…
R → “z”
R → “0”
…
R → “9”
Identifiers
22. formal languages
Lexical Analysis
formal grammar G = (N, Σ, P, S)
derivation relation G ⊆ (N∪Σ)*
× (N∪Σ)*
w G w’
∃(p, q)∈P: ∃u,v∈(N∪Σ)*
:
w=u p v ∧ w’=u q v
8
Recap: A Theory of Language
23. formal languages
Lexical Analysis
formal grammar G = (N, Σ, P, S)
derivation relation G ⊆ (N∪Σ)*
× (N∪Σ)*
w G w’
∃(p, q)∈P: ∃u,v∈(N∪Σ)*
:
w=u p v ∧ w’=u q v
formal language L(G) ⊆ Σ*
L(G) = {w∈Σ*
| S G
*
w}
8
Recap: A Theory of Language
24. formal languages
Lexical Analysis
formal grammar G = (N, Σ, P, S)
derivation relation G ⊆ (N∪Σ)*
× (N∪Σ)*
w G w’
∃(p, q)∈P: ∃u,v∈(N∪Σ)*
:
w=u p v ∧ w’=u q v
formal language L(G) ⊆ Σ*
L(G) = {w∈Σ*
| S G
*
w}
classes of formal languages
8
Recap: A Theory of Language
39. regular expressions
Lexical Analysis 14
Num: (0|1|2|3|4|5|6|7|8|9)+
A valid Num is any word that consists of a sequence of
one or more digits.
Id: (a|…|z)(a|…|z|0|…|9)*
A valid Id is any word that starts with a lowercase letter
followed by sequence of zero or more lowercase letters or
digits.
Decimal Numbers & Identifiers
51. Lexical Analysis 16
Example
Gr:
S → ('A'= | … | 'Z')*(0 | … | 9)*
L(Gr) = The set of words that start with
zero or more capital letters, followed by
zero or more decimal digits.
54. formal definition
Lexical Analysis
finite automaton M = (Q, Σ, T, q0, F)
states Q
input symbols Σ
transition function T
start state q0∈Q
final states F⊆Q
18
Finite Automata
55. formal definition
Lexical Analysis
finite automaton M = (Q, Σ, T, q0, F)
states Q
input symbols Σ
transition function T
start state q0∈Q
final states F⊆Q
transition function
nondeterministic FA T : Q × Σ → P(Q)
NFA with ε-moves T : Q × (Σ ∪ {ε}) → P(Q)
deterministic FA T : Q × Σ → Q
18
Finite Automata
78. right regular grammar
Lexical Analysis
formal grammar G = (N, Σ, P, S)
finite automaton M = (N ∪ {f}, Σ, T, S, F)
28
NFA construction
79. right regular grammar
Lexical Analysis
formal grammar G = (N, Σ, P, S)
finite automaton M = (N ∪ {f}, Σ, T, S, F)
transition function T
(X, aY)∈P : (X, a, Y)∈T
(X, a)∈P : (X, a, f)∈T
28
NFA construction
80. right regular grammar
Lexical Analysis
formal grammar G = (N, Σ, P, S)
finite automaton M = (N ∪ {f}, Σ, T, S, F)
transition function T
(X, aY)∈P : (X, a, Y)∈T
(X, a)∈P : (X, a, f)∈T
final states F
(S, ε)∈P : F = {S, f}
else: F = {f}
28
NFA construction
81. Example
Lexical Analysis 29
Num → “0” Num
Num → “1” Num
Num → “2” Num
Num → “3” Num
Num → “4” Num
Num → “5” Num
Num → “6” Num
Num → “7” Num
Num → “8” Num
Num → “9” Num
Num → “0”
Num → “1”
Num → “2”
Num → “3”
Num → “4”
Num → “5”
Num → “6”
Num → “7”
Num → “8”
Num → “9”
NFA construction
82. Example
Lexical Analysis 29
Num → “0” Num
Num → “1” Num
Num → “2” Num
Num → “3” Num
Num → “4” Num
Num → “5” Num
Num → “6” Num
Num → “7” Num
Num → “8” Num
Num → “9” Num
Num → “0”
Num → “1”
Num → “2”
Num → “3”
Num → “4”
Num → “5”
Num → “6”
Num → “7”
Num → “8”
Num → “9”
NFA construction
N
f
83. Example
Lexical Analysis 29
Num → “0” Num
Num → “1” Num
Num → “2” Num
Num → “3” Num
Num → “4” Num
Num → “5” Num
Num → “6” Num
Num → “7” Num
Num → “8” Num
Num → “9” Num
Num → “0”
Num → “1”
Num → “2”
Num → “3”
Num → “4”
Num → “5”
Num → “6”
Num → “7”
Num → “8”
Num → “9”
NFA construction
N
f
84. Example
Lexical Analysis 29
Num → “0” Num
Num → “1” Num
Num → “2” Num
Num → “3” Num
Num → “4” Num
Num → “5” Num
Num → “6” Num
Num → “7” Num
Num → “8” Num
Num → “9” Num
Num → “0”
Num → “1”
Num → “2”
Num → “3”
Num → “4”
Num → “5”
Num → “6”
Num → “7”
Num → “8”
Num → “9”
NFA construction
N
f
0-9
85. Example
Lexical Analysis 29
Num → “0” Num
Num → “1” Num
Num → “2” Num
Num → “3” Num
Num → “4” Num
Num → “5” Num
Num → “6” Num
Num → “7” Num
Num → “8” Num
Num → “9” Num
Num → “0”
Num → “1”
Num → “2”
Num → “3”
Num → “4”
Num → “5”
Num → “6”
Num → “7”
Num → “8”
Num → “9”
NFA construction
N
f
0-9
0-9
86. Example
Lexical Analysis
formal grammar G = (N, Σ, P, S)
finite automaton M = (N ∪ {f}, Σ, T, S, F)
transition function T
(X, aY)∈P : (X, a, Y)∈T
(X, a)∈P : (X, a, f)∈T
final states F
(S, ε)∈P : F = {S, f}
else: F = {f}
30
NFA construction
G:
S → a
S → aA
A → aB
B → aC
C → aD
D → aS
92. ε elimination
Lexical Analysis
additional final states
• states with ε-moves into final states
• become final states themselves
additional transitions
• ε-move from source to target state
• transitions from target state
• add these transitions to the source state
36
NFA construction
93. ε elimination
Lexical Analysis
NFA construction
additional final states
• states with ε-moves into final states
• become final states themselves
additional transitions
• ε-move from source to target state
• transitions from target state
• add these transitions to the source state
37
Gr:
S → ('A'= | … | 'Z')*(0 | … | 9)*
95. eliminating nondeterminism
Lexical Analysis
nondeterministic finite automaton M = (Q, Σ, T, q0, F)
deterministic finite automaton M’ = (P(Q), Σ, T’, {q0}, F’)
transition function T’
• T’({q1,..., qn}, x) = T({q1,..., qn}, x) = T(q1, x) ∪ … ∪ T(qn, x)
final states F’ = {S⎮S⊆Q, S∩F ≠ ∅}
• all states that include a final state of the original NFA
39
Powerset construction
101. lessons learned
Lexical Analysis
Summary
What are the formalisms to describe regular languages?
• regular grammars
• regular expressions
• finite state automata
Why are these formalisms equivalent?
• constructive proofs
43
102. lessons learned
Lexical Analysis
Summary
What are the formalisms to describe regular languages?
• regular grammars
• regular expressions
• finite state automata
Why are these formalisms equivalent?
• constructive proofs
How can we generate compiler tools from that?
• implement DFAs
43
104. learn more
Lexical Analysis
Literature
Formal languages
Noam Chomsky: Three models for the description of language. 1956
J. E. Hopcroft, R. Motwani, J. D. Ullman: Introduction to Automata
Theory, Languages, and Computation. 2006
44
105. learn more
Lexical Analysis
Literature
Formal languages
Noam Chomsky: Three models for the description of language. 1956
J. E. Hopcroft, R. Motwani, J. D. Ullman: Introduction to Automata
Theory, Languages, and Computation. 2006
Lexical analysis
Andrew W. Appel, Jens Palsberg: Modern Compiler Implementation in
Java, 2nd edition. 2002
44
108. copyrights
Lexical Analysis
Pictures
Slide 1:
Book Scanner by Ben Woosley, some rights reserved
Slides 4, 5, 8:
Noam Chomsky by Fellowsisters, some rights reserved
Slide 6, 7, 11, 15:
Tiger by Bernard Landgraf, some rights reserved
Slide 18:
Coffee Primo by Dominica Williamson, some rights reserved
Slide 32:
Pine Creek by Nicholas, some rights reserved
47