2. Introduction to Compiler
• Compiler are basically “Language Translator”.
– Language translators switch texts from one language into
another, making sure that the translated version conforms to
the grammar and style rules of the target language.
• Compiler is a program which takes one language (source program)
as input and converts into an equivalent another language (target
program).
Source program target program
Compiler
3. Introduction to Compiler
• During this process of translation if some errors are encountered
then compiler displays them as error messages.
• The compiler takes a source program as higher level languages
such as C, PASCAL, FORTRAN and converts it into low level
languages or a machine language.
4. Stream of characters
Process of
Compiling scanner Stream of tokens
parser Parse/syntax tree
Semantic analyzer Annotated tree
Intermediate code generator
Intermediate code
Code optimization
Intermediate code
Code generator
Target code
Code optimization
Target code
Chapter 1 2301373: Introduction 4
5. Computer : Analysis Synthesis
Model
• The compilation can be done in two parts
– Analysis
– Synthesis
• In analysis part – The source program is read
and broken down into constituent pieces.
– (The syntax and the meaning of the source string is
determined and then an intermediate code is
created from the input source program)
• In synthesis part – This intermediate form of the
source language is taken and converted into an
equivalent target program.
7. Analysis Part
• The analysis part is carried out in three sub-parts
– Lexical Analysis
• In this part the source program is read and then it is broken
into stream of strings. Such strings are called tokens
(tokens are collection of characters having some meaning).
– Syntax Analysis
• In this step the tokens are arranged in hierarchical
structure that ultimately helps in finding the syntax of the
source string.
– Semantic Analysis
• In this step the meaning of the source string is determined.
8. Properties of Compiler
• It must be bug free
• It must generate correct machine code.
• The generated machine code must run fast.
• The compiler itself must run fast (compilation
time must be proportional to program size)
• The compiler must be portable (i.e modular,
supporting separate compilation)
• It must give good diagnostics and error
messages.
• The generated code must work well with
existing debuggers.
9. Phases of Compiler – Lexical
Analysis
• It is also called scanning.
• It breaks the complete source code into tokens
– For example : total = count + rate * 10
• Then in lexical analysis phase this statement is broken up into
series of tokens as follows:
– The identifier total
– The assignment symbol
– The identifier count
– The plus sign
– The identifier rate
– The multiplication sign
– The constant number 10
10. Phases of Compiler – Syntax
Analysis
• It is also called parsing.
• In this phase, the tokens generated by lexical analyzer are grouped
together to form a hierarchical structure.
• It determines the structure of the source string by grouping the
token together.
• The hierarchical structure generated in this phase is called parse
tree or syntax tree.
12. Phases of Compiler– Semantics
Analysis
• It determines the meaning of source string
• For example - the meaning of source string means matching of
parenthesis in the expression or matching of if….else statements
or performing arithmetic operations of the expressions that are
type compatible, or checking the scope of operation
13. Intermediate Representation
• Most compilers translate the source code into some form of
intermediate code
• Intermediate code is later converted into machine code
• Intermediate code forms such as three address code, quadruple,
triple, posix
14. Example : Intermediate Code
Generation
• T1 : int to float
=
• T2 : rate * t1
• T3 : count + T2total +
• Total = T3
*
count
10
rate
15. Code Optimization
• It attempts to improve the intermediate code
• Faster executing code or less consumption of memory
• Machine Independent Code Optimization
• Machine Dependent Code Optimization
16. Code Generation
• In this phase the target code is generated (machine code)
• The intermediate code instructions are translated into sequence of
machine instructions
– MOV rate, R1
– MUL #10.0, R1
– MOV count, R2
– ADD R2, R1
– MOV R1, total
17. Symbol Table Management
• It maintains and stores, identifiers(variables)
used in program.
• It stores information about attributes of each
identifier (attributes : type, its scope,
information about storage allocated to it)
• It also stores information about the
subroutines(functions) used in the program
– with its number of arguments
– Type of these arguments
– Method of passing these argument(call by value or refrenece)
– Return type
18. Symbol Table Management
• Various phase use the symbol table
– Semantic Analysis and Intermediate Code Generation we need
to know what type of identifiers are used.
– Code Generation, typically information about how much
storage is allocated to identifier.
21. Grouping of Phases
Intermediate
Code
Input Back End
Front End
Program
Input
Program Output
Program
Lexical Semantic Code Code
Parser
Analysis Analysis Optimizer Generator
22. Compiler Development
Approach
• Initially compiler were divided into multiple passes so that
compiler has to manage only one pass at a time.
• This approach was used because of limited in main memory.
• Now a days two pass design of compiler is used.
– The front end translate the source code into an intermediate
representation.
– The back end works with the intermediate representation to
produce the machine code.
23. Compiler Development
Approach
• In many cases optimizers and error checkers can be shared by both
phases if they are using intermediate representation.
• Certain languages are capable of being compiled in a single pass
also, due to few rules of that language like–
– Place all variable declaration initially
– Declaration of functions before it is used
24. Types of Compiler
• Native code compiler
– The compiler designed to compile a source code for a same
type of platform only.
• Cross compiler
– The compiler designed to compile a source code for different
platforms.
– Such compiler s are often used for designing embedded system
• Source to source compiler or transcompiler
– The compiler that takes high level language source code as
input and outputs source code of another high level language.
– it may perform a translation of a program from Pascal to C. An
automatic parallelizing compiler will frequently take in a high
level language program as an input and then transform the
code and annotate it with parallel code annotations
25. Types of Compiler
• One pass Compiler
– The compiler which completes whole compilation
process in a single pass.
– i.e., it traverse through the whole source code only
once.
• Threaded Code Compiler
– The compiler which will simply replace a string
(e.g., name of subroutine) by an appropriate binary
code.
• Incremental Compiler
– The compiler which compiles only the changed lines
from the source code and update the object code
26. Types of Compiler
• Stage Compiler
– A compiler which converts the code into assembly
code only.
• Just-in-time Compiler
– A compiler which converts the code into machine
code after the program starts execution.
• Retargetable Compiler
– A compiler that can be easily modified to compile a
source code for different CPU architectures.
• Parallelizing Compiler
– A Compiler capable of compiling a code in parallel
computer architecture.
27. Language Specification
• In computer, all the instructions are represented as strings.
– Instructions are in form of numbers, name, pictures
or sounds
• Strings used in organized manner forms a language.
• Every programming language can be described by grammar.
• Grammar allows us to write a computer program
• A program code is checked whether a string of statements is
syntactically correct.
28. Language Specification
• To design a language, we have to define alphabets.
– Alphabets : A finite non-empty set of symbols that
are used to form a word(string)
– Example : An alphabet might be a set like {a, b}.
• The symbol “ ∑” denote an alphabet
• If ∑ = {a, b}, then we can create strings like a, ab, aab,
abb, bba and so on and null string is denoted as “ ”.
• The length of string can be denoted by |X|. Than | aba | =
3, |a| = 1 and |n|=0.
– The concatenation of X and Y is denoted by XY
– The set of all strings over an alphabet “ ∑” is
denoted by “ ∑*”
29. Language Specification
– The set of nonempty strings over “ ∑” is denoted by
“ ∑+”
– Languages are set sets, standard set operations such as union,
intersection and complementation
• To describe language through regular expressions and grammars
method , to determine a given string belongs to language or not.
30. Regular Expressions
• A regular expression provides a concise and
flexible means to "match" (specify and recognize)
strings of text, such as particular characters,
words, or patterns of characters.
• The concept of regular expressions was first
popularized by utilities provided by Unix
distributions, in particular the editor ed and the
filter grep.
• A regular expression is written in a formal language
that can be interpreted by a regular expression
processor, which is a program that either serves as
a parser generator or examines text and identifies
31. Regular Expressions
• Regular expressions are used by many text
editors, utilities, and programming languages to
search and manipulate text based on patterns.
32. Finite Automata
• A finite-state machine (FSM) or finite-state
automaton (plural: automata), or simply a state
machine, is a mathematical model used to design
computer programs and digital logic circuits.
• It is conceived as an abstract machine that can be in
one of a finite number of states.
• The machine is in only one state at a time; the state it
is in at any given time is called the current state.
• One of the state is designated as “Starting State” .
• More states are designated as “Final State”.
33. Finite Automata
• It can change from one state to another when initiated
by a triggering event or condition, this is called a
transition.
• A particular FSM is defined by a list of the possible
transition states from each current state, and the
triggering condition for each transition.
• Finite-state machines can model a large number of
problems, among which are electronic design
automation, communication protocol design, parsing
and other engineering applications.
34. Finite Automata
• States are represented as Circles
• Transition are represented by Arrows
• Each arrow is labeled with a character or a set of characters that
cause the specified transition to occur.
• The starting state has arrow entering it that is not connected to
anything else
35. Finite Automata
• Deterministic Finite Automata (DFA)
– The machine can exist in only one state at any given
time
• Non-deterministic Finite Automata (NFA)
– The machine can exist in multiple states at the
same time
36. Deterministic Finite Automata
• A Deterministic Finite Automaton (DFA) consists of:
Q ==> a finite set of states
Σ ==> a finite set of input symbols (alphabet)
q0 ==> a start state
F ==> set of final states
δ ==> a transition function, which is a mapping between Q x Σ
==> Q
A DFA is defined by the 5-tuple: {Q Σ q F δ }
37. How to use a DFA?
• Input: a word w in Σ*
– Question: Is w acceptable by the DFA?
– Steps:
• Start at the “start state” q0
• For every input symbol in the sequence w do
• Compute the next state from the current state, given the
current input symbol in w and the transition function
• If after all symbols in w are consumed, the current state is
one of the final states (F) then accept w; Otherwise, reject
w.
38. Regular Languages
• Let L(A) be a language recognized by a
• DFA A.
– Then L(A) is called a “Regular Language”.
39. Example #1
• Build a DFA for the following language:
– L = {w | w is a binary string that contains 01 as a substring}
– Steps for building a DFA to recognize L:
• Σ = {0,1}
• Decide on the states: Q
• Designate start state and final state(s)
• δ: Decide on the transitions:
– Final states == same as “accepting states”
– Other states == same as “non-accepting states”
41. Non-deterministic Finite Automata
(NFA)
• A Non-deterministic Finite Automaton
• (NFA)
– is of course “non-deterministic”
– Implying that the machine can exist in more than one state at
the same time
– Outgoing transitions could be non-deterministic
42. Non-deterministic Finite Automata
(NFA)
• A Non-deterministic Finite Automaton (NFA) consists of:
– Q ==> a finite set of states
– Σ ==> a finite set of input symbols (alphabet)
– q0 ==> a start state
– F ==> set of final states
– δ ==> a transition function, which is a mapping between Q x Σ
==> subset of Q
– An NFA is also defined by the 5-tuple: {Q Σ q F δ }
43. How to use an NFA?
• Input: a word w in Σ*
• Question: Is w acceptable by the NFA?
• Steps:
– Start at the “start state” q0
– For every input symbol in the sequence w do
– Determine all the possible next states from the
current state, given the current input symbol in w
and the transition function
– If after all symbols in w are consumed, at least one
of the current states is a final state then accept w;
– Otherwise, reject w.
45. Differences: DFA vs. NFA
DFA NFA
• All transitions are deterministic • Transition are non-deterministic
– Each transition leads to one – A transition could lead to
state subset of state
• For each state, transition on all • For each state, not all symbols
possible symbols ( alphabet) necessarily have to be defined in
should be defined the transition function
• Accepts input if the last state is • Accepts input if one of the last
in F states is in F
• Sometimes harder to construct • Generally easier than a DFA to
because of the number of states construct
• Practical implementation is • Practical implementation has to
feasible be derterministic(so needs
converstion to DFA)
46. Construct a DFA to accept a string containing a zero
followed by a one.
47. Construct a DFA to accept a string containing two consecutive zeroes
followed by two consecutive ones
48. Grammars
• A grammar for any natural language such as Hindi,
Gujarati, English, etc. is a formal description of the
correctness of any kind of simple, complex or
compound sentence of that language.
• Grammar checks the syntactic correctness of a
sentence.
• Similarly, a grammar for a programming language is a
formal description of the syntax, form or construction,
of programs and individual statements written in that
programming language.
49. A formal grammar G is a 4
tupel
• G={N, T, P, S}
– Where, N : Set of non-terminal symbols
– T : Set of terminal symbols
– P : Set of production rules or simply production
• Terminal
– Terminal symbols are literal characters that can appear in the inputs
to or outputs from the production rules of a formal grammar and that
cannot be broken down into "smaller" units. To be precise, terminal
symbols cannot be changed using the rules of the grammar.
• Non-terminal
– Nonterminal symbols, are the symbols which can be replaced; thus
there are strings composed of some combination of terminal and
nonterminal symbols.
50. Grammar
• Subject : The subject is the person, place, or thing
that acts, is acted on, or is described in the
sentence.
• Simple subject - a noun or a pronoun (e.g she, he, cat, city)
• Complete subject - a noun or a pronoun plus any modifiers
(e.g the black cat,the clouds in the sky )
• Adjectives : They are words that describe nouns or
pronouns. They may come before the word they
describe (That is a cute puppy.) or they may follow
the word they describe (That puppy is cute.).
51. Grammar
• Predicate :The predicate usually follows the subject ,
tells what the subject does, has, or is, what is done to
it, or where it is. It is the action or description that
occurs in the sentence.
• Noun : A noun is a word used to refer to people,
animals, objects, substances, states, events and
feelings.
• Article : English has two types of articles: definite (the)
and indefinite (a, an.) The use of these articles
depends mainly on whether you are referring to any
member of a group, or to a specific member of a group
52. Grammar
• Verbs : Verbs are a class of words used to show the
performance of an action (do, throw, run), existence
(be), possession (have), or state (know, love) of a
subject.
• Direct Object : A direct object is a noun or pronoun
that receives the action of a "transitive verb" in an
active sentence or shows the result of the action. It
answers the question "What?" or "Whom?" after an
action verb.
• Consider the english statement below
– The small CD contains a large information.
53. Grammar
• Subject
– Article : the
– Adjective : small
– Noun : CD
• Predicate
– Verb : contains
– Direct object : a large information
• A direct object
– Article : a
– Adjective : large
– Noun : information
54. Grammar
• The small CD contains a large information.
1. <sentence> : <subject><predicate>
2. <subject> : <article><adjective><noun>
3. <predicate> : <verb><direct-object>
4. <direct-object> : <article><adjective><noun>
5. <article> : The | a
6. <adjective> : small | large
7. <noun> : CD | Information
8. <verb> : contains
55. Generating a string in language
• <sentence>
• <subject><predicate>
• <article><adjective><noun><verb><direct-object>
• The | a, small | large, CD | information, <article><adjective><noun>
• The | a, small | large, CD | information, contains
56. Grammar
• N = {sentence, subject, predicate, article, adjective, noun, verb,
direct-object}
• T = {The, a, small, large, CD, information, contains}
• S = sentence
• P={ <sentence> : <subject><predicate>
<subject> : <article><adjective><noun>
<predicate> : <verb><direct-object>
<direct-object> : <article><adjective><noun>
<article> : The | a
<adjective> : small | large
<noun> : CD | Information
<verb> : contains
}
57. The C Language Grammar
(abbreviated)
• Terminals:
– n if do while for switch break continue typedef struct return main
int long char float double void static ;( ) a b c A B C 0 1 2 + * - / _ #
include += ++ ...
• Nonterminals:
– n <statement> <expression> <C source file> <identifier> <digit>
<nondigit> <identifier> <selection-statement>
<loop-statement>
• Start symbol: <C source file>
• A string: #include <stdio.h>
int main(void)
{
printf("Hello World!n");
return 0;
}
58.
59.
60.
61. Hierarchy of Grammars
• Grammars can be divided into four classes by increasing
the restrictions on the form of the productions.
• This hierarchy is also know as Chomsky(1963)
• It consists of four types of hierarchy classes
– Type 0 : formal or unrestricted grammar
– Type 1 : context-sensitive grammar
– Type 2 : context-free grammar
– Type 3 : right linear or regular grammar
62. Type 0 Grammars
• These grammars, known as phrase structure grammars, contains
production of form
α :: = β
Where both α and β can be strings
63. Type-1 Grammar
• These grammar are known as context sensitive grammar
• Their derivation or reduction of strings can take place only in
specific contexts
• αAβ :: =α∏β
– String ∏ in a sentential form can be replaced by ‘A’ only when
it is enclosed by the strings α and β.
–
64. Type-2 Grammar
• These grammar are known as context free grammar
– A ::= ∏
65. Type-3 grammar
• These grammar is also known as linear grammar or regular
grammar