Ch-three Regular languages and regular grammar

Ch-three
Regular languages and regular
grammar
Chapter programme
Regular Language
Regular Expressions
Grammar
Regular Grammar

Introduction(prelude)
What is language?
May refer either to the (human) capacity for acquiring and using
complex systems of communication, or to a specific instance of such
a system of complex communication.
...and a formal language?
A language is formal if it is provided with a mathematically rigorous
representation of its:
Alphabet of symbols, and
Formation rules specifying which strings of symbols count as
well-formed.

Describing formal languages
Generative approach: A language is
the set of strings generated by a
grammar.
Generation process:
Start symbol
Expand with rewrite rules.
Stop when a word of the language
is generated.
A grammar generates regular
language regular grammar.
Recognition approach: A language is
the set of strings accepted by an
automaton.
Recognition process:
Start in initial state.
Transitions to other states guided
by the string symbols.
Until read whole string and reach
accept/reject state.

Linguistics and formal language
Two application settings: Formal languages are studied in linguistics and computer
science.
Linguistics
In linguistics, formal languages are used for the scientific study of human language.
Linguists privilege a generative approach, as they are interested in defining a
(finite) set of rules stating the grammar based on which any reasonable sentence in
the language can be constructed.
A grammar does not describe the meaning of the sentences or what can be done
with them in whatever context __ but only their form.

Computer science and formal
languages
Computer science
In computer science, formal languages are used for the precise definition of
programming languages and, therefore, in the development of compilers.
A compiler is a computer program (or set of programs) that transforms source code
written in a programming language (the source language) into another computer
language (the target language).
Computer scientists privilege a recognition approach based on abstract machines
(automata) that take in input a sentence and decide whether it belongs to the
reference language.

Grammars and formal languages
Chomsky
Noam Chomsky (1928) is an American linguist, philosopher, cognitive scientist,
historian, and activist.
In Syntactic Structures (1957), Chomsky models knowledge of language using a
formal grammar, by claiming that formal grammars can explain the ability of a
hearer/speaker to produce and interpret an infinite number of sentences with a
limited set of grammatical rules and a finite set of terms.
The human brain contains a limited set of rules for organizing language, known as
Universal Grammar. The basic rules of grammar are hard-wired into the brain, and
manifest themselves without being taught.

Conti…
Chomsky proposed a hierarchy that partitions formal grammars into classes with
increasing expressive power, i.e. each successive class can generate a broader set of
formal languages than the one before.
Interestingly, modeling some aspects of human language requires a more complex
formal grammar (as measured by the Chomsky hierarchy) than modeling others.

Chomsky Classification of Grammars
Grammar Type Grammar
Accepted
(generator)
Language
Accepted
Automaton
(recognizers )
Type 0
Unrestricted
grammar
Recursively
enumerable
language
Turing Machine
Type 1
Context-sensitive
grammar
Context-sensitive
language
Linear-bounded
automaton
Type 2
Context-free
grammar
Context-free
language
Pushdown
automaton
Type 3 Regular grammar Regular language
Finite state
automaton

Chomsky hierarchy
It shows the scope of each type of grammar −

Automata and formal language
Stephen Kleene introduced finite automata in the 50s: abstract state machines equipped
with a finite memory. He demonstrated the equivalence between such a model and the
description of sequences of symbols using only three logical primitives (set union, set
product, and iteration).
Alan Turing (and, independently, Emil Post and John Backus a) put the ideas
underlying the notion of pushdown automata: abstract state machines with an
unbounded memory that is accessible only through a restricted mode (called a stack).
In 1936, Alan Turing described the Turing Machine: an abstract state machine
equipped with an infinite memory in the form of a strip of tape. Turing machines
simulate the logic of any computer algorithm and recognize the larger class of formal
languages.

Automata and grammar
Already we have seen Formal languages and its definition and basic notions
Which class of formal languages is recognized by a given type of
automata?__Chomsky hierarchy.
Languages described by deterministic finite acceptors (DFAs) are called regular
languages.
For any nondeterministic finite acceptor (NFA) we can find an equivalent DFA.
Thus, NFAs also describe regular languages.
Regular language

Conti…
A language L is said to be regular if there is a DFA (actually any FA ) M such that L
= L(M). Sometimes we say that M recognizes L.
The set in a regular language is a regular set.
Now we can use regular set and regular language interchangeable.
In fact most languages are not regular!.
The class of regular languages is the smallest class of languages in a hierarchy of
classes that we will consider.
To explicitly give an example of a language that is not regular though, we will need
proof technique the pumping lemma.
Regular languages form a simple but important family of formal languages.

Conti…
They can be specified in several equivalent ways:
Using FA, already have seen
Using regular expressions, see next
Using regular grammars, see next

Regular expression(REX)
An FA (NFA or DFA) is a “blueprint" for constructing a machine recognizing a
regular language.
A regular expression ,RE is a “user-friendly," declarative way of describing a
regular language.
A Regular Expression (RE) is a string of symbols that describes a regular
Language.
An algebraic-expression (notation) for describing (some) languages
Represents regular languages or regular sets A language is regular IFF some
REX describes it.
Languages are nothing, but are basically set of strings.
Each REX represents set of strings __set of strings are regular sets.

cont…
For every regular set there is a corresponding REX.
Given a regular expression E, the language it describes is l(E)={…… set of strings….)
R is a regular expression over alphabet exactly if R is one of the following:
a, for some a in Σ,
ε,
∅ ,
( R1 + R2 ), where R1 and R2 are regular expressions,
R1 ° R2 ), where R1 and R2 are regular expressions,
, where R1 is a regular expression, and

1R
*
1R

Conti…
Operators of REX represents operations on regular languages.
Regular language operations are operations to generate regular language (RL) from
other RL( union , concatenation, power, intersection, kleene closure and plus… ).
Since RLs are set, we use set operations (example union, concatenation, kleene closure).
For every RL operation there is corresponding REX operator.
Star, *
Dot, 
Plus, +

Operations on RLs
Examples:
L1={a, b} and L2={c, d}, then
union L1 U L2={a, b, c, d}.
concatenation L1L2={ac, ad, bc, bd}.
kleene closure L1*( the strings of L1 kleene closure)={, a, ab, aba,…}.
plus closure L1+ (plus closure of L1= includes all strings of L1 kleene closure
except .

Conti…
Now look at the REXs for what type of regular set(or RL) does represent.
REX Corresponding regular set or
RL
1.  {}
2.  { }
3. a where a {a}
4. r1 and r2 R1 and R2 respectively
5. r1+r2 R1 U R2
6. r1r2 R1R2 the concatenation
7. r1* R1* kleene closure
8. r1+ R1+ plus closure

Conti…
Suppose a, b, c, and d are   and are REXs, then the corresponding regular sets
are {a},{b}, {c} and {d} respectively.
Now find the regular sets for the following REXs
REXs Regular set or RLs
( a + b ) {a} U {b} = { a, b }
ab
(a +b )( c + d )
(a +b ) c ( c + d )
a*
(ab)*
(a+b)*

Order of precedence for operators:
(Precedence Rules)
1. Star
2. Dot
3. Plus
The star (*) operation has the highest precedence.
The concatenation ( . ) operation is second on the preference order.
The union (+) operation is the least preferred.
Parentheses can be omitted using these rules.
Example:
01 +1 is grouped (0(1))+1
a+bb*a can be parenthesized as a+(b(b*))a
The operations + and . in a regular expression satisfy the distributive law: For any
regular expressions r, s and t,
r(s + t) = rs + rt,
(r + s)t = rt + st.

Conti…
If L(r) = L(s) language represented by
r = language represented by s then r = s,
the two regular expressions are
equivalent, i.e. r=s.
Example:
r = (a+b)*{ε, a, b, aa, ab, ba, bb, aaa, aab, ….}
s= (a*b*)*{ε, a, b, aa, ab, ba, bb, aaa, aab, ….}
Both represent the same language,
hence r=s.
REX simplification
Simplification rules:
(+R)* = r*
(+r)r* = r*
R+rs* = rs*
+r+r* = r*
R = R =   (annihilation)
+R = R+ = R  (identity)
There are also many properties of REX
that are proven

Equivalence of FA's and REX
Regular expressions and finite automata are equivalent in their descriptive power.
This fact is expressed in the following theorem:
Theorem
A set is regular if and only if it can be described by a regular expression.
We know REX represent RLs accepted by FA
So, there is a FA accept the RL represented by the RE.
For each REX FA(equivalent to the REX) accepts the RL represented by the REX
We have already shown that DFA's, NFA's, and -NFA's are all equivalent.
Now see,

Conti…
To show FA's equivalent to REX we need to establish that
1. For every DFA A we can find (construct, in this case) a REX R, L(R) = L(A).
2. For every REX R there is an -NFAA, L(A) = L(R).

Conti…
We have a regular expression, call R,
Convert R to -NFA,
Conclude that the language must be regular accepted by some FA

RE to -NFA
For every REX R there is an equivalent -NFAA, L(A) = L(R) .
Automata for ,
, and
a

Conti…
automata for R+S automata for RS
Automata for R*

Conti…
First construct a FA to the basic (primitive) REXs
Then extend this method to any REXs
Example :
AutomatonFinitetoconvert )10(1)10( *


What is grammar
We have learned two ways of characterizing regular languages: REXs and FA
Grammar is another way of characterizing them.
A set of rewrite (production) rules , used to generate strings by successively
rewriting symbols.
Example: a language L is set of strings of symbol ‘a’ except empty string, {a, aa,
aaa, …}
Generate the strings of L by the following procedure.
Star the process from a symbol s, and rewrite s using one of the following rules.
S -> a , and s -> as , meaning s can be rewritten as either a (the first production) or
as (the last production).
Purpose: To specify how to form grammatically correct strings in the language the
grammar represents

Conti…
To generate the string aa for example,
Start with S and apply the second rule to replace S with the right hand side of the
rule, i.e. as (sentential form), to obtain as.
Then apply the first rule to as to rewrite s as a. That gives us aa (sentence).
S => as, to express that as is obtained from s by applying a single production.
Thus is written as
S => as => aa, the process of obtaining aa from s  derivation.
If we are not interested in the intermediate steps, the fact that aa is obtained from s
is written as s =>* aa .
In general if a string  is obtained from a string  by applying productions of a
grammar G, we write ( is derived from ).
Simply write =>* .( it is clear that we are using a grammar)
 
*
G

Formal definition
A grammar G is a mathematical object of four components.
General form for production rules , generates
The left hand side contains at least one symbol v
The right hand side is a string of
G=(V, T, S, P) , where
V is a nonempty, finite set of non-terminal symbols,
T is a finite set of terminal symbols , or alphabet, symbols,
S ∈ V is the start symbol
P is a finite set of rules(productions), the first rule uses the start symbol.
V and T are disjoint.
 :P  
*
)( TV 
General
form

The language of G
The set of all strings that can derived from the grammar.
The language generated by G,
All strings in that can be produced from the start symbol by applying rules of
the grammar are in the language .
All others are not.
}|{)( **
wSTwGL 
*
T

derivation
For a grammar a derivation of a string is a sequence of grammar rule applications
that transforms the start symbol in to the string
Proves the string is belongs to the grammar’s language.
Left most derivation
Replace the left most non-terminal by one of the production rules
Right most derivation
Replace the right most non- terminal by one of the production rules
Both generates the same string in the language.

Parse tree(parsing, derivation, syntax tree)
A tree representation of a derivation.
Start symbol is root node.
Non- terminals are interior nodes.
Terminals are leaf nodes.
Represents the syntactic structure of a string according to some form of the
abstract syntax trees used in computer programming, Syntax tree.
A tree representation of the abstract syntactic structure of source code written in a
programming language.
Built by parsers during the source code translation and compiling process.

Regular grammars
Generates a regular language.( To prove, convert RGFA. Since we have FA for the
language then it is regular.)
– Small number in formal language families.
Are linear.
– At most one non -terminal at the right hand side of the production rule.
Right – linear
– Left hand side of P is single non-terminal.
– Right hand side of P is a terminal followed by a single non-terminal or
– A single terminal followed by a single non-terminal __ strictly right -linear
Left – linear
– Left hand side = right linear
– Right hand side is a single non-terminal followed by a terminal or
– A single non-terminal followed by a single terminal __ strictly left- linear
Are left –linear or right – linear.
If right and left linear grammars are mixed in single grammar the result will not generate
necessarily a regular language. (Assignment? give example).

Chapter four
Context free languages and grammars
chapter plan
context-free language
context-free grammar
parsing and ambiguity
context-free grammars and programming
languages

Conti…
Context free languages (CFL)
– More general than regular languages.
– Languages in programming languages.
– Recognized by automaton __ PDA.
– Generated by type 2 grammars.
Context free grammars (CFG)
– Type 2 grammars
– Generates CFL
– Powerful than regular grammars
– Left hand side of P is a single non-terminal.
– Right hand side of P is anything (no restriction)
For a given CFG G, if w is in L(G) then there is a left most and right most
derivation of w.
 *
VT 

Conti…
Context-free grammars are powerful enough to describe the syntax of most
programming languages; in fact, the syntax of most programming languages is
specified using context-free grammars.
On the other hand, context-free grammars are simple enough to allow the
construction of efficient parsing algorithms which, for a given string, determine
whether and how it can be generated from the grammar.

Parsing and ambiguity
Parsing a string is finding a derivation (derivation tree) for that string.
Is like recognizing string.
A CFG is ambiguous if it has >1 parse tree for some string.
– More than one RMD or/and LMD for some string.
– At least one string in L(G) which has more than one derivation tree.
Ambiguity is bad
– Leaves meaning of some programs ill-defined.
– Creates conflict
Is common in programming language
– Arithmetic expressions
– If-then-else

Context free language and programming languages
Most programming languages are CFL (CFL=L(CFG) ).
A program that we write is a CFL.
For this CFL there is a grammar (CFG) that generates it.
CFG is predetermined with compiler (i.e. parser).
Syntax tree of describes the syntax of the (syntactic structure) programming
language.
Abstract syntax tree are created during compilation process for the program (code).
The structure of the string (program) is described by the syntax (parse) tree.
Then the parser checks the string (program) if it is syntactically correct.
If the written code is generated by the CFG in the compiler the program is
syntactically correct (No syntax error) unless fails( Syntax error).

Simplification of CFGs
And
Normal forms
Chapter five

CFG Simplification
In a CFG, it may happen that all the production rules (Useless productions) and
symbols (Useless symbols) are not needed for the derivation of strings.
Besides, there may be some null productions and unit productions.
Elimination of these productions and symbols is simplification of CFGs.
Simplification essentially comprises of the following steps −
Reduction of CFG
Removal of Unit Productions
Removal of Null Productions

Reduced CFG
CFG we get after removing useless symbols and productions, .
CFGs are reduced in two phases −two step procedures to obtain .
Phase 1 Remove those variables which do not derive any terminal string.
Objective Obtaining an equivalent , from the CFG, G such that each
variable derives some terminal string.
'
G
)()( '
GLGL 
'G
Derivation Procedure −
Step 1 − Include all symbols, W1, that derive some terminal and initialize i=1.
Step 2 − Include all symbols, Wi+1, that derive Wi.
Step 3 − Increment i and repeat Step 2, until Wi+1 = Wi.
Step 4 − Include all production rules that have Wi in it.
'G

Conti…
Phase 2 − Derivation of an equivalent grammar, G”, from the CFG, G’, such that
each symbol appears in a sentential form.
Objective Remove those variables and terminals which are not present in any string
derivable from state symbol S.
Derivation Procedure −
Step 1 − Include the start symbol in Y1 and initialize i = 1.
Step 2 − Include all symbols, Yi+1, that can be derived from Yi and
include all production rules that have been applied.
Step 3 − Increment i and repeat Step 2, until Yi+1 = Yi.

Removal of -production
, is null-production
If  L(G), we can remove the null-production from G.
Let the grammar we had obtained by removing -production from G=(V,T,P,S) is
If L(G) contains ‘’,
else
,where A and B are variables
A
 SPTVG ,,, 22 
  )()( 2 GLGL
)()( 2 GLGL 
Removal of unit-production
YX  (a single variable in both sides)
XX 
can be removed immediately
Unit productions of form

Summery
To sum up, Every CFL with out  can be generated by a grammar with out
Useless symbols
-production, and
Unit productions
The grammar G generates the above language is obtained by
1. Eliminate all  -productions.
2. Eliminate all unit productions.
3. Eliminate all useless symbols.
This sequence guarantees that unwanted variables and productions are removed.
Hence, The grammar is simplified
steps

Normal forms for CFG
You know that, a CFG contains a single non terminal at the  side and any thing can
be at the  side  from the general production rule.
To be applicable, an arbitrary CFG must have some specific form to describe the
language.
– The RHS must have some particular way of describing the language, i.e. must
be restricted.
Chomsky Normal Form –– CNF
Griebach Normal Form –– GNF
Backus–Naur form––BNF
Kuroda Normal Form –– KNF

Conti…
Every context-free grammar that doesn’t generate the empty string can be
transformed in to an equivalent one in normal forms.
Here equivalent means the two grammars generate the same language.
Motivations for CNF ( theoretical and practical implications)
1. Every string of length n is derivable in 2n-1 steps (in CNF) , if you know the length you
can start parsing and, hence this is really use full.
2. Proof of pumping lemma for CFL.
3. Every CFG has an equivalent NPDM, if you know the grammar in CNF then it is easy
to construct a machine that recognizes same language.
4. CYK- parsing algorithm- algorithm for membership. For instance, given a CFG, one
can use the CNF to construct time algorithm (CYK) to decide whether a given
string is in the language represented by that grammar or not.
)( 3
nO

Chomsky Normal Form (CNF)
In formal language theory, a context-free grammar G is said to be in Chomsky
normal form if all of its production rules are of the form .
where A, B, and C are non-terminal symbols, a is a terminal symbol,
S is the start symbol, and
ε denotes the empty string.
Also, neither B nor C may be the start symbol
Additionally we permit the rule where S is the start variable, for technical
reasons. this means that we allow this as one of many possible rules.
ora,A
orBC,A


S

Conti…
If G is a CFG which generates a language that consists of at least one string along
with , then there is an other CFG (in CNF) such that , no
null-productions, unit productions and useless symbols.
2G  )()( 2 GLGL
bA
SAA
aS
ASS




Chomsky
Normal Form
Not Chomsky
Normal Form

From any context-free grammar (which doesn’t produce ) not in Chomsky Normal
Form
we can obtain:
an equivalent grammar in Chomsky Normal Form

In general:
The Procedure
First remove:
1. Nullable variables
2. Unit productions
3. Long productions >2
4. Useless symbols

Conversion to Chomsky Normal Form
Example:
AcB
aabA
ABaS


 Not Chomsky
Normal Form
convert it to Chomsky Normal Form

Observations
Chomsky normal forms are good for parsing and proving
theorems
It is easy to find the Chomsky normal form for any context-free
grammar

Ch-three Regular languages and regular grammar

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Ch-three Regular languages and regular grammar

Semelhante a Ch-three Regular languages and regular grammar (20)

Último

Último (20)

Ch-three Regular languages and regular grammar