25 March 2023 Hande Çelikkanat 1
A (Long) Introduction
to AntLR
Slides adapted from:
–AntLR Reference Manual by Terence Pratt
antlr.org/share/1084743321127/ANTLR_Reference_Manual.pdf
–AntLR Tutorial by Ashley J.S Mills
http://supportweb.cs.bham.ac.uk/docs/tutorials/docsystem/build/tutorials/antlr/antlrhome.html
–An Introduction to AntLR by Terence Pratt
http://www.cs.usfca.edu/~parrt/course/652/lectures/antlr.html
–An AntLR Tutorial by Scott Stanchfield
javadude.com/articles/antlrtut/
2
AntLR
ANother Tool for Language Recognition
(or anti-LR??)
a LL(k) parser and translator generator tool
which can create
– lexers
– parsers
– abstract syntax trees (AST’s)
in which you describe the language grammatically
and in return receive a program that can recognize and
translate that language
4
Lexer
A source file is streamed to a lexer on a character by character basis by
some kind of input interface.
Lexer groups characters into meaningful tokens that are meaningful to
the parser.
A “token” may be
– keywords
– identifiers
– symbols
– operators
Lexer also removes comments and whitespace from the program,
which are meaningless to the parser.
So it creates a stream of tokens, which are received one by one by the
parser.
5
Parser
Parser organizes the tokens into the allowed sequences defined by the
grammar of the language.
If the parser encounters a sequence of tokens that match none of the allowed
sequences of tokens, it will issue an error
A design choice is whether to try to recover from the error by making
assumptions.
Parsers may either do syntax-directed translation on-the-fly,
or convert the sequences of tokens into an Abstract Syntax Tree (AST).
An AST is a structure which
– keeps information in an easily traversable form (such as operator at a node,
operands at children of the node)
– ignores form-dependent superficial details
More on AST’s later...
Parser also generates one or more symbol table(s) which contain information,
about the tokens it encounters.
6
What does a grammar file look like?
It is composed of rules
ANTLR accepts three types of grammar specifications
parsers
lexers
tree-parsers (also called tree-walkers)
Uses LL(k) analysis for all
So the grammar specifications are similar, and the
generated lexers and parsers behave similarly
8
Sample File Divided (1/3)
• An arbitrary number of parsers, lexers, and tree-
parsers in a grammar file
– a separate class file will be generated for each
– i.e, YourLexerClass.class, YourParserClass.class,
YourTreeParserClass.class
• Header:
– put preamble that will be put on top of each of these
classes
– an import, maybe?
9
Sample File Divided (2/3)
• Options
– file-wide
– charVocabulary = '0'..'377'; //defines the alphabet (usage in complement and
wildcard)
– k=2; // means two characters of lookahead
• Class specific:
{ ... header for parser class only ...}
class MyParser extends Parser;
options { ...parser options... }
{
parser class members
}
parser rules
10
• Rules in EBNF notation:
Sample File Divided (3/3)
taken from AntLR tutorial of Ashley J.S Mills
You simply list a set of lexical rules that match tokens. The tool automatically
generates code to map the next input character(s) to a rule likely to match.
A big "switch“ that routes recognition flow to the appropriate rule
12
Lexer
With one restriction:
• Rules defined within a lexer grammar must have a name beginning
with an uppercase letter
taken from AntLR tutorial of Ashley J.S Mills
13
Lexer Rules
You can define operators like:
BECOMES : “:=“;
COLON : ‘:‘;
SEMI : ‘;’ ;
EQUALS : ‘=‘ ;
LBRACKET : ‘[‘;
RBRACKET : ‘]’ ;
LPAREN : ‘(‘ ;
RPAREN : ‘)’ ;
LT : ‘<‘ ;
LTE : “<=“ ;
PLUS : ‘+’ ;
MINUS : ‘-’ ;
TIMES : ‘*’ ;
DIV : ‘/’ ;
And then you can define a token class such as:
OPS : (PLUS | MINUS | MULT | DIV) ;
14
Actions
Blocks of source code (expressed in the target language) enclosed in curly braces
Executed
after the preceding production element has been recognized
before the recognition of the following element
Typically used to generate output, construct trees, or modify a symbol table
Position dictates when it is recognized relative to the surrounding grammar elements.
If the first element of a production, it is executed before any other element in that production, but only if
that production is predicted by the lookahead
rule_name
(
{init-action}:
{action of 1st production} production_1
| {action of 2nd production} production_2
)?
The init-action would be executed regardless of what (if anything) matched in the optional subrule.
The init-actions are placed within the loops generated for subrules (...)+ and (...)*.
15
Tip: Skipping Tokens
A white space has nothing to do in a grammar:
WS :
(‘ ‘ | ‘n’ | ‘t’)
{ $setType(Token.SKIP); } → action
;
→ Do not pass this token to the parser. Recognize
it and then throw it away.
Same for comments ;)
16
Tip: Newline Stuff
Line number of input is used for reporting error
Must be incremented by hand when lexer encounters a
newline:
WS :
( ' ' | 't' | 'f'
// handle newlines
| (
"rn" // DOS/Windows
| 'r' // Macintosh
| 'n' // Unix )
// increment the line count
{ newline(); } → action executed only in this case
)
{ $setType(Token.SKIP); }
;
17
Parser
class ExprParser extends Parser;
expr:
mexpr ((PLUS|MINUS) mexpr)* ;
mexpr :
atom (STAR atom)* ;
atom:
INT
| LPAREN expr RPAREN ;
• Rules defined within a parser grammar must have a name beginning
with a lowercase letter
18
Tip: Keywords and Literals (1/2)
Many languages have a general "identifier" lexical rule, and keywords that are special
cases of the identifier pattern
A typical identifier token may be defined as:
ID : LETTER (LETTER | DIGIT)*;
So how can AntLR understand “if” is not an identifier?
You put fixed keywords into a literals table.
checked after each token is matched
Any double-quoted string used in a parser is automatically entered into the literals
table of the associated lexer.
subprogramBody :
(basicDecl)*
(procedureDecl)*
"begin"
(statement)*
"end" IDENT ;
19
Tip: Keywords and Literals (2/2)
option testLiterals
By default, ANTLR will generate code in all lexer rules to test each
token against the literals table
However, you may suppress this code generation in the lexer by using
a grammar option:
class L extends Lexer;
options { testLiterals=false; }
...
If you turn this option off for a lexer, you may re-enable it for specific
rules
ID options { testLiterals=true; }
: LETTER (LETTER | DIGIT)*;
20
Tip: Token Object Creation
You will sometimes want to access information about the token being
matched
Label lexical rules and obtain a Token object representing the text,
token type, line number, etc... matched for that rule reference
Lexer rule:
INT : ('0'..'9')+ ;
Parser rule:
INDEX :
'[' i:INT ']'
{System.out.println(i.getText());} ;
21
Tip: Syntactic / Semantic Predicates
There are other situations where you have to turn on and
off certain rules
depending on prior context or semantic information
Use “predicates” to decide
22
Syntactic Predicates
ANTLR (tree) parsers usually use only a single symbol of lookahead, which is normally
not a problem as intermediate forms are explicitly designed to be easy to walk
However, there is occasionally the need to distinguish between similar tree structures
Syntactic predicates can be used to overcome the limitations of limited fixed lookahead
For example, distinguishing between the unary and binary minus operator:
expr: ( #(MINUS expr expr) )=> #( MINUS expr expr )
| #( MINUS expr )
...
;
The order of evaluation is very important as the second alternative is a "subset" of the
first alternative
Syntactic predicates are a form of selective backtracking and, therefore, actions are
turned off while evaluating a syntactic predicate so that actions do not have to be
undone
23
Semantic Predicates
Semantic predicates
– at the start of an alternative: decides whether or not to match
– in the middle of productions: throw exceptions when they evaluate to
false
stat:
{isTypeName(LT(1))}? ID ID ";“ // declaration "type varName;"
| ID "=" expr ";" // assignment
;
decl: "var" ID ":" t:ID
{ isTypeName(t.getText()) }? //used to throw an exception
;
24
Eg: Keeping State Information
Context-sensitive recognition example:
If you are matching tokens that separate rows of data such as "----",
you probably only want to match this if the "begin table" sequence
has been found
BEGIN_TABLE :
'[' {this.inTable=true;} // enter table context
;
ROW_SEP :
{this.inTable}? "----“ // sematic predicate
;
END_TABLE :
']' {this.inTable=false;} // exit table context
;
25
The Java Code
The code to invoke the parser:
import java.io.*;
class Main {
public static void main(String[] args) {
try {
// use DataInputStream to grab bytes
MyLexer lexer = new MyLexer(new DataInputStream(System.in));
MyParser parser = new MyParser(lexer);
int x = parser.expr();
System.out.println(x);
} catch(Exception e) {
System.err.println("exception: "+e);
}
}
}
26
Running AntLR
In Linux
runantlr <antlr_file>.g
javac *.java
java Main
In Windows
Eclipse has a very easy-to-use plugin for AntLR
http://antlreclipse.sourceforge.net/ for very very detailed
instructions
The plugin will run AntLR on the grammar file
27
Expression Evaluation 1:
Syntax-Directed Translation
To evaluate the expressions on the fly as the tokens come in, add actions to the parser:
class ExprParser extends Parser;
expr returns [int value=0] {int x;} :
value=mexpr
(
PLUS x=mexpr {value += x;}
| MINUS x=mexpr {value -= x;}
)* ;
mexpr returns [int value=0] {int x;} :
value=atom
( STAR x=atom {value *= x;} )* ;
atom returns [int value=0] :
i:INT {value=Integer.parseInt(i.getText());}
| LPAREN value=expr RPAREN ;
28
Expression Evaluation 2:
via AST Intermediate Form
A more powerful strategy than syntax-directed translation is
to build an AST:
intermediate representation that holds all or most of the
input symbols and has encoded, in the structure of the
data, the relationship between those tokens
For this kind of tree, you will use a tree walker to compute
the same values as before, but using a different strategy
The utility of ASTs becomes clear when you must do
multiple walks over the tree to figure out what to
compute or to do tree rewrites, morphing the tree
towards another language.
29
Abstract Syntax Trees
Abstract Syntax Tree: Like a parse tree, without unnecessary
information
Two-dimensional trees that can encode the structure of the input as
well as the input symbols
Either
homogeneous: all objects of the same type; e.g., CommonAST in
ANTLR
or heterogeneous: multiple types such as PlusNode, MultNode...
An AST for (3+4) might be represented as
No parantheses are included in the tree!
30
AST Construction
To get ANTLR to generate a useful AST :
– turn on the buildAST option
– add a few suffix operators
class ExprParser extends Parser;
options { buildAST=true; }
expr: mexpr ((PLUS^|MINUS^) mexpr)* ;
mexpr : atom (STAR^ atom)* ;
atom: INT | LPAREN! expr RPAREN! ;
No changes in the Lexer.
31
AST Operators
AST root operator
Normally AntLR makes the first token it encounters the root of the tree
We usually want to manipulate this, eg, for operators
A token suffixed with the “^” root operator forces that token as the root of the
current tree:
expr: mexpr ((PLUS^|MINUS^) mexpr)* ;
AST exclude operator.
Tokens / rule references suffixed with the exclude operator are not included
in the AST
eg, for parantheses:
atom: INT | LPAREN! expr RPAREN! ;
32
AST Parsing and Evaluation
Rule format is like #(A B C);
which means "match a node of type A, and then descend into its list of children
and match B and C".
This notation can be nested arbitrarily, using #(...) for child trees
eg, #(A B #(C D) );
class ExprTreeParser extends TreeParser;
expr returns [int r=0] { int a,b; } :
#(PLUS a=expr b=expr) {r = a+b;}
| #(MINUS a=expr b=expr) {r = a-b;}
| #(STAR a=expr b=expr) {r = a*b;}
| i:INT {r = (int)Integer.parseInt(i.getText());} ;
Important: Sufficient matches are not exact matches. As long as the tree satistfies the
pattern, a match is reported, regardless of how much is left unparsed
#( A B ) = #( A #(B C) D).
33
in Java
The code to launch the parser and the tree walker:
import java.io.*;
import antlr.CommonAST;
import antlr.collections.AST;
class Calc {
public static void main(String[] args) {
try {
CalcLexer lexer = new CalcLexer(new DataInputStream(System.in));
CalcParser parser = new CalcParser(lexer);
parser.expr(); // Parse the input expression
CommonAST t = (CommonAST)parser.getAST();
System.out.println(t.toStringList()); // Print the resulting tree out in LISP notation
CalcTreeWalker walker = new CalcTreeWalker(); // Traverse the tree created by the parser
int r = walker.expr(t);
System.out.println("value is "+r);
} catch(Exception e) {
System.err.println("exception: "+e);
}
}
}
34
AST Construction by Hand
In some cases, you may want to transfom a tree yourself. eg, Optimization of addition with zero
class CalcTreeWalker extends TreeParser;
options{ buildAST = true; // "transform" mode }
expr:
! #(PLUS left:expr right:expr) // '!' turns off auto transform
{
if ( #right.getType()==INT && Integer.parseInt(#right.getText())==0 ) // x+0 = x
{
#expr = #left;
}
else if ( #left.getType()==INT && Integer.parseInt(#left.getText())==0 ) // 0+x = x
{
#expr = #right;
}
else // x+y
{
#expr = #(PLUS, left, right);
}
}
| #(STAR expr expr) // use auto transformation
| i:INT
;
35
in Java
The code to launch the parser and tree trasformer is:
import java.io.*;
import antlr.CommonAST;
import antlr.collections.AST;
class Calc {
public static void main(String[] args) {
try {
CalcLexer lexer = new CalcLexer(new DataInputStream(System.in));
CalcParser parser = new CalcParser(lexer);
parser.expr(); // Parse the input expression
CommonAST t = (CommonAST)parser.getAST();
System.out.println(t.toLispString()); // Print the resulting tree out in LISP notation
CalcTreeWalker walker = new CalcTreeWalker();
walker.expr(t); // Traverse the tree created by the parser
t = (CommonAST)walker.getAST(); // Get the result tree from the walker
System.out.println(t.toLispString());
} catch(Exception e) {
System.err.println("exception: "+e);
}
}
}
36
Left Recursion Solved
E → E + T | T written in AntLR as expr: expr PLUS term | term;
The code generated checks for expr infinitely:
expr()
{
expr();
match(PLUS);
expr();
}
Eliminate left recursion by
E → TE’
E’ → +TE’ | ε
results in:
expr: term (PLUS term)* ;
37
Links
• AntLR Reference Manual by Terence Pratt
antlr.org/share/1084743321127/ANTLR_Reference_Manual.pdf
• AntLR Tutorial by Ashley J.S Mills
http://supportweb.cs.bham.ac.uk/docs/tutorials/docsystem/build/tutorials/an
tlr/antlrhome.html
• An Introduction to AntLR by Terence Pratt
http://www.cs.usfca.edu/~parrt/course/652/lectures/antlr.html
• An AntLR Tutorial by Scott Stanchfield
javadude.com/articles/antlrtut/