|
Assignment 3
|
|
|
Parser - Syntax Tree Generation
|
Due: November 11
|
Objectives
In this course you will be developing an interpreter for a high-level
language, called HL, which manipulates ordered lists.
In this assignment you will be using JavaCC and JJTree
to build the component of HL's parser which generates ASTs for HL programs.
This should familiarize you with the use of compiler generators
to produce parse trees and abstract syntax trees.
The
learning objectives
addressed in this assignment are
4Cd, 4Bc, 4Bd, 4Ab. 4De is also inferred from 4Cd.
Some of these learning objectives will be identified in the description of this assignment,
and some in the tests.
Preparation
The first three steps of the preparation for this assignment are the same as for the previous assignments,
but there is some additional work because you will be using JJTree to generate ASTs in addition to JavaCC.
- Create a directory on the moons where you will do your work, with a subdirectory called Tests,
for example:
mkdir 710A3
cd 710A3
mkdir Tests
- Copy all the files in the
Handouts directory
into your working directory.
These files are located on the moons at ~cps710/public_html/doc/F2024/A3/Handouts:
cp ~cps710/public_html/doc/F2024/A3/Handouts/* .
cp ~cps710/public_html/doc/F2024/A3/Handouts/src/* .
- Make the four testing scripts t, runtestsa, runtestsb, runtestsc executable by giving them user x perms:
chmod u+x t runtests*
- JJTree Set up:
In this assignment you will be using JJTree.
JJTree is a JavaCC preprocessor which generates code to construct ASTs.
It is bundled with JavaCC.
JJTree is going to generate many AST classes, one for each type of AST node.
These classes will be stored in a subdirectory called AST.
You will want to delete all of these classes each time you
make modifications to your AST (this has been implemented in the makefile).
Therefore it is useful to have them all in the same directory,
particularly since you won't be modifying them for this assignment.
JJTree does have options to generate packages, but
there are open tickets on these features and in packaging in general
in the JavaCC forum. Try at your own risk.
Instead, you can simply set things up to put all the AST modules
in one directory.
All that is needed to set up the workspace for JJTree is to
create the subdirectory AST,
and move into that directory the SimpleNode.java file shown in class
which has a new toString() method to better print tokens:
mkdir AST
mv SimpleNode.java AST
- HL.jjt setup: (explanation only - no steps required)
For this assignment, you will not be able to use your own solution to the second assignment because the grammar has been changed
significantly. Instead you will be using the file
HL.jjt which is handed out.
This section is an explanation of the relationship between HL.jj and HL.jjt.
In terms of content, the grammar that was handed out in assignment 2 has been cleaned up
and it is now LL1 except for a few lookaheads, including some syntactic lookaheads that
will not slow down the interpretation noticeably,
practically speaking given how the language is likely to be used.
You can compare this grammar with the one in assignment 2 to see the choices made to resolve the
problems you encountered in Assignment 2.
The grammar has also been extended to better incorporate Boolean values where they might be desirable for HL programmers.
In terms of structure, HL.jjt is more or less the same file as HL.jj except that it has a .jjt extension instead of a .jj extension,
and a few other changes to support JJTree:
- The following JavaCC/JJTree directives (options) have been added at the beginning of the file:
MULTI=true; // This will generate one AST class for each non-suppressed non-terminal
JJTREE_OUTPUT_DIRECTORY="AST"; // This will put all your AST classes in the AST directory
VISITOR=true; // This won't be used until the next assignment, but will be needed to make your assignment compile properly
Comment: according to the JJTree documentation, you can also set these
as command-line options. However, it is simpler to put everything inside HL.jjt.
- In HL.jjt, the start() production has been broken down into two productions start() and S().
start() is generic code to return the AST constructed during parsing,
and S() is the starting production for HL (which was called start in assignment 2).
The code for start() is:
SimpleNode start() #void :
{}
{ S() { return (SimpleNode) (jjtree.popNode()); }
| < EOF > {throw new ParseException("End of File.");}
}
JJtree keeps a stack of the ASTs it constructs and when parsing is complete,
the top of the stack contains the AST that has been created.
start() returns this AST to the function that called it in TestHL.java.
It also handles end of file.
- Run make:
Explanation of the
makefile:
Make sure that everything compiles properly before proceeding further.
Assignment
By default, JJTree constructs parse trees.
In this assignment, you will be slowly converting these parse trees into desired ASTs.
Therefore, the outputs will change depending on how far you get in the assignment.
This is explained further in the testing section.
For the rest of the assignment, you should modify HL.jjt to construct all your ASTs.
In this assignment you will not modify any other files.
Here now are the requirements for the ASTs and their associated learning objectives.
They are listed in increasing order of difficulty.
Please check the
Tests directory
for examples of these explanations.
The tests in that directory are in the same order as the requirements.
Shorter names
4Cd:
Because JavaCC productions are Java function calls, some of the names used for them, such as "return_stat"
were lengthened to avoid using Java keywords. Other productions were renamed to be consistent with them.
Without changing the production names in the grammar, make sure that the names of all the AST classes generated
for statements do not contain the _stat component, but start instead with an upper case letter.
For example, when the return_stat() production is parsed, an ASTReturn structure will be created.
Leaves of the ASTs
In this section we will define all the leaves of the ASTs: some are nodified tokens and some are null.
- Comparators
4Cd: You should have one ASTcomparison class
(where the nodes have 3 subtrees) to deal with all comparisons.
The middle node is a nodified comparator token specifying the type of comparison,
The names are the same as the token names.
For example ASTLESSEQ for the <LESSEQ> token described by the string "<=".
- Boolean constants
4Cd:
The two Boolean constants True and False should also be nodified tokens with no children, defined exactly like nodified comparators.
For example ASTTRUE for the <TRUE> token described by the string "#1".
- Absolute value
4Cd: By default the absolute value of an expression is parsed exactly like an expression surrounded by parentheses.
Instead, it should be parsed as an ASTabsolute_value with one child.
- Nulls
Some productions have optional components. In particular:
- the domain in set former expressions
- the elif and else clauses of an if statement
In these ASTs, missing components will be expressed with an ASTNULL.
4Cd: Rework the grammar of these types of statements to make sure that
their ASTs have exactly the same structure with the same number of children
whether or not the optional components are there. More specifically:
- An ASTset_former always has 3 children: the index, the domain, and the condition,
but if the domain is not specified, then the second child needs to be ASTNULL.
(Note that at the C level, this will be the first of two children because the index will not be kept until 4Bd is implemented).
- An ASTIf always has 3 children: condition, then clause, and else clause,
but the else clause is the node ASTNULL when there are no elifs or else.
This means that you must restructure your grammar so that elifs are simply represented as ASTIf nodes,
without changing the syntax of the language.
I.e. programmers will still be able to use elif constructs exactly the same way,
but they will be represented internally as ASTIf constructs.
- Meaningful tokens
4Bd: The remaining leaves of the ASTs, are tokens that contain values: IDNUM, IDSET, IDBOOL, NUMBER, STRING.
They should be nodified as explained in the
Building AST's with JJTree Handout.
For these, you will need to construct an AST node class for each of these types of tokens.
The name of these node classes should be idnum, idset, idbool, number, and string.
Since the tokens will be discarded after the equivalent nodes are constructed,
you will need to transfer to these nodes the value of the token that was constructed during scanning.
Take a look at the source code for the two classes Node and SimpleNode to see what is in place to
support this transfer.
Note: This B-level objective is not very hard,
and you should therefore implement it as soon as you decide to attempt the B-level objectives,
because it will facilitate your testing greatly:
once you've programmed this requirement, you will be able to see the name of the identifiers,
and the values of the numbers and strings that you input into your tests.
Cleanups
Later sections will describe how boolean and arithmetic operations will be rationalized.
This section deals with simpler changes to other structures.
- Simple cleanups
4Bc:
Since you are building an AST,
you should remove useless nodes which only represent syntactic constructs that
have no inherent meaning.
To start with, nodes for productions that simply group other productions
that are fully described on their own should be discarded:
S, statement_LL1, statement, identifier, value, term, and simple_term.
4Bc:
In a similar vein, the only production in not_clause() that needs to be in an ASTnot
is the one that is negated with a "!".
4Bc:
You should also simplify the ASTs that represent sets.
There should only be two ASTs for sets: ASTset_former for the sets described with set formers and
ASTset for sets described simply as a list of elements.
- List ASTs
4Bc:
Nodes which represent lists
(ASTbody, ASTclause, ASTident_list, ASTvalue_list, ASTprint_list, ASTexp_list)
should have as many children as are parsed.
In other words, they should be n-ary nodes containing the elements of the list.
When these lists are empty (when allowed by the grammar), then they have 0 children.
Note that ASTprint_list, and ASTexp_list cannot be empty when they are used children of
ASTPrint, and ASTFor trees respectively. This should not change.
4Bc:
By default, ASTvar_decl and ASTset trees have a single child which is a list.
These two ASTs should be turned into n-ary nodes whose children are the ones
who were previously their grandchildren.
Operations
The remaining requirements are all for learning objective 4Ab.
They can all be implemented by manipulating the grammar or using more advanced features of jjtree.
4Ab:
You should now rationalize the names and structures of the AST nodes that represent
operations as described here:
- Identifiers
In HL , the type of a variable is implied by its name. This is simple for the HL programmer, and,
as you can tell from the grammar, this also provides the added benefit that some type checking can be performed
at parse time, for example by separating assignments involving Booleans from other assignments.
However, later on it will be more useful to treat all identifiers equally with their type being one of their properties.
For that purpose all identifers should be represented with a single type of AST, ASTidentifier
(whose value is the name as implemented for 4Bd) which now has one child representing the type of the identifier.
As you can see in the test answers at the A level, the names of these three different children are "typenum", "typeset" and "typebool".
- Names of AST for operations:
ASTand, ASTor, ASTnot, ASTsum, ASTmul, ASTdiv, ASTmod
will represent the and, or, sum and difference, multiplication, division, and mod operations.
- AST minimizations:
ASTs for operations should only show up in a syntax tree if their associated operation is actually in the sentence
being parsed. (This issue has already been dealt with for ! in an earlier requirement).
- Operator Precedence:
The precedence rules for all the operations in HL are:
Highest Precedence
| ( ), | |
| * / %
| + -
| <, <=, >, >=, ==, !=, =in, !in
| !
| &
| |
| Lowest Precedence
|
These precedences are already built into the grammar.
- Associativity:
All operations are left associative.
This associativity is not built into the grammar
and will have to be implemented as explained underneath.
In the description of the ASTs that follows you will notice some inconsistencies between similar operators:
Some of the binary operators &, |, +, -, *, /, %
will be represented as binary operations with ASTs with only two subtrees,
and others will be represented as n-ary operations with ASTs that can have more than two subtrees.
Real interpreters may be more consistent, but these discrepancies
have been introduced for educational purposes, to learn to manipulate
grammars and JJTree directives differently to arrive at different AST structures.
These ASTs will also end up having to be evaluated differently in the next assignment,
providing a further opportunity to work with different types of data structures.
- Boolean Operations:
- ASTnot is a unary operator. Its node has a single child.
- & and | are left-associative n-ary operators, where n>0.
Their nodes (ASTand, ASTor respectively) have as many children as are parsed in the boolean expression.
For example a | b | c |d would be represented as [ASTor a b c d].
(here a, b, c, d are placeholders for more complicated expressions involving comparisons.)
- Arithmetic Operations:
- *, /, and % are left associative binary operators with the same precedence:
ASTmul, ASTdiv, and ASTmod have two children,
and when there are multiple operations, the subtrees should be on the left.
For example, a * b / c would be represented as [ASTdiv [ASTmul a b] c].
- + and - are unary operators as well as left associative binary operators.
However, because + and - have the same precedence, a string consisting of these operations
will be considered to be a sum, where the terms are either positive or negative.
These sums will be represented with a single n-ary node called ASTsum, where n>0.
- The second to last children of ASTsums will be unary nodes ASTpos and ASTneg representing unary + and -.
- The first child will either be an ASTneg if the first term in the sum was negative,
or the term itself otherwise.
For example:
a-b+c would be represented as [ASTsum a [ASTneg b] [ASTpos c]],
-a-b+c would be represented as [ASTsum [ASTneg a] [ASTneg b] [ASTpos c]],
-a would be represented as [ASTsum [ASTneg a]],
Debugging your assignment
JavaCC has a built-in tracing functionality that you may find useful for debugging your work.
It is invoked by setting:
DEBUG_PARSER=true;
In the options section of HL.jj program right at the top where you have also specified the case.
Testing your assignment
Your entire assignment will be tested on the scs moons as described here:
- We will first retrieve your submission and, using the makefile in that directory,
we will compile it and link it with all the java source files we gave you in the
Handouts/src directory:
TestHL.java, SimpleNode.java, and all the *Token.java files needed for scanning.
- We will then run the runtest script for the version of the assignment that you submitted
(runtestsa for A3A.zip, runtestsb for A3B.zip, runtestsc for A3C.zip).
You can find these three test files and the associated single test files called t in the
Handouts directory
As for previous assignments these scripts runs a battery of tests that build on each other.
The test files and answers can be found in the
Tests directory.
However, there is an added level of complexity to the testing of this assignment:
The ASTs that you build will be very different depending on how far you get in the learning objectives.
Therefore, we will run different test scripts based your indication of how far you got in your development.
The input files are the same, but there are three types of result files, one for each level
as you can see in the structure of the
Tests directory.
The expected outputs at each level assume that all the learning outcomes have been met at that level.
If that is not the case, we will look at your outputs to see how far you have proceeded.
Submitting the assignment
As for the previous assignments, we are using scripts to grade you.
Therefore please use exact names listed below, including the name of the zip file.
As explained in the previous section on testing,
the output of you program will be very different depending on how far you have gone in the learning
objectives.
Therefore you will need to indicate in your submission which learning objectives
you are attempting to meet to make sure that we can test your submission properly:
-
Zip up HL.jjt into one of the following:
- A3C.zip if you are only attempting to meet the 4Cd learning objective,
- A3B.zip if you are attempting to meet the 4Cd, 4Bc and 4Bd learning objectives, but not 4Ab.
- A3A.zip if you are attempting to meet all the learning objectives of this assignment.
As explained above, we will be testing your assignment with
the extra java files provided in the
Handouts directory
-
Make sure that your solution passes the tests described in the section above on the moons.
- Submit electronically
A3C.zip, A3B.zip or A3A.zip
References
This page is maintained by
Sophie Quigley
(cps710@cs.torontomu.ca)
Last modified
Tuesday, 05-Nov-2024 02:07:10 EST