We could also use some more arithmetic operators. We've seen how to cope with operators before though and there is little that is new to be said about these.
The grammar is fairly large, but quite simple. (Honest. More than half the productions are concerned with boolean and arithmetic expressions). To help with the readability, I've capitalised all non-terminals.
The main problem with this degree of complexity in the language is that
although it is now possible create moderately complex programs, it is
impractical to do so because of the way it treats the ends of lines. They are
only handled at the top level of the parser, so each line has to be a complete
item. It is not possible to split 'if' statements or
'loop's across lines. This will need to be corrected before we
consider this to be a usable language.
| proglet | -> | StmtList | |
| StmtList | -> | Statement [ StmtList ] | |
| Statement | -> |
ø | IfStmt | LoopStmt | print ArithExpr | Assignment | ExitStmt | run | |
| IfStmt | -> | if BoolExpr then StmtList [ else StmtList ] endif | |
| LoopStmt | -> | [ label ] loop StmtList endloop [ LabelName ] | |
| ExitStmt | -> | exit [ LabelName ] [ ( when | unless ) BoolExpr ] | |
| LabelName | -> | name | |
| BoolExpr | -> | BoolOrExpr | |
| BoolOrExpr | -> | BoolAndExpr [ bool_or BoolAndExpr ]* | |
| BoolAndExpr | -> | BoolNegExpr [ bool_and BoolNegExpr ]* | |
| BoolNegExpr | -> | [ bool_not ] BoolPrimary | |
| BoolPrimary | -> | ArithExpr ( lt | le | eq | ne | gt | ge ) ArithExpr | |
| ArithExpr | -> | BitOrExpr | |
| BitOrExpr | -> | BitAndExpr [ bit_or BitAndExpr ]* | |
| BitAndExpr | -> | AddExpr [ bit_and AddExpr ]* | |
| AddExpr | -> | MulExpr [ ( plus | minus ) MulExpr ]* | |
| MulExpr | -> | PowExpr [ ( mul | div | rem | mod | l_shift | r_shift) PowExpr ]* | |
| PowExpr | -> | NegExpr [ pow NegExpr ]* | |
| NegExpr | -> | [ plus | minus | bit_not ] Primary | ( l_par ArithExpr r_par ) | |
| Primary | -> | number | ( name [ l_par ArithExprList r_par ] ) | |
| Assignment | -> | name ( asgn | plus_asgn | minus_asgn ) ArithExpr | |
| ArithExprList | -> | ArithExpr [ comma ArithExpr ]* |
-> "print" |
loop | -> "loop" | ||||
| if | -> "if" |
endloop | -> "endloop" | |||
| then | -> "then" |
exit | -> "exit" | |||
| else | -> "else" |
when | -> "when" | |||
| endif | -> "endif" |
unless | -> "unless" | |||
| lt | -> "<" |
le | -> "<=" | |||
| eq | -> "==" |
ne | -> "!=" | |||
| gt | -> ">" |
ge | -> ">=" | |||
| bool_and | -> "&&" |
bit_and | -> "&" | |||
| bool_or | -> "||" |
bit_or | -> "|" | |||
| bool_not | -> "!" |
bit_not | -> "~" | |||
| plus | -> "+" |
minus | -> "-" | |||
| mul | -> "*" |
div | -> "/" | |||
| mod | -> "mod" |
rem | -> "rem" | |||
| l_shift | -> "<<" |
r_shift | -> ">>" | |||
| asgn | -> "=" |
pow | -> "^" | |||
| plus_asgn | -> "+=" |
minus_asgn | -> "-=" | |||
| l_par | -> "(" |
r_par | -> ")" | |||
| label | -> name ":" | comma | -> "," | |||
| name | -> letter [ letter | digit ]* | comment | -> "#" [ character ]* "\n" | |||
| number | -> [ digit ]+ [ "." [ digit ]* ] [ ( "e" | "E" ) [ "+" | "-" ] [ digit ]+ ] | |||||
This helps keep the parser simple and detects errors that bit earlier. The later we detect an error, the harder it gets to give a meaningful error message. Since keywords are indistinguishable from names until they are complete, we handle them by checking completed names against a table of keywords, and returning a value associated with that particular keyword. These values are declared in an enumerated type. We could return the index into the table and rely on particular values for the enumerated type, but that approach means keeping the order of words in the table in step with the order of enumeration of the values. Using the table to associate the values with the words is more reliable and more obvious to anyone reading the lexer code.
Operators are also known to the lexer now. Previously, we simply assumed that
if a token wasn't a name or a number, then it must be an operator. The
mod and rem operators are particular names, so these
are checked for along with other keywords. (See
'rem' versus 'mod'
for more more detail on these operators.) Many of the characters used for
operators can be used in combinations to create other operators. To recognize
these the routine FollowedBy() checks for these combinations and
returns the correct code.
The syntactic grammer allows loops to be labelled, but we refer to these labels using 'LabelName's. A LabelName is returned as the value of the label. The difference is the presence of the colon. If we wanted to recognise a label in the syntax by returning the name and the colon as two tokens it would become difficult to distinguish between a label and an assignment without using two tokens of lookahead. Since the lexer has already seen the character following the name, it will know which it is provided we do not allow any white-space between the name and the colon in a label.
Since layout becomes important with complex programming structures, and everyone likes to lay things out their own way, we now treat newlines as just another form of white-space, relying entirely on keywords to guide the syntax analysis. It is also high time we added comments to our proglets.
If' statements.if statements is most easily explained using
the following piece of pseudo-code: ...
If Not condition Then GoTo else_part
then_part: StmtList
goto end_if
else_part: StmtList
end_if: ...
If the condition is true, the jump to the else_part is not taken
and the 'then' StmtList is executed. At the end of this we
have to skip over the 'else' StmtList if there is one. When we
first reach the 'if' we do not know where the
'else' StmtList will begin, or whether there will be one, but
we can reserve a place to fill in the address once we do know it. We then parse
the 'then' StmtList. Since no statement can start with
'else' or 'endif', we will return from parsing
statements when we reach either of these. We can then patch the address we
reserved space for.
If the next keyword is 'else', we need to do something similar
to jump over the 'else' StmtList. Reserve a space for the jump
instruction and patch in the address when we reach the 'endif'.
A doddle.
if x then.... It has to be
if x != 0 then....
do...while x and repeat...until x, where the
first version repeats the loop if the condition is true, and the second if it
is false. With four variations of loop though, we can start confusing
ourselves. If we are scanning the body of a do...while x loop
and we come across the keyword while, is this the end of our loop
or the beginning of a nested while x...endwhile loop?
We could probably come up with enough keywords to make the situation
unambiguous, but we would then have an overly complex language.
The form of loop I've gone for is similar to an Ada loop. The simplest form is
an infinite loop, but we can leave the loop using an 'exit'
statement at any point within it - not just at the start or
the end. 'Exit' statements come in conditional and unconditional
forms, with positive and negative senses of the condition being available using
'exit when' and 'exit unless' respectively.
As with 'if' statements, we want to jump to the end of a block
when we don't yet know where the end of that block is, so we have to leave a
space somewhere to patch the appropriate address in once we do know it. The
obvious place to do this is at the 'exit' statement, but what if
we have two or more 'exit' statements? We would have to remember
where each one is to be able to patch it. An easier scheme to implement is to
deposit the end-of-loop address at the start of the loop. When we come to such
a reference during execution we would just skip over it without doing anything.
When parsing the body of the loop we always know where it started, so we can
have as many exit statements as we like simply by fetching the end-of-loop
address from the reference at the start of the loop, and we only have to patch
one point once we know where the end of the loop is.
The grammar actually treats 'exit' statements as statements in
their own right - i.e. they are not handled as part of the loop. While this
makes multiple 'exit' statements easier to express, it has two
other consequences. First, an exit may occurr outside any loop, and
we have to give some thought to how we interpret this. Do we disallow it, or
exit the proglet? Secondly, if the start-of-loop address is held in storage
local to the ParseLoop() procedure, we will not be able to access
it from ParseExit(). It has to be stored 'globally' to the parser.
Since loops cannot overlap other than by nesting one inside another, a stack-type structure would be appropriate for storing this information. An un-named exit then refers to the last entry on the stack. For a named exit, we have to scan the entries looking for a match to the name. If, when we push a new named loop onto the loop-stack, we check that no other loop has the same name, it does not matter in what order we search the loop-stack for a name. This stack is only used during parsing, to help with address patches. There is no run-time loop-structure.
rem' versus 'mod'% for one of these was rejected on the grounds that some
languages use % for mod and some for rem,
and it would not be clear which was intended. Using words, it is obvious.
(a rem b) only makes sense if b is an integer, so we
round it off to the nearest integer before doing the rest of the operation.
(The original value of b is unchanged. We only round a copy of it
to carry out the rem).
Just in case you're not clear what the difference is,
(a mod b) always lies in the range
0 <= x < b, whereas rem is defined such that
(int(a / b) * b + (a rem b)) = a.
(-7 mod 3) is therefore 2, but (-7 rem 3) is -1.
++ and --, or inc and
dec. Allowing them as side effects in other expressions is
unpleasant, and I decided that, although inc and dec
statements aren't difficult to add, 'a += 1' is almost as
concise as 'inc a' and a lot more flexible.
It is easy to read
a = a + 1
but it is harder to read
Long_Variable_Name_1 = Long_Variable_Name_2 + 1
Writing Long_Variable_Name_1 += 1
makes it clear that we are incrementing a variable and excludes the possibility
of mis-spelling the name the second time round and accidentally reading from
something else.
run statement. for (t = StartTime; t <= StopTime; t += StepTime) {...}
a sequence of runs at time 0, 1,...10 is simple to set up. An attempt to do the
sequence 0,0.1,...1.0 is probably doomed to failure though. The time-step 0.1
is impossible to represent exactly in binary - it is a repeating fraction
0.00011001100... - so there is an increasing error in the values used for
t. The nearest I get to 1.0 on my computer is 1.000000119 - an
error of about 10-7 after just ten additions. If the
StopTime entered by the user is to be included in the sequence of
runs, we must accept this rounding error and take steps to compensate for it.
This just requires a small adjustment to StopTime by adding
StepTime / 2.0 to it. To definitely exclude
StopTime from the sequence, it is necessary to subtract
this amount from it.