Home
Parsing
Eval

eval6

The next enhancement almost brings us to a practical programming language by adding control statements for conditional execution and looping, and hordes of extra operators. Control statements imply boolean expressions, so we must add some boolean operators.

We could also use some more arithmetic operators. We've seen how to cope with operators before though and there is little that is new to be said about these.

The grammar is fairly large, but quite simple. (Honest. More than half the productions are concerned with boolean and arithmetic expressions). To help with the readability, I've capitalised all non-terminals.

The main problem with this degree of complexity in the language is that although it is now possible create moderately complex programs, it is impractical to do so because of the way it treats the ends of lines. They are only handled at the top level of the parser, so each line has to be a complete item. It is not possible to split 'if' statements or 'loop's across lines. This will need to be corrected before we consider this to be a usable language.


Eval6 source.
Eval6 header.
Sample application source.
Eval6 data.
Eval6 executable.
Syntactic grammar 6
proglet ->StmtList
StmtList ->Statement [ StmtList ]
Statement-> ø |
IfStmt | LoopStmt |
print ArithExpr | Assignment |
ExitStmt | run
IfStmt ->if BoolExpr then StmtList [ else StmtList ] endif
LoopStmt ->[ label ] loop StmtList endloop [ LabelName ]
ExitStmt ->exit [ LabelName ] [ ( when | unless ) BoolExpr ]
LabelName ->name
BoolExpr ->BoolOrExpr
BoolOrExpr ->BoolAndExpr [ bool_or BoolAndExpr ]*
BoolAndExpr ->BoolNegExpr [ bool_and BoolNegExpr ]*
BoolNegExpr ->[ bool_not ] BoolPrimary
BoolPrimary ->ArithExpr ( lt | le | eq | ne | gt | ge ) ArithExpr
ArithExpr ->BitOrExpr
BitOrExpr ->BitAndExpr [ bit_or BitAndExpr ]*
BitAndExpr ->AddExpr [ bit_and AddExpr ]*
AddExpr ->MulExpr [ ( plus | minus ) MulExpr ]*
MulExpr ->PowExpr [ ( mul | div | rem | mod | l_shift | r_shift) PowExpr ]*
PowExpr ->NegExpr [ pow NegExpr ]*
NegExpr ->[ plus | minus | bit_not ] Primary | ( l_par ArithExpr r_par )
Primary ->number | ( name [ l_par ArithExprList r_par ] )
Assignment ->name ( asgn | plus_asgn | minus_asgn ) ArithExpr
ArithExprList->ArithExpr [ comma ArithExpr ]*
Lexical grammar 6
print -> "print" loop -> "loop"
if -> "if" endloop -> "endloop"
then -> "then" exit -> "exit"
else -> "else" when -> "when"
endif -> "endif" unless -> "unless"
lt -> "<" le -> "<="
eq -> "==" ne -> "!="
gt -> ">" ge -> ">="
bool_and -> "&&" bit_and -> "&"
bool_or -> "||" bit_or -> "|"
bool_not -> "!" bit_not -> "~"
plus -> "+" minus -> "-"
mul -> "*" div -> "/"
mod -> "mod" rem -> "rem"
l_shift -> "<<" r_shift -> ">>"
asgn -> "=" pow -> "^"
plus_asgn -> "+=" minus_asgn-> "-="
l_par -> "(" r_par -> ")"
label -> name ":" comma -> ","
name -> letter [ letter | digit ]* comment -> "#" [ character ]* "\n"
number -> [ digit ]+ [ "." [ digit ]* ] [ ( "e" | "E" ) [ "+" | "-" ] [ digit ]+ ]

We were cheating a little earlier by not having 'proglet' as the start symbol. As things get more complex, cheating becomes more and more likely to get you into trouble though. That is why I've shown the whole grammar this time. The addition of not only the control structures but several new operators has greatly increased the size of the grammar, but all the individual productions are simple. You shouldn't be frightened off by the size of this one.

The lexer.

Because of the explosion in the number of keywords we use, and in the complexity of the grammar, the lexer has been revamped to recognise each keyword.

This helps keep the parser simple and detects errors that bit earlier. The later we detect an error, the harder it gets to give a meaningful error message. Since keywords are indistinguishable from names until they are complete, we handle them by checking completed names against a table of keywords, and returning a value associated with that particular keyword. These values are declared in an enumerated type. We could return the index into the table and rely on particular values for the enumerated type, but that approach means keeping the order of words in the table in step with the order of enumeration of the values. Using the table to associate the values with the words is more reliable and more obvious to anyone reading the lexer code.

Operators are also known to the lexer now. Previously, we simply assumed that if a token wasn't a name or a number, then it must be an operator. The mod and rem operators are particular names, so these are checked for along with other keywords. (See 'rem' versus 'mod' for more more detail on these operators.) Many of the characters used for operators can be used in combinations to create other operators. To recognize these the routine FollowedBy() checks for these combinations and returns the correct code.

The syntactic grammer allows loops to be labelled, but we refer to these labels using 'LabelName's. A LabelName is returned as the value of the label. The difference is the presence of the colon. If we wanted to recognise a label in the syntax by returning the name and the colon as two tokens it would become difficult to distinguish between a label and an assignment without using two tokens of lookahead. Since the lexer has already seen the character following the name, it will know which it is provided we do not allow any white-space between the name and the colon in a label.

Since layout becomes important with complex programming structures, and everyone likes to lay things out their own way, we now treat newlines as just another form of white-space, relying entirely on keywords to guide the syntax analysis. It is also high time we added comments to our proglets.

The parser

Depending on how the grammar has been written, it may not be obvious whether we can always decide what kind of statement is coming up. We may have to follow through several productions to find out what the following terminal symbol is. Provided every possible sequence of productions leads to a unique terminal, we always decide which rule to apply. We can rewrite the grammar rules to make this more obvious.

'If' statements.

The way we organise if statements is most easily explained using the following piece of pseudo-code:
               ...
               If Not condition Then GoTo else_part
    then_part: StmtList
               goto end_if
    else_part: StmtList
    end_if:    ...
If the condition is true, the jump to the else_part is not taken and the 'then' StmtList is executed. At the end of this we have to skip over the 'else' StmtList if there is one. When we first reach the 'if' we do not know where the 'else' StmtList will begin, or whether there will be one, but we can reserve a place to fill in the address once we do know it. We then parse the 'then' StmtList. Since no statement can start with 'else' or 'endif', we will return from parsing statements when we reach either of these. We can then patch the address we reserved space for.

If the next keyword is 'else', we need to do something similar to jump over the 'else' StmtList. Reserve a space for the jump instruction and patch in the address when we reach the 'endif'.

A doddle.

Boolean expressions.
These are effectively an extension of arithmetic expressions. All boolean values are the result of comparing one arithmetic value with another. To test whether something is equal to zero, it is necessary to do so explicitly. No shortcuts as in C with if x then.... It has to be if x != 0 then....

Loops.

I have tried to make the format of the loop flexible and clear. I wanted to have loops which test the loop condition before executing the loop and others which test after. I also wanted to have positive and negative sense versions of the loop test - i.e. something like do...while x and repeat...until x, where the first version repeats the loop if the condition is true, and the second if it is false. With four variations of loop though, we can start confusing ourselves. If we are scanning the body of a do...while x loop and we come across the keyword while, is this the end of our loop or the beginning of a nested while x...endwhile loop? We could probably come up with enough keywords to make the situation unambiguous, but we would then have an overly complex language.

The form of loop I've gone for is similar to an Ada loop. The simplest form is an infinite loop, but we can leave the loop using an 'exit' statement at any point within it - not just at the start or the end. 'Exit' statements come in conditional and unconditional forms, with positive and negative senses of the condition being available using 'exit when' and 'exit unless' respectively.

As with 'if' statements, we want to jump to the end of a block when we don't yet know where the end of that block is, so we have to leave a space somewhere to patch the appropriate address in once we do know it. The obvious place to do this is at the 'exit' statement, but what if we have two or more 'exit' statements? We would have to remember where each one is to be able to patch it. An easier scheme to implement is to deposit the end-of-loop address at the start of the loop. When we come to such a reference during execution we would just skip over it without doing anything. When parsing the body of the loop we always know where it started, so we can have as many exit statements as we like simply by fetching the end-of-loop address from the reference at the start of the loop, and we only have to patch one point once we know where the end of the loop is.

The grammar actually treats 'exit' statements as statements in their own right - i.e. they are not handled as part of the loop. While this makes multiple 'exit' statements easier to express, it has two other consequences. First, an exit may occurr outside any loop, and we have to give some thought to how we interpret this. Do we disallow it, or exit the proglet? Secondly, if the start-of-loop address is held in storage local to the ParseLoop() procedure, we will not be able to access it from ParseExit(). It has to be stored 'globally' to the parser.

Since loops cannot overlap other than by nesting one inside another, a stack-type structure would be appropriate for storing this information. An un-named exit then refers to the last entry on the stack. For a named exit, we have to scan the entries looking for a match to the name. If, when we push a new named loop onto the loop-stack, we check that no other loop has the same name, it does not matter in what order we search the loop-stack for a name. This stack is only used during parsing, to help with address patches. There is no run-time loop-structure.

'rem' versus 'mod'

The use of % for one of these was rejected on the grounds that some languages use % for mod and some for rem, and it would not be clear which was intended. Using words, it is obvious.

(a rem b) only makes sense if b is an integer, so we round it off to the nearest integer before doing the rest of the operation. (The original value of b is unchanged. We only round a copy of it to carry out the rem).

Just in case you're not clear what the difference is, (a mod b) always lies in the range 0 <= x < b, whereas rem is defined such that (int(a / b) * b + (a rem b)) = a.

(-7 mod 3) is therefore 2, but (-7 rem 3) is -1.

Bit operations.

Given that our only numeric type is actually floating point, bit-wise operations are slightly awkward. They have to be done by converting fp to integer, do the operation, and convert back to fp. This is rather inefficient, but it's easy to read when that's what we want to do. The alternatives are to have two numeric types - integer and floating point, leading to complexities in the arithmetic - or to write obscure code. Since these sort of operations are fairly likely in the situations I see this sort of language being used in, I consider them essential. This is certainly good enough for bit-operations on 16-bit values, so it will do for most situations. We could mask the results so that only the bottom 16 bits of any result are non-zero. Knowing that the results is always 16 bits may be helpful (or at least reassuring) to the user.

New assignment operators

I thought about adding increment and decrement operations into the language, either using ++ and --, or inc and dec. Allowing them as side effects in other expressions is unpleasant, and I decided that, although inc and dec statements aren't difficult to add, 'a += 1' is almost as concise as 'inc a' and a lot more flexible.

It is easy to read

    a = a + 1
but it is harder to read
    Long_Variable_Name_1 = Long_Variable_Name_2 + 1
Writing
    Long_Variable_Name_1 += 1
makes it clear that we are incrementing a variable and excludes the possibility of mis-spelling the name the second time round and accidentally reading from something else.

The run statement.

This now accepts an argument list, as suggested previously. The use of a floating point value as the control variable in the loop is something to watch out for though. With the loop set up as
    for (t = StartTime; t <= StopTime; t += StepTime) {...}
a sequence of runs at time 0, 1,...10 is simple to set up. An attempt to do the sequence 0,0.1,...1.0 is probably doomed to failure though. The time-step 0.1 is impossible to represent exactly in binary - it is a repeating fraction 0.00011001100... - so there is an increasing error in the values used for t. The nearest I get to 1.0 on my computer is 1.000000119 - an error of about 10-7 after just ten additions. If the StopTime entered by the user is to be included in the sequence of runs, we must accept this rounding error and take steps to compensate for it. This just requires a small adjustment to StopTime by adding StepTime / 2.0 to it. To definitely exclude StopTime from the sequence, it is necessary to subtract this amount from it.
Eval
Parsing
Home
Valid HTML 4.01 Any comments or queries, write to me.
Site last updated 14 March, 2011