Home
Parsing
Eval

eval5

The new feature in the grammar here is the ability to assign values to our own variables, but this brings about a whole new way of looking at the user-input. It is no longer sufficient to look at one line in isolation. To make use of variables implies assigning values and retrieving them. In other words, we have taken a significant step towards having a real language, with multi-line programs.

There is no longer an obvious, single value to return to the user. For the moment, we're going to concentrate on printing the values of expressions but we'll come back to the subject of retrieving results later.


Eval5 source.
Eval5 header.
Sample application source.
Eval5 executable.
Syntax grammar 5
statement -> ( "print" expression | assignment | "run" ) "\n"
assignment -> name "=" expression
expression -> term [ ( "+" | "-" ) term ]*
term -> factor [ ( "*" | "/" ) factor ]*
factor -> [ "+" | "-" ] ( primary | "(" expression ")" )
primary -> number | ( name [ "(" ExpressionList ")" ] )
ExpressionList -> expression [ "," expression ]*
Lexical grammar 5
number -> [ digit ]+ [ "." [ digit ]* ] [ ( "e" | "E" ) [ "+" | "-" ] [ digit ]+ ]
name -> letter [ letter | digit ]*

Although we do not yet have control statements that allow loops, we are getting to the stage when a series of steps is only worth entering if it is going to be executed several times before being discarded. Re-parsing the source each time would be innefficient, so we are going to change the actions associated with the language structures. They now produce intermediate code which is executed afterwards using a separate interpreter. Such schemes are fairly common, the best known ones being UCSD Pascal compilers which produced p-code, and more recently Java applets are compiled into byte-codes which are the machine code of a 'virtual' machine. Since the virtual machine doesn't exist, it has to be simulated by interpreting the byte-codes, but this is much more efficient than interpreting the original source.

The largest structure we enter still has to be on one line. Suppose we want to be able to enter an expression on several lines. At the time that we run out of input and need to get more, the call sequence might be quite deep. If it is ParseExpression() : ParseTerm() : ParseFactor() : ParsePrimary() : GetNextToken(), each procedure may be holding information about some pending operation. We cannot return to the caller without losing that information, and we couldn't recover our state afterwards. We have the choice of insisting on only complete structures being passed, or of installing a callback procedure that allows the lexer to get more input. For the moment, insisting on single line structures will do, but we will need to do something about this when we add control structures.

Introducing a print statement seems a rather backward step from the point of view of usability. It might be nice to be able to write a mixture of expressions, whose value we want printed, and assignments, whose value we don't want printed. At first sight,

Grammar 5-ish
statement -> ( expression | assignment ) "\n"
assignment -> name "=" expression
expression -> term [ ( "+" | "-" ) term ]*
term -> factor [ ( "*" | "/" ) factor ]*
factor -> [ "+" | "-" ] ( primary | "(" expression ")" )
primary -> number | name [ "(" ExpressionList ")" ]
ExpressionList -> expression [ "," expression ]*
looks like a reasonable grammar, but it has problems.

If a statement starts with a number, it is obviously an expression, but if it starts with a name, we still don't know whether we are looking at an assignment or an expression. We have to look at the second token to decide. An assignment will have "=" as the second token, otherwise we have an expression. Unfortunately, if it is an expression we want the calling sequence to be ParseExpression() : ParseTerm() : ParseFactor() : ParsePrimary() to get the correct action for finding a name in an expression, but the way of getting there means backtracking two symbols. We therefore have two choices. We can write the ParseStatement() routine and the lexical analyser in such a way that they can cope with backtracking multiple symbols, or we can re-structure the grammar so that we can decide what to do on the basis of one token.

Possible changes to the grammar include i) having the assignment operator return a value just like any arithmetic operator so that it is part of an expression, and ii) introducing a 'print' keyword. The former would be in keeping with the C way of doing things and would allow statements like

    a = b = c = 0 
to initialise several variables at once. When mis-used though, this can lead to obscure and easily missed assignments which make programs harder to read than they need be. Also, we don't want to print the value of an expression if it consists of just an assignment, and coming up with a neat grammar which treats "=" as an arithmetic operator and which doesn't require an untidy bodge in the semantics to suppress the printing of the value is not easy. This is not an option to be undertaken lightly.

Adding an explicit print statement only requires us to replace the first of the grammar rules with

    statement -> ( "print" expression | assignment ) "\n"
to get an easy-to-parse grammar that forces us to write relatively clean and tidy code. We have good confirmation that the statement was supposed to be an assignment if it has an "=" in the right place, or was supposed to be an expression value to be printed if it starts with "print". This sounds like a good option. We could always offer '?' as shorthand for 'print' as in BASIC.

Proglets

To avoid confusion over whether a reference to a 'program' is a reference to 'eval' or to a user-program, I will refer to user-programs as 'proglets'. This conveys an impression that these will generally be small programs, but they don't have to be. Having a compiled version of the proglet is all very well, but you might want to enter several proglets and run them later. When the compiled proglet is stored in global data structures, this is not practical. Putting it into a C/C++ struct would certainly keep everything together, but converting the interpreter into a C++ class would be better. It keeps the notation inside the routines simple as we have direct access to the members of the structure. The alternative, if you are not using C++, would be to pass the structure - or a pointer to it - from routine to routine and to access the members within it. We can keep down the clutter by using C++.

Properly encapsulated, we can associate particular proglets with particular aspects of our application program.

Run-time error checking

In eval5, all the execution happens one step at a time, in a single routine, so when an error occurrs we know exactly which step caused it. This gives us more information we could put into error messages. If we want to really make the messages meaningful though, we will have to remember where each line started so that we can give a line number. We might even want to retain the source. To indicate where on the line the error occurred as well as on which line would need a complete token by token mapping between the source and the compiled code. The source could be duplicated by the top-level parse() routine, and the finer detail is best built up in AddInstr(). This would all use considerably more memory than we have used up until now, but it still doesn't amount to much really. It all depends on how accurate you want your messages. Note that the order of execution does not correspond to a left-right scan of the source lines because of the infix to postfix conversion we do for expressions.

I won't expand on the current error messages until we can reasonably be expected to write proglets sufficiently complex that we will need to be able to save and restore them.

Running the proglet

Detecting the end of a proglet.
Proglets may now consist of several lines, so how do we know whether we can run it? Is it complete or barely started? Since only the end-user can actually know, we have to rely on being told explicitly to run the proglet or on running out of input. To do the former, I have added another keyword to the language - 'run'. The parse routine returns a flag indicating whether the proglet is complete - i.e. did it see the word 'run' - and the proglet can be run at any time after that.
Distinguishing between runs.
Proglets still don't provide any control structures such as loops or if statements. Partly as a compromise, and partly because it can be useful in its own right, I have added a parameter to the run() routine that allows the caller to set the value of a 'predefined constant' for the proglet. I have chosen to call this 'time' because that is what it stood for in a real application that employed this technique. The proglet can then be executed as a procedure with one parameter.

The user may wish to set the conditions under which the proglet will be run, so the 'run' statement could be extended to take parameters. These parameters would be left on the stack and could be retrieved and passed back to the calling program. For example, run 1, 0.1, 10 might indicate that the function should be called for values of 1, 1.1, 1.2, ..., 10

There is a problem with this idea though. When 'run' is entered by the user, we are in the 'parse' phase of operation. We don't know what values have been entered, so it seems we can't pass these values back. It is not too difficult to fudge things to get the result we want though. If we make a temporary proglet out of the parameter list, we can do a one-off execution of that proglet and then pop the results off the stack to return them.

The use of commas to separate the parameters is not vital, but it may reduce the incidence of end-user mistakes about exactly where one expression ends and the next one starts. It also allows us to re-use the ParseArgList() routine. At present, that routine expects a ")" at the end of the argument list, but we could easily pass in a parameter that told it to expect some other token instead. The calling routine has no trouble knowing which terminator to look for so this is not a problem.

The precise syntax of the run command doesn't really matter. Use whatever you feel happy with, or think up some other way of terminating the proglet source that suits your application.

Return values
At present run() doesn't return any information to the caller, so its use is rather limited, but we could pass values back out from run() or we could retrieve the values of particular variables. However we return values, run() should always return a status value to indicate whether the others have any meaning or not. Run-time errors could invalidate all our results.

Note that any return variables must not be declared as predefined or they will be treated as constants.


Eval
Parsing
Home
Valid HTML 4.01 Any comments or queries, write to me.
Site last updated 14 March, 2011