Eval1 converts an expression in normal, infix, notation to the equivalent in postfix notation. This conversion basically requires us to parse the input expression, the first step in evaluating the expression, before we write it out in another format.
Eval1 is only 55 lines long, and doesn't actually evaluate anything, but it does illustrate one way to go about parsing a statement based on a simple grammar, known as a recursive-descent parser. It reads in 'expressions' consisting of single letters separated by arithmetic operators, and rewrites them in postfix order. It continues to do so until it reaches either a "." or the end-of-file, which is entered as control/z on PCs, and control/d on UNIX systems.
This simple example works best if it is reading from a file rather than from the keyboard as the echoing of keys to the screen gets mixed up with the output from the program.
The way it works is like this. Input is divided up into tokens by GetNextToken().
Each token has to be read in before we can decide what to do with it, so the
first thing we do is read in the first token. At this point we are know that we
are at the beginning of an expression, (assuming there are no errors) so we
call ParseExpression(). The first part of an expression is always
one of the terms of that expression, so we call ParseTerm(). The
first part of a term is always one of its factors, so we call ParseFactor().
Now we have a decision to make. A factor can either be a primitive symbol, or
it may be a bracketed expression. We can decide which easily enough because we
do not have the complexity of an entire expression to cope with. We only need
to know whether the current token is an open bracket or not to decide which
form the factor takes. Assuming that our first expression is a simple one like
a+b., our processing simply consists of printing the current token
and fetching the next one.
We now return to parsing the rest of the current factor. Again, this is easy
because we know we are dealing with a factor. A factor is a sequence of symbols
separated by "*". If the current token is not a "*", we are finished parsing
the factor and return to parsing the rest of the expression. Now the current
token matches one that ParseExpression() is looking for, so we
enter the loop portion of the code. Remember what that token actually is for
later use and fetch another token. This token may be just the start of
something complex, so deal with this by calling ParseTerm() again.
Whatever it consists of, when we return from that call the current token will
either be one that ParseExpression() is looking for, or we have
come to the end of the expression. In either case, another term of the
expression will have been dealt with (printed) and we need to do something with
the operator we stored earlier, so print that too. If we have reached the end
of the expression, return. That's about it for that expression.
Recursive descent parsers are simple. First, the structure of the program corresponds closely to the structure of the grammar. Second, the actions required are easily inserted into the parsers so that, as soon as a structure has been recognised, those actions can be taken. Third, the use of variables local to procedures saves the programmer from having to build data structures to store parser state information, because it goes onto the run-time stack of the program. They do have the disadvantage that combining the parser and the actions in this way makes things more obscure when we only want to look at one in isolation, but all our actions will be very simple.
Before we can see how to construct a parser, we have to write the grammar for the language we wish to parse. Eval1 is based on an example from the "Pascal user manual and report", pp.113-117. The language it copes with is expressed using a grammar, which lists legitimate ways of constructing a sentence or statement. Abstract ideas or meaningful structures in the sentence are shown on the left hand side of the productions or rules of the grammar. Each production shows how a non-terminal symbol breaks down into some combination of terminal and non-terminal symbols. The non-terminals appearing on the right hand side generally represent simpler abstract concepts, and terminals cannot be broken down any further. Every non-terminal will have a production associated with it. In the absence of any indication otherwise, the start symbol is assumed to be the one which comes first in the grammar. In the case of grammar 1, an expression.
Source.
Executable.
Sample input data.
| expression | -> term [ ( "+" | "-" ) term ]* | |
| term | -> factor [ "*" factor ]* | |
| factor | -> name | ( "(" expression ")" ) | |
| name | -> letter |
"->" can be read "consists of", "a | b" means "either a or b", and square brackets indicate optional components. "[ a ]" means "zero or one instance of 'a'", [ a ]*" means "zero or more repetitions of a", and "[ a ]+" means "one or more repetitions of a". Terminal symbols (literal text) are given in quotes, and parenthesis groups items in the same way as in arithmetic expressions. See Grammar notation for more detail.
An expression is therefore a list of terms, separated by plus or minus operators. A term is a list of factors, separated by multiply operators, and a factor is either a name or a parenthesised expression. A name is a single letter. The termination condition has not been shown in the grammar. That would not be appropriate here since the termination is not actually detected by the parser, which acts on only one line at a time.
There are really two stages to parsing a source program, lexical analysis and syntax analysis. These are both doing the same sort of thing but with vastly different depths of understanding. The first refers to dividing the input character stream into words (symbols or tokens). For eval1 the symbols are single characters anyway, but the principle will be used in later versions and still holds here. The majority of the work lies in syntax analysis, so the program that does this is usually referred to as the parser.
Compare the procedure ParseExpression() with the grammar rule for
expressions. The rule has two parts. The first part says that an expression
always starts with a term, which translates into a call to ParseTerm().
The second part of the rule says that the first term may be followed by zero or
more instances of a plus or minus sign followed by another term. The 'zero or
more' instances of something translates into a while loop. The
condition in the while loop checks that the next token is a valid
continuation for an expression - i.e. that the next token matches one of the
terminal symbols given at the start of the second part of the rule - and the
body of the loop parses the rest of the rule. From this close correspondance
between the grammar and the parser, it can be seen that creating a parser for a
particular grammar is almost a mechanical process. Several programs have been
written which actually do write a parser given a grammar, the best known of
which is yacc (Yet Another Compiler Compiler), which is available on several
platforms, including PCs. Although it is available, most people do not
have access to it. This method is available to anyone with a compiler.
For the parsing method presented, it is necessary to write the grammar in such a way that there is always a way to decide what rule to apply from looking at the next token. When the token no longer matches any of the terminals, we might decide that it forms part of a non-terminal. Obviously, a rule can only mention one possible non-terminal at any point or we would not be able to decide which to try. It is sometimes possible to rewrite grammar rules to factor out common sub-expressions which do have leading terminals that allow decisions to be made. See here for more details.
$ type eval1.dat (a+b)*(c-d) a+b*c-d ( a * b)* c-d a+b*(c-d) a*a*a*a b+c*(d+c*a*a)*b+a . $ eval1 ab+cd-* abc*+d- ab*c*d- abcd-*+ aa*a*a* bcdca*a*+*b*+a+ $