Back

Grammar notation

The syntax of the grammars used here is quite simple, but probably not quite the same as any used in text books. In grammar 3 below, each rule has been numbered for ease of reference, but we won't generally do so.

The grammar syntax is

  1. Literal characters are quoted.
  2. Anything which is not quoted is a reference to another rule.
  3. Optional items are enclosed in square brackets, i.e. there will be 0 or 1 occurrence of that item.
  4. Square brackets followed by an asterisk indicates 0 or more occurrences of that item.
  5. Square brackets followed by a plus indicates 1 or more occurrences of that item.
  6. Vertical bars separate items in a list of possible selections. There are generally only two possibilities but there is no reason why you can't have more.
  7. Parentheses are used in the same way as in arithmetic expressions, to group parts of a rule which must be evaluated separately.

Grammar 3
1statement -> expression "\n"
2expression -> term [ add_op term ]*
3term -> factor [ mult_op factor ]*
4factor -> [ "+" | "-" ] ( primary | "(" expression ")" )
5mult_op -> "*" | "/"
6add_op -> "+" | "-"
7primary -> number | name
8number -> [ digit ]+ [ "." [ digit ]* ] [ ( "e" | "E" ) [ "+" | "-" ] [ digit ]+ ]
9name -> letter [ letter | digit ]*

Or in words,

I'll assume you know what letters and digits are.

The rule given for numbers is only a suggestion to guide anyone who has to parse their own floating point numbers. The rules followed by a library function may not match this precisely but this is unimportant provided the user knows what is required of him/her/it.

For anyone momentarily confused by the description of numbers as literal numerical constants, possible item descriptions include combinations of

literal
symbolic
numerical
string
character
constants
variables
Other data types (the middle column) are possible, but literal variables of any kind are not possible - or at least not to be contemplated by any sane programmer.

The method given above for writing a grammar is not the only one used. In 'purist' terms, the expression rule

expression-> term [ ( "+" | "-" ) term ]*
could be written
expression-> term OptTerm
OptTerm -> ø
OptTerm -> AddOp factor OptTerm
AddOp -> "+"
AddOp -> "-"
where alternative forms of a rule are written out as separate rules and ø is the empty set. The method of writing is different, but the meaning is identical.

Glossary

These are non-technical definitions of terms that I have used. They are presented more or less in order of increasing complexity. A longer, more technical list of parser- and grammar-related terms follows.
Lexical analysis
is the grouping of characters to form words, known as symbols or tokens, which are used as input to the parser. These tokens form 'atomic' chunks of code. Smaller chunks cease would to convey any meaning whatever. "3.14" and "theta" are each single tokens which consist of 4 and 5 characters respectively, but a digit on its own may indicate tens, hundreds, or be part of a variable name.
This stage may also categorize the token as being a name, number, operator or punctuation for instance, and I use it to return binary representations of numbers, but attaching meaning to tokens is generally best left to the parser.
Lexical analysis is the first step in parsing a statement, and is really only a very simple form of parsing.
A lexical analyzer, lexer, or tokenizer
carries out lexical analysis - i.e. tokenizes the source.
Syntax analysis or parsing
is the analysis of a source program, which may be a single line, to extract the underlying structure of that program. No meaning is attached to anything at this stage, but the data is genarally stored for later use. The analysis phase that does attach meaning to the syntax - semantic analysis - is sometimes mixed in along with the parser, and the intermediate data structure will then be simple and short-lived, but if semantic analysis is a separate process then the data structure may be complex and may even be written to a file.
A syntax analyser or parser
carries out syntax analysis - i.e. parses the stream of tokens output by the lexical analyzer.
An interpreter
carries out certain tasks based on some representation of a program. That representation may be the source as written by the user, or it may be an intermediate data structure produced by a parser. In the latter case, parts of the data structure it produces may used (interpreted) and discarded before parsing is complete.
An executable program - a .exe file on PCs or VAXs - is really just a data structure which can be interpreted by a specialised piece of hardware known as a CPU, but an interpreter is generally a program which operates on a more abstract level. It can be thought of as imitating a virtual machine - one which doesn't actually exist.
A compiler
is a parser whose output data structure is produced and stored in such a way that it may be efficiently executed many times without having to parse the source program again.
prefix, postfix and infix notation
refers to the position of an operator relative to its operands.
prefixinfixpostfix
+ a b a + ba b +
The normal way of writing things is infix, but postfix is the easiest to write an evaluation program for.
Prefix and postfix notation do not require parentheses to show how to evaluate an expression. Infix does.
Prefix notation was called 'polish' notation after J.Lukasiewicz who introduced it in the 1920's, and this led to postfix being referred to as 'reverse polish'.
A grammar
is a description of a language. It consists largely of a set of production rules showing valid ways to transform the non-terminal (abstract) symbols on the left-hand side into the mixture of terminal (concrete) and non-terminal (hopefully less abstract) symbols on the right-hand side. There must be a production rule for every non-terminal. The set of terminal and non-terminal symbols, along with a designated start symbol, complete the grammar. Note that a grammar is not a complete description of a language, it only specifies what a program should look like without giving any meaning to anything. So, although conforming to the grammar is necessary for a program to be valid, it is not sufficient. For example, many languages, require variables to be declared prior to being used, so "a = 27" may conform to the grammar but will not be valid unless "a" has been declared to be a numerical variable. This requires checks that are outside the scope of most grammars. There is a compromise here between the ease or difficulty of checking things during parsing and during execution.
In the example grammar
Grammar 3
statement -> expression "\n"
expression -> term [ add_op term ]*
term -> factor [ mult_op factor ]*
factor -> [ "+" | "-" ] ( primary | "(" expression ")" )
mult_op -> "*" | "/"
add_op -> "+" | "-"
primary -> number | name
number -> [ digit ]+ [ "." [ digit ]* ] [ ( "e" | "E" ) [ "+" | "-" ] [ digit ]+ ]
name -> letter [ letter | digit ]*
we have
non-terminalsterminalsstart symbol
statement,
expression,
term,
factor,
mult_op,
add_op,
primary,
identifier,
number
  
"\n",
"+", "-",
"*", "/",
"(", ")",
".",
"e", "E",
letter, digit
statement 
A parser might not actually see letters and digits at all because the lexical analyzer returns names and numbers, so whether these should be terminal symbols, or not in the grammar at all, is rather variable.
Right-recursive forms of a grammar are to be preferred over left-recursive ones. See left- or right-recursion?.
Recursive-descent parsing
is a top-down method of syntax analysis in which each rule of a grammar is represented by a procedure which only knows about that rule. The procedure which starts off the parsing process corresponds to the rule which represents the whole program. That decides which rule to apply out of those that might be valid and applies it, and so on. At each step, there must either be only one rule which applies or we must be able to choose the correct rule by looking at the next token in the source. When we reach a rule which is looking for a terminal symbol, we may take some action. As called procedures return - corresponding to the successful recognition of some structure - other actions may be taken. If these actions correspond to the semantics of those structures, the program is being interpreted as it is parsed. If they correspond to the production of a sequence of code, or some intermediate data structure, it is more of a compilation.
The 'recursive' bit comes about because grammar is recursive. For instance, expressions contain sub-expressions.
The 'descent' bit refers to the top-down approach of starting with the rule representing the whole program and searching for smaller and smaller bits.
The syntax
of a program restricts what it may look like, detailing the superficial form of a program, laying down rules of punctuation and showing how a program may be built up out of items and characters.
The semantics
of a program is what it means, defining the actions which must be carried out when the program is executed.
A syntax tree or parse tree
is a graphical method of showing the structure of a program or fragment of a program. For instance, Parse tree

Glossary

I've collected together the following, relatively technical, definitions because some books on parsers or compilers don't bother to define them or don't define them clearly, and I have difficulty understanding little bits of things if I don't at least know where they stand in relation to 'the big picture'. As an illustration of this complaint, note that most of the following definitions, from "Understanding and writing compilers", are given in chapter fifteen of that book.

In the heirarchy of grammars classifications, due to Noam Chomsky in 1957, there is a progression from most general (type 0) to most restrictive (type 3), each being a subset of the preceeding type. Note that other books define numerous other types of grammar, but these ones are common.

  1. A sentence is a string (a sequence) of symbols.
  2. A language is a set of sentences.
  3. A grammar is a finite set of productions, which contain both terminal and non-terminal symbols. One of the non-terminal symbols is the sentence symbol of the grammar (the start symbol)
  4. A production is of the form
        A -> alpha
    
    where A is a non-empty string of non-terminal symbols and alpha is a string (possibly empty) of terminal and/or non-terminal symbols. A production expresses the assertion that wherever the string A occurs in a partially derived sentence, it may be replaced by the string alpha.
  5. A type 0 grammar contains productions of the form A -> alpha where A is a non-empty string of non-terminal symbols and alpha is a string of terminal and/or non-terminal symbols.
  6. A type 1 or context sensitive grammar contains only productions of the form A -> alpha where A is a non-empty string of non-terminal symbols and alpha is a string of terminal and/or non-terminal symbols, and the length of the right-hand-side string alpha is not less than the length of the left-hand-side string A.
    Alternatively, the productions of type 1 grammars may be defined to be of the form beta A gamma -> beta alpha gamma where each production replaces a single non-terminal symbol in a particular context.
  7. A type 2 or context free grammar contains only productions of the form A -> alpha where A is a single non-terminal symbol and alpha is a string of terminal and/or non-terminal symbols.
  8. A type 3 or regular expression grammar contains only productions which are one of the forms
        A -> a
        A -> a B
    
    in which A and B are single non-terminal symbols and a is a single terminal symbol.
    Alternatively, the second of these productions may be defined to be of the form A -> C d, i.e. left-recursive rather than right-recursive. A type 3 grammar must obey one definition or the other: if some definitions are left- and some right-recursive the grammar is type 2.
  9. If a string alpha may be changed into a string beta by a single application of a single production, then beta is directly derived from alpha, written alpha -> beta
  10. A derivation is a sequence of strings alpha0, alpha1,alpha2...alphan, such that alphai-1 -> alphai, 0<i<=n.
  11. If we know that there is at least one step in a derivation (n>=1) then alphan is eventually derived from alpha0, written alpha0 ->+ alphan
  12. If we know that there may be no steps in a derivation (n>=0) then alphan is derived from alpha0, written alpha0 ->* alphan
  13. A sentential form is a string which is derived from the sentence symbol - i.e. if S ->* alpha then alpha is a sentential form.
  14. A sentence is a sentential form which contains only terminal symbols.
  15. If A ->+ alpha A then the symbol A is right-recursive.
  16. If A ->+ A beta then the symbol A is left-recursive.
  17. If A ->+ alpha A beta then the symbol A is self-embedding.
  18. Two grammars are equivalent if they describe the same language - i.e. if they generate exactly the same set of sentences.
  19. An ambiguous grammar is one with which it is possible to show two or more distinct derivations for a single sentence - i.e. two or more distinct parse trees.
  20. If a language is described by a type 2 grammar in which there are no self-embedding symbols, then the language can also be described by a type 3 grammar.
  21. Bottom-up analysis looks for patterns in the input which corespond to phrases in the grammar.
  22. Top-down analysis assumes that the input will be a <statement>. The grammar shows how a <statement> phrase can be built up of sub-phrases, and to find a statement the analyser merely searches for each of the sub-phrases in turn. To find a sub-phrase the analyser searches for each of its sub-sub-phrases, and so on until the search is reduced to a matter of searching for a single item, which it looks for in the input.

Bibliography

These are not necessarily recommendations, more a list of books-I-have-read, and which have contributed in varying degrees to my knowledge of compilers and grammars.
  1. "Compiler contruction: theory and practice", William A. Barret and John D. Couch, Science Research Associates Inc., 1979, ISBN 0-574-21160-8.
  2. "Pascal: User manual and report", Kathleen Jensen and Niklaus Wirth, Springer-Verlag, 1985, ISBN 0-387-96048-1 / 3-540-96048-1.
  3. "The UNIX programming environment", Brian W. Kernighan and Rob Pike, Prentice-Hall, 1984, ISBN 0-13-937681-X
  4. "C - a reference manual", Samuel P. Harbison and Guy L. Steele Jr., 1984, ISBN 0-13-110008-4
  5. "Understanding and writing compilers", Richard Bornat, MacMillan, 1979, ISBN 0-333-21732-2.
  6. "Compiler construction, an advanced course", F.L.Bauer and J.Eickel, Springer-Verlag, 1976, ISBN 0-387-08046-0 / 3-540-08046-5.
  7. "Lecture notes in computer science #50 - A Concurrent Pascal compiler for minicomputers", Alfred C. Hartmann, Springer-Verlag, 1977, ISBN 0-387-08240-9 / 3-540-08240-9.
  8. "Compilers: Principles, techniques and tools", Alfred V. Aho, Ravi Sethi and Jeffrey D. Ullman, Addison-Wesley, 1986, ISBN 0-201-10088-6.

Left- or right-recursion?

With left-recursive productions like expr -> expr "+" term there is a danger that the procedure which parses an expr will recursively apply itself until it runs out of stack space. It can be shown (apparently) that left-recursive grammars are a sub-set of right-recursive grammars, implying that left-recursive productions can always be re-written in right-recursive forms.

For instance,

    expr -> ( expr "+" term ) | term
which is clearly left-recursive, can be rewritten as
    expr -> expr OptionalTerm
    OptionalTerm -> ( "+" term ) | E
where E is the empty set, which is clearly right-recursive. Both forms indicate that an expr consists of a series of terms separated by +s, so the grammars are equivalent.

Now, we treat the first term we come to as a term, not as part of a larger expr, and we can take some action on that basis. In this way, the un-parsed part of the source will definitely continue to get smaller.

The following are all ways of expressing the same grammar:

Grammar 1
expression-> [ term ( "+" | "-" ) ]* term
term -> [ factor "*" ]* factor
factor -> name | "(" expression ")"
name -> letter
Grammar 1
expression-> [ expression ( "+" | "-" ) ] term
term -> [ term "*" ] factor
factor -> name | "(" expression ")"
name -> letter
Grammar 1
expression-> term [ ( "+" | "-" ) term ]*
term -> factor [ "*" factor ]*
factor -> name | "(" expression ")"
name -> letter
Grammar 1
expression-> expression [ ( "+" | "-" ) term ]
term -> term [ "*" factor ]
factor -> name | "(" expression ")"
name -> letter
The first two are left-recursive, and the second, right-recursive. See "Compilers. Principles, techniques and Tools", pp47-48.
From "Compiler construction, an advanced course", pp85-86

In order of increasing power, we have LR(0), SLR(1), LALR(1), LR(1).

A grammar is LR(k) if each sentence that it generates can be deterministically parsed in a single scan from left to right with at most k symbols of lookahead. This means that each reduction needed for the parse must be detectable on the basis of left context, the reducible phrase itself, and the k terminal symbols to its right.

By contrast, LL(k) parsers must select the production to be used in a reduction on the basis of left context, and the first k terminal symbols of the reducible phrase combined with its right context.

Thus, LR(k) parsers defer decisions until after complete reducible phrases are found (a characteristic of "bottom-up" parsers), while LL(k) parsers must predictively select a production on the basis of its first few symbols. Both techniques share the property of using complete left context in making decisions - a characteristic commonly associated with "top-down" parsers.

It is easily seen that any LL(k) grammar is also LR(k). It is less obvious that for each k there are LR(1) grammars that are not LL(k).


From "Compilers: Principles, techniques and tools", pp191-192.

A grammar whose parsing table has no multiply defined entries is said to be LL(1). The first "L" in LL(1) stands for scanning the input from left to right, the second "L" for producing a leftmost derivation, and the "1" for using one input symbol of lookahead at each step of the parsing action decisions.

No ambiguous or left-recursive grammar can be LL(1). It can also be show that a grammar G is LL(1) if and only if whenever A -> alpha | BETA are two distinct productions of G with alpha and BETA representing sequences of terminals and non-terminals that do not start with A, the following conditions hold:

  1. For no terminal a do both alpha and BETA derive strings beginning with a.
  2. At most one of alpha and BETA can derive the empty string.
  3. if BETA ->* ø, where ø is the empty string, then alpha does not derive any string beginning with a terminal in FOLLOW(A).

Clearly, grammar (4.11) for arithmetic expressions is LL(1). Grammar (4.13), modelling if-then-else statements is not.

Grammar 4.11
E -> TE'
E'-> +TE' | ø
T -> FT'
T'-> +FT' | ø
F -> (E) | id
Grammar 4.13
S-> iEtS | iEtSeS | a
E-> b

These bits of definitions come from "Compiler contruction: theory and practice", but I can't find them formally presented anywhere.
Left-linear and right-linear grammars
These are subsets of context-free grammars.
If upper-case letters denote non-terminals and lower-case letters terminals, then these types of grammar only contain rules of the form
left-linearright-linear
A -> xB
or
A -> x
A -> Bx
or
A -> x
Left-recursive and right-recursive grammars
Again using upper-case letters to denote non-terminals and lower-case letters terminals, these types of grammar can contain rules of the form
left-recursiveright-recursive
A ->+ AwA ->+ wA
(CC:T&P p149)
Context-free and context-sensitive grammars
Recursive descent parsers
(CC:T&P p161) says these are closely related to LL(1) parsers. For a recursive-descent parser to work, its source grammar must be LL(1). See also p176.
An LL(1) parser is a "Left-to-right, Left-most" parser with 1 token of look-ahead. This is (also?) a deterministic top-down parser. (CC:T&P pp58-60)
Does 'LL(k)' refer to the grammar, the parser, or is it inherently both?
A bottom-up parser would be LR(k). A non-deterministic parser would have to make a guess and be prepared to backtrack and try something else if that production is found not to match the source program.
(CC:T&P p137) says that a deterministic top-down is not possible for every context-free grammar, but is sufficient for most purposes.
Bottom-up parsers - ch5(precedence parsers) & ch6(LR(k) parsers).
LR(k) grammars include LL(k) grammars (p275)
Back
Valid HTML 4.01
Any queries, write to me.
Last updated 14 March, 2011