Back
Grammar notation
The syntax of the grammars used here is quite simple, but probably not quite
the same as any used in text books. In grammar 3 below, each rule has been
numbered for ease of reference, but we won't generally do so.
The grammar syntax is
- Literal characters are quoted.
- Anything which is not quoted is a reference to another rule.
- Optional items are enclosed in square brackets, i.e. there will be 0 or 1 occurrence of that item.
- Square brackets followed by an asterisk indicates 0 or more occurrences of that item.
- Square brackets followed by a plus indicates 1 or more occurrences of that item.
- Vertical bars separate items in a list of possible selections. There are
generally only two possibilities but there is no reason why you can't have more.
- Parentheses are used in the same way as in arithmetic expressions, to group
parts of a rule which must be evaluated separately.
Grammar 3
| 1 | statement | -> expression "\n" |
| 2 | expression | -> term [ add_op term ]* |
| 3 | term | -> factor [ mult_op factor ]* |
| 4 | factor | -> [ "+" | "-" ] ( primary | "(" expression ")" ) |
| 5 | mult_op | -> "*" | "/" |
| 6 | add_op | -> "+" | "-" |
| 7 | primary | -> number | name |
| 8 | number | -> [ digit ]+ [ "." [ digit ]* ] [ ( "e" | "E" ) [ "+" | "-" ] [ digit ]+ ] |
| 9 | name | -> letter [ letter | digit ]* |
Or in words,
-
Rule 1. Each statement consists of an expression followed by
a newline. Newlines are what is used in C as a line terminator, where other languages
may use a carriage-return or just return the line length with no particular terminator.
-
Rule 2. An expression consists of a term, possibly followed by
a series of other terms separated by add_ops.
-
Rule 6. An add_op is either an addition operator or a subtraction operator.
-
Rule 3. A term consists of a factor, possibly followed by a
series of other factors separated by mult_ops.
-
Rule 5. A mult_op is either a multiplication operator or a division operator.
-
Rule 4. A factor is either a bracketed expression or a primary
expression. In either case, it may be proceeded by a plus or minus sign. These
signs are unary operators - i.e. take only one operand as opposed to the two
operands required for addition, multiplication etc.
-
Rule 7. A primary expression is a simple entity, which may be either
a number or the name of a symbol.
-
Rule 8. A number - a literal numerical constant - can be split
into two main components - the mantissa and the exponent. The mantissa must have
an 'integer' part consisting of one or more digits, and may have a fractional part
consisting of a decimal point followed by 0 or more digits. The exponent consists
of a lower- or upper-case 'e' followed by one or more digits which may be separated
from the 'e' by a sign.
-
Rule 9. A name must start with a letter, but may be followed by a
sequence of letters and digits.
I'll assume you know what letters and digits are.
The rule given for numbers is only a suggestion to guide anyone who has to parse
their own floating point numbers. The rules followed by a library function may
not match this precisely but this is unimportant provided the user knows what is
required of him/her/it.
For anyone momentarily confused by the description of numbers as literal numerical
constants, possible item descriptions include combinations of
literal symbolic | numerical string character | constants variables |
Other data types (the middle column) are possible, but literal variables of any
kind are not possible - or at least not to be contemplated by any sane programmer.
The method given above for writing a grammar is not the
only one used. In 'purist' terms, the expression rule
| expression | -> term [ ( "+" | "-" ) term ]* |
| could be written |
| expression | -> term OptTerm |
| OptTerm | -> ø |
| OptTerm | -> AddOp factor OptTerm |
| AddOp | -> "+" |
| AddOp | -> "-" |
where alternative forms of a rule are written out as separate rules and ø
is the empty set. The method of writing is different, but the meaning is identical.
Glossary
These are non-technical definitions of terms that I have used. They are
presented more or less in order of increasing complexity.
A longer, more technical list of parser- and grammar-related terms follows.
- Lexical analysis
- is the grouping of characters to form words, known as symbols or
tokens, which are used as input to the parser. These tokens form
'atomic' chunks of code. Smaller chunks cease would to convey any meaning
whatever. "3.14" and "theta" are each single tokens which consist of 4 and 5
characters respectively, but a digit on its own may indicate tens, hundreds, or
be part of a variable name.
This stage may also categorize the token as being a name, number, operator or
punctuation for instance, and I use it to return binary representations of
numbers, but attaching meaning to tokens is generally best left to the parser.
Lexical analysis is the first step in parsing a statement, and is really only a
very simple form of parsing.
- A lexical analyzer, lexer, or tokenizer
- carries out lexical analysis - i.e. tokenizes the source.
- Syntax analysis or parsing
- is the analysis of a source program, which may be a single line, to extract
the underlying structure of that program. No meaning is attached to anything
at this stage, but the data is genarally stored for later use. The analysis
phase that does attach meaning to the syntax - semantic analysis - is
sometimes mixed in along with the parser, and the intermediate data structure
will then be simple and short-lived, but if semantic analysis is a separate
process then the data structure may be complex and may even be written to a file.
- A syntax analyser or parser
- carries out syntax analysis - i.e. parses the stream of tokens output by the
lexical analyzer.
- An interpreter
- carries out certain tasks based on some representation of a program. That
representation may be the source as written by the user, or it may be an
intermediate data structure produced by a parser. In the latter case, parts of
the data structure it produces may used (interpreted) and discarded before
parsing is complete.
An executable program - a .exe file on PCs or VAXs - is really just a data
structure which can be interpreted by a specialised piece of hardware known as a
CPU, but an interpreter is generally a program which operates on a more abstract
level. It can be thought of as imitating a virtual machine - one which
doesn't actually exist.
- A compiler
- is a parser whose output data structure is produced and stored in such a way
that it may be efficiently executed many times without having to parse the
source program again.
- prefix, postfix and infix notation
- refers to the position of an operator relative to its operands.
| prefix | infix | postfix |
| + a b | a + b | a b + |
The normal way of writing things is infix, but postfix is the easiest to write
an evaluation program for.
Prefix and postfix notation do not require parentheses to show how to evaluate
an expression. Infix does.
Prefix notation was called 'polish' notation after J.Lukasiewicz who introduced
it in the 1920's, and this led to postfix being referred to as 'reverse polish'.
- A grammar
- is a description of a language. It consists largely of a set of production
rules showing valid ways to transform the non-terminal (abstract)
symbols on the left-hand side into the mixture of terminal (concrete) and
non-terminal (hopefully less abstract) symbols on the right-hand side. There
must be a production rule for every non-terminal. The set of terminal and
non-terminal symbols, along with a designated start symbol, complete the grammar.
Note that a grammar is not a complete description of a language, it only
specifies what a program should look like without giving any meaning to anything.
So, although conforming to the grammar is necessary for a program to be
valid, it is not sufficient. For example, many languages, require
variables to be declared prior to being used, so "a = 27" may conform to the
grammar but will not be valid unless "a" has been declared to be a numerical
variable. This requires checks that are outside the scope of most grammars.
There is a compromise here between the ease or difficulty of checking things
during parsing and during execution.
In the example grammar
Grammar 3
| statement | -> expression "\n" |
| expression | -> term [ add_op term ]* |
| term | -> factor [ mult_op factor ]* |
| factor | -> [ "+" | "-" ] ( primary | "(" expression ")" ) |
| mult_op | -> "*" | "/" |
| add_op | -> "+" | "-" |
| primary | -> number | name |
| number | -> [ digit ]+ [ "." [ digit ]* ] [ ( "e" | "E" ) [ "+" | "-" ] [ digit ]+ ] |
| name | -> letter [ letter | digit ]* |
we have
| non-terminals | terminals | start symbol |
statement, expression, term, factor, mult_op, add_op,
primary, identifier, number | | |
"\n", "+", "-", "*", "/", "(", ")", ".", "e", "E",
letter, digit | statement | |
A parser might not actually see letters and digits at all because the
lexical analyzer returns names and numbers, so whether these should be terminal
symbols, or not in the grammar at all, is rather variable.
Right-recursive forms of a grammar are to be preferred over left-recursive ones.
See left- or right-recursion?.
- Recursive-descent parsing
- is a top-down method of syntax analysis in which each rule of a grammar
is represented by a procedure which only knows about that rule. The procedure
which starts off the parsing process corresponds to the rule which represents
the whole program. That decides which rule to apply out of those that
might be valid and applies it, and so on. At each step, there must
either be only one rule which applies or we must be able to choose the correct
rule by looking at the next token in the source. When we reach a rule which
is looking for a terminal symbol, we may take some action. As called procedures
return - corresponding to the successful recognition of some structure - other
actions may be taken. If these actions correspond to the semantics of those
structures, the program is being interpreted as it is parsed. If they
correspond to the production of a sequence of code, or some intermediate data
structure, it is more of a compilation.
The 'recursive' bit comes about because grammar is recursive. For instance,
expressions contain sub-expressions.
The 'descent' bit refers to the top-down approach of starting with the rule
representing the whole program and searching for smaller and smaller
bits.
- The syntax
- of a program restricts what it may look like, detailing the
superficial form of a program, laying down rules of punctuation and showing how
a program may be built up out of items and characters.
- The semantics
- of a program is what it means, defining the actions which must be
carried out when the program is executed.
- A syntax tree or parse tree
- is a graphical method of showing the structure of a program or fragment of
a program. For instance,

Glossary
I've collected together the following, relatively technical, definitions because
some books on parsers or compilers don't bother to define them or don't define
them clearly, and I have difficulty understanding little bits of things if I
don't at least know where they stand in relation to 'the big picture'. As an
illustration of this complaint, note that most of the following definitions,
from "Understanding and writing compilers",
are given in chapter fifteen of that book.
In the heirarchy of grammars classifications, due to Noam Chomsky in 1957, there
is a progression from most general (type 0) to most restrictive (type 3), each
being a subset of the preceeding type. Note that other books define numerous
other types of grammar, but these ones are common.
- A sentence is a string (a sequence) of symbols.
- A language is a set of sentences.
- A grammar is a finite set of productions, which contain
both terminal and non-terminal symbols. One of the
non-terminal symbols is the sentence symbol of the grammar (the
start symbol)
- A production is of the form
A -> alpha
where A is a non-empty string of non-terminal symbols and
alpha is a string (possibly empty) of terminal and/or non-terminal
symbols. A production expresses the assertion that wherever the string A
occurs in a partially derived sentence, it may be replaced by the string
alpha.
- A type 0 grammar contains productions of the form
A -> alpha where A is a non-empty string of
non-terminal symbols and alpha is a string of terminal and/or
non-terminal symbols.
- A type 1 or context sensitive grammar contains only
productions of the form A -> alpha where A is
a non-empty string of non-terminal symbols and alpha is a string of
terminal and/or non-terminal symbols, and the length of the right-hand-side
string alpha is not less than the length of the left-hand-side string
A.
Alternatively, the productions of type 1 grammars may be defined
to be of the form
beta A gamma -> beta alpha gamma
where each production replaces a single non-terminal symbol in a particular
context.
- A type 2 or context free grammar contains only productions
of the form A -> alpha where A is a single
non-terminal symbol and alpha is a string of terminal and/or non-terminal
symbols.
- A type 3 or regular expression grammar contains only
productions which are one of the forms
A -> a
A -> a B
in which A and B are single non-terminal symbols and a
is a single terminal symbol.
Alternatively, the second of these productions may be defined to be of the form
A -> C d, i.e. left-recursive rather
than right-recursive. A type 3 grammar must obey one definition or the other:
if some definitions are left- and some right-recursive the grammar is type 2.
- If a string alpha may be changed into a string beta by a single
application of a single production, then beta is directly
derived from alpha, written alpha -> beta
- A derivation is a sequence of strings alpha0,
alpha1,alpha2...alphan,
such that alphai-1 -> alphai, 0<i<=n.
- If we know that there is at least one step in a derivation (n>=1) then
alphan is eventually derived from
alpha0, written
alpha0 ->+ alphan
- If we know that there may be no steps in a derivation (n>=0) then
alphan is derived from alpha0, written
alpha0 ->* alphan
- A sentential form is a string which is derived from the sentence
symbol - i.e. if S ->* alpha then alpha is
a sentential form.
- A sentence is a sentential form which contains only terminal symbols.
- If A ->+ alpha A then the symbol A
is right-recursive.
- If A ->+ A beta then the symbol A
is left-recursive.
- If A ->+ alpha A beta then
the symbol A is self-embedding.
- Two grammars are equivalent if they describe the same language -
i.e. if they generate exactly the same set of sentences.
- An ambiguous grammar is one with which it is possible to show two
or more distinct derivations for a single sentence - i.e. two or more distinct
parse trees.
- If a language is described by a type 2 grammar in which there are no
self-embedding symbols, then the language can also be described by a type 3 grammar.
- Bottom-up analysis looks for patterns in the input which corespond
to phrases in the grammar.
- Top-down analysis assumes that the input will be a <statement>. The
grammar shows how a <statement> phrase can be built up of sub-phrases, and to
find a statement the analyser merely searches for each of the sub-phrases in turn.
To find a sub-phrase the analyser searches for each of its sub-sub-phrases, and
so on until the search is reduced to a matter of searching for a single item,
which it looks for in the input.
Bibliography
These are not necessarily recommendations, more a list of books-I-have-read, and
which have contributed in varying degrees to my knowledge of compilers and
grammars.
- "Compiler contruction: theory and practice", William A.
Barret and John D. Couch, Science Research Associates Inc., 1979,
ISBN 0-574-21160-8.
- "Pascal: User manual and report", Kathleen Jensen and
Niklaus Wirth, Springer-Verlag, 1985,
ISBN 0-387-96048-1 / 3-540-96048-1.
- "The UNIX programming environment", Brian W. Kernighan and
Rob Pike, Prentice-Hall, 1984, ISBN 0-13-937681-X
- "C - a reference manual", Samuel P. Harbison and Guy L.
Steele Jr., 1984, ISBN 0-13-110008-4
- "Understanding and writing compilers", Richard Bornat,
MacMillan, 1979, ISBN 0-333-21732-2.
- "Compiler construction, an advanced course", F.L.Bauer
and J.Eickel, Springer-Verlag, 1976,
ISBN 0-387-08046-0 / 3-540-08046-5.
- "Lecture notes in computer science #50 - A Concurrent
Pascal compiler for minicomputers", Alfred C. Hartmann, Springer-Verlag, 1977,
ISBN 0-387-08240-9 / 3-540-08240-9.
- "Compilers: Principles, techniques and tools", Alfred V.
Aho, Ravi Sethi and Jeffrey D. Ullman, Addison-Wesley, 1986,
ISBN 0-201-10088-6.
Left- or right-recursion?
With left-recursive productions like
expr -> expr "+" term there is a danger that the procedure which parses an expr will
recursively apply itself until it runs out of stack space. It can be shown
(apparently) that left-recursive grammars are a sub-set of right-recursive
grammars, implying that left-recursive productions can always be re-written in
right-recursive forms.
For instance,
expr -> ( expr "+" term ) | term
which is clearly left-recursive, can be rewritten as
expr -> expr OptionalTerm
OptionalTerm -> ( "+" term ) | E
where E is the empty set, which is clearly right-recursive. Both
forms indicate that an expr consists of a series of terms separated
by +s, so the grammars are equivalent.
Now, we treat the first term we come to as a term, not as
part of a larger expr, and we can take some action on that basis. In this
way, the un-parsed part of the source will definitely continue to get smaller.
The following are all ways of expressing the same grammar:
Grammar 1
| expression | -> [ term ( "+" | "-" ) ]* term |
| term | -> [ factor "*" ]* factor |
| factor | -> name | "(" expression ")" |
| name | -> letter |
Grammar 1
| expression | -> [ expression ( "+" | "-" ) ] term |
| term | -> [ term "*" ] factor |
| factor | -> name | "(" expression ")" |
| name | -> letter |
Grammar 1
| expression | -> term [ ( "+" | "-" ) term ]* |
| term | -> factor [ "*" factor ]* |
| factor | -> name | "(" expression ")" |
| name | -> letter |
Grammar 1
| expression | -> expression [ ( "+" | "-" ) term ] |
| term | -> term [ "*" factor ] |
| factor | -> name | "(" expression ")" |
| name | -> letter |
The first two are left-recursive, and the second, right-recursive. See
"Compilers. Principles, techniques and Tools",
pp47-48.
From "Compiler construction, an advanced course",
pp85-86
In order of increasing power, we have LR(0), SLR(1), LALR(1), LR(1).
A grammar is LR(k) if each sentence that it generates can be deterministically
parsed in a single scan from left to right with at most k symbols of lookahead.
This means that each reduction needed for the parse must be detectable on the
basis of left context, the reducible phrase itself, and the k terminal symbols
to its right.
By contrast, LL(k) parsers must select the production to be used in a reduction
on the basis of left context, and the first k terminal symbols of the reducible
phrase combined with its right context.
Thus, LR(k) parsers defer decisions until after complete reducible phrases are
found (a characteristic of "bottom-up" parsers), while LL(k) parsers must
predictively select a production on the basis of its first few symbols. Both
techniques share the property of using complete left context in making
decisions - a characteristic commonly associated with "top-down" parsers.
It is easily seen that any LL(k) grammar is also LR(k). It is less obvious that
for each k there are LR(1) grammars that are not LL(k).
From "Compilers: Principles, techniques and tools",
pp191-192.
A grammar whose parsing table has no multiply defined entries is said to be
LL(1). The first "L" in LL(1) stands for scanning the input from left to right,
the second "L" for producing a leftmost derivation, and the "1" for using one
input symbol of lookahead at each step of the parsing action decisions.
No ambiguous or left-recursive grammar can be LL(1). It can also be show that a
grammar G is LL(1) if and only if whenever
A -> alpha | BETA are two distinct productions of G
with alpha and BETA representing sequences of terminals and
non-terminals that do not start with A, the following conditions hold:
- For no terminal a do both alpha and BETA derive strings
beginning with a.
- At most one of alpha and BETA can derive the empty string.
- if BETA ->* ø, where ø is the empty string, then
alpha does not derive any string beginning with a terminal in
FOLLOW(A).
Clearly, grammar (4.11) for arithmetic expressions is LL(1). Grammar (4.13),
modelling if-then-else statements is not.
Grammar 4.11
| E | -> TE' |
| E' | -> +TE' | ø |
| T | -> FT' |
| T' | -> +FT' | ø |
| F | -> (E) | id |
|
Grammar 4.13
| S | -> iEtS | iEtSeS | a |
| E | -> b |
|
These bits of definitions come from
"Compiler contruction: theory and practice",
but I can't find them formally presented anywhere.
- Left-linear and right-linear grammars
- These are subsets of context-free grammars.
If upper-case letters denote non-terminals and lower-case letters terminals,
then these types of grammar only contain rules of the form
| left-linear | right-linear |
A -> xB or A -> x |
A -> Bx or A -> x |
- Left-recursive and right-recursive grammars
- Again using upper-case letters to denote non-terminals and lower-case letters
terminals, these types of grammar can contain rules of the form
| left-recursive | right-recursive |
| A ->+ Aw | A ->+ wA |
(CC:T&P p149)
- Context-free and context-sensitive grammars
- Recursive descent parsers
- (CC:T&P p161) says these are closely related to LL(1) parsers. For a
recursive-descent parser to work, its source grammar must be LL(1). See also p176.
An LL(1) parser is a "Left-to-right, Left-most" parser with 1 token of
look-ahead. This is (also?) a deterministic top-down parser. (CC:T&P pp58-60)
Does 'LL(k)' refer to the grammar, the parser, or is it inherently both?
A bottom-up parser would be LR(k). A non-deterministic parser would have to
make a guess and be prepared to backtrack and try something else if that
production is found not to match the source program.
(CC:T&P p137) says that a deterministic top-down is not possible for every
context-free grammar, but is sufficient for most purposes.
Bottom-up parsers - ch5(precedence parsers) & ch6(LR(k) parsers).
LR(k) grammars include LL(k) grammars (p275)
Back
Any queries, write to me.
Last updated 14 March, 2011