This is intended to be a hyperfast review of the basic LR(k) and LALR(k) terminology and concepts; enough to be able to understand how a parser generator program works. To illustrate the concepts, I have written Lisp software for a LR(1) and LALR(1) parser generator and parser.
Given the tokens and productions of a grammar, there is an algorithm which can either generate a deterministic finite automaton (DFA) to parse the grammar efficiently or else determine our grammar is not LR(k)/LALR(k).
Each state in our DFA corresponds to a prefix of a possible derivation of our input string.
A state contains a set of items. Depending upon the next k symbols of lookahead, an item tells us to either
We'll end up with a sequence of productions tracing out the derivation of the sentence in reverse, or stop in an error state.
A formal grammar is defined to be the tuple,
where
= finite set of non-terminal symbols
= finite set of terminal symbols
= set of productions of the form,
S = non-terminal start symbol
denote terminals.
are strings of terminals.
denote non-terminals.
are either terminal or non-terminal symbols.
are strings of either.
is the length of the string x.
is the empty string.
means directly derives.
i.e. if
then
and
means transitive and reflexive closure of derives.
In derivations, rm denotes rightmost derivation, which means expand non-terminal symbols starting with
those on the right. lm denotes leftmost derivation.
A language L is context free if all productions
satisfy
That is, the expansion of the non-terminal symbol A does not depend on
the symbols surrounding it (its context).
The concatenation of two sets of strings is defined by
The first k terminals derived from the string
is defined by
First, define
for any string of terminals and non-terminals,
recursively by string concatenation,
For terminals, define FIRST to be the terminal itself,
It remains to compute FIRST for all non-terminals,
We do this iteratively. We define the function F() which will approximate FIRST. For terminals, define
For non-terminals, we define the zeroth approximation to FIRST by
Then for all non-terminals A simultaneously, we iterate the formula,
Eventually,
at which point we terminate and say,
The epsilon-free first k derived symbols, EFFk(), are found
by computing FIRST as above, but tagging all symbols which are derived using
epsilon productions of the form,
as non-epsilon free, and keeping only the epsilon-free derived terminals.
G is LR(k) iff the same lookahead and the same prefix imply a unique left end of the handle, right end of the handle and unique reduction.
Formally, the three properties,



implies



The above is equivalent to the statement knowing the prefix
and scanning the k terminal symbol lookahead,
we can locate
An item
is valid for viable prefix
if
That is,
is a derivation consistent with what we've seen so far in parsing (the prefix) and
consistent with the lookahead.
An item with nothing after the dot indicates a reduction by a production is OK at
this stage, but an
item with a symbol after the dot indicates we should shift more symbols before
making any reductions.
A language G is LR(k) iff for a viable prefix
given
is valid for
and there is no other valid item
valid for
with
(We may have
)
In other words, the sets of items cannot contain any shift-reduce or reduce-reduce
conflicts.
Define an operator which generates the set of items having the same prefix consistent
with the lookahead:


Denote the transitive closure of this operator as,
The reason for the closure is this: if an item
is valid for viable prefix
and lookahead
There is a derivation,
Then a production of the form
implies the derivation
which means the new item
is also valid for the same viable prefix
where
Define the GOTO on a set of items A as
We can construct the set of items valid for each viable prefix iteratively.
The set of items valid for viable prefix
is
where the initial state is
This gives us a GOTO graph having the sets of items as nodes, the GOTO's on symbols as transitions between states, and the viable prefix for a set of states as the path from the initial state to that node.
The algorithm is as follows:
repeat
for each item
for each
if
and
then
until no more items can be added to C.
We generate ACTION and GOTO tables which will be used by our finite state automaton (FSA) for parsing.
Let
and
Generate the ACTION table as




Define the GOTO table as
We start in the initial state,
where
Then, starting from a state consisting of a parse stack on the left, and an input stack on the right of remaining tokens to be parsed,
We define the transition functions of the DFA on as follows:
push the new symbol and new state and transition to:
where
transition from
where
to the state:
where
and
Halt and accept the input with no error.
where the viable prefix is,
and the error tokens are,
If the grammar is ambiguous, we don't have to rewrite the grammar to avoid it; many times we can resolve the conflicts in many cases by adding precedence and associativity rules to productions and input symbols. YACC does this.
If a state contains items having a shift/reduce conflict,
we reduce if the precedence of the production
is higher than the precedence of the lookahead in the input,
The default precedence of the production is the precedence of the rightmost
terminal of
Otherwise, we can set it to be the precedence of any symbol which gives a correct parse.
If there is a tie in precedence, we reduce if the production is left associative and shift if the production is right associative. Nonassociative breaks ties by declaring error.
If there is a reduce/reduce conflict, we reduce by the earlier listed production in the list.
We can add error productions to the language of the form,
where A is a major non-terminal, and
is allowed.
When the parser gets into an error state, it pops the parse stack until it finds a
state containing the item
shifts the imaginary token,
then discards input symbols until it sees lookahead
.
Then the parser performs a reduction on A and calls an error routine associated with
the error production.
If
the reduction happens immediately.
First a definition,
The core of an item
is
If we can merge the cores of the sets of items in the LR(k) goto graph without generating conflicts to give a much smaller graph, the grammar is of type LALR(k).
(1) Construct LR(1) sets of items I,
(2) For each core in C find all sets of items having that core and replace these sets by their union (remove duplicate cores and merge lookaheads).
(3) Create the action table as before. If there are any conflicts, the language is not LALR(1) and we exit.
(4) Otherwise, construct the goto table as follows.
For a merged set of items,
the cores of
are identical also since
have the same cores.
Let
then
And our LALR(1) collection of items is done. If an LR(1) parser has detected an error on the input, the corresponding LALR(1) parser may do additional reductions. However it will not shift the error symbol onto the stack, and will declare error before the next shift.
We can't create any new shift/reduce conflicts by merging cores. Suppose we did create a new shift/reduce conflict by merging sets of items:
and
Since cores were the same before the merge, some item must have had
and
for some lookahead c. But this is an LR(1) shift/reduce conflict, a contradiction. However, we can create a reduce/reduce conflict by merging cores.
Our sample grammar with lookahead k = 1 has two productions,
First compute FIRST and epsilon-free FIRST of all the token symbols,
For all i, we have for the terminals, Fi(ε) = {ε}, Fi(a) = {a}, Fi(b) = {b}.
For the non-terminal S, F0(S) = {ε}, since that's the only production which derives a terminal symbol on the left. Now we iterate
F1(S) = {ε} U { F0(S)+1F0(a)+1 F0(S)+1 F0(b), F0(ε) } = { {ε}+1{a}+1..., ε } = {ε,a}
F2(S) = {ε} U { F1(S)+1F1(a)+1 F1(S)+1 F1(b), F1(ε) } = { {ε,a}+1{a}+1..., ε } = {ε,a}
The set doesn't change anymore, so we stop and declare, FIRST(S) = F2(S) = {ε,a}
When we compute EFF, we use a similar algorithm, but any derivations with leading epsilons were disallowed, so EFF(ε) = EFF(S) = {}.
| Symbol | FIRST1() | EFF1() |
| ε | {ε} | {} |
| a | {a} | {a} |
| b | {b} | {b} |
| S | {ε, a} | {} |
Now we compute the sets of LR(1) items using GOTO and CLOSURE operations.
| State | GOTO() from state | Viable Prefix | Kernel | Closure | |||||||||||||||
| 0 | - | ε | 1. [S' →.S,ε] |
|
|||||||||||||||
| 1 | State 0 GOTO(S) | S |
|
|
|||||||||||||||
| 2 | State 1 GOTO(a) | Sa |
|
|
|||||||||||||||
| 3 | State 2 GOTO(S) | SaS |
|
|
|||||||||||||||
| 4 | State 6 GOTO(a), State 3 GOTO(a) | SaSa |
|
|
|||||||||||||||
| 5 | State 3 GOTO(b) | SaSab |
|
|
|||||||||||||||
| 6 | State 4 GOTO(S) | SaSab |
|
|
|||||||||||||||
| 7 | State 6 GOTO(b) | SaSaSb |
|
|
Let's tabulate the GOTO's on the states and symbols, which we'll need to calculate the state after a shift (for GOTO's on terminal symbols) and also the state after a reduce (for GOTO's on non-terminal symbols).
| State | ON symbol GOTO state | |||
| S | e | a | b | |
| 0 | 1 | - | - | - |
| 1 | - | - | 2 | - |
| 2 | 3 | - | - | - |
| 3 | - | - | 4 | 5 |
| 4 | 6 | - | - | - |
| 5 | - | - | - | - |
| 6 | - | - | 4 | 7 |
| 7 | - | - | - | - |
Now we compute the ACTION tables by looking at the types of items. If the item has nothing after the dot, it's a reduction provided we see the item's lookahead. Otherwise, we compute u = EFF() of all the symbols after the item's dot including the item's k-symbol lookahead. If we see the u in the input, we call for a shift, and a goto the next state on u.
| State | Items | EFF Lookahead (shift only) | Lookahead Symbol/Action |
| 0 |
|
|
|
| 1 |
|
|
|
| 2 |
|
|
|
| 3 |
|
|
|
| 4 |
|
|
|
| 5 |
|
|
|
| 6 |
|
|
|
| 7 |
|
|
|
Collecting together the shift and reduce actions for each state, and the goto on each non-terminal gives us the ACTION and GOTO tables: the heart of the parser. Error states are easily associated with error messages.
| State | ACTION | GOTO | ||
| ε | a | b | S | |
| 0 | r2 | r2 | error - b not expected | 1 |
| 1 | accept | s2 | error - b not expected | error |
| 2 | error - expecting a or b | r2 | r2 | 3 |
| 3 | error - expecting a or b | s4 | s5 | error |
| 4 | error - expecting a or b | r2 | r2 | 6 |
| 5 | r1 | r1 | error - b not expected | error |
| 6 | error - premature end of input | s4 | s7 | error |
| 7 | error | r1 | r1 | error |
Now let's give an example of parsing a grammatical sentence.
| Parse Stack | Action | |
| (0 | ε a a b b) | r2: S → ε | |
| (0 S 1 | a a b b) | s2: | |
| (0 S 1 a 2 | a b b) | r2: S → ε | |
| (0 S 1 a 2 S 3 | a b b) | s4 | |
| (0 S 1 a 2 S 3 a 4 | b b) | r2: S → ε | |
| (0 S 1 a 2 S 3 a 4 S 6 | b b) | s7 | |
| (0 S 1 a 2 S 3 a 4 S 6 b 7 | b) | r1: S → SaSb | |
| (0 S 1 a 2 S 3 | b) | s5 | |
| (0 S 1 a 2 S 3 b 5| ) | r1: S → SaSb | |
| (0 S 1| ) | accept: S' → S |
We've just traced out a rightmost derivation in reverse:
S' → S → SaSb → SaSaSbb → SaSabb → Saabb → aabb
Now let's give an example of parsing an UNgrammatical sentence.
| Parse Stack | Action |
| (0 | a a b) | r2: S → ε |
| (0 S 1 | a a b) | s2: |
| (0 S 1 a 2 | a b) | r2: S → ε |
| (0 S 1 a 2 S 3 | a b) | s4 |
| (0 S 1 a 2 S 3 a 4 | b) | r2: S → ε |
| (0 S 1 a 2 S 3 a 4 S 6 | b) | s7 |
| (0 S 1 a 2 S 3 a 4 S 6 b 7 | b) | error: Expecting a or b next in the derivation S'=>SaSaSb... |
Now we compute the sets of LALR(1) items by merging the cores of the LR(1) items above. New state numbers show which LR(1) states were merged. There are no reduce/reduce conflicts, and the parse tables become much smaller.
| State | GOTO From State | Viable Prefix | Items |
| 0 | - | ε |
|
| 1 | 0 GOTO(S) | S |
|
| 24 | 1 GOTO(a), 3 GOTO(a) | Sa |
|
| 36 | 24 GOTO(S) | SaS |
|
| 57 | 36 GOTO(b) | SaSb |
|
Collecting together the shift and reduce actions for each state, and the goto on each non-terminal gives us the ACTION and GOTO tables: the heart of the parser. Error states are easily associated with error messages.
| State | ACTION | GOTO | ||
| ε | a | b | S | |
| 0 | r2 | r2 | error - b not expected | 1 |
| 1 | accept | s24 | error - b not expected | error |
| 24 | error - expecting a or b | r2 | r2 | 36 |
| 36 | error - expecting a or b | s24 | s57 | error |
| 57 | r1 | r1 | r1 | error |
Copyright © 1986-2009 by Sean Erik O'Connor. All Rights Reserved. last updated 01 Jan 09.