Review of LR(k) and LALR(k) Parsing Theory

Sean E. O'Connor

Introduction

This is intended to be a hyperfast review of the basic LR(k) and LALR(k) terminology and concepts; enough to be able to understand how a parser generator program works. To illustrate the concepts, I have written Lisp software for a LR(1) and LALR(1) parser generator and parser.

Given the tokens and productions of a grammar, there is an algorithm which can either generate a deterministic finite automaton (DFA) to parse the grammar efficiently or else determine our grammar is not LR(k)/LALR(k).

Each state in our DFA corresponds to a prefix of a possible derivation of our input string.

A state contains a set of items. Depending upon the next k symbols of lookahead, an item tells us to either

  1. Reduce a set of symbols after the prefix using one of the productions and goto the next state.
  2. Shift more symbols of input onto a stack and goto the next state.

We'll end up with a sequence of productions tracing out the derivation of the sentence in reverse, or stop in an error state.

Terminology

A formal grammar is defined to be the tuple $G = \left( N, \Sigma, P, S \right)$ where N = finite set of non-terminal symbols, $\Sigma$ = finite set of terminal symbols, P = set of productions of the form $\alpha \rightarrow \beta$, S = non-terminal start symbol. The collection $\{ a, b, c, d, \ldots, 0, 1, \ldots, 9 \}$ denote terminals, while $\{ u, v, w, x, y, z \}$ are strings of terminals. $\{ A, B, C, D, S, \ldots \}$ denote non-terminals, $\{ U, V, W, X, Y, Z \}$ are either terminal or non-terminal symbols. $\{ \alpha, \beta, \gamma, \delta, \ldots \}$ are strings of either. $| x |$ is the length of the string x. $\epsilon$ is the empty string. $\Rightarrow$ means directly derives, i.e. if we have a production $\beta \rightarrow \delta$ then $\alpha \beta \gamma \Rightarrow \alpha \delta \gamma$ and $\stackrel{*}{\Rightarrow}$ means transitive and reflexive closure of derives. In derivations, rm denotes rightmost derivation, which means expand non-terminal symbols starting with those on the right. lm denotes leftmost derivation.

A language L is context free if all productions $\alpha \rightarrow \beta$, satisfy $| \alpha | = 1$, i.e. the left hand side of a production is a single non-terminal, i.e. a production is independent of the symbols surrounding it (its context).

The concatenation of two sets of strings is defined by $$ L_1 \oplus_k L_2 = \{ w | \exists x \in L_1, y \in L_2, \text{ such that } {\LARGE \{ } \begin{array}{rl} w = xy \text{ if } |xy| \le k \text{ else } \\ w = \text{ first k symbols of xy } \end{array} $$

First k Derived Terminals

The first k terminals derived from the string $\alpha$ is defined by $FIRST_k \left( \alpha \right) = \{ x | \alpha \mathrel{ \substack{* \\ \Rightarrow \\ lm} } x \beta \: and \: |x| = k \: or \: \alpha \mathrel{ \substack{* \\ \Rightarrow \\ lm} } x \: and \: |x| \lt k \} $

First, define equation143. for any string of terminals and non-terminals, equation145. recursively by string concatenation,

equation147.

For terminals, define FIRST to be the terminal itself,

equation149.

It remains to compute FIRST for all non-terminals, equation151.

We do this iteratively. We define the function F() which will approximate FIRST. For terminals, define

equation153.

For non-terminals, we define the zeroth approximation to FIRST by

equation155.

Then for all non-terminals A simultaneously, we iterate the formula,

equation157.

Eventually,

equation159.

at which point we terminate and say,

equation161.

The epsilon-free first k derived symbols, EFFk(), are found by computing FIRST as above, but tagging all symbols which are derived using epsilon productions of the form, equation039. as non-epsilon free, and keeping only the epsilon-free derived terminals.

LR(k) Definition

G is LR(k) iff the same lookahead and the same prefix imply a unique left end of the handle, right end of the handle and unique reduction.

Formally, the three properties,

  1. equation041.
  2. equation043.
  3. equation045.

implies

  1. equation047.
  2. equation049.
  3. equation051.

LR(k) Property

The above is equivalent to the statement knowing the prefix equation053. and scanning the k terminal symbol lookahead, equation055. we can locate

  1. The right end of the handle.
  2. The left end of the handle.
  3. Which reduction to make.

Valid Items

An item equation057. is valid for viable prefix equation059. if equation061.

That is, equation063. is a derivation consistent with what we've seen so far in parsing (the prefix) and consistent with the lookahead. An item with nothing after the dot indicates a reduction by a production is OK at this stage, but an item with a symbol after the dot indicates we should shift more symbols before making any reductions.

LR(k) Equivalent Definition

A language G is LR(k) iff for a viable prefix equation065. given equation067. is valid for equation069. and there is no other valid item equation071. valid for equation073. with equation075.

(We may have equation077. ) In other words, the sets of items cannot contain any shift-reduce or reduce-reduce conflicts.

Closure on Sets of Items

Define an operator which generates the set of items having the same prefix consistent with the lookahead: equation079.

  1. equation081.
  2. Otherwise, equation083.

Denote the transitive closure of this operator as, equation085.

The reason for the closure is this: if an item equation114. is valid for viable prefix equation116. and lookahead equation118.

There is a derivation, equation120. Then a production of the form equation122. implies the derivation equation124. which means the new item equation126. is also valid for the same viable prefix equation128. where equation130.

GOTO on Sets of Items

Define the GOTO on a set of items A as equation087.

Sets of Items Construction

We can construct the set of items valid for each viable prefix iteratively. The set of items valid for viable prefix equation089. is equation091.

where the initial state is equation093.

This gives us a GOTO graph having the sets of items as nodes, the GOTO's on symbols as transitions between states, and the viable prefix for a set of states as the path from the initial state to that node.

The algorithm is as follows:

equation245.

repeat

for each item equation246.

for each equation247. if equation248. and equation249. then equation250. until no more items can be added to C.

Fig. 1.

LR(k) Parsing Table Generation

We generate ACTION and GOTO tables which will be used by our finite state automaton (FSA) for parsing.

Let equation095. and equation097.

Generate the ACTION table as equation099.

  1. equation101. equation103.
  2. equation105. equation107.
  3. equation109.
  4. equation111.

Define the GOTO table as equation113.

LR(k) Parsing Algorithm

We start in the initial state,

equation163.

where

equation165.

Then, starting from a state consisting of a parse stack on the left, and an input stack on the right of remaining tokens to be parsed,

equation167.

We define the transition functions of the DFA on as follows:

equation169.

push the new symbol and new state and transition to:

equation171.

where equation173.

equation175.

transition from

equation177.

where equation179. to the state:

equation181.

where equation183. and equation185.

equation187.

Halt and accept the input with no error.

equation189.

equation191.

where the viable prefix is, equation193. and the error tokens are, equation195.

Ambiguous Grammars

If the grammar is ambiguous, we don't have to rewrite the grammar to avoid it; many times we can resolve the conflicts in many cases by adding precedence and associativity rules to productions and input symbols. YACC does this.

If a state contains items having a shift/reduce conflict,

equation197. equation199.

we reduce if the precedence of the production equation201. is higher than the precedence of the lookahead in the input, equation203. The default precedence of the production is the precedence of the rightmost terminal of equation205. Otherwise, we can set it to be the precedence of any symbol which gives a correct parse.

If there is a tie in precedence, we reduce if the production is left associative and shift if the production is right associative. Nonassociative breaks ties by declaring error.

If there is a reduce/reduce conflict, we reduce by the earlier listed production in the list.

Error Recovery

We can add error productions to the language of the form,

equation207. where A is a major non-terminal, and equation209. is allowed.

When the parser gets into an error state, it pops the parse stack until it finds a state containing the item equation211. shifts the imaginary token, equation213. then discards input symbols until it sees lookahead equation215. . Then the parser performs a reduction on A and calls an error routine associated with the error production. If equation217. the reduction happens immediately.

Generating LALR(1) Sets Of Items

First a definition,

The core of an item equation240. is equation241.

If we can merge the cores of the sets of items in the LR(k) goto graph without generating conflicts to give a much smaller graph, the grammar is of type LALR(k).

(1) Construct LR(1) sets of items I,

equation219.

(2) For each core in C find all sets of items having that core and replace these sets by their union (remove duplicate cores and merge lookaheads).

equation220.

(3) Create the action table as before. If there are any conflicts, the language is not LALR(1) and we exit.

(4) Otherwise, construct the goto table as follows.

For a merged set of items,

equation223.

the cores of

equation224.

are identical also since

equation225.

have the same cores.

Let

equation226.

Fig. 4.

then

equation227.

And our LALR(1) collection of items is done. If an LR(1) parser has detected an error on the input, the corresponding LALR(1) parser may do additional reductions. However it will not shift the error symbol onto the stack, and will declare error before the next shift.

We can't create any new shift/reduce conflicts by merging cores. Suppose we did create a new shift/reduce conflict by merging sets of items:

equation228.

and

equation229.

Since cores were the same before the merge, some item must have had

equation230.

and

equation231.

for some lookahead c. But this is an LR(1) shift/reduce conflict, a contradiction. However, we can create a reduce/reduce conflict by merging cores.

Example: Generating LR(1) Sets Of Items and Parsing Sentences

Our sample grammar with lookahead k = 1 has two productions,

  1. S → SaSb
  2. S → ε

First compute FIRST and epsilon-free FIRST of all the token symbols,

For all i, we have for the terminals, Fi(ε) = {ε}, Fi(a) = {a}, Fi(b) = {b}.

For the non-terminal S, F0(S) = {ε}, since that's the only production which derives a terminal symbol on the left. Now we iterate

F1(S) = {ε} U { F0(S)+1F0(a)+1 F0(S)+1 F0(b), F0(ε) } = { {ε}+1{a}+1..., ε } = {ε,a}

F2(S) = {ε} U { F1(S)+1F1(a)+1 F1(S)+1 F1(b), F1(ε) } = { {ε,a}+1{a}+1..., ε } = {ε,a}

The set doesn't change anymore, so we stop and declare, FIRST(S) = F2(S) = {ε,a}

When we compute EFF, we use a similar algorithm, but any derivations with leading epsilons were disallowed, so EFF(ε) = EFF(S) = {}.

Symbol FIRST1() EFF1()
ε {ε} {}
a {a} {a}
b {b} {b}
S {ε, a} {}

Now we compute the sets of LR(1) items using GOTO and CLOSURE operations.

State GOTO() from state Viable Prefix Kernel Closure
0 - ε 1. [S' →.S,ε]
From Item Item Lookahead
1 2. [S → .SaSb,ε] FIRST(ε)
2 3. [S → .SaSb,a] FIRST(aSbε)
1 4. [S → .ε,ε] FIRST(ε)
2 5. [S → .ε,a] FIRST(aSbε)
1 State 0 GOTO(S) S
  1. [S' → S.,ε]
  2. [S → S.aSb,ε]
  3. [S → S.aSb,a]
From Item Item Lookahead
None None None
2 State 1 GOTO(a) Sa
  1. [S → Sa.Sb,ε]
  2. [S → Sa.Sb,a]
From Item Item Lookahead
1 3. [S → .SaSb,b] FIRST(bε)
2 4. [S → .SaSb,a] FIRST(ba)
1 5. [S → .ε,b] FIRST(bε)
3 6. [S → .ε,a] FIRST(aSbb)
3 State 2 GOTO(S) SaS
  1. [S → SaS.b,ε]
  2. [S → SaS.b,a]
  3. [S → S.aSb,b]
  4. [S → S.aSb,a]
From Item Item Lookahead
None None None
4 State 6 GOTO(a), State 3 GOTO(a) SaSa
  1. [S → Sa.Sb,a]
  2. [S → Sa.Sb,b]
From Item Item Lookahead
1 3. [S → .SaSb,b] FIRST(ba)
2 4. [S → .ε,b] FIRST(bb)
2 5. [S → .ε,b] FIRST(bb)
3 6. [S → .SaSb,a] FIRST(aSbb)
5 State 3 GOTO(b) SaSab
  1. [S → SaSb.,ε]
  2. [S → SaSb.,a]
From Item Item Lookahead
None None None
6 State 4 GOTO(S) SaSab
  1. [S → SaS.b,a]
  2. [S → SaS.b,b]
  3. [S → S.aSb,b]
  4. [S → S.aSb,a]
From Item Item Lookahead
None None None
7 State 6 GOTO(b) SaSaSb
  1. [S → SaSb.,a]
  2. [S → SaSb.,b]
From Item Item Lookahead
None None None

Fig. 2.

Let's tabulate the GOTO's on the states and symbols, which we'll need to calculate the state after a shift (for GOTO's on terminal symbols) and also the state after a reduce (for GOTO's on non-terminal symbols).

State ON symbol GOTO state
  S e a b
0 1 - - -
1 - - 2 -
2 3 - - -
3 - - 4 5
4 6 - - -
5 - - - -
6 - - 4 7
7 - - - -

Now we compute the ACTION tables by looking at the types of items. If the item has nothing after the dot, it's a reduction provided we see the item's lookahead. Otherwise, we compute u = EFF() of all the symbols after the item's dot including the item's k-symbol lookahead. If we see the u in the input, we call for a shift, and a goto the next state on u.

State Items EFF Lookahead (shift only) Lookahead Symbol/Action
0
  1. S' → .S,ε
  2. S' → .SaSb,ε
  3. S' → .SaSb,a
  4. S' → .,ε
  5. S' → .,a
  1. EFF(Sε)=0
  2. EFF(SaSbε)=0
  3. EFF(SaSba)=0
  4. -
  5. -
  1. -
  2. -
  3. -
  4. ε reduce 2
  5. a reduce 2
1
  1. S' → S.,ε
  2. S → S.aSb,ε
  3. S → S.aSb,a
  1. -
  2. EFF(aSbε)=a
  3. EFF(aSba)=a
  1. accept
  2. a shift 2
  3. a shift 2
2
  1. S → Sa.Sb,ε
  2. S → Sa.Sb,a
  3. S → .SaSb,a
  4. S → .SaSb,b
  5. S → .,a
  6. S → .,b
  1. EFF(Sbε)=0
  2. EFF(Sba)=0
  3. EFF(SaSba)=0
  4. EFF(SaSbb)=0
  5. -
  6. -
  1. -
  2. -
  3. -
  4. -
  5. a reduce 2
  6. b reduce 2
3
  1. S → SaS.b,ε
  2. S → SaS.b,a
  3. S → S.aSb,a
  4. S → S.aSb,b
  1. EFF(bε)=b
  2. EFF(ba)=b
  3. EFF(aSba)=a
  4. EFF(aSbb)=a
  1. b shift 5
  2. b shift 5
  3. a shift 4
  4. a shift 4
4
  1. S → Sa.Sb,a
  2. S → Sa.Sb,b
  3. S → .SaSb,a
  4. S → .SaSb,b
  5. S → .,a
  6. S → .,b
  1. EFF(Sba)=0
  2. EFF(Sbb)=0
  3. EFF(SaSba)=0
  4. EFF(SaSbb)=0
  5. -
  6. -
  1. -
  2. -
  3. -
  4. -
  5. a reduce 2
  6. b reduce 2
5
  1. S → SaSb.,ε
  2. S → SaSb.,a
  1. -
  2. -
  1. ε reduce 1
  2. a reduce 1
6
  1. S → SaS.b,a
  2. S → SaS.b,b
  3. S → S.aSb,a
  4. S → S.aSb,b
  1. EFF(ba)=b
  2. EFF(bb)=b
  3. EFF(aSba)=a
  4. EFF(aSbb)=a
  5. -
  6. -
  1. b shift 7
  2. b shift 7
  3. a shift 4
  4. a shift 4
7
  1. S → SaSb.,a
  2. S → SaSb.,b
  1. -
  2. -
  1. a reduce 1
  2. b reduce 1

Collecting together the shift and reduce actions for each state, and the goto on each non-terminal gives us the ACTION and GOTO tables: the heart of the parser. Error states are easily associated with error messages.

State ACTION GOTO
  ε a b S
0 r2 r2 error - b not expected 1
1 accept s2 error - b not expected error
2 error - expecting a or b r2 r2 3
3 error - expecting a or b s4 s5 error
4 error - expecting a or b r2 r2 6
5 r1 r1 error - b not expected error
6 error - premature end of input s4 s7 error
7 error r1 r1 error

Now let's give an example of parsing a grammatical sentence.

Parse Stack Action
(0 | ε a a b b) r2: S → ε
(0 S 1 | a a b b) s2:
(0 S 1 a 2 | a b b) r2: S → ε
(0 S 1 a 2 S 3 | a b b) s4
(0 S 1 a 2 S 3 a 4 | b b) r2: S → ε
(0 S 1 a 2 S 3 a 4 S 6 | b b) s7
(0 S 1 a 2 S 3 a 4 S 6 b 7 | b) r1: S → SaSb
(0 S 1 a 2 S 3 | b) s5
(0 S 1 a 2 S 3 b 5| ) r1: S → SaSb
(0 S 1| ) accept: S' → S  

We've just traced out a rightmost derivation in reverse:

S'S → SaSb → SaSaSbb → SaSabb → Saabb → aabb

Now let's give an example of parsing an UNgrammatical sentence.

Parse Stack Action
(0 | a a b) r2: S → ε
(0 S 1 | a a b) s2:
(0 S 1 a 2 | a b) r2: S → ε
(0 S 1 a 2 S 3 | a b) s4
(0 S 1 a 2 S 3 a 4 | b) r2: S → ε
(0 S 1 a 2 S 3 a 4 S 6 | b) s7
(0 S 1 a 2 S 3 a 4 S 6 b 7 | b) error: Expecting a or b next in the derivation S'=>SaSaSb...

Example: Generating LALR(1) Sets Of Items

Now we compute the sets of LALR(1) items by merging the cores of the LR(1) items above. New state numbers show which LR(1) states were merged. There are no reduce/reduce conflicts, and the parse tables become much smaller.

Fig. 3.

State GOTO From State Viable Prefix Items
0 - ε
  1. [S' → .S,ε]
  2. [S' → .SaSb,ε]
  3. [S' → .SaSb,a]
  4. [S' → .,ε] ε reduce 2
  5. [S' → .,a] a reduce 2
1 0 GOTO(S) S
  1. [S' → S.,ε] accept
  2. [S → S.aSb,ε] a shift 2
  3. [S → S.aSb,a] a shift 2
24 1 GOTO(a), 3 GOTO(a) Sa
  1. [S → Sa.Sb,ε]
  2. [S → Sa.Sb,a]
  3. [S → Sa.Sb,b]
  4. [S → .SaSb,a]
  5. [S → .SaSb,b]
  6. [S → .,a] a reduce 2
  7. [S → .,b] b reduce 2
36 24 GOTO(S) SaS
  1. [S → SaS.b,ε] b shift 57
  2. [S → SaS.b,a] b shift 57
  3. [S → SaS.b,b] b shift 57
  4. [S → S.aSb,a] a shift 24
  5. [S → S.aSb,b] a shift 24
57 36 GOTO(b) SaSb
  1. [S → SaSb.,ε] ε reduce 1
  2. [S → SaSb.,a] a reduce 1
  3. [S → SaSb.,b] b reduce 1

Collecting together the shift and reduce actions for each state, and the goto on each non-terminal gives us the ACTION and GOTO tables: the heart of the parser. Error states are easily associated with error messages.

State ACTION GOTO
  ε a b S
0 r2 r2 error - b not expected 1
1 accept s24 error - b not expectederror
24 error - expecting a or b r2 r2 36
36 error - expecting a or b s24 s57 error
57 r1 r1 r1 error

References


Copyright © 1986-2014 by Sean Erik O'Connor. All Rights Reserved.     last updated 01 Jan 14.