This is intended to be a hyperfast review of the basic LR(k) and LALR(k) terminology and concepts; enough to be able to understand how a parser generator program works. To illustrate the concepts, I have written Lisp software for a LR(1) and LALR(1) parser generator and parser.
Given the tokens and productions of a grammar, there is an algorithm which can either generate a deterministic finite automaton (DFA) to parse the grammar efficiently or else determine our grammar is not LR(k)/LALR(k).
Each state in our DFA corresponds to a prefix of a possible derivation of our input string.
A state contains a set of items. Depending upon the next k symbols of lookahead, an item tells us to either
We'll end up with a sequence of productions tracing out the derivation of the sentence in reverse, or stop in an error state.
A formal grammar is defined to be the tuple $G = \left( N, \Sigma, P, S \right)$ where N = finite set of nonterminal symbols, $\Sigma$ = finite set of terminal symbols, P = set of productions of the form $\alpha \rightarrow \beta$, S = nonterminal start symbol. The collection $\{ a, b, c, d, \ldots, 0, 1, \ldots, 9 \}$ denote terminals, while $\{ u, v, w, x, y, z \}$ are strings of terminals. $\{ A, B, C, D, S, \ldots \}$ denote nonterminals, $\{ U, V, W, X, Y, Z \}$ are either terminal or nonterminal symbols. $\{ \alpha, \beta, \gamma, \delta, \ldots \}$ are strings of either. $ x $ is the length of the string x. $\epsilon$ is the empty string. $\Rightarrow$ means directly derives, i.e. if we have a production $\beta \rightarrow \delta$ then $\alpha \beta \gamma \Rightarrow \alpha \delta \gamma$ and $\stackrel{*}{\Rightarrow}$ means transitive and reflexive closure of derives. In derivations, rm denotes rightmost derivation, which means expand nonterminal symbols starting with those on the right. lm denotes leftmost derivation.
A language L is context free if all productions $\alpha \rightarrow \beta$, satisfy $ \alpha  = 1$, i.e. the left hand side of a production is a single nonterminal, i.e. a production is independent of the symbols surrounding it (its context).
The concatenation of two sets of strings is defined by
$$L_1 \oplus_k L_2 = \{ w  \exists x \in L_1, y \in L_2$$ such that $${\LARGE \{ } \begin{array}{rl} w = xy \text{ if } xy \le k \text{ else } \\ w = \text{ first k symbols of xy } \end{array} $$
The first k terminals derived from the string $\alpha$ is defined by $FIRST_k \left( \alpha \right) = \{ x  \alpha \mathrel{ \substack{* \\ \Rightarrow \\ lm} } x \beta \: and \: x = k \: or \: \alpha \mathrel{ \substack{* \\ \Rightarrow \\ lm} } x \: and \: x \lt k \} $
First, define $FIRST_k()$ for any string of terminals and nonterminals, $\beta = X_1 \ldots X_n \in (N \bigcup \Sigma)^*$ recursively by string concatenation,
$FIRST_k( \beta ) = FIRST_k( X_1 ) \oplus_k \ldots \oplus_k FIRST_k( X_n )$
For terminals, define FIRST to be the terminal itself,
$FIRST_k( a ) = \{ a \} \quad \forall a \in \Sigma \cup \{ \epsilon \}$
It remains to compute FIRST for all nonterminals, $FIRST_k( X ) \quad \forall X \in N$
We do this iteratively. We define the function F() which will approximate FIRST. For terminals, define $F_i( a ) = \{ a\} \quad \forall a \in \Sigma, \quad \forall i \ge 0$
For nonterminals, we define the zeroth approximation to FIRST by $$F_0( A ) = \{ x  x \in \Sigma^{*k},$$ $$A \rightarrow x \: \alpha \in P, \: x = k \text{ or } x \lt k \text{ and } \alpha = \epsilon \}$$
Then for all nonterminals A simultaneously, we iterate the formula, $$F_{i+1}( A ) = F_i( A ) \cup \{ x  A \rightarrow Y_1 \ldots Y_n \in P$$ and $$x \in F_i( Y_1 ) \oplus_k \ldots \oplus_k F_i( Y_n ) \} \quad i \ge 0$$
Eventually,
$F_i( A ) \subset F_{i+1}( A ) \subset \ldots \subset F_l( A ) = F_{l+1}( A ) \quad \forall A \in N$
at which point we terminate and say,
$FIRST_k( A ) = F_l( A ) \quad \forall A \in N$
The epsilonfree first k derived symbols, $EFF_k()$, are found by computing FIRST as above, but tagging all symbols which are derived using epsilon productions of the form, $A \rightarrow \epsilon$ as nonepsilon free, and keeping only the epsilonfree derived terminals.
G is LR(k) iff
implies $\alpha = \gamma$, $A = B$ and $x = y$
That is, if we see a string of the form $\alpha \beta w$ and we are given only $FIRST_k( w )$, we must have used the unique production $A \rightarrow \beta$ and we must have had the same unique prefix $\alpha \beta$.
Or equivalently, knowing the prefix $\alpha \beta$ and only the k terminal symbol lookahead, we can locate
An item $[ A \rightarrow \beta_1 \bullet \beta_2, u]$ is valid for viable prefix $\gamma = \alpha \beta_1$ if $\exists S \underset{rm}{\overset{*}{\Rightarrow}} \alpha \beta_1 \beta_2 w \ni u = FIRST_k( w )$ That is, $\alpha A w \Rightarrow \alpha \beta_1 \beta_2 w$ is a derivation consistent with what we've seen so far in parsing (the prefix) and consistent with the lookahead. An item with nothing after the dot indicates a reduction by a production is OK at this stage, but an item with a symbol after the dot indicates we should shift more symbols before making any reductions.
A language $G$ is $LR(k)$ iff for a viable prefix $\gamma = \alpha \beta$ given $[ A \rightarrow \beta \bullet , u ]$ is valid for $\gamma$ and there is no other valid item $[ A_1 \rightarrow \beta_1 \bullet \beta_2, \nu ]$ valid for $\gamma$ with $u \in EFF_k( \beta_2 \nu )$
(We may have $\beta_2 = \epsilon$ ) In other words, the sets of items cannot contain any shiftreduce or reducereduce conflicts.
Define an operator $\delta$ which generates the set of items having the same prefix consistent with the lookahead:
$\delta \{ [ A \rightarrow \alpha \bullet X \beta, u ] \} =$
Denote the transitive closure of this operator as, $CLOSURE( \{ [ A \rightarrow \alpha \bullet X \beta, u ] \} )$
The reason for the closure is this: if an item $[ A \rightarrow \alpha \bullet X \beta, u ]$ is valid for viable prefix $\gamma = \delta \alpha$ and lookahead $u = FIRST_k( w )$
There is a derivation, $\exists S \underset{ rm }{ \overset{ * }{ \Rightarrow } } \delta A w \Rightarrow \delta \alpha X \beta w$ Then a production of the form $X \rightarrow \eta$ implies the derivation $S \underset{ rm }{ \overset{ * }{ \Rightarrow } } \delta A w \Rightarrow \delta \alpha \eta \beta w$ which means the new item $[X \rightarrow \bullet \eta, \nu]$ is also valid for the same viable prefix $\gamma = \delta \alpha$ where $\nu = FIRST_k( \beta w )$
Define the GOTO on a set of items $A$ as $$GOTO( A, X ) = CLOSURE( \{ [ A \rightarrow \alpha X \bullet \beta, u ] \} )$$ where $$[A \rightarrow \alpha \bullet X \beta, u] \in A$$
We can construct the set of items valid for each viable prefix iteratively. The set of items valid for viable prefix $X_1 \ldots X_i$ is $V_k( X_1 \ldots X_i ) = GOTO( V_k( X_1 \ldots X_{i1}), X_i)$ where the initial state is $V( \epsilon ) = CLOSURE( \{ [S' \rightarrow \bullet S, \epsilon ] \} )$ This gives us a GOTO graph having the sets of items as nodes, the GOTO's on symbols as transitions between states, and the viable prefix for a set of states as the path from the initial state to that node.
The algorithm is as follows: Begin with $C = CLOSURE[S' \rightarrow \bullet S, \epsilon]$
Repeat for each item $I \in C$ for each $X \in G$ if $GOTO( I, X ) \neq \varnothing$ and $GOTO( I, X ) \not\in C$ then $C = C \cup GOTO( I, X )$ until no more items can be added to $C$.
We generate ACTION and GOTO tables which will be used by our finite state automaton (FSA) for parsing.
Let $s_m = \text{ arbitrary state label for a set of items }$ and $u = \text{ k symbol lookahead }$
Generate the ACTION table as $ACTION[ s_m, u ] =$
Define the GOTO table as $GOTO[ s_m, X_i ] = GOTO( A_{s_m}, X_i )$
We start in the initial state, $( s_0  a_i a_{i+1} \ldots a_n \$)$
where $s_0 = V( \epsilon ) = CLOSURE( [S' \rightarrow \bullet S, \epsilon])$
Then, starting from a state consisting of a parse stack on the left, and an input stack on the right of remaining tokens to be parsed,
$(s_0 X_1 s_1 \ldots X_m s_m  a_i a_{i+1} \ldots a_n \$)$
We define the transition functions of the DFA on as follows:
$ACTION[s_m, a_i \ldots a_{i+k1}] = shift \: s$
push the new symbol and new state and transition to:
$(s_0 X_1 s_1 \ldots X_m s_m a_i s  a_{i+1} \ldots a_n \$)$
where $s = GOTO( s_m, a_i \ldots a_{i+k1}]$
If $ACTION[s_m, a_i \ldots a_{i+k1}] = reduce \quad A \rightarrow \beta$
transition from
$(s_0 X_1 s_1 \ldots X_{mr} s_{mr} X_{mr+1} s_{mr+1} \ldots X_m s_m  a_i a_{i+1} \ldots a_n \$)$
where $\beta = X_{mr+1} \ldots X_m$ to the state:
$(s_0 X_1 s_1 \ldots X_{mr} s_{mr} A s  a_i a_{i+1} \ldots a_n \$)$
where $r =  \beta $ and $s = GOTO[ s_{mr}, A ]$
If $ACTION[ s_m, a_i \ldots a_{i+k1} ] = accept$
Halt and accept the input with no error.
If $ACTION[ s_m, a_i \ldots a_{i+k1} ] = error$
Call $ErrorHandler[ s_m, a_i \ldots a_{i+k1}]$
where the viable prefix is $X_1 \ldots X_m$ and the error tokens are $a_i \ldots a_{i+k1}$
If the grammar is ambiguous, we don't have to rewrite the grammar to avoid it; many times we can resolve the conflicts in many cases by adding precedence and associativity rules to productions and input symbols. YACC does this.
If a state contains items having a shift/reduce conflict,
$ACTION[s_m, a_i \ldots a_{i+k1}] = reduce \quad A \rightarrow \beta$
$ACTION[s_m, a_i \ldots a_{i+k1}] = shift \quad s$
we reduce if the precedence of the production $A \rightarrow \beta$ is higher than the precedence of the lookahead in the input, $a_i \ldots a_{i+k1}$ The default precedence of the production is the precedence of the rightmost terminal of $\beta$ Otherwise, we can set it to be the precedence of any symbol which gives a correct parse.
If there is a tie in precedence, we reduce if the production is left associative and shift if the production is right associative. Nonassociative breaks ties by declaring error.
If there is a reduce/reduce conflict, we reduce by the earlier listed production in the list.
We can add error productions to the language of the form, $A \rightarrow \text{ error } \alpha$ where A is a major nonterminal, and $a = \epsilon$ is allowed.
When the parser gets into an error state, it pops the parse stack until it finds a state containing the item $A \rightarrow \bullet \text{ error } \alpha$ shifts the imaginary token, $\text{ error }$ then discards input symbols until it sees lookahead $\alpha$. Then the parser performs a reduction on $A$ and calls an error routine associated with the error production. If $\alpha = \epsilon$ the reduction happens immediately.
First a definition,
The core of an item $[A \rightarrow \beta_1 \bullet \beta_2, u]$ is $[A \rightarrow \beta_1 \bullet \beta_2]$.
If we can merge the cores of the sets of items in the $LR( k )$ goto graph without generating conflicts to give a much smaller graph, the grammar is of type $LALR( k )$.
(1) Construct $LR(1)$ sets of items $I$:
$C = \{ I_0, \ldots, I_n \}$
(2) For each core in C find all sets of items having that core and replace these sets by their union (remove duplicate cores and merge lookaheads).
$C' = \{ J_0, \ldots, J_n \}$
(3) Create the action table as before. If there are any conflicts, the language is not LALR(1) and we exit.
(4) Otherwise, construct the goto table as follows.
For a merged set of items, $J = I_1 \cup I_2 \ldots \cup I_k$ the cores of $goto( I_1, X ) \ldots goto(I_k, X )$ are identical also since $I_1 \ldots I_k$ have the same cores.
Now let $K = GOTO( I_1, X ) \cup \ldots \cup GOTO( I_k, X )$ then $goto( J, K ) = K$
And our LALR(1) collection of items is done. If an LR(1) parser has detected an error on the input, the corresponding LALR(1) parser may do additional reductions. However it will not shift the error symbol onto the stack, and will declare error before the next shift.
We can't create any new shift/reduce conflicts by merging cores. Suppose we did create a new shift/reduce conflict by merging sets of items:
$[A \rightarrow \alpha \bullet , a] \quad (reduce \quad A \rightarrow \alpha)$
and
$[B \rightarrow \beta \bullet a \gamma, b] \quad (shift \quad a)$
Since cores were the same before the merge, some item must have had
$[A \rightarrow \alpha \bullet , a] \quad (reduce \quad A \rightarrow \alpha)$
and
$[B \rightarrow \beta \bullet a \gamma, c] \quad (shift \quad a)$
for some lookahead c. But this is an LR(1) shift/reduce conflict, a contradiction. However, we can create a reduce/reduce conflict by merging cores.
Our sample grammar with lookahead k = 1 has two productions,
First compute FIRST and epsilonfree FIRST of all the token symbols,
For all i, we have for the terminals, F_{i}(ε) = {ε}, F_{i}(a) = {a}, F_{i}(b) = {b}.
For the nonterminal S, F_{0}(S) = {ε}, since that's the only production which derives a terminal symbol on the left. Now we iterate
F_{1}(S) = {ε} U { F_{0}(S)+_{1}F_{0}(a)+_{1} F_{0}(S)+_{1} F_{0}(b), F_{0}(ε) } = { {ε}+_{1}{a}+_{1}..., ε } = {ε,a}
F_{2}(S) = {ε} U { F_{1}(S)+_{1}F_{1}(a)+_{1} F_{1}(S)+_{1} F_{1}(b), F_{1}(ε) } = { {ε,a}+_{1}{a}+_{1}..., ε } = {ε,a}
The set doesn't change anymore, so we stop and declare, FIRST(S) = F_{2}(S) = {ε,a}
When we compute EFF, we use a similar algorithm, but any derivations with leading epsilons were disallowed, so EFF(ε) = EFF(S) = {}.
Symbol  FIRST_{1}()  EFF_{1}() 
ε  {ε}  {} 
a  {a}  {a} 
b  {b}  {b} 
S  {ε, a}  {} 
Now we compute the sets of LR(1) items using GOTO and CLOSURE operations.
State  GOTO() from state  Viable Prefix  Kernel  Closure  
0    ε  1. [S' →.S,ε] 


1  State 0 GOTO(S)  S 



2  State 1 GOTO(a)  Sa 



3  State 2 GOTO(S)  SaS 



4  State 6 GOTO(a), State 3 GOTO(a)  SaSa 



5  State 3 GOTO(b)  SaSab 



6  State 4 GOTO(S)  SaSab 



7  State 6 GOTO(b)  SaSaSb 


Let's tabulate the GOTO's on the states and symbols, which we'll need to calculate the state after a shift (for GOTO's on terminal symbols) and also the state after a reduce (for GOTO's on nonterminal symbols).
State  ON symbol GOTO state  
S  e  a  b  
0  1       
1      2   
2  3       
3      4  5 
4  6       
5         
6      4  7 
7         
Now we compute the ACTION tables by looking at the types of items. If the item has nothing after the dot, it's a reduction provided we see the item's lookahead. Otherwise, we compute u = EFF() of all the symbols after the item's dot including the item's ksymbol lookahead. If we see the u in the input, we call for a shift, and a goto the next state on u.
State  Items  EFF Lookahead (shift only)  Lookahead Symbol/Action 
0 



1 



2 



3 



4 



5 



6 



7 



Collecting together the shift and reduce actions for each state, and the goto on each nonterminal gives us the ACTION and GOTO tables: the heart of the parser. Error states are easily associated with error messages.
State  ACTION  GOTO  
ε  a  b  S  
0  r2  r2  error  b not expected  1 
1  accept  s2  error  b not expected  error 
2  error  expecting a or b  r2  r2  3 
3  error  expecting a or b  s4  s5  error 
4  error  expecting a or b  r2  r2  6 
5  r1  r1  error  b not expected  error 
6  error  premature end of input  s4  s7  error 
7  error  r1  r1  error 
Now let's give an example of parsing a grammatical sentence.
Parse Stack  Action 
(0  ε a a b b)  r2: S → ε 
(0 S 1  a a b b)  s2: 
(0 S 1 a 2  a b b)  r2: S → ε 
(0 S 1 a 2 S 3  a b b)  s4 
(0 S 1 a 2 S 3 a 4  b b)  r2: S → ε 
(0 S 1 a 2 S 3 a 4 S 6  b b)  s7 
(0 S 1 a 2 S 3 a 4 S 6 b 7  b)  r1: S → SaSb 
(0 S 1 a 2 S 3  b)  s5 
(0 S 1 a 2 S 3 b 5 )  r1: S → SaSb 
(0 S 1 )  accept: S' → S 
We've just traced out a rightmost derivation in reverse:
S' → S → SaSb → SaSaSbb → SaSabb → Saabb → aabb
Now let's give an example of parsing an UNgrammatical sentence.
Parse Stack  Action 
(0  a a b)  r2: S → ε 
(0 S 1  a a b)  s2: 
(0 S 1 a 2  a b)  r2: S → ε 
(0 S 1 a 2 S 3  a b)  s4 
(0 S 1 a 2 S 3 a 4  b)  r2: S → ε 
(0 S 1 a 2 S 3 a 4 S 6  b)  s7 
(0 S 1 a 2 S 3 a 4 S 6 b 7  b)  error: Expecting a or b next in the derivation S'=>SaSaSb... 
Now we compute the sets of LALR(1) items by merging the cores of the LR(1) items above. New state numbers show which LR(1) states were merged. There are no reduce/reduce conflicts, and the parse tables become much smaller.
State  GOTO From State  Viable Prefix  Items 
0    ε 

1  0 GOTO(S)  S 

24  1 GOTO(a), 3 GOTO(a)  Sa 

36  24 GOTO(S)  SaS 

57  36 GOTO(b)  SaSb 

Collecting together the shift and reduce actions for each state, and the goto on each nonterminal gives us the ACTION and GOTO tables: the heart of the parser. Error states are easily associated with error messages.
State  ACTION  GOTO  
ε  a  b  S  
0  r2  r2  error  b not expected  1 
1  accept  s24  error  b not expected  error 
24  error  expecting a or b  r2  r2  36 
36  error  expecting a or b  s24  s57  error 
57  r1  r1  r1  error 
Copyright © 19862017 by Sean Erik O'Connor. All Rights Reserved. last updated 01 Jan 17.