Lecture 19

Shift/Reduce Conflicts and the Parsing Game

If bison runs into trouble, it will be because of conflicts in the parsing automaton that it builds. It goes beyond the scope of this course to deal with this issue in full detail, but even a partial understanding is tremendously helpful in being able to debug grammars.

Let's start with a simple example—a grammar for generating non-empty sequences of balanced parentheses.

S -> ( ) // rule 1 S -> S S // rule 2 S -> ( S ) // rule 3

The way the parser is usually explained is that it scans input, and maintains a stack. That is actually not as helpful as it might be in forming an understanding.

Instead, let me propose the PARSING GAME.

The parsing game is a solitaire game in which we're given a context free grammar, and a string of terminals, to the left of which we have placed the dot. A play of the parsing game consists of a sequence of moves that take place around the dot:

  1. A shift move consists of moving a symbol across the dot from right-to-left.
  2. A reduce move can be made if the symbols to the immediate left of the dot are the right hand side (RHS) of some production rule. In this case, these symbols can be replaced by the variable that makes up the left hand side (LHS) of that production rule.

We win the parsing game if after finitely many steps, we acheive the configuration

S .

where S is the start symbol of the grammar.

An example will make this clearer—let's assume we have the balanced parenthesis grammar, and we want to parse the string ()(()). Consider the following annotated play of the parsing game:

. ( ) ( ( ) ) // initial state ( . ) ( ( ) ) // shift ( ( ) . ( ( ) ) // shift ) S . ( ( ) ) // reduce by rule 1, S -> () S ( . ( ) ) // shift ( S ( ( . ) ) // shift ( S ( ( ) . ) // shift ) S ( S . ) // reduce by rule 1, S -> () S ( S ) . // shift ) S S . // reduce by rule 3, S -> ( S ) S . // reduce by rule 2, S -> S S, victory!!

At the end of a successful parse, we should have the start symbol followed by the dot. If we write down the initial state and all of the reduce states, and erase the dots, we have a reversed right-most derivation:

()(()) <= S(()) <= S(S) <= SS <= S

Connecting the pieces (and waving our hands a bit), we have a one-to-one correspondence between winning plays of the parsing game and right-most derivations, and there is also a one-to-one correspondence between right-most derivations and parse trees. Composing these relations gives us a one-to-one relationship between winning plays of the parsing game and parse trees.

Indeed, the plays of the parsing game amount to a pre-order traversal of the parse tree, and this is useful to understand. The stack, by the way, consists of the symbols to the left of the dot, while the symbols to the right of the stack are those not yet processed by the automaton (i.e., unread input).

What this simple example does not make clear is that there are choices in the algorithm. In a way, this is obvious enough—we can always shift until we get to the end of the string. But this is usually a losing strategy:

. ( ) ( ( ) ) ( . ) ( ( ) ) ( ) . ( ( ) ) ( ) ( . ( ) ) ( ) ( ( . ) ) ( ) ( ( ) . ) ( ) ( ( ) ) .

But now no rule applies, and we are stuck.

Here is a more realistic example: consider the input string ()()().

We begin easily enough:

. ( ) ( ) ( ) initial state ( . ) ( ) ( ) shift ( ( ) . ( ) ( ) shift ) S . ( ) ( ) reduce by rule 1, S -> () S ( . ) ( ) shift ( S ( ) . ( ) shift ) S S . ( ) reduce by rule 1, S -> ()

At this point, we have a choice, and the derivation can continue in two different ways:

S S . ( ) where we left off S . ( ) reduce by rule 2, S -> S S S ( . ) shift ( S ( ) . shift ) S S . reduce by rule 1, S -> () S . reduce by rule 2

or

S S . ( ) where we left off S S ( . ) shift ( S S ( ) . shift ) S S S . reduce by rule 1, S -> () S S . reduce by rule 2, S -> S S S . reduce by rule 2, S -> S S

These are fundamentally different parses, as they are associated with different parse trees:

and

respectively.

This is an example of an essential shift/reduce conflict, one that reflects an ambiguity in the language, i.e., the existence of two distinct parse trees deriving the same string. yacc and bison, by default, resolve shift/reduce conflicts in favor of shifts, so that the second parse is what actually happens. [Historically, this was done so that if/if/else constructs would bind as expected, but it is also crucial for dealing with grammars that have epsilon productions.]

An even worse possibility is a reduce/reduce conflict. Consider the following (silly) grammar:

S -> A rule 1 S -> B rule 2 A -> a rule 3 B -> a rule 4

This grammar generates only the string "a". Consider a hypothetical parse:

. a initial state a . shift

At this point, we have two shift possibilities:

a . where we left off A . reduce by rule 3, A -> a S . reduce by rule 1, S -> A

Or

a . where we left off B . reduce by rule 4, B -> a S . reduce by rule 2, S -> B.

Not surprisingly, yacc/bison resolve reduce/reduce conflicts in favor of the highest priority rule, by default the first applicable rule. Nevertheless, reduce/reduce conflicts are almost always indicative of a failure of analysis, and can produce unexpected results. No production grammar should ever contain a reduce/reduce conflict.

So now, let's get to the hard part... It is possible to deal with the alternatives in parsing by backtracking (this is basically what was happening with the functional parsers we wrote in Haskell last quarter), but this comes at a potentially exponential cost. If you're willing to backtrack, you can parse any unambiguous context-free grammar, and indeed any context-free grammar if you're willing to go with a disambiguation rule like first-parse.

The traditional algorithmic approach is quite different. The parser is compelled to provide a strategy for the parsing game that is based on very local information about the neighborhood of the dot. This limitation gains efficiency (e.g., an LR(1) parser runs in linear time and space), at the cost of generality (not all unambiguous grammars can be parsed).

In the case of yacc/bison, the LALR(1) parsing algorithm builds a finite state automata that runs on the left hand part of the input, and it considers the next symbol of input (the lookahead symbol), and makes a particular shift/reduce decision based on this information. The states of the automata are collections of items, where each item looks like a production rule with a dot on the right hand side (corresponding to the dot in the discussion above).

For example, consider the following balanced parenthesis grammar:

%{ /* paren.y */ %} %% S : '(' ')' | S S | '(' S ')' ;

If we try to build a parser,

$ bison -v paren.y paren.y contains 1 shift/reduce conflict.

we get a conflict. The -v flag causes bison to produce verbose output in the file paren.output. If we look at that file, we'll see some book keeping information, and list of automata states. One of those states has the shift/reduce conflict described above:

state 5 S -> S . S (rule 2) S -> S S . (rule 2) '(' shift, and go to state 1 '(' [reduce using rule 2 (S)] $default reduce using rule 2 (S) S go to state 5

Note that there are two different actions associated with having '(' on the stack, a shift, and a [bracketed] reduce. The bracketed rule is ignored, and we can see the shift/reduce conflict resolved in favor of shift here.

Intuitively, this conflict points to exactly the situation we saw above—if we have a configuration that has two S's to the immediate right of the dot, and a first character that could come from an S, there is an ambiguity.

S S . ( )

The old school approach was to tweak the grammar for a while, until there were a few well understood shift/reduce conflicts (but no reduce/reduce conflicts), and to be happy with that. Indeed, yacc/bison have associativity and precedence rules that can be associated with the rules of a grammar that permit a controlled tie breaking of rules.

For example, let's consider an easy grammar for arithmetic expressions:

%token NUMBER %% expression : NUMBER | expression '+' expression | expression '-' expression | expression '*' expression | expression '/' expression | '(' expression ')' ;

This grammar contains 16(!) shift/reduce conflicts, which seems like a lot. But..., examining the state table tells the tale,

state 10 expression -> expression . '+' expression (rule 2) expression -> expression '+' expression . (rule 2) expression -> expression . '-' expression (rule 3) expression -> expression . '*' expression (rule 4) expression -> expression . '/' expression (rule 5) '+' shift, and go to state 5 '-' shift, and go to state 6 '*' shift, and go to state 7 '/' shift, and go to state 8 '+' [reduce using rule 2 (expression)] '-' [reduce using rule 2 (expression)] '*' [reduce using rule 2 (expression)] '/' [reduce using rule 2 (expression)] $default reduce using rule 2 (expression)

The automata will be in this state, e.g., if it has

expression '+' expression . '*' NUMBER

on the stack. In this case, the shift/reduce conflict is really a symptom of an operator precedence issue

2 + 3 * 5

do we mean (2 + 3) * 5 [reduce], or 2 + (3 * 5) [shift] at

2 + 3 . * 5?

If we're following the usual precedence rules, 2 + (3 * 5) is the right answer here, but this doesn't mean we should favor reduces over shifts (consider the mirror image).

The old-school solution is to provide associativity/precedence rules.

%token NUMBER %left '+' '-' %left '*' '/' %% expression : NUMBER | expression '+' expression | expression '-' expression | expression '*' expression | expression '/' expression | '(' expression ')' ;

The rules (there is a %right as well as a %left, and even a %nonassoc for operators that aren't associative) are given, lowest precedence first.

yacc/bison report no shift reduce errors for this grammar. A quick check of the automaton is reassuring, e.g.,

state 10 expression -> expression . '+' expression (rule 2) expression -> expression '+' expression . (rule 2) expression -> expression . '-' expression (rule 3) expression -> expression . '*' expression (rule 4) expression -> expression . '/' expression (rule 5) '*' shift, and go to state 7 '/' shift, and go to state 8 $default reduce using rule 2 (expression)

This rule applies in the situation like this:

expression '+' expression . op NUMBER

If op is '+' or '-', left associativity tells us to reduce. If op is '*' or '/', the rule says to shift, which will give us a parse that is equivalent to

expression '+' ( expression op expression)

i.e., what we intend.

These days, the use of precedence/associativity rules is discouraged in favor of writing an unambiguous grammar in the first place.

E.g., let's consider the initial balanced parenthesis grammar:

S -> ( ) // rule 1 S -> S S // rule 2 S -> ( S ) // rule 3

The conflict here comes in rule 2 -- how do we get to S S S? Here we can think about the ambiguity in terms of generating a (non-empty) list of grammar symbols each of which generate a balanced expression within a set of parentheses

S -> S T S -> T T -> ( ) T -> ( S )

This basically forces juxtapositions of balanced parenthesis strings to associate to the left. The resulting grammar is unambiguous, and yacc/bison builds a conflict-free parser for it.

A generalization of the same approach enables us to handle precedence relationships unambiguously.

Two critical patterns are the grammars for left- and right-associative sequences:

left : TERMINAL | left TERMINAL ; right : TERMINAL | TERMINAL right ;

Note that if associativity doesn't matter, it's better (if a bit counter-intuitive) to prefer left-associativity to right-associativity, because the reductions come incrementally instead of all at the end, resulting in a smaller stack.

We then build a grammar from the lowest-level of precedence to the highest.

expression : term | expression '+' term | expression '-' term ;

These rules assert that an expression is a left associative list of terms, joined by '+' and '-'

term : factor | term '*' factor | term '/' factor ;

Likewise, a term is a left associative list of factors, joined by '*' and '/'.

Finally,

factor : NUMBER | '(' expression ')' ;

A factor is either a NUMBER, or a parenthesized expression.

The resulting, unambiguous grammar, is processed by yacc or bison without any conflicts. This is not hard, although there is some work in building up an ontology for the various precedence levels. The names I used here are traditional.

Exercise 19.1 Consider the following bison program:

%{ // prop_formula.y // parsing for boolean expressions %} %token AND OR IMPLIES NOT OPEN_PAREN CLOSE_PAREN VAR %% expression : VAR | NOT expression | expression OR expression | expression AND expression | expression IMPLIES expression | OPEN_PAREN expression CLOSE_PAREN ; %%

This is intended to be a recognizer for propostional formula, but it has 12(!) shift reduce conflicts. It can be disambiguated as follows:

  1. NOT has highest precedence
  2. AND is next, and it is left associative,
  3. OR is next, and it is left associative,
  4. IMPLIES has lowest precedence, and it is right associative.

Produce a correct bison grammar. Extra credit: do so without using %left, %right.