# Lecture 18

## Parsing, II, Intro to Yacc/Bison

The core idea is a that languages can be specified by context free grammars.

A context free grammar is a four tuple G = (V,T,P,S) where

• V - is a finite set of atomic variables, a.k.a., nonterminals
• T - is a finite set of atomic terminals, V and T are disjoint.
• P - is a finite set of productions. A production has the form V -> [V|T]*, i.e., a variable is to be replaced by a finite (possibly empty) string of variables and terminals.
• S - an element of V, is the start symbol.

A quick example:

``` S -> SS | (S) | ε ```

is a terse way to describe a context free language that generates balanced parenthesis strings. [N.B., epsilon (ε) here denotes the empty string]

We can use the grammar to derive a string in the language by repeated substituting variables based on production rules, e.g.

``` S => SS => (S)S => ()S => ()(S) => ()((S)) => ()(()) ```

Such a parse is often described in terms of a parse tree (which has the advantage of not involving the temporal overspecification): A deep algorithmic result is that certain for certain context free grammars, this generation process can be efficiently reversed, and a parse tree can be reconstructed.

``` ()(()) <= (S)(()) <= S(()) <= S((S)) <= S(S) <= SS <= S ```

We will consider `yacc`/`bison`, parser generators for LALR(1) languages.

Central to the idea of `yacc` is that variables and terminals can have values, which essentially propagate from the leaves of the parse tree up to the root. In simple examples (e.g., the calculator program from the book, or today's similar example), the values are numbers. In more complicated (and typical) examples, the values could be the full parse tree associated with the descendants of that symbol.

When we write a `yacc` program, we provide production rules, together with actions that typically assign a value to the variable/terminal that arises from reduction of a production rule.

### An example -- an atomic weight calculator.

We'd like to be able to easily compute atomic weights of simple formulae, E.g., H2SO4. An issue is that these formula can have recursive structure, e.g., the chemical formula of octane is CH3(CH2)6CH3.

For starters, we need a table of chemical elements and weights:

``` # mass.data # ... 1 H 1.00794 Hydrogen 2 He 4.002602 Helium 3 Li 6.941 Lithium 4 Be 9.012182 Beryllium ... ```

Note that a table like this is easily processed using using sed to filter comments and blank lines, and then scanf to process each line. Thus, fancy parsing approaches are not required. We have C files element-db.[ch] that provide a simple procedural interface to our database. Here's the interface:

``` // element-db.h // Interface to the element database #ifndef GUARD_ELEMENT_DB #define GUARD_ELEMENT_DB typedef struct element_record { struct element_record *next; char *abbrev; char *name; int atomic_number; double mass; } element_record, *element_ptr, *element_db; element_db create_element_db(char *file_name); const element_ptr lookup_element(element_db, char *abbrev); void delete_element_db(element_db *db); #endif ```

Here is the core—a `yacc`/`bison` program that describes the syntax of chemical formulae.

``` %{ // parse.y #include <stdio.h> #include <stdlib.h> void yyerror(char *); int yylex(void); %} %union { double mass; int count; } %token ERROR OPEN_PAREN CLOSE_PAREN EOLN EOFTOK %token<mass> ELEMENT %token<count> NUMBER %type<mass> primary secondary compound %start input %% input : lines EOFTOK { YYACCEPT; } ; lines : | lines line ; line : compound EOLN { printf("%f\n",\$1); } | error EOLN { fprintf(stderr,"invalid\n"); } | EOLN ; primary : ELEMENT { \$\$ = \$1; } | OPEN_PAREN compound CLOSE_PAREN { \$\$ = \$2; } ; secondary : primary { \$\$ = \$1; } | primary NUMBER { \$\$ = \$1 * \$2; } ; compound : secondary { \$\$ = \$1; } | compound secondary { \$\$ = \$1 + \$2; } ; %% void yyerror(char *msg) {} ```

Things to note:

1. Three sections, like lex/flex. Declarations, Rules, and C-stuff.
2. Production rules with actions.
3. `\$\$` syntax
4. Explicit declaration of tokens (terminal), implicit declaration of variables
5. The value carrying system: `%union`, `%token <tag>` and` %type <tag>` declarations.

Where do the tokens come from? `lex`/`flex`, of course:

``` %{ // tokens.l // tokenization description for chemical compounds. #include <stdio.h> #include "atomic-weight.h" #include "element-db.h" #include "parse.h" int yyparse(); %} %option noyywrap element [[:upper:]][[:lower:]]{0,2} %% \( { return OPEN_PAREN; } \) { return CLOSE_PAREN; } {element} { element_ptr p = lookup_element(db,yytext); if (p == NULL) { return ERROR; } yylval.mass=p->mass; return ELEMENT; } [ \t] {} [[:digit:]]+ { yylval.count = atoi(yytext); return NUMBER; } . { return ERROR; } \n { return EOLN; } <<EOF>> { return EOFTOK; } %% ```

When `lex` is working together with `yacc`, the characteristic action is to set the value of the appropriate variant of the union variable `yylval`, if needed, and to return an appropriate token value. The token values come from `parse.h`, which is created by `yacc`.

A full set of sources for the weight program, including the data file, Makefiles, etc., is available in zip format.

Exercise 18.1 Implement a `yacc` program that parses well nested parentheses sets over `()`, `[]`, and `{}`. You should report the depth of parenthesization.

``` \$ para {()} 2 {}[[()()]] 3 {) syntax error ```