Lecture 17

Administrativia

SAK's office hours for today are cancelled. I have a bonus meeting. Whee.

Parsing, I. Intro to Flex

Part of the “Unix way” is to write small languages that handle particular problems: this has the effect of moving information from programs into data files, where it is more easily managed. Sometimes these small languages end up being Turing complete, which is to say, general purpose, like awk. More often, they fill a particular specialized niche, like sed, or sieve, and are not Turing complete.

Often the hardest part of a programming problem is to make it convenient to describe a particular instance. E.g., consider a maze solving program. The algorithmic part isn't that hard, but “getting the maze into the computer” is. We want a textual representation. Designing a small language seems like the way to go. But then, the problem reduces to that of parsing the language, i.e., transforming a character-stream representation of the maze into an in-memory data structure that we can feed to our search algorithm.

Unix provides several tools that work together that enable us to build robust, efficient language processing tools.

The first of these was yacc—for “yet another compiler compiler”—which is typically used to convert a token stream into a syntax tree.

Historically, yacc is a very old program—the original version of yacc was written before C (I think in B, but it might have even been BCPL). A tour through the yacc code, or for that matter the code that it generates, visits a very strange place indeed.

The second of these is lex—for “lexical analyzer”—which is typically used to convert an input stream into a token stream. Tokens are typically things like key words, numbers, identifiers, operators, punctuation symbols, etc. Comment elision is typically handled at this stage too.

These days, we tend to use the more modern flex and bison, both parts of the GNU toolkit.

Pedagogically, it makes sense to cover lex/flex first.

Some examples:

%{ // rot13.l // simple minded implementation of the Caesar cypher #include <stdio.h> #include <stdlib.h> %} %option noyywrap %% [a-z] { putchar('a' + (yytext[0] - 'a' + 13) % 26); } [A-Z] { putchar('A' + (yytext[0] - 'A' + 13) % 26); } %% int main() { yylex(); exit(0); }

Things to note. A lex program has three sections, separated by %% 's. The first sections is a preamble, and it consists of a mix of lex declarations and embedded C code (delimited by %{ ... %} ). Note that lex does not have comments!! But the embedded C code does...

The second section consists of rule-action pairs. If an input character cannot be matched by one of our regular expressions, a default rule echoes it to stdout. This program is very unusual in not trying to process every character. Note that in the action part, the matched text is found in the yytext array. In flex (unlike lex), this is actually a char* to a dynamically allocated buffer that grows as needed.

The third section consists of C code. The effect of the lex code is to produce a function yylex(), which we call. This will (by default) return at EOF.

To compile this, we have the following rules in our Makefile:

LEX= flex rot13: rot13.o clean: rm -f rot13 rot13.o

Note that make has a default rule for converting .l files to .c files, but it involves calling the lexical analyzer through the variable LEX which defaults to the (obsolete) lex, rather than flex.

Our next example processes a text file, returning one word per line (case-normalized). You might remember that we did this task using a sedtrtr pipeline earlier:

%{ // words.l // extract words from a file #include <stdio.h> #include <stdlib.h> #include <ctype.h> %} %option noyywrap word [a-zA-Z]+ %% {word} { for (char *cp = yytext ; *cp != 0; ++cp) { *cp = tolower(*cp); } printf("%s\n",yytext); } \n {} . {} %% int main() { yylex(); exit(0); }

Note that we can define regular expressions in the declaration part of the flex file, and refer to them in the rules part. Note that . captures any character except for a newline, which must be explicitly caught.

At this point, we need to get a bit more sophisticated regarding our Makefile, since we're using a c99-ism in declaring a variable within the control portion of a for statement.

CFLAGS= -std=c11 LEX= flex words: words.o clean: rm words words.o

This makes the point that flex isn't actually that particular about the “action” language. In effect, the target language has to be able to compile C code (the “envelope” of the function), and its treatment of blocks and statements has to be similiar to C's. So “C-family” is good enough, which means that flex (and bison) can and is used with ANSI C, C-99, C-11, C++, Objective-C, etc.

Our next example is a flex-based implementation of the standard Unix wc program, albeit with a distinctive definition of word:

%{ // wc -- a simple word counting program #include <stdio.h> #include <stdlib.h> #include <string.h> int chars = 0; int words = 0; int lines = 0; %} %option noyywrap alpha [a-zA-Z] word {alpha}+ %% {word} { chars += strlen(yytext); ++words; } \n { ++chars; ++lines; } . { ++chars; } %% int main() { yylex(); printf("%8d %8d %8d\n",lines,words,chars); exit(0); }

There's not a huge amount to see here—except for the global declarations of chars, words, and lines variables.

Exercise 17.1 (Non-graded, don't turn in) This version of the wc program produces a slightly different word count (although the same character and line count) as the standard /usr/bin/wc program. Explain the difference, and provide a short test file that illustrates it. Modify this program so that it agrees with /usr/bin/wc on word count, as well as character and line count.

It is also possible to have explicit parser states. The following program might be used to strip comments from C code.

%{ /* decomment.l * remove comments from C source files. */ #include <stdio.h> #include <stdlib.h> %} %option noyywrap %x COMMENT %% <INITIAL>"/*" { /* Match the beginning of a multi-line comment */ BEGIN COMMENT; } <INITIAL>"//".* { /* Match a single-line comment */ } <INITIAL>.|\n { putchar(yytext[0]); } <COMMENT>"*/" { /* Recognize the end of a comment */ BEGIN INITIAL; } <COMMENT>. { /* Dispense with characters in multi-line comments */ } <COMMENT>\n { putchar('\n'); } %% int main (int argc, char **argv) { yylex(); exit(0); }

This is more substantial. Now, we're going to use multiple states to deal with a more complex lexical situation.

Note the %x option to declare a state, and the BEGIN directives in the C code for shifting from one tokenizing state to another.

Exercise 17.2 Email the instructor a brief note indicating the wiki projects that you expect to work on between now and the end of the quarter. Your subject line should be “[16200] Homework 17.2.”

Exercise 17.3 Fix up decomment. The version of decomment above has a significant bug ... it is possible that a C program will contain a string literal (or character literal, but that is a really obscure case that you can ignore for now) that will contain one of the comment begin substrings. It's easy enough to expose this bug by running decomment on its own source. Even though decomment targets C, lex is close enough. Correct this flaw.

Note: for the Haskellians, the program alex is essentially a lex replacement. Like lex, alex processes a file (with a syntax intentionally reminiscent of lex's), producing a Haskell source file. One complexity is that for alex to handle user states, it's important to thread a monad through the lexer. I found the documentation to be someone lacking, but was successful in figuring it out.