Lecture 14

Abstraction Barriers

The Histogram problem is typically the source of more emails from students than all of the previous homework exercises combined. Some of the emails are about technicalities of C, but I'd say that a greater issue was a general lack of understanding of the notion of an abstraction barrier. This is an absolutely critical issue in controlling complexity in large software systems, so let's spend a few moments on the subject.

As programs get larger, the complexity of the code is often driven by an increase in interactions between various program concepts. As these interactions increase, the code becomes more and more rigid: any change to one part of program can interact with all of the rest, and so significant restructurings become more and more difficult. This is how programs die.

The standard solution is through the creation of abstraction barriers. The idea here is to limit the interaction between various portions of the program to well-defined interfaces (typically, these coincide with module/compilation unit boundaries). The responsibility of a “service” module is to implement the interface, while the “client” is limited to accessing the interface. What is really happening here is an application of the old “divide and conquer” strategy of problem solving. By factoring a programming project into a number of modules that interact across well-defined interfaces, we essentially turn a large difficult task into a number of smaller, simpler tasks.

An unexpected side benefit, as we'll see in Lecture 15, is that in a well-structured program we can completely reimplement one module without impacting others, possibly gaining important efficiencies. There's another kind of parsimony here. A common experience in writing code is that there are often multiple approaches possible to a given problem. There's often a simple solution, which just might be good enough, but it's hard to know without testing; and then a complex solution for more demanding situations. A modular approach allows us to write a correct program using only simple solutions, to determine whether or not the resulting program meets our performance needs, and if it does not, to do performance analyses that will identify particular “hot-spots,” i.e., performance bottlenecks in the code. We can then direct our efforts specifically to the portions of the program where improvements will have the greatest benefit.

get_word

It's possible to write a naïve get_word function that's good enough for today's homework, e.g.,

#define WORDMAX 1024 char *get_word() { static char buf[WORDMAX+1]; int c = '\0'; // skip nonletters while (!isalpha(c = getchar())) { if (c == EOF) { return NULL; } } int i = 0; do { buf[i++] = tolower(c); assert (i < WORDMAX); } while (isalpha(c = getchar()) && c != EOF); buf[i] = '\0'; return buf; }

because even "Supercalifragilisticexpialidocious" at 31 characters doesn't take 1024 characters. Neither does "antidisestablishmentarianism" at 28 characters long purported to the longest word in English. Of course, there are longer words still in other languages -- Wiki claims 79 for German, 139 for Afrikaans, 173 for Greek (used by Aristophanes!), and a mind-bending 431 for Sanskrit. So 1024—an exemplar of the "allocate a large buffer and prey" approach—is a bit generous, but probably a whole lot less generous than we expected.

Can we do better? Can we write a version of get_word that doesn't overflow? The rest of the lecture is a long answer to that question.

Static Dynamically Allocated Buffer

This version of get_word uses a static (i.e., local but persistant), dynamically allocated buffer, into which it reads a word of input. If it encounters a word that is longer than the buffer, it attempts to allocate a larger buffer.

The calling convention is that the static buffer is returned (so there's no obligation on the caller to free it), but this comes at the cost that the contents of the buffer will be overwritten at the next call to get_word. It is the caller's responsibility to copy the returned result (e.g., using strdup) if they want to save it.

char *get_word(FILE *fp) { static char *buf = NULL; static int buf_size = 0; // allocate our input buffer if (buf == NULL) { buf_size = 256; buf = (char *) calloc(buf_size,sizeof(char)); assert(buf != NULL); } int ch; // scan past any white space. while (!isalpha(ch = fgetc(fp))) { if (ch == EOF) return NULL; } // read the word int index = 0; do { // invariants: // index < buf_size // buf points to a buffer of size buf_size // ch contains the next character to be added to the buffer buf[index++] = (char) tolower(ch); // make sure we can push another character, // reallocating the buffer if necessary. if (index == buf_size) { buf_size *= 2; buf = realloc(buf,buf_size * sizeof(char)); assert(buf != NULL); } } while (isalpha(ch = fgetc(fp)) && ch != EOF); buf[index] = 0; return buf; }

A serious problem with this routine in the context of modern systems is that it is not re-entrant (i.e., you cannot have two copies of get_word in flight in the same program at the same time). This is a huge problem in multi-threaded environments. E.g., we might imagine a "scholarly library" program, that provides an interface for searching and performing computations on a variety of books. A scholar who was interested in word use, e.g., in Shakespeare, might end up with several concurrent computations of histograms, and this would be a disaster. How can we make this thread safe? What costs do we incur?

There are times when it's just not possible to be both a floor wax and a dessert topping. The essential obstacle we face with a re-entrant get_word is the use of the global buf. Our problem is that this is essential to our allocation strategy, which tries to limit the total number of calls to malloc and free to be logarithmic in the length of the longest string. One strategy we might use if this was critically important is to somehow implement a per-thread buffer. This is certainly possible, but it doesn't scale particularly well.

The most sensible solution seem to require a change in policy. The version of get_words above owns the buffer it returns, and so relies on the client calling strdup or a similar function to retain the value, but not calling free to release the buffer. Making this re-entrant is going to flip this around: what was forbidden before (calling free on the returned buffer) will now become mandatory, as the caller will own the result. But as the circumstances have changes, so too has what we might hope to accomplish. We don't want to be returning pointers to unnecessarily large allocation blocks, since this will have us chewing through memory unnecessarily quickly. We can accomplish this by calling strdup on our buffer, and freeing the underlying buffer ourselves. But we can still avoid unnecessary trips through the allocator by following the old programming precept of making the common case efficient, using a stack-allocated buffer that's large enough in all but pathological cases.

#define BUFSIZE 256 char *get_word(FILE *fp) { char store[BUFSIZE]; // a local buffer, which is almost always good enough... char *buf = store; int buf_size = BUFSIZE; int ch; // scan past any white space. while (!isalpha(ch = fgetc(fp))) { if (ch == EOF) return NULL; } // read the word int index = 0; do { // invariants: // index < buf_size // buf points to a buffer of size buf_size // ch contains the next character to be added to the buffer buf[index++] = (char) tolower(ch); // make sure we can push another character, // reallocating the buffer if necessary. if (index == buf_size) { buf_size *= 2; // is our buffer stack allocated or dynamically allocated?! if (buf == store) { buf = (char *) malloc(buf_size); assert (buf != NULL); memcpy(buf,store,BUFSIZE); } else { buf = realloc(buf,buf_size * sizeof(char)); assert(buf != NULL); } } } while (isalpha(ch = fgetc(fp)) && ch != EOF); buf[index] = 0; char *result = strdup(buf); assert(result != NULL); // if our buffer is dynamically allocated, free it if (buf != store) { free ((void *) buf); } return result; }

Exercise 14.1 Implement the Animal Game in C.

The Animal Game is an interactive, text-based game that illustrates a simple learning strategy. The basic idea was that you would think of an animal. The program then asks a sequence of yes/no questions, and eventually it makes a guess. If it guesses right, it wins, if it guesses wrong, it loses. The learning comes from losses, where it would expands its knowledge base by adding a new animal, along with a discriminating question.

Here's a sample transcript, starting with very little knowledge at all...

$./animals Welcome to the Animal Game! Does your animal have four legs? (y/n) y Is your animal a horse? (y/n) n I lose. Your animal is > a dog Please state a question that is true of a dog but false of a horse. > Does your animal bark? Would you like to play again? (y/n) y Does your animal have four legs? (y/n) y Does your animal bark? (y/n) y Is your animal a dog? (y/n) n I lose. Your animal is > a seal Please state a question that is true of a seal but false of a dog. > Does your animal live in the sea? Would you like to play again? (y/n) y Does your animal have four legs? (y/n) y Does your animal bark? (y/n) y Does your animal live in the sea? (y/n) y Is your animal a seal? (y/n) y I win! Would you like to play again? (y/n) n Goodbye. $

Don't worry about serialization (i.e., saving the Animal Game database from run to run).