Lecture 13

The C Programming Language, IV

Arrays: a psuedo-type

Often, we want to create a collection of items of the same type, indexed by number rather than name, i.e., an array. We can define an array very simply:

int a[3];

defines a to be a three element array of integers. All C arrays are zero-indexed, i.e., the constituents of the array a are a[0], a[1], a[2], etc. I refer to arrays as a psuedo-type, because a is effectively a constant of type int *, and its effective value is &a[0]. There is not really an array type in C, instead, arrays arise implicitly from C's ability to perform address arithmetic on pointers.

This has some odd consequences—an array has no intrinsic notion of size, although (if the declaration of a is in scope) we can compute its size of the array it refers to via

int asize = sizeof(a)/sizeof(int);

using the C compiler's sizeof operator. Such things are sometimes done, although it's not clear that they're a good idea.

Recall that C assigns a size to each data type. This size must reflect alignment considerations. If

typedef struct particle_store { double mass; double xpos, ypos; double xvel, yvel; } particle_store, *particle; particle_store a[10]; particle ap = a; // a is a constant pointer.

Then ap+1 is also a pointer, but to the next particle, not the next byte, i.e.,

(void *) (ap + 1) == ((void *) ap) + 1 * sizeof(particle_store)

The notation a[i] abbreviates *(a+i). This is perfectly general, and negative indices are possible (and occasionally useful...). One crucial thing to note is that C arrays are not bounds checked, which means that buffer overflow attacks against naïvely coded C programs are endemic.

In addition to pointer offset arithmetic, one can take the difference between two pointers of the same time, within the same allocation unit (which gives an integer result expressed in terms of offsets). E.g.,

&a[j] - &a[i] == j-i;

Arrays are often allocated using calloc

particle av = (particle) calloc(100,sizeof(particle_store)); // dynamically allocate an array of 100 particles... // ... uses of av[i] ... free((void *) av); // free the array...

N.B., both malloc and calloc will return NULL on failure (e.g., there is no more memory available for allocation). An important difference is that calloc zeros memory, whereas malloc does not. If you want to zero raw memory, the function bzero is handy.

Strings

A string is a NULL (0) terminated array of strings. Thus, the natural type for strings is char *. Literal strings are read only:

char *msg = "Buy more socks!"; msg[10] = 'r'; // probably a runtime error.

Note that the probable runtime error comes from the fact that literal strings are usually put in read-only data pages, not because the array reference itself is checked.

The type of main can be understood now:

int main (int argc, char **argv)

The second argument is a NULL terminated array of pointers to strings:

The standard library contains a large number of functions that facilitate string processing. One thing to understand is that there are two (and in some cases three) layers to the string handling functions.

The oldest layer, which you should never use, is represented by functions like

strcpy(char *dest, const char *source);

These functions simply assume that the destination array is large enough to accept the source string, which is a bad assumption to make. For example:

char buf[10]; // ... strcpy(buf,"Supercalifragilisticexpialidocious");

This results in a “buffer overflow,” in which memory intended for other purposes is overwritten because the memory available for the buffer was inadequate. If you're lucky, buffer overflows like these will only cause your program to crash. In many cases, however, they result in security vulnerabilities that enable attackers who exploit them to hijack your program and take over the computer it's running on.

The source string might have been provided to the program as input from an attacker — think of a Web server that is open to requests from anyone in the world. In these cases, the programmer can't predict how long the source string will be. However much memory the programmer allocates for the destination, an attacker can just make the source string longer. As a result, functions like strcpy can't be used securely and should be avoided.

A later layer is represented by functions like

strlcpy(char *dest, const char *source, size_t size);

This is a string copy, but the programmer specifies the size of the destination buffer, and the function will never exceed it. It will truncate the source string if it has to. Thus, for example:

#define BUFSIZE 10; char buf[BUFSIZE]; strlcpy(buf, "Supercalifragilisticexpialidocious", BUFSIZE);

You might also encounter:

strncpy(char *dest, const char *source, size_t count);

This is also limited to copying count bytes. The problem is that it isn't guaranteed to add the NULL terminator. Thus, to use it safely, you have to add the NULL terminator yourself:

#define BUFSIZE 10; char buf[BUFSIZE]; strncpy(buf, "Supercalifragilisticexpialidocious", BUFSIZE); buf[BUFSIZE - 1] = '\0';

A final layer is represented by functions like

char * strdup(char *src); asprintf(char **ret, const char *fmt, ...); char * fgetln(FILE *stream, size_t *len);

which, rather than accepting a destination buffer as an argument, return an appropriately-sized result buffer.

The disadvantage of these “third layer” functions is that they're inconsistent in how they expect the programmer to manage the result buffer's memory. Functions like strdup and asprintf allocate a new string and require the programmer to deallocate it via free. On the other hand, functions like fgetln, which reads the next line from a file, return a pointer to a buffer that is conceptually owned by the function. This means that the programmer doesn't have to worry about deallocation. But, it also means that the buffer will be overwritten by subsequent or concurrent calls, making use in a multithreaded environment problematic.

Unions (sum types)

A much less frequently used type constructor is the union type, which describes a data structure that holds one object, but of different possible types. E.g.,

union int_or_float { int ival; float fval; };

defines a type that holds either an integer or a float (note that the union doesn't keep track of which, and so unions are sometimes used to facilitate bit-level operations on types that don't natively support them). Unions typically occur within structs, together with enum fields that serve as designators for the union's intended type. An enum is just a variable that can take on one of a finite number of symbolic types (that equate to numeric types, naturally). And naturally, enforcing the restriction on the values of an enum variable is the programmer's responsibility.

E.g., the following is more realistic:

struct int_or_float { enum { iof_int, iof_float } tag; union { int ival; float fval; } variant; };

Note that unions (and enums, and structs) don't need to be named, and often aren't. There is a problem with this code, however, and that is that the names declared in the enum type are global, appearances notwithstanding, i.e., iof_int and iof_float enter the global namespace, not some type-local namespace. This often forces naming conventions that encode the type within the symbolic name, as above.

Note also that C11 allows the union type to be anonymous, thus:

struct int_or_float { enum {iof_int,iof_float} kind; union { int intv; float floatv; }; }; ... struct int_or_float iof; iof.kind = iof_float; iof.floatv = 1.0;

C often fakes unions. It is guaranteed that if two structs have the same initial structure, then these initial segments will be layed out in the same way. Thus, e.g., you might see something like this:

enum shape_t { shape_square, shape_rectangle, shape_circle, ... }; struct shape_base { enum shape_t shape; // enum shape... }; struct shape_square { enum shape_t shape; // must be shape_square float left, bottom; float side; }; struct shape_rectangle { enum shape_t shape; // must be shape_rectangle float left, bottom, float width, height; };

In real world code (e.g., drawing code), there would be a switch statement driven by the tag, and each case would begin with a cast to a more specific type, e.g.,

void draw_shape(struct shape_base *sp) { switch(sp->shape) { case square_shape: draw_square((struct shape_square *) sp); break; case shape_retangle: draw_rectangle((struct shape_rectangle *) sp); break; ... } }

C also has bit-fields. Through bit-fields, it is possible to produce more compact structure types. These days, bit fields are rarely used outside of device drivers, as they lead to portability and maintenance issues, and memory constraints are less procrustean now, in the days of gigabyte scale main memories than they in the kilobyte scaled memories in days of legend and myth (or memory, for codgers like me).

Exercise 13.1 Write a simple word-frequency program in C. [We did this earlier using a tr/sed/awk pipeline.]

Part 1. Your program should have a function

char *get_word();

This should read the next word (i.e., non-empty sequence of letters) from input, returning either a pointer to a buffer that contains that word, or NULL if there are no more words to read.

Part 2. You should create a flexible procedural interface for managing (word,count) pairs, e.g.

typedef struct { char *word; int count; } dict_entry_type; typedef struct { ... } dict_type; void dict_initialize(dict_type *d); void dict_increment(dict_type *d, char *word); dict_entry_type **dict_get_entries(dict_type *d);

The final function is returning a dynamically allocated array of pointers to dict_entry_types, suitable for passing to qsort(3),

Part 3. You should return the frequency table.

As a test, compare your program on a reasonable sized input (e.g., cat together all the man pages in /usr/man/man1 to create a test file), and compare histogram output/performance between this version and the tr/awk pipeline. Obviously, outputs should be identical!