Lecture 15

A performance analysis of hist

Simply re-implementing a program (like hist) in a lower-level language is no guarantee of performance improvements. Let's start by considering our original implementation, using a tr-tr-sed-awk-sort pipeline. We'll use as a data source the King James Version Bible, downloaded from Archive.org. Note here that more recent timings with MacOS X 10.9 are different and much worse :-(.

$ time sh/hist < kjv.txt > kjv.sh.txt real 0m0.421s user 0m1.047s sys 0m0.019s

This is not too shabby. Note that processor time (user + system) is 1.066s, which exceeds wall-clock time. Welcome to the brave new world of multi-processor systems, which this simple program was able to take advantage of because the processes in the main pipeline could be executed on different processors.

Next, we'll consider two straightforward C implementations, following the basic design from Homework 13, first, an unsorted add-at-the-end linked-list version:

$ time linked-list/hist kjv.txt > kjv.ll.txt real 0m3.190s user 0m3.184s sys 0m0.005s

then an unsorted array-based version:

$ time array/hist kjv.txt > kjv.ar.txt real 0m3.123s user 0m3.115s sys 0m0.008s

It's hard to feel good about this. Writing the C-code is anything but trivial, and yet the naïve shell-script is out-performing us 7.5x in wall-clock (real) time, and 3.0x in processor time.

So, why are our versions so slow, compared to the shell version? The key difference is that awk is doing something a lot more clever in its associative array implementation than a simple linear list. Presumably, we can be cleverer too. Organizing our dictionary as a binary search tree has a lot of promise. We can do a little back-of-the-envelope calculation. By running wc on our output files, we see that there are 12,569 distinct words in the KJV. We'd expect that a typical lookup is going to work through a couple thousand words (our add-at-the-back strategy will tend to put frequently used words near the front). So a good guess is that we'll make about 24M calls to strcmp over the course of processing the input. If we use a binary search tree, we expect to use something closer to ln(12569) = 13.62, or about 175K calls to strcmp. So we might hope for a 100x speedup, but this is a bit naïve, as other bottlenecks might appear.

$ time bst/hist kjv.txt > kjv.bst.txt real 0m0.155s user 0m0.151s sys 0m0.003s

Not quite 100x, but not bad. And 3x faster than the shell-based version. But can we do better?!

Hashing

Hashing is a often used technique for providing data structures that are both fast (often nearly constant time) and simple. For example, suppose that we have a domain D, and we want to represent subsets S of D. Further, suppose that in typical applications, we expect that S will have roughly n elements.

We could implement S via a linked list, but only at the cost of O(n) complexity for routine operations like element insertion, deletion, and membership tests. If, in addition, D had a natural order <, we could exploit this to represent S via some sort of tree structure. In this case, the basic set operations will require time proportional to the depth of the tree. If we are clever, we can maintain a balanced tree structure, so that the depth will be O(log n), but this often requires algorithms of considerable complexity. A somewhat more common approach is to build the tree without rebalancing, in the hope that the tree will turn out balanced enough anyway. This hope is usually realized, although it needs to be understood that this assumption breaks on structured (i.e., already sorted) input.

Hashing represents an alternative. The idea is that we will create an efficient function h: D -> m = { 0, 1, ..., m-1 } where m ~ n. We will then use an array of m linked lists, and each item d in D will be stored (or not) in the list with index h(d).

If h is a good hash function, then we expect that most indexes will correspond to short (e.g., 0, 1 or 2 element) lists, and indeed that most elements d in S will map to short lists. (This would not be the case if half of the elements of S mapped to the same list...).

All of the advantages that we associate with hashing are dependent on having a good hash function. So where do good hash functions come from?

Let's get a bit more precise. Let's suppose that we're trying to hash strings. We can view a string as a representing a large non-negative integer in binary. Let's define string2int: strings -> integers be this integer representation function.

A simple (and useful) hash function is

h(s) = string2int(s) % n

A complication is that this seems to require infinite precision arithmetic, which C doesn't support. But we don't actually need infinite precision arithmetic. Consider the following algorithm:

unsigned int hash (char *s) { unsigned int result = 0; char ch; while (ch = *s++) { result = ((result << 8) + ch) % HASH_SIZE; // (*) } return result; }

This is typical terse C, but the code corresponds to a Horner's rule calculation of the (very large) integer represented by the bits of s modulo HASH_SIZE.

The only potential problem in this code is the line (*). When will this result in an arithmetic overflow? If we're working with 32 bit unsigned integers, this will overflow when result * 256 + 255 = 2^32, limiting us to values of HASH_SIZE less than 16777215, which corresponds to a 24-bit hash-function.

Usually 24-bits is plenty, and if that's all there was, we could wrap up the discussion here.

But sometimes 16 million hash buckets isn't enough, and sometimes we're not willing to pay the cost for the implicit division on line (*).

Let's suppose that we're trying to hash words in the dictionary. My version of /usr/share/dict/words has 234,936 words. If our interest is the cfreq program, we don't expect to run into anything like this number of words -- arguably something in the 20-40K range is more reasonable.

Let's suppose that we chose

#define HASH_SIZE 39989

Where here 39989 is a prime, and there are theoretical reasons for preferring prime moduli.

If we hash all of /usr/share/dict/words, the largest bucket has 18 words. This compares to a “typical” bucket size of 234936/39989, which is roughly 6. Moreover, a binary approach would take ceiling(log2(234936)) = 18, so hashing's worst case is the same as pretty much every case with a balance tree. This is actually in excellent theoretical agreement with the largest bucket we'd get if the hash function was truly random. This isn't bad at all, and it argues for calling this problem solved, and moving on to the next.

But let's not...

It turns out that by adding a lot more mathematical sophistication, but building on the same set of ideas as above, we can build hash functions which use the full 32 bits (if we want them, or even 64 or 128), and which replace the relatively expensive division with a few cheap bit-operations. What's not to like?

Mathematicians have long studied finite fields. These are mathematical structures which have many of the additive and multiplicative properties of the rationals (or reals, or complexes, which are infinite fields), but which are (as their name suggests) finite.

The simplest examples are arithmetic modulo a prime -- which was the basis of our first hash function.

But this is not the only way. We can also consider extension fields -- these are new fields that conceptually extend an old field by adding a root to an otherwise irreducible polynomial.

This is exactly where the complex numbers come from: we start with the real numbers, and add a root (called i) to the equation x^2 + 1 = 0.

Mathematically, these are quotient structures. We start by taking the set of polynomials F(x) over our base field F, and then “modding out” by the irreducible polynomial p(x). Formally, we say that two polynomials r(x) and s(x) are equivalent if p(x) divides their difference. It is an easy exercise to see that this notion of equivalence is an equivalence relation (i.e., it is symmetric, reflexive, and transitive), and that it gives rise to sensible notions of addition and multiplication (which is to say that these operations are well defined on equivalence classes).

The resulting structure is trivially ring (i.e., it supports operations of addition, subtraction, and multiplication, which follow the expected laws), and with a bit more work (Euclid's algorithm, as it applies to polynomials) can be seen to be a field (i.e., it also supports division). Moreover, [x] (the equivalence class of x) is root of p(x) = 0 in F(x)/p(x).

This is all very elegant mathematics, but what does it have to do with hashing?

Well, let's suppose that our base field is the integers mod 2. Then our field operations are

+ 0 1 0 0 1 1 1 0

and

* 0 1 0 0 0 1 0 1

These just happen to have a simultaneous existence as the operation tables for exclusive or, and logical and. Moreover, polynomials can be compactly represented by 0,1 valued strings. I.e., we will view

$$x^5 + x^3 + x + 1 =_{\hbox{rep}} 101011,$$

which is perhaps more obvious if we write the polynomial down in full form.

$$1 x^5 + 0 x^4 + 1 x^3 + 0 x^2 + 1 x^1 + 1 x^0.$$

We can compute a remainder by simply doing long division. In this case, it is much more succinct to just write the coefficients. Let's suppose, for example, that we wanted to compute the remainder of dividing 1101101011 by 101011. This corresponds to the following division:

1101101011 101011 111011 101011 100000 101011 101111 101011 100

I.e., the remainder is the polynomial x^2. The cool thing about working mod 2 is that subtracting and adding 1 are the same thing, so we can just add, and throw away carries, which is to say, we can use bit-wise exclusive or, which is a single machine instruction on most architectures. Moreover, we never have to guess a trial divisor -- we just line up the lead bit.

Now, the remainders will always be of lesser order than the quotient polynomial, and every polynomial of lesser degree is already a remainder, so the size of the resulting field will be exactly 2^n for n-th order polynomials (remembering the 0th-order term 1).

Combining this peculiar remaindering context with the long-division (Horner's rule) intuition of the first hash algorithm, we arrive at the following algorithm, which offers no obvious advantages over the first.

#define HASH_BITS ... // low order coefficients of p. unsigned hash(char *s) { unsigned int result = 0; unsigned char ch; while (ch = *s++) { // shift 8 bits, one bit at a time... for (int i = 0; i < 8; ++i) { result = ((result << 1) & 0xffffffff) ^ ((result >> 31) ? HASH_BITS : 0); } result ^= ch; } return result; }

This is a bit tricky -- we multiply by 2 (i.e., shift left by 1 position), and then, if the high-order (32nd) bit of the word was set, we xor-off the the value that corresponds to the low-order terms of the polynomial. This is the right arithmetic, but it's hard to believe that that for-loop offers any advantage over division.

But we can do better. We can shift off the entire high-order byte, and precompute the associated remainders. Let's suppose that we have

unsigned int hash_table[256],

where hash_table[xxxxxxxx] contains the remainder of the polynomial represented by

xxxxxxxx 00000000 00000000 00000000 00000000.

That's 8 positions, with 32 0 bits following. Then this algorithm becomes, modulo a few other minor optimizations

unsigned int hash_table[256] = { ... }; unsigned int hash(char *s) { unsigned int result = 0; unsigned char ch; while (ch = *s++) { result = ((result & 0x00ffffff) << 8) ^ hash_table[result >> 24] ^ ch; } return result; }

A minor theoretical issue is that it is a common convention often to “shift 4 nulls” after processing s, so that the last character of s can potentially affect every bit of the hash (a property called avalanching). This turns out to be equivalent to xor-ing the character to be read into the high-order (rather than low-order) byte, and gives rise to

unsigned int hash_table[256] = { ... }; unsigned int hash(char *s) { unsigned int result = 0; unsigned char ch; while (ch = *s++) { result = ((result & 0x00ffffff) << 8) ^ hash_table[result >> 24 ^ ch]; } return result; }

In real life, we'd need to set the initial value of result to something (fixed) non-zero if we wanted to be algebraically correct, but doing so doesn't change the equivalence classes (buckets) of the hash function, so there's no advantage (from the point of view of hashing) to doing so.

So now, let's put a couple of twists on this.

There is a well-known check-sum procedure (CRC-32), which is widely used, e.g., to ensure the reliability of ethernet packet transmissions, which is essentially a special case of this. Some well-meaning, but sadly misinformed people have tried to use CRC32 as a cryptographic hash function, i.e., a hash function such that it is difficult to produce a file that hashes to a given value, as might be used in a digital signature scheme.

Unfortunately, the simple linear properties of the CRC32 system mean that it is quite easy to produce strings that have a particular hash value. If you're mathematically inclined, figure out why this is so. If you're mathematically inclined, and verbally playful, I claim that you can without too much difficulty write a program that takes a 32-bit value as input, and produces a decent sonnet that hashes to that value. (The 2013 class actually did this!)

And this brings us to the final twist. The “bad” properties of the CRC32 seem to come from the tremendous amount of structure that we've built into the hash_table. What if we just used a random collection of 256 32-bit values? It would seem that this would have the good mixing properties of CRC32, but lack the linear structures that make inversion easy.

A final consideration with hashing is how many buckets to use. This is generally a tradeoff between initialization cost and bucket size. We'll choose a pragmatic 2^16, so that the average bucket size on the KJV example is one.

Running the code gives:

$ time hash/hist kjv.txt > kjv.hash.txt real 0m0.085s user 0m0.080s sys 0m0.003s

So we've gained roughly a 5x increase in wall-clock speed, and with code that is conceptually clear.

Exercise 15.1 Instrument the hashed version of this program, and produce a histogram of bucket sizes. What is the largest bucket? What is the average length of a non-empty bucket list? If you know the relevant mathematics (poisson distributions, binomial probabilities), compare the observed histogram to what you'd expect if the underlying hash function were truly random.