Lecture 6

Awk

Awk is the prototype scripting language. Scripting languages generally have a few shared characteristics: they're interpreted, which can make for very rapid debugging cycles; variables are declared implicitly; the language handles memory management; values are freely coerced from one base type to another; associative arrays (a.k.a., dictionaries) are a major high-utility abstraction.

Truth be told, I struggle every year with the question of whether or not to bother with awk. There's no denying that, as a language, it offends my sense of orderliness, elegance, and generality. But there's also no denying that for doing simple report generation, awk is a killer language. It is just too useful to skip.

The utility of awk comes from the ubiquity of a particular kind of data set, consisting of a plain text file in which individual lines represent records, and which the individual words represent fields. This particular representation of data often hits a sweet point in computer/human interactions—computers can easily read and write this kind of data, and yet it remains human readable. Moreover, the field separator FS can be set either by the program or the command-line, so awk is equally capable of dealing with simple CSV data, and is often the tool of choice for quick-and-dirty processing of simple CVS files.

Awk was initially conceived of as a report generating language, which could process data in this representation and generate summary reports. BTW, awk is named after it's creators, Al Aho, Peter Weinberger, and Brian Kernighan, all formerly of AT&T Bell Labs Research, the cradle of Unix. Rumor has it that the initial proof-of-concept implementation was done over a weekend, starting from the sed code-base, motivated by the need to meet yet another internal reporting deadline.

As usual, there are good online tutorials/references, and a very good O'Reilly book.

Things to know — # is a comment character. This means that awk programs can be wrapped up as interpreter files, although it's more common to specify the program as a command-line argument.

#!/usr/bin/awk -f

The evaluation model of awk is unusual, but useful. An awk program is a set of guard/action pairs. The program processes a file on a line-by-line basis. For each line, the guards are evaluated in turn. For each guard that is satisfied, its corresponding action is carried out. Make sure you got that! This is not a case-like construct, nor a Haskellish set of alternatives. Each statement block corresponding to a satisfied guard gets evaled. E.g., consider

#!/usr/bin/awk -f # foobar /foo/ { ++foo } /bar/ { ++bar } END { print "Foo lines: " foo print "Bar lines: " bar }

Note that the first two guards consist of pattern matching statements, and trigger when a line of input contains the pattern. Note also that variables are not declared, but come into being with use, and sensibly initialized. There are other kinds of guards, and guards can be combined via the usual boolean operators.

Here's a sample run:

$ ./foobar foo bar foo ^D Foo lines: 1 Bar lines: 1 $

Note the special guard END is matched after standard input is exhausted. There is a corresponding BEGIN statement which is matched before standard input is read.

#!/usr/bin/awk -f # echoargs -- print our command line arguments, one per line. BEGIN { for (i = 1; i < ARGC; ++i) { print ARGV[i] } }

Here is a more complicated example that I use for calculating MPG and related statistics for my car.

Data Set:

# Prius Mileage Record # Odometer reading, Gallons Pumped, Cost 14 327 6.058 16.77 716 9.741 28.24 1072 7.341 20.33 1375 7.519 20.37 1717 4.710 10.83 1905 8.599 22.26 2245 7.888 19.55 2542 7.646 19.72 2819 8.510 21.95

The format of the data file follows some of the usual principles -- we use # as a comment character. Nonempty lines constitute records.

In this case, my assumption is that the mileage record will be imperfect, i.e., that some entries will be missed. Therefore, the record consists of a collection of segments. Each segment begins with a line that contains only a mileage (the start mileage for that segment), and zero or more continuation lines (giving a mileage at fill up, gallons pumped, cost, and in some cases, the date).

Here is a simple awk program that produces a data report:

#!/usr/bin/awk -f function commit_segment() { if (seg_records > 0) { miles += (seg_end - seg_start) } seg_records = 0 } /#/ { $0 = substr($0,1,index($0,"#")-1) } # trim comments NF == 1 { commit_segment() seg_start = $1 } NF == 3 || NF == 4 { ++seg_records seg_end = $1 gallons += $2 dollars += $3 } END { commit_segment() printf("miles: %5.0f\n",miles); printf("gallons: %9.3f\n",gallons); printf("dollars: %8.2f\n", dollars); printf("miles/gallon: %8.2f\n",miles/gallons); printf("miles/dollar: %9.3f\n",miles/dollars); printf("dollars/mile: %9.3f\n",dollars/miles); }

and a sample run (from long ago, as you can see by computing the average cost of gas per gallon...)

$ ./milage < prius.txt miles: 15452 gallons: 352.550 dollars: 1013.06 miles/gallon: 43.83 miles/dollar: 15.253 dollars/mile: 0.066

Note the use of printf instead of print, in order to get more control over output formatting.

The structure of the awk program is pretty straightforward. The most interesting aspect is the use of a function definition, which enables me to avoid a bit of code duplication. Ordinarily, I could create a selector/action pair, but END (like BEGIN) can't be combined with the other selectors.

Note that the syntax of the action part of awk is basically “sloppy C.”

One of the nicer aspects of awk is that it includes associative arrays, i.e., arrays that are indexed by strings. This turns out to be extraordinarily useful, and all of the wannabe awk-killers have included associative arrays as language primitives.

The next example makes use of associative arrays. This program is intended to create a word-frequency table from standard input. Here, we use awk within the context of a shell script, and use other filters to simplify the awk program:

#!/bin/bash # freq -- compute word frequencies from standard input PATH=/bin:/usr/bin tr '[:upper:]' '[:lower:]' | tr -sc '[:lower:]' '\n' | sed '/^[ \t]*$/d' | awk ' { ++freq[$1] } END { for (word in freq) { printf("%-16s %d\n",word,freq[word]) } } ' | sort +1nr +0

Note: GNU's version of sort expects different arguments, and some OSs (including MacOS X) have replaced the traditional AT&T sort program with GNU's sort, in which case you should use

sort --key=2nr --key=1 -

There are a few things to notice here. First, the use of POSIX character classes make this program portable to non-English locales, although strong claims in this regard would require testing.

The next is the use of associative arrays within awk. The initial tr-tr-sed part of the pipeline guarantees that every line of input will have the form [:lower:]+, so the awk part of the program is justified in assuming precisely one word per line.

The for loop in the END part of the awk program iterates over the valid indicies of the associative array freq. Finally, the sort command specifies a sort order where items are sorted first by frequency, and within equal frequency alphabetically.

Note that awk and sort have inconsistent views as to how fields are numbered. For awk, fields are numbered $1, $2, ... , NF, and $0 represents the whole line. For AT&T sort, fields are numbered 0, 1, 2, ... .

Exercise 6.1 Write a program (specifically, a shell script with an embedded awk program) to detect duplicate files in a file structure. These may be files that are really identical (i.e., they refer to the same inode, either because of hard or symbolic links), or merely have the same contents.

If we do a file-by-file comparison, this requires $n \choose 2$ comparisons, which is quadratically much work. This is bad, if you have a lot of files (I have >800k files in my home directory on my research computer, which is admittedly rather a lot). A better approach is to use a signature scheme. For each file, produce a signature, which is to say a relatively short string that depends on the contents of the file. Only files with the same signature can be equal to one another. This has the effect of dividing our files into lots of buckets—we still have a potentially quadratic cost per bucket, but the buckets will tend to be very small.

Moreover, if the signature scheme is good enough, we can punt and assume that if the signatures are the same, so are the files.

Some useful tools that are lying around...

1. find — useful for recursive traversal of a directory structure.

$ find <dir> [<args>...]

The most important args break down into two categories — filters, which restrict the files (and/or directories) that find finds; and action commands. The default action is to print relative paths for the files found, one per line. This is very convenient for shell scripts.

2. xargs — we often want to take the contents of standard output of a pipeline and use them as command line arguments for another command. Historically, there are three solutions to this. For example, let's say we wanted to find the total number of bytes associated with .html files in our current directory and below.

The Old Way:

% ls -l `find -X . -type f -name \*.html`

A problem with the old way is that it is hard to nest. So...

The New Way:

% ls -l $(find -X . -type f -name \*.html)

This is ok. But then we have the xargs way...

% find -X . -type f -name \*.html | xargs ls -l

N.B., the -X flag to find says to ignore files with names that are problematic for shell substitution (and xargs), e.g., files with spaces, newlines, or other embarrassing characters in their names.

3. shasum — a program for creating cryptographically secure fingerprints. These are our signatures. Unless you're working within the file structure of a professional cryptographer, it is safe to assume that if two files have the same shasum -a 256 signature, then they are identical, and you are permitted to make this assumption.

And some additional considerations...

It's very tempting to use the shell to run shasum once per file, e.g.,

for file in `find -X . -type f` do shasum -a 256 $file done | ...

This is probably a bad idea. Programs like shasum often have a relatively long startup latency (certainly, this was true of the md5 program we used before 2015), and you pay it once per file if you do it this way. A better idea is to give shasum all the files at once.

Finally, here is a link to a zipped archive that should unpack into 20 directories, each containing 20 files. Illustrate the use of your program by finding the duplicates within the top-level example directory.