Lecture 5

Modern Bourne Compatible Shells, Control structures

Simple Conditionals

The simple conditional if ... then ... else ... is basic to any programming language, although there is a certain amount of gratuitous variation, e.g., does the language support an else if construct, and if so, what key works are used (you'll see elif, elsif, else if this quarter). This is one of the reasons why it helps to look at a bit of old code if you find yourself suddenly working in a language that you haven't used for a while. In the shell, the original construct is:

if <pipe> then ... elif <pipe> then ... else ... fi

Note here the use of fi (if spelled backwards) as a statement delimiter. There may be zero or more elif blocks, the else block is optional.

The <pipe> was usually a call to the test program (/bin/test). As the shell has evolved, there are two additional forms that you should be familiar with:

New:

if [ ... ]

Newer:

if [[ ... ]]

Note here that both the [..] and [[..]] forms require that their key tokens (the brackets themselves) be separated by a space from their content.

Although these seem similar, they are actually differ in important ways. In the original Bourne shell, test was not a built-in feature of the language, but instead was just a program. The if [ ... ] form was a first attempt at making the shell look more like other programming languages (notably C), and oddly enough, adding this construct did not require making any changes to the shell itself! So how was it done? Back in the days of yore, when programmers used teletypes and wore crewcuts and white shirts with pocket protectors, they just linked /bin/[ to /bin/test, and tweaked test to expect a terminal ] on the command line if it was called as [. (Note that [ is still in /bin!) Thus, the [..] form provided a bit of syntactic sugar, but otherwise the same quirks and idioms as the test form.

Gordon Kindlmann as a 1980's era computer scientist.
Prof. Gordon Kindlmann as a 1980's computer chess wizard.

The [[..]] form, however, is built-in to modern shells, which allows the shell to define semantics as well as syntax. (Note here that test and [ are also built-in these days, but this is for efficiency reasons, and a conscious decision was made to preserve their earlier semantics. As a general rule, the [[..]] form is more robust and easier to work with, and so only this form should be used in new code.

Unfortunately, it's unlikely that you'll be able to limit yourself to dealing with only new code. Inevitably, you're going to run into the old forms in legacy code, and there's one particular idiom of the old forms that's both ubiquitous and opaque. Suppose we want to test if a particular shell variable is undefined, and set it if it's not. In modern shells, we might have code like this:

export VISUAL=${VISUAL:-$EDITOR}

Unfortunately, the old shell didn't support default values like this. So, we might have been tempted to write something like this:

if [ $VISUAL = "" ] then VISUAL=$EDITOR export VISUAL fi

but this doesn't work! In the case that we're most interested in, where VISUAL is undefined, $VISUAL expands to the empty string, resulting in call to

$ [ = "" ]

which so far as the [ program is concerned is s syntax error. How can we work around the problem that one of our command line arguments to [ just disappears? The obvious thing to do is to quote the variable, i.e.,

if [ "$VISUAL" = "" ] ...

This works, but it's not what's usually done. Instead, expect to see

if [ x$VISUAL = x ] ...

This works because the shell understands x$VISUAL as the concatenation of "x" and $VISUAL, which results in just "x" if VISUAL is undefined. Or, to put it rigorously if a bit opaquely, the set of strings forms a cancellative semigroup under concatenation.

With the [[ .. ]] forms, if you wanted to skip the default mechanism, you could have written

if [[ $VISUAL == "" ]]

and things would just work. Note the use of == with the [[ .. ]] form, vs. = with the test and [ .. ] forms. In fact, the = form works too (at least in bash), but this has the effect of giving = two distinct meanings depending on context (assignment vs. string equality), so I prefer the == form.

The test forms have lots of predicates. Among the most useful are file existence predicates, which look strange but quickly become very readable

if [[ -f $file ]]

... is $file the path of an existing regular file?

if [[ -d $dir ]]

... is $dir the path of an existing directory? Etc.

Case

case word in pattern|pattern) cmd cmd ;; pattern) cmd cmd ;; *) cmd cmd ;; esac

Select and run a set of commands based on pattern matching. The terminating form esac (“case” backwards, like “fi”) reflects a relatively short period in programming language aesthetics, making it fairly easy for future language archaeologists to accurately date the introduction of these features into the shell. The patterns are "glob" patterns, a restricted form of regular expressions that includes character sets ([..]), wildcard characters (?) and wildcard strings (*), and must match the entire word.

For

At least three forms. By far the most common is textual

for var in <words> do ... done

Note here that the words can be generated lots of different ways, e.g., as a literal list, via pattern matching of files

There's also a concise form for iterating over command line arguments

for var do ... done

and an arithmetic form that is close to C's (but much less common—the shells are at best awkward contexts for arithmetic programming)

for (( [expr1]; [expr2]; [expr3] )) do ... done

Other iterative control structures

while <pipe> do ... done until <pipe> do ... done

There are others. And this raises a point. Don't feel that you have to learn, let alone use, every feature that a programming language offers before you can get to work with it. But do take the effort to read code that others have written, and be prepared to consult texts, references, online tutorials, and/or local experts if you see a usage that is unfamiliar.

Traps

Processes can be sent signals asynchronously. Processes have an array of signal handlers, one per signal. We can set signal handlers on a per-signal basis in the shell. This is often done either

  1. to provide an “exit hook,” or
  2. to increase robustness.

For example, let's suppose we need to create temporary files. We want to make sure that they go away after we're done executing.

tempfile=/tmp/myfile.$$

To make sure that tempfile gets deleted when we exit, we might have

trap 'rm -f ${tempfile}' EXIT

Note that after processing an interrupt, we start the next command. So if we want to cleanup and exit, we need to be explicit

trap 'rm -f ${tempfile}; exit 1' INT TERM

Note that while you might have one handler for multiple signals, each signal can have only a single handler. But this handler can be a function call, and so can be as complex as necessary.

The full list of signals can be found in signal(3).

Argument processing in shell scripts...

There are several conventions for specifying command line arguments.

Classic Unix

Command line arguments are single letters, and are introduced by a hypen, e.g.,

$ ls -l -d .

Multiple arguments can be combined into a single flag block:

$ ls -ld

Some flags take an argument. The argument that follows a flag can either be separated by a white space

$ myscript -d extra-sauce

or not

$ myscript -dextra-sauce

GNU

The GNU project (Gnu is Not Unix), cf. gnu.org, is an attempt to implement a free replacement for Unix. GNU has a particular political agenda (free software) which is admirable enough, and it has produced a great deal of extremely useful software for Unix-like systems. Many of the standard “userland” programs in common use on modern Unix-like systems are GNU project rewrites of classic programs (often with greatly increased robustness and functionality). The GNU convention is that command line arguments can be words, in which case they are introduced by a double hyphen (--), and cannot be combined.

$ ls --long --directory .

If these commands take an argument, that argument either follows after a space, or is separated by an = sign without spaces, e.g.,

$ myscript --input foo.in --output=foo.out

X-Windows

The X programs use long arguments, with a single hyphen.

The classic Unix approach works well when there are a few, commonly used flags, and the script/program is often invoked from the shell (i.e., it is typed in by the user). The GNU approach is definitely superior in scripts, because the longer names assist readability.

Command line arguments in practice

Unix approach has the advantage of succintness, but it becomes problematic when there are a large number of flags (there's an old UNIX joke -- what three lower case letters are not valid flags for ls? j, y, and z, of course.). Of course, long flag names are also problematic if there are a lot of flags—we don't want CLI commands to look like one-line novels. So both GNU and even more so X-Windows provide alternatives means for setting commonly used flags via environment variables and/or global configuration files.

Note that the Unix and GNU approaches are compatible, and it is common for GNU (and GNU-influenced) programs to support both long (for scripts) and short (for CLI) forms of frequently used flags.

The trickier business is that there is less standardization about flag semantics. A common approach, however, is that flags preceed non-flag (typically, file name) arguments, and that all flags are processed before the non-flag arguments. Getopt(1) is often useful in this case.

Final thought on scripting

Bulletproofing scripts is an arms race against an ever growing collection of idiots, in whose number we occasionally find ourselves.

Exercise 5.1 Write a shell script del, intended as “safe” version of rm.

del [<file> | --list | --empty]*

Deleting a file moves it to the $TRASH directory (which defaults to ~/.trash), creating $TRASH if necessary.

If the --list flag is encountered, the contests of ~/.trash are listed on stdout. If the --empty flag is encountered, the files in ~/.trash are rm'ed. E.g.,

del foo.c --list --empty

will

  1. move foo.c to ~/.trash.
  2. list ~/.trash
  3. ~/.trash is emptied.

Note that this can be made much more useful with a bit more effort, but these are suggestions for further play, rather than a part of the assignment.

Think about putting a pick filter in front of the rm'ing (c.f., the text). It should be possible to bypass this, e.g.,

del --force --empty

If you're really ambitious, think about adding a --recycle option that will return a file to its original location. This requires having some way to specify metadata. [Note that MacOS X does this through the "put back" menu, but MacOS relies on the ability to specify metadata as a part of a file. This is not a generic Unix capability, however. Think about how we might do this.] Metadata also allows you to deal robustly with the case where you delete two files with the same name from different directories.