Lecture 2

Administriviai

Wiki

We have a few suggestions...

This is your garden. Tend it. Infrastructure work often pays off. If you do any infrastructure work, note it as a contribution on your personal profile page.

If you want credit for the problems you propose, put your name on them. Indeed, please sign your wiki submissions—both in context, and in the author field. This has a number of virtues:

  1. Please remember that we're trying to form a community. It helps a lot to know who you're talking to.
  2. Attributed comments help me in assessing participation.

A good model for managing student comments within a wiki page is to separate each comment by a horizontal rule. This can be done by introducing a line with consisting of four hyphens to the markup.

----

You only need four, and it makes for cleaner markup.

Likewise, the SourceBlock syntax markup module has been added to the wiki. The basic markup takes the form

(:source lang=XXX :) [@ ... code ... @]

There's a link on the wiki sidebar to SourceBlock, and you can always use the edit feature to look at markup. Remember that with a wiki, essentially nothing is hidden. If you see something on the wiki formatted in a way you like but don't know how to do, hit the edit button and look at the markup!

Piazza

We've done a bulk-enrollment of the students in CMSC-16200 into the class Piazza site. If you want to enroll under a different email address, or drop your enrollment, it should be possible to do so via Piazza.

Short Courses

The Computer Science Instructional Laboratory (CSIL) is running a number of mini-courses, some of possible interest. Please consult the CSIL webpage for the most recent updates. Note that three of the four announced instructors are 162 alumni.

Files

There are two key concepts in Unix—a file and a process. Pretty much everything builds on these two ideas. Linux tries to simplify this even further, by making processes special cases of files.

A file is a finite sequence of bytes. A byte consists of 8 bits. A crucial issue is "how do we interpret those bits and bytes?" One way to interpret a byte is as a natural number in the range 0..255 (2^8-1). Another way is as a (signed) integer in the range -128..127. Why is there one more negative number than positive number?! [The high order bit is interpreted as -128. The arithmetic is easier this way than with a sign bit, because it's really just $\mathbb{Z}/256$, where the congruence classes are represented by an integer of minimal magnitude.]

Often, we'll want to represent numbers over a larger domain. E.g., contemporary computers support in hardware standard operations on signed and unsigned integer values of 1-, 2-, 4-, and 8-bytes, as well as operations on floating point numbers (which typically have 4- or 8-byte representations, but standardized 10- and 16-byte formats also exist).

But computers manipulate a lot more than numbers, and arguably the “killer applications” that first drove computers into the mainstream were dominated by the manipulation of text, and hence the problem of the digital representation of human languages. This turns out to be both easier and harder than you'd expect. The cultures in which computers grew up shared the characteristic of languages for which mature alphabetical representation schemes already existed, and this meant that what might have been the hardest problem of representing natural language texts reduced to the problem of representing sequences drawn from a limited set of characters.

The history of character encoding, i.e., choosing a particular scheme for representing characters via bits and bytes, is complicated and a bit sad, because the same errors were made over and over again: the first error was in not provisioning for enough characters, the second was in allowing inconsistent encodings to coexist. The history we're about to give is both US centric and grossly simplified, but it will do for now.

Historically, the first scheme was Hollerith—this was essentially a six bit scheme, but the physical representation was a punch card that consisted of an array of 12 rows and 80 columns. Each column represented a character (typically printed at the top of the card by the card-punch machine), with between one and three punches. In theory, this would give rise to $298 = \binom{12}{1} + \binom{12}{2} + \binom{12}{3}$ code points [299 if we count no punches], but the Hollerith encoding was more structured, and historically only 64 of the nominal 299 code points were used. This was enough to represent all 26 letters of the English alphabet (upper case only), the 10 digits, and a limited set of punctuation and relation symbols. EBCDIC was a historically significant, backward compatible extension of Hollerith, used (and still used!) on IBM mainframes and their associated peripherals. Six bit character encoding schemes may seem quaint today, but there were common until fairly recently. E.g., the Dec-20 [the computer system Professor Kurtz used when he first got here] had a 36-bit word size (an interesting architectural decision) and early versions of its OS used 6 x 6-bit characters/word. This is also why you sometimes see all-caps used in old computing texts. It was all we had! (c.f., Dilbert)

The preferred character encoding scheme for Unix is ASCII. This was a 7-bit system (giving rise to 128 code points), which was based on fixed-length teleprinter codes. There's a bit of an interesting story here—Morse code (you might remember the ··· --- ··· pattern of SOS) is a variable length code. For example, the code for the commonly used 'e' character was a simple dot (·), whereas the code for the more rarely used 'q' was dash-dash-dot-dash (--·-)—and yes, Professor Kurtz actually learned Morse code in the late 60's because in those days, you either had to learn Morse code or semaphore to make 1st class scout, and semaphore was deprecated. It's worth noting here that Morse code is not prefix free, e.g., the code for 'p' is dot-dash-dash-dot, which begins with the code for 'e'. Unique decoding in the Morse code world was not obtained merely from the sequence of dots and dashes, but also from brief "half-beat" pauses that Morse operators put between characters, so in the strict sense, Morse code relied on a trits rather than bits. The advantage of a variable length code was that you could encode frequently used characters with a small number of bits—, and therefore transmit messages faster. This was important when most telegraph operators were humans, and the limiting factor was wet-ware signal processing capability.

But variable length codes have problems. If you change a single bit, that can result in a “framing error,” where the boundaries between code words as intended by the encoder and as interpreted by the decoder get out of sync. Fixed length codes don't have this problem. Note, though, that Morse code was robust against framing errors because a space was alway a code delimiter, whereas fixed length codes will get out of sync if a bit is added or dropped. Consider DNA...

Anyway, humans can type faster than they signal by key, and so teletype and teleprinter machines were invented. Since these machines were constrained by human typing performance, which was orders of magnitude slower than even 1950's era electronics, so there was no need to maximize the underlying signal rate (at least, not until people figured out how to run multiple independent signals down a single wire...), so the driving force in teletype/teleprinter design was simplicity and reliability, and this gave rise to 5-bit fixed length (Baudot/ITA2) codes. The 7-bit ASCII code is an extension of these earlier codes within the AT&T family of codes, just as EBCDIC was an extension of Hollerith within the IBM family of codes. But ASCII is had the advantage that AT&T allowed its standardization, which meant that venders other than AT&T could build interoperable peripherals based on ASCII without an AT&T license. [This notion of standard-based vs. proprietary formats for data representation is still a big deal, cf., recent versions of MS Word which use a (nominally open) application of XML for data serialization in order to meet governmental open-standards mandates.]

ASCII added enough code points to enable representation of both upper and lower case. One nice characteristic of ASCII is that each of the ranges of upper case letters, lower case letters, and digits map to contiguous sets of code points, e.g., the ASCII code point for the character '0' is 48. If we know this, we can infer that the ASCII code point for the character '2' is 50 = 48 + 2. Likewise, given that the code point for 'A' is 65 + 0 = 65, we can infer that the code point for 'Z' is 65 + 25 = 90. This was not true of EBCDIC.

ASCII also has so called “control codes,” which are effectively a third alphabetical case, running from ^-A (at code point 1) through ^-Z (at code point 26). A few of these control codes remain important—^-J (10) is the newline character, ^-M (13) is the carriage return character, and ^-T (20) is the tab character.

A text file (in traditional Unix) was a finite sequence of bytes in which the high-order bit was always zero. The remaining 7 bits of each byte were used to represent a character in the ASCII encoding. Line breaks were represented by a single LF character (10). There was no special character that indicated end of file—the end of a file was determined by its length, which is stored in the inode, a data structure we'll talk about in the next lecture.

But that unused, high-order “meta” bit was an attractive nuisance.

The two great consumer Operating Systems of the 1980's (Apple's Classic Macintosh System and Microsoft's Windows) followed an unfortunate, if somewhat predictable path. Although both relied on Unix in various ways, both made decisions that were intentionally incompatible with Unix. (Note Window's use of '\' as a path delimiter, and note that internally, MacOS Classic used ':', although this was not something that end users often knew. Another difference was that line breaks in Classic MacOS were represented by a CR rather than an LF, whereas DOS/Windows used and still uses a CRLF pair.) Apple fell from grace first, allocating unused code points to represent characters from European languages that were not part of the US-centric ASCII encoding. Microsoft followed for essentially the same reason but its encoding was gratuitously different from Apple's. As these OSs propagated to other regions, new character sets were created, recapitulating the Tower of Babel. Lest you think this is an exaggeration, MacOS X 10.10 knows about, via iconv -l, 427 distinct names for 144 distinct encodings. As the regretful programming koan goes, “The nice thing about standards is that there are so many of them...”

Unicode

These days, these character sets are all being superseded by Unicode. Unicode was originally designed as a 16-bit (2 byte) system, but this has proven inadequate (under-provisioning strikes again!), and now it is structured as a 17 planes of 16 bits each, i.e., a 21-bit scheme with a bit of growth room. The character sets of western languages are code plane 0, a.k.a., “the basic multilingual plane,” or BMP.

So how does the work in the real world? Not simply...

There are three basic coding schemes, with variations. The simplest is UTF-32. In UTF-32, we simply use the low-order 21 bits to represent a Unicode code point, and leave the high-order 11 bits set to 0. This is wasteful of memory, and so is understandably popular with storage manufacturers, but it is otherwise rarely used.

Unicode is often represented in memory in UTF-16 format, which is natural given it's original design as a 16 bit system. In UTF-16, code plain 0 (BMP) characters stand for themselves, i.e., each BMP character gets encoded as a single 16-bit word. There is a "hole" in the BMP, which can be exploited so that characters in other planes are encoded as 2 16-bit words, essentially by encoding 10 bits per word, and using the high-order 6 bits to place these words in the BMP "hole." Unfortunately, the encoding process is not quite that straightforward, but this is the basic idea. There are variants of UTF-16 and UTF-32 which specify big-endian or little-endian byte order for serialization, and a few other quirks, so that iconv actually lists 9(!) distinct encodings for the Unicode “standard.” Just remember, “The nice thing about standards...”

Finally, there is UTF-8, which encodes Unicode characters as a sequence of 1, 2, 3, or 4 bytes. This is often used for serialization, and makes an interesting case study, especially within the context of byte-based file systems.

The Unicode characters with 1-byte representations are those with code points in the range 0..127, i.e., the characters with 7-bit representations, which happen to be the historical ASCII files. This has the effect of blessing all of those ASCII encoded text files as Unicode files, with their meaning unchanged. Sweet.

To deal with characters that don't have 7-bit representations, a simple multi-byte scheme is used. The first byte will have high order bits of 0, 110, 1110, or 11110, depending on whether the character will be represented by 1, 2, 3, or 4 bytes respectively. Continuation bytes of multibyte characters will always have high-order bits 10, with the significant bits filled in. (Note that this scheme means that single bit errors only affect a single character, specifically, if it causes a framing error, it can be detected and corrected immediately.) For example, consider the character ‘⅓’—yes, there is a Unicode character that represents the fraction 1/3—at the (hex) code point 2153 == 8531 decimal. In binary, we have

code_point(⅓) = 10 0001 0101 0011,

i.e., 14 bits. Can this be encoded in two bytes? If so, it would have the form

110x xxxx 10xx xxxx,

But this has only 5+6 = 11 bits, and isn't enough. How about 3 bytes?

1110 xxxx 10xx xxxx 10xx xxxx

This gives us 4 + 6 + 6 = 16 bits, which *is* enough. We now simply extend our string to 16 bits by adding zeros on the left, and then we fill in the x's with these bits:

1110 0010 1000 0101 1001 0011

which is E2 85 93 in hex, and you might be able to find it listed that way on character tables.

K&R doesn't have to deal with any of this—their's was an ASCII world. You will... A consequence is that the Unix documentation, which is typically of a very high quality, uses “character” and “byte” as if they were synonyms. They are not, and this is a distinction that we're going to have to learn how to make. There is, by the way, a very useful application and library (iconv) that does character coding translation.

Pipes and grep

Pipes and grep are important. Note that these days, grep isn't a single program, but rather a collection of programs (grep, egrep, and fgrep). Each has advantages, but egrep is usually the best choice, and when it's not the best, it's a close second.

Assignment: read the man page for the grep family.

Assignment: K&P Study Chapter 2, read ahead on Chapter 3. Our lectures will focus on information that augments the content of the text—we assume you can learn by reading. We're also going to assume that you've found your way to a modern Bourne Shell compatible (bash, ksh, etc.) shell by now. If not, you need to do this ASAP.

Exercise 2.1 Provide a Unix pipeline that will find all of the words in the system dictionary that contain

  1. All of the vowels [aeiou] in order (n.b., other vowels may interpolate, e.g. "adventitious" should be included).
  2. All of the vowels occur (but the order is indeterminate).

The dictionary (under both MacOS X and the UCCS Linux systems) is located in /usr/share/dict/words. How large is the dictionary on your system?

For this exercise, indicate the OS and version used. For both parts, you should provide the actually command used, the total number of words discovered (for which wc might be useful), and the first 10 words discovered (for which head may be useful). We don't want pages and pages of output.

An issue to think about: how to handle vowels that are capitalized. You should strive for an elegant approach.

Some systems (specifically, Cygwin and Ubuntu) don't come with a words file. In this case, download a words file from another source (e.g., one of the CS department's Linux systems, or search the web) and install it in the appropriate location.