Lecture 7


The graders had a number of comments on regarding the solution to Exercise 5.1:

  1. Quite a number of people did not use the variable $TRASH correctly, they either just used ~/.trash, or they instantiated the variable and set it to ~/.trash.
  2. A large number of people did not iterate through the list of arguments, but instead referred to specific arguments. Therefore, while their code could handle the particular example given on the website, it could not handle a more generic call. In addition, there were people who could not iterate through the arguments correctly, such as incorrect usage of "while[[$# -gt 0]]", "shift" and "getopt".
  3. Finally, many people did not check that the things they were deleting/ moving were files, and not directories, as stipulated by the problem.


Awk is a great tool, but it has its limitations, which become increasingly apparent as data complexity and program size increase. In the jargon of the discipline, “Awk does not scale.” Part of the problem here is that Awk’s usage rapidly exceeded its original design. Intended as a rough-and-ready report generating tool, Awk was soon pressed into a rapid-prototyping role.

There have been a number of languages that have tried, with varying degrees of success, to position themselves as awk-killers. These languages tend to share a few characteristics, some in common with Awk, some distinct from it.

In common with Awk, they tend to be dynamically typed (i.e., values have types, but variables don't), they provide powerful string management and associative arrays (and therefore implicit memory management), they don't require the use of functions (i.e., evaluation of a file consists of evaluating any “bare” top-level code, which does not need to be wrapped in a main function). They are all interpreted, which in practice is a tradeoff that favors interactive development and rapid prototyping over execution speed.

As distinct from Awk, they generally are better structured as programming languages, with richer data-definitions, well-defined module systems that facilitate the use of extensive libraries, and extension schemes that make it possible to call code that is written in lower-level, higher performing languages. These are also languages that have been quick to adopt features introduced into other languages, once they prove their value, albeit with varying degrees of elegance. [Note that this is a slam against Perl, where these new features have been visibly bolted on, giving it a unique Franken-language feel.]

A consequence of better scaleability is that these languages have supplanted awk as rapid-prototyping languages, but invariably, they are less well tuned to the task of rough-and-ready processing of one-record per line, white-space delimited files, of awk's original design and so none have truly killed awk.

It's understandable if functional programmers gloat over all of this, nevertheless, no functional language has emerged as an even remotely credible awk-killer.

Probably the most successful of these languages in lines-of-code written is Perl. The language which seems most successful as a programming language (i.e., the one that has been best able to adapt features from other languages, without becoming distorted along the way), and the one that I think you will most benefit from learning, is Python.

This is an awkward time for Python, a language that continues to evolve in interesting ways. One of the fundamental question of an evolving language is "how to deal with the past"? Does it continue to support obsolete features, or quirky semantics? If it choses to do so indefinitely in the name of "backwards compatibility," it invariably becomes more and more complex and cumbersome, as well as unpredictable. If it choses to "clean house" regularly, old programs become unusable quickly, and it's hard to build momentum.

Python deals with this issue pragmatically. Minor versions support the past, but major versions are essentially clean starts. And right now, we in the midst of a major version transition. Python has a clear past: Python 2, best realized through Python 2.7.x, which adapts Python 3-isms when it can. And Python has a clear future: Python 3, whose latest stable version is 3.4.x. The more difficult question is Python's present.

My recommendation to you will be to use and learn Python 3, especially if you're new to the language. But expect some rockiness—few distributions have Python 3 as their default, and there will be some work do to if you want it.

Getting Started

First things first. You're bright kids. You've learned a few programming languages already. You're not going to learn your next by passively listening to a lecture. The right way to learn your next twenty programming languages will be to sit down with a good book (or set of YouTube links), and a computer with a good text editor and a compiler or interpreter for the language you're trying to learn, and to alternate working through the book/videos with programming play. Beazley's book is good for this.

Easy stuff is easy:

>>> a = 10 >>> b = 20 >>> c = a+b >>> print(c) 30

Note that one of the more in-your-face differences between Python 2 and Python 3 is that in Python 2, print is a keyword, whereas in Python 3 it's an ordinary function. Thus, the parentheses above are mandatory in Python 3 as a part of the ordinary function call syntax, but optional (and in practice, rarely used and even frowned upon) in Python 2.

# is for comments.

Python 3 uses Unicode strings, although it also supports byte-oriented data as an immiscible type; Python 2 used unicode and 8-bit strings, and they were miscible. The basic syntax is common though, using single or double quoting. Triple single- or double- quoting allows multi-line uses:

>>> print """Quoth the raven, 'Nevermore.'""" Quoth the raven, 'Nevermore.' >>>

Unicode versions use a prefix u in Python 2, e.g.,

>>> a = u"\u2663" >>> print a ♣

In Python 3, you can skip the u.

Unfortunately, Python's default character set encoding (ASCII) does not permit use of Unicode in source files, so simply

>>> print "♣"

doesn't work (and Python 3 doesn't change this). Of course, the operative word in the foregoing sentence is default, which hints that we can change it, and we can:

#!/usr/bin/python # -*- coding: utf-8 -*- # spade print u"♣"

The Python 3.x version of this differs in three ways, two of which are expected (the dropping of the Unicode-specifying u prefix, and the added parentheses around the argument to the print function), but one requires a bit more comment:

#!/usr/bin/env python3.4 # -*- coding: utf-8 -*- # spade print("♣")

That use of /usr/bin/env is ugly, but it's standard. And so far as I can tell, it's standard because python3.x gets installed in idiosyncratic directories on a per-distribution basis, so hard-coding the path breaks portability. Moreover, this illustrates one of the disadvantages of dropping backward compatibility. If the new Python eventually becomes just python, all those old scripts break. But calling it anything else is ugly too. It does not seem to me that the community has yet figured out how it's going to handle this.

Python unicode in memory is UTF-16, which causes some complications if you want to specify a character that's not in the basic multi-lingual plain. In Python 2 (and early versions of Python 3), you'd use a surrogate pair:

>>> a = u"\ud834\udf01" >>> print a 𝌁

In Python 3.3ff, surrogate pairs aren't supported, and you actually specify the full 4-bytes, e.g.,

>>> "\U0001D301" '𝌁'

Note here the symbol for “heavenly earth” in Tai Xuan Jing, the language of the “I Ching” divination system. Of course, we can include the symbol itself in the input file, if we include the right encoding directives:

#!/usr/bin/python # -*- coding: utf-8 -*- # heavenly-earth msg = u"𝌁" print msg, len(msg)

As you should expect, this prints

𝌁 2

The 2 coming from the internal utf-16 representation.

Python 3.3ff (mutatis mutandis) gets this right, i.e., with a length of 1.

Getting this all to work requires that your terminal program and python's output agree on how to encode unicode.

Strings can be concatenated by +, and formatted by %

>>> print "%s, %s!" % ('Hello','world') Hello, world!

In Python 3.x, the use of % as a formatting operator is deprecated, and the use of the (more flexible) format method is preferred:

>>> print("{}, {}!".format("Hello","world")) Hello, world!

Note that the set braces {} are placeholders for format arguments, and that they can include format specifiers (of which we'll see more later).

Strings are a sequence type, and can be subscripted, etc. Note that strings are immutable, arrays support negative indexing, and that range selections/assignments are possible with more or less expected semantics.

>>> a='hello' >>> a[:-1] 'hell'

Conditionals are easy—although at this point it is easier to move to an editor. (Henceforth, I'll stick to Python 3.)

a = 2 if a == 0: print('zero') elif a == 1: print('one') elif a == 2: print('two') else: print('many')

Note here a general feature of Python syntax. Block structure is indicated by indentation rather than keyword, and blocks are introduced by a colon on the preceding line. Tabs are deprecated in favor of spaces, but if present are treated as though tab stops are 8 characters wide. The intent here is to have a programming syntax that hews closely to the informal conventions of pseudo-code. Indeed, it's reasonable to view pseudo-code as a holy grail of scripting language design: we want a language that makes the translation of pseudo-code into runnable code as lightweight as possible.

Other control structures: while, for, and exceptions.

while test: body body body

for is used to iterate over a collection (or in Python 3, more generally, an iterator):

>>> for c in 'foobar': ... print(c) ... f o o b a r >>>

Remember, strings are sequences, and sequences are a kind of collection.

Basic types: bool, int, long, float, complex.
Basic collections: list, tuple, dict, set.

These work pretty much as you would expect, with consistent vocabularies.

>>> a = [] >>> a.append(['a','b','c']) >>> a.append([1,2,3]) >>> a.append(['baby','you','and','me']) >>> a [['a', 'b', 'c'], [1, 2, 3], ['baby', 'you', 'and', 'me']] >>> a[1][0] 1 >>> a[-1] ['baby', 'you', 'and', 'me'] >>> a[2][1:-1] = ['let','go','of'] >>> a [['a', 'b', 'c'], [1, 2, 3], ['baby', 'let', 'go', 'of', 'me']]

Note that (unlike Haskell), the elements of a list can have distinct types. Note also that the way section assignments work with negative indices is a bit counter-intuitive.

Dictionaries are like arrays, but can be indexed with any immutable hashable value. Built-in immutable objects are hashable.

>>> d = {} >>> d['foo'] = 'bar' >>> d['foo'] 'bar' >>> d[(1,2)] = 'baz' >>> d[(1,1+1)] 'baz' >>> for k,v in d.items(): ... print(k,v) ... (1, 2) baz foo bar

Note that d.items() returns a list (strictly speaking, an iterator in Python 3) of key-value tuples. Iterating over d itself yields just the keys. Also note that the for loop hints at structure assignment with implicit tupling.

def fib(n): a,b = 0,1 # note that fib(-1) = 1 for i in range(0,n): a,b=a+b,a return a >>> fib(10) 55 >>> fib(100) 354224848179261915075

A cute ideas, adapted from Ruby (which got it from Icon, which arguably got it from Prolog) is the notion of a generator, which is the easy way to create an iterator.

def fibgen(n): a,b=0,1 # note that fib(-1) = 1 while a < n: yield a a,b=a+b,a >>> for i in fibgen(100): ... print i ... 0 1 1 2 3 5 8 13 21 34 55 89 >>>

The effect of this is to create an iterator over the yield'ed values, in which a evaluation of the generator is suspended with the yield, and re-started when the next element is demanded. Haskell programmers will want to understand this as a way of titrating laziness into a language that is otherwise eager.

Hidden in here is a simple convention that makes Python very powerful. Syntactic features like for are not restricted to built-in types, but can also be exploited by user-defined types and user-defined code through specific interfaces. We'll see how this works later.

Here's a simple concordance program, in Python:

#!/usr/bin/env python3.4 # concord - a line-based concordance program import sys import re concord = {} # a map from words to lists of line occurrences # process input for lineno,line in enumerate(sys.stdin,start=1): for word in set(re.split("[^a-z]+",line.lower())) - {''}: concord.setdefault(word,[]).append(str(lineno)) # produce output for word in sorted(concord.keys()): lines = ' '.join(map(str,concord[word])) print("{:<16} {}".format(word,lines))

Why do we push the words on a line through a set?! It is a simple way to ensure that each word counts only once per line. Note that this example contains a number of Python 3-isms, so translating it to Python 2 isn't completely trivial. Especially note the use of a format specifier in the last line.

I've also done an implementation of the duplicate-finding problem from the last lecture in Python: dups. This illustrates the use of generators, docstrings, and Python's mechanism for creating executable modules.

Exercise 7.1 Write a python version of the freq program from the last lecture. Indicate clearly whether you're using Python 2 or Python 3. Note that your program should replace the entire shell script, not just the awk bit.