Lecture 8

Administrivia

The Computer Science Instructional Lab (CSIL) is currently hiring new staff. If you're interested, here's the job description.

Python, II

Back in the old days of industrial programming there was a clear conceptual distinction between data (numbers, strings, arrays, and such) and code. Code was static, and entrusted to the stewardship of relatively highly compensated programmers and their managers, and subject to a careful regime of specification and testing. Data was dynamic, required ongoing maintenance, and was typically entrusted to the stewardship of relatively lowly compensated secretaries, data-entry clerks, and their supervisors. This distinction remains apparent in the standard Unix preference for moving information out of code and into human-readable configuration files. On one hand, this tended to increase flexibility, on the other, it tended to reduce costs.

Still, from a functional programmer's point of view, this is a bad bargain. Code should be dynamic, just like data. Indeed, code is a kind of data, and the distinction between the two is artificial. Back in the old days, this was not a particularly compelling argument. Functional programs were generally interpreted, and the interpreters were intentionally kept as simple as possible, whereas the traditional languages were compiled, and the computers themselves were relatively slow. Thus, the choice of language could mean a factor of 10 or more in the speed of the resulting code, favoring languages with more primitive control and data abstractions for production work. There's an old joke from that era, that Lisp programmers know the value of everything, but the cost of nothing. The prejudices that underlie that joke notwithstanding, ideas that first appear in "academic" languages have a way of percolating down into traditional production-oriented languages, and so we come to object-oriented languages. An object-oriented language is, not surprisingly, organized around a collection of run-time objects. These objects are encapsulations of data and code, or as object-oriented programmers would say, state and behavior. What is important here is that behaviors here are (in some sense) specific to the objects themselves, i.e., different objects might have distinct code implementing the same named behavior. E.g., in a typical application programming environment for a computer with a graphical-user interface, the various interface items (windows, menus, buttons, check-boxes, etc) will often have common behaviors (e.g., drawing) implemented in very different ways.

There are a growing number of object-oriented languages out there, each with its own aesthetic, culture, and user community, of which Python is one.

Object-oriented languages are typically involve a notion of class, and one can think of the objects of the system as being organized into classes. But there are different understandings here. In typical object-oriented languages (Simula, C++, Java), an object's behaviors and their implementation is determined by its class. This has the advantage of very little space overhead—when an object receives a message (this is object-oriented nomenclature for having one of its behavior functions called), it looks up the implementation in a type-specific method table. Python's view is a bit different, although the difference isn't always exercised. In Python, a class is an object factory, i.e., classes build objects, providing them with initial state and behaviors, but both are mutable. In particular, objects may acquire new behaviors, and new implementations of old behaviors at run time. We won't do anything today that involves this dynamic behavior, but you should know that it exists, and think about how it might be used.

A simple object-oriented example

# Queue.py class Queue(object): def __init__(self): self.contents = [] def push(self,x): self.contents.append(x) return x def pop(self): return self.contents.pop(0) def isempty(self): return len(self.contents) == 0

[This, by the way, is also simple module...] Note that Python 2 has two distinct class systems: old classes and new classes. Python 3 has only new classes, making the explicit reference to the inheritance from the object class a bit superfluous.

What's going on here...

  1. A class is essentially a user-defined type. A Queue can be thought of as a type that builds on the built-in object type, and so contains the state of that type, as well as its functions (methods). Another, perhaps more accurate understanding is that a Queue is an object in its own right, a factory object that builds instances of type Queue. Note that a derived type (like Queue here) can override (i.e., re-implement) methods defined by object. In particular, __init__ is reimplemented here.
  2. Methods are defined with an initial self argument that refers to the object. Python in unusual among OO languages in requiring that you do this—a more common design decision is that the object associated with a method call is implicit. Note that there are actually four scopes at play in a class method: the method's scope (accessed by plain variables), the object's scope (referenced via self), global scope (referenced via a global declaration in the method), and class scope (referenced via ClassName), although only the first two are used in this example.
  3. The __init__ is used to initialize a newly constructed Queue. “Double underscore” methods in Python are reserved for methods that interact with languages features.

It is useful to think of an object in python as a wrapper around a dictionary that maps member names to values.

And in use...

>>> import Queue >>> q = Queue.Queue() >>> q.is_empty() True >>> q.push(1) 1 >>> q.push(2) 2 >>> q.contents [1, 2] >>> q.pop() 1 >>> q.pop() 2 >>> q.pop() Traceback (most recent call last): File "", line 1, in File "./Queue.py", line 11, in pop return self.contents.pop(0) IndexError: pop from empty list >>>

OK, maybe we don't like that error—we want to call to the attention that this is an attempt to pop from an empty queue. So let's define our own exception class for the queue. Note here that the specifics of exception handling are one of the places where Python 3 made significant changes, so this code would need to be reworked for Python 2.

# Queue.py class Queue: def __init__(self): self.contents = [] def push(self,x): self.contents.append(x) return x def pop(self): try: return self.contents.pop(0) except IndexError: raise QueueException("queue is empty") def is_empty(self): return len(self.contents) == 0 class QueueException(RuntimeError): pass

Note here that pass is Pythonese for an empty block.

And let's force an error...

>>> import Queue >>> q = Queue.Queue() >>> q.pop() Traceback (most recent call last): File "./Queue.py", line 11, in pop return self.contents.pop(0) IndexError: pop from empty list During handling of the above exception, another exception occurred: Traceback (most recent call last): File "", line 1, in File "./Queue.py", line 13, in pop raise QueueException("queue is empty") Queue.QueueException: queue is empty >>>

That backtrace pretty much tells the tale—and the exception makes pretty clear what the problem is.

Now, let's say that we want to make it possible to iterate over a queue. One of the nice things about Python is that by adding a few special methods, this is easy. The standard way is to construct an iterator, which we want to iterate over contents. But contents is already a list, and it already has an iterator! So we can just subcontract to the list iterator. Note that this doesn't even require that we know the iterator protocol.

class Queue(object): ... def __iter__(self): return self.contents.__iter__()

And...

>>> import Queue >>> q = Queue.Queue() >>> q.push(1) 1 >>> q.push(2) 2 >>> for item in q: ... print(item) ... 1 2 >>>

One problem with this approach is that it can give unexpected results if we change the queue during the for loop. We can fix that...

class Queue(object): ... def __iter__(self): return self.contents[:].__iter__()

One of the cute things about Python is that pretty much all of the functionality of the language: operations like +, -, comparison operators, even function calls, have corresponding protocols, and objects that implement those protocols can participate in that functionality as if they were pre-defined types. This is similar in intent to Haskell's type class concept, although the implementation details are quite different.

Here's a scheme-ish example, inspired by an example in Abelson and Sussman's "Structure and Interpretation of Computer Programs." Note that defining __call__ makes it possible for an object to use function-call syntax.

#!/usr/bin/env python3.3 # accumulator class Accumulator: def __init__(self): self.x = 0 def __call__(self,y): self.x += y return self.x acc=Accumulator() print(acc(3)) print(acc(4))

Yup... this is just a complicated way to print 3 and 7.

For my part, I tend to think of objects as lambdas with multiple entry points, but I'm strange that way...

Exercise 8.1 While the political debate simmers over the role that anthropogenic forcing factors have on global warming, the hurricanes in the Atlantic rage. The destructiveness of the 2004 and 2005 storm years has been given as evidence for global warming. A serious problem in evaluating this is the extreme variability in the number and strength of hurricanes over time. The NOAA has a good (not perfect) data set that records hurricanes since 1851. What I want you to do is write a program that will attempt to analyze the strength of each hurricane season based on NOAA data.

I propose the following easily computed aggregate measure. There is a well known scale of hurricane strength—the Saffir-Simpson scale was developed in 1969, long before the current controversies began. I want you to measure the aggregate storm activity in Saffir-Simpson days. For example, the NOAA data has four daily sustained wind speed readings for each storm, given in knots (note here that I want you to look at the synoptic 0000, 0600, 1200, and 1800 entries, and to ignore any non-synoptic entries). Map each wind speed to the Saffir-Simpson scale (check Wikipedia for the precise definition). Divide each such entry by 4.0, and add it to the year's total (each entry is taken as a surrogate for 1/4 of a day's total activity).

Report the annual Saffir-Simpson day totals for each year.

The data set can be found at http://www.nhc.noaa.gov/data/hurdat/hurdat2-1851-2013-052714.txt

You might consider using curl(1) or wget(1), depending on your system, rather than a browser, to get this file. An explanation of the format of the data can be found in http://www.nhc.noaa.gov/data/hurdat/hurdat2-format-atlantic.pdf.

This data set is pretty typical for scientific datasets accumulated over many years.

Programming notes: You should write the data analysis program in Python. Your program should include (at least)

  1. A Storm class, which has a saffir_simpson_day function; and
  2. A generator function for reading Storms from a data file.

If you want to do more:

  1. Figure out how to use gnuplot to produce plots of your storm data. Note that gnuplot is a pretty complicated program in its own right. Googling for “gnuplot tutorial” is a good idea. Also note that gnuplot is not a part of the standard distribution. It is installed on the department's linux systems. If you want it on your personal system, you'll have to install it. I use MacPorts on MacOS X, YMMV. An alternative would be to write PostScript directly, which is often what I do.

  2. Play Republican/Democrat advocate. An issue with the Saffir-Simpson Day scale is that it is a poor surrogate for storm damage/energy. Is a class 5 storm really exactly 5 times as destructive as 5 class 1 storms? It seems unlikely. Indeed, this considering maximum sustained speed only doesn't account for storm size.

    The advocacy comes in through the creation of a “damage transfer function.” This should be strictly monotonically increasing in whatever storm parameters you chose to consider (e.g, HURDAT2 looks at wind radii by quadrant). Consider the 10 most damaging storm years, as determined by your damage transfer function. To play Republican advocate, try to minimize the number of years since 2000 that occur in the data; to play Democratic, try to maximize it. One of the nice attributes of the HURDAT2 data is that it gives wind speed radii, which allow you to take into account storm size.

The current data format HURDAT2 is new. Prior to 2013, the data was in HURDAT format, a plain-text version of an 80-column Fortran dataset. The description of these files is still available, and makes interesting reading: http://www.aoml.noaa.gov/hrd/data_sub/hurdat.html. I think you'll especially like the FORTRAN code for parsing this data—it will give you an idea of what programming was like when I was first learning it.