Lecture 20

Administrivia

Final Presentations

It's time to start thinking seriously about the final presentations. Each section of the class will have three slots: Wednesday of 10th week -- 40 minutes; Friday of reading period -- 1 hour; Final Exam slot -- 2 hours. Please pick a day, a title, and send your instructor email. The rule as regards the non-exam slots is first-come, first-served. Please let us know whether you'll need to use the projector. If you are planning to use the projector, please make sure you know how to use it!

CMSC 22311

CMSC 22311, Functional Systems in Haskell, will consider advanced topics in Haskell and Haskell-based systems. This course will be offered in the Spring Quarter of 2015, but won't be offered during the 2015-16 academic year. The prerequisites for 22311 have been tweaked to permit 15400 as a co-requisite rather than a pre-requisite, so that students who took 161 this year can enroll. The registrar's system should reflect the change soon, but you can petition into the course until it does.

Languages and Meta-languages

Much of the utility of the lex/yacc (flex/bison) toolchain comes the recursive nature of context-free languages, which make them a natural choice for building textual representations of data objects that are themselves recursively defined. But yacc and lex are not the easiest tools in the world to use. A consequence has been a de facto partitioning of the serialization problem into two regimes:

One simple and fairly success such meta-language is JSON (JavaScript Object Notation).

JSON has a simple data ontology, built out of a few primitive data-types:

and a couple of closure operations

The basics of JSON are simple: strings are represented by sequences of unicode characters (excluding unescaped double-quotes and control characters), enclosed in double quotes. The prohibited characters can be escaped via the printf backslash mechanism. Numbers follow the standard floating point grammar. Booleans are either true or false (case matters!), and null is null. Objects are represented by a comma separated sequence of binding pairs (a key string, a colon, and value) enclosed in braces { and }. Arrays are represented by a comma separated sequence of values, enclosed in brackets [ and ]. Whitespace (except within strings) is ignored. The following example comes from wiki:

{ "firstName": "John", "lastName": "Smith", "age": 25, "address": { "streetAddress": "21 2nd Street", "city": "New York", "state": "NY", "postalCode": 10021 }, "phoneNumber": [ { "type": "home", "number": "212 555-1234" }, { "type": "fax", "number": "646 555-4567" } ] }

A system like JSON has several advantages. It is simple and robust. It can describe most kinds of data efficiently. Most languages have libraries that support JSON directly, and it's an easy language to parse in any event. But JSON isn't perfect. We need more powerful tools.

XML

Important note: The University has a site license for Safari—a books on-line system started by O'Reilly publishing which includes most of the O'Reilly books, including “XML in a Nutshell.” Just go to http://proquestcombo.safaribooksonline.com.proxy.uchicago.edu/ and follow your nose.

XML stands for “extensible markup language.” XML is intended to solve a number of problems, but for our purposes it is best thought of as language that facilitates the textual representation of structured information.

The core language is very simple. An XML document begins with an XML declaration

<?xml version="1.0" encoding="utf-8"?>

Note that “<?” must be the first two characters in the file, and that XML language processors have to be able to figure out whether they're looking at one of the standard 8-bit, 16-bit, or 32-bit encodings before they get to the encoding string. We've already talked about utf-8, which is the most robust standard for representing unicode on a byte-oriented, English-biased system. The objects that XML describes (the DOM, Document Object Model) are themselves defined in terms of Unicode, irrespective of the particular character set used to represent the XML file.

It is possible (but unlikely) that you'll see version 1.1, and very likely that you'll see other encodings (ISO-8859-1, a.k.a., “Latin-1” or “Windows” is the most likely), but you should avoid them in preference to UTF-8 whenever possible in your own work in the name of standards, transparency, and interoperability.

This is optionally followed by a DOCTYPE tag, a mechanism that binds a document type description (DTD) to the document, although this is discouraged for new applications in favor of more modern and flexible constraint systems, and then a single XML element, which takes the form of

<tag> ... markup ... </tag>

The markup consists of text together with tags, but the tags have to come in balanced pairs. Important terminology: an XML document is well-formed if it starts with a proper XML declaration, and it consists of a single element, with well-nested content. It is valid with respect to a schema if it satisfies all of the schema's constraints.

This later point is worth understanding, and it's part of the reason why XML is a win. A Dyck language is a balanced parenthesis language over finitely many distinct types of parentheses (remember the homework assignment?). The Chomsky-Schützenberger theorem says that every context free language can be expressed as a homomorphism of the intersection of a Dyck language with a regular language. XML tags can (and should!) be thought of a distinct types of parantheses—XML has infinitely many potential types of parentheses, JSON has only two or three (if we count double quotes). And the role of XML scheme parallels the role of the regular language in the Chomsky-Schützenberger theorem (whether or not this was planned). Indeed, most of the popular XML scheme mechanisms (e.g., DOCTYPES) involve both context-free and regularity constraints.

What this means is putative XML files can be pre-flighted for meaningful structure, as general-purpose tools can use textual specifications to make sure that they're properly formatted (i.e., well-founded) and semantically meaningful (i.e., valid) in our specific context.

Some syntax:

An example:

<?xml version="1.0" encoding="utf-8"?> <course-schedule> <quarter term="Autumn" year="2006"> <course> <number>CMSC 10500</number> <title>Fundamentals of Computer Programming I</title> <instructor>Mulmuley</instructor> </course> <course> <number>CMSC 15300</number> <title>Foundations of Software</title> <instructor>MacQueen</instructor> </course> <course> <number>CMSC 27100</number> <title>Discrete Mathematics</title> <instructor>Klivans</instructor> </course> <course> <number>CMSC 27700</number> <title>Mathematical Logic I</title> <instructor>Mileti</instructor> </course> <course> <number>CMSC 28000</number> <title>Introduction to Formal Languages</title> <instructor>Kurtz</instructor> </course> </quarter> <quarter term="Winter" year="2007"> <course> <number>CMSC 16200</number> <title>Honors Introduction to Computer Science II</title> <instructor>Kurtz</instructor> </course> <course> <number>CMSC 27200</number> <title>Theory of Algorithms</title> <instructor>Felzenszwalb</instructor> </course> <course> <number>CMSC 27500</number> <title>Graph Theory</title> <instructor>Simon</instructor> </course> <course> <number>CMSC 27800</number> <title>Mathematical Logic II</title> <instructor>Mileti</instructor> </course> </quarter> <quarter term="Spring" year="2007"> <course> <number>CMSC 28100</number> <title>Introduction to Complexity Theory I</title> <instructor>Fortnow</instructor> </course> </quarter> </course-schedule>

This is a (shortened) version of an XML file that described the theory faculty's teaching plan for a few years ago. Things to note—tags, attributes.

Note that the syntax of an attribute is very rigid—it must take one the forms

The distinction between the flavor of quotes is pretty minimal—you don't have to quote double quotes within a single quoted string, etc.

XML is case sensitive (unlike HTML). The form of indenting used here is pretty typical for human-maintained XML.

Please understand that I am not telling you the full story—just, as usual, enough to be dangerous.

The flexibility of XML can be an issue here. In most applications, you want to constrain the XML to match a particular model of the data. (Such constrained sets of XML are called “applications,” just to confuse outsiders.) I've been working with a constraint language called Relax NG, which is supported by the extremely useful xmllint tool. There is a nice on-line tutorial for Relax NG. Here is an .rng file that describes the intended data model for the teaching plan:

<?xml version="1.0" encoding="utf-8"?> <!-- courses.rng --> <element name='course-schedule' xmlns="http://relaxng.org/ns/structure/1.0"> <zeroOrMore> <element name="quarter"> <attribute name="term"/> <attribute name="year"/> <zeroOrMore> <element name="course"> <element name="number"> <text/> </element> <element name="title"> <text/> </element> <choice> <element name="instructor"> <optional> <attribute name="type"/> </optional> <text/> </element> <element name="not-offered"> <attribute name="year"/> </element> </choice> </element> </zeroOrMore> </element> </zeroOrMore> </element>

The validity of an XML file relative to a .rng format can be tested using xmllint.

$ xmllint --noout courses.xml --relaxng courses.rng

Having written out a schema, it is usually straightforward to design C (or other language) data structures that parallel the XML.

struct course_schedule { int n_quarters; struct quarter **quarters; }; struct quarter { char *term; int year; int n_courses; struct course **courses; }; struct course { char *number; char *title; bool offered; union { char *instructor; // if offered == true int year; // if offered == false } offered_choice; };

Exercise 20.1 Write some interesting data in xml. Develop a reasonable Relax NG Scheme that constrains your data, and make sure that your XML validates. Provide an analogous set of C datatypes.

If you started with the datatypes, and the problem of serialization (i.e., the problem of moving dynamic data into a persistent, and necessarily linear storage format), you might be inclined to do something that is less verbose, and more specific to your application.

But XML has huge advantages. The biggest is that most programming languages provide generic data abstractions for representing XML (typically through the Document Object Model, a.k.a., DOM), and routines for reading and writing these structures.

This means that a typical XML-oriented workflow is going to look something like this

Workflow Diagram

This is kind of nice, since the “application code” is written in the base language (e.g., C), and you don't need to deal with parser generator tools (like lex/yacc). Moreover, the inputs and outputs of this process are data-structures within the language.

As a user, if your data is represented in XML, it is very likely that a wide variety of programming tools can operate on it, increasing flexibility and reducing vulnerability.

Let's have at this a bit more systematically.

XML represents a tree. There are a lot of different types of nodes on that tree—the three most important are

Mapping between these trees and XML syntax is fairly straightforward—the main issue being when whitespace matters. By default, all whitespace matters.