Lecture 21

Administrivia

Final Presentations

Please sign up for a final presentation slot if you've not done so already. Note that there's now a "presentation" link on the class web pages. The times are provisional, subject to change, etc.

XPath

XPath is a language that enables you to select a set of nodes in an in-memory model of an XML file. Why this is interesting will become clear later, but you should think of XPath as a “little language,” much like the format description (printf/scanf) sublanguage of C.

Formally, XPath is a simple expression language. The values of XPath expressions can have one of four types:

One of these is not like the others. There are a few things to know. Strings can be delimited with either single or double quotes. The basic collection of comparison operators (=, <, <=, >, >=, !=) exist for numbers and strings. [Note that string comparison is Unicode based, and should not be expected to correspond to the natural collating order of a given language].

Boolean expressions:

Numbers:

Strings:

The real game is node sets. As far as XPath is concerned, there are a seven types of nodes:

Note what's not there—CDATA sections, entities, DOCTYPEs, etc. XPath has a DOM-centric, rather than serialization-centric view of XML, and these should all be thought of as part of the machinery for serialization, but disappear in the deserialization process.

The node set expressions are the heart of the matter—the reason why XPath exists.

An initial concept—XPath expressions are evaluated relative to a specific (individual) context node, which has one of the seven types above. There are two types of node set syntax, the abbreviated and the unabbreviated. The abbreviated is the most common—and XPath expressions written in the abbreviated style look intentionally like Unix paths.

Let's consider our standard example:

<?xml version="1.0" encoding="utf-8"?> <course-schedule> <quarter term="Autumn" year="2006"> <course> <number>CMSC 10500</number> <title>Fundamentals of Computer Programming I</title> <instructor>Mulmuley</instructor> </course> <course> <number>CMSC 15300</number> <title>Foundations of Software</title> <instructor>MacQueen</instructor> </course> <course> <number>CMSC 27100</number> <title>Discrete Mathematics</title> <instructor>Klivans</instructor> </course> <course> <number>CMSC 27700</number> <title>Mathematical Logic I</title> <instructor>Mileti</instructor> </course> <course> <number>CMSC 28000</number> <title>Introduction to Formal Languages</title> <instructor>Kurtz</instructor> </course> </quarter> <quarter term="Winter" year="2007"> <course> <number>CMSC 16200</number> <title>Honors Introduction to Computer Science II</title> <instructor>Kurtz</instructor> </course> <course> <number>CMSC 27200</number> <title>Theory of Algorithms</title> <instructor>Felzenszwalb</instructor> </course> <course> <number>CMSC 27500</number> <title>Graph Theory</title> <instructor>Simon</instructor> </course> <course> <number>CMSC 27800</number> <title>Mathematical Logic II</title> <instructor>Mileti</instructor> </course> </quarter> <quarter term="Spring" year="2007"> <course> <number>CMSC 28100</number> <title>Introduction to Complexity Theory I</title> <instructor>Fortnow</instructor> </course> </quarter> </course-schedule>

Abbreviated syntax

Basic path selectors:

We can combine path selectors by using /, and // pretty much as expected, and by using | for union.

For example, in

<foo> <bar> <baz/> </bar> <baz/> </foo>

the XPath expression //bar/baz selects the first baz element, but not the second.

Likewise //(number|title) selects all of the number and title elements of the course example.

Selector sets can be further refined using predicates. Note that individual nodes have values, which correspond to their content, stripped of markup. Predicates follow a path expression, and are placed in []'s.

Thus, e.g.,

//quarter[@term='Autumn']

selects the first of the quarter elements. Note that we can also use [] for array indexing, (1 based), and so could have selected the same node using

//quarter[1]

These can be mixed and matched:

//quarter[@term='Autumn']/course

selects all of the autumn quarter course nodes.

Unabbreviated syntax

There are a number of "axes" for search. We've seen child (/), descendant-or-self (//), parent(..), and self(.).

What if we wanted to get find all of the quarters in which CMSC 16200 was taught?

//number[.='CMSC 16200']/ancestor::quarter

The ancestor:: axis selector tells us to search "up" from the current node. There are also sibling, following-sibling, preceding-sibling, and several others. The general syntax is axis::selector.

This same example could be done without using the ancestor axis:

//quarter[.//number='CMSC 16200']

The xmllint program has a “shell” mode that enables you to browse an XML file much as if it were a filesystem.

$ man xmllint $ xmllint -shell mydata.xml / > ls //instructor t-- 11 N. Hinrichs t-- 5 Kurtz t-- 8 Razborov t-- 9 Montalban t-- 5 Kurtz t-- 8 Mulmuley t-- 5 Babai t-- 5 Simon t-- 5 Simon t-- 5 Kurtz t-- 5 Simon t-- 11 Hirschfeldt t-- 5 Babai t-- 7 Chuzhoy t-- 5 Soare t-- 8 Razborov t-- 5 Babai t-- 8 Mulmuley t-- 8 Mulmuley t-- 5 Babai t-- 5 Soare t-- 8 Razborov t-- 5 Soare / >

Alternatively, we can use the xpath command, but it works a bit differently. If we use the same search string, we'll just get the nodes (it is somewhat like the distinction between "ls" and "ls -d").

/ > xpath //instructor Object is a Node Set : Set contains 23 nodes: 1 ELEMENT instructor ATTRIBUTE type TEXT content=nontheory 2 ELEMENT instructor 3 ELEMENT instructor 4 ELEMENT instructor ATTRIBUTE type TEXT content=nontheory 5 ELEMENT instructor 6 ELEMENT instructor 7 ELEMENT instructor 8 ELEMENT instructor 9 ELEMENT instructor 10 ELEMENT instructor 11 ELEMENT instructor 12 ELEMENT instructor ATTRIBUTE type TEXT content=nontheory 13 ELEMENT instructor 14 ELEMENT instructor 15 ELEMENT instructor 16 ELEMENT instructor 17 ELEMENT instructor 18 ELEMENT instructor 19 ELEMENT instructor 20 ELEMENT instructor 21 ELEMENT instructor 22 ELEMENT instructor 23 ELEMENT instructor

To get at the underlying text nodes, we have to use the text() selector:

/ > xpath //instructor/text() Object is a Node Set : Set contains 23 nodes: 1 TEXT content=N. Hinrichs 2 TEXT content=Kurtz 3 TEXT content=Razborov 4 TEXT content=Montalban 5 TEXT content=Kurtz 6 TEXT content=Mulmuley 7 TEXT content=Babai 8 TEXT content=Simon 9 TEXT content=Simon 10 TEXT content=Kurtz 11 TEXT content=Simon 12 TEXT content=Hirschfeldt 13 TEXT content=Babai 14 TEXT content=Chuzhoy 15 TEXT content=Soare 16 TEXT content=Razborov 17 TEXT content=Babai 18 TEXT content=Mulmuley 19 TEXT content=Mulmuley 20 TEXT content=Babai 21 TEXT content=Soare 22 TEXT content=Razborov 23 TEXT content=Soare

Exercise 21.1 There is a file courses.xml on the web site. Use xmllint to print

  1. the name of all of the courses that I teach (instructor="Kurtz")
  2. the list of quarters in which I teach.

Note that the format of the file is not exactly the same as I've discussed in this lecture.