Please sign up for a final presentation slot if you've not done so already. Note that there's now a "presentation" link on the class web pages. The times are provisional, subject to change, etc.
XPath is a language that enables you to select a set of nodes in an in-memory model of an XML file. Why this is interesting will become clear later, but you should think of XPath as a “little language,” much like the format description (printf/scanf) sublanguage of C.
Formally, XPath is a simple expression language. The values of XPath expressions can have one of four types:
- node sets
One of these is not like the others. There are a few things to know. Strings can be delimited with either single or double quotes. The basic collection of comparison operators (
!=) exist for numbers and strings. [Note that string comparison is Unicode based, and should not be expected to correspond to the natural collating order of a given language].
- relational expressions
- Usual arithmetic operators (
- relational operators
- other functions (see the XPath spec), mostly based on matching and substring selection.
The real game is node sets. As far as XPath is concerned, there are a seven types of nodes:
- root nodes
- element nodes
- text nodes
- attribute nodes
- namespace nodes
- processing instruction nodes
- comment nodes
Note what's not there—CDATA sections, entities, DOCTYPEs, etc. XPath has a DOM-centric, rather than serialization-centric view of XML, and these should all be thought of as part of the machinery for serialization, but disappear in the deserialization process.
The node set expressions are the heart of the matter—the reason why XPath exists.
An initial concept—XPath expressions are evaluated relative to a specific (individual) context node, which has one of the seven types above. There are two types of node set syntax, the abbreviated and the unabbreviated. The abbreviated is the most common—and XPath expressions written in the abbreviated style look intentionally like Unix paths.
Let's consider our standard example:
<?xml version="1.0" encoding="utf-8"?> <course-schedule> <quarter term="Autumn" year="2006"> <course> <number>CMSC 10500</number> <title>Fundamentals of Computer Programming I</title> <instructor>Mulmuley</instructor> </course> <course> <number>CMSC 15300</number> <title>Foundations of Software</title> <instructor>MacQueen</instructor> </course> <course> <number>CMSC 27100</number> <title>Discrete Mathematics</title> <instructor>Klivans</instructor> </course> <course> <number>CMSC 27700</number> <title>Mathematical Logic I</title> <instructor>Mileti</instructor> </course> <course> <number>CMSC 28000</number> <title>Introduction to Formal Languages</title> <instructor>Kurtz</instructor> </course> </quarter> <quarter term="Winter" year="2007"> <course> <number>CMSC 16200</number> <title>Honors Introduction to Computer Science II</title> <instructor>Kurtz</instructor> </course> <course> <number>CMSC 27200</number> <title>Theory of Algorithms</title> <instructor>Felzenszwalb</instructor> </course> <course> <number>CMSC 27500</number> <title>Graph Theory</title> <instructor>Simon</instructor> </course> <course> <number>CMSC 27800</number> <title>Mathematical Logic II</title> <instructor>Mileti</instructor> </course> </quarter> <quarter term="Spring" year="2007"> <course> <number>CMSC 28100</number> <title>Introduction to Complexity Theory I</title> <instructor>Fortnow</instructor> </course> </quarter> </course-schedule>
/— selects the root node.
/course-schedule— selects the course-schedule element node that is the container node for this XML file.
//course— selects all course nodes. Note that in the abbreviated syntax,
//means “all my descendants and me.”
Basic path selectors:
/— the root.
.— the context node
..— the parent node of the context node
name— all children elements nodes of the context node with tag "name".
*— all child elements
@name— all of the attribute nodes with tag "name" that are children of the current node.
@*— all child attributes.
We can combine path selectors by using
// pretty much as expected, and by using
| for union.
For example, in
<foo> <bar> <baz/> </bar> <baz/> </foo>
the XPath expression
//bar/baz selects the first baz element, but not the second.
//(number|title) selects all of the number and title elements of the course example.
Selector sets can be further refined using predicates. Note that individual nodes have values, which correspond to their content, stripped of markup. Predicates follow a path expression, and are placed in
selects the first of the quarter elements. Note that we can also use
 for array indexing, (1 based), and so could have selected the same node using
These can be mixed and matched:
selects all of the autumn quarter course nodes.
There are a number of "axes" for search. We've seen child (
/), descendant-or-self (
..), and self(
What if we wanted to get find all of the quarters in which CMSC 16200 was taught?
ancestor:: axis selector tells us to search "up" from the current node. There are also
preceding-sibling, and several others. The general syntax is
This same example could be done without using the ancestor axis:
The xmllint program has a “shell” mode that enables you to browse an XML file much as if it were a filesystem.
$ man xmllint $ xmllint -shell mydata.xml / > ls //instructor t-- 11 N. Hinrichs t-- 5 Kurtz t-- 8 Razborov t-- 9 Montalban t-- 5 Kurtz t-- 8 Mulmuley t-- 5 Babai t-- 5 Simon t-- 5 Simon t-- 5 Kurtz t-- 5 Simon t-- 11 Hirschfeldt t-- 5 Babai t-- 7 Chuzhoy t-- 5 Soare t-- 8 Razborov t-- 5 Babai t-- 8 Mulmuley t-- 8 Mulmuley t-- 5 Babai t-- 5 Soare t-- 8 Razborov t-- 5 Soare / >
Alternatively, we can use the xpath command, but it works a bit differently. If we use the same search string, we'll just get the nodes (it is somewhat like the distinction between "ls" and "ls -d").
/ > xpath //instructor Object is a Node Set : Set contains 23 nodes: 1 ELEMENT instructor ATTRIBUTE type TEXT content=nontheory 2 ELEMENT instructor 3 ELEMENT instructor 4 ELEMENT instructor ATTRIBUTE type TEXT content=nontheory 5 ELEMENT instructor 6 ELEMENT instructor 7 ELEMENT instructor 8 ELEMENT instructor 9 ELEMENT instructor 10 ELEMENT instructor 11 ELEMENT instructor 12 ELEMENT instructor ATTRIBUTE type TEXT content=nontheory 13 ELEMENT instructor 14 ELEMENT instructor 15 ELEMENT instructor 16 ELEMENT instructor 17 ELEMENT instructor 18 ELEMENT instructor 19 ELEMENT instructor 20 ELEMENT instructor 21 ELEMENT instructor 22 ELEMENT instructor 23 ELEMENT instructor
To get at the underlying text nodes, we have to use the
/ > xpath //instructor/text() Object is a Node Set : Set contains 23 nodes: 1 TEXT content=N. Hinrichs 2 TEXT content=Kurtz 3 TEXT content=Razborov 4 TEXT content=Montalban 5 TEXT content=Kurtz 6 TEXT content=Mulmuley 7 TEXT content=Babai 8 TEXT content=Simon 9 TEXT content=Simon 10 TEXT content=Kurtz 11 TEXT content=Simon 12 TEXT content=Hirschfeldt 13 TEXT content=Babai 14 TEXT content=Chuzhoy 15 TEXT content=Soare 16 TEXT content=Razborov 17 TEXT content=Babai 18 TEXT content=Mulmuley 19 TEXT content=Mulmuley 20 TEXT content=Babai 21 TEXT content=Soare 22 TEXT content=Razborov 23 TEXT content=Soare
Exercise 21.1 There is a file courses.xml on the web site. Use
xmllint to print
- the name of all of the courses that I teach (instructor="Kurtz")
- the list of quarters in which I teach.
Note that the format of the file is not exactly the same as I've discussed in this lecture.