Lecture 22

Admimistrivia

If you haven't signed up for a final presentation, please do so soon. Also, please check the presentations schedule to make sure you're scheduled on the date you expect, and with (or without) a projector, as you require. If your talk title is current "TBA," please think about your presentation, and send me a title as soon as possible.

ElementTree—A Python API for XML

ElementTree is also supported under Python 2.5 and later. We will consider the use of ElementTree in Python 3.

There are two standard approaches w.r.t. language-specific XML handling APIs:

  1. Closely following the W3C's DOM specification. This approach has the advantage that programmers can move from language to language, and utilize familiar APIs, with consistent vocabulary. But it has the disadvantage that it's not able to take advantage of any language specific data structures or abstractions, and so often feels un-idiomatic: the moral equivalent of manipulating XML with chopsticks.
  2. Develop a language-specific view of an XML InfoSet, which takes advantage of languages-specific abstractions. This ends up "feeling natural" so long as you're working in the host language, but can make it challenging to port XML handling code from one language to another, and requires learning a language-specific XML API.

ElementTree (like Java's JDOM) follows the second approach, using native Python constructs and concepts to represent XML. There is a semi-reasonable ElementTree tutorial online.

The core type is Element. An Element has the following members:

Let's do a very simple example. Here is a bit of XML that describes scores from a Gymnastics meet (this was a team my daughter coached...):

<?xml version="1.0" encoding="utf-8"?> <meet name="Totally Rad" location="Kalamazoo, MI" date="2007-02-25"> <gymnast level="prep-opt/silver"> <name>Amy</name> <vault></vault> <bars>8.550</bars> <beam></beam> <floor></floor> </gymnast> <gymnast level="prep-opt/silver"> <name>Christen</name> <vault>8.250</vault> <bars place="2">9.050</bars> <beam>7.850</beam> <floor place="2">9.125</floor> <all place="8"/> </gymnast> <gymnast level="prep-opt/bronze"> <name>Jackie</name> <vault place="2">8.150</vault> <bars place="3">8.150</bars> <beam>8.150</beam> <floor place="4">8.600</floor> <all place="3"/> </gymnast> <gymnast level="prep-opt/silver"> <name>Miranda</name> <vault></vault> <bars place="8">8.700</bars> <beam place="2">9.150</beam> <floor place="9">8.700</floor> </gymnast> </meet>

The computational goal is to produce a “pretty” textual output as follows, with the gymnasts sorted in descending order by all-around score:

Totally Rad Feb 25, 2007 name vault bars beam floor all --------- ----- ----- ----- ----- ----- Christen 8.250 9.050 7.850 9.125 34.275 Jackie 8.150 8.150 8.150 8.600 33.050 Miranda ..... 8.700 9.150 8.700 26.550 Amy ..... 8.550 ..... ..... 8.550

Here's the code:

#!/usr/bin/env python3.3 # meet -- compute a result table for a gymnastic meet import xml.etree.ElementTree as ET from datetime import date from sys import argv template = '{:<10} {:>7} {:>7} {:>7} {:>7} {:>7}' def num(text): if text is None: return 0.0 return float(text) def nstr(s): if s is None: return '.....' return s class Gymnast(object): events = ['vault','bars','beam','floor'] def __init__(self,element): self.name = element.find('name').text for field in Gymnast.events: setattr(self,field,element.find(field).text) self.all = sum(map(lambda field: num(getattr(self,field)), Gymnast.events)) def __str__(self): return template.format(nstr(self.name), nstr(self.vault), nstr(self.bars), nstr(self.beam), nstr(self.floor), '{: 6.3f}'.format(self.all)) def __lt__(self,other): return other.all < self.all # grab the XML doc = ET.parse(argv[1]) meet = doc.getroot() # print the header print(meet.get('name')) year,month,day = meet.get('date').split('-') print(date(int(year),int(month),int(day)).strftime('%b %d, %Y')) print() print(template.format('name','vault','bars','beam','floor','all')) print(template.format('---------','-----','-----','-----','-----','------')) # print the table entries gymnasts = list(map(Gymnast,doc.findall('.//gymnast'))) gymnasts.sort() for g in gymnasts: print(g)

Things to note:

import xml.etree.ElementTree as ET

This loads the xml.etree.ElementTree module under the name ET, which saves us a lot of typing. We've seen this sort of thing before in Haskell.

doc = ET.parse(argv[1])

This is huge, and a big part of the point to XML APIs. We reduce the complex task of opening an XML file, parsing it, and building an in-memory representation of the document it contains into a single line of code.

meet = doc.getroot()

Get the root element of the ElementTree. Note that meet has type Element, and this is the key!

Note the use of the get method rather than just using the attrib field — this is an efficiency hack. ElementTree doesn't bother building the dictionary unless you use it.

print(meet.get("name"))

But we could have written

print(meet.attrib["name"])

if we wanted to remember a bit less, were willing to type a bit more, and didn't mind a (minor, in this case) performance hit.

We have two XML XPath-like methods:

Note that we can use len(element) and element[i] for array-like access to immediate children.

Note the use of find + text in xml2dict.

Note the use of findall in the list constructor for gymnasts.

If, at the end of all of this, you're struck by the thought that this could have been done more quickly and easily with a simple plain-text data format and a dozen lines of AWK, I have two quick thoughts:

  1. You're well on your way to being a Unix hacker/master, and
  2. There's a cost and a benefit associated with generality.

Note that namespaces are handled by Clark syntax, e.g.,

element.find(".//{http://xml.namespace.org/mynamespace}tag")

This makes it a bit easier to deal with ad-hoc XML than “principled” (i.e., namespace aware) XML.

Two oddities:

1. ElementTree doesn't use text nodes, and this creates a issue in the content of "mixed content" uses of XML, a la, XHTML. Consider:

<p>This is a paragraph. <span class="emph">With a bold sentence!!</span> And then back to normal.</p>

What does element tree do with this?

We have just two elements — a "p" and its child "span": p.text is "This is a paragraph."; span.text is "With a bold sentence!!". Where does "And then back to normal." go? Answer: span.tail (!). I.e., text content that immediately follows a closing tag and preceeds any other open or close tag becomes the tail field of the just-closed element.

This seems awkward, but I don't have any real experience, and it's not clear that there's a better alternative.

2. ElementTree doesn't keep child->parent pointers. If you need them, you'll have to add them yourselves. This isn't hard to do...

Note that modifying an ElementTree InfoSet (the model used by element tree) is pretty easy. Tag names can be renamed by assignment. Attributes can be added or deleted using dictionary or shortcut methods. New elements can be added or deleted using sequence methods, e.g., append, delete, and friends.

And finally, XML can be written by calling the write method on an ElementTree node.

Exercise 22.1 Write a bit of XML, and code that does something with it using ElementTree. It might be cool to try to use the XML you wrote for Exercise 20.1. The level of the example in class is fine, but try to something a bit different. E.g., imagine a simple XML syntax for expressions, and write an evaluator...