Lecture 25

libxml2 —an XML handling API for C

I tend to view XSLT as “Awk for Trees.” It's a nice, capable (even Turing complete) language, but it doesn't solve the problem of getting XML in and out of C programs. That's what libxml does. [N.B., I'm discussing libxml2, specifically.]

Resources:

What you'd like to see in an API is a simple interface built on top of a small number of well chosen abstractions. Most APIs treat objects as opaque, either de facto or de jure, and the effect is that all access to the objects provide by the API is through a well defined (and generally hierarchically structured) set of procedures.

Generally speaking, you should be able to tackle simple tasks by learning a few of the basic API abstractions, and using the simplest (and earliest listed) constructor, destructor, accessor, and mutator procedures. It should be possible to solve more complicated tasks in controlled way—learning a few supporting abstractions that enable you to modify the behavior of the basic objects, digging a bit deeper into the API documentation, etc.

Libxml is not like this. It is not clear to what extent this is reflective of the intrinsic complexity of XML, and to what extent it is reflective of a failure of abstraction and/or documentation. With libxml, the foundational structure (xmlNode) is not opaque—but the documentation does not make it as clear as it might which fields should never be accessed by the application developer, which fields should be treated as read-only, and which fields it is safe to alter. My strategy in dealing with this is to use procedural interfaces when possible, and to model my direct access to API objects on example code provided by the libxml developers. This is not a perfect world, but it is the world we have. Libxml isn't perfect, but it is the tool we have, and if you work within its spirit, it isn't bad. And it is fast—very, very fast.

Let's begin, though, with a simple example, a repeat of an early XSLT example. Consider the following XML (simple.xml):

<?xml version="1.0" encoding="utf-8"?> <root> <foo> <bar>bar #1</bar> </foo> <bar>bar #2</bar> </root>

Our goal is to write a C-program that enables us to select a set of nodes via an XPath expression, and to print their values, one per line, e.g.,

$ xpath-value "//bar" simple.xml bar #1 bar #2 $

Here is a libxml-based C program, xpath-value.c:

// xpath-value.c // Extract the contents of elements extracted from an XML file // by a XPath expression. // Intended usage: $ xpath-value path-expression file.xml #include <assert.h> #include <stdio.h> #include <stdlib.h> #include <libxml/parser.h> #include <libxml/xpath.h> int main(int argc, char **argv) { assert(argc == 3); char *xpathexp = argv[1]; char *xmlfile = argv[2]; xmlDocPtr doc = xmlParseFile(xmlfile); // 1 assert(doc != NULL); xmlXPathContextPtr context = xmlXPathNewContext(doc); // 2 assert(context != NULL); xmlXPathObjectPtr result = xmlXPathEvalExpression((xmlChar *) xpathexp,context); // 3 assert(result != NULL); assert(result->type == XPATH_NODESET); xmlNodeSetPtr nodeset = result->nodesetval; xmlBufferPtr buffer = xmlBufferCreate(); // 4 for (int i = 0; i < nodeset->nodeNr; ++i) { // 5 xmlNodePtr contentNode = nodeset->nodeTab[i]; // 6 assert(xmlNodeBufGetContent(buffer,contentNode) == 0); // 7 printf("%s\n",(char *) buffer->content); // 8 xmlBufferEmpty(buffer); // 9 } xmlBufferFree(buffer); // 10 xmlXPathFreeObject(result); xmlXPathFreeContext(context); xmlFreeDoc(doc); exit(0); }

Key lines & concepts...

  1. An xmlDocPtr is a pointer to an in-memory version of an XML document. Note that we can load an XML document into memory with a single, simple, function call. Later, we'll see that you can write out DOM with a single, simple function call. These are the real triumphs of libxml.
  2. To evaluate an XPath expression, we need an evaluation context. For this application, the appropriate context is the document node, so we don't need to make any changes to the xmlXPathContext.
  3. Here is where we evaluate the XPath expression. The result of evaluating the XPath expression can have a number of types—we're expecting that the type will be a node set.
  4. an xmlBuffer is just an automatically resizing string that we can write to using various functions in libxml.
  5. For every index of a node in our node set...
  6. Get the relevant contentNode. There is something crucially important going on here, which we should not gloss over. xmlNodePtr is, naturally enough, a pointer to an xmlNode. But what is an xmlNode? The answer is that it can be any of a number of things—it could be an element, an attribute, a text node, etc. To find out which, we would look at the type field, and possibly cast the xmlNodePtr to a more specific type. In practice, casting is not usually needed, since the xmlNode struct includes the most commonly used fields (even though they're not always needed).
  7. In this particular case, we don't actually need to know the type. We will later...
  8. Write the content of the contentNode to the buffer, making sure that there were no errors...
  9. Finally, print the content of the buffer on a line.a pointer to a set of nodes...
  10. Empty the buffer (relevant for the 2nd and successive times through the loop)...
  11. Finally, calling libxml to free each of the objects we asked it to allocate.

A final issue is the actual build process. Here is the Makefile:

CFLAGS= -std=c99 `xml2-config --cflags` LDLIBS= `xml2-config --libs` PATH=/bin:/usr/bin:. xpath-value: xpath-value.o xpath-value.o: xpath-value.c test: xpath-value xpath-value "//bar" simple.xml clean: rm -f xpath-value rm -f xpath-value.o

Note particularly the calls to xml2-config. These could be expanded out, but this is more portable.

Now, let's go on to consider a more realistic problem—the use of XML as a serialized data format. Recall our workflow from Lecture 20:

Workflow Diagram

Note that it is perfectly conceivable that your application format could be DOM, which makes writing the conversion particularly easy :-). The problem with this “solution” is that DOM is only occasionally an appropriate data structure, and working with DOM directly can greatly complicate the internals of data representation/storage, and cause a great deal of unnecessary conversion.

Let's work on a simple example ... Let's suppose we want to write a maze editor, and we're going to use XML to represent the maze. I'm going to focus just on serialization/deserialization, not the application logic of the editor.

Let's suppose we have a simple maze (represented here in ASCII art):

+ - + - + - + + - + - + | | | + - + - + + + - + + | | | | + - + + - + - + + + | | | | | | + + + - + + + - + | | + - + - + - + - + - + - +

Here's an internal representation—on the surface, we have a 4 x 6 array of open cells (which don't require any representation at all). Together with this, we have an array of 5 x 6 horizontal walls,

| | | | | | v v v v v v -> + - + - + - + + - + - + | | | -> + - + - + + + - + + | | | | -> + - + + - + - + + + | | | | | | -> + + + - + + + - + | | -> + - + - + - + - + - + - +

and 4 x 7 vertical walls.

| | | | | | | v v v v v v v + - + - + - + + - + - + -> | | | + - + - + + + - + + -> | | | | + - + + - + - + + + -> | | | | | | + + + - + + + - + -> | | + - + - + - + - + - + - +

And, although again no representation is required, we have 5 x 7 pillars (represented by + symbols in this diagram).

The C data type that we will use to represent such an maze might be

struct maze_t { int rows, cols; bool **hwall; bool **vwall; }; typedef struct maze_t maze_t;

From our application's point of view, the maze API consists of the following primitives:

maze_t *create_maze(int rows, int cols); void delete_maze(maze_t *maze); bool get_vwall(maze_t *maze, int row, int col); void set_vwall(maze_t *maze, int row, int col, bool wall); bool get_hwall(maze_t *maze, int row, int col); void set_hwall(maze_t *maze, int row, int col, bool wall); void print_maze(maze_t *maze,bool coords);

We're going to avoid the actual implementation details...

We will have an XML representation of the maze above.

<?xml version="1.0" encoding="utf-8"?> <maze rows="4" cols="6"> <hwall row="0" col="0"/> <hwall row="0" col="1"/> <hwall row="0" col="2"/> <hwall row="0" col="4"/> <hwall row="0" col="5"/> <hwall row="1" col="0"/> <hwall row="1" col="1"/> <hwall row="1" col="4"/> <hwall row="2" col="0"/> <hwall row="2" col="2"/> <hwall row="2" col="3"/> <hwall row="3" col="2"/> <hwall row="3" col="5"/> <hwall row="4" col="0"/> <hwall row="4" col="1"/> <hwall row="4" col="2"/> <hwall row="4" col="3"/> <hwall row="4" col="4"/> <hwall row="4" col="5"/> <vwall row="0" col="0"/> <vwall row="0" col="3"/> <vwall row="0" col="6"/> <vwall row="1" col="0"/> <vwall row="1" col="2"/> <vwall row="1" col="5"/> <vwall row="1" col="6"/> <vwall row="2" col="0"/> <vwall row="2" col="1"/> <vwall row="2" col="2"/> <vwall row="2" col="4"/> <vwall row="2" col="5"/> <vwall row="2" col="6"/> <vwall row="3" col="0"/> <vwall row="3" col="6"/> </maze>

This is what is sometimes called a "sparse representation"—we only represent the walls that are actually there. An interesting alternative would be to assume the exterior walls, and add hexit and vexit tags.

Let's tackle the deserialization process first. This is going to involve loading an XML document, and then interpreting the resulting DOM to build our maze...

maze_t *create_maze_from_XML(char *xmlfile) { // load the DOM... xmlDocPtr doc = xmlParseFile(xmlfile); assert(doc != NULL); xmlNodePtr root = xmlDocGetRootElement(doc); // get the root element, make sure it's what we expect... assert(root != NULL); assert(root->type == XML_ELEMENT_NODE); assert(xmlStrcmp((xmlChar *) "maze",root->name) == 0); // extract the attributes from this node.... int rows = getIntAttr(root,"rows"); int cols = getIntAttr(root,"cols"); assert(0 < rows); assert(0 < cols); assert(rows < ROWMAX); assert(cols < COLMAX); // create an empty maze of the appropriate size... maze_t *maze = create_maze(rows,cols); // traverse the children and populate the walls... for (xmlNodePtr child = root->children; child != NULL; child = child->next) { if (child->type == XML_ELEMENT_NODE) { // we only care about element nodes... if (xmlStrcmp((xmlChar *) "vwall",child->name) == 0) { int row = getLongAttribute(child,"row"); int col = getLongAttribute(child,"col"); set_vwall(maze,row,col,true); } else if (xmlStrcmp((xmlChar *) "hwall",child->name) == 0) { int row = getLongAttribute(child,"row"); int col = getLongAttribute(child,"col"); set_hwall(maze,row,col,true); } } } // delete the xmlDoc... xmlFreeDoc(doc); return maze; } static int getIntAttr(xmlNodePtr node, char *name) { xmlChar *content = xmlGetProp(node,(xmlChar *) name); assert(content != NULL); return strtol(content,NULL,10); }

Mostly, this code is self explanatory. There are some things to note...

  1. libxml defined xmlChar to be (unsigned char). char is usually signed. The important thing to know here is that libxml represents strings internally in utf-8, so the (xmlChar *) is all utf-8 encoded. That's not a problem for us, but we do have to get the casts right. Note that libxml defines xmlStrcmp. This is basically strcmp, but it expects xmlChar *'s.
  2. xmlNodes have children, parent, next (sibling) and previous (sibling) elements. The traversal of children elements in the code above is pretty typical -- go to the child, then follow the ->next links.
  3. Note how we're filtering the children for element nodes -- there are a *lot* of text nodes in the file that contain nothing but whitespace. We don't care about them. Moreover, if someone wants to add comment or processing-instruction nodes, we don't care about that either.
  4. Note that we're not doing anything with namespaces. There is an namespace field (xmlChar *xmlNs) in xmlNode's that we could pay attention too, if we want to.

Next, for deserialization.

xmlDocPtr create_doc_from_maze(maze_t *maze) { // create the document xmlDocPtr doc = xmlNewDoc((xmlChar *) "1.0"); assert(doc != NULL); // create the root node, and set its attributes. xmlNodePtr root = xmlNewDocNode(doc,NULL,(xmlChar *) "maze",NULL); assert(root != NULL); setIntProp(root,"rows",maze->rows); setIntProp(root,"cols",maze->cols); // attach the root to the document. xmlDocSetRootElement(doc,root); // add the hwall elements... for (int r = 0; r <= maze->rows; ++r) { for (int c = 0; c < maze->cols; ++c) { if (get_hwall(maze,r,c)) { xmlNodePtr child = xmlNewDocNode(doc,NULL,(xmlChar *) "hwall",NULL); assert(child != NULL); setIntProp(child,"row",r); setIntProp(child,"col",c); xmlAddChild(root,child); } } } // add the vwall elements... for (int r = 0; r < maze->rows; ++r) { for (int c = 0; c <= maze->cols; ++c) { if (get_vwall(maze,r,c)) { xmlNodePtr child = xmlNewDocNode(doc,NULL,(xmlChar *) "vwall",NULL); assert(child != NULL); setIntProp(child,"row",r); setIntProp(child,"col",c); xmlAddChild(root,child); } } } return doc; } static void setIntProp(xmlNodePtr node,char *name, int v) { char *value = NULL; asprintf(&value,"%d",v); xmlSetProp(node,(xmlChar *)name,(xmlChar *)value); free(value); }

Again, now that the patterns are clear, the code isn't too bad. Of course, once we have a doc node, we want to be able to write it to a file. As you might guess, this is a one-liner:

xmlSaveFormatFile("output.xml",doc,1);

The only thing that requires explanation here is the final argument -- which is simply a pretty-printing flag. You'd set this to zero if you couldn't tolerate extra white-space content, or don't care about human-readability.