The 10-Minute XPath Tutorial - Automating System Administration with Perl

by David N. Blank-Edelman

Before we launch into XPath, we need to get three caveats out of the way.

First, in order to understand this appendix, you’ll need to have at least a moderate grip on the subject of XML. Be sure to read Appendix A, The Eight-Minute XML Tutorial if you haven’t already.

Automating System Administration with Perl, Second Edition book cover

This excerpt is from System Administration with Perl, Second Edition . Thoroughly updated and expanded in its second edition to cover the latest operating systems, technologies, and Perl modules, Automating System Administration with Perl will help you perform your job with less effort. The second edition not only offers you the right tools for your job, but also suggests the best way to approach particular problems and securely automate pressing tasks.

buy button

Second, XPath is a language unto itself. The XPath 1.0 spec consists of 34 relatively dense pages; the XPath 2.0 spec is 118 pages long. This appendix is not going to attempt to do any justice to the richness, expressiveness, and complexity of XPath (especially v2.0). Instead, it is going to focus on the subset of XPath that will be immediately useful to a Perl programmer.

Finally, this appendix will be sticking to XPath 1.0. As of this writing there are no solid Perl modules that I know of that support XPath 2.0.

With all of that aside, let’s get to questions like “What is XPath?” and, perhaps more importantly, “Why should I care?” XPath is a W3C spec for “a language for addressing parts of an XML document.” If you ever have to write code that attempts to select or extract certain parts of an XML document, XPath may make your life a great deal easier. It is a fairly terse but quite powerful language for this task and has a lovely “make it so” quality to it. If you can describe what data you are looking for using the XPath language (and you usually can), the XPath parser can fetch it for you, or allow you to point your program at the right part of the XML document. You can often achieve this with a single line of Perl.

XPath Basic Concepts

There are several basic concepts that you need to understand to be able to start using XPath. Let’s look at them one at a time in order of increasing complexity.

Basic Location Paths

To understand XPath, you have to start with the notion that an XML document can be parsed into a tree structure. The elements of the document (and the other stuff, but we’ll leave that out for now) serve as the nodes of the tree. To make this clearer, let’s pull in the sample XML file from Chapter 6, Working with Configuration Files. I’ll reprint it here so you don’t have to keep flipping back and forth to refer to it:

<?xml version="1.0" encoding="UTF-8"?>

<network>
    <description name="Boston">
        This is the configuration of our network in the Boston office.
    </description>
    <host name="agatha" type="server" os="linux">
        <interface name="eth0" type="Ethernet">
            <arec>agatha.example.edu</arec>

            <cname>mail.example.edu</cname>
            <addr>192.168.0.4</addr>
        </interface>
        <service>SMTP</service>

        <service>POP3</service>
        <service>IMAP4</service>
    </host>
    <host name="gil" type="server" os="linux">
        <interface name="eth0" type="Ethernet">

            <arec>gil.example.edu</arec>
            <cname>www.example.edu</cname>
            <addr>192.168.0.5</addr>
        </interface>

        <service>HTTP</service>
        <service>HTTPS</service>
    </host>
    <host name="baron" type="server" os="linux">
        <interface name="eth0" type="Ethernet">

            <arec>baron.example.edu</arec>
            <cname>dns.example.edu</cname>
            <cname>ntp.example.edu</cname>
            <cname>ldap.example.edu</cname>

            <addr>192.168.0.6</addr>
        </interface>
        <service>DNS</service>
        <service>NTP</service>

        <service>LDAP</service>
        <service>LDAPS</service>
    </host>
    <host name="mr-tock" type="server" os="openbsd">
        <interface name="fxp0" type="Ethernet">

            <arec>mr-tock.example.edu</arec>
            <cname>fw.example.edu</cname>
            <addr>192.168.0.1</addr>
        </interface>

        <service>firewall</service>
    </host>
    <host name="krosp" type="client" os="osx">
        <interface name="en0" type="Ethernet">
            <arec>krosp.example.edu</arec>

            <addr>192.168.0.100</addr>
        </interface>
        <interface name="en1" type="AirPort">
            <arec>krosp.wireless.example.edu</arec>
            <addr>192.168.100.100</addr>

        </interface>
    </host>
    <host name="zeetha" type="client" os="osx">
        <interface name="en0" type="Ethernet">
            <arec>zeetha.example.edu</arec>

            <addr>192.168.0.101</addr>
        </interface>
        <interface name="en1" type="AirPort">
            <arec>zeetha.wireless.example.edu</arec>
            <addr>192.168.100.101</addr>

        </interface>
    </host>
</network>

If we parse this into a node tree, it will look something like Figure B.1, “XML document node tree”.

The root of the tree points to the document’s root element (<network></network>). The other elements of the document hang off of the root. Each element node has associated attribute nodes (if it has any attributes) and a child text node that represents the contents of that element (if it has any character data in it). For example, if the XML said <element attrib="value">something</element>, the XPath parse would have one <element></element> node with an attribute node of attrib and a text node holding the string something. Be sure to stare at Figure B.1, “XML document node tree” until the XML document-to-node tree idea is firmly lodged in your head, because it is crucial to the rest of this material.

If this diagram reminds you of the tree-like diagrams in Chapter 2, Filesystems, that’s good. The resemblance is intentional. XPath uses the concept of a location path to navigate to a node or set of nodes in a document. Location paths start either at the top of the tree (an absolute path) or at some other place in the tree (a relative path). Just like in a filesystem, “/” at the beginning means “start at the root of the tree,” “.” (dot) refers to the current node (also known as the “context node”), and “..” (dot-dot) refers to the parent of the context node.

If you want, you can think of location paths as a way to point at a specific node or set of nodes in a diagram. For example, if we wanted to point at the <description></description> node, the location path would be /network/description. If we used a location path of /network/host, we would be referring to all of the <host></host> nodes at that level of the tree. Pointing at a node any further down the tree would require a way to distinguish between the different <host></host> nodes. How to do that leads to a whole other XPath topic; we’ll hold off on that question for just a moment so we can look at a few more of the navigational aspects of walking a node tree.

Figure B.1. XML document node tree

XML document node tree

The information in our sample file consists of more than just markup tags; the file has real data in it. The elements themselves often have attributes (e.g., <interface name="en1" type="AirPort">) or act as labels for data (e.g., <addr>192.168.0.4</addr>). How do we get to those parts of the document? To get to an element’s attributes, we use an @ in front of the attribute name. For example, /network/description/@name gets us name="Boston". To access the contents of an element’s text node, we end the location path with text(), as in /network/description/text(). This returns the data This is the configuration....

Wildcards in XPath can function similarly to their filesystem analogs. /network/host/*/arec/text() finds all element nodes[136] under a <host></host> node that have <arec></arec> sub-nodes and then returns the contents of those <arec></arec> elements. In this case, we get back the DNS A resource record name associated with each interface:

agatha.example.edu
gil.example.edu
baron.example.edu
mr-tock.example.edu
krosp.example.edu
krosp.wireless.example.edu
zeetha.example.edu
zeetha.wireless.example.edu

Attributes can be wildcarded in a similar fashion by using @*. /network/host/@* would return all of the attributes of the <host></host> elements.

There’s one last piece of syntax worth mentioning before we get to the next section. XPath has what I call a “magic” location path operator. If you use two slashes (//) anywhere in the location path, it will search from that point down in the tree to try to locate the subsequent path elements. For example, if we say //arec/text(), we will get back the same set of interface A resource record names as in our previous example, because the operator will search from the root of the tree down to find all of the <arec></arec> elements that have text nodes. You can also place double slashes in the middle of a location path, as in /network//service/text(). Our sample file has a very shallow node tree, but you can imagine how the ability to describe a path without specifying all of the intervening parts of the tree might come in handy.

Predicates

In the last section we daintily stepped over the question of how one specifies which branch or branches of a tree to follow if the elements at that level in the tree have the same name. In our example document, we have five <host></host> elements at the third level of the tree. They have different attributes and the data in each is different, but that doesn’t help if the location path is constructed with just element names. If we say /network/host, the word host is (in the parlance of the spec) acting as a “node test.” It selects which network branch or branches to take when moving down the tree in our location path. But the node test in this example isn’t giving us the granularity we need to select a single branch.

That’s one place where XPath predicates come into play. Predicates allow you to filter the set of possible nodes provided by a node test to get just the ones you care about. /network/host returned all of the host nodes; we’d like a way to narrow down that set. Predicates are specified in square brackets ([]) in the location path itself. You insert a predicate right at the point where a filtering decision has to be made.

The simplest predicate example looks like an index number, as in /network/host[2]/interface/arec/text(). This location path returns the interface name(s) for the second host node (second in document order). If you were standing and looking at all of the host nodes, the predicate would tell you which branch of the tree to take: in this case, the one in the second position.

Warning

Perl programmers should be familiar with this index-like syntax, but don’t get too comfortable. Unlike in Perl, the index numbers in XPath start with 1, not 0.

If index numbers were the only possible predicate, that would be a bit ho-hum. But here’s where XPath starts to get really cool. XPath has a relatively rich set of predicates available for use. The next level of predicate complexity looks something like this: /network/host[@name="agatha"]. This selects the correct <host></host> by testing for the presence of a specific attribute with a specific value.[137]

Predicates aren’t always found at the very end of a location path, either. You can work them into a larger location string. Let’s say we wanted to find the names of all of the Linux servers in our network. To get this information we could write a location path like /network/host[@os="linux"]/service/../@name. This location path uses a predicate to select all the <host></host> elements that have an os attribute of linux. It walks down the branch for each of the nodes in that set that have a <service></service> subelement (i.e., selecting only the hosts that are servers). At this point we’ve walked the tree all the way down to a <service></service> node, so we use ../@name to get to the name attribute of its parent (the <host></host> that contains the <service></service> we just found).

We can test the contents of a node like this: //host/service[text()='DNS']. This location path says to start at the root of the tree looking for branches that have a <service></service> node embedded in a <host></host> node. Once XPath finds a branch that fits this description, it compares the contents of each of those service nodes to find the one whose contents are “DNS”.

The location path is being nicer to the parser than it needs to be by calling text(). If we just use a “.” (dot) instead of text() (meaning the current node), XPath will perform the comparison against its contents.

Testing for equality is only one of the comparison operators. Our sample data doesn’t offer a good way to demonstrate this, but predicates like [price > 31337] can be used to select nodes as well.

It’s starting to look like a real computer language, no? It gets even closer when we bring functions into the picture. XPath defines a whole bunch of functions for working with node sets, strings, Boolean operations, and numbers. In fact, we’ve seen some of them in action already, because /network/host[2]/interface/arec/text() really means /network/host[position()=2]/interface/arec/text().

Just to give you a taste of this, here’s a location path that selects the HTTP and HTTPS service nodes (allowing for any whitespace that might creep in around the service name): //host/service[starts-with(normalize-space(.),'HTTP')]. The string function starts-with() does just what you would expect it to: it returns true if the thing being compared (the contents of the current node) begins with the string provided in the second argument. The XPath spec has a list of the available functions, though it is a little less beginner-friendly than one might like. Searching for “XPath predicate” on the Web can lead to other resources that help explain the spec.

Abbreviations and Axes

This appendix started with the simplest core ideas of XPath, and each section along the way has incorporated more complexity and nuance. Let’s add one last level of subtlety by circling back to the original discussion of location paths. It turns out that all of the location paths we’ve seen so far have been written in what the spec calls an “abbreviated syntax.” The unabbreviated syntax is one of those things that you almost never need, but when you do, you really need it. We’re going to look at it quickly here just so you know it is available if you get into one of those situations.

So what exactly got abbreviated in the location paths we’ve seen so far? When we said /network/host[2]/service[1]/text(), it actually meant:

  1. Start at the root of the tree.

  2. Walk toward the children of the root node (i.e., down the tree), looking for the child node or nodes with the element name network.

  3. Arrive at the <network></network> node. This becomes the context node.

  4. Walk toward the children of the context node, looking for the child node or nodes with the element name host.

  5. Arrive at the level in the tree that has several <host></host> nodes. Filter to choose the node in the second position. This becomes the context node.

  6. Walk toward the children of the context node, looking for the child node or nodes with the element name service.

  7. Arrive at the level in the tree that has several <service></service> nodes. Filter to choose the node in the first position. This becomes the context node.

  8. Walk toward the text node associated with the context node. Done.

If we were to write that out in the unabbreviated syntax, it would look like the following (this is all one long location path split onto two lines):

/child::network/child::host[position()=2]/child::service[position()=1]/
child::text()

The key things we’ve added in this path are the axes (plural of axis, we’re not talking weaponry here). For each step in the location path, we can include an axis to tell the parser which direction to go in the tree relative to the context node. In this case we’re telling it at each step to follow the child:: axis; that is, to move to the children of the context node. We’re so used to filesystem paths that describe a walk from directory to subdirectory to target file that we don’t think too hard when faced with the /dir/sub-dir/file syntax. This is why the abbreviated XPath syntax works so nicely. But XPath doesn’t restrict us to moving from child node to child node down the tree. We’ve seen one example of this freedom already with the // syntax. When we say /network//cname, we are really indicating /child::network/descendant-or-self::cname. That is:

  1. Start from the root.

  2. Move to its child nodes to find a <network></network> node or nodes. When we find one, it becomes the context node.

  3. Look at the context node or descend farther in the tree until we find a <cname></cname> node or nodes.

The other three axes you already know how to reference in abbreviated form are self:: (.), parent:: (..), and attribute:: (@). The unabbreviated syntax lets us use all of the other axes—eight more, believe it or not: ancestor::, following-sibling::, preceding-sibling::, following::, preceding::, namespace::, descendant::, and ancestor-or-self::.

Of these, following-sibling:: is probably the most useful, so I’m only going to describe and demonstrate that one. The references section of this appendix points you at other texts that have good descriptions of the other axes. The following-sibling:: axis tells the parser to move over to the next element(s) in the tree at that level. This references the context node’s siblings. If we wanted to write a location path that tried to find all of the hosts with multiple interfaces, we could write (again, as one long line):

/child::network/child::host/child::interface/following-sibling::interface/
parent::host/attribute::name

This essentially says, “Walk down from the network node until you find a host with an interface node as its child, then see if it has a sibling interface at the same level in the tree. If it does, walk back up to the host node and return its name attribute.”

Further Exploration

If you find XPath really interesting and you want to get even deeper into it, there are definitely some places you can explore outside the scope of this chapter. Be sure to read the specification and other references listed in the next section. Learn about the other predicates and axes available to you. Become acquainted with XPath 2.0, so when a Perl module that can use it becomes available, you’ll be ready. And in general, just play around with the language until you feel comfortable with it and it can become another handy tool in your toolchest.

References for More Information

http://www.w3.org/TR/xpath and http://www.w3.org/TR/xpath20 are the locations of the official XPath 1.0 and 2.0 specifications. I’d recommend reading them after you’ve had a chance to read a good tutorial or two (like those listed here).

XML in a Nutshell, Third Edition, by Elliotte Rusty Harold and W. Scott Means (O’Reilly), and Learning XML, Second Edition, by Erik T. Ray (O’Reilly), both have superb sections on XPath. Of the tutorials I’ve seen so far, they are best.

http://www.zvon.org/xxl/XPathTutorial/General/examples.html is a tutorial that consists mostly of example location paths and how they map onto a sample document. If you like to learn by example, this can be a helpful resource.

There are various tools that allow you to type an XPath expression and see what it returns based on a sample document. Some parsers (e.g., the libxml2 parser) even ship with tools that provide this functionality. Get one, as they are really helpful for creating and debugging location paths. The one I use most of the time is built into the Oxygen XML editor.

Another cool tool for working with XML documents via XPath is XSH2 by Petr Pajas, the current maintainer of XML:LibXML. It lets you manipulate them using XPath 1.0 as easily as you can manipulate files using filesystem paths.



[136] At the beginning of the chapter I mentioned that XPath parses the document into a set of nodes that include both the elements and “other stuff.” The wildcard * matches just element nodes, whereas node() matches all kinds of nodes (element nodes and the “other stuff”).

[137] Before we go any further, it is probably worthwhile making something implicit in this discussion explicit: if a node test fails (e.g., if we tried to find the node or nodes at /network/admin/homephonenumber in this document), it doesn’t return anything. There’s no error, the program doesn’t stop, etc.

If you enjoyed this excerpt, buy a copy of Automating System Administration with Perl, Second Edition .