Before we launch into XPath, we need to get three caveats out of the way.
First, in order to understand this appendix, you’ll need to have at least a moderate grip on the subject of XML. Be sure to read Appendix A, The Eight-Minute XML Tutorial if you haven’t already.
This excerpt is from System Administration with Perl, Second Edition . Thoroughly updated and expanded in its second edition to cover the latest operating systems, technologies, and Perl modules, Automating System Administration with Perl will help you perform your job with less effort. The second edition not only offers you the right tools for your job, but also suggests the best way to approach particular problems and securely automate pressing tasks.
Second, XPath is a language unto itself. The XPath 1.0 spec consists of 34 relatively dense pages; the XPath 2.0 spec is 118 pages long. This appendix is not going to attempt to do any justice to the richness, expressiveness, and complexity of XPath (especially v2.0). Instead, it is going to focus on the subset of XPath that will be immediately useful to a Perl programmer.
Finally, this appendix will be sticking to XPath 1.0. As of this writing there are no solid Perl modules that I know of that support XPath 2.0.
With all of that aside, let’s get to questions like “What is XPath?” and, perhaps more importantly, “Why should I care?” XPath is a W3C spec for “a language for addressing parts of an XML document.” If you ever have to write code that attempts to select or extract certain parts of an XML document, XPath may make your life a great deal easier. It is a fairly terse but quite powerful language for this task and has a lovely “make it so” quality to it. If you can describe what data you are looking for using the XPath language (and you usually can), the XPath parser can fetch it for you, or allow you to point your program at the right part of the XML document. You can often achieve this with a single line of Perl.
There are several basic concepts that you need to understand to be able to start using XPath. Let’s look at them one at a time in order of increasing complexity.
To understand XPath, you have to start with the notion that an XML document can be parsed into a tree structure. The elements of the document (and the other stuff, but we’ll leave that out for now) serve as the nodes of the tree. To make this clearer, let’s pull in the sample XML file from Chapter 6, Working with Configuration Files. I’ll reprint it here so you don’t have to keep flipping back and forth to refer to it:
<?xml version="1.0" encoding="UTF-8"?>
<network>
<description name="Boston">
This is the configuration of our network in the Boston office.
</description>
<host name="agatha" type="server" os="linux">
<interface name="eth0" type="Ethernet">
<arec>agatha.example.edu</arec>
<cname>mail.example.edu</cname>
<addr>192.168.0.4</addr>
</interface>
<service>SMTP</service>
<service>POP3</service>
<service>IMAP4</service>
</host>
<host name="gil" type="server" os="linux">
<interface name="eth0" type="Ethernet">
<arec>gil.example.edu</arec>
<cname>www.example.edu</cname>
<addr>192.168.0.5</addr>
</interface>
<service>HTTP</service>
<service>HTTPS</service>
</host>
<host name="baron" type="server" os="linux">
<interface name="eth0" type="Ethernet">
<arec>baron.example.edu</arec>
<cname>dns.example.edu</cname>
<cname>ntp.example.edu</cname>
<cname>ldap.example.edu</cname>
<addr>192.168.0.6</addr>
</interface>
<service>DNS</service>
<service>NTP</service>
<service>LDAP</service>
<service>LDAPS</service>
</host>
<host name="mr-tock" type="server" os="openbsd">
<interface name="fxp0" type="Ethernet">
<arec>mr-tock.example.edu</arec>
<cname>fw.example.edu</cname>
<addr>192.168.0.1</addr>
</interface>
<service>firewall</service>
</host>
<host name="krosp" type="client" os="osx">
<interface name="en0" type="Ethernet">
<arec>krosp.example.edu</arec>
<addr>192.168.0.100</addr>
</interface>
<interface name="en1" type="AirPort">
<arec>krosp.wireless.example.edu</arec>
<addr>192.168.100.100</addr>
</interface>
</host>
<host name="zeetha" type="client" os="osx">
<interface name="en0" type="Ethernet">
<arec>zeetha.example.edu</arec>
<addr>192.168.0.101</addr>
</interface>
<interface name="en1" type="AirPort">
<arec>zeetha.wireless.example.edu</arec>
<addr>192.168.100.101</addr>
</interface>
</host>
</network>If we parse this into a node tree, it will look something like Figure B.1, “XML document node tree”.
The root of the tree points to the document’s root element
(<network></network>).
The other elements of the document hang off of the root. Each element
node has associated attribute nodes (if it has any attributes) and a
child text node that represents the contents of that element (if it has
any character data in it). For example, if the XML said <element
attrib="value">something</element>, the XPath parse
would have one <element></element> node with
an attribute node of attrib and a
text node holding the string something. Be sure to stare at Figure B.1, “XML document node tree” until the XML document-to-node tree
idea is firmly lodged in your head, because it is crucial to the rest of
this material.
If this diagram reminds you of the tree-like diagrams in Chapter 2, Filesystems, that’s good. The resemblance is intentional. XPath uses the concept of a location path to navigate to a node or set of nodes in a document. Location paths start either at the top of the tree (an absolute path) or at some other place in the tree (a relative path). Just like in a filesystem, “/” at the beginning means “start at the root of the tree,” “.” (dot) refers to the current node (also known as the “context node”), and “..” (dot-dot) refers to the parent of the context node.
If you want, you can think of location paths as a way to point at
a specific node or set of nodes in a diagram. For example, if we wanted
to point at the <description> </description> node, the location
path would be /network/description.
If we used a location path of /network/host, we would be referring to all of
the <host></host> nodes
at that level of the tree. Pointing at a node any further down the tree
would require a way to distinguish between the different <host></host> nodes. How to do
that leads to a whole other XPath topic; we’ll hold off on that question
for just a moment so we can look at a few more of the navigational
aspects of walking a node tree.
The information in our sample file consists of more than just
markup tags; the file has real data in it. The elements themselves often
have attributes (e.g., <interface name="en1"
type="AirPort">) or act as labels for data (e.g.,
<addr>192.168.0.4 </addr>). How do we get to those
parts of the document? To get to an element’s attributes, we use an
@ in front of the attribute name. For
example, /network/description/@name gets us name="Boston". To access the contents of an
element’s text node, we end the location path with text(), as in /network/description/text(). This returns the
data This is the
configuration....
Wildcards in XPath can function similarly to their filesystem
analogs. /network/host/*/arec/text()
finds all element nodes[136] under a <host></host> node that have
<arec> </arec> sub-nodes and then returns
the contents of those <arec></arec> elements. In this
case, we get back the DNS A resource record name associated with each
interface:
agatha.example.edu gil.example.edu baron.example.edu mr-tock.example.edu krosp.example.edu krosp.wireless.example.edu zeetha.example.edu zeetha.wireless.example.edu
Attributes can be wildcarded in a similar fashion by using
@*. /network/host/@* would return all of the
attributes of the <host></host> elements.
There’s one last piece of syntax worth mentioning before we get to
the next section. XPath has what I call a “magic” location path
operator. If you use two slashes (//)
anywhere in the location path, it will search from that point down in
the tree to try to locate the subsequent path elements. For example, if
we say //arec/text(), we will get
back the same set of interface A resource record names as in our
previous example, because the operator will search from the root of the
tree down to find all of the
<arec> </arec>
elements that have text nodes. You can also place double slashes in the
middle of a location path, as in /network//service/text(). Our sample file has
a very shallow node tree, but you can imagine how the ability to
describe a path without specifying all of the intervening parts of the
tree might come in handy.
In the last section we daintily stepped over the question of how one
specifies which branch or branches of a tree to follow if the elements
at that level in the tree have the same name. In our example document,
we have five <host></host> elements at the
third level of the tree. They have different attributes and the data in
each is different, but that doesn’t help if the location path is
constructed with just element names. If we say /network/host, the word host is (in the parlance of the spec) acting
as a “node test.” It selects which network branch or branches to take
when moving down the tree in our location path. But the node test in
this example isn’t giving us the granularity we need to select a single
branch.
That’s one place where XPath predicates come
into play. Predicates allow you to filter the set of possible nodes
provided by a node test to get just the ones you care about. /network/host returned
all of the host nodes; we’d like a way to narrow down that set.
Predicates are specified in square brackets ([]) in the location path itself. You insert a
predicate right at the point where a filtering decision has to be
made.
The simplest predicate example looks like an index number, as in
/network/host[2]/interface/arec/text(). This
location path returns the interface name(s) for the second host node
(second in document order). If you were standing and looking at all of
the host nodes, the predicate would tell you which branch of the tree to
take: in this case, the one in the second position.
Perl programmers should be familiar with this index-like syntax, but don’t get too comfortable. Unlike in Perl, the index numbers in XPath start with 1, not 0.
If index numbers were the only possible predicate, that would be a
bit ho-hum. But here’s where XPath starts to get really cool. XPath has
a relatively rich set of predicates available for use. The next level of
predicate complexity looks something like this: /network/host[@name="agatha"]. This
selects the correct <host></host> by testing for the
presence of a specific attribute with a specific value.[137]
Predicates aren’t always found at the very end of a location path,
either. You can work them into a larger location string. Let’s say we
wanted to find the names of all of the Linux servers in our network. To
get this information we could write a location path like /network/host[@os="linux"]/service/../@name.
This location path uses a
predicate to select all the <host></host> elements that have
an os attribute of linux. It walks down the branch for each of
the nodes in that set that have a <service></service>
subelement (i.e., selecting only the hosts that are servers). At this
point we’ve walked the tree all the way down to a <service></service> node, so we
use ../@name to get to the name attribute of its parent (the <host></host> that contains the
<service></service> we
just found).
We can test the contents of a node like this: //host/service[text()='DNS']. This location path says to start at the root of
the tree looking for branches that have a <service></service> node embedded
in a <host></host> node.
Once XPath finds a branch that fits this description, it compares the
contents of each of those service nodes to find the one whose contents
are “DNS”.
The location path is being nicer to the parser than it needs to be
by calling text(). If we just use a
“.” (dot) instead of text() (meaning
the current node), XPath will perform the comparison against its
contents.
Testing for equality is only one of the comparison operators. Our
sample data doesn’t offer a good way to demonstrate this, but predicates
like [price > 31337] can be used
to select nodes as well.
It’s starting to look like a real computer language, no? It gets
even closer when we bring functions into the picture. XPath defines a
whole bunch of functions for working with node sets, strings, Boolean
operations, and numbers. In fact, we’ve seen some of them in action already, because /network/host[2]/interface/arec/text() really
means /network/host[position()=2]/interface/arec/text().
Just to give you a taste of this, here’s a location path that
selects the HTTP and HTTPS service nodes (allowing for any whitespace
that might creep in around the service name): //host/service[starts-with(normalize-space(.),'HTTP')].
The string function starts-with()
does just what you would expect it to: it returns true if the thing
being compared (the contents of the current node) begins with the string
provided in the second argument. The XPath spec has a list of the
available functions, though it is a little less beginner-friendly than
one might like. Searching for “XPath predicate” on the Web can lead to
other resources that help explain the spec.
This appendix started with the simplest core ideas of XPath, and each section along the way has incorporated more complexity and nuance. Let’s add one last level of subtlety by circling back to the original discussion of location paths. It turns out that all of the location paths we’ve seen so far have been written in what the spec calls an “abbreviated syntax.” The unabbreviated syntax is one of those things that you almost never need, but when you do, you really need it. We’re going to look at it quickly here just so you know it is available if you get into one of those situations.
So what exactly got abbreviated in the location paths we’ve seen
so far? When we said /network/host[2]/service[1]/text(), it
actually meant:
Start at the root of the tree.
Walk toward the children of the root node (i.e., down the
tree), looking for the child node or nodes with the element name
network.
Arrive at the <network></network> node. This
becomes the context node.
Walk toward the children of the context node, looking for the
child node or nodes with the element name host.
Arrive at the level in the tree that has several <host></host> nodes. Filter to
choose the node in the second position. This becomes the context
node.
Walk toward the children of the context node, looking for the
child node or nodes with the element name service.
Arrive at the level in the tree that has several <service></service> nodes.
Filter to choose the node in the first position. This becomes the
context node.
Walk toward the text node associated with the context node. Done.
If we were to write that out in the unabbreviated syntax, it would look like the following (this is all one long location path split onto two lines):
/child::network/child::host[position()=2]/child::service[position()=1]/ child::text()
The key things we’ve added in this path are the axes (plural of
axis, we’re not talking weaponry here). For each step in the location
path, we can include an axis to tell the parser which direction to go in
the tree relative to the context node. In this case we’re telling it at
each step to follow the child:: axis;
that is, to move to the children of the context node. We’re so used to
filesystem paths that describe a walk from directory to subdirectory to
target file that we don’t think too hard when faced with the /dir/sub-dir/file syntax. This
is why the abbreviated XPath syntax works so nicely. But XPath doesn’t
restrict us to moving from child node to child node down the tree. We’ve
seen one example of this freedom already with the // syntax. When we say /network//cname, we are
really indicating /child::network/descendant-or-self::cname.
That is:
Start from the root.
Move to its child nodes to find a <network></network> node or
nodes. When we find one, it becomes the context node.
Look at the context node or descend farther in the tree until
we find a <cname></cname> node or nodes.
The other three axes you already know how to reference in
abbreviated form are self:: (.), parent:: (..), and attribute:: (@). The unabbreviated syntax lets us use all
of the other axes—eight more, believe it or not: ancestor::, following-sibling::, preceding-sibling::, following::, preceding::, namespace::, descendant::, and ancestor-or-self::.
Of these, following-sibling::
is probably the most useful, so I’m only going to describe and
demonstrate that one. The references section of this appendix points you
at other texts that have good descriptions of the other axes. The
following-sibling:: axis tells the
parser to move over to the next element(s) in the tree at that level.
This references the context node’s siblings. If we wanted to write a
location path that tried to find all of the hosts with multiple
interfaces, we could write (again, as one long line):
/child::network/child::host/child::interface/following-sibling::interface/ parent::host/attribute::name
This essentially says, “Walk down from the network node until you
find a host with an interface node as its child, then see if it has a
sibling interface at the same level in the tree. If it does, walk back
up to the host node and return its name attribute.”
If you find XPath really interesting and you want to get even deeper into it, there are definitely some places you can explore outside the scope of this chapter. Be sure to read the specification and other references listed in the next section. Learn about the other predicates and axes available to you. Become acquainted with XPath 2.0, so when a Perl module that can use it becomes available, you’ll be ready. And in general, just play around with the language until you feel comfortable with it and it can become another handy tool in your toolchest.
http://www.w3.org/TR/xpath and http://www.w3.org/TR/xpath20 are the locations of the official XPath 1.0 and 2.0 specifications. I’d recommend reading them after you’ve had a chance to read a good tutorial or two (like those listed here).
XML in a Nutshell, Third Edition, by Elliotte Rusty Harold and W. Scott Means (O’Reilly), and Learning XML, Second Edition, by Erik T. Ray (O’Reilly), both have superb sections on XPath. Of the tutorials I’ve seen so far, they are best.
http://www.zvon.org/xxl/XPathTutorial/General/examples.html is a tutorial that consists mostly of example location paths and how they map onto a sample document. If you like to learn by example, this can be a helpful resource.
There are various tools that allow you to type an XPath expression and see what it returns based on a sample document. Some parsers (e.g., the libxml2 parser) even ship with tools that provide this functionality. Get one, as they are really helpful for creating and debugging location paths. The one I use most of the time is built into the Oxygen XML editor.
Another cool tool for working with XML documents via XPath
is XSH2 by Petr Pajas, the current
maintainer of XML:LibXML. It lets you manipulate them
using XPath 1.0 as easily as you can manipulate files using filesystem
paths.
[136] At the beginning of the chapter I mentioned that XPath parses
the document into a set of nodes that include both the elements and
“other stuff.” The wildcard *
matches just element nodes, whereas node() matches all kinds of nodes (element
nodes and the “other stuff”).
[137] Before we go any further, it is probably worthwhile making
something implicit in this discussion explicit: if a node test fails
(e.g., if we tried to find the node or nodes at /network/admin/homephonenumber in this
document), it doesn’t return anything. There’s no error, the program
doesn’t stop, etc.
If you enjoyed this excerpt, buy a copy of Automating System Administration with Perl, Second Edition .
Copyright © 2009 O'Reilly Media, Inc.