XML::XPathEngine is a Perl distribution that allows you to use XPath expressions to navigate tree-like data structures. Here’s what I learned from experimenting with it.
Why XPath?
Why would you want to do that? There are lots of questions that are easy to answer by describing the answer that you want instead of how to do it. SQL and Prolog are two examples of languages that make this possible. XPath is another.
I write books occasionally. I almost always use PseudoPod as it is easy to use, text-based, and has a fairly good tool suite. Occasionally I need to fetch a list of titles or authors or code snippets.
It’s tempting to write a regular expression to parse a list of documents, but when it gets more complex than looking at a single line to find the right information — when I need something like “Find all of the section headers followed by a link anchor”, building and maintaining the state machine required for such parsing is tedious and time consuming.
That’s especially true when the XPath expression to find such nodes might be /Header/following-sibling::Link.
My Code
As part of an abortive experience to build PDFs directly from PseudoPod documents with PDF::API2 (to summarize: this was several months ago and writing my own line-breaking and page-filling algorithms wasn’t even as much fun as it sounds), I wrote a series of modules to build a document module for PseudoPod documents. Pod::PseudoPod::DocumentParser (not on the CPAN) subclasses the PseudoPod parser to build up this DOM.
When I stopped work on the code, I had classes to model Documents, Headers, Blocks, Paragraphs, Sections, and Text. That’s not all a PseudoPod DOM might use, but it was enough to try adding XML::XPathEngine support.
Experience
XML::XPathEngine expects the node objects it works on to support several methods.
getName()getRootNode()getParentNode()getChildNodes()getNextSibling()getPreviousSibling()isElementNode()isTextNode()getValue()cmp()
Returns the name of the element this node represents.
Returns the document.
Returns the parent that contains this node element.
Returns a reference to an array of children, if this node has any.
If this node has siblings, returns the next one.
If this node has siblings, returns the previous one.
Returns true if this node represents an element, not text.
Returns true if this node represents text, not an element. (Think of XML, where you have elements and text.)
Returns the value represented by this node.
The documentation didn’t say I needed to add this one, but I received errors about sorting problems without it. My version just returns -1 for each call, as there’s no particular child sorting order I care about.
I also added two helper methods to make the sibling-finding methods work better:
find_before()find_after()
This finds the parent node, grabs all of its children, and loops through the list to find the current node. Then it returns the immediately preceding sibling. There are better ways to write this algorithm, but this was the simplest and my DOM trees aren’t complex enough that performance suffers.
This finds the parent node, grabs all of its children, and does what you expect.
Of course, because I rarely use Java or XML, my nodes already had methods with different names (getRootNode() corresponds to my get_document(), for example). I created a helper module that imports these methods into each node’s class:
package Pod::PseudoPod::XPath;
use strict;
use warnings;
sub import
{
my ($class, %args) = @_;
$args{children} ||= 'children';
$args{value} ||= '';
$args{element} ||= 1;
$args{text} ||= 0;
my $caller = caller();
my $kids_method = 'get_' . $args{children};
no strict 'refs';
*{ $caller . '::getName' } = sub { $args{name} };
*{ $caller . '::getRootNode' } = sub { shift->get_document( @_ ) };
*{ $caller . '::getParentNode' } = sub { shift->get_parent( @_ ) };
*{ $caller . '::getChildNodes' } = sub { shift->$kids_method( @_ ) };
*{ $caller . '::getNextSibling' } = \&getNextSibling;
*{ $caller . '::getPreviousSibling' } = \&getPreviousSibling;
*{ $caller . '::find_before' } = \&find_before;
*{ $caller . '::find_after' } = \&find_after;
*{ $caller . '::isElementNode' } = sub { $args{element} };
*{ $caller . '::isTextNode' } = sub { $args{text} };
*{ $caller . '::getValue' } = $args{value} ?
sub { my $gv = 'get_' . $args{value}; shift->$gv( @_ ) } : sub { '' };
*{ $caller . '::cmp' } = sub { -1 };
}
sub getNextSibling
{
my $self = shift;
$self->get_parent->find_after( $self );
}
sub getPreviousSibling
{
my $self = shift;
$self->get_parent->find_before( $self );
}
sub find_before
{
my ($self, $child) = @_;
my $before;
for my $kid ( @{ $self->get_children() } )
{
return $before if $before and $kid eq $child;
$before = $kid;
}
return;
}
sub find_after
{
my ($self, $child) = @_;
my $after;
my @kids = @{ $self->get_children() || [] };
while (my $kid = shift @kids)
{
return shift @kids if $kid eq $child;
}
return;
}
1;
It’s not entirely beautiful code, and if I were to release it formally I would rethink some of the design decisions, but it worked in my tests. I consider it livable for a short experiment with the module.
With everything in place, I could then create a PseudoPod DOM object and run XPath queries against it:
my $parser = Pod::PseudoPod::DocumentParser->new();
$parser->parse_file( 'book.pod' );
$parser->finish();
my $document = $parser->get_document();
my $xp = XML::XPathEngine->new();
my $nodelist = $xp->find( '/Header', $document );
while ( my $node = $nodelist->shift() )
{
print "Header: ", $node->getValue(), "\n";
}
I tried a few more complex expressions too. Some worked well, but others revealed weaknesses either in my implementation (likely; my explanation of the difference between element and text nodetypes is obviously fuzzy) or my PseudoPod document model. That’s fine. This was an experiment and I’m happy with what I’ve learned.
Gotchas
Your data structure doesn’t have to be tree-like, as long as you support the methods XML::XPathEngine requires. This does mean that you have to have some way of translating between your representation and the tree-like structure. Depending on how often you search, how complicated are your expressions, where the nodes you want to find are in the tree, and how expensive it is to translate between your data structure and the expected interface, this could be slow.
You also need an object interface to your data structure. Again, this doesn’t mean that you have to use objects to represent your data structure — only that XML::XPathEngine calls methods on nodes. If you can calculate the metadata about your data structure without building the expensive structure itself, you may have a huge advantage over trying to navigate the real datastructure.
In my experience, having to add references to the parent and root nodes in my tree required some code changes — mostly to constructor calls. There is probably a way to do without the need for the explicit root node method (probably a method that calls get_parent() recursively up the tree until it hits a Document would suffice), but I went with the simplest approach I could think of.
It would have been very nice to be able to apply all of the necessary support methods, including get_parent() and get_root() within a role so that I could decorate all of the PseudoPod DOM classes at runtime only as necessary, but I didn’t see an easy way to do that.
Recommendations
For finding nodes in a tree-like data structure, XPath is really the way to go. I’m not aware of any easier general-purpose description language. While the documentation for XML::XPathEngine 0.03 is slim, it only took me a couple of hours of fiddling to get something working. I look forward to future updates and will consider using this module again for similar projects.

XPath rocks.
As for decorating: maybe Class::Trait would have been what you needed?
I know it was just an example, but for using XPath on POD documents, you could have used my Pod::XPath module, which I built a few years ago to address this exact situation.
Aristotle, I probably would use
Class::Traitif I were to keep this code and approach.Darren, thanks for the pointer. Offhand, do you know if the module would work with PseudoPOD?