Related link: http://www.nelson.monkey.org/~nelson/weblog/tech/python/xpath.html

Nelson says: There’s the stock Python install, which barely does anything [for XML]. That’s overstated. Plain old SAX and minidom may not be ideal, but they’re useable. Various bugs in PySAX and Minidom (see, for example this article ) have unfortunately plagued the standard library, but starting with Python 2.3, I think that they deliver what’s promised. The main problem is that what they promise doesn’t fit Python’s shoes all that well. PySAX’s very literal translation from Java’s class/method callback feels very stilted in a language that now has the likes of generators and nested scopes. I suspect if PySAX were in development now things would be very different. It’s to some extent a legacy problem. I used to recommend SAX to those who need performance, but I think my own recent work (represented in Amara) and that of Fredrik Lundh (in ElementTree) may be enough to render PySAX obsolete. as for minidom, it could do with a lot of more Python-friendly sugar so that people don’t have to think in the W3C’s over-elaborated API, but once you get the hang of DOM, you can pretty much do whatever you need with it.

So rather than completely writing off the stdlib XML facilities, as Nelson did, I damn it with faint praise. Not a difference worth much bother? Perhaps. Moving on, here’s Nelson again:

PyXML, which has an ugly hack to confusingly install on top of the default Python libraries. But if you follow the advice of Python’s most visible XML expert, Uche Ogbuji, you may think there’s something wrong with PyXML and install 4Suite instead, which is the same as PyXML only different.

I’ve done a horrible job of explaining 4Suite if people are thinking it’s in any way similar to PyXML. The two could hardly be more different. Maybe Nelson means that the XPath libraries are the same? This isn’t true either. Years ago we did copy the 4Suite code base to PyXML, and it was massaged to make it fit better into PyXML overall. Since then, the XPath in 4Suite has evolved into an entirely different beast: much faster, more extensible, and with a cleaned up API.

Or should you use Amara instead? Fair question. When I developed Amara I considered lumping all that code back into 4Suite, but I thought it better to release it as a separate 4Suite add-on. For one thing, I think it has a very different flavor: focusing on Python idioms rather than what-would-W3C-do (which we’d been peeling away from gradually in 4Suite, anyway).

I think I can make a workable soundbite for the cause: If you’re coming more from a Python background, and XML is just something that’s getting in your way, try Amara. If you’re coming from an XML background, and you think in DOM, XSLT and all that, try 4Suite. Does anyone find that soundbite useful? Based on it, I think Nelson should be trying Amara rather than just 4Suite. I should point out that Amara is very fast as well (and 4Suite has made huge strides from when it was too slow to bear: it’s now very respectable, if not blistering).

ElementTree which is brilliantly fast and simple to use, but limited

Hmm. Several times I’ve made the mistake of claiming some limitation in ElementTree, and then along comes Fredrik to straighten me out. ElementTree is a lot more versatile than one might think at first glance. So why did I develop Amara? Why didn’t I just use ElementTree? I did for a while, but I always felt that ElementTree does a great job of loosening DOM shackles for something more Python-flavored (hats off to Fredrik, who tried to coax me that DOM not good enough for Python long before I saw the light). But I honestly think ElementTree doesn’t go quite far enough. Amara follows the principle that once I decide to shrug off DOM, I want to be able to use every possible nifty tool in Python’s arsenal to make the XML feel native to the language. I want something closer to Gnosis Utilities Objectify, but using a much more declarative framework. I think that Amara’s unique niche is a combination of extreme Python-friendliness and declarativity. I think that XML without declarativity results in far too much and too brittle code, even in Python.

xmltramp, which is even more hacky.

I’ll risk the flames and be honest. I don’t think xmltramp
is (yet) industrial strength. It’s a lot hackier than ElementTree, Gnosis, generateDS, 4Suite or even Amara. It looks and probably feel great in the first foray, but I don’t think that experience will scale to heavy usage. Besides, It doesn’t support XPath.

But what’s missing is a clear single simple library to use.

I don’t believe a single choice is appropriate. I want many options. I think people who want just one way to process XML are limited by sketchy experience with XML. Just like I wouldn’t expect one single library for text processing in Python (and I expect no one would suggest such a thing), I can’t imagine how anyone couls shoehorn all the breadth and variety of XML use cases into a single idiom, or even two or three. XML is ridiculously versatile, and this necessitates broad choice. I do a lot with XML and consequently, i often use 3-4 different tools in any given day.

PyXML seems the most standard, but it seems very slow and it tries to be more DOM-like than Python-like. I hate DOM.

I don’t promote PyXML s any sort of standard. To me the only standard is Python’s stdlib and PyXML is not in it. It’s just a couce, and a flawed one for some of the reasons you mention. I think PyXML was important, but has been overtaken by events. I’m not entirely blameless in that matter, and I’m sorry I never had all the energy to work on PyXML as hard as, say 4Suite, but I think at this point it’s too late.

[with PyXML] from xml.dom.ext.reader import Sax2

Yuck. That’s the ancient DOM code included in PyXML. Many people make the mistake of invoking it. It is dreadfully slow and consumes a dreadful amount of memory. Always use PyXML’s minidom. Just replace the above with:

from xml import minidom
from xml import xpath
doc = minidom.parse('foo.opml').documentElement
for url in xpath.Evaluate('//@xmlUrl', doc):
  print url.value

You’ll get a lot more speed, but all my other downer comments on PyXML still apply. There are better options.

the awfulness of the libxml2 API

I couldn’t agree more. libxml2 is a miracle of function, but alas in a form that doesn’t suit Python one bit. I know that folks are working on better libxml2 wrappers, but familiar as I am with the C code, I honestly don’t believe they can produce anything truly Pythonesque without losing all the performance gains.

So that’s all the chatter. But code speaks louder, and I’ll offer some in a subsequent entry.