advertisement

Weblog:   The Python comunity has too many deceptive XML benchmarks
Subject:   Where angels fear to tread
Date:   2005-01-26 04:09:57
From:   RobertKern
I know I'm going to regret getting in the middle of this.


Let me say this first: I have no investment in whose XML tool is the fastest or easiest to use or more compliant or whatever other standard you choose to apply. Like Phillip, I avoid XML where possible.


I also want to say that I am incredibly disappointed in the extreme lack of maturity some of you are displaying. It's as if you're looking to get offended by the other guys. This is not how adult communities behave.


No, I take that back: this is how adult communities behave all too often. But it's not how they should behave, and not how I've come to expect the Python community to behave.


Now, this article has, I think, a few valid points. First, I think that benchmark results should, wherever possible, come with the code and data that generated them, especially when they are part of the announcement of the package being benchmarked.


Second, I think a number of different kinds of benchmarks, performed with a variety of packages, is an important thing to have. I don't think that benchmark results that don't have this kind of breadth are worthless, though.


Third, Uche correctly points out that benchmark numbers are not the only factor to consider in choosing a tool. However, no one, not even Fredrik, is disputing this.


The primary area where this article misses the mark is the claim that Fredrik was being deceptive. Fredrik did not post a deceptive benchmark. He posted an incomplete set of benchmark results: one that did not include the actual code he used to derive his numbers although he documented his procedure elsewhere. I would encourage Fredrik to post the benchmark code the next time he advertises cElementTree with benchmark numbers. Not only does it encourage confidence in the numbers, but it will also serve as a useful tool for others. Apparently, there are any number of people who don't know how to properly benchmark Python code and fewer (certainly not I before this incident) who know why the standard solution, timeit.py, is inadequate for these tools. People can see what the current "best practices" are for ElementTree for the operations timed and contribute benchmarks for the tools that were not included.


The article also makes the incorrect claim that what Fredrik bechmarks is useless. It's not. The parse time and memory used are important components to the whole XML-wielding program and should be measured. What's more, these are factors that are shared by pretty much every program; I may not need to find text or construct a tree or extract certain tags in my program, but I certainly need to read in the data. Now these aren't the only quantities that should be measured, but the measurement is not useless just because it's the only one offered there. And Fredrik certainly wasn't hiding the fact that that was all that he was timing.


The article also comes to the conclusion that the bechmarks offered by Fredrik were "deceptive" based on the evidence of Uche's own benchmarks which yielded different numbers than Fredrik's. Following the article's logic, the goal was to measure something important that Fredrik's benchmarks didn't measure: find all tags with a certain text string. The benchmarks were done, the numbers were rather different than Fredrik's, and so he was being deceptive in posting his numbers. If the article's benchmarks were adequate measurements, this line of argument would make some sense. However, the article's benchmarking strategy does not accurately measure comparable times.


The fact that the article's benchmarks are open, with full code and documented timing strategy, does not change the fact that they are wrong. Furthermore, it does not change the fact that concluding that someone else is being deceptive because their results (accurately obtained) don't match up with your results (not accurately obtained) is wrong. Posting an article to O'Reilly falsely accusing someone else of deception instead of hashing it out in private or a semi-private forum like the XML-SIG is also wrong.


But you don't have to take my word for it. I redid the benchmarks from the article with a proper timing harness. The results from 5 runs of each package are given, in seconds, in the file timings.csv. The information about my system are given in comments. I couldn't run the saxtools version; I get an exception as documented. I also tried the Gnosis code that David Mertz posted, but Gnosis_Utils-1.1.1 doesn't seem to define one of the functions needed. I didn't implement the lxml version because I didn't feel like building it.


The results are broadly along the lines of what Fredrik posted. Say what you like about his attitude and his "bluster," the man doesn't lie with his benchmarks.

-->