Mike Champion asked on the XML-DEV mail list this week
To be honest the XBC (XML Binary Characterization) WG has been waiting for some community push-back
and input on both the negatives and positives of binary XML … but so far the negatives
haven’t been coming and that worries me a bit..
My guess is that most people think like me: if “Binary XML” is just a form of compression that allows a tightly coupled interface to SAX, then why not? Or if it allows some substantially different characteristic such as random access, then maybe it has a good place as an adjunct. If it does not have enough bang per buck, it will only be a niche thing. Not that there is anything wrong with niches. (And we should expect that companies who don’t do well out of XML will try to spoil it and develop alternatives, while companies doing well from it will try to stifle innovation. That’s show biz!) Does ASN.1 currently hurt XML?
I think that, post XML Schemas, the W3C brand is fairly diminished as far as new specs are concerned. XBC could easily go the way of XPointer, XML 1.1 and XML Fragment Interchange: like a quarrelsome but beautiful neighbour, decorative but to be avoided.
If the XBC discussion takes off, expect all the usual baloney, in particular the extremely fragrant sausage that if we reduce the number of different tags that XML recognizes, it will speed up parsers in some significant way. (Baloney because there are efficient ways of implementing parsers so that tests for rare tags don’t penalize the common cases. For example, an optimized parser could detect that a document has no DOCTYPE declaration, and then switch to an implementation that does not need to do any buffer reallocation to handle entity inclusion. Java even provides jump tables to make simple parsing fast.)
My view from the armchair is that chip manufacturers (Intel, AMD, et al.) need to step up to the plate here: the Unicode character tables and properties, and Unicode transcoders for the most common characters sets, should be hardcoded inside CPU chips. I have seen at least one East Asian CPU with character tables built-in, so it is not a far-fetched idea. People do not say “Maths operations take a lot of CPU power, lets ditch less common math functions”, do they?
Now that XML is ubiquitous and mission critical, of course we should expect all sorts of ingenious ways to speed it up. But the prime area that is being missed, it seems to me, is how to improve XML support inside CPUs.
What kind of form could it take? The simplest form might be to provide an operation that takes an unsigned short (i.e. a UTF-16 character) and returns an int containing bits representing each binary Unicode property and its status as an XML delimiter, just by simple table lookup. (Actually, I would provide two operations: one for UTF-16 BMP which also copes with ASCII and ISO 8859-1 because they are code compatible, and one for UTF-32.) Since XML documents tend to be small, for both SAX processing and XML->DOM processing, I quite expect that not much XML parser machine code would survive in the cache between invocations of the parser or SAX. So providing a built-in table will marginally improve cache behaviour as well as allowing faster parsing without giving up on decent and suspicious parsing: since IO between the CPU and bus is the current bottleneck, this improvement, though certainly limited and sporadic, is in the right kind of area.
If I were Intel or AMD, and looking for a way to add value to my CPUs, I would look into building the Unicode character tables especially to speed up XML processing. Derek Denny-Brown made a good point on XML-DEV: Most of the CPU cost of parsing
is related to the abstract model of XML, not the text parsing: Duplicate
attribute detection, character checking, namespace resolution/checking.
Every binary-xml implementation I have researched which improves CPU
utilization does so by skipping checks such as these. At that point you
are no longer talking about XML.
Of course, Unicode is evolving. But nowadays only on the fringes, and really only outside the Basic Multilingual Plane (BMP: the first 64000 characters). XML delimiter-based parsing is quite cheap (at any one time, there are usually only two significant characters to look for: & or < in data content, “or & (or ‘ or &) in attribute values, ] in CDATA content, whitespace or > in tagnames, whitespace or = in attribute names.)
It is the characters that indicate malformed XML that add checking cost: finding !@#$%^*()_+={}[];;”‘,<?/ or other non-element character in an element or attribute name. XML pairs its Draconian error handling with trivial inspectability of the data: this is congenial for programmers, in comparison to a binary format which may not have enough hints to allow meaningful reconstruction of the file for inspection. (Add comment about babies and bathwater here.)
Perhaps the rise of East Asian economic power also may have some impact here: when most CPUs drove PCs with ASCII documents, there was little reason to think about hardware support for large-character-set property-tables. Now that everyone has converged on Unicode, notably XML, and that China/Korea/Japan/Taiwan are such big players, this might be a useful feature.