Wrong in every way
Related link: http://www.xml.com/pub/a/2004/07/21/dive.html
Mark Pilgrim is usually excellent, so I was perplexed to read his latest XML.COM article "XML on the Web has failed" in which he uncovers the horrible fact that XML is unreliable when used unreliably. Knock me down with a feather! Mr Pilgrim could correct his article by substituting text/xml for xml in most places, and by removing the snide-seeming opening and closing comments.
To set the scene, here are some quotes from the IETF discussions on RFC 3032, five years ago:
- Most XML objects have no business being registered under text... (There is a ) lack of ability to dispatch on parameters in most applications that support MIME. ( Ned Freed)
- Users need application/xml rather than text/xml to ensure end-to-end integrity because the out-of-band approach has so far failed to provide that integrity... Out-of-band signalling of the encoding of a file to some extent a hack to cope with formats that are not adequately self-describing. I would have no problem with removing text/xml entirely (Rick Jelliffe)
- Fallbacks to text/plain are not useful in general, and fallbacks to 7-bit US-ASCII risk mangling the data. (Chris Lilley)
Pilgrim's argument is roughly:
- XML must be served as text/xml, and with the correct charset parameter in the MIME header to be well-formed
- But, oh no! From a large collection of RSS servers we see that about half is ASCII and the other half is wrongly labelled or not well-formed.
- Therefore all XML is broken, because XML promises you can "publish in any character encoding".
- Doh, this was something that slipped by the developers of XML and the RFCs, and which we will know better next time.
All those are wrong:
- XML should not be be served as text/xml. In fact, RFC 3032 and RFC 3470 recommends against it (except for "casual" inspection purposes, not typical XML). Most recently, Sir Tim and the W3C Technical Achitecture Group's latest draft guidelines make it very clear that application/* with no charset parameter should be used. Even the XML spec, which does not concern itself with transmission issues, recommends against it (for "files".) When the data is a sequence of bytes, such as a file with no metadata or application/xml, you use the encoding declaration; when the file is a sequence of characters (such as a Java String or text/xml) the encoding declaration is irrelevant; programmers just need to be aware when they are dealing with bytes or characters.
- If I sampled a Japanese aggregator and then said Because none of the sources use ASCII, ASCII has failed! people would be surprised. Of course when Mark looks at a site will almost entirely US sources, he finds a very high incidence of ASCII data: this has absolutely nothing to do with the characteristics of XML. That 20% of RSS is badly tagged is hardly surprising, and no reflection on XML either. That the XML header and the MIME charset disagree may not actually be significant, simply because few systems use the charset parameter anyway because it has always been unreliable: HTML browsers, for example, use a mix of information such as locale-specific defaults, the most recent pages, and byte patterns to guess the encoding used. Often, systems will just read data in in the locale-default encoding, so people are often not aware that their plumbing is wrong. Unless things have changed recently, the de facto default encoding for text/* is the system's locale-default encoding: in fact, MIME's text/* is a red flag that data may be fiddled with.
- Putting aside the bogosity of making a conclusion about all XML from one application (RSS), XML simply has never promised that you can publish reliably in any arbitrary encoding. The quotation marks around "publish in any character encoding" may give some people the impression that this it must be an official goal of XML. Who is this quoting? I couldn't find it on Google. XML processors are not required to support every encoding, and character references are not available in markup anyway. It seems a bit rich to fault XML for not satisfying a promise it never made.
My Life as a Dog
But some of the statements concern me directly. I first proposed the auto-detection algorithm that became XML's Appendix F. My first jobs as a programmer involved writing automatic rate-detecting UARTs for microcontrollers, and I had years of experience figuring out what information can be detected from bit patterns, watching signal traces on protocol analysers and other gizmos. Work in East Asia gave me knowledge of other encodings. I think I solved a big problem well. I take great satisfaction whenever I read people attributing XML's success in large part to its internationalization, because that is the part I influenced most (though I always hold that XML's success is just as much due to the taste of the people who were good at filtering out bad ideas as to the lightbulbs from those of us who could contribute new ideas, if you catch my drift.)
Naturally it bothers me to be informed that we were not aware of some issues that, in fact, we were actively trying to solve. How dare someone not love and adore me!
So Pilgrim could not be more wrong than when he claims that these encoding problems somehow slipped us by. It was because we already knew then that out-of-band metadata was inadequate on several levels that XML adopted the in-band labelling solution. Mark's mistake is in thinking that the charset parameter of a text/* content-type in the MIME header of resource retrieved using HTTP can, will or should work reliably.
There are five different approaches to character encoding:
- Ostrich-like: it is all too complicated, it doesn't concern me, let the foreigners solve their own problems.
Obviously, the W3C takes the 'World' in 'World Wide Web'
very seriously, so this approach is a non-starter.
- Puritan: let everyone adopt UTF-8 (or UTF-16) and ban everything else. This approach was favoured by many Unicode people, but has the disadvantage of being unworkable. Legacy data and locale-specific defaults are a fact to be dealt with, not a embarrassment. (However, XML allows the best aspect of this: it provides an on-ramp to Unicode.)
- Ungenerous: let everyone adopt ASCII. This is the approach underneath, for example, the painfully slow progress towards allowing non-ASCCI characters in domain names. I used to hear (even from Asians) the comment "Everyone should learn English" as the answer to encoding issues: this is little better. (Again, XML allows this to a good extent, allowing ASCII with character references.)
- Magical: waving a magic wand in an HTTP header. Some people seriously believe that the only use-case XML needs to concern itself with is as packets generated at an HTTP server and immediately used by an HTTP client. However, XML is often stored in files before and after transmission, and XML is used for configuration files. This introduces the real possibility of configuration error, because the person who creates a file may be different from the person who configures the webserver. Furthermore, typical file-reading and -writing APIs do not force the programmer to be aware of the encoding used: programmer ignorance is positively encouraged.
- Systematic: encoded data must be inseparable from encoding metadata. Given that most file systems (or APIs) do not support metadata, the only feasible way to do this is to provide in-band signalling of the metadata. Consequently, out-of-band signalling should be used only as a supplement.
This is the route that XML took: its success in neutralizing the character encoding issue is clear; whereas ten years ago the hundreds of character encodings represented a swamp that prevented interoperability, nowadays encodings are regarded as local optimizations that are just another design trade-off. It has utterly changed the landscape.
Obviously, Pilgrim is a controversialist. But discovering that when you use XML in an unreliable way, XML is unreliable should be the source of no wonder. That out-of-band signalling of encoding (such as MIME's text/*) has a particular set of problems is not new information that we poor, ignorant, simple-minded geeks didn't know about in 1996, and which we have embarrassingly discovered since: it was our departure point then. That is why the encoding declaration exists!
The continued popularity of text/* for XML makes it important that XML processors fail when they find infeasible byte sequences; and this is the reason why XML 1.1 took a positive step to banning the C1 control characters from appearing literally. Some people think it is bad that this introduces a superficial syntactis incompatability with XML 1.0, but the advantage in error detection warrants it.
Finally, you can lead a horse to water, but you cannot make it drink. Use application/xml not text/xml. If you don't know what the encoding your system uses and so you don't know what encoding declaration to use, force it to UTF-8 and remove guesswork from the equation.
Categories
WebComments (5)
Read More Entries by Rick Jelliffe.

Shrugging disagreement
ASN.1 is good for what ASN.1 was designed to do. But on its own it does not help those that want to exchange information defined to be XML (SVG, SOAP, etc), which is what binary XML is about. I also feel that ASN.1 could strongly benefit from getting the same treatment SGML got.
The newer X.69x specs certainly help in the "supporting XML" are, but in order for a binary XML format to be generally useful, it needs to have a similar (not necessarily as high of course, but still) amount of industry consensus to that which XML has. With ISO alone having three standards in that area (MPEG BiM, X.69x, soon the X3D one, and perhaps even a fourth one for GML at some point in the foreseeable future, depending on the XBC WG's output) we clearly aren't there yet -- as a whole the field is seriously balkanised.
When the XBC WG finishes its work in next March, I hope that we will have laid out clear criteria to allow people to make informed choices in that area, so that even if the W3C decides to not produce a Recommendation for binary interchange of XML information, it will become easier to trim down the amount of dissent in the industry.
Shrugging disagreement
Heh. I can understand, since I've also sometimes tried to curb my sharp tongue. I probably didn't mean to complain that you were to nice, but more to say I was yearning for someone to be as frank as the situation warranted.
As far as binary XML goes, I suppose I should ask the Simon St.Laurent question: what about ASN.1?
--Uche
Shrugging disagreement
Zounds! It seems that whenever I speak bluntly people are unhappy, and whenever I try to be nice someone will complain ;)
To be honest, I went for the low-key approach because it's not my forte topic and I had no time to look up the hammering references, yet knew enough to rebuke and saw no one else doing it. I'm glad Rick picked up the ball though, he did a far better job than I did.
I shall now hop back to characterizing the need, or lack thereof, for binary XML. I promise, no matter what happens, no one will ever be allowed to transmit it as
text/*:)"Wrong in every way" just about says it
Yes, Murata-san is always more interested in finding positive ways forward, rather than criticizing. You can see this with his promotion of RELAX NG, for example: not strident, but persistent and visionary.
"Wrong in every way" just about says it
I followed Berjon's link to Pilgrim's article last week, and I was also astonished at how riddled with inaccuracies and exaggerations the article was. The snide and smarmy tone is really just a fluffy topping on such a fundamentally incorrect set of premises and conclusions.
I was also amazed at how mild most criticisms of the article was, from Berjon's shrugging disagreement to Murata-san's measured rebuke. I was waiting for someone to hold no quarter and tell it like it really is, and I'm glad you beat me to the punch. I knew from general experience where Pilgrim's falsehoods generally lay, but I would have needed some research to have been as precise as you are in debunking the article.
Anyway, for my part, I think it's just an indication of XML's success that the "XML is doomed" prophecies are heating up. The funny thing is how few of thre prophets of doom seem to get the right order of things (IMO). If XML is indeed doomed, it is not because of the superlative first generation of XML specs, but rather because of the pernicious trend towards complexity in the second generation (XPath 2.0, XSLT 2.0, WXS, XQuery, etc.)
But at least the foundation is solid. I've yet to find a credible argument to the contrary.
--Uche