Related link: http://www.javareport.com/article.asp?id=9797

I think the most pervasive problem in XML adoption is ingorance and even wilful sabotage of the international foundation on which XML is built. In several recent incidents, both in my consulting work and in my OSS/community work I have come across systems that ignore or break XML’s Unicode character model.

I’ve almost grown tired of saying it, but it is worth saying until I’ve worked through my very last nerve: the single most important aspect of XML is its character model. Ditch XML and use something else before you mess with that. A tremendous amount of damage is done by people who can’t see past the pointy brackets as the point of XML.

Yes, Unicode is hard. There is nothing to be done about this. We have a myriad of languages, writing systems and local conventions, and they complicate just about everything. That’s our wacky, wondrous world for you. Nevertheless, as a software professional in this age, there is no excuse not to buckle down and learn the rigors of i18n. I’m not meaning to be a pedant about this: I know a lot less abotu i18n than I wish I did, and I fall short of good i18n in much of my code. However, I respect the problem and I strive to work on my skills in the area, and my discipline in applying it in software development.

If you use XML in your work, please read “The skew.org XML Tutorial. A reintroduction to XML with an emphasis on character encoding“, by Mike Brown (a truly brilliant article). You might also want to check out my article “Proper XML Output in Python“. Even if you’re not a Python programmer, you might find some use in its discussion of common character problems when generating XML.