Understand Encodings in XML? This true case is a good test...
If you can diagnose this, I dub you a Hero of the XML Revolution. Hint: it is probably the most common problem for English-language XML documents
This story is true; guess what the problem is
- You have an XML file "x.xml" on your file system, with an XML encoding declaration saying UTF-8. The file is not XHTML, just some home-made elements.
- You are demonstrating your spiffy XML application to a hushed crowd.
When you open the xml file in IE, it stops and gives a cryptic error message about a particular location. This is severely embarrassing and so you phone me and give me a vague report. - When tell me that when open the file in Firefox, it loads without error but at the point that IE identified there is an odd character, a black triangle with a white question mark.
- When you open the file in Topologi Markup Editor, it reports "Illegal character found: FFFD" and has a processing instruction with FFFD at that point.
- When I check out that location in the hex editor for Topologi I see a non-ASCII byte xA0 in among all the ASCII bytes (which are all less than x80.
Have you figured it out? Which tools have the right behaviour? The answer below, after the space
The problem is that the file has byte pattern, the solitary xA0 byte, that is not possible in UTF-8. This is the non-breaking space code in most or all ISO 8859-1 character sets, and their extended forms used in Windows and IBM systems. UTF-8 has the property that bytes with the high bit set are never solitary. So we can diagnose two possibilities: either the encoding header is wrong and the file is actually "ISO8859-1" or whatever (simply fixed) or the file was UTF-8 and someone edited it using an ISO8859-1 text editor which has corrupted the file (which is more tricky to fix.)
IE is correct for saying there is a well-formedness error and stopping at that point.
Where does the black triangle with question mark come from? It is the standard glyph (picture) for the Unicode character U+FFFD, with the semantic "buggered if I know". Firefox must be using a transcoder that does not halt on encoding errors, but replaces it with this character. This is fair enough for browsing HTML, but not right for XML. But it is much better than the behaviour of old transcoders, which would silently strip the character or replace it with '?'
The transcoder in Topologi gives the warning and puts an inline PI to show where the error occurred. But, as is typical, it gets the information from the transcoder in the form of the Unicode character FFFD not in the form of the bad byte. But this is correct behaviour for a markup editor, which deals in text not WF XML or the infoset.
Did you get that? If you didn't, don't panic. But this is a very common problem. If you are receiving data from an AJAX system with forms, you may well get data with the wrong header: using UTF-8 as the default is better than using ISO 8859-1 as the default, because a UTF-8 will tell you when something is screwy while you can expect a UTF-8 or CP1252 transcoder to be as silent as the grave and promiscious as the town bike.
Different applications handle this problem in different ways. Many encoding libraries still do not generate errors or exceptions that the XML processor can use to report the well-formedness error of a bad encoding. Some people dislike that XML requires that a WF XML document should have no encoding errors; they are typically people who can write their names using ASCII characters only.
Categories
WebRead More Entries by Rick Jelliffe.

Encoding Aspirations
"Aspirational encoding" made me laugh, Tony :-)
The markup equivalent of the insane comefrom (http://www.fortran.com/fortran/come_from.html)
statement perhaps? (http://www.fortran.com/fortran/come_from.html)
Often a result of XSLT
I have seen bare #160s in content most oftenas a result of XSLT processing. Someone is building an HTML document and wants an but they don't want to declare a DOCTYPE, instead of doing a literal result in the XSLT, they use or the like. When processed, this character comes directly across in the output.
Encoding Aspirations
I don't think so -- it really more separates people who never wrote anything else than US-ASCII from the rest (majority) of the world.
Encoding Aspirations
What I have found is the people routinely put "UTF-8" at the top of XML files, even if they have edited them in a plain-text non-UTF editor. So they are providing an "aspirational" setting for the encoding, rather than an "actual" setting.
I supposed what this really tells you is that too many developers are still doing XML by copying examples, without really understanding what they are doing. Understanding character encoding is definitely something that seems to separate people who can diagnose and fix XML issues from people who can't.
Cheers, Tony.