Hello, my name is Michael Day and I’m here to blog about XML, CSS, web standards, declarative programming, UNICODE and other topics of interest to XML.com readers. Since a lengthy biography of me is not one of these topics, I shall limit myself to one sentence: I am the founder of YesLogic and the designer of Prince, an XML + CSS formatter and a great way of getting web content onto paper.
Now that we’ve got that out of the way I would like to get straight into talking about XML parsing and UNICODE encodings. In Prince we use libxml2 for all of our XML and HTML parsing needs, and have been very happy with it. However, it’s always interesting to see new approaches for XML parsing that may offer greater speed or convenience than existing methods.
State Machines
Last year, Tim Bray released the state machine that he used for parsing XML in Lark. His state machine operates on UNICODE characters (technically character classes in some cases), so it requires a separate decoding step to turn the incoming stream of bytes into characters first. What about parsing XML with a state machine that operates directly on bytes instead of characters, would that lead to any opportunities for clever performance optimisations?
Now, there is a good reason for defining an XML state machine in terms of UNICODE characters: it means that you only need to define it once, whereas if you define it in terms of bytes then you will need to define it multiple times, once for each encoding that you wish to support. However, given a state machine that operates on characters and the definition of an encoding, it should be possible to programmatically generate an equivalent state machine specifically for that encoding that operates directly on bytes, so we can pretend that this issue is not too serious.
When I sat down and tried sketching out a simple XML state machine that operates on bytes I immediately hit a snag in Appendix F of the XML specification: Detection Without External Encoding Information. The problem occurs when an XML parser examines the document to determine its encoding. If the state machine starts reading the document and finds that the first byte is “FF” and the second is “FE”, what state should it be in? Ideally, it should be able to say that it has just read a little-endian UTF-16 byte order mark, and continue to parse the document. However, if the following two bytes are both zero, then it means that the “FF FE” was actually the start of a little-endian UTF-32 byte order mark. Checking this requires two bytes of look-ahead, or hundreds of additional states and transitions in the state machine, which sucks. A much cleaner solution springs to mind: don’t bother supporting UTF-32. Who uses it, anyway?
Anyone use UTF-32? Anyone at all?
So I have a question for you, the reader: do you ever use the UTF-32 encoding for your XML documents? I’m thinking the answer is no, given that most software defaults to UTF-8, UTF-16, or one of the regional encodings, but if you do use UTF-32 and feel that you have a damn good reason for doing so then I’d like to hear it.


I've never even heard of UTF-32 anywhere. Although, a quick Google returns about 371,000 results. Yikes! The question I have in mind suddenly is, WHY would anyone need UTF-32? I'll have to go see what it's good for. Personally, I probably won't ever use it.
Michael,
Welcome to XML.com and I hope to follow your columns with great pleasure.
I don't get the problem really, from the spec:
" All XML processors MUST accept the UTF-8 and UTF-16 encodings of Unicode 3.1"
It is true that most support more than that, but basically most serious projects have rules about exchanging XML in UTF-8 so as to increase interoperability (my experience at any rate. )
So my suggestion is: support UTF-8 and UTF-16
of course, also from the spec: "Although an XML processor is required to read only entities in the UTF-8 and UTF-16 encodings, it is recognized that other encodings are used around the world, and it may be desired for XML processors to read entities that use them. " my suggestion is, outside of publishing - which I suppose you have a lot to do with - entities don't seem to be used that much anymore. Unfortunate, but people seem to want the ability to refer to the external resource via markup.
So this started off interesting, parsing XML with a state machine, but then you stopped on an artificial barrier.
Any more ruminations on this subject would be interesting.
Esp, using Raven to generate the state-machine. Then you could write the state-machine once and compile Java and C, and Ruby versions of it.
Hi David,
I will have more to say on this topic at a later date. I first wanted to answer the question "can you generate an XML parser by applying an XSLT transform to a description of the XML grammar expressed in XML?"; it seems that the answer is "Yes, but XSLT really isn't very convenient for this sort of thing" so I've gone back to the drawing board for the time being.
By the way, what is Raven?