Hello, my name is Michael Day and I’m here to blog about XML, CSS, web standards, declarative programming, UNICODE and other topics of interest to XML.com readers. Since a lengthy biography of me is not one of these topics, I shall limit myself to one sentence: I am the founder of YesLogic and the designer of Prince, an XML + CSS formatter and a great way of getting web content onto paper.

Now that we’ve got that out of the way I would like to get straight into talking about XML parsing and UNICODE encodings. In Prince we use libxml2 for all of our XML and HTML parsing needs, and have been very happy with it. However, it’s always interesting to see new approaches for XML parsing that may offer greater speed or convenience than existing methods.

State Machines

Last year, Tim Bray released the state machine that he used for parsing XML in Lark. His state machine operates on UNICODE characters (technically character classes in some cases), so it requires a separate decoding step to turn the incoming stream of bytes into characters first. What about parsing XML with a state machine that operates directly on bytes instead of characters, would that lead to any opportunities for clever performance optimisations?

Now, there is a good reason for defining an XML state machine in terms of UNICODE characters: it means that you only need to define it once, whereas if you define it in terms of bytes then you will need to define it multiple times, once for each encoding that you wish to support. However, given a state machine that operates on characters and the definition of an encoding, it should be possible to programmatically generate an equivalent state machine specifically for that encoding that operates directly on bytes, so we can pretend that this issue is not too serious.

When I sat down and tried sketching out a simple XML state machine that operates on bytes I immediately hit a snag in Appendix F of the XML specification: Detection Without External Encoding Information. The problem occurs when an XML parser examines the document to determine its encoding. If the state machine starts reading the document and finds that the first byte is “FF” and the second is “FE”, what state should it be in? Ideally, it should be able to say that it has just read a little-endian UTF-16 byte order mark, and continue to parse the document. However, if the following two bytes are both zero, then it means that the “FF FE” was actually the start of a little-endian UTF-32 byte order mark. Checking this requires two bytes of look-ahead, or hundreds of additional states and transitions in the state machine, which sucks. A much cleaner solution springs to mind: don’t bother supporting UTF-32. Who uses it, anyway?

Anyone use UTF-32? Anyone at all?

So I have a question for you, the reader: do you ever use the UTF-32 encoding for your XML documents? I’m thinking the answer is no, given that most software defaults to UTF-8, UTF-16, or one of the regional encodings, but if you do use UTF-32 and feel that you have a damn good reason for doing so then I’d like to hear it.