Here is XML’s grammar expressed as an automaton in XML syntax. It is based on Tim Bray’s Lark grammar, which Tim has recently released.
I’ve XML-ified it to give a headstart to anyone interested in making their own XML processor: you could write an XSLT script to generate code for example. The XML version is based on a finite state machine, but with a couple of extra attributes that act on stacks. There are various Lark-specific actions included in the transitions: these provide a good guide for some of the processing required for a complete XML processor. However, of course you might prefer to annotate it with your own actions.
A big thanks to Tim for the original productions!
There are plenty of gaps you need to fill: character encoding, newline handling, entity handling, start- and end-tag matching, attribute defaulting, and validation. But the grammar is a nice big chunk; I hope an open source version of the the automaton will make it easier for Desparate Perl Hackers (and other significant figures from mythology) to get a more complete implementation within their particular constraints. But especially, I hope it will help spur experiementation with efficient implementation techniques. Not so much because I want to nobble “Binary XML” but because we need efficient XML even if we have Efficient XML Interchange.


Rick
What do users gain by having an XML representation of the states? Are there any available tools that can take advantage of the XML representation?
You could use it with XSLT to generate code, for example. Or have it in a DOM and build a FSM/PDA visitor. You could even create an XSLT stylesheet to generate ANTLR code or even Bison/Lex code, I suppose!
In the case of Tim's original format, there is no off-the-shelf tool either, so the XML format improves things a little at the syntax level only.
Modern Pentiums have deep pipelines which make it very difficult to know whether one algorithm will necessarily perform better than another (of the same order.)
For example, is it better for a table-based FSM to have byte tables and a conditional jump based on their value or have jump tables (where the table contains the branch address)? The byte-based takes a quarter the size, but cache lines decrease the advantage to some extent, and there is an test or branch. And different processors act differently. So empirical testing is required to choose from several different candidates, and to make generating alternative parsers feasible we need to have a headstart like the XML-in-XML, I think. Now, I am not embarking on this project...but if I were then XML-in-XML or equivalent would be necessary to cut down the workload. (Also, it makes program logic clearer, as little languages tend to do.)