Here is XML’s grammar expressed as an automaton in XML syntax. It is based on Tim Bray’s Lark grammar, which Tim has recently released.

I’ve XML-ified it to give a headstart to anyone interested in making their own XML processor: you could write an XSLT script to generate code for example. The XML version is based on a finite state machine, but with a couple of extra attributes that act on stacks. There are various Lark-specific actions included in the transitions: these provide a good guide for some of the processing required for a complete XML processor. However, of course you might prefer to annotate it with your own actions.

A big thanks to Tim for the original productions!

There are plenty of gaps you need to fill: character encoding, newline handling, entity handling, start- and end-tag matching, attribute defaulting, and validation. But the grammar is a nice big chunk; I hope an open source version of the the automaton will make it easier for Desparate Perl Hackers (and other significant figures from mythology) to get a more complete implementation within their particular constraints. But especially, I hope it will help spur experiementation with efficient implementation techniques. Not so much because I want to nobble “Binary XML” but because we need efficient XML even if we have Efficient XML Interchange.

Download file