Kenneth Chiu from SUNY Binghamton this week sent me a couple of exciting papers on recent techniques for XML parsing. He and his collegues have been looking at paralllelizing parsing of (largish) XML documents to suit multicore processors. One of the papers Parallel XML Parsing Using Meta-DFAs is listed as available on the ACM website and the other Simultaneous Transducers for Data-Parallel XML Parsing seems (I didn’t check) available as part of the large Proceedings 22 IEEE International Parallel and Distributed Processing Conference 2008.
I am always surprised, as I was with the recent work from Simon Fraser University on pipelined parsing techniques, that we are in 2008 and there are still new techniques cropping up.
One thing that struck me about the Chiu papers concerned an application of his parallel ideas outside of the multicore world to the world of interactive text-based XML editors: for example, Topologi’s Markup Editor or other “coloring editors.” These are hardly the glamour end of town, but interesting none the less.
Chiu etc have various ideas based on making new state-machine-like structures that allow all possible parse states of a document to be represented simultanously: these are presented through two different ideas in each paper: the meta-DFA and the simultanous finite transducer (SFT). I defer to those papers for details.
For the application I am thinking of, consider a “<" in the middle of some XML document. In a conventional XML processor, you can just parse using a simple state machine to find whether it is acting as a delimiter (or is it inside a CDATA section, etc.?) In a interactive editor, the state machine has to be much more complicated, because it also has to cope with what happens when there are errors: how do you recover and resynch? In Topologi's case, we register particular status-bar messages with every error transition, so that when the cursor is at an error location, a message is provided. (The next layer of editing above this is to provide also some kind of user interface for fixing the error.)
Now for large interactive documents, memory utilization and snappy response you really don't want to build a parse tree. So various techniques exist. James Clark's RELAX NG mode for emacs uses a check-pointing system, where at regular intervals (e.g. 1000 lines or 1k characters or whatever) the state machine state at that point is available. Jumping to a point only requires the editor to find the last checkpoint and provide a detailed parse only from there to the checkpoint after the current entry.
Making an edit invalidates following checkpoints, but because a state machine is being used, you only need to reparse text until you find the next checkpoint that agrees with your parse: at that point you are in synch. And, in fact, you only re-parse forward when you actually need it, since you don't want trivial edits to force parsing in sections that are way beyond the current display area.
Topologi's Markup Editor takes a slightly different approach, from JEdit and Java's document APIs, where each line has a checkpoint memoizing the current parse state. This reduces the amount of reparsing during editing to a minimum (in fact, we found we had to add delays to the display so that users would be aware that changes had taken place!) But there are the same optimizations: when a change is made, the rest of the lines on the screen are reparsed stopping if there is resynchronization, and an index is kept to the line off the screen to notify where there is some invalidity.
Generally, this works well, but there are a few pathological uses cases. Consider a large XML document (say 100 meg) with no comments. At the beginning the user starts typing a comment, gets as far as the open delimiter, then wants to go to some point late in the document to confirm something. In this case, all the document between the comment open delimiter and the jumped-to location needs to be reparsed, but with no real benefit.
What Chiu et al's work suggests is that you can have a kind of parallelized parse, where each potential delimiter is parsed in all possible states, and each possible transition noted. Think of it as if parsing produced a linked list which associated each delimiter with an array of parse roles and next parse states.
For example, if we had the text
<xxx> then our parse list might have a node pointing to the first “<” and a pointer to the following node which points to the “>”. The each node has an array with one entry per mode in the original DFA: what is < when you are in the prolog, what is it when you are inside a tag, what is it inside a CDATA section, what is it inside an attribute value, what is it inside data content, and so on. Then, for each of these entries there is a index to the next mode from the original DFA which is used to index in the array of the next node of the parse list. (To keep things under control, XML parsers usually have a lookahead or peek function, which reduces the number of states: the same technique can be used here.) So in our example, the < in the prolog is a start-tag open delimiter, in which case the next delimiter > should be interpreted using the entry for being inside a start tag. If we started in a comment, then the < delimiter would be just data, and the next delimiter > would be interpreted using the inside-a-comment entry.
(You would presumably use singletons for these, to keep size under control.)
So in our previous example of adding a
<!-- delimiter, then reparsing from the top of the document to the new jumped-to location only involves following the parsed delimiters. To get the line-based check-pointing only involves checkpointing the possible transitions at the start of each line together with the transitions that the lead to at the end of the line. (Other optimizations are possible, in particular for resynchronizing.)
So this can reduce the blow out for long documents (though both it and the old method are both linear, so neither suffers from combinatorial explosion). That may be a nice optimization for coloring editors, but they are not something that grabs people’s minds!
What may be more interesting is the idea that you can build an “all-possible parse” DOM on top of this parallel parse list. That has a bit more interest for editors which have, for example, on-the-fly well-formedness and validity feedback. Now I am quite aware that in most cases, a syntax-error prevention mechanism is better UI technique (for many uses): for example if you want to add a comment, you can only put it inside an element, or you have to select some range, but you can never type just an opening comment delimiter.
But for interactive editors, you already have to cope with no-well-formed transitions, so the DFA is already at the level of transition-complexity of the parallelized DFAs in Chiu’s work. What the parallel approach allows is things like saying if the document ends up being non-well-formed, we want to trace back through the transitions from a well-formed result to the nearest point to the editing point, so that WF errors only are shown for the most reduced range possible. For example, if the document was
<!-- unterminated comment <x>text</x>, then rather than showing all the document as non-well-formed with an indication that the end-of-document had been reached, it would back track to the last feasible well-formed point, which is the initial
< in this case (taking the text
!-- unterminated commentas data content when backtracking.)
The difficulty of interactive editing of XML is that errors are identified where they are found, not necessarily where they are caused. These borrowed-from-parallel techniques perhaps could allow a different approach.