Throwing our Cache Away: We need a mini-XSLT with better text processing
Related link: http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dnpag/html/scal…
I taught a couple of full day courses last week, on XSLT and XML Schemas. The students were pretty positive banking programmers, and I only managed to disgrace myself once, when I became stroppy at a student I misheard: he was trying to be helpful, but I thought he was insanely and repeatedly insisting we should skip 120 pages of the text. Ah, the joys of fading ears. The other students evaluated me to 5 or 4 out of 5, but he only gave me 1. This is the first time that my hearing has lead to a problem like this, and it is quite saddening. (I had warned the class and I apologized when I realized what happened, no flames please!)
Some other random thoughts: XML Schemas is a hard subject to make fun; the biggest lightbulbs above students heads went off when explaining the priority and mode attributes of templates: the sooner they are introduced the better. Allette's course uses (by kind permission) a version of Ken Holman's diagram of XSLT axes: it really is very useful.
One piece of advice (apart from "Wait a couple of years before adopting XSLT 2/XQuery 2 or anything with 'PSVI'") I gave was the importance of caching or precompiling schemas, stylesheets and XPaths. But, in the URI above, I read today that MSDN puts this even stronger: caching is so much the norm that not doing represents a "failure"! I don't think this is blaming the victim, despite how it reads, it is undoubtedly good advice.
But I wonder whether we would be better off with a standard, declarative language to cope with two of XSLT's real problems: first that XSLT systems are not really streamable (in the sense that a tree is never built and that the output can start to appear before the end of the input), and second that the kinds of text functions provided by XSLT are fairly weak.
Because XSLT[1,2] is the only game in town, it gets used when a more light-weight language might be adequate. I suppose the kind of language I am thinking of might use CSS selectors rather than XPath, and provide functions for renaming (but not rearranging) names, and for altering data values (but not sorting). Converting currencies and dates, adding progressive counts for numbered lists and for running totals, making headings use title case, collapsing or guaranteeing spaces between elements, suppressing elements or performing XIncludes. That sort of thing.
The discipline of requiring that you don't make choices based on forward references was what allowed OmniMark (a text processing language that dominated most real large SGML production in the 90s) to have such excellent performance. (It is important to note the distinction between forward references for data values, which can be handled by diverting data and resolving the references at the end of the document, and decisions requiring forward references.) Even today, I note that the Advent/3B2's Pure typesetting system (which seems to be pretty fast: they say Boeing typesets hundreds of thousands of pages a day with it) gets a lot of its speed by assuming that the input is pretty much ordered in the publication order.
Perhaps some tricky implementation of XSLT could figure out if a stylesheet is streamable and switch to a streaming strategy. But I suspect we would be better off with some little language that could have a very small implementation: indeed, an implementation tightly coupled to the XML parsing stack might be fastest, rather than, say, a SAX-in-SAX out filter.
Has someone made somthing like this already?
Categories
WebComments (4)
Read More Entries by Rick Jelliffe.

Plus ca change?
I think Tim's point there is more about XML being hard (at least that some APIs don't match convenient idioms) whereas mine is that XML will perpetually be called slow as long as we don't have a standard streaming little brother to XSLT, and one that has better (publishing oriented) formatting options.
Streaming, XSLT-like processing
Yes, STX looks good from the input side. But vanilla STX seems to have been made with the "the data is already OK" approach of XSLT: no way to format numbers (e.g. in roman numerals), no title case, no locale-sensitive casing, and so on. I will look to see if there are some extension functions.
Plus ca change?
Tim Bray wrote a similar column a year ago, that caused quite a stir:
http://www.tbray.org/ongoing/When/200x/2003/03/16/XML-Prog
I had a look for pages linking to that on google to see if anyone had done anything about it, but no such luck. The most promising light on the horizon is that the Perl6 people are thinking about it...
http://groups.google.com/groups?threadm=D53FA5D4-5EF1-11D7-BB31-00050245244A%40cognitivity.com
Streaming, XSLT-like processing
http://stx.sourceforge.net/