The W3C XML Schema WG is looking at an XSD 1.1, with so far only the mildest of changes. Maybe that is for the better. At one stage I heard they were considering putting in guard statements on content model particles, using the streaming XPath subset that makes KEY/KEYREF so sad.
I was initially excited by the prospect of more Schematron-isms creeping inexorably into the infrastructure, but I changed my mind on consideration. In fact, it seems the worst thing to do: only compounding XSD’s tendency for extra complication without much power. Instead, I think the W3C XSD WG should attempt to align XSD content models with RELAX NG’s more powerful content models. I was heartened in this regard to see that Michael Sperberg McQueen, an leading XSD conspirator, had a nice paper (at Extreme XML?) on the parser technique adopted by RELAX NG implementations: I think what holds back rapprochement is in some degree unsureness about the theoretical implications.
I just read a great 2004 paper on this Expressiveness and Complexity of XML Schemas by Martens, Neven, Schwentick and Jan Bex. Boy, what a great paper! It is really nice to see a paper that has strong theory, strong awareness of the real world, and a willingness to empirically sample the web.
The gist of the paper is that XSD is only a slight advance on DTD in expressive power, but there is a way that would improve its power (impacting the UPA Unique Particle Attribution and the EDC Element Declarations Consistant constraints) and move it closer to the power of RELAX NG (unranked regular tree languages).
But the paper has a lot of other interesting things to offer: it samples over 800 XSD schemas and finds that few actually use anything more than DTDs could have provided. The ones that did use something more, only used 1 level local element declarations it seems: unexpected confirmation of the idea in my blog on PVL yesterday that a parent/child single step path was good enough for validation, apparantly not for 80/20 but for 99/1
Distrurbingly, 70% of the schemas were not correct in some way: the sample was 2 years ago and some tools that were notorious generators of bad schemas have been corrected now, so I hope things have improved, but still…sheesh!
But finally the paper gets onto a lengthy discussion of having streaming XPaths either as the left-hand side of the grammar or as a guard inside a grammar, as far as I can work out. In other words, this paper both discusses the kind of alignment with RELAX NG that I favour but also discusses the alternative(?) of using ancestor-based XPaths, and surprisingly fits them into RELAX NG too. Nail my hat to the ceiling! Streaming XPath guards and RELAX NG convergence are not as antithetical, perhaps, as they might appear.
Its been fun to do some more research recently. New (to me) terms I have enjoyed have included Island Grammars (think open content models or lax validation) and “twig patterns” (what XPaths are). I’ve also been impressed by papers by Byron Choi and Intel’s Jimmy Zhang: names to watch out for.