Doing some training last week, it became again clear that some XML users are in a bind currently with insignificant whitespace. They may have documents with indented XML but they don’t want to have to validate or transform to strip out the whitespace in element content.
I have been also thinking about a similar problem from a different angle: how to make an ultra-simple, efficient validator that does not use a grammar, just the XML processor’s stack, for basic parent/(child|attribute|pi|data|element) validation. This is part of an ongoing interest in implementing XSD as Schematron assertions: one problem with which is that Schematron (using XPath 2 for example) requires random access pretty much with a built tree. But for any kind of high transaction rate work, you really want to be able to fail early if there are foreign elements, billion laughs attacks, etc. In my company’s Interceptor product, we have Schematron processing as a third stage after basic size and evil-string-detection, then WF/validation checking. But schema validation does blow out the timing more than desirable.
So, for your delight, I present the Path Validation Language, a thought experiment in minimal schema languages, rather like UNIX access control lists ACL. You more or less make an ACL entry for each information in each significant context. I’ve tried to make it that XSD, DTD, and RELAX NG schemas (and perhaps even the streamable parts of some Schematron schemas) could be readily simlpified into a PVL schema. Obviously it could also be extended
to allow simple datatyping too or attribute defaulting, but if you go too far you may as well have the real thing. (Though I increasingly think that grammars get in the way of schemas: better to have path-based datatype attribution for example.)
Actually the syntax is unimportant (could be a PI, could be a config file, could be in XML syntax, could be just an internal datastructure from compiling a schema). The more interesting thing is the question of whether we actually need something much simpler than DTDs (which can either be written or generated from schema languages) for situations where XSD (or even RELAX NG!) is too complex or inefficient.
Fans of schema languages may be interested in the other schema languages I have been involved in, real or toy: Schematron, ISO DSDL, Hook, and XSD. I spent the best part of 1999 in Taiwan basically thinking through various schema issues: see Weaker Validation Models (think Jing’s partial validation), Axis Models and Path Models (think RELAX NG’s attribute content models), and Richer Anonymous Content Types (think xsd’s ALL). Schematron is currently the best of them, because it gets the hacker’s sweet spot between single-person implementability, 80/20 anti-grandiosity, follows the XML goals especially human-readability, and yet addresses several issues (phases, progressive validation, custom error messages) which periodically present roadblock with other schema languages.