A couple of years back I had a very surprising experience with a junior programmer, who had just joined our team. I had asked him to work on some code until there were no more JUnit errors. A few hours later he proudly showed there were no errors, and explained it was easier than he expected because he just commented out the tests! Then he paused, regarded my startled expression for a few seconds and quickly blushed deeply. Doh!
Poor old Alex Brown has been in and out of favour with the extreme anti-OOXML-ists (perhaps I should use a new acronym, such EAOOXMLista, to say for the hundred thousandth time that not every anti-OOXML person is extreme?) over the last few weeks. First, he didn’t somehow stop the DIS29500 BRM somehow (exactly how?) from doing its job. So he is bad. Then he works with SC34 to organize getting more improvements made to OOXML and ODF. Again, bad. Then he says ““The question behind the question, for a lot of the current OOXML debate, seems to be: can Microsoft really be trusted to behave? We shall see” which earned him the quote of the day on ConsortiumInfo. So presumably he is good.
Then he does a smoke test of validation conformance of Office and the various OOXMLs, and reported the validation errors he found. So he is deemed good. Now he has validated various versions of Open Office and ODF and reported the validation errors he found. And that makes him the devil again.
Unless there is some tussle between evil twins going on, I’d like to suggest that Alex is just trying to faithfuly fulfill his normal committee responsibilities, which include checking through standards. Alex has long been involved in Data Quality issues for publishing professionally, and has been very involved in the development of ISO DSDL at SC34 (which includes RELAX NG and Schematron.)
So what is it that Alex found about ODF that has caused the fuss? It is quite technical, but the gist is this, as I understand it: if a schema is not itself valid, no documents can be formally valid against it.
(When the invalid part of the schema is only detected at run-time when exercised by a particular instance document structure, and the document does not contain such a triggering instance, the implementation may report that the document is valid, but that is a false positive. And you make look at the schema and say “I know what was intended, and the false positive is in fact correct against the intent of the schema” but this is lucky accident, i.e. hacking, not formal validity.)
The particular issue is quite interesting because it relates to an area in a W3C Schema standard where the user requirements for XSD could not be supported by the facet model used, and where XSD fudges it. OASIS RELAX NG, also to an extent inherited this problem.
The problem is with attributes of type ID in the ODF schema. Alex Brown has provided a very simple fix, which I hope gets adopted into ODF 1.2.
The problem with IDs is this. XML inherits ID type attributes from SGML. They have various constraints, which include that they are XML names (tokens), that their values are unique within the document, and that an element can only have one ID attribute.
When XSD came to make its datatyping the XSD WG made a nice theoretical distinction between lexical space and value space: these are entirely context-free distinctions, which relate only the atomic values of the individual pieces of text. XSD also provided another mechanism to declare that certain data values should be unique. But the constraints that an ID attribute value must be document-unique and that an element may only have a single ID attribute are left out in the cold by this model, and are not directly in the XSD specs. Blink and you’ll miss them, there is a little handwaving going on but it is a good pragmatic workaround: the spec references the XML specification; that these non-facet constraints on IDs are intended is made explicit in the (non-normative) Primer which forms Part 0 of the spec:
the scope of an ID is fixed to be the whole document.
and, more importantly, the XSD Structures Spec Part 1 specifies the ID/IDREF table as part of the PSVI.
ODF uses RELAX NG, and ISO RELAX NG specifically allows (s. 9.3.8 data and value pattern) datatyping to validate using more than just the atomic string:
services may make use of the context of a string. For example, a datatype representing a QName would use the namespace map.
(This seems to be a difference from the original OASIS RELAX NG, which AFACS started with a more atomic view of datatypes. )
So when an ODF schema says an attribute is an ID type, we expect for full validation it will have all the XSD/XML semantics, and that for full validation of the schema conflicts would be pointed out. If you don’t want these semantics, you just use the base type
xs:ncName which has the lexical and value space but adds none of the other constraints.
So we come to the concrete problem that a couple of content models allow wildcarded attributes in any namespace, and many of the attributes in the namespaces in question have ID attributes. So the argument (which you can follow on Alex Brown and Rob Weir’s blog) is what class of error this should be: all the implementations of RELAX NG and Alex say this makes the schema invalid (in ISO Schematron I specifically included definitions for a “good schema” and a “correct schema” as well as a “valid schema” in order to make these nuances clearer); Rob thinks it shouldn’t be an error (”thinks” is too weak a term) and seems to think it should only be an error if a element actually has two ID attributes. I think this is also legitimate possible approach that the standards could take (but they don’t.).
Alex has found the fix for ODF, but I think RELAX NG and XSD could well have some extra clarifaction text (non-normative) to stop basic mistakes. If a schema, whether DTD, XSD or RELAX NG, says something is an ID, it has all the semantics of an XML ID.
So what was the point about the programmer turning off tests to make some code fault-free? That is Rob Weir’s suggestion on how to make the ODF documents valid: turn off ID testing! Brilliant! So what is the point of ODF 1.0 making these things IDs in the first place if that was not the intended semantics?
I suspect this is actually another example of where it would have been more satisfactory all around to have these constraints in Schematron. For example, not use ID type but xs:ncName (this is not real code, but to give the idea…you’d use a regex and this assumes a consistent naming convention in ODF and sub-vocabularies wrt attribute naming):
<sch:rule context="whatever"> <sch:report role="duplicate-ids" test="count( @*[ends-with(name(), 'id']) > 1"> There should not be more than one attribute called id. </sch:report> </sch:rule>
This seems to give the intended constraint against duplication, but makes it a run-time instance-driven problem, not a static schema error. Another assertion would handle uniqueness.
So my take: Alex is right that the schema has a flaw, and right to point it out and offer a fix; Rob is right that it is unnecessary for this to be a static error (which is the positive point I would infer from his over-reacting blog), but wrong that the way to fix it is to turn off validating that constraint.