Charles Goldfarb’s idea of using grammars to represent documents has proven itself useful in many situations, and the DTD legacy lives on in ISO RELAX NG and W3C XSD. However, there are many structures that regular grammars, as conventionally implemented, cannot cope with. And it is possible to get a certain cart-before-the-horse mentality about grammars, where any structure that cannot be represented by a grammar is regarded as bad ipso facto.

However, we need to be striving towards systems that free us so that what is congenial to the mind is easy to do on the computer.

I was looking at Ant files recently and they provide another good example. Ant files are configuration files for a modern make system, open source through Apache and most associated with Java development. Ant files are mostly a defined set of elements and attributes which you could have a grammar-based schema for quite easily.

But you can extend the elements inline in the document itself. For example, I am working on (updating Christopher Lauret and Willy Ekasalim’s) Ant task for Schematron, to be available as an Ant extension. In Ant, you just need this:

 <target name="test-fileset" description="Test with a Fileset">
    <taskdef name="schematron" classname="com.schematron.ant.SchematronTask"
        classpath="../lib/ant-schematron.jar"/>
  	<schematron schema="../schemas/test.sch" failonerror="true" debugmode="false">
  	  <fileset dir="../xml" includes="*.xml"/>
  	</schematron>
  </target>

Where the taskdef element defines that there is a task called schematron, and this can then be used as an element later.

In Schematron you could validate this by the following:

      <sch:pattern>
          <sch:title>Check allowed elements</sch:title>

          <sch:rule context="target/*[name() =  ancestor::*/taskdef/@name]">
                  <sch:assert  test="true()">
                  The target element may contain user-defined tasks.
                </sch:assert>
          </sch:rule>

          <sch:rule context="target/*" >
             <sch:assert test="self::bunzip2  or self::bzip2 or self::depend or self::javac or ..."
                diagnostics="unknown-name" >
             The target element should only have built-in Ant tasks apart user-defined tasks.
             </sch:assert>
          </sch:rule>

     </sch:pattern>
...

    <sch:diagnostic id="unknown-name" >
               The element <sch:name/> is not one of the built-in types in Ant (at least, as at Ant 1.7.0).
    </sch:diagnostic

Unless I have made a mistake with the XPath what this does is

  • The first rule finds every element that is a child of target for which there is an in-scope taskdef element for that name. In-scope means that any taskdef underneath any ancestor. The assertions in this rule can never fail, and they just filter out properly defined extension elements so that they do not fire the second rule.
  • The second rule, which applies to any other element under target, checks against the full list of the built-in Ant tasks.

That grammars cannot represent this is not just a lost opportunity for better validation: after all, the Ant program itself can generate messages. But it is a real shortfall for documentation: I cannot see one place in the Ant documentation in which all the structural rules are consolidated. I suppose if you are not used to going to a schema first, then you might not miss it, but I think one of the major convenience factors of DTDs, RELAX NG compact syntax, and Schematron can be the convenient and terse collection of structural rules, like a help card for programmers.

I have added a little diagnostic message too: just to let the user know what the unexpected element actually was. It isn’t part of the main assertion so that the assertions are “pure” positive descriptions of what should be.

Now, lets assume you are Vigorous Grammar Fanboy (VGF). You object, why not just have a container element like user-task fo all the points where you want these, along the lines of the CustomXML elements in OOXML where the name of the desired element is effectively in an attribute not the actual element name? First, because it is ugly. Second, because it emphasizes that this is an extension element, which is of interest during setup and then extraneous information afterwards. Third, because then you are messed up with using the element name to determine the contents of the element anyway. And fourth because it is not what the original writers found idiomatic, direct and minimal. Or was that point one again?

But you, the VGF, are not content with that. Oh no, you are relentless, like a killer whale attacking a seal pup on the beach. You say “Err, isn’t this what namespaces are for?” And, indeed, Ant is starting to add support for namespaces which may in time supercede this. My answer: namespaces are difficult for the kind of developer who are making Ant tasks: they are probably not addressing XML problems at all. And namespaces pose more problems for users. In fact, the Ant declaration system is one of binding a local name to a class, and so it is no more prone to name clashing that if namespaces had been used (i.e. conflicts with the same element name are no different from conflicts from the same prefix.)

So a quick comment to developers: if you have used XML for configuration files or other things, and then found that XSD doesn’t have enough power to represent what you have, it is most likely that ISO Schematron can do the job, and do it with clearer diagnostics.