One of the really neat things about the XML specification is not just that it makes its design goals explicit (I gave a twist to this idea in the Schematron standard by mentioning various non-goals too) but that the goals were really well chosen.

A decade ago, Tim Bray wrote up his Annotated version of the XML spec, which includes some hypertext comments to the Goals section.

Recently, I have heard several times people quoting the XML goals to support various opinions on what makes a good or bad markup language (schema). In particular, goal #10 Terseness is of minimal importance gets used to claim that abbreviated element names go against the spirit of XML (a blithe spirit indeed). (See here for example.)

But if we look at the XML Spec, we see that these are not general goals for XML documents to follow, but goals for the committee designing XML the technology: they are explicitly design goals. Tim’s comments are useful here, on goal #10 he writes

The historical reason for this goal is that the complexity and difficulty of SGML was greatly increased by its use of minimization, i.e. the omission of pieces of markup, in the interest of terseness. In the case of XML, whenever there was a conflict between conciseness and clarity, clarity won.

I have always attributed the goals to Jon Bosak. Tim mentions Jon’s stewardship of the XML process has been marked by a combination of deft political maneuvering with steadfast insistence on the principle of doing things based on principle, not expediency., where I think “principle” requires having clear goals and persuing them. (Regular readers of this blog might see that my Reasonable Principles for Reviewing Open XML and other Standards follows this line. If you get hold of the Standards Australia comments on DIS29500 ballot, you can see that most of them try to state the general principle behind the specific problem.)

But Dave Hollander and Michael Sperberg-McQueen mention how the goals were the foundation for the XML design effort too. The goals were fait accompli by the XML ERB by the time the larger XML WG formed (another good thing about Jon Bosak: he welcomed all sorts of stakeholder involvement) but I don’t recall any of us on the WG (which would now be called an Interest Group, not to be confused with the current XML WG which took up from the old ERB) ever complaining about the goals.

Alice through the Looking Glass

Looking at the goals (and see Tim’s comments if you don’t trust mine) you can see that most of the goals are specific responses to problems either with SGML or with the SGML process at ISO then. (ISO standards were supposed to have 10 year reviews which would be an opportunity for changes to be addressed, outside the ordinary maintenance process. But some influential and vital members of the ISO group had been committed to keeping SGML unchanged for as long as possible, and many of the other members who wanted change wanted changes that would support technologies such as ISO HyTime better: these would be changes that made SGML more complicated and varigated rather than simpler, to the frustration of all.)

1. XML shall be straightforwardly usable over the Internet.

SGML had a particular issue that it was, by design, retargetable. Before Unicode and the URLs, every different system had different character sets and different ways of locating files. So SGML provided a mechanism for labelling that an entity (resource) would need some system-specific fix in order to be useful, and a mechanism for naming entities regardless of their location (PUBLIC identifiers.

Because of this goal, SDATA entities were removed from SGML as was the use of unresolved entities (entities PUBLIC identifiers with no SYSTEM identifiers.) It was unfeasible to expect users to fix document to suit their local systems: that is geekstuff. The use of Unicode and URLs was a non-brainer from this goal.

2. XML shall support a wide variety of applications.

While some people had been using SGML for non-publishing uses (Dave Peterson at MIT for example had been using it for numerical data from 1985 IIRC) its complexity and strangeness made it difficult for, in particular, people from the database world. Now, as it turns out, there problems can fruitfully be solved by treating them as publishing applications. But this has been a success of XML and HTML, not SGML per se.

3. XML shall be compatible with SGML.

In fact, ISO 8879 was changed to allow this. In particular, to allow documents with no DTDs, hex numeric character references (I had tried to get them introduced in my successful 1996 Corregendum to SGML, but horsetraded them away to get support for the main requirement, to support CJK characters better) and the empty elements form <x/>

4. It shall be easy to write programs which process XML documents.

SGML parsers were a pain to write. An SGML processor was really a compiler compiler where you could change delimiters, keywords and a whole lot of different behaviours. Note that process here is a defined term: a XML processor is the parser and support utilities. This goal does not state that it is against XML’s goals to write complicated programs that use XML data!

5. The number of optional features in XML is to be kept to the absolute minimum, ideally zero.

SGML had an ancillary document, the SGML declaration which told you which which features a particular document needed. In theory, you could then look up the SGML system decaration for an application and see whether it matched. In fact, XML can be regarded as largely a particular SGML declaration, superceding the default Reference Concrete Syntax defined by IS8879:1986 (and taking on board the Extended Reference Concrete Syntax proposals which I and the CJK DOCP group were promoting.)

Now, in fact, there are two big optional features in XML: DTDs and non-UTF-* character encodings. Many early home-made XML parsers did not support DTDs or ignored them, and many supported only limited numbers of character encodings.

6. XML documents should be human-legible and reasonably clear.

As Tim notes, this is a goal which blocks off any attempt to allow binary data and non-graphical characters in XML. Text is king.

Before XML, I organized an effort to make up a set of rules for the Unicode characters that could be used in names in markup: this Native Language Markup list was part of the Extended Reference Concrete Syntax, was adopted and improved by XML 1.0 (and were downgraded in XML 1.1. Out of this effort came a strong belief that XML should not contain non-graphical or control characters: this ended up being reworked into to a W3C and Unicode Consortium note: Unicode in XML and other Markup Languages.

But the issue crops up periodically. Indeed, it one area where I think OOXML goes seriously wrong: in a few places it provides a mechanism for circumventing XML’s character repertoire restrictions. I think the idea that just because someone generated an automatic name and used the backspace character as part of it, this should be regarded as acceptable practice in the standard is completely bogus. Several National Bodies have commented on it: I hope ECMA will have the good sense to remove it or severely deprecate it at the least. For example it is clearly a security hole to allow backspace in names, where the visible name may be coded differently than its readers expect: a kind of spoofing.)

7. The XML design should be prepared quickly.

SGML’s “10 year review” had not even really started properly after 10 years. In fact, XML was the 10 year review of SGML!

8. The design of XML shall be formal and concise.

SGML has attracted criticism that it did not use academic formalisms, and was difficult to characterize with formalisms. I don’t know why this isn’t a criticism of the formalisms just as much: cart before the horse. Anyway, XML being simpler is much more friendly to simple theoretical formalisms, and consequently easier to write parsers for using compiler compilers. (In a sense, XML represent a move to unbundle the markup language from the compiler compiler technology. In my idle moments, I wonder whether compiler compiler systems would have been more capable of handling SGML if the SGML spec has been freely available on the internet in PDF or whatever: the lack of an open standard for SGML meant that the academic/private-hacker community (apart from James Clark) did not connect with the standard or its challenges.

9. XML documents shall be easy to create.

Tim wrote in 1998

The main goal was in fact to design XML in such a way that it would be tractable to design and build XML authoring systems. Our success in meeting this design goal remains to be established in the marketplace.

In 2008, the success is completely established: it is difficult to find anything anywhere which doesn’t use XML, even when it is a mad choice!

10. Terseness in XML markup is of minimal importance.

SGML was designed with a big attention to the requirement of users: i.e. typists. Minimizing the number of keystrokes it took to markup a raw text file was a large part of the economic value proposition of SGML. SGML allowed you to leave off many delimiters, omit many tags, and gave many kinds of shortcuts so that you could just use simple keyboard symbols instead of explicit tags.

It is tempting to think of this as an old-fashioned concern which we, in our age of RIAs and off-shored outsourcing don’t need to worry about. But what XML did was, in fact, to cast adrift the users of Wiki-like markup into a standards-free world, which has incredibly harmed the adoption of Wiki–like markup. And when we look at the current upheaval in the HTML 5 discussions going on currently, a central meme from that is that XML’s restricted syntax is simply inappropriate for vanilla HTML. (For an alternative, see my ECS)

This goal #10 has been the cause for much of XML’s success: with a stroke it allowed many SGML features to be removed without much fuss: DATATAG, SHORTREF, OMITTAG, SHORTTAG. Coupled with this goal was the realization that to a major extent, HTTP compression was the correct layer for reducing the transmission size of documents, rather than XML language features. (Of course, it is not true that terseness is of no importance in language syntax standards: the prefix mechanism in XML namespace is terseness mechanism after all!) And the removal of these features meant that the DTD was no longer necessary, a big win which many people had been seeking.

But to treat this design goal as somehow indicating any policy about how long a name in an schema should be goes beyond the intent of those goals, at least as I ever understood it. The goal of Native Language Markup was to allow people to markup documents using their customary names and symbols. This is different to the goal of literate programming, which is where I think people are getting confused.

In fact, what we are seeing with XML is that for international standards and in nations from the Hindu-European language groups or with English as an official language or (such as Indonesia) where the simple 26-letter alphabet has been adopted for transcription, schemas do restrict themselves to the ASCII repertoire names. No surprises there. However, for national and local documents for other languages or scripts, Native Language Markup is a big success, very spectaularly in the Chinese OUF spec and in Murata-san’s schemas for Japanese local governments.