Markup design fascinates me. What is it that makes one format easier to use than another? Why, even within that subset of markup that uses XML syntax, are some markup languages elegant and others unreadable? When is it best to use XML, when YAML, when a custom format?
Not all XML is created equal, and I think the biggest distinction between a good markup language and a bad one comes down to whether the XML was designed as a markup language or whether it’s a serialisation of a completely different model. Practically all the XML serialisations that I’ve seen of object-oriented models, or relational models, or graph models, have been dreadful as markup languages.
So what are the characteristics of XML that a good markup language should take advantage of? Here’s my list:
-
custom names: XML is extensible: you can make up your own names for the elements that you use. Bad markup languages use generic names like
tableorrecord, which only mean anything when coupled with a name provided in an attribute or child element. Good markup languages use element names to provide semantics about element content. -
mixed content: this is XML’s killer feature, something that is essential for document-oriented content (whether it appears in a document or not). Good markup languages use mixed content to their advantage.
-
nesting: XML can be nested to any level, and good markup languages take advantage of that by grouping similar things together and by using inheritance to scope the applicability of particular attributes. Good markup languages also take advantage of the context in which a particular element or attribute appears to determine its meaning, rather than giving each possibility a distinct name.
-
attributes and elements: good XML uses elements for the main document content and attributes for metadata on that content. You can’t always make this distinction cleanly (because you can’t have attributes on attributes, or attributes containing elements), but a good markup language will attempt to do so.
-
untyped data: this is one of XML’s strengths, because it means that not all data needs to be reduced to atomic data types such as numbers, strings and booleans, and it’s perfectly acceptable to use other formats for small amounts of data within an XML document. A good markup language knows where to stop marking things up.
-
namespaces: namespaces might not be to everyone’s taste, but they enable one markup language to re-use other markup languages. Reuse helps everyone: it lowers the amount of design you have to do, it prevents authors from having to learn another way of marking something up, it enables programmers to reuse their code. Reusing languages such as XHTML, SVG or MathML should be a no-brainer, and we can do so easily because we have namespaces.
What does this mean in practice? Here’s an OOXML example. OOXML is one of the ugliest markup languages in existance. Why? Because it’s basically a dump of the internal format used by the Office applications. A sample paragraph from a document I have lying around:
<w:p>
<w:pPr>
<w:tabs>
<w:tab w:val="clear" w:pos="709" />
<w:tab w:val="clear" w:pos="1418" />
<w:tab w:val="left" w:pos="360" />
<w:tab w:val="left" w:pos="1080" />
</w:tabs>
</w:pPr>
<w:r>
<w:t>13.</w:t>
</w:r>
<w:r>
<w:tab wx:wTab="60" wx:tlc="none" wx:cTlc="0" />
<w:t>Section 1(2)(a) of the Act adds new subsections (1A), (1B) and (1C) into section 66 (</w:t>
</w:r>
<w:r>
<w:rPr>
<w:i />
</w:rPr>
<w:t>further consideration of case of conditionally discharged patient</w:t>
</w:r>
<w:r>
<w:t>) of the 1984 Act. ...</w:t>
</w:r>
</w:p>
There are some things that OOXML does right. It uses meaningful element names; it uses nesting to a certain extent; it only uses the content of elements for the actual content of the document, with other values going into attributes; and its use of namespaces isn’t bad. However, there are many more ways in which it fails badly: it doesn’t use mixed content, even with document-oriented content; it separates every piece of information into its own element or attribute rather than creating a more compact format; its reuse of existing formats sucks.
An alternative would be:
<w:p tabs="709C 1418C 360L 1080L">
13.<w:tab wx:wTab="60" />Section 1(2)(a) of the Act adds
new subsections (1A), (1B) and (1C) into section 66
(<w:i>further consideration of case of conditionally
discharged patient</w:i>) of the 1984 Act. ...
</w:p>
which as well as being shorter is more readable. This uses mixed content and it uses a custom data format for tab information rather than marking everything up using elements and attributes. (You could additionally use a default namespace for added readability.)
Here’s another example, this time a configuration file from my favourite XML editor:
<options>
<serialized>
<map>
<entry>
<String xml:space="preserve">clear.undo.buffer.before.format.and.indent</String>
<Boolean xml:space="preserve">false</Boolean>
</entry>
<entry>
<String xml:space="preserve">editor.detect.indent.on.open</String>
<Boolean xml:space="preserve">false</Boolean>
</entry>
<entry>
<String xml:space="preserve">editor.detect.line.width.on.open</String>
<Boolean xml:space="preserve">false</Boolean>
</entry>
<entry>
<String xml:space="preserve">editor.hard.line.wrap</String>
<Boolean xml:space="preserve">false</Boolean>
</entry>
...
<entry>
<String xml:space="preserve">scenario.associations</String>
<scenarioAssociation-array>
<scenarioAssociation>
<field name="name">
<String xml:space="preserve">Standard Chunking</String>
</field>
<field name="type">
<String xml:space="preserve">XSL</String>
</field>
<field name="url">
<String xml:space="preserve">../temp/phase1/aspen_20010009_en.htm</String>
</field>
</scenarioAssociation>
...
</scenarioAssociation-array>
</entry>
...
</map>
</serialized>
</options>
Now this is the kind of XML that Jeff Atwood railed against recently, and rightly so. You can tell that this XML hasn’t been thought about as a markup language because of the generic names like map, entry and field. Just getting those out of the way makes the markup a lot better as a markup language:
<options
clear.undo.buffer.before.format.and.indent="false"
...>
<editor detect.indent.on.open="false"
detect.line.width.on.open="false"
hard.line.wrap="false"
... />
<scenario.associations>
<association type="XSL">
<name>Standard Chunking</name>
<url>../temp/phase1/aspen_20010009_en.htm</url>
</association>
...
</scenario.associations>
...
</options>
There’s a reason that people use generic markup languages like the configuration file above: it’s easy to marshal data into it and unmarshal data out of it, and it’s not as if the configuration is going to be shared with other applications. But that mentality tightly couples your current implementation with the configuration file: bad news if your application’s data structures change down the road.
XML isn’t bad — far from it — but its flexibility means that XML can be used badly. To use it well, my advice is to take advantage of its strengths, and aim for readability above everything else.

Jeni Tennison's on XML.com!!!
w00t! :D
(That's all I can comment on at the moment... Haven't read your post. That's next. Just was excited to see your first post so felt the need to say so :D)
Ugg! I'd never noticed the config file format in Oxygen before now. Yikes!
Fortunately the tool *ROCKS*! :D
Good post. I agree with the comments about the office xml format. Trying to write an XSLT that will produce clean (X)HTML from Word is a pain! That being said, I don't find Open Office/Open Document Format XML that much better, either....
Well said! I've wrapped myself around a tree a few times over WordprocessingML 2003, OOXML, and InDesign's serialized INX files.
Beyond the terse markup, there's a whole other issue in that most serialized XML is not necessarily valid XML. This is especially true of the INX format - you'll get a well-formed document, but not necessarily the consistency of valid XML required for downstream processes.
Thank you for clarifying the difference between "bad" and "used badly."
I wonder if OOXML's lack of mixed content is a hangover from MSXML's policy of stripping insignificant whitespace...
I can see the usefulness of these rules for document data, but if I need to store something that is inherently a graph of structured data, the rules make no sense at all. So, basically, your advice is that people who store or exchange anything but tree structured documents should just go away and leave XML alone.
But why? Because representations of graphs in XML don't look pretty and are hard to read and edit by hand? That's a bad reason because graph structured data is hard to read and edit by hand in any format I know of. I believe this problem is inherent to the way the human brain works. Physical containment (like element subtrees) is easier to grasp than networks, because networks lack the locality our senses need.
I see no reason why I should not make use of the other strengths of XML beyond those resulting from stree structure and mixed content, like Unicode and existing mature parsers.
@fauigerzigerk,
>> I see no reason why I should not make use of the other strengths of XML beyond those resulting from stree structure and mixed content, like Unicode and existing mature parsers.
This is a fair point. And you're right, there are times when there isn't a pretty well to move your data into XML. I think the problem occurs when people are forced to look at the XML and attempt to make sense of it which at which point you get the classic "Oh, that's ugly! Isn't there a better way?" which then leads to rants similar to Jeff Atwood's. Of course we all know what happens after such rants, a global "conversation" which ultimately ends up leading back to your point,
>> I see no reason why I should not make use of the other strengths of XML beyond those resulting from stree structure and mixed content, like Unicode and existing mature parsers.
Of course I can't help but agree.
@Andrew,
>> I wonder if OOXML's lack of mixed content is a hangover from MSXML's policy of stripping insignificant whitespace...
Hmmm.... Interesting point. As far as XSLT is concerned, while it's impossible for it to become as complicated given <xsl:text>foo</xsl:text> doesn't allow anything other than plain text with escaped markup, I certainly have a tendency to put all text that isn't generated into an xsl:text element. And as Anup points out, ODF isn't any prettier (okay, maybe it's a little prettier ;-)), so maybe this is really a simple matter of guaranteeing a lossless document format when moving from one application to another?
Food for thought...
@fauigerzigerk,
I certainly don't think that people who want to use XML for graph structures should go away and use something else! All I'm arguing is that everyone should think about the way the XML they use is designed as a markup language rather than simply dumping out a graph or other data-oriented structure in a generic serialisation.
Yes, graph structures are inherently harder for humans to read because they aren't linear; that's when you have to work particularly hard on the design of your XML to make it as usable as possible (for programmers as well as authors).
> There are some things that OOXML does right. It uses meaningful element names
I fail to see how "r", "pPr", "i" and "t" are "meaningful element names".
@Theo,
They're more meaningful than 'element' :) They may be short, but at least they stand for something that reflects their semantics (r = run, pPr = paragraph properties, i = italic, t = text).
I don't agree about ODF not being better than OOXML. I think it's a lot cleaner and easier to understand.
I just did a simple test with a one page document, containing a table, foot notes, and several styles. It is created in OpenOffice. I saved it as Word 2003 XML, which resulted in a 63KB file. I unzipped the ODT file, and the content file is 34 KB, about half the size. I know this is Word 2003, which is not OOXML, but I suppose it's very similar, looking at the examples here.
Jeni,
I want to publicly second David's enthusiastic response. Welcome to XML.com - I for one am eagerly looking forward to your posts!
-- Kurt Cagle
One post: #1 on the Hot 25: http://weblogs.oreillynet.com/
/me is looking forward to the next #1 post. :D