Markup design fascinates me. What is it that makes one format easier to use than another? Why, even within that subset of markup that uses XML syntax, are some markup languages elegant and others unreadable? When is it best to use XML, when YAML, when a custom format?

Not all XML is created equal, and I think the biggest distinction between a good markup language and a bad one comes down to whether the XML was designed as a markup language or whether it’s a serialisation of a completely different model. Practically all the XML serialisations that I’ve seen of object-oriented models, or relational models, or graph models, have been dreadful as markup languages.

So what are the characteristics of XML that a good markup language should take advantage of? Here’s my list:

  • custom names: XML is extensible: you can make up your own names for the elements that you use. Bad markup languages use generic names like table or record, which only mean anything when coupled with a name provided in an attribute or child element. Good markup languages use element names to provide semantics about element content.

  • mixed content: this is XML’s killer feature, something that is essential for document-oriented content (whether it appears in a document or not). Good markup languages use mixed content to their advantage.

  • nesting: XML can be nested to any level, and good markup languages take advantage of that by grouping similar things together and by using inheritance to scope the applicability of particular attributes. Good markup languages also take advantage of the context in which a particular element or attribute appears to determine its meaning, rather than giving each possibility a distinct name.

  • attributes and elements: good XML uses elements for the main document content and attributes for metadata on that content. You can’t always make this distinction cleanly (because you can’t have attributes on attributes, or attributes containing elements), but a good markup language will attempt to do so.

  • untyped data: this is one of XML’s strengths, because it means that not all data needs to be reduced to atomic data types such as numbers, strings and booleans, and it’s perfectly acceptable to use other formats for small amounts of data within an XML document. A good markup language knows where to stop marking things up.

  • namespaces: namespaces might not be to everyone’s taste, but they enable one markup language to re-use other markup languages. Reuse helps everyone: it lowers the amount of design you have to do, it prevents authors from having to learn another way of marking something up, it enables programmers to reuse their code. Reusing languages such as XHTML, SVG or MathML should be a no-brainer, and we can do so easily because we have namespaces.

What does this mean in practice? Here’s an OOXML example. OOXML is one of the ugliest markup languages in existance. Why? Because it’s basically a dump of the internal format used by the Office applications. A sample paragraph from a document I have lying around:

    <w:p>
      <w:pPr>
        <w:tabs>
          <w:tab w:val="clear" w:pos="709" />
          <w:tab w:val="clear" w:pos="1418" />
          <w:tab w:val="left" w:pos="360" />
          <w:tab w:val="left" w:pos="1080" />
        </w:tabs>
      </w:pPr>
      <w:r>
        <w:t>13.</w:t>
      </w:r>
      <w:r>
        <w:tab wx:wTab="60" wx:tlc="none" wx:cTlc="0" />
        <w:t>Section 1(2)(a) of the Act adds new subsections (1A), (1B) and (1C) into section 66 (</w:t>
      </w:r>
      <w:r>
        <w:rPr>
          <w:i />
        </w:rPr>
        <w:t>further consideration of case of conditionally discharged patient</w:t>
      </w:r>
      <w:r>
        <w:t>) of the 1984 Act. ...</w:t>
      </w:r>
    </w:p>

There are some things that OOXML does right. It uses meaningful element names; it uses nesting to a certain extent; it only uses the content of elements for the actual content of the document, with other values going into attributes; and its use of namespaces isn’t bad. However, there are many more ways in which it fails badly: it doesn’t use mixed content, even with document-oriented content; it separates every piece of information into its own element or attribute rather than creating a more compact format; its reuse of existing formats sucks.

An alternative would be:

    <w:p tabs="709C 1418C 360L 1080L">
      13.<w:tab wx:wTab="60" />Section 1(2)(a) of the Act adds
      new subsections (1A), (1B) and (1C) into section 66
      (<w:i>further consideration of case of conditionally
        discharged patient</w:i>) of the 1984 Act. ...
  </w:p>

which as well as being shorter is more readable. This uses mixed content and it uses a custom data format for tab information rather than marking everything up using elements and attributes. (You could additionally use a default namespace for added readability.)

Here’s another example, this time a configuration file from my favourite XML editor:

    <options>
        <serialized>
            <map>
                <entry>
                    <String xml:space="preserve">clear.undo.buffer.before.format.and.indent</String>
                    <Boolean xml:space="preserve">false</Boolean>
                </entry>
                <entry>
                    <String xml:space="preserve">editor.detect.indent.on.open</String>
                    <Boolean xml:space="preserve">false</Boolean>
                </entry>
                <entry>
                    <String xml:space="preserve">editor.detect.line.width.on.open</String>
                    <Boolean xml:space="preserve">false</Boolean>
                </entry>
                <entry>
                    <String xml:space="preserve">editor.hard.line.wrap</String>
                    <Boolean xml:space="preserve">false</Boolean>
                </entry>
                ...
                <entry>
                    <String xml:space="preserve">scenario.associations</String>
                    <scenarioAssociation-array>
                        <scenarioAssociation>
                            <field name="name">
                                <String xml:space="preserve">Standard Chunking</String>
                            </field>
                            <field name="type">
                                <String xml:space="preserve">XSL</String>
                            </field>
                            <field name="url">
                                <String xml:space="preserve">../temp/phase1/aspen_20010009_en.htm</String>
                            </field>
                        </scenarioAssociation>
                        ...
                  </scenarioAssociation-array>
             </entry>
             ...
         </map>
      </serialized>
   </options>

Now this is the kind of XML that Jeff Atwood railed against recently, and rightly so. You can tell that this XML hasn’t been thought about as a markup language because of the generic names like map, entry and field. Just getting those out of the way makes the markup a lot better as a markup language:

    <options
      clear.undo.buffer.before.format.and.indent="false"
      ...>
      <editor detect.indent.on.open="false"
              detect.line.width.on.open="false"
              hard.line.wrap="false"
              ... />
      <scenario.associations>
        <association type="XSL">
          <name>Standard Chunking</name>
          <url>../temp/phase1/aspen_20010009_en.htm</url>
        </association>
        ...
      </scenario.associations>
      ...
    </options>

There’s a reason that people use generic markup languages like the configuration file above: it’s easy to marshal data into it and unmarshal data out of it, and it’s not as if the configuration is going to be shared with other applications. But that mentality tightly couples your current implementation with the configuration file: bad news if your application’s data structures change down the road.

XML isn’t bad — far from it — but its flexibility means that XML can be used badly. To use it well, my advice is to take advantage of its strengths, and aim for readability above everything else.