As several commentators point out, there is quite a large size and complexity difference between the different office formats of the simple examples given in my previous blog Comparing Office Document Formats. But it is useful not to jump to conclusions. Don’t be scared of wrapper elements: HTML has too few and is popular but impoverished because of it.

The bottom line of data formats is “is the information extractable?” not “is the markup pretty?” Complexity is certainly undesirable, but the choices are not simple: for example, if you decide to have really simple elements that only serve one signalling prupose each, or favour elements over attributes, you will probably end up with deeper nesting that may scare the horses: people will be perturbed. Yet in a sense you are uncomplexifying: at least you are increasing cohesion and decreasing coupling.

As a concrete example, last week at an XSLT course I was giving I produced a simple stylesheet for converting back from the Open Office document to the HTML. As it turns out, it was pretty simple a few rules like this:

<xsl:template match="w:document/w:body/w:p[w:pPr/w:Style/@w:value='Heading1']">
   <h1><xsl:apply-templates/></h1>
</xsl:template>

The XPath is long, but not complicated; the contents of the template are really simple. Now the equivalent rule for ODF would perhaps be:

<xsl:variable name="Heading1Style"
     select="//style:styles[@style:style-parent-name='Heading_20_1']" />
<xsl:template match="text:h[@text:style-name=$Heading1Style']">
   <h1><xsl:apply-templates/></h1>
</xsl:template>

So which is simpler here: ODF or Office Open XML? ODF reads much better, but to get the semantics back from the styles seems to involve an extra level of indirection for headings. But when we look at lists, the opposite is true: ODF has nice explicit markup for list containers while in Office Open XML (and XSL-FO) you have to be scrabbling around for @ilvl attributes to try to reconstruct the list containers.

Which is why I think it is better to consider the bottom line: can all the information be round-tripped, even if with effort? That is the information that anyone with archiving and data conversion requirements should be considering more than initial eye-rolling, however understandable, I think.