I’ve blogged recently comparing the two contenders for the standard office XML format crown: the Sun/IBM sponsored Open Document Format (ODF) and the Microsoft sponsored MS Office Open XML format (MSOOX). Also I’ve blogged recently on various metrics for XML: magic numbers that help provide objective evidence for help characterize things like complexity in documents, to help evaluation and produce estimation. A reader, unsurprisingly, asked if I could combine the two threads and provide some metrics on ODF and MSOOX.
Fair enough! Here are some XML metrics for a large document with almost 180,000 words, tables, lists, sidebars and some graphics. I chose a large document so that bootstrap effects would be minimized. I used the ODF v.1.0 specification, converting it from .SWX to .DOC and .ODT in Open Office 2.0, then converting the .DOC to .DOCX in Word 2007 beta. Then I used a COTS archiver to treat the ODT and DOCX files as ZIP archives, and extracted the XMLfiles containing the basic text and markup: content.xml (ODF) and word/document.xml (MSOOX). I chose to use a .SWX format because I didn’t want to have any MS-dependencies in the data, .DOC being proprietary.
I also resaved the document to .DOC, re-opened it and re-exported it to .DOCX and extracted the word/document.xml file. Resaving data is a good trick when doing data conversion, because it removes extraneous information or structures from the source: the first .DOC are what Open Office thinks .DOC looks like, the second .DOC is what Microsoft does things.
I used the upcoming release of the Topologi Complexity Detective to create the metrics. The reports on the ODF document are here Download file; the reports on the original MSOOX document are here Download file, and the better reports on the resaved MSOOX documents are here Download file. Comments below.
First, a few words of caution. First, neither ODF nor MSOOX are completely finished or stable; the numbers may be different in 2008. Second, this is only one document from one provenance; the numbers may be different with the documents are entered native or come from different sources. Third, the files are the products of software, so to some extent they test the applications rather than formats per se; the numbers may be different for different applications. Fourth, the version of Word being used is a beta and some parts of Open Office are also probably immature: DOCBOOK export failed for example. (So to some extent this is a test of how some beta software produces data in a beta format, done to beta-test a utility using some beta metrics…I will update the article if any errors are found.)
Application Characteristics
Opened in Open Office, the document is about 736 pages. In Office 2007, the document formats at 732 pages. It doesn’t seem a significant difference.
For load times, I logged off and logged on again to ensure a fresh session. I opened the applications and used the open menu rather than double clicking, so that application load time was not involved. For Open Office, the .SXW and .ODT files took about six seconds to load each (this was quite load depdendent: at another time I noticed the same document taking about 14 seconds to load; I believe this may be due to Open Office being paged back into memory). For the Word 2007 beta, the (resaved) .DOC and .DOCX returned their initial page display faster than that: however the rest of the file loaded in the background and loading took about 35 and 45 seconds respectively.
Comment It seems that consideration of file loading needs to be slightly more nuanced than simple times then: if you count to when you first see some text, Microsoft was much faster; however, if you count from when the document is fully loaded, Microsoft was significantly slower.
File Size
Here are the file sizes:
- .SWX (original):434K
- .ODT (ODF 1.0): 438K
- content.xml (ODF): 4,383K
- .DOC (MS): 4,432K
- .DOCX (MSOOX): 764K
- word/document.xml (MSOOX): 7,775K
- .DOC (MS resave): 3, 142K
- .DOCX (MSOOX resave): 733K
- word/document.xml (MSOOX resave): 7,472K
Comment For a large files, the .ODT file is much smaller than the equivalent .DOCX file. This can be almost entirely attributed to the relative sizes of the XML files: the ODF XML file is much smaller than the equivalent MSOOX XML file. However, the differences in those files sizes are dwarfed by the difference in their size compared to the .DOC size which is five to ten times lalrger. Resaving the .DOC file resulted in approximately a 25% file size reduction.
XML Metrics
So here are the XML metrics.
Element and Attribute Count
| Category | ODF | MSOOX (resave) |
| Element | 103 | 95 |
| Attributes | 325 | 150 |
| Total Metrics Value | 428 | 245 |
Comments For the same document, MSOOX and ODF seem to require about the same number of unique elements. However, MSOOX has substantially fewer attributes reqiured. (I will look further sometime, but I’d suspect that MSOOX is using richer data values rather than markup. It also seems that the ODF content.xml file contains more style information; both the ODF and MOOX ZIP structures have other files for containing stylesheets.) At a minimum, we can say that processing ODF and MSOOX will involve different tasks: they are organized differently, and if the extra attributes in ODF are indeed due to a finer grain of markup then we can say that some kinds of document processing using XML APIs will be easier using ODF.
Field Count Metric
The field count metric here is a version of the field count metric presented in the blog before. The original metric required knowledge about which attributes were IDs, xml:space or other metadata, which requires a schema, annotations and perhaps some hand-counting. The metric here shortcuts matters by saying that the first attribute in each element is not a field.
| Category | ODF | MSOOX (resave) |
| Number of Elements with Data Content (excluding Whitespace) | 44213 | 90743 |
|
Number of Attributes (excluding First Attribute and Namespace Declarations) |
12033 | 25407 |
| Total Metrics Value | 57246 | 121543 |
Comments The MSOOX numbers are about double those of the ODF. The reason for this seems to be that MSOOX uses an element value rather than attribute valye for style information and something mysterious Bin64 encoded data called “fldData” (field Data) which are used for almost every chunk of text. I included this metric because I was concerned that Microsoft’s highly nested style might inflate its document complexity metric, based on tests with tiny documents, but it turns out not to be the case.
Document Complexity Metric
| Category | ODF | MSOOX (resave) |
| Element | 103 | 95 |
| Required Attributes | 157 | 95 |
| Optional Attributes | 168 | 55 |
| Required Children | 16 | 19 |
| Optional Children | 112 | 73 |
| Required as First Child | 26 | 23 |
| Total Metrics Value | 582 | 360 |
Comment According to these numbers, the ODF document is more complicated than the MSOOX document. This in part reflects the use of generic elements rather than specific elements, and as mentioned it may relect a tendency in MSOOX to do more using rich data values rather than explicit markup.
Weighted Document Complexity Metric
The Topologi Complexity Detective allows you to weight various factors to reflect the experience in your organization, when deriving metrics for cost or time estimation in projects. The following weighting is one such set, based on a particular client’s experience for a certain kind of task.
| Category | Weight | ODF | MSOOX (resave) |
| Element | 2 | 103 | 95 |
| Required Attributes | 2 | 157 | 95 |
| Optional Attributes | 1 | 168 | 55 |
| Required Children | 1 | 16 | 19 |
| Optional Children | 1 | 112 | 73 |
| Required as First Child | 1 | 26 | 23 |
| Total Metrics Value | - | 842 | 550 |
Comment According to these numbers, the ODF document is more complicated than the MSOOX document.
What do these numbers mean?
The trouble with metrics, of course, is that people bring their own presuppositions to interpret them. The numbers are rarely univocal: if there are a large number of unique elements, for example, does this mean that they are really targetted and rich, or uncontrolled and sloppy with overlap, or confused and inelegant, or easy to read since the names are clearer, or more difficult to remember, or more difficult to use since they may all have different usages?
Where metrics are strong is that they do show us where things differ from our expectations. They provide a kind of objective evidence that let us identify anomolies: in this case, the difference in field count lead me to look at the fldData attributes: it seems possible that the Word 2007 beta saves a lot of information in bin64 encoded form that ODF exposes as attribute values. Now I wouldn’t get too alarmed about this (this means you, Pamela Jones! :-) ) because MSOOX is not finished and it would seem to be to be very sensible implementation technique to progressively move over from binary to XML while the thing is under development. It would not be, of course, a good thing if this continued into the final standard (and the major implementations that exported to it.) If there is this progressive implementation going on, then that pulls the rug out from these metrics!
The numbers seem to support the interpretation that beta MSOOX may be quite a bit less complex than ODF 1.0 at this stage, at least in the sense of using fixed structures more, and simpler in these sense of using fewer elements and attributes. ODF is flatter and has smaller filesize but seems to include more style headers than the MOOX does. The metrics indicate that the use of attributes may be significantly different between the two formats, for example for people looking at data conversion estimation. On the application level, Open Office loads the ODT file much faster than the Word 2007 beta loads the DOCX file.
Finally, the fact the we have two (and presumably more) word processors that can produce and consume XML for a decent sized book, is such a great step forwards from a decade ago. A medal to both teams! Boiled down, based on these numbers (and I need to double check my thinking here, and this is completely blue sky!) I’d wouldn’t be surprised if MSOOX were easier to convert from (because of its regularity, scale and low complexity) while ODF were easier to convert into (because of its richness and flexibility), after the initial hurdle of converting anything to/from either of them was leapt.


Hey Rick,
Interesting stuff (as usual). Is it just me, or are the linked word-1 and word-2 files identical? -m
Thanks, and well spotted. I've corrected the second link and updated the blog. The difference is less than a percent.
Great article. I've been converting some spreadsheets to OpenOffice.org content.xml files and loading data into a database, I was pleasantly suprised how easy it was. (Better than CSV, and I easily got the "Track Changes" annotations and cell styles which the client was using.) I'd love to see a similar comparison for spreadsheets.
Hey Rick, do you think you could post a link to the document itself? I'm really curious to see the cases where you are getting the base 64 encoded data...
Thanks.
-Brian
I believe OpenOffice 2.0 does support ODF 1.0, while you write several times to compare ODF 1.1. Am I mistaken?
K
-Off topic - message to the webmaster:
This comment form butchers less than and greater than signs. It needs a little decode/encode magic.
Lets see if my signature makes it this time?
K<o>
P.s.: Yes!
Here is a example. The SWX original has
<text:p text:style-name="Text body">Chapter
<text:reference-ref text:reference-format="chapter" text:ref-name="Introduction">1</text:reference-ref> contains an introduction to the
<text:user-field-get text:name="CommitteeName">OpenDocument</text:user-field-get> format. The structure of documents that conform to
the <text:user-field-get text:name="CommitteeName">OpenDocument</text:user-field-get> specification is explained in chapter
<text:reference-ref text:reference-format="chapter" text:ref-name="Document Structure">2</text:reference-ref>.
The MSOOX has
<w:r w:rsidR="00DE46CC">
<w:instrText xml:space="preserve"> REF Ref_Introduction \n \h
</w:instrText>
</w:r>
<w:r w:rsidR="00DE46CC">
<w:fldChar w:fldCharType="separate"/>
</w:r>
<w:r w:rsidR="00DE46CC">
<w:t>1
</w:t>
</w:r>
<w:r w:rsidR="00DE46CC">
<w:fldChar w:fldCharType="end"/>
</w:r>
<w:r w:rsidR="00DE46CC">
<w:t xml:space="preserve"> contains an introduction to the OpenDocument format.
The structure of documents that conform to the OpenDocument
specification is explained in chapter
</w:t>
</w:r>
<w:r w:rsidR="00DE46CC">
<w:fldChar w:fldCharType="begin">
<w:fldData
xml:space="preserve">CNDJ6nn5us4RjIIAqgBLqQsCAAAACAAAABkAAABSAGUAZgBfAEQAbwBjAHUAbQBlAG4AdAAlADIAMABTAHQAcgB1AGMAdAB1AHIAZQAAAAAA
</w:fldData>
</w:fldChar>
</w:r>
<w:r w:rsidR="00DE46CC">
<w:instrText xml:space="preserve"> REF Ref_Document%20Structure \n \h
</w:instrText>
</w:r>
<w:r w:rsidR="00DE46CC">
<w:fldChar w:fldCharType="separate"/>
</w:r>
<w:r w:rsidR="00DE46CC">
<w:t>2
</w:t>
</w:r>
<w:r w:rsidR="00DE46CC">
<w:fldChar w:fldCharType="end"/>
</w:r>
I should note that this may well be a fault of OpenOffice's output converter rather than Word. Because of the impenetrable .DOC stage, it is impossible to know or fix. (Well, at least there is open source!)
You are right, the DOCTYPE declaration has "-//OpenOffice.org//DTD OfficeDocument 1.0//EN". I will correct the text, but it doesn't alter anything.
This really article leaves me even less convinved of the value of such metrics than before. What is desireable isn't more complexity or less complexity, but right complexity. That isn't going to be determined by these numbers but by actually analyzing the content and their representations in detail, and looking at how well the problem domain is actually represented, how completely, how precisely and how intelligable the result is. All these metrics seem to lead to is vague speculations and to encourage you to look more closely at this or that detail to determine the why of some numeric difference. An equally big design different might lead to equal numbers for different reasons, and if you use the numbers as your guide of where to look you'll never look there. It seems to me the moral of the story is if you want a real answer you have to look at it in detail, and you might as well just start out that way. As it is you didn't really get to a real answer just a number of incomplete suppositions.
[edges] Yes, you are completely right that a metric then requires analysis, but I don't think I said otherwise.
But your phrase "vague speculation" hits the nail on the head: without some objective measures, your analysis and the evidence on which the analysis is based is speculation or guestimation. With a metric, we can provide some objective evidence; to put it another way, if we make statements about something but cannot come up with any objective metrics to back up our statement, then a manager can reasonably suspect that we are on flimsy ground. I have seen projects where a simple metric resolved an issue that the client thought was some kind of personality conflict between two consultants: the metric put the onus on the party saying "there is no difference between these two schemas" to have to show why the numbers (i.e. the objective evidence) varied so much. Metrics help save us from consultants; or, at least, good quality consultants are happy to modify their opinions in the face of more evidence.
In the case of ODF/MSOOX, we might easily say "Oh, of course ODF is simpler" out of prejudice and yet, on several fairly straightforward measures, it is not the case.
You seem to think it is a flaw if a metric encourages you to look into some detail; on the contrary, that is part of their function and why they can be useful.
I also agree that "right complexity" has a place; however, the "right"ness belongs to analysis, but the "complexity" belongs to metrics. There may indeed be better metrics, and they may involve measuring programmer productivity rather than the schema itself of course; but that is not a point against metrics in general.
Those fldChar and fldData elements are most probably used for cross referencing/toc/indexing.
Word can create such things based on styles (headings,...) or special fields (field values).
It seems to me that OpenOffice.Org Writer is inserting fields when exporting to .doc.
Using styles could result in different markup and different metrics.
BTW, Ecma released draft 1.4 of Office Open XML on 23rd of August 2006 and MS Office 2007 is not (fully) compatible to this version.
Starting with a ODF predecessor the conversion to .DOC could have been a reason for the poor size of the .DOC file compared to the ODT file.
If I understand you correctly you have coverted a .SWX file to the .DOC file using openoffice. That seems a weird way to start this test as openoffice (or any such converter) isn't likely to create an efficient small complex .DOC file but rather a larger less complex file as that is easier for conversion purposes.
hAl, yes, the most that can be said is that at least one workflow produces these figures. As to whether other workflows produce other figures, and how significant the figures are, I leave to the reader. I wasn't trying to "show up" either OOX or ODF, I just took the most straightforward path on my system.
this is very helpful explaining the next office 2007 tools
Tim: Thanks.
Of course, these numbers are based on old versions of the technologies, not the versions of 2007 (e.g. ODF 1.1 or DIS29500). And I expect the 2008 versions (ODF 1.2 and IS 29500) will be different again. So that is a big caveat on these numbers.
1.what differance between document & Format?
2.What differance Quality Manual and Quality Procedure?
1.what differance between document & Format?
2.What differance Quality Manual and Quality Procedure?