We are used to thinking in terms of formats as rivals (either X or Y) or adjuncts (X for this use; and Y for that use) but what if there is an entirely different way of approaching office file formats? What if we learn from the success of Apple’s Fat Binary system and progress to a ZIP-based system where different standards can co-exist and share media files in the same package?
For background on this, see my 1999 paper How to Promote Organic Plurality on the WWW which introduces three ideas data kidnap, workflow kidnap and data lockout as more useful specific concepts than the usual data lock-in (which, being an over-broad concept, tends to generate over-broad solutions.) The basic idea is that technology needs to be layered so that each layer can allow a multiplicity of alternatives, rather than monolithic solutions. Think the Internet stack and all the RFCs giving alternatives. The paper was made as part of thinking about XML Schemas at the time of its development, and I think our later experience with XSD has entirely born out its conclusions. XSD has succeeded where it is modular (e.g. data types) and had trouble where it is monolithic.
ODF, Open XML and Java Web Applications (.WAR files) all are based on ZIP archives. Change the extension to ZIP, and you can poke about inside with just COTS ZIP utilities. So at the moment, it seems that we can actually, say, save a simple word processing document as HTML, ODF and Open XML, then merge them into a single file after paying attention to various paths and metadata issues, and removing duplicated media files. If we give that file the .odt extension, it will open using ODF; .docx, it will open as Open XML; .war, it can be installed as a servlet serving HTML pages.
Does this give us a file that is three times the size of the single file? Probably not, because not only will media files (which can easily dwarf the text components) be shared, but also because there will be fewer unique strings to compress, a unique string in the original document’s text will appear in the three different formats. Also, the modern formats allow embedded XML for forms or spreadsheet data source which can also be shared. So I don’t see it as impractical from the size point of view.
The kinds of adjustments that would need to be made include adding the appropriate MIME-based content type information to the ODF manifest metadata and the Open Packaging Convention content types file. But the result would be a single file that could be, with the appropriate extension change, be read by any of the other systems.
And, more interestingly, if we made up a new extension (.superXML?) an application could open up the file and then select which format it was happiest with. For example, an application might only cope with HTML and ODF, and so would chose one of those. Or an application might decide to open the file using whatever was the native format of the application that created the document: for example, if the document was created by Open Office, the receiving application might decide that ODF matched the feature set of Open Office more than Open XML does, and so import using that.
A different road to harmonization
With this kind of framework in place, the road to harmonization becomes clear, because harmonization doesn’t become a question of “Which format do we choose as the round hole, and which formats have to become square pegs?” but becomes a question of “What modules do they have in common now? What modules can be split out of one to help the other?”. So by supporting plurality and modularity, we can actually find out the points of similarity and quarantine differences to ever-smaller alternative fragments.
Lets give a practical example. Font matching is the feature where an application opens up a file and, upon discovering that some font needed by the document is missing, tries to find a near match. Various mechanisms can be used, but the most basic matching criterion is of course whether the font contains the characters for the language (I mean “script” of course) being used. It is not good using a Russian font for a Thai document.
ODF betrays its pre-Unicode and UNIX roots here, and uses a non-Unicode based system where it uses the locale character set of the original document (or of the font) and matches that. So it will say “This font has an ISO 8859-1 mapping table, therefore we will look for another font with an ISO 8859-1 mapping table.” This is pretty crappy in theory, actually, because Unicode extends so many of the locale-based character sets, but ultimately OK, because these things are only optional hints and the more hints the better.
Open XML uses the more modern Open Fonts standard ISO/IEC 14496-22 for font mapping, which allows mapping both by Unicode block and by major script family. Open Fonts comes from Open Type, which in turn is a container for including both Adobe PostScript fonts and Microsoft TrueType fonts: in fact, it is another example of this kind of containment mechanism.
Interestingly, it is this use of IS 14496-22 that has shows one of the problems with ISO DIS 29500 (i.e. Open XML). You may remember that anti-Open XML people have raised the issue of bitmasks in Open XML, with the lunatic fringe going as far as saying that Open XML was riddled with bitmasks and that these were impossible to validate or manipulate in XSLT; and me then rushing to Schematron’s defence and showing how it was entirely possible, if not trivial, in Schematron and XSLT. Well, the main place that bitmasks are found in Open XML are actually in the font/sig element that is used for font matching, and the bitmasks are the values specified by ISO 14496-22. There is no reason for an application to tease apart the bitmask numbers, certainly not to add 96 separate attributes for something that humans will not be interested in. because the numbers are just magic numbers that come from the original font and are matched against the prospective substitute fonts, that I can see. In the same way that you don’t want to have separate values for R, G and B, because having combined RGB values is more convenient for manipulation. (So the problem with DIS 29500 is not that it uses bitmasks in this element, but that it only gives a vague reference to the standard that the bitmasks are based on, when it should have a clear normative reference—I don’t think anyone else has picked up the ISO Open Font implicit reference hooray for me—: yet again this requires just an editorial fix rather than a technical fix.)
So what should be done? Should Unicode people say to ODF “You need to replace your antique system with something better” Should Linux people say to Open XML “You need to replace your cross-platform system with something that handles Linux-only legacy fonts better?”
With a common system based on plurality, we can say “Well, why not modularize both out as separate resources in the ZIP archive, so that each application has more resources to use?” Now Open XML is probably ahead of ODF here, because it tries to split up the document into many different files in the archive and is already divided into multiple namespaces. So it would be great it ODF adopted the same kind of modularity too. So an ODF application also can, if it chooses, to look in the Open XML font tables for better information. And a Linux system that is using the Open XML format can include information that will help with legacy documents on Linux better.
Practical issues that need to be addressed to get to plurality
The overarching idea is not so much that each document will have a grab-bag selection of different formats, but that each document will have at least one complete version in a standard format *plus* any alternative and additional information from other formats that the application can provide. So that a receiving application can choose the best modules it can, and so that information interchange becomes less dependent on limits of one particular standard.
I have mentioned before that no serious application suite can afford to ignore any common standard format. So in a couple of years time I am sure we will see Open XML and ODF import/export as part of the base packages for all the suites. Indeed, governments and power buyers should demand this from vendors for the distros they buy. (I suspect this will indeed become a purchasing requirement: see the European Open Documents Exchange Formats workshop in Feb 2007 where (p.12) Representatives from public administrations requested over and over again that industry take steps to overcome interoperability problems between ISO 26300 (ODF) and Office Open XML and to implement both standards in their products.. The writing is on the wall.) But my idea goes beyond merely transformation to a model of enabling selective augmentation.
Now even though it seems we can probably make an archive with these different formats now, the difficulty is with writing them. Applications currently won’t update the format parts they don’t understand of course. So if you update an ODF document that also has an embedded Open XML document, the Open XML document will be out-of-sync. This is an area for standards, and in particular an area for the maintenance of OPC and ODF: should the extra parts be removed and how do we signal it in markup?
Adopting the multi-format approach then has feedthrough for other formats, such as PDF. PDF would need to be unshelled, so that the various pages and resources were exposed as different files in the ZIP archive.
Of course, adopting this approach would nor preclude different formats from cross-pollenating and converging where possible. But sometimes there are differences that cannot be reconciled, and supporting plurality means that no solutions are gratuitously ruled out by bureaucratic dictates for single standards (Obviously I think the “Highlander” principle expressed here p.6 is in danger of being terribly simplistic and impractical, unless the one-true-format itself allows plurality at subsequent layers.)
Already Open XML has some capability for allowing alternative chunks within a file, and ODF of course allows foreign elements so you could poke some alternative or extra information in there. But my view is that this is something that needs to be engineered at the standards level, with vendor buy-in, to push competition between standards bodies and their stakeholders one level up the protocol stack. Every level is a victory, and I think this is a race where we need to win one step at a time. The hare and the tortoise.
What steps might this involve? Well, for a start I think that most of Open Packaging Conventions (OPC) should be adopted. There could be an on-ramp made for it, to allow current ISO ODF documents to fit in. The big difference is that ODF uses direct references to entities in the package, while Open XML uses OPC which uses indirect references. So the idea would be an identifier resolution system where ODF applications first treat the reference as a local relative URL then if that fails look up the OPC package then if that fails treat it as an external URL (of course, delimiters will provide extra hints to speed this up.) Furthermore, Open Office rewrites the identifiers used to be GUUID not human names, so it would be nice to add mirror SGML’s PUBLIC/SYSTEM/Indentifier distinction here—SGML got it right.
But I don’t think these issues are insurmountable. The question we need to ask is not How do we enforce monolithic technologies? but How do we take the sting out of multiplicity? It is not a question of trying to have the cake and eating it too, but rather that it is foolish and unworkable to merely throw half the cake out. Oh, that is getting far too aphoristical.
The pluralistic approach of this .superXML format also makes it easy to address issues such as equations, bibliographic citations and metadata where the needs of laymen are entirely different from the needs of professionals. The primary standard formats can adopt simple, layman-oriented structures (Dublin Core, etc) while encouraging specialist formats with higher qualtity requirements.
The recent bomb in the ODF world from Gary Edward’s claims that Sun successfully blocked the addition of features to ODF that would be needed for full interchange with Office are explosive not only because they demonstrate how ODF was (properly, in my view) developed to cope with the particular features of the participants, not really as a universal format, but also because the prop up Microsoft’s position that Open XML is required because it exposes particular features that ISO ODF is not capable of exposing. Both because ODF is still in progress and because sometimes the features are simply incompatible in the details.
(For subsequent discussion by Edwards, see Game over for OpenDocument, noting that I agree strongly with some of their later comments but disagree strongly with their thrust. In particular, I think that what Edwards is actually complaining about here and in these comments is that there needs to be an application standard and profiles as well as a document standard. Without agreeing or disagreeing, comments like It was nothing short of a total breakdown and abandonment of the consensus process. about the ODF development process from a participant are interesting in light of the various process claims about Open XML. Obviously sometimes people who lose out in committees feel cheated, obviously some people are congenitally independent enough that they don’t accept umpire’s decisions, but equally obviously sometimes committees end up going in directions that earlier participants were not happy with, and sometimes there are indeed mistakes in procedure made. In linking to Edward’s articles, I’d recommend to readers that they decline to make unnecessary judgment unless they can thoroughly investigate all different sides to the story themselves. IYKWIM.)
I raise it because when we have an either/or mentality, we force ourselves into a nasty political world of intense scrutiny and spin which diverts efforts away from practical technical issues. A move to a pluralist technology has the capacity to reduce the heat without entrenching one side or another further. The moves to require a single, unified standards for applications that have different feature sets and traditions are based on the wrong analysis of how large open systems get to thrive. They thrive by being small layers which enable a plurality of subsequent layers, and time and user- and vendor-support work out which ones get widespread traction.