Three programmers gathered at the next cubicle to mine yesterday, clucking and snorting as is their want. I looked over to ask what was going on. “A bug in Java” they said. The problem was with ZIP files, specifically some differences between ZIP files made by different methods.

They had some files with non-breaking spaces (U+00A0) in the file name. Not something that I would do myself, but the number of people who want to use non-ASCII characters in their filenames is surely now much greater than the number of people just content with ASCII-only names. Aha, so file this under internationalization (I18n)!

The problem was, it seems, that WinZIP stored the filenames using the system default encoding. But Java would read the filename using UTF-8. So sometimes ZIP files parts would have the non-breaking space, and other times the same file saved a different route would have 0xFF at that position. Now this is the kind of behaviour and problem that you would expect a decade ago, but I was surprised it still occurred.

Checking through Sun’s bug database, we find that this bug (or its clone) is actually the second most requested (2008-13-28). The engineer who evaluates the problem gives the excuse that Sun decided to use UTF-8 for JAR files (which use ZIP) and seems a little surprised to discover that ZIP may actually be created by other systems to.

Looking at the bug report, we also find it was first reported 07-JUN-1999. Almost nine years ago. The bug report says it is only reported up to Java 1.4.2, however I cannot see anything in Java 1.6 that addresses it.

So what has happened? Several things:

  • Apache put out a zip implementation as part of Ant that supports different encodings. So people who needed it can use that.
  • Since September 2006 the ZIP spec has formally included a bit to state the the file name is stored using UTF-8.
  • It seems other manufacturers have increasingly used UTF-8

So for almost 10 years the Java version of ZIP has been broken for internationalization purposes, the fix seems to be caught in limbo (are they waiting for non-UTF-8 encodings to go away, perhaps?) , and so people are forced to go to other implementations. WORA undermined! Indeed, this seems another example where Java is simply too large for Sun to maintain adequately.

But what about this angle: the current ZIP spec has an appendix on file names and encoding it says

The ZIP format has historically supported only the original IBM PC character
encoding set, commonly referred to as IBM Code Page 437.

Which means that Sun’s policy of merely writing UTF-8 is now going against what the ZIP spec says.

Software maintenance and juggling issues on a budget are not easy. However I think it is more than plausible that had Sun gone ahead and submitted Java to ISO for standardization a decade ago, this issue would have been fixed long ago. Because ISO National Bodies give very high precedence to issues such as internationalization, accessibility, modularity, and conformance. So the lack of proper encoding support in the ZipEntry API would undoubtedly have come to the fore in the very first round: Japan never lets this kind of thing slip, for example.

By exactly the same token, if the ZIP format has been put through as a standard, proper encoding support would have undoubtedly been raised as part of the first review. Standardizing either would have been good enough to have a technical fix agreed on, published and pressure applied for a fix ahead of the demands of corporate featuritus. But standardizing both would still be best.

After Sun backed off last time, leaving so many people who had participated feeling burnt, it is hard to see that standards people won’t be deeply suspicious of them. And Sun people may not be keen to submit even to a “bullshit process” based on pragmatism and incrementalism. But Java would clearly, IMHO, be in a much better position today if it had been standardized. And so would ZIP.

Standardization as a kind of audit

What standardization of a living technology gives stakeholder companies is more than just bragging rights and ammunition to shoot their rivals with and to confuse procurement people with, tempting as those things may be, it also give an objective audit program dictated not from the corporate POV but from (to a greater or lesser extent, depending on interest) the market and relatively disinterested third parties. Any long-term software project gets encrusted in the personal politics and ideosyncrasies of the development team, and needs a circuit-breaker. This is a view of standardization as a kind of major technical audit, particularly of the documentation but also of areas that are becoming more market-critical: standards use and compliance, openness, responsiveness, accessibility, internationalization, integratability, testability, and so on.

These are all things that established technologies need. Now of course you can get audits in each of these areas by hiring experts. That is good, but you don’t get the breadth or provable transparency that National Body participation can bring. And expert opinions still have to get evaluating the context of the power relationships of the company, the very same relationships that allowed the problem to arise (these might be as simple as CJK requirements not having an adequate champion or I18n not being a profit center that can demand changes.) And you can get benefits from using boutique standards bodies in which vendors or their representatives can have voting rights: W3C, Ecma, OASIS, and so on. That is good too, but it does open to domination by one side or the other.

Which leaves the ISO family (e.g. ISO/IEC JTC1) as being effective forums for this kind of audit. People who think that ISO standardization is always a pushover should consider the current OOXML debate: you have MS and friends on one hand and IBM and friends on the other both pushing as hard as they can, and yet as I write neither can establish clear dominance. And these are the largest players in the world. Whether DIS 29500 mark II passes or fails it will be because national bodies decided on technical issues, not pack alliances, as far as I can tell. I am sure that neither MS nor IBM is feeling comfortable at the moment: and this is the strength of the ISO kind of procedure, regardless of the outcome.

We have all had enough experience of open source to be aware of its strengths and weaknesses now. Making something open source does not automatically mean that bugs and so on will be fixed. No silver bullet. As I wrote in this blog a couple of years ago in Sun should open source Swing

it is not enough to Open Source something: the mechanism for speedy response to bug fixes and releases is crucial too.

And neither will auditing a technology by making it a standard. Nothing is automatic. But Error-full systems emerge from single-strategy maintenance regimes and the dinosaur systems such as Java and Office are full of examples of this. The ISO standardization process has many qualities to commend itself for large companies as a tool for shaking things up and circuit-breaking. And we still need an ISO standard for ZIP too.