I was reading the Ant (the make system) documentation today, and in the section on copy I came across this horrible note:

Important Encoding Note: The reason that binary files when filtered get corrupted is that filtering involves reading in the file using a Reader class. This has an encoding specifing how files are encoded. There are a number of different types of encoding - UTF-8, UTF-16, Cp1252, ISO-8859-1, US-ASCII and (lots) others. On Windows the default character encoding is Cp1252, on Unix it is usually UTF-8. For both of these encoding there are illegal byte sequences (more in UTF-8 than for Cp1252).

How the Reader class deals with these illegal sequences is up to the implementation of the character decoder. The current Sun Java implemenation is to map them to legal characters. Previous Sun Java (1.3 and lower) threw a MalformedInputException. IBM Java 1.4 also thows this exception. It is the mapping of the characters that cause the corruption.

On Unix, where the default is normally UTF-8, this is a big problem, as it is easy to edit a file to contain non US Ascii characters from ISO-8859-1, for example the Danish oe character. When this is copied (with filtering) by Ant, the character get converted to a question mark (or some such thing).

There is not much that Ant can do. It cannot figure out which files are binary - a UTF-8 version of Korean will have lots of bytes with the top bit set. It is not informed about illegal character sequences by current Sun Java implementations.

One trick for filtering containing only US-ASCII is to use the ISO-8859-1 encoding. This does not seem to contain illegal character sequences, and the lower 7 bits are US-ASCII. Another trick is to change the LANG environment variable from something like “us.utf8″ to “us”.

Now, lets put aside the question of why anyone would copy using text operations rather than binary operations. The larger question is why one earth, in 2007 and ten years after XML came out, we are still using text files that don’t label their encoding?

Let me put it another way: if you make up or maintain a public text format, and you don’t provide a mechanism for clearly stating the encoding, then, on the face of it, you are incompetent. If you make up or maintain a public text format, it is not someone else’s job to figure out the messy encoding details, it is your job.

If avoiding the issue is the wrong approach, what is the right approach? One of the right approaches is to adopt Unicode character encodings (UTF-8. UTF-16) as the only allowed formats. (This is what RELAX NG compact syntax does for example.)

Another right-ish approach would be for every text format to adopt explicit labelling: the disadvantage of this however is that, like HTML’s <meta> element, that it is unsatisfactory to have to parse deep in the document in order to be able to parse the document. And to have recognition software that understands the conventions of each format is impossible.

However, it is possible to generalize XML’s encoding header into a delimiter-independent form that can be adopted . My 2003 suggestion for XTEXT gives the details. I don’t see any disadvantages to XTEXT: in the post-XML world, programmers have moved from being puzzled by encoding labels to understanding that are a valuable part of the furniture.

An XTEXT-aware Ant (or default readers that recognize XTEXT conventions) would allow the problem to go away incrementally, as developers and maintainers adopt it. But the trouble is some mix of a lack of leadership by people developing or maintaining text formats: they don’t see themselves as part of a larger community of text users, I guess, or believe that there is any advantage in participating in a larger community. I suspect that this ultimately because the developers of text formats are people who think in terms of ASCII or who don’t have contact with use-cases where there are different character sets possible. The problem is pushed downstream. Not only incompetent but lazy?

Am I being too harsh? I hope so. In particular, in this day and age of international standards, the burden for fixing this has shifted from the developers to user-community representatives: it is something that governments and non-ASCII-locale standards bodies need to consider.

When I say “You are incompetent” an entirely satisfactory rejoinder back at me is to say “Yes I am: I can only respond to demand from people who are affected by this issue, and the standards and procurements processes are the place for these demand to be manifested!”

But buck-passing won’t fix anything. If we know the problem won’t go away, why cannot we (we consumers or we developers) deal with it?