Unicode 5.0 was released a week ago: congratulations to all concerned. Unicode now has about 99,000 characters defined, though many of the improvements in Unicode 5.0 are related to how to use characters (their properties or display algorithms) rather than additions. There are only 1369 new characters compared to Unicode 4.1; and no milestone for implementations such as Unicode 3.1 in 2001 when the number of characters broke the 16-bit range.

I find Unicode very inspirational. Of course the mad scripts like Tifinarg not to mention the beautiful Burmese have their own fascination. But the diligence and effort in Unicode demonstrates a community with a love of communication and refined respect for culture. There are three main drivers for enhancements:

  • For Western text, the basics have long been in place and the emphasis is on additions for specialist publishing, academic and historical scripts: maths characters, Phoenecian,
  • For text from the industrializing nations, the emphasis is on completeness and coping with national variation: variant glyphs between China, Korea, Vietnam and Japan; the pronunciation used by Koreans, improved bidirectionality algorithm for Arabic for example.
  • As the codes, algorithms and properties for national languages sort themselves out, it becomes politically possible to address the requirements for minority scripts: Balinese, for example.

Some points of possible interest:

  • There is a new time-limited evaluation version of Asmus Freytag’s UniBook browser for Unicode 5 as well.
  • The updated version of the BIDI (bidirectionality), which includes consideration of XML. But Unicode 5.0 allows by discourages use of “higher-level protocols” like XML; this goes against the W3C technical note on XML and BIDI which merely discouraging mixing. It looks like the tide currently is in favour of using the language code (e.g. xml:lang) to determine paragraph directionality, and then Unicode BIDI rules within text, rather than ever having explicity BIDI markup on elements. (Can anyone confirm this?)
  • An alternate form of the line-break algorithm which can be implemented as a regular expression.

Not all the technical reports have been updated on the Unicode website yet. (For example, there is a link to TR 14 version 19, but it only exists as version 18 at the current time.) But I expect it will sort itself out soon. ICU is not updated for Unicode 5.0 yet, but it supports Unicode 4.1 which is almost the same.

The Wikipedia entries on language scripts are excellent, and a great introduction if you feel your knowledge of Limbu, Mithilakshar or Phags-pa lacks lustre.