advertisement

May 2004 Archives

O´Reilly´s Digital Media Blogs have been expanded and are now located at a new home. To find our new blogs, please visit:
Rick Jelliffe

AddThis Social Bookmark Button

XML 1.1 is fairly controversial: it just adds some niche features that most people don’t need. We recently added some XML 1.1 support to Topologi’s XML editor; I was surprised by how straightforward it was, but also perhaps premature.

Easier that I expected…

XML 1.1 has four aspects: NELS (a newline on some IBM mainframes), extra name characters (some pretty obscure characters), improved rules for control characters (more controls allowed, but safer because they must be references not literals), and coping with the different version number. Adding support for converting NEL to newlines on text import is trivial. Adding support for the extra name characters was straightforward because our editor was “broadminded” on those issues anyway (but see below on surrogates), ditto with controls, and the version-up of the XML version number causes no problems at this stage. So moving our product to XML 1.1 was not disruptive or difficult.

We use the latest version of Xerces (customized for error messages and a couple of tweaks) for XML processing. The latest version of Xerces can detect and switch to XML 1.1, and we didn’t find any problems.

We use IBM’s open source ICU4J for internationalization classes and normalization, because these track the latest version of Unicode better than Java 1.4.2. The trick with ICU4J is to strip it down to the smallest JAR possible: their site describes how to do this. Again, this was straightforward if tedious (I hope the ICU4J team will one day provide every permutation of JAR online for download, to save users’ time.)

The biggest scary issue in XML 1.1 is that it moves beyond using just Unicode characters that fit into sixteen bit UTF-16 characters, as used internally by Java. UTF-16 has a kind of trick to cope with those kind of characters: there are two dedicated ranges of code points in UTF-16 that, when combined, map to characters that are greater than U+FFFF. These ranges are the surrogate characters. This are not nearly as odd as it seems, because actually many characters already require more than one Unicode code point to represent.

To support surrogates in an editor you need to do three things: first, make sure that your underlying APIs support the most recent version of Unicode and that you pass around Strings (or, at least, CharSequence or Segment) rather than chars; second, that your navigation and editing operations do not let you mess up a surrogate pair; and third that the correct fonts are available and your metrics understand surrogates.

..but premature?

And it is this last step where everything falls apart, at the moment. There are almost no fonts available that actually have the characters that surrogates might point to. Microsoft have one that is licensed for use with a particular product, and there is the open source
Code2001
font, but apart from that almost nothing. (In fact, the absense of a variety of fonts made it impossible for us to have confidence that Java’s proportional font metrics are 100% right for the surrogate ranges: that is another story.)

I think this is pretty telling, and is why we needn’t be too hysterical or proactive about XML 1.1. Only people who set up their computers specially (or who have applications or runtimes specifically preconfigured with these fonts) can use the surrogates anyway. And only people dealing with importing certain mainframe data need worry about the NEL feature. So we just won’t see a flood of XML 1.1 data. (In fact, by the time we finished adding XML 1.1 support to our product, we had convinced ourselves that probably no-one would use 1.1 anyway.)

XML 1.1 the ‘XML Stack’

One thing XML 1.1 does expose is that XML’s simple n.n versioning system didn’t come with enough policy to make it convenient: it would be better if there were a policy such as “XML processors must fail with a WF error when there is a different major number; XML processors must not fail only because of a difference in the minor number”, so that an XML 1.0 processor would only fail on an XML 1.1 document when there is some syntax apart from the version number that fails. (Other standards groups should heed XML’s version failure here.)

IBM’s Noah Mendelson recently spoke at the W3C meeting about this: >Making the XML Stack work with XMl 1.1. Now Noah has two habits of thought that you can time your watch by: first he really wants to think that XML provides guaranteed interoperabilty (it only does if you are conservative), and second he is quite quick to invoke the bogeymen of Unimagined Complications. These two habits can conspire so that he frequently is (or adopts the role of being) extremely loathe to move from any established position: lets not fix what isn’t broken. In the case of XML 1.1 and protocol stacks, I think Noah should be a little conflicted: adopting XML 1.1 may have some value (IBM asked for it), but then again it requires change to other parts of the stack.

My take on whether the “XML stack” should adopt XML 1.1? The inadequate definition in XML of how to treat the minor number has fatally compromised XML 1.1’s deployability. So “XML 1.1 only” is not a workable option; “XML 1.0 only” is workable, but “limited XML 1.0 guaranteed; any XML 1.0 or 1.1 available but not guaranteed” is best. XML is so useful because it has both this reliable core for general use and also an outer layer that increases its convenience and usefulness for particular, maybe ‘niche’ tasks. XML has demonstrated that developers are smart enough to stick to the core when they need to—they don’t need subsets of standards nannying them.

Rick Jelliffe

AddThis Social Bookmark Button

One day last week, I had over 30,000 spam emails, many of them with 180K attachments. I remember a Hong Kong spammer claiming I should be happy that so many people are interested in informing me of their products, but I was not happy. For a start, deleting this number of files bogged down our IMAP server.

This flood happened as I was training the spam filters for a new mail reader: I wonder if I have been getting them at this rate for some time? Does everyone get this amount?

Whatever, spamming is stealing so much bandwidth and wasting so many people’s time that there clearly needs to be much more policing going on.

Or is email dead, as some claim, and we need to move to P2P and RSS alternatives?

Damien Stolarz

AddThis Social Bookmark Button

Related link: http://news.com.com/Is+Torvalds+really+the+father+of+Linux%3F/2100-7344_3-521665…

This article cites a really weird attack on the authorship of Linux… I remember that Linux was based on the Minux stuff early on, but Minix was designed as a learning OS in the Tennenbaum books… They make this absurd argument that Linux couldn’t have been written by Torvalds because it would be impossible to write that fast - even though it was. In 1991 I had a class on operating systems at UCLA and they expected all the CS students to rewrite parts of the Minix code… it will be incredible if this sort of nonsense wins in court.

Rick Jelliffe

AddThis Social Bookmark Button

Related link: http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dnpag/html/scal…

I taught a couple of full day courses last week, on XSLT and XML Schemas. The students were pretty positive banking programmers, and I only managed to disgrace myself once, when I became stroppy at a student I misheard: he was trying to be helpful, but I thought he was insanely and repeatedly insisting we should skip 120 pages of the text. Ah, the joys of fading ears. The other students evaluated me to 5 or 4 out of 5, but he only gave me 1. This is the first time that my hearing has lead to a problem like this, and it is quite saddening. (I had warned the class and I apologized when I realized what happened, no flames please!)

Some other random thoughts: XML Schemas is a hard subject to make fun; the biggest lightbulbs above students heads went off when explaining the priority and mode attributes of templates: the sooner they are introduced the better. Allette’s course uses (by kind permission) a version of Ken Holman’s diagram of XSLT axes: it really is very useful.

One piece of advice (apart from “Wait a couple of years before adopting XSLT 2/XQuery 2 or anything with ‘PSVI’”) I gave was the importance of caching or precompiling schemas, stylesheets and XPaths. But, in the URI above, I read today that MSDN puts this even stronger: caching is so much the norm that not doing represents a “failure”! I don’t think this is blaming the victim, despite how it reads, it is undoubtedly good advice.

But I wonder whether we would be better off with a standard, declarative language to cope with two of XSLT’s real problems: first that XSLT systems are not really streamable (in the sense that a tree is never built and that the output can start to appear before the end of the input), and second that the kinds of text functions provided by XSLT are fairly weak.

Because XSLT[1,2] is the only game in town, it gets used when a more light-weight language might be adequate. I suppose the kind of language I am thinking of might use CSS selectors rather than XPath, and provide functions for renaming (but not rearranging) names, and for altering data values (but not sorting). Converting currencies and dates, adding progressive counts for numbered lists and for running totals, making headings use title case, collapsing or guaranteeing spaces between elements, suppressing elements or performing XIncludes. That sort of thing.

The discipline of requiring that you don’t make choices based on forward references was what allowed OmniMark (a text processing language that dominated most real large SGML production in the 90s) to have such excellent performance. (It is important to note the distinction between forward references for data values, which can be handled by diverting data and resolving the references at the end of the document, and decisions requiring forward references.)
Even today, I note that the Advent/3B2’s Pure typesetting system (which seems to be pretty fast: they say Boeing typesets hundreds of thousands of pages a day with it) gets a lot of its speed by assuming that the input is pretty much ordered in the publication order.

Perhaps some tricky implementation of XSLT could figure out if a stylesheet is streamable and switch to a streaming strategy. But I suspect we would be better off with some little language that could have a very small implementation: indeed, an implementation tightly coupled to the XML parsing stack might be fastest, rather than, say, a SAX-in-SAX out filter.

Has someone made somthing like this already?

Rick Jelliffe

AddThis Social Bookmark Button

Related link: http://www.salon.com/tech/col/rose/2004/05/11/military_software/index1.html

Quoted by Scott Rosenberg
article Code that kills, for real
at (DoD-sponsored) Systems and Software Technology Conference
in Salon Premium:

“XML and Web services are crucial for protecting America,” according to deputy undersecretary of defense Sue Payton.

The Salon article (see URL above, but you may need to subscribe or suffer an ad for something first) also has some nice links on shiny defense technology.

But it reminds me of that old joke: the speaker at a software engineering conference asks the 600 developers there who would feel safe in an aircraft which was flown by software they themselves had written. Only one hand was raised. The other 599 developers quickly rushed to find out what this developer’s secret was: “Nothing” he replied “I would be safe because I know it wouldn’t even leave the parking bay!”

Regretfully, one Web Service that might be a good place to start, would be a
Web Service for Whistle Blowers
and the Red Cross. Perhaps this is next technology needed for US military tactical support: tools to reduce the opportunity for counter-propaganda by the enemy, not by hiding the problem but by dealing with it fast. I also suspect there will be a bigger role after recent events for online updates and access to procedures: soldiers need to know that if the book says “thou shalt not torture” they should not torture. The message needs to be available enough that soldiers will know that “I was just following orders” or “I was not properly trained” will not protect them.

People at the top should have less likelihood/excuse/escape-clause that information had not percolated up to them, and people on the bottom should have less liklihood/excuse/escape-clause that information had not percolated down to them.

Back to the issue of military information systems. For soldiers, I reckon a combination of
e Ink with
eBook DTD and RSS-based updating when recharging would be agile and workable, both for human and technical procedures. It needs to be light, simple, complete and current:
it needs graphics, have no moving parts, but not interactivity or response.

Class 5

IETMs
* (PDF) are a great idea, but the addition of interactivity and connectivity
is a real killer, because you then need active programs
(not just passive browser) with input devices, web connectivity (nowadays) and so on. Pile on batteries, screens, CD-ROMs, etc. and the weight and support complexity shoots up. At the moment, it looks to me that interactivity is better handled by mobile phones; tablets or notebooks are just too bulky, at least with modern battery technology.
I suggest that Class 5 IETMs should be considered a long-term or niche goal: in many militaries (I am thinking of two that I know about first-hand) there are severe problems just getting static reference material formatted, collated and distributed for major weapons systems, given that often each system has unique mods. (Loose-leaf updates, for example, are just plain dangerous, but some militaries are justifiably cautious about adopting alternative PC hardware with the unneccessary cost and risk of operating systems.)
Get this Class 4-ish browser cheap and deployed to, say, every sergeant and above (as the baby step to Class 5), make the data also available over HTML to notebooks, and use mobile phones where possible.

(This problem of how to get complex, but relatively slow-changing data (manuals, rules-of-engagement, procedures) deployed is different from the problem of how to cope with the flood of fragmentary, fast-moving information from the battle-field and intelligence, of course.)

* IETM: Interactive Electronic Technical Manual, pronounced I Eat ‘Em

Damien Stolarz

AddThis Social Bookmark Button

Related link: http://www.streamingmedia.com/article.asp?id=8605

“Finally, it’s hard to minimize the importance of the DVD Forum’s provisional approval for Microsoft’s VC-9 technology, essentially Windows Media Video 9, along with two other technologies, H.264 and MPEG-2, as mandatory on next-generation playback devices.”

Wow. I just wrapped up the manuscript on my upcoming book on Internet Video where I have a whole chapter on MPEG-4 as a wonderful standards organization, fighting against the forces of proprietariness. However, I began to see the writing on the wall when so many $39 DVD players began playing both .asf and DivX ;-) movies. It’s not like Microsoft used dirty tricks to force the low-end DVD players to include Windows media - they did it because all the ripped movies were in these formats. Thus, it really was driven by customer demand, and with that DVD foothold, it’s not surprising that the DVD forum has now blessed Windows media .

Another factor working for the ‘proprietary’ Windows media standard is that it has really become a de-facto standard, which is more powerful than a real standard. Video CD, Super Video CD, CD-Text, MP3, all sort of organically have become standards… none blessed by the standards bodies really until after they make good penetration in the marketplace.

A third factor is the speed of computers, and the ability to flash the rom of DVD players, or simply sell new ones. With DVD’s at $12-$25 and DVD players themselves around $39, it’s easy to see how ridiculous the pricing game is. Give away the hardware and sell the software. Thus, the original goal of video standards, to ensure that 1994’s DVD player will play 2004’s DVDs” becomes less relevant when WalMart can practically bundle the DVD player if you buy 5 DVD’s. And, whereas it used to be that a computer couldn’t play MPEG-2, only special hardware chips, the Moore’s law has made it so now it’s easier to make a sort of general-purpose DVD player, which could have a new codec added by updating some software and flashing the ROM. It’s not that customers will flash the ROM for their DVD players - they’ll just go buy new ones. It’s that there’s no cost to the manufacturer to create essentially a brand new product - just software - and voila, you can ship a “NEW” version of your DVD player, this one playing Super-Duper Video CD’s that are MPEG-4 burned to a CD-R, etc. etc. etc.” Thus, the economics of “ship now and patch later” work in the hardware world; you DON’T have to get the unit perfect before you ship it; and new codecs CAN be “downloaded” on demand (by having customer buy a new DVD player).

This may spell doom for MPEG in the mobile space too. While we definitely need to agree for at least 18 months at a time what the current video codec is, Macromedia is aggressively pushing mobile-Flash out to cell phones; Sorenson has it’s MPEG-4 and Non-MPEG-4 codecs both of which make it money; and Microsoft can change it’s codec constantly and pretty much count on all the devices catching up.

Plus, the incentive of hardware vendors *is* to get you to purchase a new device. Thus, a standard that allowed yesteryears device to play today’s content is NOT aligned with the financial incentives of any hardware manufacturer (resulting in landfills full of digital electronic trash). If new, faster codec (software) requires new, faster cellphone (hardware), Nokia and everybody else is delighted - sell new phones

As a recycler with a stack of 1980’s computers that still operate, and printers from the 80’s that still print postscript, it could be really argued that it’s not just MPEG-4, but any large deliberative ’standard’ that could be at risk, because Moore’s law, plus mass consumerism and economies of scale, have reduced the friction of adopting new standards to a minimum.

Hardware is irrelevant now; firmware is king.

Wow.

Software will completely take over hardware eventually, no? Comments?

Rick Jelliffe

AddThis Social Bookmark Button

After a horrible Windows virus infection last year, I moved from Windows to Linux. Fed up. I have been a UNIX user and sometimes administrator (HP, Sun, DEC) for about 20 years: as well as owning Macs and PCs at various times, I have been the proud owner of the wonderful AT&T UNIX PC (which had the best keyboard ever created for editing) and a SparcStation. My company had carefully avoided relying on any applications that would tie us to a particular operating system: Eclipse, Java, Open Office, Mozilla, Bugzilla and so on, just in case we needed to make this kind of move.

9.1: I chose Mandrake 9.1 last year purely because it was supposed to be easiest to install. Everyone has something better to do than system administration of personal computers! I had played with a Red Hat 7.2 in 2002, and was not really happy with the out-of-the-box fonts and Java support. However, we had been able to run VSS client fine under WINE emulation with FAT filesystem.

9.1 had installed relatively easily, but I found that the distribution had some bad holes in it: the kernel sources appeared to be a different version from the binary, so when I tried to rejig HZ I couldn’t get a working recompiled kernel. More importantly, the lack of a reliable writable NTFS meant that I couldn’t really use WINE emulation to run most of the applications I needed to. Linux needs good NTFS for WINE to work; WINE needs better documentation/wizards and study to make work; and Mandrake needed better guidance on its init.d and pserver.

VSS not working was no problem, I had decided to use CVS anyway. But I couldn’t get pserver working, to make the CVS available to remote Eclipse clients. Oh dear, a real spanner in the works, because we all like to work on each other’s code regularly. I found the Web a very unreliable source of information: configuration information regularly did not mention which distro and application version it applied to. (In retrospect, I should have just got a good book to help me.) Cleaning up and recovering from the Windows virus probably cost me a man-week, and the problems with CVS and VSS probably has cost me the same.

I have been looking at Subversion or moving to BSD, but walking through my newsagent I impulse bought May’s LinuxFormat magazine, from the UK, which includes a Mandrake 10.0 Community Edition distro. Maybe I should risk an upgrade and then see if CVS pserver works easily?

10.0: I must say, it has been the easiest install or upgrade I have ever experienced. The only hiccup was that I needed to delete some old vmlinuxes at the appropriate point in the dialog. Apart from that it was all smooth: maybe only two or three prompts to start and the to ask for more disks.

When it asked for disk 3, suddenly I discovered that the magazine had given me a second disk 2 rather than disk 3. I knew that Mandrake organize disks so that this would be applications so I could cancel at that point without fear. Mandrake upgrades connect to the web and get more recent versions than the disk anyway.

The result: everything seems fine. The system boots; all my files (I use a lot of partitions) are fine; internet, partitions, printer and user settings are maintained; sound has started working magically; and the most recent Open Office seems to work. The next step is to check through Eclipse and whip CVS (or Subversion) into shape.

Except… the only downside was Mozilla. It installed a nice new Mozilla 1.6, but the mail application does not come up. (Similarly, Evolution would not fire up, out-of-the-box.)
I downloaded a Mozilla binary direct from Mozilla, and it had the same problem: something to with not being able to use some XUL file. I am guessing the problem is either because of the missing disk 3 (but why, then would the fresh install of Mozilla fail?) or because I was installing it over an existing version: perhaps I needed to delete my ~.mozilla/ or something.

Anyway, in the end I downloaded Thunderird, the standalone Mozilla client, which works beautifully. My settings and my backup mail archives had been deleted, but we keep everything related to business on IMAP so I only lost a few friend’s email addresses: I had forgotten these were not on the IMAP server, but I can get them back.

So, all in all, this was a completely non-stressful install. I had expected that Mozilla might take a little extra time anyway. (My definition of stressful is whether it distracts me from my foreground work on another system, and I am happy it didn’t congest my naive multitasking wetware.) I am looking forward to seeing what performance impact the new kernel and the pthreads has on Topologi’s editor. Traditionally UNIX has had a very smooth feel especially switching between applications; contrast with Windows which always felt bursty to me especially when working with more than one application; both Mandrake 9.1 and Red Hat 7.2 felt jerky to me, which was disappointing because of my good experiences with “real” UNIXes.

If I do find any big problems with Mandrake 10.0, I’ll blog them. And, yes, I realize that Mandrake is intended as an end-user distro rather than a hacker’s distro. But like Sun’s JDS, this is just a wonderful distro to install; I suspect the same is true for the other Linuxes (and BSD?). And the typical applications are consolidating in number and maturing in quality.

Valé Windows: So far, in the 6 or so months I have been using it, I only recall missing Windows once: when I had to do some simplified Chinese typesetting. It seems that Linuxes may still have the crappy notion of localization rather than internationalization, which is no good for all the users who need to work in files of a different script than their locale. But we still have our old Windows license and we can boot that up and run it when we need it: its not going anywhere.

What are your experiences with Mandrake 10.0?

Rick Jelliffe

AddThis Social Bookmark Button

Related link: http://www.25hoursaday.com/weblog/PermaLink.aspx?guid=72515234-1f57-475d-8374-ad…

"http://www.25hoursaday.com/weblog/PermaLink.aspx?guid=72515234-1f57-475d-8374-ad1f0965dde9"
>Dare Obasanjo
wants to add “standard” to the list of
"http://www-106.ibm.com/developerworks/library/w-rdf/"
>S-Cursed words
(see box).
This curse makes a word
"http://www.nntp.perl.org/group/perl.perl6.language.regex/543"
>prevent

useful communication rather than enable it.
Uche Ogbuji lists currently-cursed words as
"http://lists.xml.org/archives/xml-dev/199806/msg00424.html"
>semantics
,
"http://lists.xml.org/archives/xml-dev/200010/msg00937.html"
>schema
and syntax.

We shall be
"http://www.matthewyglesias.com/archives/002797.html"
>swamped
.

Does S have a future?