Articles Archives

Rick Jelliffe

AddThis Social Bookmark Button

SwiXML is Wolf Paulus’ XML User Interface languge (XUI or XUL) which uses the regularity of the Java Swing GUI libraries to allow very lightweight implementation: XML elements are used for JComponents, XML attributes are used for properties (e.g. <frame size="5"/> would be JFrame.setSize(5)), and there is provision for layout managers, ids, custom JComponents and so on. Substance is Kiril Grouchnikov’s Look and Feel library for Swing components, which allows various subtle and lurid modern effects: the latest release, version 5, has just come out. Kiril has also written a good series of articles, available at his blog, on what he has learned about running a successful responsive open source project.

So, how can I use SwiXML with Substance 5?

Seelcting the look and feel with SwiXML is easy: you just add the appropriate plaf attribute to the top frame element. You can find a list of the various looks and feels available in Substance in the documentation or under the skin directory of the source tree. So your top element is, for example,

<frame  plaf="org.jvm.substance.skin.SubstanceOfficeBlue2007LookAndFeel" ...>
...
</frame>

In many cases, that is all that is needed. However, I found a gotcha with an easy fix.

In Swing, you are supposed to do everything concerned with the GUI in the Event Dispatching Thread (EDT). This prevents some kinds of thread-related problems, for example where one thread tries to access an object already destroyed by another. Some of the Substance code has checks, which generate errors if the function was not invoked in the EDT.

However, all the example code from SwiXML is run straight from the main thread. This is OK in theory because you are just initiating the object so there is no chance of mishap.

The simple answer is to run the SwiXML initialization in the EDT, and invokeLater() is our friend here. Instead of

   Component myGUI = new SwingEngine(this).render("xml/myGui.xml");

you use

  Component myGUI; 

  SwingUtilities.invokeLater(new Runnable() {
      public void run() {
        try {
           myGUI = new SwingEngine(this).render("xml/myGui.xml");
       } catch (Exception e) {
           e.printStackTrack():
      }
     }
   }
  Thread.yield();
  Thread.yield();

Unfortunately, even though Substance is highly configurable, it mainly uses various kinds of system or UIManager or rootNode properties rather than setters. It is possible to make a home-made property element in SwiXML , however because Swing elements are initialized before being attached to their parents, there is no way to access the client properties of the intended target object, so it is not well suited. (It seems you may be better of just specify the look and feel mix you want in Java rather than SwiXML at the current time.) But the simple selection of Substance LAF seems to work fine.

SwiXML already has support for the JGoodies Forms library built-in, which some of the Substance demos also use. I hope to get around to using the Flamingo library with SwiXML soon.

What are those Thread.yield() calls? The first is to make sure that the EDT thread runs immediately, just because we want to be in a good known state. The second is because it is possible, though not likely (though in Java who can say?), that Substance itself may use invokeLater() calls on the EDT where the runnable will be cued on the EDT but scheduled after the return from the first Thread.yield(). The second call is at least harmless, and perhaps best practice: some people think that just because in Java you often don’t have to think much about scheduling (or memory management) it means you never have to: quite wrong for desktop apps.

Rick Jelliffe

AddThis Social Bookmark Button

First some jargon (from the Glossary of Typesetting Terms or Harrod’s Librarians’ Glossary full props to Google.) Castoff: The calculation the number of typeset pages a manuscript will make, based on a character count. Proof: An impression made from type before being finally prepared by printing. Proofs are made on long sheets of normal page width… Galley proof: Proof of text before it is made up into pages…just as long as can be conveniently photocopied – usually 13 inches. Compose: To set type-matter ready for printing.

Deciding on breaks

I am ancient enough to have used galley proofs, the long pages of text of books before it had finally been made up into the final pages and runoff on a printer (or rather, by a printery.) It still exists in the draft modes on some modern word processors, I suppose. There has always been a chicken and egg problem in documents which contain dynamic forward references that expand to section or page numbers (e.g. See page 99: how do you know how much space to reserve for the page number? A reference on a tightly-set line or full page may cause different page breaks if it is a two or three digit number, for example. A traditional way to deal with this was to allow a lot of space around page references (to reduce the impact) and to take two passes of the document, the first to estimate the pages and the space required for each reference, and the second to actually compose the document using the calculated space as fixed and squeezing the generated text if necessary.

The idea that you could divide the same text into different length pages is obvious, and quite early on even the electronic typesetting programs alllowed draft modes (or provided alternative macros) for producing proofs. The requirement of some publishers for double spaced manuscripts made the idea of separating structure and presentation, ideas ascribed to Charles Goldfarb and (independently) Brian Reid, does not seem a big leap to us nowadays. Multi-publishing and retargetting became commonplace in the SGML arena, with the advent of declarative stylesheets looming for a long while, but the next really big step was with the advent of the WWW and the impact of resizable windows on formatting.

One of the most important ideas following from the separation of presentation (into stylesheets) and content has been the formalization of the page-flow model (frames), which was championed by Frame Corporation’s FrameMaker though the simpler concept of regions was of course older. The idea is you “pour” the text into the frames and they flow, break and cause new pages where they will.

Loose

In my blog yesterday, I mentioned that the transformational approach of stylesheets in XML (the DSSSL, XSL-FO streams) is only loosely-coupled with the typesetting engine (or formatting engine…some people think that word processors don’t do typesetting, I don’t want to get hung up on terminology) so there are some kinds of page design rules that are impossible even if because the developers cannot be aware of every design rule anyone might want to make.

The separation also impacts another area: the area of document interoperability. I have written several blogs referring to Markup’s Dirty Little Secret, which is that because everyone’s system and each system’s algorithms and resources and capabilities are different, you cannot expect perfect fidelity to the extent of the same line and page breaks when exchanging XML+stylesheet documents (such as OOXML, ODF, DOCBOOK, you name them). This goes quite against the expectations of some users (though I think people are much more realistic about this now than two years ago) and quite against the hard requirements of others (for example, people who need fixed page numbering for legal requirements.)

In yesterday’s blog, Standardization as a collective loss of imagination? I suggested that users may need to assert themselves to prevent the standardization of the current round of office application formats from a particular pitfall of losing sight of the centrality of page (and document and information) design: how to help people communicate rather than how to add the latest pet feature from some vendor. Not that pets are not fun and valuable.

Hinting at our priorities

The tie in that suggestion and the page-fidelity problem (which is really an interoperability issue) is that I think we need some more imagination about whether our current re-pour-each-time model of formatting is actually good enough if we genuinely want substitutability of office applications. People don’t want to be sold a turkey.

Now SGML did provide processing instructions, a kind of markup that still exists in XML, for applications to add extra information that belonged to formatting for example. The ArborText Publisher program used them very successfully, with processing instructions that let you force page and line breaks in certain places, for example. That is one way Iof integrating page markup, but it is not what I am suggesting (for various reasons.)

At the moment, I think that a much better approach would be to add a kind of cast off hint as an attribute to each block-level object (paragraph, list item, table cell, etc). This would be added to the XML markup by the formatting engine as a hint, to enable a subsequent formatter to try to get the same results.

The first time data came into a document, the normal composition mechanisms would apply. But the document’s block structures would also be decorated by these hints at save time. And subsequent opens of the document would use these hints as well when composing the pages. For example the castoff hint might be as simple as
giving the bounding box of the block on the page. The composing system would used differences in these bounding boxes with the bounding boxes it wanted to use as penalties to adjust line feathering (or even margins, padding, breakpoints, spacing, text size.)

Auto-sizing is not completely unknown: WordPerfect had a patent on automated adjusting various page parameters to make sure some range of text fitted on a single page. And many people are aware of the behaviour of some page-oriented systems such as presentation programs to automatically resize text (including nested text lists) to fit into the available space.)

It could be user selectable whether to freeze the page according to the block hints or just use them as hints, or ignore them. As a hint, it wouldn’t interfere with minimal implementations.

Rick Jelliffe

AddThis Social Bookmark Button

Regularly as clockwork, every five years another group attempts to make a new standard language for typesetting. FOSI, DSSSL, XSL-FO, and ODF (plus the less grandiose scopes of CSS (styling) and OOXML (legacy).) I predict that in a decade we will see the same thing. In the past, these efforts came from the user side rather than the vendor side, and were driven by user requirements rather than vendor requirements. But requirements for standards now predominately come from questions about “Our product X supports feature Y and therefore the standard should support it” rather than “Our document A uses typesetting feature B therefore the standard should support it”: the cart is driving the horse. There is more vendor buy-in because the new standards demand and achieve so little.

In part it is understandable, the catch-up mentality does not necessarily encourage imagination.

Comparison Matrix

One very common tool for organized standards groups is a feature matrix: rather than just ad hoc consideration of this feature or that feature as proposed by vendors, the idea is to make a list of the general features required by the users document sets, or by the technologies being evaluated, or the products chosen to get first-class support. Traditionally, standards groups for typesetting and publishing have included actual typesetters (at ISO, Martin Bryan actually worked in type for example.)

A really good example of this can be seen in a document from a decade ago Final DSSSL Survey and
Assessment Report for the DOD CALS IDE Project
(Kidwell, Richman). This is a good introduction both to the Output Specification (FOSI) formatting language used by US military typesetting in the 1990s, and the ISO Document Style and Semantics Specification Language (DSSSL) which has been available standard on many Linux systems using James Clark’s open source JADE program.

The feature matrix can be found in Comparison Matrix which shows how well the standards support the document requirements: we learn that the US military requires both cartoons and running feet. This is the kind of table that I think should be driving requirements for ODF (and OOXML); preferable to the approach of feature- (or vendor- or product-) centric comparison matrix and much preferable to ad hoc feature requests.

FOSI and DSSSL

The US military adopted FOSI because it was under consideration by (what is now) ISO/IEC JTC1 SC34, however SC34 ultimately went with an extended version of Scheme under the (terrible) name of DSSSL; FOSI was deeply unlovable and never floundered outside its early adopters who were locked in; DSSSL was like the other power-user oriented standards from SC34 of the time and never found much commercial adoption by had uptake in the publishing industry that SC34 catered to. James Clark, the DSSSL editor, later merged it with CSS ideas and split it into XSLT and XSL-FO at W3C using an XML+XPath syntax rather than the S-expression syntax.

Where DSSSL and FOSI (MIL­PRF­28001 Output Specification) differ in particular was that DSSSL adopted a strict transformation approach: this is of course a UNIX-ism since the days of nroff, and the idea was that you could output to particular page description languages (RTF, MIF, etc.) Consequently there was no way for the DSSSL processor to make decisions based on typesetting metrics on the fly; instead the race was on for a set of abstract properties that could describe common cases. This fits in well with the checkbox mentality of desktop publishing tools, but was entirely counter to the typesetting-as-programming approach of the 1970s and 1980s generation of systems (systems such as troff and TeX used macro facilities so that creating a typesetting system for a document could involve all sorts of custom smarts to capture the design and fit in with the data; very high-end systems such as Interleaf even had full-blown LISP available for processing: some of these systems are still around with their niches: XYwrite and 3B2 for example, however they face a rising tide where quality and power is increasingly mysterious to the market.)

FOSI did allow or require some kind of interrogation of the pages while they were being typeset: while this can certainly allow much more expert typesetting and decision-making, it also must be tightly coupled to the formatting engine, which effectively prevents any network effects.

What do I mean by expert typesetting?

To give an idea of what I mean by expert (also known as “quality” or “industrial”) typesetting and decision-making, consider the case of typesetting a Yellow Pages (phone directory for businesses categorized by type of business.) Imagine you have to produce a Yellow Pages document using your favorite tool. The page designer and sales force come up with a design and timetable. The layout will be five columns. Entries may not span pages. Some entries take up part of a column and should be put as near to alphabetical order as possible, but rather than break they can be placed before or after their alphabetical position with previous or subsequent entries swapped before them. And there may be two, three, four or five column display adds, which also have this arrangement. And there can even be adds that take a half page but span over two pages.

And it is important that ads should not be orphaned or widowed, with one ad on a previous page by itself or on a subsequent page. And there are 6,000 pages of this. And you get the final data 24 hours before you have to deliver it.

Now how would you do that in ODF, or OOXML, or any of the standard declarative languages? You simply cannot: there is always an extra rule or concept that will not fit. (There are a few moderns systems that do allow this kind of flexibility: using JavaScript in Adobe’s In-Design for example. The program uses XPaths to locate information, but can also access the page model.)

Declarative abstractions are worthy replacements for programs and scripts but have different coverage

Now the history of (SGML and) XML is the effort to key presentation cues from structural information: the benefit of marking up “invisible” containers is that they are often not invisible. The current approach of both ODF and OOXML of allowing foreign container elements (in different syntaxes) but not providing facilities to format based on them, is the worst of all words: for QUASIWYG systems users will be loathe to do anything (well) which does not have a resulting visual/stylistic result in the on-screen draft. And (as was pioneered in pre-Adobe FrameMaker and taken up in CSS) the abstraction of frames (floating or relative, linked boundaries into which text can be flowed) also provides many hooks for making declarative properties that otherwise might require programming.

The way that standards for public declarative publishing formats (whether HTML or ODF) should go, in my opinion, is by progressively asking the question, how can we make it easier for users to do what they want to do? In the old days, this was easy: you had physical paper (from mechanical typesetting, for example) or device-independent page designs (the Yellow Pages for example) and you then programmed it by inserting commands in with the text. SGML and generalized markup came along and said describe the data in markup, then move the processing out of to a presentation system, except for Processing Instructions where you need specific overrides inline still. After this re-factoring came libraries where common code or functions were provided with the base system, and then consolidation where the code for the libraries was hidden from the user, and then exposure where only programming capabilities were removed and only the declarative portions left. RTF and MIF are examples, but so are OOXML and ODF.

At this point, users of transformation systems (such as XML with XSLT) have a lot of capabilities, even for overcoming the differences between the underlying typesetting engines of systems (see Different classes of typesetting engines and Markup’s Dirty Little Secret, but they have none for the kind of page-based calculations required by the Yellow Pages.

Now you could continue to make abstractions: nested keeps with partial float for re-ordering, for example. And in the past, there was a hope we might progress there, because the driving factor for markup languages and style languages was to cope with the kinds of designs which simple word processors failed at. But as I said, the cart seems to be driving the horse: I have no objection to document formats for existing and legacy applications (nor obviously to have them as voluntary standards, readers will no be surprised to read).

Universal pretensions without an assertively inclusive process merely disenfranchizes the weak and the foreign

However, and this was something that I saw as a flaw in the XML Schemas process, the more that you claim your format as a universal format, the more that you need to cope with cases that may be “niche” to vendors (i.e. that didn’t fit in with their development or profit model) but which are significant in their own right. When a technology, standard or not, mandated or not, does not provide a capability needed for a job, it will not (because it cannot) be used.

Lets take a concrete example. In about 1999 I spent a year looking at the various requirements for Chinese and XML, at Academia Sinica in Taiwan. As part of this, I looked at how Chinese actually did typesetting before the advent of computerization. I first made a (example below) of some interesting, but not at all atypical tables, some of which have visual structures that Japanese will recognize. (In effect, when you have the equivalent to very small word size, there are other graphical possibilities that don’t go well in Western text.)

t-b2.png

Then I made a suggested a possible structure that could be used to reconcile them. I was surprised at the reaction: Westerners universally made comments like “Oh, but those are *bad* tables and bad practise” and “They show confusion and unstructuredness”. Microsoft did add diagonal headers to Word 2000, but the SGML (and pre-SGML) idea that you should look at the artifacts and let design lead, rather than merely let vendor’s developers lead, had by the start of this decade died a rather sad death, it seemed to me.

Since then, the Chinese have gone their own way with a fork of ODF called UOF which features, as far as I can make out, Chinese element names (yah!) and extra markup for Chinese-specific requirements that other systems didn’t support. In April 2007, a request came in for ODFOpen Office to add it: Diagonal Header Specification. (which has a particularly wonderful and mad table example.) I don’t know what the status is at OASIS though, or if Sun has even passed it on: as I mentioned before, they are still discussing 2005 and 2006 user requests, which is what set my alarm bells off. (And in July 2007 Bert Bos raised the related issue of text rotation to dismiss it again for CSS at W3C: theoretically not all diagonal splits in tables require rotation or typesetting along the diagonal path, but the requirement for diagonal splits and for rotated headers spring from the same grapheme/glyph qualities of ideographic scripts.

Putting page design back at the centre

The only way to put the horse back in front of the cart is to put page design (in all its detailed aspects) at the centre of the process. Get stakeholders involved who are prepared to contribute (many will have them already) the kind of checklists that the Comparison Matrix has.

However, I fear that this may only push the issue back without changing it; if the external stakeholders themselves have their opinions formed by what commercial pre-fabbed system such as Office provides. In Different classes of typesetting engines I mention how the different implementation approaches lend themselves to different declarative properties. People realize that Office has pretty minimal keep-together control, but instead merely substitute some other products capabilities. We are quite lucky that countries like China are now getting fed up with the lack of imagination and responsiveness by Western developers and standards makers: it provides one of the few chinks in the protective armour by vendors that they only want change when driven by them.

So this has been a rather codgery item, are there any good signs? Well, I have praised Office’s Smart Art before and it is exactly an example of what goes on the page driving the technology: it is not the making format into as the driver but inventing a new class of page object (just as a table is a page object). Now Smart Art actually has crappy arbitrary structure, but the direction it can take is clear, and ODF could leapfrog it, if they could be bothered. There are hundreds of thousand of pages with simple diagrams, and once you decide to support what people actually do (and look at where people find things tedious) , that is putting the documents first.

So the only drivers I see for this, again, is for large user-side organizations to participate and dominate all the standards bodies, to work out their checklists, and force through the changes to ODF/OOXML/CSS/HTML that are required to conform to how people make documents when their focus is on good or natural page design (usability) rather than on incrementalism and conservativism.

Rick Jelliffe

AddThis Social Bookmark Button

European Commissioner for Competition Policy Neelie Kroes gave a really interesting talk this week at OpenForum Europe: sounds like a breakfast that would have been very stimulating.

There some very obvious tough talk directed at Microsoft, she is the person with the stick rather than the carrots after all, and most of the commentary I have seen have focussed on that. But there was a few other points that I found interesting with respect to comments I have been making.

Standards for market dominating technologies

Readers may remember that I have been pushing that All Interface Technologies by Market Dominators should be QA-ed, RAND-z Standards! By interface technologies I mean the boundary or exposed technologies: protocols, APIs, file formats.

Dr Kroes writes about so-called de facto standards:

First, the de facto standard could be subject to the same requirements as more formal standards:

* ensuring the disclosure of necessary information allowing interoperability with the standard;

* ensuring that other market participants get some assurance that the information is complete and accurate, and providing them with some means of redress if it is not;

* ensuring that the rates charged for such information are fair, and are based on the inherent value of the interoperability information (rather than the information’s value as a gatekeeper).

The process of subjecting a standard to the same requirements as a formal standard is called, err, standardization.

Note, I strictly use “standard” in the sense of the offered voluntary standard: standardization means being documented, QA-ed, RAND-z, etc and on the books, it certainly does not mean (in my usage) that it is mandated for use (from the demand side of the standards market). If I can fend off some flames before they arrive, at ISO/IEC JTC1 there are types of lesser standards, such as Technical Reports, that may have less scary implications for panic-ridden and be certainly more appropriate that full standards in some cases: I include these as “standards”.

So I don’t see any difference in what Dr Kroes has suggested and my comment; indeed I think it is a very welcome and logical step forward. Indeed, she mentions it in the context of what competition authorities may be obliged to do!

When a market develops in such a way that a particular proprietary technology becomes a de facto standard, then the owner of that technology may have such power over the market that it can lock-in its customers and exclude its competitors.

Where a technology owner exploits that power, then a competition authority or a regulator may need to intervene. It is far from an ideal situation, but that it is less than ideal does not absolve a competition authority of its obligations to protect the competitive process and consumers.

Dr Kroes does however earlier use “standardization” in a loose way, though I don’t imagine it would cause anyone to choke on their croissants: while I agree with It is simplistic to assume that because standardisation sometimes brings benefits, more standardisation will bring more benefits. on the vaccuuous lines that too much of anything is bad, the two different meanings of “standardization” should not be lumped together: standardization in the sense of “putting a technology on the books ready for voluntary use or voluntary disdain” then I don’t see that we are anywhere near the point of having too many standards nor that they are complete enough or updated enough (and I think Dr Kroes may not mean this, given the comments quoted above). However standardization in the sense of adopting or mandating a standard is an entirely different question, and I certain agree with her for that meaning.

In case people were wondering about MS increasing embrace of ODF, the writing is on wall. Dr Kroes says:

In addition, where equivalent open standards exist, we could also consider requiring the dominant company to support those too.

I certainly support that: see The Norwegians get it!

Cartels

Sometimes I feel like I am the only voice, peeping out “cartelization is a dominating regulatory issue” for standards bodies. Standards organizations have little and perhaps no obligations (or, at least, capability) to redress monopoly positions of technologies in a market, and indeed as the previous section mentions, standardization (if RAND-z and proper) actually can actually ameliorate monopoly positions (and they may have a duty to assist in making voluntary standards for that technology); however standards bodies must be careful not to operate as cartels of any kind.

Dr Kroes mentions cartels early: Her opening sentence.

Credible competition policy requires competition law enforcement. Cartel cases, merger cases, abuse of dominance cases.

and cuts to the chase later:

…standardisation agreements should be based on the merits of the technologies involved. Allowing companies to sit around a table and agree technical developments for their industry is not something that the competition rules would usually allow. So when it is allowed we have to look carefully at how it is done.

If voting in the standard-setting context is influenced less by the technical merits of the technology but rather by side agreements, inducements, package deals, reciprocal agreements, or commercial pressure … then these risk falling foul of the competition rules.

Now this brings up an interesting question. I raised the issue of cartelization, in particular the aspect of vendor collusion of a majority against their dominant competitor in Is our idea of open standards good enough?

The question may seem provocative to even ask, but sooner or later it must be asked. Are standards made by organizations where vendor stakeholders can and do outnumber non-corporate stakeholders acceptable or sound?

We can take OASIS, ECMA, W3C or any of the boutique consortia that allow corporate members (or their individual proxies.) Why should we believe that standard is sound enough to mandate merely on the absence of discovered side agreements, inducements, etc, if it has been made by a committee dominated by vendors (at the quorum level of real participation)?

It seems to me that only the various international standards bodies, which have direct voting by National Bodies not individual stakeholders in particular vendors, provides the workable immunity from direct control by vendors (singly or in collision) that needs to be required for mandatory standards. It can certainly be argued that the boutique consortia may have standards approved ultimately by a larger member vote than the working group that created the standard, and that the membership was not dominated by vendors; but that is something that requires certification or monitoring—with ISO it is manifestly the default case because of National Body voting.

So the National Body system prevents “cartelization-in-the-large”, where the final votes have a good measure of independence. However, no system I have seen completely prevents “cartelization-in-the-small”: this is where the small working groups that prepare the drafts initially have vendor domination. Again, it is not always the case: but look at the composition of the ODF TC at OASIS and the OOXML ECMA TC45 over the last two years and you can catch my drift.

Furthermore, in practice not all members are equal: government members of committees are very likely to be there to advance a particular government agenda (accessibility, say) rather than as providers of alternative technical solutions than the vendors come up with: a working group may have effective vendor domination at the technology selection level even though the vendors do not control of the requirements.

There are some other possible approaches too. For example, some standards bodies allocate chairs in working groups by a fixed number of representatives per sector: some academics, some government, some industry, which has some merit.

All this is why I wrote

But the issue of public and archival formats for government and agency documents is clearly one where governments have a vital interest: the customer is always right. This is why I believe governments need to look beyond the current academic definitions of “open standards” and re-frame the issue as “How do we achieve verifiably vendor-neutral standards?”

Maintenance

There is one part that where some implications need to be thought through a little more, perhaps. In the sentence after the When a market section quoted above, Dr Kroes says

In essence the competition authority has to recreate the conditions of competition that would have emerged from a properly carried out standardisation process.

Dr Kroes uses process but means a terminating process, I think. But standardization of a technology is a continuing process, not a one-off event: standards have lifecycles, and waving a magic wand of standardization on a market dominating technology to give it some number or status will do little to help it unless there is an ongoing process of development, correction, evolution, convergence, and so on.

And an ongoing process requires an organization. A standards organization. So when the competition authority “recreates” the conditions of competition that emerged from a properly carried out standarization process (she says this in the context of de facto standards that have had no official process, by the way) this must ultimately involve passing the maintenance on to a standards body and verifying it where there is some concern. (There is certainly scope for Competition Commission action here: if governments and user groups and academia do not participate in standards bodies, say out of some mix of sloth, underinvestment, underskilling, and lack of vision (rather than just because of being poor) it would be great if the Competition Commission could compel or encourage at least matching participation by non-vendors in standards groups of interest. But that is just pure fancy, I know!)

And, of course, this maintenance has to be done with some openness. And openness means not only openness to the needs of stakeholders, but a responsiveness to outside requests. A prioritization of vendor requirements for new features over external user requests for corrections should be taken as ipso facto evidence of vendor domination of the standards group, and/or a failure in openness. Andy Updegrove has recently been talking up the need for metrics for judging the effective operation of standards bodies, a good idea, and metrics for openness and lack of vendor domination in quorums should certainly be one objective measure of this. Despite how it sounds, actually people in almost every standards body are keen for more participation.

Rick Jelliffe

AddThis Social Bookmark Button

There is a new avenue for participation in the ODF effort at OASIS: ODF Implementation, interoperability and conformance which I commend.

Conventionally, people speak of syntactical conformance and semantic conformance, where the first is easy and the second is hard. In fact, because computers can only deal in symbols, the second is impossible. So the issue for automated conformance testing becomes “how can we reflect the semantic operations into syntactical artifacts: into symbols we can investigate.”

So the semantic conformance problem then resolves into just another validation issue. And we have lots of nice schema languages notably Schematron which can help out there. (And using general purpose languages at a pinch, no worries!)

To put it another way, it is an issue of data capture.

For ODF, I would recommend they adopt a strategy of progressive but complete verification.

For ODF import and export, this is easy: have a good RELAX NG schema (make it quite forgiving), use NVDL and DSRL if needed, then use Schematron phases to allow various levels of validity to be detected. The trouble with the monolithic valid/invalid distinction is that there may easily be invalidities in thing you don’t care about. An implementation of a word processor may have problems in its support for spreadsheets, but it should be a minor issue not a flagged as a showstopper. Schematron’s phase mechanism groups patterns of assertions so that you can have a much more useful chunked view of the strengths and weaknesses of a system.

But this leaves the issue of screen display. How can that be tested? Given my characterization of the issue as being one of data capture, the answer is that ODF needs to specify a page dump format, which can then be tested with automated tests. What would this format look like? Think PDF in XML: tiny-SVG may be good enough—anything where you can get the page position of each character (or string) and graphic on a page.

For example, let us suppose we want to test a table implementation. Now we can use RELAX NG to say that there should be tables, rows, cells etc. And we can use Schematron to say that various numeric constraints should hold. And that gets us a long way into validating that good ODF is being generated and accepted. And we can have tests for whether bad ODF is accepted, and so on.

But what about the graphical component? Having a simple page object dump allows testing that, for example, if you have a string a and a string b in two adjacent cells of the same row in a table (in the same script and of the same metrics, etc), then the (X,Y) co-ordinates of their base points conform to (Xa < Xb) and (Ya ~= Yb)

And you can use Schematron for that kind of validation. The advantage of having this built into the spec is that then the ODF spec can use mathematic properties and constraints rather than just natural language. The disadvantage of this approach is that it imposes a burden on the implementer, in particular if the graphic library cannot be trapped conveniently to provide the information; however, it certainly should be possible to generate this information from the PDF (in a reverse of the Magellan software!) especially if using a nice PDF subset like PDF/A.

Rick Jelliffe

AddThis Social Bookmark Button

I would like to propose a new test which you can use to see whether your favoured spout* of technical information is biased (or possibly just a re-printer of press releases, if there is a difference) or not. Here it is:

  1. They reported that the UK Unix Users group had take the British Standards Institute to the UK High Court, and
  2. They didn’t report in the same detail the outcome: that the High Court utterly rejected it.

Surprisingly, the Inquirer gets the guernsey here, in the marvelously titled UK unix beardies appeal for $cash. No sign of it so far on CNET, ComputerWorld, ConsortiumInfo, Slashdot (references welcome). (Groklaw perhaps did not have space for this, given that it has two interesting posts in its news about IBMs RoadRunner supercomputer which is “to ensure the safety and reliability of the nation’s nuclear weapons stockpile.” Terrific boxes! Perhaps the High Court needs to put out its findings disguised as product press releases in order to get into independent media?)

Quoting from the Inquirer:

Mr Justice Lloyd Jones rejected the UKUUG’s application for a judicial review last Thursday, giving the group until the break of dawn this Friday to raise a legal fund for an appeal.

“This application does not disclose any arguable breach of the procedures of BSI or of rules of procedural fairness,” said Justice Jones on Thursday.

“In any event, the application is academic in light of the adoption of the new standard by ISO,” he added.

For terminology. In JTC1, the terminology is that a standard is accepted by a ballot and consequently published. This general process is called adoption. So IS29500 has been accepted as an ISO standard, but not yet published. The UKUUG’s reported comment that

OOXML had not been ratified as a standard, it had merely been put on the fast track to certification.

is mumbo-jumbo.

AddThis Social Bookmark Button

Have you ever wondered if the laws of evolution apply to computer languages? When you walk down the isle at your favorite bookstore, does it seam like there are actually more computer languages than last year? What forces are driving each of these new languages to evolve?

In 1835 Charles Darwin visited the Galapagos Islands. There he collected what he thought were about a dozen distinct species of birds. Upon returning to England he discovered that each of these species had evolved from a single species of finches. On the various Galapagos Islands the requirements for food gathering was different, but consistent over hundreds of thousands of years. Enough time for a single species to adapt to meet consistent requirements.

Consider the Raccoon: omnivores that have proved to be one of the most adaptable mammals on Earth. The Raccoon’s range has rapidly expanded into urban areas due to their ability to quickly adapt to new requirements before other animals have had time for the wheels of evolution to turn.

So goes it with computer languages. Some procedural languages can be quickly adapted to fill in the needs for a new niche. When the web was young, procedural languages like Java and JavaScript quickly filled in the need for a variety of tasks. As the requirements for building web applications stabilized, declarative systems like CSS, XForms and XQuery started to push procedural languages back into niche-areas. As these declarative languages stabilize and become worldwide standards, graphical tools are being created to allow non-programmers to create, manipulate and extend these systems.

This is why many of us believe their will always be some need for procedural programming, but certainly not for building standard web applications that are controlled by style sheets and user interaction forms. Like the finch, declarative languages need a little longer to evolve. It sometimes takes years for a small vocabulary of functional specification patterns to emerge and be given labels. Additionally, it can takes years for the standards bodies to agree on the best way to deliver these new languages in a set of semantically precise data elements that have unambiguous interpretations. Finally, it may take another few years for IT managers to realized that they really do lower costs if they avoid vendor-specific implementations and adopt worldwide standards.

When CSS first came out you may have been a little reluctant to let web designers play with a rules engine. As XForms becomes ubiquitous you may be resisting change because you have invested so much time and energy learning how to debug JavaScript (without a debugger). You can not hold back the forces of evolution…and now we all need to adapt to the declarative world or risk our own extinction.

If you are interested in more on this topic see my Presentation from the 2007 Semantic Technology Conference The Semantics of Declarative Systems

Rick Jelliffe

AddThis Social Bookmark Button

ISO Namespace Validation Dispatching Language (NVDL) is a little language for taking an XML documents, sectioning it off into single namespace sections, attaching or detatching these sections in various ways, and then sending the resulting sections to the appropriate validation scripts.

NVDL solves several problems that come up with namespaces, and as with DSRL takes a very different approach than XSD takes (not saying one is better or worse: they have different capabilities and therefore may even be used together). One of these problems is the problem that often the official schema has a wildcard to say “at this point you can put any element”, but you really want to limit this to your own elements only and you don’t want to edit the official schemas (and thereby create versioning and configuration issues).

Another of these issues can be found in ODF. It allows foreign elements anywhere, and in order to validate against the schemas you have to strip these out. However, this does not mean just remove the foreign element and their children, you have to leave the non-foreign descendents in place.

Now this is something that W3C XSD cannot really handle well. You can have a wildcard to allow foreign elements, and process them laxly so that when you come to an ODF namespace you start validating, but you don’t have the capability of validating that these elements are correct against the content model you want on the parent of the wildcard. You lose synch.

Here is the section of ODF 1.1 clause 1.5 which gives the constraint:

Documents that conform to the OpenDocument specification may contain elements and attributes not specified within the OpenDocument schema. Such elements and attributes must not be part of a namespace that is defined within this specification and are called foreign elements and attributes.

Conforming applications either shall read documents that are valid against the OpenDocument schema if all foreign elements and attributes are removed, or shall write documents that are valid against the OpenDocument schema if all foreign elements are removed before validation takes place.

Hmmm, seems like a job for NVDL.

Here is a rough NVDL script to do this. (It is untested, but thanks to members of the DSDL maillist for vetting it.)

This script just takes the contents.xml file and removes all elements from a foreign namespace. It uses wildcards a bit. Then it sends the result to be validated using the schema. Note that this is a very coarse sieve: there is no need to get too smart with which namespaces are actually allowed under the main office namespace, because validation will handle that. The purpose of the script is to minimally preprocess the file so that the right elements get dispatched to the appropriate validator.

<rules xmlns="http://purl.oclc.org/dsdl/nvdl/ns/structure/1.0" startMode="root">

	<mode name="root">

		<!-- Validation for content.xml -->
		<namespace ns="urn:oasis:names:tc:opendocument:xmlns:office:1.0">
			<validate schema="super-odf.rng"
				useMode="odf"/>
		</namespace>

	</mode>

	<mode name="odf">

		<namespace ns="urn:oasis:names:*">
			<attach/>
		</namespace>

		<namespace ns="http://purl.org/*">
			<attach/>
		</namespace>

		<namespace ns="http://www.w3.org/*">
			<attach/>
		</namespace>

		<anyNamespace>
			<unwrap/>
		</anyNamespace> 

	</mode>

</rules>

So there you have it: a nice declarative way to specify the validation pre-processing which can be actually run with the various NVDL processors around the place.

Now we could duplicate this script to handle the other XML files in an ODF ZIP archive: to say that stylesheets files should start with the appropriate namespaces etc. (I think it would be possible to combine them all into one file, actually, so that different root namespaces would cause the stripped document to be dispatched to be validated by different schemas as appropriate.)
Now

Rick Jelliffe

AddThis Social Bookmark Button

ISO Document Schema Renaming Language (DSRL) is one of Martin Bryan’s contributions to the ISO Document Schema Description Languages project at JTC1 SC34 WG1. This brings together various technologies by Murata Makoto, James Clark, Martin Duerst, Jenni Tennison, and others (including me) to try to build a layered solution to validation using a variety of “little languages“.

I don’t need to go into the advantages of little languages, though I will say that I think that one major concern is that large languages disenfranchise the solo and part-time developer—this is perhaps no concern if you are a large corporation (though it will become so as the maintenance crunch sets in) but it is a definite issue otherwise. Of course, there are disadvantages too: we might hope that the little language would be easier to reason about than a large language, but little language may concentrate on depth rather than breadth, and this extra bang-per-buck can add to the complexity of understanding every case. Furthermore, the little languages still need to be combined, and this has its own perils. But admitting these possibilities does not diminish the usefulness of the approach.

A common issue with standards is how to cope with changes from the pre-standard technology to the standard one. Schematron was a typical case: in moving from Schematron 1.6 to ISO Schematron involved:

  • Swapping to a new namespace
  • In the pattern element, replacingt he attribute called name to id with a title subelement element
  • Removing the sch:key element but recommending xsl:key instead.

All these changes are cosmetic as far as functionality is concerned, but prevent a Schematron 1.n schema being a valid ISO Schematron schema.

This kind of renaming problem is not just reserved for the initial step of making a standard. During the life of a schema, different values and names may come into fashion. Sometimes people decide to take a broom through a schema to consolidate names and allowed values.

And this is where DSRL (pronounced DISRULE as in being against a central authority) comes in. It is a simple declarative language that basically maps between from names and values to to names and values. You can make maps for namespaces, element names, attribute names, PI targets, element values and attribute values (including token lists). Most topically now, in relation to recent ODF discussions, you can also declare maps for the default values for attributes and elements: in fact, it is now looking like the ODF facilities for DTD-compatible attribute value default declarations are fraught with complexity and ugliness such that they should be avoided. One really interesting, but problematic, feature is the ability to provide declarations for undeclared entity references in the document (a feature often requested by the publishing industry) and the ability to rename entity references (which may be quite useful now that SC34 has given the ISO standard entity sets for special characters to the W3C MathML group to maintain: they have a high premium on HTML compatibility even when wrong.)

DSRL is now at a very late draft stage, and I expect it will be finalized over this year. DSRL is declarative: it provides mappings, and even though it could be used to rename items in schemas, Martin Bryan’s open source XSLT implementation of it takes the more direct route of renaming the document. The implementation is available in the ZIP file at the DSDL.ORG site.

For a flavour, here are the renaming rules as given above for the changes from Schematron 1.n to ISO Schematron.

<dsrl:maps
     xmlns:dsrl="http://purl.oclc.org/dsdl/dsrl"
     xmlns:sch="http://www.ascc.net/xml/schematron"
     xmlns:iso="http://purl.oclc.org/dsdl/schematron"
     xmlns:xslt="http://www.w3.org/1999/XSL/Transform"  >

  <dsrl:element-map>
     <dsrl:from>sch:schema</dsrl> <dsrl:to>iso:schema</dsrl>
  </dsrl:element-map>
 <dsrl:element-map>
     <dsrl:from>sch:title</dsrl> <dsrl:to>iso:title</dsrl>
  </dsrl:element-map>
 <dsrl:element-map>
     <dsrl:from>sch:phase</dsrl> <dsrl:to>iso:phase</dsrl>
  </dsrl:element-map>
 <dsrl:element-map>
     <dsrl:from>sch:active</dsrl> <dsrl:to>iso:active</dsrl>
  </dsrl:element-map>
 <dsrl:element-map>
     <dsrl:from>sch:pattern</dsrl> <dsrl:to>iso:pattern</dsrl>
     <dsrl:attribute-map> <dsrl:name>name</dsrl:name></dsrl:attribute-map>
  </dsrl:element-map>
 <dsrl:element-map>
     <dsrl:from>sch:rule</dsrl> <dsrl:to>iso:rule</dsrl>
  </dsrl:element-map>
 <dsrl:element-map>
     <dsrl:from>sch:extends</dsrl> <dsrl:to>iso:extends</dsrl>
  </dsrl:element-map>
 <dsrl:element-map>
     <dsrl:from>sch:assert</dsrl> <dsrl:to>iso:assert</dsrl>
  </dsrl:element-map>
 <dsrl:element-map>
     <dsrl:from>sch:report</dsrl> <dsrl:to>iso:report</dsrl>
  </dsrl:element-map>
 <dsrl:element-map>
     <dsrl:from>sch:diagnostics</dsrl> <dsrl:to>iso:diagnostics</dsrl>
  </dsrl:element-map>
 <dsrl:element-map>
     <dsrl:from>sch:diagnostic</dsrl> <dsrl:to>iso:diagnostic</dsrl>
  </dsrl:element-map>
 <dsrl:element-map>
     <dsrl:from>sch:let</dsrl> <dsrl:to>iso:let</dsrl>
  </dsrl:element-map>
 <dsrl:element-map>
     <dsrl:from>sch:p</dsrl> <dsrl:to>iso:p</dsrl>
  </dsrl:element-map>
 <dsrl:element-map>
     <dsrl:from>sch:span</dsrl> <dsrl:to>iso:span</dsrl>
  </dsrl:element-map>
 <dsrl:element-map>
     <dsrl:from>sch:value-of</dsrl> <dsrl:to>iso:value-of</dsrl>
  </dsrl:element-map>
 <dsrl:element-map>
     <dsrl:from>sch:name</dsrl> <dsrl:to>iso:name</dsrl>
  </dsrl:element-map>
 <dsrl:element-map>
     <dsrl:from>sch:dir</dsrl> <dsrl:to>iso:dir</dsrl>
  </dsrl:element-map>
 <dsrl:element-map>
     <dsrl:from>sch:emph</dsrl> <dsrl:to>iso:emph</dsrl>
  </dsrl:element-map>
 <dsrl:element-map>
     <dsrl:from>sch:key</dsrl> <dsrl:to>xsl:key</dsrl>
  </dsrl:element-map>
</dsrl:maps>

What does it do? Replacing a namespace is quite rare, so the declaration is not as simple as could be conceived: you rename each element explicitly. The last entry handles the special case of sch:key.

The sch:pattern element has an attribute name which ISO Schematron regularized to be a title element, but there is no way to declare this in DSRL: it is not a general purpose transformation language like XSLT (but it can be translated into XSLT, as in Martin’s implementation which follows the Schematron pattern) and in fact is just as convenient in reverse (mapping from new schema documents back to old names) or renaming schemas rather than documents with a suitable implementation: it specifies the mapping not the transformation in a sense. So the best we can do is just to strip that attribute out: it is not required for validation.

I think important aspect of DSRL is that it shows that the SC34 WG1 is asking fundamentally different questions than the W3C XML Schemas WG, which is not to say that one is necessarily asking better questions at all! In XSD you have various facilities like import, redefine, equivalence groups, type derivation by restriction and extension, but there is no systematic facility to allow name and value mapping: to say “What I used to call xxx:yyy I am now calling aaa:bbb!” XSD is not interested in PIs or entities, of course.

So where is WG1 going with all this? The DSDL project is taking time: there have been no shortage of distractions. It has no support from large companies, much as we would welcome this, no publicity or marketing budget, and has to stand or fall squarely on its technical merits, in the context of a market which would really prefer if there were some way to shoehorn XSD into doing this. Now, of course, in a rational world the large corporate (open and closed source) developers would see DSRL as a simple pre-processor to XSD that can help many migration and maintenance issues: as an adjunct. But we are not holding our breaths!

But my vision is that in the near term, with DSRL completing the base DSDL quartet of RELAX NG, NVRL, DSRL and Schematron, that standards developers will start to take them on board as a package:

  • ISO NVDL selecting the particular schemas for different namespaces and culling foreign elements as desired
  • ISO DSRL renaming, localization and providing default values to handle common evolution cases
  • ISO RELAX NG performing grammar-based validation, extended with its XSD data types
  • ISO Schematron performing more complex and detailed validation

A couple of years ago we finally arrived at the point where people had come to pretty realistic apprehensions about the proper limits of XSD functionality, and I think we are now arriving at the same kind of level of maturity with RELAX NG. As these limits become commonplace, I think the need for NVDL and DSRL (for XSD and for RELAX NG) will similarly become more well-know.

My prediction is that it will increasingly occur to community standards bodies that their standards have quite a number of constraints or gotchas which are poorly expressed in English but much clearer (and machine verifiable) when expressed using DSRL (and NVDL and Schematron.)

Rick Jelliffe

AddThis Social Bookmark Button

Charles Goldfarb’s idea of using grammars to represent documents has proven itself useful in many situations, and the DTD legacy lives on in ISO RELAX NG and W3C XSD. However, there are many structures that regular grammars, as conventionally implemented, cannot cope with. And it is possible to get a certain cart-before-the-horse mentality about grammars, where any structure that cannot be represented by a grammar is regarded as bad ipso facto.

However, we need to be striving towards systems that free us so that what is congenial to the mind is easy to do on the computer.

I was looking at Ant files recently and they provide another good example. Ant files are configuration files for a modern make system, open source through Apache and most associated with Java development. Ant files are mostly a defined set of elements and attributes which you could have a grammar-based schema for quite easily.

But you can extend the elements inline in the document itself. For example, I am working on (updating Christopher Lauret and Willy Ekasalim’s) Ant task for Schematron, to be available as an Ant extension. In Ant, you just need this:

 <target name="test-fileset" description="Test with a Fileset">
    <taskdef name="schematron" classname="com.schematron.ant.SchematronTask"
        classpath="../lib/ant-schematron.jar"/>
  	<schematron schema="../schemas/test.sch" failonerror="true" debugmode="false">
  	  <fileset dir="../xml" includes="*.xml"/>
  	</schematron>
  </target>

Where the taskdef element defines that there is a task called schematron, and this can then be used as an element later.

In Schematron you could validate this by the following:

      <sch:pattern>
          <sch:title>Check allowed elements</sch:title>

          <sch:rule context="target/*[name() =  ancestor::*/taskdef/@name]">
                  <sch:assert  test="true()">
                  The target element may contain user-defined tasks.
                </sch:assert>
          </sch:rule>

          <sch:rule context="target/*" >
             <sch:assert test="self::bunzip2  or self::bzip2 or self::depend or self::javac or ..."
                diagnostics="unknown-name" >
             The target element should only have built-in Ant tasks apart user-defined tasks.
             </sch:assert>
          </sch:rule>

     </sch:pattern>
...

    <sch:diagnostic id="unknown-name" >
               The element <sch:name/> is not one of the built-in types in Ant (at least, as at Ant 1.7.0).
    </sch:diagnostic

Unless I have made a mistake with the XPath what this does is

  • The first rule finds every element that is a child of target for which there is an in-scope taskdef element for that name. In-scope means that any taskdef underneath any ancestor. The assertions in this rule can never fail, and they just filter out properly defined extension elements so that they do not fire the second rule.
  • The second rule, which applies to any other element under target, checks against the full list of the built-in Ant tasks.

That grammars cannot represent this is not just a lost opportunity for better validation: after all, the Ant program itself can generate messages. But it is a real shortfall for documentation: I cannot see one place in the Ant documentation in which all the structural rules are consolidated. I suppose if you are not used to going to a schema first, then you might not miss it, but I think one of the major convenience factors of DTDs, RELAX NG compact syntax, and Schematron can be the convenient and terse collection of structural rules, like a help card for programmers.

I have added a little diagnostic message too: just to let the user know what the unexpected element actually was. It isn’t part of the main assertion so that the assertions are “pure” positive descriptions of what should be.

Now, lets assume you are Vigorous Grammar Fanboy (VGF). You object, why not just have a container element like user-task fo all the points where you want these, along the lines of the CustomXML elements in OOXML where the name of the desired element is effectively in an attribute not the actual element name? First, because it is ugly. Second, because it emphasizes that this is an extension element, which is of interest during setup and then extraneous information afterwards. Third, because then you are messed up with using the element name to determine the contents of the element anyway. And fourth because it is not what the original writers found idiomatic, direct and minimal. Or was that point one again?

But you, the VGF, are not content with that. Oh no, you are relentless, like a killer whale attacking a seal pup on the beach. You say “Err, isn’t this what namespaces are for?” And, indeed, Ant is starting to add support for namespaces which may in time supercede this. My answer: namespaces are difficult for the kind of developer who are making Ant tasks: they are probably not addressing XML problems at all. And namespaces pose more problems for users. In fact, the Ant declaration system is one of binding a local name to a class, and so it is no more prone to name clashing that if namespaces had been used (i.e. conflicts with the same element name are no different from conflicts from the same prefix.)

So a quick comment to developers: if you have used XML for configuration files or other things, and then found that XSD doesn’t have enough power to represent what you have, it is most likely that ISO Schematron can do the job, and do it with clearer diagnostics.

Rick Jelliffe

AddThis Social Bookmark Button

One of the advantages of Schematron is that because the assertion text and diagnostics text are part of the schema, not built into the validator, a user who is distanced from the markup (e.g. by a GUI) can be given diagnostic information in terms of the application domain and even the GUI rather than just in terms of the invisible XML.

But nevertheless very often system-specific details can emerge despite out best efforts, like farts in an elevator and about as desirable.

PRESTO is a set of conventions I have been working on recently, and one of the advantages that recently came out from it is that because the URLs are meaningful to users, it is not so tragic if they emerge into the user’s notice. Just like a good Schematron diagnostic, they give system-level information in terms of how the user thinks, as much as that may be practical, rather than being limited to throwing deployment-contaminated muck at the user.

AddThis Social Bookmark Button

XRX is a new web development architecture that is a milestone in elegant simplicity. XRX stands for:

XForms on the client
REST interfaces
and XQuery on the server

Because XRX uses a single model for data (XML) it avoids the translation complexity of other architectures. The simplicity and elegance of XRX allows developers to focus on other value-added features of web application development and enables non-programmers to create a rich web interaction experience without the need to use procedural programming languages.

Our Request: An Open Mind

Before you begin …take a deep breath. Some of the thoughts expressed in this article may be unsettling, especially for those of you, like me, who have invested years into the learning of procedural programming languages like Java and JavaScript. Those skills may become obsolete if trends, predicted in this article, happen.

This begins a series of articles to describe how one can become an early adopter of this innovative technology. As organizations become more aware, the ability to quickly build rich-client web applications will spread beyond programmers to less technical audiences thus empowering a new class of web application developers. As you proceed, we ask you keep an open mind about how emerging technology will affect IT as well as business end-users.

Use Case for Real Estate Forms

For the last five years, developers have queried their peers “Can we create rich web applications using only XML technologies.” In January 2007, Kurt Cagle encouraged me to use XForms with an open source native XML database/web-server called eXist. EXist developers selected an innovative architecture where every XQuery is directly callable from a REST interface which is exactly what XForms applications need to directly send and receive data to the database. Kurt’s suggestion came at a very opportune time. I was working on a project with real-estate transactions that had many associated complex real-estate forms. Traditional methods required approximately 40 inserts into separate tables within a relational database. The use of XForms and eXist resulted in one line of XQuery code:

store(collection, file, data)

That was it. Simple. Elegant.

I was hooked. After spending over 20 years building applications with a variety of procedural languages I found my preferred architecture. I have seen the power of XForms and eXist and can’t conceive of returning to my procedural programming ways. It is my hope, that I can convey to you my excitement about this architecture.

This is not the first time an attempt to use a non-translation architecture has been made. In the late ’90s tens of millions of dollars funded object-oriented database initiatives with the hope that objects on fat-clients or middle tiers could be stored a queried without translation. However, for all the promises that object-oriented databases made, they lacked standard interfaces and query languages. Further, IT strategists could not overcome their proprietary system lock-in fear. As web-clients expanded, object-oriented databases soon became niche products for specific industries.

It has only in the last year that the combination of XForms, REST and XQuery has piqued interest of application architects trying to optimize software development lifecycles. XRX promises to not only change the role of the software developer but also the role of Subject Matter Experts (SMEs) and Business Analysts.

Proof of Architecture: FireFox and eXist

XForms, REST, and XQuery has matured in an environment that not dominated by a single vendor or product. XRX did not originate in the labs of Silicon Valley which seems to favor traditional brute-force procedural languages like Java and JavaScript. It was rather, championed by a group international software developer’s lead by German, Wolfgang Meier.

This collaboration of developers from IBM, Xerox, Novell and other organizations started by building an impressive XForms extension to FireFox. As people combined these disparate systems with REST interfaces, the overall architectural benefits began to emerge. One should not think that XRX is a mature technology. To date, there are no fully integrated development environments for XRX model and due to vendor and browser support issues; integrated development tools will be slow in coming. However, if you believe that superior application architecture will trump vendor-locking strategies, you should closely examine XRX, even in its current form.

XRX represents the confluence of mature declarative client architecture in XForms and the ability of persistence engines to easily store and query XML datasets. The term declarative identifies XForms as a set of XML elements that tell a client “what” the functionality of an interface is, and leaves the “how” to a standardized software system. With XRX, a single line of XML can declare your desired functionality and allows graphical tools to manipulate these blocks of code resulting in non-programmer tools.

The Translation Pain Chain

To understand the elegant simplicity of XRX, look at the problem of English language translation. Select any passage from any book and enter it into a translation program such as Google Translate. Perform a translation from English to Spanish and from Spanish to German. Then reverse the process by translating the German to Spanish and the Spanish back to the original English. The result will have little resemblance to the original text and will require manual cleanup.

Here is a roundtrip for-step translation using Google Translate of the Gettysburg Address from English to Spanish to German to Spanish and back into English:

Score six fifty-six years ago our fathers came to this continent a new nation, conceived in liberty and dedicated to the idea that all men are created equal.

We are now in the midst of a great civil war, testing whether that nation or any nation so conceived and so dedicated, can long endure. We have mounted a major battlefield in this war. We have come to dedicate a portion of this area and as a final resting place for those who here gave their lives that that nation might live. It is entirely appropriate and proper that we should do this.

But in a broader sense, we can not dedicate-we can not consecrate, we can not on this sacred ground. The brave men, living and dead, have fought enshrined here, far above our poor power to add or subtract. The World little note nor long remember what we say here, but can never forget what they did here. It gives us life and that is not a case pending struggled here, have so far progressed so noble. Rather us to be here dedicated to the great task before us-that from these honored dead we take increased devotion to that cause was, during the last full devotion that we here highly resolve that these deaths are not in vain - that this nation under God, shall have a new birth of freedom and that government of the people by the people and for the people not perish from the earth.

Now compare this process with what web application developers are doing today in a three-tier stack using Java or .Net systems. Each time one writes a web application using standard HTML forms, those key-value pairs in the form must be converted to a set of middle tier objects using an object-oriented language. When the objects are in memory they are translated from the object type libraries to a set of tabular data streams that use the database type libraries and then inserted into the correct order in one or more relational database tables. When a user wants to view or update the data, he/she must gather the data from all of the tables, put it into objects and then translate back to a set of attribute-value pairs and displayed in a web form.

The Disruptive Change of Elegant Simplicity

If you have studied advanced math and physics, you are mostly likely familiar with Maxwell’s Equations. James Clerk Maxwell discovered four simple elegant mathematical equations to describe the relationship between electricity and magnetism. Prior to Maxwell’s discovery, the fields of electricity and magnetism were considered separate where each used disparate complex mathematics to show their relationship. Maxwell demonstrated that by looking at problems from a new perspective that many pages of equations can be represented in four simple and elegant equations that can be printed on a T-shirts in a science museum gift shop.

We believe that XRX will do for web development what Maxwells equations did for the study of electricity and magnetism. Briefly stated:

XForms+REST+XQuery = XRX = High ROI for Web Developers

XRX gives developers the luxury of using the same data selection language (XPath) on both the client and server. The same expressions can be used in your MVC bind on the client and in Schematron data validation rules on the server. This however, is not the motivation for migrating to XRX. Declarative techniques that use XML structures tend to accelerate the creation of domain-specific languages (DSLs). DSLs are easier to manage with forms and graphical user interfaces which makes them more useable by SME’s and BA’s. XRX is the front runner in the declarative revolution and the forces empowering non-programmers. This is not to say that XRX will not have opposition. Vendors selling operating-specific client APIs or SQL products will resist XRX technologies for the foreseeable future. An entire community of AJAX developers has grown up around the lack of declarative technologies in our browsers. But in the long term these opponents will be required to compete against a simpler and superior architecture. Future articles will explore the hidden benefits of the XRX architecture and the challenges XRX presents to large-scale application developers.

Summary

In the past, the ability to create rich-client web applications was limited to small groups of highly trained and motivated application developers proficient in procedural scripting languages. XRX and declarative programming will expand this community to include a much larger audience, and with it a shift in power will occur. However, in most organizations this will occur only if IT leadership is interested in empowering business units to solve technically challenging problems and create high-quality user experiences. XRX evangelists are needed to break down the walls between IT and the business. We hope this and future articles will be useful as a tool and as a guide for the faithful.

Rick Jelliffe

AddThis Social Bookmark Button

JFK’s line after the Bay of Pigs that Victory has a thousand fathers, but defeat is an orphan has a less adversarial and more useful popular version Success has a thousand fathers, failure is an orphan, and that is what came to my mind when thinking about the Microsoft announcements on first-class support for ODF, direct involvement in the OASIS process, and extending the OSP license to ODF.

It is a great opportunity for hatchets to be buried, and I endorse everything that Patrick Durusau has written this week on it: “Not With A Bang, But With A Whimper” Ending the document format war that never was. Microsoft adopts OpenDocument Format. (and also see Dr Durusau’s “Divorce, Trust and Microsoft” Immediate steps towards building trust with Microsoft in the OpenDocument community.) Alex Brown also has an item Microsoft Moves to Support ODF Standard that I concur with too. (For background, Dr Durusau’s comments hearkens back to Dr Brown’s OpenXML vs ODF in SC34: The Phoney War.)

Pandora

A year ago, I wrote a blog called Fantasy Press Releases which I called for MS to support standards out-of-the-box, as many people did. It looks like we will get this before another year is out. Excellent, excellent. I don’t know why they don’t do the PDF support earlier though: surely if it is just a matter of packaging up the existing plug-ins there should be no problems? I cannot see any convincing reason not to support IS29500 Compatibility in the Service Pack 2 either. It would be good for everyone if they put out an early version of SP2 ASAP with the PDF support in it, under some kind of beta scheme. (One thing that I have learned about MS is that it does take about three years to go from plan to execution: this was of course a reason why support for ODF in Office 12 was unreasonable, it was at the wrong stage in their development cycle. [A report has said they are skipping from Office 12 to Office 14: pfah] )

Almost a year ago, in Remembering George the Animal Steele! Why the Open Source community should support an ISO Office Open XML standard (or, at least, not oppose it!) I wrote:

In my view, the drivers for ODF will continue unabated even after/if Open XML becomes a standard.

So, in my jaded view, ODF will not make Office go away, ISO ODF will not make Ecma Open XML go away, and ISO Open XML will not make ISO ODF go away. So I see no downside in Open XML becoming an ISO standard: it ropes Microsoft into a more open development process, it forces them to document their formats to a degree they have not been accustomed to (indeed, the most satisfactory aspect of the process at ISO has been the amount of attention and review that Open XML has been given), and it gives us in the standards movement the thing that we have been calling for for decades (see my blog last week that compared what Slashdotters were calling for in 2004 with the path that MS has taken).

I think this is what we are seeing. The people who saw OOXML as being some kind of defense against ODF (whether they were on the anti-MS side or the MS side) were wrong. The thing that makes MS support ODF is market demand: significant users saying “We want to use ODF”. It is this positive demand, not emotional anti-IS29500 rhetoric, that is prevailing.

To try to put it again, there is a supply for standards and a demand for standards: adding IS29500 to the standards that can be supplied does not alter the dynamics and drivers for the demand of ODF. In fact, in the long run it increases the demand, because the file format information is out in the open, relatively unencumbered, and there will be many governments who will take the line “We know that ODF 1.0 was not not complete enough, and we know that ODF 1.1 is better, and that ODF 1.2 is looking very good: it is reasonable for us to anticipate that ODF 1.2 will be generally adequate for our requirements with the extra IS29500 input, and that we can start working towards ODF 1.2 by encouraging ODF 1.1 use.”

(On the issue of why MS will support ODF 1.1 not ISO ODF 1.0, the people to ask are the OASIS ODF TC: why didn’t they do their correct maintenance and submit ODF 1.1 to ISO? It would be paradoxical if MS participation energizes the ODF TC to treat ISO as something more than a rubber stamp!)

Goodbye to all that

The decision to broaden the OSP license to ODF (which is no surprise, this is something that participation in the OASIS group would require) does bring up an interesting point. During the OOXML discussions, there was frequent FUD that ODF was preferable to OOXML because OOXML may have IP problems: one problem I had with that was that surely if there were patents applying to techniques of implement office applications held by MS, these would apply just as much to ODF implementations as OOXML implementations? In fact, more so, because the OSP applied to the OOXML. The same issue is true vice versa: Sun’s equivalent to the OSP for its IP in OpenOffice applies to ODF but AFAIK not to OOXML implementations. (When you get to particular media formats that are outside the scope of the standards, there is a different argument, of course: the two shouldn’t be conflated.)

OASIS

Finally there is the issue of MS joining the OASIS ODF TC. I have argued fairly consistently about the benefits of having MS at the table, and I think we owe Patrick Durusau a really good amount of honour here, for demonstrating that it is possible for self-motivated technical experts, who are above the marketing fray and who open themselves to criticism by refusing to budge from their vision despite partisan attack, to have moved the OASIS ODF TC to a point where MS thinks there is some point in participating in it.

However, frankly, I have my doubts. While I welcome the move, my regular readers will know I that I think partisan participation in standards bodies (i.e. where one mob actively blocks the technical requirements of another mob on the grounds “I don’t want to advantage my competitors”) is untenable for a standards body. That there is a significant danger that this attitude will prevail can be seen from the response of (my fanboys) the ODF Alliance Marino Marcich with its talk of “governments will continue to adopt a ‘buyer beware’ attitude” and so on. It will be a challenge for companies who have made “open” a codeword for “anti-Microsoft” to figure out a new marketing position: but where you get “open” people running public conferences on openness under Chatham House secrecy rule and sending emails threatening legal consequences to committee experts if they dare not follow the corporate line, I don’t have high expectations. The word “openness” has become like the “war on terror”: don’t look at the details or what is actually being done too closely!

Will leopards who have made their livelihood pouncing on MS every time it admits or reveals a problem over the last year change their spots: will they learn to have a pragmatic and cooperative attitude where the outcome of a good standard is more important than scoring marketing points along the way? We shall see. I have my hopes and doubts.

What about the dangly bits?

My other reservation about MS’s announcement does have a resonant spike with something else in Marcich’s reported comment ODF Alliance managing director Marino Marcich said the proof of Microsoft’s commitment to openness would be whether ODF support is on a par with Open XML.

We know that ODF 1.1 does not do everything that OOXML can support. So when it is the default format, what happens to the extras? There are a couple of possibilities. Office could just throw them away. That will frustrate users, who expect documents to open the same as when they close, and you would expect that users will be savvy enough to save in whichever format round-trips adequately. Office could embed foreign elements into the ODF: this is of course what ODF allows, but it then it will freak out people who apply the “embrace and extend” hammer to every issue. Or it could add OOXML files into the ODF ZIP file, with dual formatting, along the lines I raised in Can a file be ODF and OOOXML at the same time?.

Lets take a concrete example. As far as I know, ODF has no equivalent to OOXML’s Smart Art feature. Smart Art is one of the those features which makes old-time SGML-ers say “At last, after 20 years, this is the kind of thing we have been talking about” and represents IMHO the most radical innovation in structured GUI design in the last 20 years (given that there have been no real advances in structured editors since 1988’s SoftQuad Author/Editor.) What Smart Art does is allow a list to be edited structurally in a simple nesting list editor, then styled into scores of different diagram types: Venn diagrams, circular lists, all sorts of things. In the old SGML days, this is the kind of thing we would do by transforming from an SGML structure into a troff pic script, for example, but with much slicker graphics.

I have found Smart Art really is a great advance for productivity, and maintainability (not having to keep the graphics files in a separate format for the drawing application), and it is something that I wish there were Open Source equivalents. Now if it is so good, why isn’t it on the ODF radar (and I trust readers will correct me if I have missed it!)? When saving out to a format that does not support SmartArt, Office currently converts it to a graphic, but tries to incorporate metadata or extra information to allow “rehydration” (which is MS’ buzzword for when you roundtrip data through a less-capable format with embedded extras which allow reconstruction of the original format.)

SmartArt is addictive. If it is lost by going through a different application or format that does not support it, or maintained as a graphic, you are liable to replace the graphic with another SmartArt graphic when you re-open the file, with steely annoyance.

One thing about the DIS29500 debates that observers have found perplexing has been the idea that a 6000 page standard has too much information. I don’t know that many people realized that in some cases (and I am not saying this is the only issue) it was a code for “Our product cannot match your feature list” which itself has several sub-issues (”We don’t want for our products to march to MS’ drum”, “We can only get interoperability by limiting features to a common subset”, “Our development procedures are too chaotic to have any goalposts other than adding one level of features to what we already have”, and so on.) SmartArt is definitely in this category.

So I would see Smart Art (under whatever guise) as a touchstone issue for seeing how well MS’ participation in the OASIS ODF TC takes them towards real convergence. I certainly expect that there are many issues such as the formula issue Dr Durusau raised earlier that will benefit quite fast. In fact, just as having IS29500 is helping ODF, I think MS participation in OASIS ODF will also help improve IS29500.

As I said, Smart Art is a really important advance in (QUASIWYG) editing of structured information and it shifts the text/graphic barrier in a really interesting and useful way: AFAIK it is not on the list for ODF 1.2, and it will be interesting to see whether the ODF process can handle innovations that come from the MS side. I think the deafening chorus from users especially governments throughout the DIS29500 discussions that IS29500 was acceptable only because it could help towards convergence is something that the ODF old blood may need to take stock of: Microsoft seems to be taking it seriously, yikes!

Do the right thing

Developers/standarizers on both sides need to be whacked on their heady heads with a mackeral that Not Invented Here is not acceptable. I think people accept that until now there have been reasonable excuses: that Office could not implement ODF before it existed, that Office could not use ODF as its default format until ODF had even minimal features and completeness, that OpenFormula could be syntactically incompatible with everyone else’s spreadsheet syntax, that ODF’s graphics could cherry pick SVG without really providing actual SVG compatibility (SVG Tiny please?), and so on. (Actually, I don’t mean NIH in the sense that there absolutely cannot be multiple syntaxes or technologies for the same thing if there is some historical reason or feature difference, I am primarily talking about rejecting features merely because of their provenance.) The state of the schemas for DIS 29500 mark 1 and ODF 1.0 just reveal their level of maturity and production-level adoption, and there is nothing wrong with being an adolescent. ODF and OOXML will grow up, and they need the partisan spirit and the NIH attitude to be kept under control to do so.

But it remains to be seen whether the OASIS ODF TC can sustain MS participation. I have written before that where there is direct participation in a standards body by rivals who take an uncooperative stance, it is difficult for the work to go ahead without it becoming a ganging-up exercise. (See Is our idea of open standards good enough? for more on this.) If MS proposes things to support Office better in ODF, and Sun and IBM don’t want to have to support those things, what happens: if it were the W3C, with direct member voting, you could expect MS to be rollled and eventually go away out of frustration/pique. The ISO model is one of direct membership for technical work, but indirect membership for final votes (i.e. it is the National Bodies which vote, not corporations or other stakeholders), and that creates a different dynamic that can produce a fairer result.

With all that said, Happy Father’s Day to the many people who have gotten us this far: I think it is positive news!

Kurt Cagle

AddThis Social Bookmark Button

Balisage is probably not a term on everyone’s tongue. Its original usage comes from the Navy - for a ship to travel “balisage” means that they are using special dimmed lights for navigation while in enemy territory, a term also known as Silent Running. It has, however, acquired a second meaning more appropriate to computer science in general and XML in particular. Balisage is the use of XML to enable document processing without “giving away” data to a proprietary application’s format. Balisage in this sense is somewhat edgy and subversive, striking at the boundaries where Open Source and Open Standards meet to form Open Data.

It’s perhaps appropriate then that the former Extreme XML conference, long known as the hardest core of XML moots, should take on the name of one of the central tenets of the Open Data movement. Balisage brings together some of the foremost minds in the areas of content management, semantics and ontology, information processing, application development and security to explore how best to build on the shape of this emerging technology. The shift in name also reflects a broader shift going on in the field, as people realize that while XML is core to most of what they are discussing, it is what is being done with XML (and with the harmonics of that activity) that is becoming most important, not the format itself.

Rick Jelliffe

AddThis Social Bookmark Button

I had thought that we all would be buckling down to productive work by now, but I see that there still is some attention being paid to the idea that there is a correlation between OOXML Yes votes and the corruption index. The most recent form of this (which I consider plays to racist views) replaces the corruption index with GDP (and then mentions that the corruption index is correlated to the GDP anyway, wink wink nudge nudge.)

So I thought readers might be interested in seeing a quick graph, in which the final national body votes are displayed against the per capita GDP. It is all a bit crude graphically: the horizontal axis gives 87 national bodies, and the vertical axis gives the $ per capita GDP from the World Bank figures. The green triangle gives the data points for each NB. On the 100,000 line above each green triangle is a little icon showing whether the national body voted Accept (blue square), Abstain (red diamond) or Reject (yellow triangle).

(Please ignore the icons down on the horizontal bar, that is just my general ineptness. Sorry it is PDF, I couldn’t figure out how to export the long diagram any other way.)

To quickly check the distribution, lets see how the numbers are distributed when we divide into four quarters.

Quarter Accept Abstain Reject
1 (highest GDP) 16 4 1
2 15 6 2
3 16 2 4
4 15 4 3

So I would like to make the bold interpretation: national bodies did not vote on per capita lines. A curve could be fitted to describe the relationship, but it has no explanatory power either.

What did they vote on? I have an equally daring view: all sorts of reasons, but mainly just the boring technical ones. Technical people discussed and judged. Standards people reviewed the technical people’s judgments and made their judgments.

But that is no fun. Lets entertain the idea that voting was on some subterranean basis, what could it be? I’d say that the more that a national body came from a English-speaking “peripheral” country (e.g. not UK or USA) the more chance that it would not vote “accept”. And the more chance that a national body came from a socialist (or previously non-aligned) government (China, India, Cuba, Venezuala) the more chance it would vote “Reject”. And perhaps the more that a NB came from a member of Bush’s Coallition of the Willing, the less chance the NB would vote “accept”: it is a funny place for frustrations with having to follow the American lead to emerge, but I think the international mood for independence is very strong, and perhaps it affects the way people see even these standards issues.

However, the way to get standards that are less US dominated, is for non-US people to participate in the various technical groups that are developing these standards: OASIS ODF TC including OpenFormula TC, Ecma TC45, ISO SC34, the ISO work on PDF, etc.

Rick Jelliffe

AddThis Social Bookmark Button

Now that ODF and OOXML are both set to be on the ISO/IEC books, it is useful to consider what the next productive steps are.

For genuine ODF Supporters who are concerned that ODF has languished a little out of the limelight during 2007, there are a lot of useful things to be done. You don’t even need to join the OASIS groups or your local National Body or SC34 to begin.

I suggest here are some things that will help the ODF effort coming into ODF 1.2.

  • Lobby the component standards groups, notably the W3C, to have official RELAX NG schemas available. Without schemas, there is no validation, and without validation there is no conformance testing, and without conformance testing there is no interoperability. (Or, at least, it becomes significantly more difficult in each case.) I believe SMIL is an example of this. If possible, actually have the schema ready and waiting, to make it easy: you will feel more of an achievement to have part of the standard that you can say “I contributed that”!
  • In a similar vein, lobby the component standards groups to harmonize their standards with ODF. SVG is the one in particular that seems needed. It would be great if not only would W3C SVG group add the few missing attributes and so on, but perhaps also make a profile of SVG to match ODF better (this is not a concrete suggestion, just something whose usefullness could be checked up by someone wanting to get involved.
  • Speaking of SVG, some open source XSLT transforms for going from ODF’s “SVG” to standard SVG would be good.
  • Join in the KOffice and Open Office efforts, especially in areas that effect you or for which you have expertise. Maths is a good area, for example.
  • Check through the IS29500 spec that are of interest, when it comes out, and figure out whether they are things that are decorations (which can be handled merely by foreign elements in ODF) with the current ODF behaviour an adequate fallback, or features that are currently unsupported in ODF, that will need attention. Share your results with the SC34 committee and with the OASIS and ECMA committees.
  • Patrick Durusau has made a request that he thinks the area of checking how well some of the detailed descriptions of formula functions in IS29500 accords with the reality of Office as currently implemented, would be really helpful. This would help both IS29500 get improved and provide better information for IS26300.
  • Join in a conformance testing group: make up test documents. Ideally a test library will have some tests that test one thing per document, which makes a very large number of documents, and others that test cascaded errors. So I wonder if algorithmically generating test documents from schemas is viable.
  • Get you National Body to submit more Defect Reports, so that SC34 does not lose impetus. Remembering that when something becomes a standard, maintenance becomes a community job not “their” job.

Of course, if you were not interested in being constructive but in trying to frustrate yourself there are other things you could do. You could for example, mount a court action asking for something that you know to be impossible (e.g. withdraw a vote on a ballot that has been closed), with reason that you know won’t stand up (e.g. that a committee of long-term experts changes it vote after being satisfied that there have been enough changes to proceed with a standard), with odd legal ground (if the voluntary standards group is not subject to administrative law, not being under the government), and where you know that your standards body’s final vote is a credible one (because it was shared by more than an absolute majority of other National Bodies around the world.) Why would someone do that, my readers might be asking themselves? Embarrassment? Sour grapes? Vindictiveness? Marketing?

I certainly hope that national standards bodies will stand by their committee members and provide financial support during court cases, for time and expenses the private individuals will be dragged away from their work. This kind of intimidation, to use courts and the threat of legal action to force a result after you have lost the technical argument, should be seen for what it is.

Now please, I am not saying that I have confidence in every NBs votes. While I believe that every NB acted intra vires and therefore legal overturns are futile, I was not pleased with the Norwegian national vote (for just the same reason as I was not pleased that several NBs voted for ODF bypassing their technical committees too) and the Brazillan vote (after an IBM representative blogged that he had convinced them that if they had *any* outstanding technical problems they should vote no: if he is true, the NB secretariat should have picked this up in committee and told its members that perfection is not a requirement for a standard.) But, I don’t see them as acting outside their powers.

And, most importantly, it is a different class of problem to have a standard accepted than to have it blocked.

National and international standards bodies are highly aware that their activities and importance is tolerated and encouraged only because they create markets. The minute a national or international standards effort becomes a servant of some clique or cartel, to the exclusion of others, it loses its fundamental justification. (I say “effort” because a body may have thousands of efforts on the boil at any time.) For standards bodies, exclusive behaviour is a mortal sin; in comparison, too much inclusiveness (i.e. by having multiple standards where in a perfect world we could imagine having only one) is only a mild (and bearable) fault. (And, indeed, in most cases I consider support of plurality, to allow the market to choose, a positive virtue.)

Rick Jelliffe

AddThis Social Bookmark Button

Patrick Durusau has fun on his site with a posting satirizing the strategies of some opponents and proponents of OOXML at ISO as Beavis (for the the former) and Butt-head (for the latter.) Wikipedia has a good explanation of the Great Cornholio for people who don’t get the reference.

His key passage is probably

I think the Butt-head side seriously abused a process that had been designed by the Beavis side for their own abuse but that hardly qualifies as an objection to OpenXML.

The notional trigger for the commentary is a worthwhile article by IBM’s Arnaud Le Hors A Standards Quality Case Study: W3C which asks many good questions. I don’t know that he is correct about Candidate Recommendation, however: From what I have seen, OOXML would have passed the Candidate Recommendation requirements from W3C: it clearly has an implementation and the differences between what Office 2007 does and IS29500 says are largely cosmetic; my understanding of the W3C CR regime was that it proved that a technology was implementable, not that every part had been implemented: consider SVG or XSL-FO for example. And I was a little puzzled that Ecma’s process should be considered lacking because of its emphasis on timeliness, but later on OpenId was lauded because it is by a group of interested individuals that share a common interest and decide to solve it swiftly in a somewhat informal way using the internet to its full advantage. (emphasis added)

My angle is this: there needs to be a marketplace for standards bodies (plurality), so that stakeholders can choose the one that matches their requirements. And this in turn allows a marketplace for standards (plurality), so that users can choose the one that matches their requirements. And it is only to be expected that when there are competitive standards, vendors will attempt to use the standards process for marketing, to differentiate why their doovalakey is better than their rival’s thingamajig. Caveat emptor: competition entails sorting through rival claims.

David A. Chappell

AddThis Social Bookmark Button

For the past several years, I have been involved in many healthy discussions centered around the benefits of adopting technology and its supporting tools and infrastructure. Never once had I ever thought of measuring the benefits in terms of tonnage of hardware or in kilowatts per hour (kW/h).

Until now….

Rick Jelliffe

AddThis Social Bookmark Button

Open Geospatial Consortium has put out Google’s KML as one of their industrial standards. Congratulations to all concerned!

KML is an XML language focused on geographic visualization, including annotation of maps and images. Geographic visualization includes not only the presentation of graphical data on the globe, but also the control of the user’s navigation in the sense of where to go and where to look.

OGC seems like thriving ecosystem, under their OpenGIS® brand. OGC has a strong liaison with ISO TC 211 Geographic. information/Geomatics) and having been transposing their standards across when stable. OGC has a strong government and specialty-vendor influence. (Geomatics is not a word I had come across until today.) The OpenGIS® Reference Model (PDF) seems a good place to start figuring the GIS standards out. I liked the statement in their FAQ answering Q: How do you compare the action of the OGC with that of the ISO and other standards organizations?:

A: The standards tracks of OGC and ISO are fully coordinated through shared personnel and through various resolutions of ISO TC211 and OGC. They are often complementary and where they overlap, there is no competition, but common action (e.g. in the geometry model). OGC provides fast-paced specification development and promotion of standards adoption, similar to other industry standards consortia such as W3C, IETF, and OMG. ISO is the dominant de jure international standards development organization (SDO), providing international government authority important to institutions and stockholders. Through OGC’s cooperative relationship with ISO, many of OGC’s OpenGIS Specifications either have become ISO standards or are on track to become ISO standards.

The OGC process includes a 30-day window for public comment. (The ISO process has at least 6 months period for National Body comment, and sometimes much more.)

What I found interesting about the KML spec was that it represents a connecting point between two different standards ecosystem: in particular with the Khronos Group which is

a member-funded industry consortium focused on the creation of open standard, royalty-free APIs to enable the authoring and accelerated playback of dynamic media on a wide variety of platforms and devices.

The Khronos group is that now maintains Silicon Graphics’ OpenGL graphic system, and is strongly influenced by the mobile device industry. For my own research, I made a little diagram to try to express the Khronos ecosystem (and to experiment whether there was much point in using UML for this kind of thing.)

To summarize the diagram: lets concentrate on three kinds of standards: XML file formats, Rendering APIs for 2D or 3D graphics, and Delivery or Codec APIs which provide session or lower-level services.

2D3DStandards.png

From the bottom left, we see that OGC KML uses Khronos Collada Digital Asset and FX Exchange Schema. This is a Sony-derived technology for transporting 3D assets between applications. It used by 3D modeling software. In turn, Collada can be considered to some extent to be a serialization of OpenGL data. Of particular interest to XML-ers is OpenVG which has as a design goal of supporting SVG Tiny 1.2 :

OpenVG™ is a royalty-free, cross-platform API that provides a low-level hardware acceleration interface for vector graphics libraries such as Flash and SVG. OpenVG is targeted primarily at handheld devices

The diagram has a cluster of various vector-related XML (or soon to be XML) formats: the SVG family of SVG and its two profiles SVG Basic and SVG Tiny, and SVG’s near relative the ODF drawing language. What struck me is that the ease of implementation of a standard is really related to the function points (and tooling up, etc) of a complete implementation, and this in turn is directly related to how close the underlying libraries are to the markup language.

I would call OpenVG a glue standard, which is where you have one or more existing underlying standard API which has grown under its own steam, and one or more standard formats, and to ease implementation you make an intermediate API based on abstracting the various standard’s features. In industries where there are multiple entrenched document formats which have a lot of similarity, these kinds of glue standards are one practical way forward.

It is often the case that where you have a problem with obstinate multiplicity of standards, the way forward is not by insisting on a single standard, but in aggressively supporting plurality in such a way as to neutralize the problem. I’d see the XML encoding header (and, indeed, Unicode itself and perhaps SGML/XML too) as an example of exactly the same strategy.

I don’t expect to see it, but it is interesting to consider whether the same approach would make sense for ODF/OOXML harmonization. To what extent is “harmonization” an assumption about how to achieve a particular desired result (in particular, more guaranteed interoperability or less gratuitous non-interoperability) rather than being an outcome in its own right? If the outcome desired is interoperability, then that could also be addressed by having everyone support everything (to reduce my point to the absurd): and that tactic can only be prosecuted by making implementation as easy as possible (by having APIs that are as close as possible to the various file formats.)

Eric Larson

AddThis Social Bookmark Button

Test Driven Development is a relatively popular methodology nowadays and I think XML tools can play crucial aspect in better testing. Testing frameworks are more than capable of using and testing XML based applications, but just in case you have ever had trouble, here are a few tips.

Kurt Cagle

AddThis Social Bookmark Button

XML.COM Newsletter

There’s a problem with living life on the bleeding edge. For all of the exhilaration of being one of the first to play with a new technology (or in some cases even to create that new technology), it’s also very lonely there - by definition, most people will be encountering the same technology about the time that you’ve come to think of that technology as old hat or even (gods forfend) passe. This means that sometimes its easy to lose sight of what happens when these tools and techniques hit the real world of the workaday developer.

XML is a case in point. It’s ten-year old technology and has become the data lifeblood of the Internet (along with its younger sibling JSON). Every so often someone up on the xml-dev mailing list will pop up and say “Is XML Dead?” - to which the rest of the old guard will pipe up in defense of XML or agree that, yes, XML’s existential crisis is upon it, and it will soon be pushing up the daisies, if it’s not there already. As one colleague of mine put it - XML’s just not interesting anymore.


Read the full newsletter.

Subscribe to XML.com newsletter.

Rick Jelliffe

AddThis Social Bookmark Button

Three programmers gathered at the next cubicle to mine yesterday, clucking and snorting as is their want. I looked over to ask what was going on. “A bug in Java” they said. The problem was with ZIP files, specifically some differences between ZIP files made by different methods.

They had some files with non-breaking spaces (U+00A0) in the file name. Not something that I would do myself, but the number of people who want to use non-ASCII characters in their filenames is surely now much greater than the number of people just content with ASCII-only names. Aha, so file this under internationalization (I18n)!

The problem was, it seems, that WinZIP stored the filenames using the system default encoding. But Java would read the filename using UTF-8. So sometimes ZIP files parts would have the non-breaking space, and other times the same file saved a different route would have 0xFF at that position. Now this is the kind of behaviour and problem that you would expect a decade ago, but I was surprised it still occurred.

Checking through Sun’s bug database, we find that this bug (or its clone) is actually the second most requested (2008-13-28). The engineer who evaluates the problem gives the excuse that Sun decided to use UTF-8 for JAR files (which use ZIP) and seems a little surprised to discover that ZIP may actually be created by other systems to.

Looking at the bug report, we also find it was first reported 07-JUN-1999. Almost nine years ago. The bug report says it is only reported up to Java 1.4.2, however I cannot see anything in Java 1.6 that addresses it.

So what has happened? Several things:

  • Apache put out a zip implementation as part of Ant that supports different encodings. So people who needed it can use that.
  • Since September 2006 the ZIP spec has formally included a bit to state the the file name is stored using UTF-8.
  • It seems other manufacturers have increasingly used UTF-8

So for almost 10 years the Java version of ZIP has been broken for internationalization purposes, the fix seems to be caught in limbo (are they waiting for non-UTF-8 encodings to go away, perhaps?) , and so people are forced to go to other implementations. WORA undermined! Indeed, this seems another example where Java is simply too large for Sun to maintain adequately.

But what about this angle: the current ZIP spec has an appendix on file names and encoding it says

The ZIP format has historically supported only the original IBM PC character
encoding set, commonly referred to as IBM Code Page 437.

Which means that Sun’s policy of merely writing UTF-8 is now going against what the ZIP spec says.

Software maintenance and juggling issues on a budget are not easy. However I think it is more than plausible that had Sun gone ahead and submitted Java to ISO for standardization a decade ago, this issue would have been fixed long ago. Because ISO National Bodies give very high precedence to issues such as internationalization, accessibility, modularity, and conformance. So the lack of proper encoding support in the ZipEntry API would undoubtedly have come to the fore in the very first round: Japan never lets this kind of thing slip, for example.

By exactly the same token, if the ZIP format has been put through as a standard, proper encoding support would have undoubtedly been raised as part of the first review. Standardizing either would have been good enough to have a technical fix agreed on, published and pressure applied for a fix ahead of the demands of corporate featuritus. But standardizing both would still be best.

After Sun backed off last time, leaving so many people who had participated feeling burnt, it is hard to see that standards people won’t be deeply suspicious of them. And Sun people may not be keen to submit even to a “bullshit process” based on pragmatism and incrementalism. But Java would clearly, IMHO, be in a much better position today if it had been standardized. And so would ZIP.

Standardization as a kind of audit

What standardization of a living technology gives stakeholder companies is more than just bragging rights and ammunition to shoot their rivals with and to confuse procurement people with, tempting as those things may be, it also give an objective audit program dictated not from the corporate POV but from (to a greater or lesser extent, depending on interest) the market and relatively disinterested third parties. Any long-term software project gets encrusted in the personal politics and ideosyncrasies of the development team, and needs a circuit-breaker. This is a view of standardization as a kind of major technical audit, particularly of the documentation but also of areas that are becoming more market-critical: standards use and compliance, openness, responsiveness, accessibility, internationalization, integratability, testability, and so on.

These are all things that established technologies need. Now of course you can get audits in each of these areas by hiring experts. That is good, but you don’t get the breadth or provable transparency that National Body participation can bring. And expert opinions still have to get evaluating the context of the power relationships of the company, the very same relationships that allowed the problem to arise (these might be as simple as CJK requirements not having an adequate champion or I18n not being a profit center that can demand changes.) And you can get benefits from using boutique standards bodies in which vendors or their representatives can have voting rights: W3C, Ecma, OASIS, and so on. That is good too, but it does open to domination by one side or the other.

Which leaves the ISO family (e.g. ISO/IEC JTC1) as being effective forums for this kind of audit. People who think that ISO standardization is always a pushover should consider the current OOXML debate: you have MS and friends on one hand and IBM and friends on the other both pushing as hard as they can, and yet as I write neither can establish clear dominance. And these are the largest players in the world. Whether DIS 29500 mark II passes or fails it will be because national bodies decided on technical issues, not pack alliances, as far as I can tell. I am sure that neither MS nor IBM is feeling comfortable at the moment: and this is the strength of the ISO kind of procedure, regardless of the outcome.

We have all had enough experience of open source to be aware of its strengths and weaknesses now. Making something open source does not automatically mean that bugs and so on will be fixed. No silver bullet. As I wrote in this blog a couple of years ago in Sun should open source Swing

it is not enough to Open Source something: the mechanism for speedy response to bug fixes and releases is crucial too.

And neither will auditing a technology by making it a standard. Nothing is automatic. But Error-full systems emerge from single-strategy maintenance regimes and the dinosaur systems such as Java and Office are full of examples of this. The ISO standardization process has many qualities to commend itself for large companies as a tool for shaking things up and circuit-breaking. And we still need an ISO standard for ZIP too.

Rick Jelliffe

AddThis Social Bookmark Button

I was told recently that of the 250 or so fast-tracked standards that Ecma has successfully had accepted by National Bodies at ISO/IEC, only three of them have failed. I thought it would be interesting to read up a little more on them.

Ecma (shooting the messenger)

Ecma makes standards on a wide variety of subjects, and has particularly strong involvement with the European and Japanese computer hardware industry. In a response to a comment on another item, I posted this list, which is of the current groups and chairman’s affiliations, to give an idea of its scope:

  • C# (Chairman from Microsoft)

  • ECMAScript (Chairman from Mozilla)
  • Business Communications (Chairman from Siemens)
  • Near Field Communications (Chairman from Sony)
  • High Rate Short Range Wireless Communications (Chairman from Sony)
  • Environmental Design Considerations (Chairman from IBM)
  • Accoustics (Chairman from HP)
  • Electromagnetic Compatibility (Chairman from Intel)
  • Optical disks and disk cartridges (Chairman from Toshiba)
  • Universal 3D (I3D) (Chairman from Boeing)
  • Holographic Information Systems (Chairman from Fujifilm)
  • OOXML (Chairman from Microsoft)
  • XPS (Chairman from Global Graphics)

Now I knew that the C++/CLI effort had failed (for what seems good reasons to me.) But I was not so sure of other efforts.

I found this article, from 10 years ago: Sun Uses ECMA as Path to ISO Java Standardization which I will look at in more detail in a moment. But there is an interesting passage halfway down the page:

In 1996 Microsoft Corp was able to shoot down another ECMA standard, the Public Windows Initiative, at this stage, thus preventing it from becoming an ISO standard. The PWI was a Sun effort to get Windows APIs put into the public domain. … Microsoft was able to mount a successful campaign against PWI at ISO on this issue.

What do we learn from that? That Ecma was happy to serve as a neutral forum. That Sun was happy to try to make use of the Fast-Track procedure when it suited them, for competitive reasons. That in fact IP buy-in from the critical stakeholder is necessary. And that MS has made a 179 degree turn on standards since a decade ago. (I am always amused at how often anti-OOXML material will, when it fails in a current objection, resort to decade-old material as if it were fresh and compelling. The company then was fleeing standardization; now they are participating and allowing significant changes. You do not have to trust or like them to acknowledge that.)

Control of the API

ISO standards are a very scary proposition for large companies. Many of them are not comfortable with any position other than dominance and stability. The control of the API is terribly important to them, and they regard loss of control of the API as a risk (whereas it can be a circuit-breaker and new-market enabler.) This is one reason why all the large companies try to favour the member-based boutique standards bodies: W3C, OASIS, Ecma, because there is more chance that they can establish a beachhead and make participation at those bodies unattractive or futile for their competitors. The need for stability is sometimes stronger than the need for dominance: when you see calls for “equilibrium” to be maintained in a market, you know that is a buzzword for maintaining the status quo. (And it is not always the market leader: it can be a smaller player in fear of losing their share just as much.)

It goes in cycles. The wheel turns and sooner or later the big companies are forced to deal with ISO and national bodies, and they find this lack of control very unpleasant. Sooner or later they find some reason to split back to more dominatable bodies, and they jump ship.

It is not all venal (or even venial) or negative though: for example, look at SGML: Sun’s Jon Bosak (and many others) were unhappy with the way and speed that SGML maintenance was proceeding and we went to W3C as a forum for making a simple profile and addressing a lot of peripheral issues, and XML in turn became the foundation for the update of SGML. There is always an interplay between what the boutique, specialist bodies are interested in, and what the national-body-based regimes such as ISO are interested in: industry activity is actually really important, because it clarifies what the ISO groups should be doing.

The downside is that when these large, usually-US-based multinationals hop over to their boutique bodies, they have to try to justify their jump by slagging off at ISO/IEC. This is a predictable behaviour: it has happened in the past, it is happening now, and it will happen in the future. Some parts of the complaints are often reasonable, some parts are often merely self-serving, but it is not a new behaviour.

Ecma and Java

Now back to Java. Originally Sun put up Java to become an ISO standard using the PAS process (the fast-track process that ODF used) using the Open Management Group (another boutique group) as the submitter. Then Sun changed its mind and decided to submit it to become an Ecma standard (and thence to ISO on Fast-Track) because

In examining our standardization options, our primary goal always has been to preserve the industry’s substantial investment in evolving and using the Java technology,” said Dr. Baratz. “By paring the collaborative Java Community Process with ECMA’s proven standards process, we can achieve international standardization while preserving rapid innovation and cross-platform compatibility.

According to this article Sun chose to go with Ecma, because it was flexible enough to allow maintenance to continue on through the Java community process as it stood then. Other articles suggest that one reason for Sun’s reluctance to be involved at ISO was their strong desire to keep effective control. One particularly interesting aspect of the article is that it mentions the potential danger from Sun’s point of view of HP, Microsoft and so on doing exactly what Sun had attempted to do with PWI: make up their own version of the standard and submit it to ISO!

Of course, what Sun was concerned about was Microsoft’s attempts to destroy Java’s Write-Once, Run Anywhere promise by grafting on their own graphics primitives into J++ and splitting the market. This is of course how IBM put a nail in Java’s coffin for the desktop, by doing exactly the same thing with their SWT graphics library, as used in Eclipse: it is not a part of standard Java and Java applications that use it are not WORA applications.

The fight between Sun, IBM and Microsoft over their effective graphics libraries shows a couple of things that are very instructive. For a start, it shows that they all try to use standards for their own competitive purposes. It is no news: the challenge it to try to use the standards process to channel them into behaviours that benefit society and the market.

It also shows the futility of non-layered standards. The WORA spiel is really compelling, and it is something that I bought into with my company Topologi, but all systems that have to grow need to support what I call Organic Plurality. Systems with modularity in the wrong spots die but can cause problems in their death throws: it seems that with Java, the graphics interface was exactly such a spot, unfortunately for the vision. (For another aspect of this, see The Software World of 2010: Its about the Suite.)

But thirdly it shows that the big players have been involved in these kinds of standards games for years. For a while, and under the noxious impact of the MPEG group, the large companies got excited by the idea that they could use standards bodies to become revenue-generators by standardizing on Royalty-bearing technologies.

Pigs at the trough

In the middle part of this decade, there were attempts at OASIS for this, and many of us spoke out against the large companies trying to do this, and we were successful. For people with short memories, the background of this was the attempts to get non RAND-z technologies adopted for DRM proposals: the major pigs with their snouts in the trough at that time were ContentGuard (ex Xerox), Microsoft and IBM, all the usual suspects. (Readers may also be interested to note that Patrick Durusau got involved in the OASIS DRM effort, on the side of the angels: he has a very hard-headed attitude to all the large companies, and not one that endeared him to Microsoft or IBM.) By 2004, the OASIS DRM group wound up without getting this endorsement for the non-RAND-z technologies. RAND-z won!

David Berlind has quite a good article on why a non-RAND-z standards organization is a “patent shelter” and not open: it is great that OASIS has straightened up here, and I hope SC34 continues its long-standing RAND-z policy. But it is especially great that companies like Microsoft, IBM and Sun, which a few short years ago were all excessively concerned with trying to keep control and use standards as patent-shelters are behaving well now. However, just because Microsoft, IBM and Sun have little credibility in the world of standards for altruism’s sake, it does not mean that they should be blocked from participating legitimately in standards. To the contrary, we need to have institutions to allow these behemoths to act as good citizens: RAND-z standardization is a great vehicle for a behemoth!

The futility of monocultures

Back to our Java story. In late 1997, SC32’s Java study group had recommended that Sun should submit Java through the “more traditional” processes. Sun eventually did shift to use the Ecma route, but apparantly out of fears it would lose control. Then

.In another effort to block other companies and interests from developing Java platforms that do not meet its strict guidelines, Sun Microsystems on March 1, 2000, declined an offer from ECMA to standardise Java. ECMA, which is a standards organisation in Geneva, Switzerland, denounced Sun because the company refused the standardisation proposal. TechRepublic

Industry gossip was that Sun wanted to make their source code a normative part of the standard and they withdrew when they found it would not be possible through Ecma (or ISO or anywhere!): nice try fellows! I’d love to get some confirmation or another angle on this. But clearly the issue is one of control: integrity, interoperability are all nice side-effects. The trains always ran on time under Mussolini: we should not pretend that centralized control and monocultures do not have some benefits.

However, when we look at the way large companies act with respect to standards bodies, one very large question should arise: it is a variant on Adam Smith’s aphorism (or was it G.B. Shaw) that every profession turns into a conspiracy against the public interest. If monopolistic, cartels and collusive behaviour are undesirable (I don’t use “wrong” here because it carries a moral implication which distracts people from the point and lets them drink from the waters of Lethe from the sweet cup of self-righteousness) because they result in sub-optimal market operation.

So why are standards allowed: surely they are collusive, and interfere with the market?

Public policy

The traditional answer is that public policy encourages standards because and as far as they create markets. When the Torx screwdriver company got its hexagonal screwdriver heads adopted as a standard, they may have been wanting to encourage a market in screws not competitors in screwdrivers, but they were creating a market none-the-less. OASIS lawyer Andy Updegrove, who I criticize a lot for his flakey reporting and bias, has really good legal material at his website which quotes the (U.S.) Fifth Circuit Court of Appeals decision in Consolidated Metal Products v. American Petroleum Institute in 1988:

A trade association by its nature involves collective action by competitors. Nonetheless, a trade association is not by its nature a “walking conspiracy”, its every denial of some benefit amounting to an unreasonable restraint of trade. In particular, it has long been recognized that the establishment and monitoring of trade standards is a legitimate and beneficial function of trade associations.

One key aspect of the setting of standards is that they cannot be needlessly exclusionary: this is why there is always the need for multiple boutique bodies, because when a company is unable to get satisfactory inclusion of its technologies or requirements because existing members have “stacked” the process against it (and it should be noted that this is a negative stacking aimed at blocking: there seems to be no such thing as stacking a standards body in favour of a legitimate technology, quite the reverse: a standards body is there to foster agreements) then that company can go elsewhere. The need for a market in standard technologies requires a slew of supporting markets, including a competitive market for member-based standards organizations. (It’s turtles all the way down, as the joke says!)

When we get to ISO/IEC JTC1 we run out of competitive standards bodies. At the international level, there is quite a clear difference between the kinds of work that, for example, IEEE takes on and the work that ISO takes on. So if allowing plurality rather than blocking is at the very core for justifying standards (I mean voluntary technical standards used by industry, not regulations or which side of the road to drive on) as market-creators and preventing standards from being feet-in-the-door for cartels, what happens at the apex, at ISO/IEC JTC1 for example, when there are no competitor bodies?

The answer is simple: plurality. ISO/IEC cannot be in the business of allowing cartelization, since the only justification for standards is because they actually prevent cartelization by creating markets.

Trapping a bear

From this light, I hope my support for OOXML getting standardized even though I recommend ODF for public government documents, becomes clearer. The need to support plurality goes to the very heart of the mission of international standards bodies. It is one thing to speak of technical issues, it is another to blanket state “We already have a standard that is good enough for us, therefore you don’t need the standard that you think would meet your needs”. Because that is just code for “We want to prevent your technology for operating in its market by limiting the market to our favoured technology”. That kind of blocking behaviour needs to be exposed and rejected.

The large US multinationals have always been trying to use standards bodies to compete, and they have always shopped around, and none of them like giving up control. The recent defection of some of the leading lights of the Open Document Foundation away from ODF springs out of exactly this issue: the charge that Sun has tried to keep too much control. They all try to play this game, it is not new.

So what can we do? We have to be like bear trappers. The bear is bigger than us, has an off-putting odour, and a taste for honey. But when the bear wanders into a cage, you don’t say “Oh, Mr Bear, you are too big” or “Oh, Mr Bear, you stink” or “Oh, Mr Bear, all you want is to raid the honeypot, such a naughty and greedy animal does not deserve to be trapped!” You close the trapdoor and jubilate. The history of these large companies is that they all try to find the route where they can maintain the maximum control, and very often they will get skittlish at the amount of control they have to give up. Even Ecma, which is polloried at the moment for being some kind of a rubber-stamp, would have required giving up too much control for Sun with their Java effort: and you would not want to think that Ecma were necessarily the most accomodating here.

A lot of the anti-OOXML material over the last year has been along the lines “Don’t you know how bad MS is” spouted by companies who have been playing exactly the same kinds of games. Think SWT, think DRM, and so on. But standardization can be a real game changer: one of the few game-changers on the horizon. The chance to capture a large mass technology into the review and influence of the international standards organizations comes very rarely and IMHO is not a chance that should be squandered on petty ideological or competitive points. Open Source millionaires and closed source search engine companies, all of them are in the same boat as the rival office suite developers: competitors with vested interests to block the development of multiple markets.

The thing is that competition between these kind of standards is not just good, it is essential. I have just been looking at the new feature list for OpenOffice 3.0, due mid year, and it finally includes tables in Spreadsheets. Now it has been incredible to me that this has not been there before: I don’t know how you can make a presentation without tables. But tables in spreadsheets was not something encouraged by ODF before OOXML came on the scene. (It is not a feature suggested for spreadsheet applications in the informative feature table in ISO ODF, in particular.) And the recent changes in OOXML have surely occured in part to catch up with ODF: it is not one sided. The competition is forcing each technology to be improved in places that their original champions did not consider important.

Given the utterly toxic relations between the various players at the moment, which makes any talk of sitting down at the same standards body ludicrous, what we need is frog race. Rival technologies whose stakeholders are attempting to leapfrog each other, but with each jump taking them closer to the goals we have set: open standards, with better QA, harmonized and mappable where possible, supporting plurality, extension and adequate profiles, with decent validation and test suites. The anti-OOXML side tries to claim that the best way to openness it through enforcing a monoculture, but the experience of the last two years, and the substantial improvements in the ODF and OOXML technologies that have occurred and are pending are clear indications that standards need to harness the competitive energies of the stakeholders rather than dissipate them in prolonged committee-room chicanery aimed at maintaining the current “equilibrium”.

Kurt Cagle

AddThis Social Bookmark Button

I have recently accepted the position as Site Editor for the XML.com site, becoming responsible for the content appearing throughout the site as well as helping to guide functionality and look and feel for this particular portion (and to a certain extent the other sites in the O’Reilly Network). Having contributed to xml.com for several years, I feel honored to get a chance now to steer the editorial direction of the site, but I also need help doing it.

What I’m looking for right now, more than anything, are bloggers interested and passionate about XML and who would like the forum of XML.com to share these ideas. Given the breadth of the XML field at this point, what I’m looking for in terms of skills or expertise is equally broad; specialists (and generalists) in:

  • XML Data Technologies (XQuery, LINQ, XForms, etc.)
  • Semantic Web, both formal (RDF Stack) and informal (micoformats, folksonomies, and so forth)
  • User Interface, User Experience and RIA Components (AJAX, XUL, Silverlight, Flex, CDF/WICD, etc.)
  • Publishing and Syndication (AtomPub, Office Formats, DocBook, DITA)
  • SOA Services (SOAP, WSDL, Messaging and Marshalling, ESB, etc.)
  • XML Data Modeling (Schema design, taxonomies, methodologies)

These are currently unpaid positions, though we’re working on plans to change that, but the site is widely recognized as being one of the pre-eminent authorities on XML technologies on the web, and we hope to provide as much editorial freedom as possible to all of our bloggers.

So if you are interested in writing a regular blog on the hottest trends in XML, give me a shout at kurt@oreilly.com with what you’d like to do and, if you have any, some samples of writings on the web.

Rick Jelliffe

AddThis Social Bookmark Button

IBM/Lotus’ Rob Weir has a timely blog up entitled How many defects remain in OOXML? Timely, because of course, the clock is ticking on the OOXML vote, so this is coming up to his last chance to throw some mud. This is a subject I am interested in, and have blogged on before, so I think it might be useful to make a comment.

The Set Up

First lets look at the set-up material:

DIS 29500, Office Open XML, was submitted for Fast Track review by Ecma as 6,045 page specification. (After the BRM, it is now longer, maybe 7,500 pages or so. We don’t know for sure, since the post-BRM text is not yet available for inspection.)

Longer? Well what has happened is that

  1. Normative schemas (with structural improvements to run better on the open source XSD validators) that were in external files are now included in the text: there is no change in the amount of information in the standard despite the extra pages! In fact, because at the same time the schema fragments in the draft are now (post-BRM) informative, there has actually been an decrease in the amount of normative text.
  2. Non-normative material on accessibility has been added, again not requiring the kind of review of thought that normative text requires.
  3. Extra explanatory material requested by NBs has been added, but this text was specified in the Editor’s responses or explicitly by the BRM, it simply isn’t the case that NBs don’t know what this text is: see the BRM outcome documents.

I have blogged before against the simplistic use of page length: That diagram (Let me ring your bell), and I refer interested readers to that.

Next, comes:

Based on the original 6,045 page length, a 5-month review by JTC1 NB’s lead to 48 defect reports by NB’s, reporting a total of 3,522 defects.

Now what you might not realize from this is that the 5-month review is actually a title or nickname for one phase of the review, not the actual time limit. The initial text was released in December 2006, and national bodies didn’t actually submit their ballots until September 2007. So National Bodies had 9 months, not 5. (And interested parties could have participated for the prior year-long process at ECMA, which included a public draft.)

The total of 3,422 defects sounds impressive, except that most of them were duplicates, many just cut-and-paste duplicates by lazy or novice reviewers who somehow were under the misapprehension that in ISO process the squeaky wheels would get the most oil. ECMA grouped them into 1027 unique issues, however my estimate was that many more could be grouped together (this is borne out by the repetition of answers within the Editor’s disposition of comment) to about 750 really unique issues.

Next comes the material on a defect count per page. (To give an idea of why this is an area where simplistic use of numbers will be actively misleading is, of course, that adding the extra pages of schema material will actually cause a reduction in average the number of errors per page, without decreasing the absolute number of problems.)

I have blogged before On error rates in drafts of standards and I refer interested readers to that. Note that I give an estimate of the number of errors that your would expect to be caught (in one pass) at about 1,000, which was exactly what we have. In particular, note (ISO SQL Editor’s) Jim Melton’s comments, which I will repeat

Or perhaps most people were somewhat intimidated by the prospect of (thoroughly) reviewing a 6,000 page document. To put this in perspective for those who know SQL’s size and complexity, the sum of all nine parts of SQL is about 3950 pages. A ballot on SQL frequently receives several thousand comments, and we’ve been balloting versions of SQL for 20 years!

In fact, virtually every large spec I’ve ever had the “pleasure” to review leads to “thread-pulling”, in which every page yields at least “one more” bug, and following up on that one leads to more, and following up on those leads to still more, etc. I would personally be stunned if 30 dedicated, knowledgeable reviewers of a 6,000 page spec on its first public review were unable to find at least 3,000 unique significant problems and at least 40,000 minor and editorial problems. But that’s just me…

Under that kind of criteria that our Big Blue friend is proposing, the ISO SQL standard which is one of the most widely implemented and important and mission-critical of all ISO IT standards would not be of high enough quality to make the grade! Next Mr Weir says:

If we believed that the 5-month review represented a complete review of the text of DIS 29500, by those with relevant subject matter expertise, then we would have some confidence that all, or at least most, defects were detected, reported and repaired.

Did you see the sleight-of-hand there? The outcome “repaired” is not the only possible outcome! The big possibility that Wier misses is that a defect can be allocated to maintenance: the ballot to become a standard is not the end of the process but merely the start! But absolutely no reference to this. Why? To panic people into assuming this is the last and only chance to get things perfect.

(Weir does have another post Contra Durusau, notable for a really sleazy reference to Seattle. He takes an unrelenting anti-maintenance line, rather surprising in the light that the same arguments can apply to ODF which is his alternative. It does not suit his argument that there are many standards with successful maintenance.)

The Trick

One of the constant themes over the last year has been the theme of panic. QUICK: You only have one month to find contradictions. QUICK: You only have five months to find defects. You only have a few weeks to evaluate the Editor’s comments. Every person has to read or review the whole standard. Every national body needs to have an explicit detailed position on every issue. And so on. Always under the assumption that the current stage is the last and only chance for change.

It every case this panic is has been unnecessary FUD-mongering, because at ISO there is always the scope for improving a standard. [The normal caveat that you want to get it as right as possible first time because you cannot bolt the stable door once the horse has bolted does not apply with the same strength as with a from-scratch standard because the horse has already bolted. In fact the horse has been off and running for the last 20 years! So “getting it right” relates to documentations and harmonization rather than the general shape.]

What happens when a draft gets accepted as a standard? It gets subjected to the normal committee maintenance procedures. There is indeed a special step which can be taken where a standard gets deemed stabilized and so not subject to maintenance, but there is absolutely no way that IS29500 (or IS26300) are candidates for that yet!

Maintenance sounds a dreary word, but what it means is that National Bodies (and liaison bodies) can submit to SC34 defect reports. And I would hope there are a backlog of these issues: a trouble with a stretched out Fast-track such as we have had is that it means there is in effect six months where Defect Reports have to sit on the shelf waiting until the standard is accepted before being processed. That there have been more defects or improvements discovered since the ballot was taken is not a source of wonder or horror: of course there will be more issues discovered: how could it be otherwise?

But it is a complete mistake, and at worst disinformation, to think that defects remain outstanding, that the standard is set in stone at the time of voting. Indeed, ISO ODF is largely predicated on there being ongoing maintenance to fill in the gaps and fix problems that are found. The thing is that standards based on deployed technologies do not need reviews based on “is this technology bogus and unimplementable” in the way that blue-sky standards do: in the case of Open XML and ODF and PDF you can open up a file and look at it and see whether the big and middle picture is workable. (And you can go further and validate the XML with the schemas, for fine-grained and objective compliance testing, of course.)

At ISO/IEC JTC1, the rule is that the Editor has to handle defect reports “promptly”. (”Promptly” needs to be measured in quarters of years, it won’t be weeks. But it won’t be years or decades, which is how long some bugs have persisted in Office without the circuit-breaking of National Body scrutiny.) SC34 participants have been discussing many issues relating to getting maintenance agile and pro-active, and National Bodies who are interested in document standards need to get involved.

What you have in the ISO process is equivalent, if the NBs want it, to a Ballot Resolution Meeting every six months in perpetuity. Defect Reports can include detailed suggestions for change, and it is even possible to bundle them as Draft Amendments and get that fast-tracked.

There is a lot of talk about “ECMA should resubmit it for another fast-track” or “ECMA should resubmit it for slow-track” and so on. I regard a lot of this talk as disingenuous, because it is frequently suggested by commentators who you know are not interested in corralling OOXML into a standard no matter how technically excellent it can become. It looks like a compromise but it is intended to block progress not help it. Now I have no general objection to standards taking years to complete, but for a deployed technology the correct process is the maintenance process not the committee draft process.

Every standard that gets adequate review will have reams of defects reported. That is just as much a function of the intensity of review as the underlying quality of the standard. Indeed, you could use a reverse metric: any standard which does not have at least one defect per 6 pages reported (for example) should be suspected of having inadequate review. DIS 29500 has had thousands of people reading it and reviewing it. Thousands, not hundreds. A big swathe have been dealt with, a big swathe has been dealt with partially and can be improved further; and there is a big swathe of issues that are not defects at all but extra features which clearly belong to maintenance not initial review.

But the idea that this is it, this is Microsoft’s only accountability moment where they get a pass or fail is propaganda, not the ISO process. It is completely true that the maintenance procedure needs continued interest and continued pressure, but it is not true that this is the last chance to improve the standard as if it will be frozen for all time.



Update

In comments below, ISO SQL editor Jim Melton has clarified his comments. I was glad to see him say Please note also that I have taken no position at all on the merits of standardizing the technology in the spec, nor even the merits of the technology itself. What Jim says, however, is that he would expect a full multi-year review of a new 6,000 page spec to almost certainly reveal upwards of 5000 unique issues.

I have three responses to that. First, that Ecma 376 already had a year of review before ISO, so it is inappropriate to count the number of issues as from a de novo standard: we should be open to the possibility that in fact we did not find thousands more problems because they are not there. (However, Jim’s original comment about pulling threads is really appropriate.)

Second, that the error rates in a standard have to be tied to the number of normative pages not just the raw page count: OOXML is unusual as a standard in having so much repeated and non-normative material: indeed, Patrick Durusau in 20 hours was able to condense the WordProcessingML material by 74% to 452 pages: assuming that the other parts have similar rates that gives us about 1500 normative pages, which by Jim’s metric should reveal only 1250 unique issues. Compare this to the approx 1,000 issues that were dealt with (and the large number of issues dealt with en masse such as fixing ISO-ese shalls and shoulds and fixing examples) and the review is actually looking pretty good even on Jim’s metrics, isn’t it!

And my third point is the same one I have said elsewhere. The maintenance process is the best place to deal with remaining issues. If you look at some of the FUD lists floating around of new issues, you see an indiscriminant grab-bag of new feature requests, denials of the scope of OOXML which emphasizes legacy features, function changes, as well as (hopefully) some errors proper. These are not showstoppers, but they all should be dealt with sooner rather than later because of their importance. And sooner means by maintenance of the standard, not by pre-standardization faffing around and fillibistering.

Update 2

A website picked up on this exchange and quoted Jim’s

You’ve written 6000 pages of specification largely in secret (and, I understand, recently added over 1500 more pages) and given the world five months to read, absorb, understand, review, critique, and establish informed positions on it.

So I think it is useful to restate the problems with this.

  • 6,000 pages The pre-BRM draft standard (DIS 29500 mark I) had over 6,000 page plus several hundred more for schema files that were not printed in the text. However, the text of a standard has normative parts which state actual requirements and informative parts which give extra information to help users. Estimates from the editor of a “rival” standard is that about 75% of the content of DIS 29500 mark I was informative or could be condensed to that without loss. The additional pages (and I have seen no reliable count that it is 15000 pages: that seems just puff) is mainly due to taking schemas that currently are normative and putting them into the standard; however, at the same time repeated fragments of schemas in the draft text are being made informative, so actually there is net decrease in informative material.

    So really what we have is a standard of about 1500 normative pages (perhaps 2,000 pages including schemas) with about 4500 pages of additional information to help explain it. The attempts to use the blanket figure 6000 disguise both that the text has an enormous amount of material to aid understanding but also to allow inflated views of the amount of work needed to find errors in the normative sections. Furthermore, there is an enormous amount of repetition, so review comments from one section often applies without change to other sections.
  • Secret Actually, Ecma put out a public draft for comment.
  • Five months No, the “five month period” is the nick name, and it actually took nine months until the ballot. So not 5 months to review 6,000 normative pages, but 9 months to review effectively 1,500 normative pages. What is the difference: well let us remove 1 month for administrative palava, the difference is 6,000/4 = 1500 pages per month and 1,500/8 = 187 pages per month.
  • Five months No, actually there was an additional period after the ballot where National Bodies could look at each other’s comments and participate in the Ballot Resolution Meeting: which takes it to over a year in total, not including the previous year of development at Ecma
  • read, absorb, understand, review, critique, and establish informed positions But every individual National Body does not need to have a definite opinion on each individual issue: abstain is fine on issues that are not of interest or are outside the expertise. I don’t know how the ISO SQL Steering Committee works, but in SC34 national bodies try hard not to act outside their competence and are careful to abstain rather than spoil the process: they find the best experts they can and encourage development of national expertise and awareness of their particular national interests: Japan on internationalization, fonts and formal schemas for example. The review happens not because everyone involved knows everything, but because collectively and cooperatively all the issues get adequate coverage. For example, there may only be three or four National Bodies with deep experts on maths, and several more with general experts who can get the drift pretty well, and a few more with industry contacts and other liaisons, and that is more than adequate for review.
  • Given the world SC34 has been operational in one form or another for almost 25 years. People who are interested in this area have had a long time to get involved, learn the procedures, get national committees going, participate in various standards to learn the ropes and make networks. Both when ODF and OOXML were first proposed for fast-tracking there were good signs for people who were interested to get involved. The idea that somehow DIS 29500 has been foisted on an unsuspecting and unready public shifts the responsibility away from the people who should have been participating and up-to-speed. If a National Body (or government or other stakeholder) ignores developing skills and experts who will be ready to participate when the time comes, of course they will not have enough time: but it is their fault! If you are running in a race, arrive late, and the starter’s gun goes off while you are still putting on your shoes, you cannot complain “I didn’t have enough time!”
Rick Jelliffe

AddThis Social Bookmark Button

This is an open letter to all companies who achieved market success in the 1980s and 1990s with PC-based applications.

The recent controversy over ODF and Office Open XML at ISO shows both that there is substantial interest in document formats, and that there is also substantial commercial rivalry. I do not believe I am on my own in thinking that the writing is on the wall: the days of private proprietary formats, especially binary formats, are numbered and perhaps have already expired.

There are of course many millions of documents archived in these older formats, and it will be a major challenge for archivists to figure out workable and cost-effective strategies for maintaining or grandfathering these documents into newer formats, especially more-or-less lossy standard formats.

Corporations who were market leaders in the 1980s and 1990s for PC applications have a responsibility to make sure that documentation on their old formats are not lost. Especially for document formats before 1990, the benefits of the format as some kind of IP-embodying revenue generator will have lapsed now in 2008. However the responsibility for archiving remains.

So I call on companies in this situation, in particular Microsoft, IBM/Lotus, Corel, Computer Associates, Fujitsu, Philips, as well as the current owners of past names such as Wang, and so on, to submit your legacy binary format documentation for documents (particularly home and office documents) and media, to ISO/IEC JTC1 for acceptance as Technical Specifications.* Handing over the documentation to ISO care can shift the responsibility for archiving and making available old documentation from individual companies, provide good public relations, and allow old projects to be tidied up and closed.

The recent controversy over Office Open XML and ODF has occurred in part because both were submitted to become International Standards, which is appropriate for living formats. However, there is still a substantial public interest that would be served by existing documentation of legacy formats being submitted as Technical Specifications or Technical Reports, which, as classes of documents that are less than a standard, will be less controversial but still useful for putting this valuable information onto the public arena. As publicly available specifications, ISO/IEC would make the material available free on their website: free access is a very important outcome.

For nations where the 17 year patent time applies, there seems little reason why formats from 1990 and before could not be quickly submitted and dealt with in this way. However, given the enormous benefits that openness brings in increasing the size of the pie, I suggest that even recent formats, for example formats before 2001, should also be submitted to ISO as Technical Specifications in this way with some appropriate RAND-z IP covenant or license.

Examples of these formats that spring to mind include:

  • All Microsoft Office binary and text and media formats, including RTF and Visio
  • All IBM/Lotus binary and text and media formats, including Visicalc
  • All Corel formats, including WordPerfect

Furthermore, I call on archiving and regulatory bodies to investigate encouraging and supporting this kind of activity. As well as office document formats, there are substantial legacy collections of financial and engineering documents which would also benefit from the same treatment. It should go without saying, but the Macintosh, Amiga, OS/2, and applications on the many different versions of UNIX may also have hosted popular applications whose documentation may be in danger of being lost unless it is lodged with a suitable formal international technical library, such as ISO/IEC.

The ISO/IEC Technical Specification is a good, low-fuss medium for making sure that older formats do not disappear, and without requiring costly rewrites or changes.

*Contact your local national standards body for advise on this, or your local SC34 committee member. Do not get too caught up in whether the document is a Technical Report or Technical Specification.

Rick Jelliffe

AddThis Social Bookmark Button

In the markup world, the jargon is that inline markup is the tags that delimit ranges of text in a document (e.g., Plain Old XML), while out-of-line markup is where the structures and labels are in one place but the subjects of the structures and labels is in other place (e.g., XLinks). Of course, you can have XPaths which drill down to some piece or bundle of information with inline markup, but where there is out-of-line markup there is potentially another XPath that can drill down through the out-of-line markup and end up labelling the same information.

What may not be obvious is that a web system that uses the PRESTO is in effect using URLs that act like XPaths on virtual out-of-line markup. “Virtual” because no actual tree is ever explicated (necessarily): notionally PRESTO uses resolver rewriting.

That good markup practice is to directly markup the information without fluff and tricks and in as pleasant a way as possible is universally acknowledged; and that there are many kinds of information structure where the markup cannot be a neat model of the data such that all elements represent objects of the same analytical importance is also widely known and regretted. (Think of the distinction in XSD between the components (the objects of the schemas) and the tags used for each component, for example. Or the *Pr containers in OOXML. )

A PRESTO URL should give the view in terms of the (conceptual) components, not the specific tags used if the resource is stored as an XML document. And not necessarily every tag, certainly. But every concept (every significant concept) should have a URL, even if there is no representation available or only a pretty crappy one.

So if in PRESTO a URL represents a kind of XPath to a virtual out-of-line markup view of some data, then it is possible to have a virtual schema for that virtual markup: in effect, you could have a schema for the URL. For example, given the virtual schema (as RELAX NG compact syntax here):

  element address {
     element tent { text },
     element oasis  { text },
     element wadi { text },
     element desert { text }
  }

which would allow PRESTO URLs like

   http://www.eg.com/address
   http://www.eg.com/address/tent
   http://www.eg.com/address/oasis
   http://www.eg.com/address/wadi
   http://www.eg.com/address/desert

In PRESTO, these should be available regardless of how the data is stored, because the idea is to model the user’s conceptions. (And if an exact match is not available, to provide the best fit. This certainly creates a task allocation between front-end and back-end systems that may not be workable for some organizations or tasks. No sweat.)

But what about cardinality? Here is a schema more typical of literature:

   element law {
       element title { text}
       element part * {
            element title { text } ,
            ( element p { text } |
              element list {
                  element item  { text } +
              }
            )*
         }
    }

The Xpath for accessing a particular part’s title would be /law/part[2]/title so the PRESTO URLs would need some kind of convention.

In PRESTO we *might* have URLs for

     http://www.eg.com/law/
     http://www.eg.com/law/title
     http://www.eg.com/law/part
     http://www.eg.com/law/part2/title
     http://www.eg.com/law/part2/p3
     http://www.eg.com/law/part2/list4
     http://www.eg.com/law/part2/list3/item4

Now, I am not sure I understand the issues well enough to say which system for indexing is absolutely best. But I think the advantage of http://www.eg.com/law/part2/title over http://www.eg.com/law/part2/title is that it is probably a more common case that your system is interested in /law/part[2]/title rather than all titles of parts /law/part/title. But it is a matter of the particular use case and the consequent virtual schema.

(Another possibility is just to bite the bullet and allow XPath syntax directly in the URLs, with appropriate percent escaping. For example http://www.eg.com/l/law/part%5B2%5D/title. Is this reinventing XPointer? Well, in a way, except that in Xpointer you are locating a file then drilling down according to the actual markup: in PRESTO there information is merely hierarchically accessible according and you are using the Use Case concepts to zero in on the information.)

Rick Jelliffe

AddThis Social Bookmark Button

One question that comes up really regularly when I have been yacking about the PRESTO approach with people over the last month, is that people don’t see how Objects fit into it. They get Persistent URIs, they get REST, but the Object part is not so obvious. (Actually, I have had several people email me that they approach is one they have been tending towards in their work too.)

One reason, of course, is that the term Object-Oriented is generic and used for a family of related ideas, rather than being a single neat idea. But the PRESTO idea is that the public URLs should reflect an object-oriented modeling of the data and systems, and that you should have URLs for every object in your system even if there is no satisfactory representation of that resource.

Wikipedia says that an object can be viewed as an independent little machine with a distinct role or responsibility which is a good start, but I have always thought a key value is objects was that they can help model the system according to concepts according the users/developer’s/domain’s mind or usage. The aspects of being an object that PRESTO is interested in are encapsulation (the idea that entities should be self contained, with data and methods tightly coupled) and introspection (the idea that you can ask an object about its contents: methods, children, etc.). [UPDATE: Oi! NOT INHERITANCE, NOT RPC, NOT INTERFACES, NOT COUPLING STATE, NOT POLYMORPHISM] Bjarne Stroustrup has commented recently that problems which can be composed into a hierarchy are good candidates for Object-Oriented solutions (sorry, no reference here: it was in a Linux magazine I was reading today, maybe Linux Developer…has a Sun Solaris distro on the DVD.)

In pattern terms, PRESTO is a Facade pattern applied to URLs. In terms of UML, we might see PRESTO as saying that public-facing URLs should be constructed based on some entity analysis such as Use Cases or Package Models.

But the key way to think about it is just basic object concepts. The PRESTO approach says to form URLs so that each “directory” in the URL is an object, and its contents are sub-objects, data or other resources. Methods are not expressed as queries, but declaratively by identifying their result: so you don’t say http://www.eg.com/document/?getGraphic but http://www.eg.com/documents/graphic which then allows you to say http://www.eg.com/documents/graphic/title and so on.

Of course there are often many alternative ways of organizing or categorizing data. Which is why you appeal to use cases to guide you in which the best form is. Indeed, you might have alternative PRESTO URLs for the same data resource.

One piece of software that is highly useful for implementing a PRESTO system is the Tuckey UrlRewrtieFilter which is good for Java-based web servers. We are finding that Rregex-based URL mapping makes the whole thing quite easy and painless, in particular when retrofitting a PRESTO facade on top of an existing web site. The difficulty is largely where it belongs: in figuring out which objects are most interesting or obvious to the users. This is where modeling the particular Use Cases or even Configuration Items comes in.

Rick Jelliffe

AddThis Social Bookmark Button

The story so far

  • In the 1990s and earlier, Microsoft was notoriously prominent in its desire to keep its binary formats proprietary: it provided RTF for text-based interoperability but RTF did not allow full round-tripping of data.
  • In 2000, Microsoft started providing XML data dumps for spreadsheet data and each subsequent version MS Office has used XML more, with the Office 2003 providing quite full support, to the extent where now the default save formats, on the Windows platform at least, are all XML-in-ZIP file, the latest generation with the name Office Open XML (which people often write as OOXML.)
  • In 2004 a European Union agency recommended to MS that it should continue down the XML route and open up its formats by submitting them to some international standards body. (At the same time, a recommendation was issued for OASIS to submit ODF to ISO.)
  • In December 2005 Microsoft founded a technical committee at the ECMA standards body, TC45, which worked for a year and released ECMA 376 in December 2006; during this time the specification, which included much text based on documentation for the older binary formats, grew from about 2,000 pages to over 6,000 pages. A public draft was issued in mid 2006. (At the same time, around December 2005, OASIS submitted ODF 1.0 to for ISO consideration using a variant fast-track procedured: it was accepted with scant National Body review in mid 2006.)
  • At this time (December 2006) ECMA 376 was submitted to ISO/IEC JTC1, the international standards organization, for “Fast-Track” adoption as a standard: the fast-track process is used for standards which have been drafted at other organizations, and enter the process as Final Draft International Standards. At this stage, National Bodies had about eight months to review the standard and come to an initial position. Many National Bodies invested significant effort in attempting various reviews, however this period was also characterized by the raising of many spurious issues. (In early 2007, an update to ODF called ODF 1.1 was released at OASIS but not resubmitted to ISO, with improved accessibility features.)
  • In September 2007, the initial ballot of National Bodies resulted in a significant number of “No with comment” votes, which triggered a Ballot Resolution Meeting (BRM). The BRM had been widely expected, due to the expected large number of comments. in the ISO process, a “No with comment” has also been called “Conditional Yes but many journalists and commentators at this stage preferred oversimplification to reality. Over 3,000 individual comments were received, however the majority of these were repeated form-letter comments part of an organized campaign, rather than coming from fresh National Body Reviews.
  • In mid January 2008, the Editor for DIS 29500 released a promised Disposition of Comments document, containing suggested fixes from ECMA for addressing the National Bodys’ issues: these ranged from simple acceptance, to alternative approaches to rejection of the issue, with their justification for these. ECMA had bundled the issues into about 1000 different responses. I wrote earlier, The Editor’s Disposition of Comments …is usually the starting point for comment resolution, and, given that most comments are uncontroversial, is often the end-point too.
  • In early 2008 Microsoft releases the binary format documentation under its OSP covenant, and promises the mappings between the binaries and OOXML: this seems in direct response to requests for this from NBs, though the mappings are not in-scope for DIS29500’s text.
  • In late February 2008, a week-long Ballot Resolution Meeting was held in Geneva, Switzerland. It was attended by 120 individual delegates from about 34 different National Standards Bodies. The outcome of the meeting was a series of editor’s instructions to allow a new draft of the standard to be create: usually these instructions are completely specific though there may be some general ones, for example to use one term rather than another globally. (At time of writing, March 2008, OASIS has been working on ODF 1.2 which is slated to improve several important ODF weakspots, in particular relating to formulas and metadata. It is mooted for re-submission to ISO during 2008.)
  • The results of the BRM are available online and
    National Bodies now have one month (end of March 2008) to decide if the changed draft meets their requirements. For the new draft to pass, it will require 5 National Bodies (of the “P” class), to switch from Abstain or No votes (remembering that No with Comments may mean “Conditional Yes”)
  • Of the 1027 Editor’s responses, the BRM addressed 189 responses by specific resolutions and discussions of the BRM, and the rest using a paper ballot where each National Body in attendance voted: this accepted 825 of the Editor’s recommendations and rejected 13. (The issue of a paper ballot had been abstain on issues of lesser interest to them.
  • If the new draft is adopted as a standard, it does not remain static but can be “maintained” by the relevant ISO/IEC JC1 committee, SC34, Document Processing and Description Languages. Procedures exist for National Bodies to submit Defect Reports, which again attract the Editor’s attention and National Body voting acceptance, so the kind of process seen at the BRM becomes an ongoing effort, if there is enough interest by National Bodies.

The upshot is that, if DIS29500 mark II and ODF 1.2 both get accepted as standards, by the end of 2008 we should have two standards which together can thoroughly cover the field of representing current and legacy office documents, each representing one of the two dominant commercial traditions, with both under active and significantly open maintenance to fill in the remaining gaps and to repair pending broken parts, with clear cross-mapping to allow interconversion, with an increasing level of modularity so that the can share their component parts, and at least with a feasible agenda of co-evolution and other kinds of convergence.

And if we play our cards well, both traditions will have significant competitive motivation to accommodate the technical requirements of their competitors. Viola, harmonization? (Violà, harmonisation?)

The big picture changes

The “big picture” changes very often concern issues of conformance and modularity.

  • The draft is being split into 4 Standards,
    1. Fundamentals
    A large standard for the core of OOXML
    2. OPC
    Open Packaging Conventions: the details on using ZIP and referencing
    3. Markup Compatability and Extensibility
    4. Transitional Migration Features
    ContainsVML and features not recommended for new documents. Problematic terms like “legacy” and “deprecated” have now been avoided.
  • Six document conformance classes have been created: Core and Transitional classes for WordProcessing documents, Spreadsheet documents and Presentation documents.
  • Six application conformance classes have been created: Base and Full classes for word processors, spreadsheet and presentation applications.
  • The scope sections have been clarified.
  • Normative references are to be complete.
  • Use of standard formats for syntax: BNF
  • Use of standard measures for typesetting lengths
  • Use of standard format for dates
  • Use of IANA/ISO names for language and countries codes
  • Development of a prefix mechanism for spreadsheet formulas, presaging a full namespace modularity system like Open Formula’s.
  • Encouragement for applications to save equations as MathML even if they also save in the OMML maths.
  • Many casual references to MS-tradition technology removed and replaced by references encouraging W3C technologies for interchange

The small picture changes

The small-picture changes frequently are aimed to make the draft more “ISO-ish” and therefore make maintenance and future development at ISO/IEC JTC1 easier.

  • All known typos will be fixed
  • All known errors in examples will be fixed
  • All schema fragments will be marked informative to prevent clashing
  • ISO standard conformance language will be used: shalls and shoulds

The middle picture changes

The changes from the BRM usually relate to either correcting bugs or better documentation. Additions to functionality tended to be limited to providing better accessibility and better internationalization, rather than completing or expanding the general feature set. The Editor’s Disposition of Comments clearly tried to reduce the amount of gratuitous breakage of documents or applications, and the explicit resolutions of the BRM continued this policy IMHO.

  • Accessibility features to support better tabbing (in the fashion of HTML’s tabinfo) and table labelling. An informative reference to guide developers in accessibility features is being added.
  • Multiple changes to support right-to-left writing, half-width character terminology and less US-centric artwork and measures
  • The schemas have been re-written to be more compatible with the frailties of various XSD implementations. The XSD schemas will be included in the text as annexes with line numbers. There will be both Strict and Transitional schemas, following the model of HTML. The RELAX NG schemas have been regenerated accordingly and much improved: many people may find them preferable to the XSD schemas.
  • Hundreds of clearer explanations of multiple elements and functions.
  • Almost all bitfields will be replaced by specific attributes. (The bitfield which accords with ISO Open Font remains.)
  • Fixes to the CONVERT() function and a mathematically proper ceiling function, ISO.CEILING() for spreadsheets
  • A mechanism to prevent applications from executing files with incorrect types, to prevent viruses
  • Strings may not have non-XML graphical characters in them
  • Different hashing algorithms

Plus hundreds more.

Other Issues

Many other related issues were also discussed in the hallways at Genva. For example, the German DIN standards body is preparing a cross-mapping list to match features in OOXML and ODF: there really is very little information on this currently, despite the confident assertions that ODF can/cannot handle everything that OOXML does and vice versa. The Italian standards body is seeking to work on conformance suites for testing: obviously the schemas and BNF grammars allow validation testing of instances for document conformance, so I presume the test suites will be more concerned with application conformance. ISO/IEC JTC1 SC34 has been making various preparations to establish an effective and responsive maintenance regime: ODF could also benefit from this effort.

With over 1,000 changes, I certainly will have missed out some items of interest. Will these be enough to sway the necessary five National Bodies? The changes certainly provide objective extra information favourable to DIS29500 supporters, and the sheer number of changes suggests that ECMA is not going for a first-past-the-post strategy but trying to demonstrate a broader commitment to improvements even from antagonistic National Bodies. But though the anti-OOXML faction doesn’t have any new information to provide a counterbalance (discarding the frantic and self-justifying posturings over the BRM) I expect that they will try to explain their longstanding objections more carefully and acutely, since they do raise many good points.

Impressions

I thought the BRM went very smoothly, for a large high-stakes meeting, and I was happy to make some old and new friendships. In substance, the BRM was a typical ISO meeting of this kind: collegiality, druthers, voting, discussion, corridor meetings, rounding up supporters for measures, trying to track down definitive answers on technical issues, and so on. In accidents, it was very unusual due to size, content and ramifications not to mention the new blood pool.

I think we did pretty well in the Australian delegation, in getting many of our issues addressed completely and most of our issues addressed in part, but (like any standard!) the more you look the more holes you see. There are so many improvements that can and should be made by pro-active maintenance. At various times we had particular help from CA, MY, JP, UK, CZ, FI, US, and several others, so an unofficial thanks to those delegates from this delegate.

Rick Jelliffe

AddThis Social Bookmark Button

I’ve been trying to think of the best way of characterizing the basic classes of typesetting engines. Here’s roughly where I am up to.

There are basically three approaches used by typesetting systems:

Grids
The oldest approach. The page is divided up into grids, and paragraph gets injected line by line to fit between various gridlines. Further gridlines may be placed relative to positions in the paragraph (e.g. the end). In a grid system, tables and lists are really just an arrangement of paragraphs with particular grid relationships rather than being objects in their own right. Troff and Word 1.0 and XSL-FO regions are examples of this kind of approach.
Frames
The page is divided into linked (typically rectangular) areas and the text is poured into them. A table would be considered a frame of frames. Adobe FrameMaker and ISO DSSSL are examples of this approach.
Cells
Cells are objects which have certain fixed and variable properties, such as size etc, and have various relationships between other cells: TeX’s box and glue metaphor is a good example, but ideas of gravity or magnetism are also appropriate. Typesetting involves finding an optimal solution from a system or subsystem of cells. Cells may contain other cells, allowing hierarchical properties. The cell approach can allow very dynamic typesetting.

Each kind of typesetting engine has different ways to get the same kind of effect. Take the example of how a system knows when to break a paragraph at the bottom of the page, or move it to the next. A primitive grid system would have some kind of “requires” attribute on the paragraph, for example to say “This paragraph requires at least two lines free at the bottom of the page, otherwise cast off the page and start the paragraph on a new one.” A primitive frame system might have “widow and orphan” controls, which looked at how the text was spread between the frames. A cell system might have “keep with next” and “keep with previous” properties for each paragraph, and sort out which kind of breaking resulted in the least penalty.

Modern typesetting systems are rarely pure versions of each, of course: the needs for extra features, convenience and interoperability leads developers to graft or cherry pick approaches. For example, a copy-fitting system might be basically grid-based, but use a penalty system and feedback to rejiggle the grid settings for better fit. The extent (how many paragraphs, columns, pages, etc) and granularity (which objects, frames or grids can be rejiggled) plays a large role in determining how much human intervention will be required to achieve high quality typesetting. Think of a Yellow Pages directory: to get good results for these, you need to go beyond what is on the immediate spread but to previous (and therefore following) spreads as well, for optimal the placement of floating display material that keeps in sync with the current running heads.

And even within the same approach of system, there are many possible variations, which page designers will be very aware of. For example, when a paragraph says “Keep 1cm space after me” and the next paragraph says “Keep 2 cm space before me” some systems will work by adopting the greater (2cm) while others will adopt the sum (3cm). We might imagine that primitive grid systems could tend to the latter, while frame systems could tend to the former (and cell systems might do some negotiation or compromise: 1.5cm?) But at this level, it is every man for himself.

One feature of typesetting systems that dominates their design and capabilities is whether they are streaming or in-memory. A streaming implementation has very little lookahead (and probably very little memory of recent pages), and complicated typesetting will be performed by mixes of diversions (where text perhaps in some semi-processed state is stored for later use) or by multiple passes or by checkpoints (a range is read in-memory to allow various typesetting options to be tried and the optimal one put out, the range being discarded: to overcome the limitations of stream-based processing). It is quite rare to find systems that have typesetting rules allowing or using very significant lookahead: even cell-based systems try to localize properties to being object-properties (e.g. paragraph properties) or immediate-location properties (e.g. frame or page properties).

[UPDATE: I am removing any comments not on the topic of typesetting engines. Though of course I really appreciate the readers who defend me, please don’t post comments about individuals. There may be malicious hypocrites at loose in the world, but they can be exposed on other blog items! ]

Rick Jelliffe

AddThis Social Bookmark Button

PRESTO is not something new: its basic ideas are presupposed in a lot of people’s thinking about the web, and many people have given names to various parts, but I don’t know that anyone has given a name to this package. In any case, this combination of ideas which seems to me to be the sweet spot of practicality for large public document sets seem to have escaped the way that we approach many problems and systems. However, the question I ask is “How else are you going to do it?

The elevator pitch for PRESTO is this:

“All documents, views and metadata at all significant levels of granularity and composition should be available in the best formats practical from their own permanent hierarchical URIs.”

I would see PRESTO as the kind of methodology that a government could adopt as a whole-of-government approach, in particular for public documents and of these in particular for legislation and regulations. The problem is not “what is the optimal format for our documents?” The question is “How can link to the important grains of information in a robust, technology-neutral way that only needs today’s COTS tools?” The format wars, in this area, are asking exactly the wrong question: they focus us on the details of format A rather than format B, when we need to be able to name and link to information regardless of its format: supra-notational data addressing.

PRESTO is a combination of three ideas:

  • Permanent URLs
  • REST
  • Object-oriented

Legal documents such as legislation have three characteristics: they are highly structured, they are highly voluminous, but they have highly varying value. So many documents do benefit from the classic SGML treatment, with semantic Full Monty markup, but many others are accessed so rarely there is little benefit in having high-level markup for them. And in fact many documents may be scanned images with no text at all, and full markup entails re-keying.

So what PRESTO does (and people familiar with SGML PUBLIC identifiers will get the drift, and even more so people familiary with ISO Topic Maps) is to say that there is a real importance in being able to have permanent names even for resource that don’t have really brilliant representation available.

In fact, the legal documents may not exist physically yt all: it may be a base document and an ammendment document. So we want a permanent URL for the idea of that document, and we want our system to deliver the best fit it can when we want to get the representation. And we want to allow multiple formats, because often the best representation may be client-dependent. !

Some people might understand it better if we say that PRESTO is about naming and structuring the configuration items for document sets, and forms a precondition for vendor-neutral implementations, and to support plurality. What PRESTO does is say that when we drill down into a document, we do not want to drill down using media-dependent or presentation-dependent accidents, but according to the editorial/rhetorical (i.e. “semantic”) substance.

So why do I say “How else are you going to do it?

The reason is because if you are wanting to build a large information system for the kinds of documents, and you want to be truly vendor neutral (which is not the same thing as saying that preferences and delivery-capabilities will not still play their part), and you want to encourage incremental, decentralized ad hoc and planned developments in particular mash-ups, then you need Permanent URLs (to prevent link rot), you need REST (for scale etc) and you need object-oriented (in the sense of bundling the methods for an object with the object itself, rather than having separate verb-based web services which implement a functional programming approach: OO here also including introspection so that when you have a resource you can query it to find the various operations available)

What would a concrete example be? Lets say we are a government and we have adopted PRESTO so all our legislatation is online with these kinds of permanent URLs including every numbered thing inside the legislation. Then we want to be able ask “What other laws reference Part 4 of this Act?” In PRESTO, we say “OK, the object here is Part 4, so we want to extend the URL for Part 4 to add a name which means the list of references.” So we would have a URL like http://www.eg.gov/laws/ChildProtectionAct1904/1993/Part4/Referenced so that this gives a new URL, hierarchically based on the object it was dependent on. What we don’t do is http://www.eg.gov/functions/getReferences?to=/laws/ChildProtectionAct1094/1993/Part4 (which is procedural/functional) and not http://www.eg.gov/laws/ChildProtectionAct1904/1993/Part4?query=Referenced (some people would think this is OK, I don’t have a particularly strong view at the moment.)

Now what happens when we try to access this resource, using an HTTP GET for example? Well, that depends entirely on what information that back-end has to go on. It might be an HTTP 404 error. It might be an HTML file with a list of links. It might be an XML file of XPaths. It is up to the client to cope with the data that is sent, not the server to send in a standard, universal format. But if we allow introspection, we can then ask the resource for a list of the resources available (and HTTP content negotiation can be used too, potentially.)

I guess a rule of thumb for a document system that conformed to this PRESTO approach would be that none of the URLs use # (which indicates that you are groping for information inside a system-dependent level of granularity rather than being system-neutral) or ? (which indicates that you are not treating every object you can think about as a resource in its own right that may itself have metadata and children.)

Keith Fahlgren

AddThis Social Bookmark Button

As someone who arrived much later to the XML party than most of my peers & mentors, this week’s series of XML @ 10 years posts has been a wonderful history lesson. Today, Norm Walsh posted an even more surprising quote:

I joined O’Reilly on the very first day of an unprecedented two-week period during which the production department, the folks who actually turn finished manuscripts into books, was closed. The department was undergoing a two-week training period during which they would learn SGML and, henceforth, all books would be done in SGML.

Rick Jelliffe

AddThis Social Bookmark Button

One simplification I have made in the XSLT code presented so far is that except for datatypes I have elided the issue of diagnostics. Yet the ability to provide better diagnostics is one of the value propositions for Schematron. So lets quickly add in some diagnostics!

In Schematron schemas, a distinction is made between assertions, which are positive natural language statements about what should be found in a schema (and if possible why!) and diagnostics which provide information about errors for users. So the schema might say “Element X should be followed by element Y” and the diagnostics might say “Element X was followed by element Z”. The user gets both pieces of information.

But in a well-written schema, the assertions can pretty much be printed off without theirXPath paraphernalia as bullet points and read as software requirements or human-usable documentation. See Autogenerating standards from Schematron schemas for an XSLT script that does this.

So here is our basic diagnostics section. These are each linked to from the appropriate assertions using the diagnostics attribute to reference the diagnostic element’s ID..

		<sch:diagnostics>
			<sch:diagnostic id="d1">This element was found:
				"<sch:value-of select="*/name()"/>".</sch:diagnostic>

			<sch:diagnostic id="typo-element">This element was found:
				"<sch:name/>" in "<sch:value-of select="parent::*/name()"/>".</sch:diagnostic>

			<sch:diagnostic id="typo-attribute">This attribute was found:
				"<sch:name/>" on "<sch:value-of select="parent::*/name()"/>".</sch:diagnostic>

			<sch:diagnostic id="expected-element">This element was found:
				"<sch:name/>" in "<sch:value-of select="parent::*/name()"/>".</sch:diagnostic>

			<sch:diagnostic id="expected-attribute">This attribute was found:
				"<sch:name/>" on "<sch:value-of select="parent::*/name()"/>".</sch:diagnostic>

			<sch:diagnostic id="unexpected-immediate-follower">This element was found:
				"<sch:value-of select="following-sibling::*[1]/name()"/>".</sch:diagnostic>

			<xsl:comment>Generating Diagnostics for xs:all/xs:elements 
			<xsl:for-each select="xs:element[.//xs:all]//xs:all/xs:element">
				<xsl:variable name="ancestor-element" select="ancestor::xs:element/@name"/>
				<xsl:variable name="element-name" select="if (@name) then @name else @ref"/>
				<sch:diagnostic id="{concat('d2-',$ancestor-element,'-',$element-name)}">
				<sch:value-of select="count($element-name)"/>
					"<xsl:value-of select="$element-name"/>" elements were found</sch:diagnostic>
			</xsl:for-each>

			<!-- generate diagnostic for each standard datatypes -->
			<xsl:call-template name="generate-standard-datatypes-diagnostics"/>
		</sch:diagnostics>

This is the last in this round of articles on the XSLT to Schematron converter about schema generation, probably. Thanks to JSTOR and Allette Systems for sponsoring its development. I hope to be transferring the to SourceForge under GPL in February, though I want to divide the main code out into nice separate files first, for maintainability. I am very interested in finding anyone interested in taking over or contributing to the project&,dash;I have a backlog of Schematron matters to attend to!

Rick Jelliffe

AddThis Social Bookmark Button

This article is part of a series describing how to convert from W3C XML Schemas to ISO Schematron. They are very different schema languages! This time we look at some code with quite complex XPaths: we want to validate that the element that follows another element in a document is one that “goes after” the first. But not necessarily immediately after: the schema might require extra elements in between, for example.

Why would we want to do that? Well, because we are approaching this systematically, and gradually expressing constraints from the most general to the most specific. XML schema content models are rather difficult, even when simplified in the way we already do when pre-processing the schema. By having a pattern that validates consecutive elements for partial order we can cope with all manner of inter-nested choice and element groups and cardinalities. We leave testing of required immediate following elements to another test (See the previous in this series Required pairs in sequences for a start.) By plugging the hole with a big rock, we need smaller pebbles for the remaining gaps.

(I have previously raised the use of partial order for schemas in my single-element schema language Hook, and readers.)

Output

Lets start off by showing what we want to achieve. We want to generate from any XSD content model rules like the following:


<sch:rule context="Address/StreetOrPOBox">
         <sch:assert test="not (following-sibling::*)  or
                   (following-sibling::*[1][name() ='Suburb']  or
                    following-sibling::*[1][name() ='State']  or
                    following-sibling::*[1][name() ='Postcode'])">
			When in a  "Address" element, the element "StreetOrPOBox" can only be followed
			(perhaps with other elements intervening)
			by the following elements: Suburb, State, Postcode
      </sch:rule>

And for elements which cannot have any followers, we want to generate rules like the following:

      <sch:rule context="Address/Postcode">
         <sch:assert test="not(following-sibling::*)">
		When in a "Address" element, the element "Postcode" should not be
		followed by any other element.
	 </sch:assert>
      </sch:rule>

Main Loop

Here is the start of the named template

	<xsl:template name="generate-following-elements-checking-rule">

In this we first select all the element declarations or references in the XSD schema which are particles in a content model. Remember that we have pre-processed the schema modules into a single file so we don’t need to worry about import, include and global complexType declarations. And we are not supporting some features at this stage, such as wildcards, substitution groups and dynamic typing, which simplifies our life quite a lot, though unfortunately not entirely.

	<!--  For every use of an element in any content model -->
	<xsl:for-each select="//xs:schema/xs:element//xs:element[not(parent::xs:all)]">
		<!-- Sort them so that local declarations come before globals, and so that deep path
		declarations come before shallow ones -->
		<xsl:sort select="count(ancestor::xs:element)" order="descending" />

Note that we are not worrying about elements in an xs:all group, because these have no partial order constraints that are not tested by the patterns for allowed elements and required elements. We sort our particles longest first so that local declarations are tested before global ones.

Now in this scope we make some convenience variables:

		<!--  Store the name of the parent element -->
		<xsl:variable name="parent-element-name" select="ancestor::xs:element[1]/@name"/>
		<xsl:variable name="parent-element" select="ancestor::xs:element[1]"/>
		<!--  Store the context path -->
		<xsl:variable name="path-to-parent">
			<xsl:for-each select="ancestor::xs:element"
                            ><xsl:value-of select="@name"/>/</xsl:for-each>
		</xsl:variable>

Handling repeating choice elements

Next, we handle a special case. Probably there are more special cases like this, and identifying them would help trim the output Schematron schema and reduce redundant messages.

This is the common case of (a | b | c)*, a single repeating choice group. Like an xs:all there is no need to generate declarations for these (though declarations could be made.)



	<!--  Handle special case -->
	<xsl:when test="parent::xs:choice
		[@maxOccurs='unbounded' or @maxOccurs > 1 ]
		[parent::xs:complexType
			[count(xs:choice)=1]
			[count(xs:sequence)=0]
		or parent::xs:element
			[count(xs:choice)=1]
			[count(xs:sequence)=0]]
		[count(child::xs:choice)=0]
		[count(child::xs:sequence)=0]">
		<!--  If the parent is a repeating choice element and its parents only have that choice,
			and that choice element only has element particles for children
			then we can treat it as a special case: it has no extra positional constraints than the
			presence constraints don't catch. -->

		<!--  only generate the rule when we come to the first subelement -->
		<xsl:if test="not(preceding-sibling::xs:element)">
			<xsl:comment> No sequence constraints for element <xsl:value-of select="$parent-element-name"/>.</xsl:comment>
		</xsl:if>
		</xsl:when>

Identify followers

Now comes the more heart of the matter. We want to identify various kinds of followers, each in variables containing the sequence of possible elements; we use various XPaths to locate the possible elements and put them into variables. This is fraught with error!

At the end, we collect them into a variable followers which hopefully has all the elements we need.

Remember that we are looping through all the element particles in all the content models (except for xs:all and (a | b | c)* models) one by one.

The first variable repeating-cousins holds all the elements particles which belong to the same parent as the current element, but have anywhere between them and the parent element, some kind of repetition&,dash; it could be on the element itself or on a parent sequence or choice. Any of these elements can follow our candidate element, by partial order.

The second variable subsequent-cousins traverses up the document tree-of-nodes from our candidate element and finds every time there is a sequence element: all elements that are direct particles are selected.

The third variable subsequent-nephews is a more elaborate version of this: it selects all the descendants element particles of following groups in sequences.

		<!--  Handle the normal case -->
		<xsl:otherwise>

		<xsl:variable name="repeating-cousins"
		    select="$parent-element//*
			[@maxOccurs='unbounded' or @maxOccurs > 1]
			[.//*=current() or .=current()]
			/descendant-or-self::xs:element
				[ancestor::xs:element[1] is $parent-element]" />

		<xsl:variable name="subsequent-cousins"
			select="ancestor-or-self::*
				[parent::xs:sequence]
				[ancestor::xs:element[1] is $parent-element]
				/following-sibling::xs:element "/>

		<xsl:variable name="subsequent-nephews"
			select="ancestor-or-self::*
				[parent::xs:sequence]
				[ancestor::xs:element[1] is $parent-element]
				/following-sibling::*//xs:element "/>			

		<xsl:variable name="followers"
			select="$repeating-cousins | $subsequent-cousins | $subsequent-nephews" />

Now we are set up to generate our rules.

Note: I suspect that people would expect the code to work by generating a transitive closure for the reachable following elements of each particle, finding the possible immediate following sibling elements, then finding their possible immediately following sibling elements, and so on repeated. But recursion in this situation seems to me to be prone to exploding (in time, if not in memory) based on some other recent work I was doing on XML schema re-factoring. However, the method above (if it is correct!) uses no recursion and may be better for that reason.

Rules for elements with followers

The odd use of concat() in the context attribute is just to cope with some element particles being locally declared and others being globally declared.

The code here is not difficult. There is a little xs:if section to customize the assertion text when there is only one possible follower.

    <!--  Make a rule using the current context path -->
		<sch:rule context="{concat($path-to-parent, (@name | @ref))}">
		  <!--  select  all the elements that are under any choice or sequence group which allows
		  	repetition and has the current element under it-->

		<xsl:if test=" $followers " >
		  	    <sch:assert>
		  	    	<xsl:attribute name="test">
		  	    	<xsl:text>not (following-sibling::*)  or (</xsl:text>
				<xsl:for-each select=" $followers ">
				    <xsl:choose>
				    	<xsl:when test="@name"
						>following-sibling::*[1][name() ='<xsl:value-of select="@name  " />']</xsl:when>
				    	<xsl:when test="@ref"
						>following-sibling::*[1][name() ='<xsl:value-of select="@ref  " />']</xsl:when>
				    </xsl:choose>	

				 <xsl:if test="position()!=last()"> or </xsl:if>
				</xsl:for-each>
				<xsl:text>)</xsl:text>
				  </xsl:attribute>
			When in a  "<xsl:value-of select=" $parent-element-name" />" element,
			the element "<xsl:value-of select="concat(@ref, @name)" />" can only be followed
				<xsl:if test="count( $followers ) != 1">(perhaps with other elements intervening)</xsl:if>
				by the following elements:
				<xsl:for-each select=" $followers">
					<xsl:value-of select="@name | @ref" />
					<xsl:if test="position()!=last()">, </xsl:if>
				</xsl:for-each>
			</sch:assert>
		</xsl:if>

Rules for elements with no followers

Finally, we handle the case of elements at the end of the content model. This is the case where there are no elements in the follower set.

		<xsl:if test=" not( $followers ) ">
			<sch:assert test="not(following-sibling::*)">
			When in a "<xsl:value-of select=" $parent-element-name" />" element,
			the element "<xsl:value-of select="concat(@ref, @name)"/>" should not be
			followed by any other element.
			</sch:assert>
		</xsl:if>

		</sch:rule>

	  </xsl:otherwise>

	  </xsl:choose>
	</xsl:for-each>
</xsl:template>

(Acute people might be wondering whether we need any rules to test that an element must start with a particular element. When the same element can appear more than once in a content model, that might indeed be useful. When it only appears once, and is required, the required element rules will report it. Where it is optional, if it goes in the wrong place these partial order rules should catch it. So I am not sure it is an important case at this grain, for us: we are not really doing making any effort to handle multiple particles, though certainly I expect most patterns we have used so far will cope with them. But it remains another issue to audit!)

Rick Jelliffe

AddThis Social Bookmark Button

What we want to do is to have a Schematron pattern that just checks a very specific thing: when the use in a document of one element requires that another element immediate follows it.

Actually, I am skipping over a stage here, because this code is quite small, fun and instructive. Which is perhaps another way of saying and the code we are skipping over (for now) is quite complex. The stage we are skipping over for now has assertions to test partial order (like Topologi’s and James Clark’s RELAX NG validator JIng’s feasible validation mode: it passes any element which could go after the current element (in its parents) not just the element that can immediately follow it. Having the test for partial order is useful for progressive validation (for example for feasible validation where we have a document that we know is incomplete, but we just want to know if it is OK as far as it goes) but more importantly it lets us divide and conquer our task.

Back to our simple case… The XML Schemas schema for this is when there is a xs:sequence element, which contains two consecutive xs:element particles, with occurrence constraints set so that the first cannot repeat while the second is required.

First here is the kind of code we will have in our Schematron schema:

   <sch:pattern id="Required_Immediate_Followers">
      <sch:title>Required Immediate Followers (Simple)

      <sch:rule context="Address/StreetOrPOBox">
         <sch:assert test="following-sibling::*[1][self::Suburb]">
		When in a "Address" element, the element "StreetOrPOBox" should be immediately followed by
		 the element "Suburb". </sch:assert>
      </sch:rule>
     ...
   </pattern>

And here is the beta XSLT code to generate it from our (expanded and munged) XML schema:

	<xsl:template name="generate-immediate-following-elements-checking-rule">

		
    	<xsl:for-each select="//xs:element
		    	[not(@maxOccurs='unbounded') and not(@maxOccurs > 1) and not(@maxOccurs=0)]
		    	[@minOccurs='unbounded' or not(@minOccurs=0)]
    			[parent::xs:sequence]
    			[following-sibling::*
    				[self::xs:element
    					[@maxOccurs='unbounded' or not(@maxOccurs=0)]
                                        [@minOccurs='unbounded' or not(@minOccurs=0)]]]">
    			 	<!--  Store the name of the parent element -->
		<xsl:variable name="parent-element-name" select="ancestor::xs:element[1]/@name"/>
		<xsl:variable name="parent-element" select="ancestor::xs:element[1]"/>
		<!--  Store the context path -->
		<xsl:variable name="path-to-parent">
			<xsl:for-each select="ancestor::xs:element"><xsl:value-of select="@name"/>/</xsl:for-each>
		</xsl:variable>

		  	<sch:rule context="{concat($path-to-parent, (@name | @ref))}">
		  		<sch:assert diagnostics="unexpected-immediate-follower">
		  			<xsl:attribute name="test">following-sibling::*[1][self::<xsl:value-of
		  			select="concat(following-sibling::*[1]/@name, following-sibling::*[1]/@ref)"/>]
                                       </xsl:attribute>
		  			When in a "<xsl:value-of select=" $parent-element-name" />" element,
                                       the element "<xsl:value-of select="concat(@ref, @name)"/>" should be
                                       immediately followed by  the element  "<xsl:value-of
                                      select="concat(following-sibling::*[1]/@name, following-sibling::*[1]/@ref)"/>".
		  		</sch:assert>
		  	</sch:rule>

    	</xsl:for-each>

   </xsl:template>

One thing to note is the variable path-to-parent: we will see this used again later. It allows us to have local declarations as deep as we need. Another thing to note is that whenever we test the XML Schemas attribute maxOccurs and minOccurs we first have to do a string test for “unbounded” (or a test using number()) because they have a union data type allowing numbers and “unbounded”.

Looking at this code I see an immediate potential flaw: in XPath 1.0 you would only need to check the maxOccurs and minOccurs attributes for numeric values: the tests would gracefully fail if “unbounded” was used in the original schema. However, XPath 2.0 will generate a type error, so we put the test for string first (the attribute value will be first tested as a string, then as a number). This relies on shortcircuiting: the success of the first test means the second test is not evaluated. But, oh dear, shortcircuiting is not guaranteed in XPath 2.0 (it is XPath 1.0 behaviour.) So I will have to make these tests into little if ... then... expressions. This is one place XLST 2.0 really gets it wrong, it should add the short-circuiting constraint because it makes life sooo much easier for programmers. I am enjoying exploring XSLT 2, but this is thing is just dumb and un-idiomatic. If it ain’t broke don’t fix it, and so on. (Having said all that, SAXON acts the way I want here, and short-circuits or at least does not freak out. Keen readers: please let me know if my understanding it wrong here!)

This simple test actually handles a lot of the required constraints in content models, and obviously it can be improved on: for example, when the first element can repeat, the assertion needs to be broadened to allow it to follow itself. Or what the second particle is another sequence, or a choice? Or what if the second particle is optional? And what if the same particle appears several times in the content model? (See my initial article on this Converting Content Models to Schematron for some ideas.)

However, it does not generate false negatives, which is what we want as we create our finer sieve.

Rick Jelliffe

AddThis Social Bookmark Button

We can improve on the diagnostics given by the rules in the previous article in this series, Progressive validation for complex content models.

Diagnosing Similar Names

One of the most common typos is simply to make a mistake in upper-case/lower-case. We can generate Schematron code to check this:

<sch:rule context="*[upper-case(local-name())=upper-case('Address')]">
         <sch:report test="true()">The unexpected element "<sch:name/>" has been used,
            which is close to an element in the schema: the element "Address".
	</sch:report>
 </sch:rule>

And here is the XSLT for generating those Schematron rules:

	<xsl:for-each select="//xs:element[@name]">
		<xsl:sort select="@name"/>
		<xsl:variable name="theLocalName" select="replace( @name, '^(.*):(.*)', '$2' )" />
		<xsl:if test="string-length( $theLocalName ) > 0">
			<sch:rule context=
                             "{concat(
                                  '*[upper-case(local-name())=upper-case(&quot;',
                                  $theLocalName,
                                  '&quot;)]')}">
				<sch:report test="true()" role="note"
                                >The unexpected element "<sch:name/>" has been used, which is close to an
				element in the schema: the element "<xsl:value-of select="@name"/>"
                                <xsl:if test="contains(@name, ':')"> in the
				{<xsl:value-of select="ancestor::xs:schema/@targetNamespace"/>} namespace</xsl:if>.
				</sch:report>
			</sch:rule>
		<xsl:if>
	</xsl:for-each>

This code actually catches two problems: have you made an upper-/lower-case typo or have you used an element with a name in the current namespace but using a different namespace.

Actually, the code as it is will generate a false positive if the same element name is used in multiple namespaces. So I will give it a role attribute of “Note” (as in Note, Caution, Warning). The role attribute lets you know what function a particular assertion plays in its rule or pattern.

These generated rules get put in the pattern that checks for typos, after the checks for defined names, but before the wildcard catch-all entry at the end: this way elements that have correct names and namespaces are dealt with before these rules, and any names that have other problems get dealt with by the default. In Schematron, a schema is made from patterns: each pattern contains rules, and each rules contains assertions (assert or report elements): every assertion in a rule is tested in the context (an XPath that may match nodes of interest from the document) provided by the rule; the rules however form a case statement, so that if some node matches one rule they won’t be tested by a subsequent rule in the same pattern.

Towards terser, more declarative schemas

It is almost axiomatic that automatically generated code is ugly and unfriendly. Look at compiler generators for example. Of course, getting consistent code that does the same thing many times is why you use a code generator like Schematron in the first place rather than writing the XSLT yourself, in many cases.

But it is certainly possible to make the code more friendly and more declarative. In Converting Schematron to XML Schemas I showed how to use abstract rules to provide extra declarative information so that there is enough information to convert back to a kind of W3C XML Schema. It doesn’t go so far, but the idea is that abstract rules (and abstract patterns, together with the role attribute) provide the abstraction for grouping assertions and representing types.

I won’t go into the code, it is trivial, but the idea is that there are quite a few rules or assertions that don’t have any dynamic content (sometimes it is handled by the diagnostic element, other times we don’t expect the rule to ever generate messages, see Expressing untested and untestable constaints in Schematron) and we can use abstract patterns to make things much more declative, readable and terse.

Here is an example, for the rules that swallow elements names that are defined in the current namespace

      <sch:rule id="DefinedElement" abstract="true">
         <sch:assert test="true()">The element name "<sch:name/>" is defined.</sch:assert>
      </sch:rule>

      <sch:rule context="Address">
         <sch:extends rule="DefinedElement"/>
      </sch:rule>

      <sch:rule context="AgeNextBirthday">
         <sch:extends rule="DefinedElement"/>
      </sch:rule>

And here is an example for detecting various kinds of text content:

	<sch:rule abstract="true" id="NoDataContent-ns1">
         <sch:assert test="string-length(normalize-space(string-join(text(), ''))) = 0"
                     diagnostics="d1">Element "<sch:name/>" should have no text content.</sch:assert>
      </sch:rule> 

      <sch:rule abstract="true" id="NoElementContent-ns1">
         <sch:assert test="count(*|processing-instruction()|comment()) = 0" diagnostics="d1
         ">Element "<sch:name/>" should be completely empty (no XML comments, PIs, or elements).</sch:assert>
      </sch:rule>

      <sch:rule abstract="true" id="NoContents-ns1">
         <sch:extends rule="NoDataContent-ns1"/>
         <sch:extends rule="NoDataContent-ns1"/>
         <sch:assert test="count(processing-instruction()|comment()) = 0" diagnostics="d1"
                >Element "<sch:name/>" should be completely empty (no XML comments, PIs).</sch:assert>
      <sch:rule>

      <sch:rule context="BestTime">
         <sch:extends rule="NoElementContent-ns1"/>
      </sch:rule>

      <sch:rule context="Gender">
         <sch:extends rule="NoDataContent-ns1"/>
      <sch:rule>

      <sch:rule context="Female">
         <sch:extends rule="NoContents-ns1"/>
      </sch:rule>

      <sch:rule context="Male">
         <sch:extends rule="NoContents-ns1"/>
      </sch:rule>

Much easier to read than having all those assertions expanded!

David A. Chappell

AddThis Social Bookmark Button

I just published part 2 of an article exploring the “Next Generation Grid Enabled SOA”. This one is sub-titled “Not Your MOM’s Bus“.

Abstract: In our previous article we discussed how SOA grids can be used to break the convention of stateless-only services for scalability and high availability (HA) by allowing stateful conversations to occur across multiple service requests, whether between disparate service boundaries or load-balanced groups of cloned service instances.

In this article we will challenge traditional applications of message-oriented middleware (MOM) for achieving high levels of quality of service (QoS) when sharing data between services in an enterprise service bus (ESB).We will further compare and contrast a state-based, in-memory storage and notification model, and investigate the intelligent co-location of processing logic with or near its grid data in large payload scenarios. Finally, we will also explain when to substitute an SOA Grid for existing MOM technologies as driven by the following question: “If you have an SOA grid that can reliably hold application state data and the necessary systems can access it, why continue to utilize conventional messaging?”

Read More..

Cheers,
Dave

Rick Jelliffe

AddThis Social Bookmark Button

One of the really neat things about the XML specification is not just that it makes its design goals explicit (I gave a twist to this idea in the Schematron standard by mentioning various non-goals too) but that the goals were really well chosen.

A decade ago, Tim Bray wrote up his Annotated version of the XML spec, which includes some hypertext comments to the Goals section.

Recently, I have heard several times people quoting the XML goals to support various opinions on what makes a good or bad markup language (schema). In particular, goal #10 Terseness is of minimal importance gets used to claim that abbreviated element names go against the spirit of XML (a blithe spirit indeed). (See here for example.)

But if we look at the XML Spec, we see that these are not general goals for XML documents to follow, but goals for the committee designing XML the technology: they are explicitly design goals. Tim’s comments are useful here, on goal #10 he writes

The historical reason for this goal is that the complexity and difficulty of SGML was greatly increased by its use of minimization, i.e. the omission of pieces of markup, in the interest of terseness. In the case of XML, whenever there was a conflict between conciseness and clarity, clarity won.

I have always attributed the goals to Jon Bosak. Tim mentions Jon’s stewardship of the XML process has been marked by a combination of deft political maneuvering with steadfast insistence on the principle of doing things based on principle, not expediency., where I think “principle” requires having clear goals and persuing them. (Regular readers of this blog might see that my Reasonable Principles for Reviewing Open XML and other Standards follows this line. If you get hold of the Standards Australia comments on DIS29500 ballot, you can see that most of them try to state the general principle behind the specific problem.)

But Dave Hollander and Michael Sperberg-McQueen mention how the goals were the foundation for the XML design effort too. The goals were fait accompli by the XML ERB by the time the larger XML WG formed (another good thing about Jon Bosak: he welcomed all sorts of stakeholder involvement) but I don’t recall any of us on the WG (which would now be called an Interest Group, not to be confused with the current XML WG which took up from the old ERB) ever complaining about the goals.

Alice through the Looking Glass

Looking at the goals (and see Tim’s comments if you don’t trust mine) you can see that most of the goals are specific responses to problems either with SGML or with the SGML process at ISO then. (ISO standards were supposed to have 10 year reviews which would be an opportunity for changes to be addressed, outside the ordinary maintenance process. But some influential and vital members of the ISO group had been committed to keeping SGML unchanged for as long as possible, and many of the other members who wanted change wanted changes that would support technologies such as ISO HyTime better: these would be changes that made SGML more complicated and varigated rather than simpler, to the frustration of all.)

1. XML shall be straightforwardly usable over the Internet.

SGML had a particular issue that it was, by design, retargetable. Before Unicode and the URLs, every different system had different character sets and different ways of locating files. So SGML provided a mechanism for labelling that an entity (resource) would need some system-specific fix in order to be useful, and a mechanism for naming entities regardless of their location (PUBLIC identifiers.

Because of this goal, SDATA entities were removed from SGML as was the use of unresolved entities (entities PUBLIC identifiers with no SYSTEM identifiers.) It was unfeasible to expect users to fix document to suit their local systems: that is geekstuff. The use of Unicode and URLs was a non-brainer from this goal.

2. XML shall support a wide variety of applications.

While some people had been using SGML for non-publishing uses (Dave Peterson at MIT for example had been using it for numerical data from 1985 IIRC) its complexity and strangeness made it difficult for, in particular, people from the database world. Now, as it turns out, there problems can fruitfully be solved by treating them as publishing applications. But this has been a success of XML and HTML, not SGML per se.

3. XML shall be compatible with SGML.

In fact, ISO 8879 was changed to allow this. In particular, to allow documents with no DTDs, hex numeric character references (I had tried to get them introduced in my successful 1996 Corregendum to SGML, but horsetraded them away to get support for the main requirement, to support CJK characters better) and the empty elements form <x/>

4. It shall be easy to write programs which process XML documents.

SGML parsers were a pain to write. An SGML processor was really a compiler compiler where you could change delimiters, keywords and a whole lot of different behaviours. Note that process here is a defined term: a XML processor is the parser and support utilities. This goal does not state that it is against XML’s goals to write complicated programs that use XML data!

5. The number of optional features in XML is to be kept to the absolute minimum, ideally zero.

SGML had an ancillary document, the SGML declaration which told you which which features a particular document needed. In theory, you could then look up the SGML system decaration for an application and see whether it matched. In fact, XML can be regarded as largely a particular SGML declaration, superceding the default Reference Concrete Syntax defined by IS8879:1986 (and taking on board the Extended Reference Concrete Syntax proposals which I and the CJK DOCP group were promoting.)

Now, in fact, there are two big optional features in XML: DTDs and non-UTF-* character encodings. Many early home-made XML parsers did not support DTDs or ignored them, and many supported only limited numbers of character encodings.

6. XML documents should be human-legible and reasonably clear.

As Tim notes, this is a goal which blocks off any attempt to allow binary data and non-graphical characters in XML. Text is king.

Before XML, I organized an effort to make up a set of rules for the Unicode characters that could be used in names in markup: this Native Language Markup list was part of the Extended Reference Concrete Syntax, was adopted and improved by XML 1.0 (and were downgraded in XML 1.1. Out of this effort came a strong belief that XML should not contain non-graphical or control characters: this ended up being reworked into to a W3C and Unicode Consortium note: Unicode in XML and other Markup Languages.

But the issue crops up periodically. Indeed, it one area where I think OOXML goes seriously wrong: in a few places it provides a mechanism for circumventing XML’s character repertoire restrictions. I think the idea that just because someone generated an automatic name and used the backspace character as part of it, this should be regarded as acceptable practice in the standard is completely bogus. Several National Bodies have commented on it: I hope ECMA will have the good sense to remove it or severely deprecate it at the least. For example it is clearly a security hole to allow backspace in names, where the visible name may be coded differently than its readers expect: a kind of spoofing.)

7. The XML design should be prepared quickly.

SGML’s “10 year review” had not even really started properly after 10 years. In fact, XML was the 10 year review of SGML!

8. The design of XML shall be formal and concise.

SGML has attracted criticism that it did not use academic formalisms, and was difficult to characterize with formalisms. I don’t know why this isn’t a criticism of the formalisms just as much: cart before the horse. Anyway, XML being simpler is much more friendly to simple theoretical formalisms, and consequently easier to write parsers for using compiler compilers. (In a sense, XML represent a move to unbundle the markup language from the compiler compiler technology. In my idle moments, I wonder whether compiler compiler systems would have been more capable of handling SGML if the SGML spec has been freely available on the internet in PDF or whatever: the lack of an open standard for SGML meant that the academic/private-hacker community (apart from James Clark) did not connect with the standard or its challenges.

9. XML documents shall be easy to create.

Tim wrote in 1998

The main goal was in fact to design XML in such a way that it would be tractable to design and build XML authoring systems. Our success in meeting this design goal remains to be established in the marketplace.

In 2008, the success is completely established: it is difficult to find anything anywhere which doesn’t use XML, even when it is a mad choice!

10. Terseness in XML markup is of minimal importance.

SGML was designed with a big attention to the requirement of users: i.e. typists. Minimizing the number of keystrokes it took to markup a raw text file was a large part of the economic value proposition of SGML. SGML allowed you to leave off many delimiters, omit many tags, and gave many kinds of shortcuts so that you could just use simple keyboard symbols instead of explicit tags.

It is tempting to think of this as an old-fashioned concern which we, in our age of RIAs and off-shored outsourcing don’t need to worry about. But what XML did was, in fact, to cast adrift the users of Wiki-like markup into a standards-free world, which has incredibly harmed the adoption of Wiki–like markup. And when we look at the current upheaval in the HTML 5 discussions going on currently, a central meme from that is that XML’s restricted syntax is simply inappropriate for vanilla HTML. (For an alternative, see my ECS)

This goal #10 has been the cause for much of XML’s success: with a stroke it allowed many SGML features to be removed without much fuss: DATATAG, SHORTREF, OMITTAG, SHORTTAG. Coupled with this goal was the realization that to a major extent, HTTP compression was the correct layer for reducing the transmission size of documents, rather than XML language features. (Of course, it is not true that terseness is of no importance in language syntax standards: the prefix mechanism in XML namespace is terseness mechanism after all!) And the removal of these features meant that the DTD was no longer necessary, a big win which many people had been seeking.

But to treat this design goal as somehow indicating any policy about how long a name in an schema should be goes beyond the intent of those goals, at least as I ever understood it. The goal of Native Language Markup was to allow people to markup documents using their customary names and symbols. This is different to the goal of literate programming, which is where I think people are getting confused.

In fact, what we are seeing with XML is that for international standards and in nations from the Hindu-European language groups or with English as an official language or (such as Indonesia) where the simple 26-letter alphabet has been adopted for transcription, schemas do restrict themselves to the ASCII repertoire names. No surprises there. However, for national and local documents for other languages or scripts, Native Language Markup is a big success, very spectaularly in the Chinese OUF spec and in Murata-san’s schemas for Japanese local governments.

David A. Chappell

AddThis Social Bookmark Button

- Grid computing will grip the attention of enterprise IT leaders, although given the various concepts of hardware grids, compute grids, and data grids, and different approaches taken by vendors, the definition of grid will be as fuzzy as ESB. This is likely to happen at the end of 2008.

- At least one application in the area of what Gartner calls “eXtreme Transaction Processing” (XTP) will become the poster child for grid computing. (see Gartner Research ID # G00151768 - Massimo Pezzini). This “killer app” for grid computing will most likely be in the financial services industry or the travel industry. Scalable, fault tolerant, grid enabled middle tier caching will be a key component of such applications.

- Event-Driven Architectures (EDA) will finally become a well understood ingredient for achieving realtime insight into business process, business metrics, and business exceptions. New offerings from platform vendors and startups will begin to feverishly compete in this area.

Rick Jelliffe

AddThis Social Bookmark Button

Sean McGrath’s
Master Foo On Structured Documents
makes a similar point to my Standardize the jellybeans not the jars, and is worth a read.

However, there is one big problem with open content models, and using generic containers: many automated XML tools only use schema information and not instance information to do their stuff. This is a problem I am facing right at the moment, actually: a customer wants to use Brand X tool which lets you map from controls on a form to elements in a schema, but also wants to use an industry-standard schema which uses data values.

For example, the tool would like a document like this:

  <Customer>
        ..
       <homephone>1234</homephone>
      <businessphone>1324</businessphone>
      <ax>123</fax>
      ...
    </Customer>

but the industry standard has

    <Party>
        <Person>
            <PersonTypeCode tc="1">Customer</PersonTypeCode>
            ...
            <Phone>
                <PhoneTypeCode tc="1">Home</PhoneTypeCode>
                <DialNumber>1234</DialNumber>
           </Phone>
            <Phone>
                <PhoneTypeCode tc="2">Business</PhoneTypeCode>
                <DialNumber>1234</DialNumber>
           </Phone>
            <Phone>
                <PhoneTypeCode tc="12">Fax</PhoneTypeCode>
                <DialNumber>1234<DialNumber>
           </Phone>
      </Person>
   </Party>

In the first case the Xpath to the fax number is
//Customer/fax

In the second case the XPath is
//Party/Person[PersonTypeCode='Customer']/Phone[PhonetypeCode/@tc="12"]/DialNumber

This kind of issue is a common problem, and the answer is almost always either to forgo the graphical tools (sometimes the application’s backend can handle more complicated Xpaths than the IDE GUI can) or to transform the data in and out so that the application works with data in an optimal form (which requires having a customized schema for the particular application or class of application.) In many cases, it seems that the large standard schemas are either “jack of all trades but master of none” or that they really are designed for neutral data interchange and adoptees should expect to have to do some information-preserving transforms in and out.

Either way, Sean’s blog is in the ballpark.

Rick Jelliffe

AddThis Social Bookmark Button

Noah Mendelsohn on XML-DEV got me thinking again about the obvious dual of the recent blogs here: how to convert Schematron into XSD. Putting aside the natural question of why we might want to do this, here’s my stab at an answer.

Because Schematron is more powerful and more general than XSD, and because it uses different abstractions (phases and patterns rather than a grammar), it is not possible to convert every arbitrary Schematron schema into a useful schema in XSD.

However, it is certainly possible to devise some conventions which allow translation to some extent.

I’ll leave out the abstract declarations, but here is an example for HTML of how it might look. First lets treat the different types of complex content as if they were just like facets.

<sch:pattern role="Declarations">
<sch:rule context="xhtml:html"  role="element-declaration">
   <sch:rule extends="container"  role="element-content-type" />
   <sch:rule extends="metadata" role="attribute-group-reference" />
</sch:rule>

<sch:rule context="xhtml:head"  role="element-declaration">
   <sch:rule extends="container"  role="element-content-type" />
   <sch:rule extends="metadata" role="attribute-group-reference" />
</sch:rule>

<sch:rule context="xhtml:meta"  role="element-declaration">
   <sch:rule extends="empty"  role="element-content-type" />
</sch:rule>

<sch:rule context="xhtml:p"  role="element-declaration">
   <sch:rule extends="mixed"  role="element-content-type" />
   <sch:rule extends="metadata" role="attribute-group-reference" />
   <sch:rule extends="css" role="attribute-group-reference" />
</sch:rule>

<sch:rule context="xhtml:span"  role="element-declaration">
   <sch:rule extends="text"    role="element-content-type" /> <!-- for example-->
   <sch:rule extends="css" role="attribute-group-reference" />
</sch:rule>

<sch:rule context="xhtml:img"  role="element-declaration">
   <sch:rule extends="empty"  role="element-content-type" />
   <sch:rule extends="css" role="attribute-group-reference" />
</sch:rule>

<sch:rule context="xhtml:img/@width"  role="attribute-declaration">
   <sch:rule extends="dimension-type"  role="simple-type" />
</sch:rule>

<sch:rule context="xhtml:img/@height"  role="attribute-declaration">
   <sch:rule extends="dimension-type"  role="attribute-simple-type" />
</sch:rule>

</sch:pattern>

This gives us the framework. In this pattern, we use abstract rules for a mix-in style of multiple inheritance.

You should be able to see how each of these can be mechanically converted into partial XSD declarations for elements and attributes, such as

<xsd;element name="html">
   <xsd:complexType mixed="false" >
      &xsd:group ref="open" />
     &attributeGroup ref="metadata" />
   <xsd:complexType>
</xsd:element>

<xsd;element name="meta">
   <xsd:complexType mixed="false" >
     &attributeGroup ref="open" />
   <xsd:complexType>
</xsd:element>

<xsd;element name="head">
   <xsd:complexType mixed="false" >
      &xsd:group ref="open" />
     &attributeGroup ref="metadata" />
   <xsd:complexType>
</xsd:element>

<xsd;element name="p">
   <xsd:complexType mixed="true" >
      &xsd:group ref="open" />
     &attributeGroup ref="metadata" />
     &attributeGroup ref="cvv" />
   <xsd:complexType>
</xsd:element>

<xsd;element name="span">
   <xsd:complexType mixed="true" >
      &xsd:attributeGroup ref="css" />
   <xsd:complexType>
</xsd:element>

<xsd;element name="img">
   <xsd:complexType mixed="false" >
       <attribute name="height" type="dimension-type" />
       <attribute name="width" type="dimension-type" />
   <xsd:complexType>
</xsd:element>

So far so good. To get better control of optionality of attributes, we could extend a pattern with a different name, I suppose. But the idea is the same: we use the sch:role attribute to provide the metadata needed to allow an effective transformation.

What about content models? In the version above, any element with subelements just has an “open” content model, presumably declared with some wildcard.

This is where abstract patterns can come in. First lets alter the schema we generate to make groups with the same name as the element.

<xsd;element name="html">
   <xsd:complexType mixed="false" >
      &xsd:group ref="html" />
     &attributeGroup ref="metadata" />
   <xsd:complexType>
</xsd:element>

<xsd;element name="meta">
   <xsd:complexType mixed="false" >
     &attributeGroup ref="open" />
   <xsd:complexType>
</xsd:element>

<xsd;element name="head">
   <xsd:complexType mixed="false" >
      &xsd:group ref="head" />
     &attributeGroup ref="metadata" />
   <xsd:complexType>
</xsd:element>

<xsd;element name="p">
   <xsd:complexType mixed="true" >
      &xsd:group ref="p" />
     &attributeGroup ref="metadata" />
     &attributeGroup ref="cvv" />
   <xsd:complexType>
</xsd:element>

<xsd;element name="span">
   <xsd:complexType mixed="true" >
      &xsd:attributeGroup ref="css" />
   <xsd:complexType>
</xsd:element>

<xsd;element name="img">
   <xsd:complexType mixed="false" >
       <attribute name="height" type="dimension-type" />
       <attribute name="width" type="dimension-type" />
   <xsd:complexType>
</xsd:element>

We could have:

<sch:pattern is-a="container"   role="group-declaration">
  <sch:param  name="context" value="xhtml:html"/>
  <sch:param  name="content-model" value="( xhtml:head?, xhtml:body )"/>
  <sch:param  name="required-children" value="xhtml:body"/>
  <sch:param  name="optional-children" value="xhtml:head"/>
</sch:pattern>

<sch:pattern is-a="container"   role="group-declaration">
  <sch:param  name="context" value="xhtml:head"/>
  <sch:param  name="content-model" value="( xhtml:title, xhtml:meta*, (xhtml:style | xhtml:script)* " />
  <sch:param  name="required-children" value="xhtml:title"/>
  <sch:param  name="optional-children" value="xhtml:meta | xhtml:style | xhtml:script"/>
</sch:pattern>

<sch:pattern is-a="container"   role="group-declaration">
  <sch:param  name="context" value="xhtml:p"/>
  <sch:param  name="content-model" value="( i | b | span | img )* "/>
  <sch:param  name="required-children" value=""/>
  <sch:param  name="optional-children" value="i | b | span | img"/>
</sch:pattern>

The purpose of abstract patterns is to allow patterns to be parameterized to bring out and name the parts of the patterns of interest in the schema-developer’s world-view. Create your own schema language!

In this case, we have three uses of abstract patterns, and the role attribute tells our notional converter program that we want to generate a xsd:group from these. (We are giving the element content model in the conventional syntax, so some converter would have to translate between syntaxes, it is just mechanical.) The final two parameters give (actually repeat) the information in the content model in a way that would be more conducive for use in an XPath.

So our convert program would take these and generate

<xsd:group name="html">
   <xsd:sequence>
      <xsd:element ref="head" minOccurs="0" />
      <xsd:element ref="body" minOccurs="0" />
  </xsd:sequence>
</xsd:group>

and so on. Of course, there would be many other approaches possible. But you should be able to see from this example how Schematron can, in fact, be highly declarative, if that is what you want.

Rick Jelliffe

AddThis Social Bookmark Button

Brian Reid, the old fuddy-duddy fighting back, turns out to be the Brian Reid, of Scribe fame. Scribe was an early word-processing application that was one of the first practical and public tools for showing that descriptive markup, rather than procedural markup, was workable.

The story is that IBM’s Charles Goldfarb was due to make his big presentation on their GML system to a conference in Switzerland in 1981, but Reid had a paper at the same conference which turned out to present a lot of the same material: stolen thunder! GML morphed into SGML now XML, while Scribe influenced the direction of word processors by showing the practicality of styles (think CSS!).

Reid revisited his 1981 paper at an SGML/XML conference keynote in 1998, which is still online though large (10meg PPT?) The paper includes some interesting thoughts on why markup is wrong-headed.

Reid’s dispute with Google is quite interesting to me. Last year or so, I was looking around to see whether there were any interesting opportunities in Sydney, and Google contacted me to come in for an interview. When I arrived, it turned out to be a rather odd interview for a programming job which was pretty much unconnected with anything I had been doing for the last 20 years, but the questions would have been good for a recent graduate, sort of like interviewing Donald Duck for a position as an egg, if you know what I mean; the people seemed super nice, but I think there was a bit of mutual mystification as to why I was there. The impression I got was very much of a mono-culture: the founders wanted people like the founders. It seemed that standards were not on the company’s horizon at all.

But perhaps this is a new way to break up monopolies: everyone above 35 goes to one company, all the rest go to another!

Rick Jelliffe

AddThis Social Bookmark Button

Now we come to the most interesting part: how do we generate Schematron schemas that implement the constraints from an XML Schema? A question often comes up, of whether Schematron is strictly more powerful than XML Schemas or just often so; some academics have made tentative opinions, and the conclusion I had reached was that probably it was not: for any implementation in Schematron you could probably make a content model that was so baroque and monstrous that Schematron would not capture some aspect of it. But it you would have to try hard.

However, with Schematron using XPath2 or XSLT2 as their query language (rather than the default XSLT1) things are much clearer: I think there is a really simple technique available that captures all the cardinality, optionality, and sequence constraints.

The Regular Expression technique

The technique? Convert a content model into a regular expression; make a string contain as space-separated tokens each element name found in the instance; then validate that string against the regular expression! The regular expression language used in XSLT2 allows sequence, choice, cardinality, repetition, and these are the same as in XSD. You don’t need a special FDA library if you have the regular expression library. (The special cases of xsi:type and substitution groups are no problem: xsi:type could be handled by another pattern because the element must be still valid against the declared type; while substitution groups can be handled during the prior pre-processing and be long gone by this stage. nillibility is not something I have thought about much: it certainly can be done but I don’t know the impact on the Xpaths.)

So, if we wanted to, we could implement this in our XSD to Schematron converter and say hooray.

But the problem is that even though we could validate all the constraints, we would get lousy diagnostics. What would be the point? I guess it would still be useful as a fallback, as another phase for confidence building and to check if anything had fallen between the cracks of the method we will be using, but it is not so interesting to me that we have it in our plans to implement. If someone else wants to implement it in XSLT and contribute it, it might be a fun and small project!

We could break the regular expression in various interesting ways, however: we could make one version of it that made everything optional, and so implement feasible validation, as found for example in Jing for RELAX NG. (However, there might be ambiguity issues here, so it might not work each time.) Feasible validation is an approach I came up with a couple of years ago, based on the idea that it can be useful to validate only certain constraints: very often you might markup a document to fill in the metadata at the last stage, so you don’t want validation to fail because of some problem at the start of an element’s children when you are working on subsequent elements. A validator should not dictate a workflow!

Validators in editors frequently implement partial validation, where they don’t complain about child elements missing at the end of a content model. This is partial validation: it is useful if you are entering the document in element order, but not otherwise.

Now another approach with the regular expression method would be to break it into smaller expressions, for example a string of three tokens with anything allowed before or after: trigrams/ Indeed, that is something pretty similar to what we do later, in effect, but not using regular expressions.

A more Schematron-ish way

So what is the Schematron-ish way to approach the problem? Well, it is to concentrate on two things: first, What is the most useful way of expressing and organizing diagnostics to help the user? and second, What is the model of user interaction built into the schema? Actually, in my opinion, you cannot answer the first without answering the second, and the second dictates the first.

Rather than talk theory, I’ll show you the approach and you should be able to figure out what I mean by user interaction and so on, with these use cases.

Use Cases

  • The user wants to check for typos: names that are spelled incorrectly
  • The user wants to check for containment: that elements and attribute belong to the correct parents
  • The user wants to check that all required elements and attributes are present
  • The user wants to check that each element is in the required position

This is another example of progressive validation. It allows the user to systematically find certain kinds of mistakes, and partitions them off. Because Schematron will usually report all the errors it finds anywhere in a document, it has an advantage that it is very easy to see systematic errors, if they are presented together; grammar-based validators often just die at the first error. But assertion-based schemas using paths may generated too many diagnostics, as the same error causes multiple assertions to fail.

So Schematron has a feature called phases. Phases let you group some patterns together, give them a name, and then you can instruct the validator to only validate the patterns in the that phase. This allows workflows, progressive validation, incremental markup, transformation checking, variant document types, and so on. Very useful.

Each of these use-cases may take one or more patterns to implement, however, we will make a phase for each of them. (Actually, we have gone phase craaazy, which will be in a later posting.) Here is the phase declarations to validate just the typos, for example:

<sch:phase id="phase-typo">

	<sch:active pattern="Element_Name_Typo">

				Pattern for checking for typos in element names.

	</sch:active>

	<sch:active pattern="Attribute_Name_Typo">

				Pattern for checking for typos in attribute names.

	</sch:active>

	<sch:p>This phase has all the patterns for checking typos in names.

</sch:phase>

As you can see, we are not validating using a state machine or similar grammar system at all.

The patterns

Here are the patterns; we have factored out the guts to make the commonality between this boilerplate more obvious.

 <!-- pattern 5: Element name typos Elements	-->

<xsl:comment>

			============================================================

		                          ELEMENT NAMES 

			============================================================

</xsl:comment>

<sch:pattern id="Element_Name_Typo">

	<sch:title>Typos in Element names

	<xsl:call-template name="generate-elements-typo-checking-rule"/>

</sch:pattern>

<sch:pattern id="Element_Name_Expected">

	<sch:title>Expected in Element names

	<xsl:call-template name="generate-elements-expected-checking-rule"/>

</sch:pattern>

<sch:pattern id="Element_Name_Required">

	<sch:title>Required in Element names

	<xsl:call-template name="generate-elements-required-checking-rule"/>

</sch:pattern>

<!-- pattern 6: Attributes name typos Attributes	-->

<xsl:comment>

			============================================================

	                         Attributes NAMES

			============================================================

</xsl:comment>

<sch:pattern id="Attribute_Name_Typo">

	<sch:title>Typos in Attributes names

	<xsl:call-template name="generate-attributes-typo-checking-rule"/>

</sch:pattern>

<sch:pattern id="Attribute_Name_Expected">

	<sch:title>Expected in Attributes names

	<xsl:call-template name="generate-attributes-expected-checking-rule"/>

</sch:pattern>

<sch:pattern id="Attribute_Name_Required">

	<sch:title>Required in Attributes names

	<xsl:call-template name="generate-attributes-required-checking-rule"/>

</sch:pattern>

<xsl:comment>

Typos

The typo patterns are very easy. Here is the one for elements.

<xsl:template name="generate-elements-typo-checking-rule">

	<xsl:for-each select="//xs:element[@name]">

		<xsl:sort select="@name"/>

		<sch:rule context="{@name}">

			<sch:assert test="true()">
			The <sch:name/> element is defined in this schema.</sch:assert>

		</sch:rule>

	</xsl:for-each>

	<sch:rule context="*">

		<sch:report test="true()" diagnostics="typo-element">
		Only elements declared in the schema may be used.</sch:report>

	</sch:rule>

</xsl:template>

In this case, we generate a rule for each element, but with only a vacuous true() assertion test; there is still a useful assertion behind it though, in the assertion text. Elements with typos fall through and are caught by the wildcard test of the last rule.

And finally, here is a simple diagnostics element, to report the miscreant.

	<sch:diagnostic id="typo-element">
		The following element was found <sch:name/>.
	</sch:diagnostic>

We’ll continue with handing more of the use cases in another blog.

Rick Jelliffe

AddThis Social Bookmark Button

A year ago I wrote in this blog a precursor to this series Converting Content Models to Schematron, in which I outlined one approach. This blog item is an update on that, in particular for special cases, clearing the decks with them leaves us free to look at XML content models:

  • Empty elements
  • Text content (untyped)
  • Element content
  • XSD ALL content models

Empty Elements

Empty elements are easy. (Update: 2007-11-09)

<xsl:template match="xs:element[xs:complexType
                    [not(xs:simpleContent)]
                    [not(@mixed='true')]
                    [not(.//xs:element)]]"        priority="100">
	<sch:rule>
		<xsl:call-template name="generate-element-context"/>
		<xsl:comment>Check Empty Elements: They can't have
			1, text nodes 2, elements 3, comments 4, processing-instructions </xsl:comment>
		<sch:assert test="count(*|processing-instruction()|comment()|text()) = 0" diagnostics="d1">
		Element <sch:name/> should have no content.</sch:assert>
	</sch:rule>
</xsl:template>

Text Elements (Untyped)

Text elements are easy too.

<xsl:template match="xs:element[xs:complexType[xs:simpleContent]]" priority="99">
	<sch:rule>
		<xsl:call-template name="generate-element-context"/>
		<xsl:comment>Check Text Only: They can't have
			1, elements </xsl:comment>
		<sch:assert test="count(*) = 0" diagnostics="d1">
		Element <sch:name/> should have text content and attributes only, but no sub-elements.
		(They may have procesing instructions and comments.0</sch:assert>
	</sch:rule>
</xsl:template>

Element Content

For element content elements, we’ll just check that they don’t have text, for this pattern. (We will check whether the elements it has are allowed in a different pattern, in a future blog.)

<xsl:template match="xs:element
                    [xs:complexType[not(@mixed='true')][not(xs:simpleContent)]]" priority="98">
	<sch:rule>
		<xsl:call-template name="generate-element-context"/>
		<xsl:comment>Check None Text found: They can't have
			1, any text content </xsl:comment>
		<sch:assert test="string-length(normalize-space(string-join(text(), ''))) = 0" diagnostics="d1">
		Element <sch:name/> should have no text content.</sch:assert>
	</sch:rule>
</xsl:template>

The ALL Content Model

The ALL content model, in XSD, is a way of saying that all the elements are
required (or optional) but they can be in any order. To do this with a grammar runs the risk of a combinatorial explosion, but the ALL content model is very straightforward to implement in Schematron, but we have to break it into its component assertions.

FIrst, the ALL content model is closed (we don’t implement wildcards.) So we count that the total number of elements is equal to the sum of the counts of the allowed elements. If the element requires all A, B and C, then we count(A) + count(B) + count(C) = count(*) which is another example of how in Schematron you solve many problems by counting.


<xsl:template match="xs:element[.//xs:all]" priority="90">

<xsl:comment>======= Handle XS:ALL ========</xsl:comment>

<sch:rule>

	<xsl:call-template name="generate-element-context"/>

	<xsl:comment>check allowed elements</xsl:comment>

	<sch:assert  >

		<xsl:attribute name="test">

			<!-- get names of each allowed element -->

			<xsl:for-each select=".//xs:all/xs:element">

				<xsl:text>count(</xsl:text>

				<xsl:value-of select="if (@name) then @name else @ref" />

				<xsl:text>)</xsl:text>

				<xsl:if test="following-sibling::xs:element"> + </xsl:if>

			</xsl:for-each>

			<xsl:text> = count(*)</xsl:text>

		</xsl:attribute>

			The element <xsl:value-of select ="@name"/> can only have the following elements:

		<!-- get names of each allowed element -->

		<xsl:for-each select=".//xs:all/xs:element">

			<xsl:value-of select="if (@name) then @name else @ref" />

			<xsl:if test="following-sibling::xs:element">, </xsl:if>

		</xsl:for-each>.

	</sch:assert>

Next we generate an assertion that each element only occurs with the cardinality of the maxOccurs and minOccurs.

	<xsl:for-each select=".//xs:all/xs:element">

		<xsl:variable name="ancestor-element" select="ancestor::xs:element/@name"/>

		<xsl:variable name="element-name" select="if (@name) then @name else @ref"/>

		<xsl:variable name="MAXOccurs" select="if (@maxOccurs) then @maxOccurs else '1'"/>

		<xsl:variable name="MINOccurs" select="if (@minOccurs) then @minOccurs else '1'"/>

		<xsl:choose>

			<xsl:when test="$MAXOccurs = $MINOccurs">

				<sch:assert diagnostics="{concat('d2-',$ancestor-element,'-',$element-name)}">

					<xsl:attribute name="test">

							count(<xsl:value-of select="$element-name"/>) = <xsl:value-of select="$MAXOccurs"/>

					</xsl:attribute>

						There should be <xsl:value-of select="$MAXOccurs"/> of element <xsl:value-of select="$element-name"/>

				</sch:assert>

			</xsl:when>

			<xsl:otherwise>

				<sch:assert  >

					<xsl:attribute name="test">

							count(<xsl:value-of select="$element-name"/>) <= <xsl:value-of select="$MAXOccurs"/>

					</xsl:attribute>

						There should be at most <xsl:value-of select="$MAXOccurs"/> of element <xsl:value-of select="$element-name"/>

				</sch:assert>

				<sch:assert diagnostics="{concat('d2-',$ancestor-element,'-',$element-name)}">

					<xsl:attribute name="test">

							count(<xsl:value-of select="$element-name"/>) >= <xsl:value-of select="$MINOccurs"/>

					</xsl:attribute>

						There should be at least <xsl:value-of select="$MINOccurs"/> of element <xsl:value-of select="$element-name"/>

				</sch:assert>

			</xsl:otherwise>

		</xsl:choose>

	</xsl:for-each>

</sch:rule>

</xsl:template>

So every element with an ALL type only requires a single rule to implement.

Now we want to add some more information for better diagnostics, so for each of the count rules we implement

<sch:assert diagnostics="{concat('d2-',$ancestor-element,'-',$element-name)}">

and we generate the corresponding diagnostics to give an actual count of the overpopulation:

 <xsl:for-each select="xs:element[.//xs:all]//xs:all/xs:element">

	<xsl:variable name="ancestor-element" select="ancestor::xs:element/@name"/>

	<xsl:variable name="element-name" select="if (@name) then @name else @ref"/>

	<sch:diagnostic id="{concat('d2-',$ancestor-element,'-',$element-name)}">  elements were found

</xsl:for-each>

In Schematron , we make a distinction between the assertion text, which is a positive statement of what is true, and diagnostics, which give extra help to humans. Very often people new to Schematron want to put diagnostic messages as the assertion text. (Indeed, some of the programmers working on this project did it, so it is not an obvious thing sometimes.) To get the idea, think about what happens if you want to generate a paper document with the schema printed out, with one bullet point per assertion: the diagnostics information would not make much sense, while usually good assertions would be perfectly readable and useful for domain experts.

Housekeeping

Finally, here are a couple of useful housekeeping elements, to be used in the same pattern as above: these give warnings about which element declarations are actually handled, to prove the converter.

<xsl:template match="xs:element[@ref]" priority="1" >
	<xsl:message>PROGRAMMING ERROR: trying to process an element reference.</xsl:message>
</xsl:template>

<xsl:template match="xs:element" >
	<xsl:message>I don't know how to handle this kind of element declaration yet.</xsl:message>
</xsl:template>
Rick Jelliffe

AddThis Social Bookmark Button

There are three rules concerning documents with xs:ID and xs:IDREF.

First, they must contain token values, that accord with the XML naming conventions. We check this already as part of the simple type checking. (The empty string is not allowed.)

Second, no attribute of type ID can have the same value as another attribute of type ID.

Third, for every attribute of type IDREF there must be an attribute of type ID with the same value. (There can be multiple IDREFs with the same value, but one only ID with that value.) That is what this entry is about.

Here is how to check IDREFs. First of all, we make three variables collecting all element declarations which have IDREF attributes. Then we make three variables containing all the element declarations which have ID attributes. Then we make a list with just the distinct IDs, just to make life easier. (There are other ways to do this, of course.)

	<xsl:variable name="idref-list">

		<root>

			<xsl:for-each select="//xs:attribute[@name][@type='xs:IDREF']">

				<xsl:sort select="@name"/>

				<idref><xsl:value-of select="@name"/></idref>

			</xsl:for-each>

		</root>

	</xsl:variable>

	<xsl:variable name="id-list">

		<root>

			<xsl:for-each select="//xs:attribute[@name][@type='xs:ID']">

				<xsl:sort select="@name"/>

				<id><xsl:value-of select="@name"/></id>

			</xsl:for-each>

		</root>

	</xsl:variable>

	<xsl:variable name="id-distinct-list">

		<root>

			<xsl:for-each select="$id-list/root/id">

				<xsl:if test="position() = 1 or . != preceding-sibling::id[1]">

					<id><xsl:value-of select="."/></id>

				</xsl:if>

			</xsl:for-each>

		</root>

	</xsl:variable>

Now we have all our input data nicely available in variables, Generating IDREF rules is easy. For each attribute that can contain an IDREF we check it against each attribute that can contain an ID. (Now this would be better factored out into an abstract rule, but it is easier to read this.)


	<xsl:for-each select="$idref-list/root/idref">

		<xsl:if test="position() = 1 or . != preceding-sibling::idref[1]">

			<sch:rule context="*/@{.}">

				<sch:assert>

					<xsl:attribute name="test">

						<xsl:for-each select="$id-distinct-list/root/id">

							<xsl:text>//@</xsl:text>

							<xsl:value-of select="."/>

							<xsl:text> = . </xsl:text>

							<xsl:if test="position() != last()"> or </xsl:if>

						</xsl:for-each>

					</xsl:attribute>

					Element <sch:name/> 's IDRef hasn't been found. IDRef: <sch:value-of select="."/>.

				</sch:assert>

			</sch:rule>

		</xsl:if>

	</xsl:for-each>

You can get an idea from this how the ID uniqueness checking could be generated. KEY/KEYREF and UNIQUENESS checks in XSD already use XPath, and don’t use types, so they also should be straightforward to integrate.

Jennifer Golbeck

AddThis Social Bookmark Button

Ok. So perhaps this is not a conspiracy because it’s out in the open, but ebay’s role in keeping feedback ratings artificially high is something worth discussing.

My argument is not about retaliatory feedback, but let’s discuss that briefly. Anyone who has used eBay much knows that feedback retaliation happens. You get treated badly, you leave feedback that says so, and the recipient leaves you bad feedback, sometimes even lying. This is a disincentive for leaving anything negative in the first place. eBay could take steps to make the system more fair, but they don’t. In fact, they have an incentive to leave the system exactly like it is. Retaliation discourages people from leaving bad feedback, and less bad feedback makes the entire marketplace look more trustworthy.

But that could be a bit of me creating a conspiracy, and perhaps eBay has better intentions than the previous paragraph gives them credit for. I considered this a possibility until one of my most recent forays into the depths of their system.

I was sold a counterfeit item on eBay. I paid about $100 for it and when it arrived it was obviously fake. After the seller did not respond to my emails, I filed a claim with Papal (which is owned by eBay for those of you who are not familiar). They offer “seller protection” designed to make sure you don’t get ripped off. Papal sent some messages back and forth and after about a month told me I would need to get the item appraised and send them evidence that it was counterfeit. This can cost hundreds of dollars, and discourages a cheated buyer from proceeding with the process, but let’s allow that it is necessary. On principle, I continued and found someone to certify that I had received a fake. After more than two months of fighting, PayPal finally resolved the dispute on my behalf and sent me a refund. That is all well and good. However, what really got me was the email they sent notifying me of all this:

PayPal has received the item in dispute. A refund will be issued to your
PayPal account within 5 business days.

PayPal regrets any inconvenience you may have experienced.

This claim has been resolved amicably. Please consider this when leaving feedback for this seller.

Thank you for your cooperation.

Sincerely,

Protection Services Department

“Amicably”? What was amicable about this claim? I spent a lot of money and effort trying to get a refund that the seller refused me for months. Why would PayPal tell me to consider this claim amicable when I leave feedback? Well, they have the same incentive as before. A marketplace with no negative feedback looks safer. But none of us want to participate in a system where a seller who regularly sends out counterfeit items is ranked highly, simply because eventually buyers can get their money back by the actions of a third party.

I do not hold out any hope that eBay, who is thriving, will correct their ways. I think it could eventually lead to a third party system (and several have popped up) for creating real and honest feedback about buyers and sellers. I would certainly use such a service - I might even pay for it - because I want to truly know how much to trust people I interact with. I don’t want to be falsely reassured that everything will be ok, even though that seems to be the tactic eBay is betting on for continued success.

Rick Jelliffe

AddThis Social Bookmark Button

XSD allows you to derive your own simple datatypes by restricting the lexical space or the value space of the type. The rule about derivation by restriction is that everything that is valid against the derived type is also valid against the base type.

And this gives us our method. Remember from the previous blog in this series that we implement a built-in datatype like this

   <sch:rule context="imametadataman">
     <sch:rule extends="xsd-byte-datatype"/>
   </sch:rule>

If we want to say that imametadataman should have a facet of minExclusive of 32, we just implement the facet restriction by adding an assertion:

   <sch:rule context="imametadataman">
     <sch:rule extends="xsd-byte-datatype"/>
    <sch:assert test=". > 32 "> The value for <sh:name />should be greater than 32 </sch:assert>
   </sch:rule>

Type derivation by restriction can be directly implemented by Schematron abstract rules. There is a mismatch in terminology: we restrict the type (in the XSD) by extending the constraints (in Schematron).

Here is some code to give the flavour of how easy it is to handle each type. (The assertion text needs work, and there is lots of scope for beautification, but you should get the idea. )

	<xsl:when test="xs:simpleType/xs:restriction[@base]">
			<xsl:variable name="baseon" select="xs:simpleType/xs:restriction/@base"/>
			<sch:rule>
				<xsl:choose>
					<xsl:when test="self::xs:attribute[parent::xs:schema]">
						<xsl:attribute name="abstract">true
						<xsl:attribute name="id">
							<xsl:choose>
								<!-- attribute has no namespace -->
								<xsl:when test="ancestor::namespace/@uri=''">
									<xsl:value-of select="concat('global_', @name)"/>
								</xsl:when>
								<!-- attribute has namespace (normal case) -->
								<xsl:otherwise>
									<xsl:value-of select="concat('global_', ancestor::namespace/@prefix, '_', @name)"/>
								</xsl:otherwise>
							</xsl:choose>
						</xsl:attribute>
					</xsl:when>
					<xsl:otherwise>
						<xsl:choose>
							<xsl:when test="self::xs:element">
								<xsl:call-template name="generate-element-context"/>
							</xsl:when>
							<xsl:otherwise>
								<xsl:call-template name="generate-attribute-context"/>
							</xsl:otherwise>
						</xsl:choose>
					</xsl:otherwise>
				</xsl:choose>
				<!-- get base value -->
				<!--  FIX THIS: should use namespace URI not prefix! -->
				<xsl:choose>
					<xsl:when test="starts-with($baseon,'xs:') or
									starts-with($baseon,'xsd:') or
									starts-with($baseon,'xsi:')">
						<sch:extends rule="{concat(ancestor::namespace/@prefix, '-xsd-datatype-', substring-after($baseon, ':'))}"/>
					</xsl:when>
					<xsl:when test="contains($baseon,':')">
						<xsl:variable name="prefix"
							select="substring-before($baseon, ':')"/>
						<xsl:variable name="typename"
							select="substring-after($baseon, ':')"/>
						<sch:extends rule="{concat($prefix, '_', $typename)}"/>
					</xsl:when>
					<xsl:otherwise>
						<sch:extends rule="{concat(ancestor::namespace/@prefix, '_', $baseon)}"/>
					</xsl:otherwise>
				</xsl:choose>
				<!-- check the underneath of restriction -->
				<xsl:if test="xs:simpleType/xs:restriction/xs:enumeration">
					<sch:assert>
						<xsl:attribute name="test">
							<xsl:for-each select="xs:simpleType/xs:restriction/xs:enumeration">
								<xsl:text>(. = "
								<xsl:value-of select="normalize-space(@value)"/>
								<xsl:text>")
								<xsl:if test="following-sibling::xs:enumeration">
									<xsl:text> or 
								</xsl:if>
							</xsl:for-each>
						</xsl:attribute> The value of  should be one of
						<ue-of select="@value"/>
							<xsl:if test="following-sibling::xs:enumeration">
								<xsl:text>, 
							</xsl:if>
						</xsl:for-each>. (It is of type "
						<xsl:value-of select="normalize-space(@name)"/>".)
					</sch:assert>
				</xsl:if>
				<xsl:if test="xs:simpleType/xs:restriction/xs:minLength">
					<sch:assert test="string-length(.) < xs:simpleType/xs:restriction/xs:minLength/@value"> A
						simpleType(
						<xsl:value-of select="@name"/>)'s value must be longer than
						<xsl:value-of select="xs:simpleType/xs:restriction/xs:minLength/@value"/> </sch:assert>
				
				
					<sch:assert test="string-length(.) > xs:simpleType/xs:restriction/xs:maxLength/@value"> A
						simpleType(
						<xsl:value-of select="@name"/>)'s value must be shorter than
						<xsl:value-of select="xs:simpleType/xs:restriction/xs:maxLength/@value"/> </sch:assert>
				</xsl:if>
				<xsl:if test="xs:simpleType/xs:restriction/xs:length">
					<sch:assert test="string-length(.) != xs:simpleType/xs:restriction/xs:length/@value"> A length of
						this simpleType(
						<xsl:value-of select="@name"/>)'s value must be
						<xsl:value-of select="xs:simpleType/xs:restriction/xs:length/@value"/> </sch:assert>
				</xsl:if>
				<xsl:if test="xs:simpleType/xs:restriction/xs:whiteSpace">
					<sch:assert test="true()"> WhiteSpace would be treated as 'preserve',
						'replace' or 'collapse' </sch:assert>
				</xsl:if>
				<xsl:if test="xs:simpleType/xs:restriction/xs:totalDigits">
					<xsl:comment>The counting doesn't include dot, leading and trailing zeros.</xsl:comment>
					<sch:assert test="string-length(replace(string(.),'.','')) < xs:simpleType/xs:restriction/xs:totalDigits/@value"> The maximum number of digits for <sch:name/>
						should smaller than <xsl:value-of select="xs:simpleType/xs:restriction/xs:totalDigits/@value"/> </sch:assert>
				</xsl:if>
				<xsl:if test="xs:simpleType/xs:restriction/xs:minExclusive">
					<sch:assert test=". > xs:simpleType/xs:restriction/xs:minExclusive/@value"> The value for  should be
						bigger than <xsl:value-of select="xs:simpleType/xs:restriction/xs:minExclusive/@value"/> 
				</xsl:if>
				<xsl:if test="xs:simpleType/xs:restriction/xs:minInclusive">
					<sch:assert test=". > xs:simpleType/xs:restriction/xs:minExclusive/@value or . = xs:simpleType/xs:restriction/xs:minExclusive/@value"> The value for <sch:name/> should be
						bigger than and equal with <xsl:value-of select="xs:simpleType/xs:restriction/xs:minExclusive/@value"/> </sch:assert>
				<<xsl:if test="xs:simpleType/xs:restriction/xs:maxExclusive">
					<sch:assert test=". < xs:simpleType/xs:restriction/xs:maxExclusive/@value"> The value for  should be
						smaller than <xsl:value-of select="xs:simpleType/xs:restriction/xs:maxExclusive/@value"/> </sch:assert>
				</xsl:if>
				<xsl:if test="xs:simpleType/xs:restriction/xs:maxInclusive">
					<sch:assert test=". < xs:simpleType/xs:restriction/xs:maxExclusive/@value or . = xs:simpleType/xs:restriction/xs:maxExclusive/@value"> The value for <sch:name/> should be
						smaller than and equal with <xsl:value-of select="xs:simpleType/xs:restriction/xs:maxExclusive/@value"/> </sch:assert>
				</xsl:if>
				<xsl:if test="xs:simpleType/xs:restriction/xs:pattern">
					<xsl:comment>This assertion check xs:pattern, xs:pattern could be more than one, but the value is valid when one of them is matched.
					<xsl:variable name="testString">
						<xsl:for-each select="xs:simpleType/xs:restriction/xs:pattern">
							<xsl:variable name="apost" select='"'"'/>
							<xsl:value-of select="concat('matches(.,', $apost,@value,$apost,')')"/>
							<xsl:if test="position() != last()"> or </xsl:if>
						</xsl:for-each>
					</xsl:variable>
					<sch:assert>
						<xsl:attribute name="test">
							<xsl:value-of select="$testString"/>
						</xsl:attribute> The value for  should match
						<xsl:choose>
							<xsl:when test="count(xs:simpleType/xs:restriction/xs:pattern) = 1">
								the pattern:
							</xsl:when>
							<xsl:otherwise>
								one of patterns:
							</xsl:otherwise>
						</xsl:choose>
						<xsl:for-each select="xs:simpleType/xs:restriction/xs:pattern">
							<!-- HACK: This is strange to make span into a list value, but better than nothing -->
							<sch:span class="li"><xsl:value-of select="@value"/></xsl:for-each>
					</sch:assert>
				</xsl:if>
			</sch:rule>
		</xsl:when>

We are not implementing simple type derivation by union or list at the moment, because it is outside our primary requirements. I expect derivation by list would benefit from XPath2’s extra power. Derivation by union needs more thought.

But at least this puts us in the position where I think (have I missed something? never impossible!) we can say that Schematron’s power to validate datatypes is strictly more power than XSDs power for datatypes derived by restriction; Schematron (i.e. using Xpath2) can express all the XSD constraints and more.

But is Schematron more powerful to model type derivation? We want to be able to draw pretty diagrams of type derivation. Well, actually because derivation by restriction is simply implemented by abstract rules, in fact Schematron is equally capable of modeling the derivation structure. And, if we add @role attributes to the assertions with the name of the facet being restricted, actually Schematron models the facet system too: to the extent that (if you know the particular conventions used) you could re-generate versions of the original XSD datatype declarations.

But is Schematron better for diagnostics? Well, here comes the rub. In fact, for the datatypes Schematron does not bring any great improvement, in itself, in the kinds of diagnostics that can be generated by an XSD system that was targeted at humans (does any exist?). It does potentially bring a lot more ease of customization (compared to compiled XSD validators, but this is a benefit of scripting), but basically it is just working with a fairly well-enumerated set of properties, in the facets. We will see that it has a lot more scope for smarter diagnostics when validating so-called complex content.

And, we are not necessarily restricted to even XPath2’s power. It is possible to use an extended version of the query language that invokes functions from the Java (or Eiffel or whatever) platform. But this goes beyond our modest scope of a fairly complete implementation of XSD in a handful of XSLT scripts!

Finally, there is a little potential wrinkle here that needs to be worked out. What if our value for imametadataman is -333: we will get an assertion failure both for the byte constraint and the >32 constraint. There is a danger that a multiply derived datatype will generate a flood of redundant error messages. There are two answers: one is to say “we already treat the built-in derived types as single abstract rules, so there won’t really be much multiple derivation with the same facet, its not a big problem!” Another answer is that the assertion for a facet restriction should only test the actual restricted range, and not any range for the base type. So the assertion for >32 also cops out for data >256 and leaves assertion failure for the base type’s abstract rule to provide. (I think this second approach is nicer.)

Rick Jelliffe

AddThis Social Bookmark Button

Because we are using XSLT2 as our query language for the generated Schematron schema, validating built-in simple types from XSD is almost trivial. If you want to validate that, say, an element is a valid boolean, then we can use the test . castable as xs:boolean

What we get is an abstract rule declaration for each built-in type.

<sch:rule abstract="true" id="xsd-datatype-boolean">
  <sch:assert test=". castable as xs:boolean">
    <sch:name/> elements or attributes should have an xs:boolean type value.
  </sch:assert>
</sch:rule>

which can then be used by an element like this: say the element imametadataman is boolean:

   <sch:rule context="imametadataman">
     <sch:rule extends="xsd-boolean-datatype"/>
   </sch:rule>

(What the optimization in the previous entry in this series does is merge types, so that instead of multiple rules we can just have multiple combined. )

   <sch:rule context="imametadataman | ockadocknocka ">
     <sch:rule extends="xsd-boolean-datatype"/>
   </sch:rule>

So hurray for XSLT2 and XPath2!

Not so fast Boy Wonder

There is a rub, however. XSLT2 defines a basic conformance level (which is what the free SAXON XSLT2 transformer uses) that uses the basic XPath2 features. However, the XPath2 working group apparantly went mad with their desire to make life simple for implementers, and decided the basic level of XPath2 would understand (for castable) the built-in primitive type of XSLT but not the built-in derived types. Err, well except for integer. So then of course, because it is silly and confusing for them to be missing, the diligent implementer like Michael Kay of SAXON has to add support as a custom extension: no-one’s life is made simpler.

So in order to use castable without the gratuitous ommissions, it means that SAXON has to be invoked with a special attribute, which in turn has meant I have had to alter the Schematron skeleton to generate that code. (I’ll release it in the next few days.) I hope the XPath2 committee realizes that the more distinctions they make, the more complex their technology and the more difficult for us punters. I am less than impressed. Boo for XSLT2 and XPath2!

Peek at the code

Anyway, here is the basic code, which is part of the larger converter script. This is the most straightforward part of the whole project! Hurray for XSD Datatypes! First we have a list of all the type names, so we can refer to them later. Move constants to headers!

<xsl:stylesheet version="2.0"
	xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
	xmlns:xs="http://www.w3.org/2001/XMLSchema"
	xmlns:sch="http://purl.oclc.org/dsdl/schematron"
>

<xsl:output method="xml" encoding="UTF-8" indent="yes" omit-xml-declaration="no"/>

<!-- supported by Basic XSLT 2.0 processor and XPath 2.0 -->
<xsl:variable name="standard-datatypes">
	<datatype>anyAtomicType</datatype>
	<datatype>anyURI<</datatype>
	<datatype>anySimpleType</datatype>
	<datatype>anyType</datatype>
	<datatype>base64Binary</datatype>
	<datatype>boolean</datatype>
	<datatype>date</datatype>
	<datatype>dateTime</datatype>
	<datatype>dayTimeDuration</datatype>
	<datatype>decimal</datatype>
	<datatype>double</datatype>
	<datatype>duration</datatype>
	<datatype>gDay</datatype>
	<datatype>gMonth</datatype>
	<datatype>gMonthDay</datatype>
	<datatype>gYear</datatype>
	<datatype>gYearMonth</datatype>
	<datatype>hexBinary</datatype>
	<datatype>integer</datatype>
	<datatype>QName</datatype>
	<datatype>string</datatype>
	<datatype>time</datatype>
	<datatype>untyped</datatype>
	<datatype>untypedAtomic</datatype>
	<datatype>yearMonthDuration</datatype>
</xsl:variable>

<!-- not supported by Basic XSLT 2.0 processor -->
<xsl:variable name="extended-datatypes">
	<datatype>byte
	<datatype>ENTITIES
	<datatype>ENTITY
	<datatype>float
	<datatype>ID
	<datatype>IDREF
	<datatype>IDREFS
	<datatype>int
	<datatype>language
	<datatype>long</datatype>
	<datatype>Name</datatype>
	<datatype>NCName</datatype>
	<datatype>negativeInteger</datatype>
	<datatype>NMTOKEN</datatype>
	<datatype>NMTOKENS</datatype>
	<datatype>nonNegativeInteger</datatype>
	<datatype>nonPositiveInteger</datatype>
	<datatype>normalizedString</datatype>
	<datatype>NOTATION</datatype>
	<datatype>positiveInteger</datatype>
	<datatype>short</datatype>
	<datatype>token</datatype>
	<datatype>unsignedByte</datatype>
	<datatype>unsignedInt</datatype>
	<datatype>unsignedLong</datatype>
	<datatype>unsignedShort</datatype>
</xsl:variable>

...

Now generate a set of abstract rules for each of these types. The unrestricted string type never needs validation, so its assertion test is always true(). We also generate a custom diagnostics element for each abstract type too.

	<xsl:for-each select="$standard-datatypes/datatype">
		<xsl:variable name="dataType" select="."/>
		<sch:rule abstract="true" id="{concat('xsd-datatype-', $dataType)}">
			<sch:let name="norm" value="normalize-space(.)"/>
			<!-- Facet: check if it is a float type -->
			<xsl:choose>
				<xsl:when test=" $dataType = 'string' ">
			<!--  strings don't need checking -->
			<sch:assert test="true()"
				diagnostics="{concat($dataType, '-diagnostic')}">
				<sch:name/> elements or attributes should have a </xsl:text>
				<<xsl:value-of select="$dataType"/><xsl:text> type value.</xsl:text>
			</sch:assert>
				</xsl:when>
				<xsl:otherwise>
			<sch:assert test="{concat('$norm castable as xs:', $dataType)}"
				diagnostics="{concat($dataType, '-diagnostic')}">
				<sch:name/><xsl:text> elements or attributes should have a </xsl:text>
				<xsl:value-of select="$dataType"/><xsl:text> type value.</xsl:text>
			</sch:assert>
			</xsl:otherwise>
			</xsl:choose>
		</sch:rule>
	</xsl:for-each>

And here is the code for generating a diagnostic element. The intent is that a user can tailor these if needed. (In Schematron we make a distinction between the assertion text, which is a positive statement of what should be true in the document, and the diagnostic, which contains specific messages for describing, locating and correcting the problem.)


<!-- generate disgnostics for standard datatypes check -->
<xsl:template name="generate-standard-datatypes-diagnostics">
	<xsl:for-each select="$standard-datatypes/datatype">
		<xsl:variable name="dataType" select="."/>
		<sch:diagnostic id="{concat($dataType, '-diagnostic')}">
			<xsl:text> "</xsl:text><sch:value-of select="."/>
			<xsl:text>" is not a value allowed for xs:</xsl:text>
			<xsl:value-of select="$dataType"/><xsl:text> datatypes.</xsl:text>
		</sch:diagnostic>
	</xsl:for-each>
</xsl:template>

And finally we use it when we find an element with a built-in simple type:

<sch:extends rule="{concat('xsd-datatype-', substring-after($baseon, ':'))}"/>

where $baseon is the prefixed built-in simple type name.

On top of this, of course, there is great scope for adding much better diagnostics for problems the datatypes. But not yet.

Rick Jelliffe

AddThis Social Bookmark Button

I’m going to jump to the end now. Here is a little XSLT stylesheet that is a little specialist, and probably not much use outside this application, but if anyone else is autogenerating large Schematron schemas using lot of simple abstract rules, it may be useful.

Our XSD to Schematron implementation generates large Schematron schemas. We knew they would, but small is beautiful. This script trims the generated schema down by merging some kinds of rules.

We use Schematron abstract rules for handing data typing: there can be quite a lot of rules containing just the same sch:extends element and value. In a trial for merging these we brought a 9000 line generated schema down to about 5000 line; not bad.

Download file COMPRESS.XSLT

Rick Jelliffe

AddThis Social Bookmark Button

Here is some XSLT scripts for macro-expanding a set of XSD schemas into a single file with references removed, as a more optimal form for schema interrogation and conversion.

Converting an XSD schema into Schematron involves three stages:

  • Preparing the XSD schemas so they are in an optimal form for transforming out from
  • Converting the grammar and datatype constraints of this prepared schema into Schematron for elements and datatypes
  • Converting the other constraints such as KEY and ID into Schematron.

This blog item gives some beta XSLT code for the first part. A pipeline of three XSLT scripts are used:

  • INCLUDE: starting from a schema, substitute all the included and imported schemas in-place. (<redefine> is not supported in this version.)
  • FLATTEN: move schemas for different namespaces to the top-level, removing duplicates.
  • EXPAND: substitute references to complexType, group, attributeGroup and remove declarations (substitution groups and wildcards are not supported in this version.)

The result is a document with a top-level element of <schemas> contain <namespace> elements each containing an XML Schema module for a single namespace. These modules contain element, attribute and simpleType declarations, but structural references have been replaced. This resolved form makes the job of converting to Schematron much easier, because there are fewer cases to consider and simpler paths. And all the schemas are gathered into a single file.

I have put the beta XSLT files here. It will go to sourceforge or somewhere eventually: watch this space. But I have been frustrated by the lack of tools that expand out XSD schemas, so this code may be useful for other things (I may rewrite Topologi’s XSD to RELAX NG converter to use this as the front end, for example):

I would like to acknowledge JSTOR as the sponsor for this code. Thanks to Matt Stoeffler. It is licensed under GPL as open source.

Rick Jelliffe

AddThis Social Bookmark Button

In my blog Converting Content Models to Schematron I outlined some code ideas. Recently we (Topologi) have been working on an actually implementation for a client: a series of XSLT 2 scripts that we want to release as open source in a few months time.

Why would you want to convert XSD to Schematron?

The prime reason is to get better diagnostics: grammar-based diagnostics basically don’t work, the last two decades of SGML/XML DTD/XSD experiences makes plain. People find them difficult to interpret and they give the response in terms of the grammar not the information domain. And error messages are reported in terms of where the error was detected, not where the error was. For example, given a content model (a, (b, c)?, c, d ) and a document <a/><c/><c/><d/> you will get an error “Expected a d” at the location of the second c element; however the problem really is that the b is missing.

Schematron converted from a grammar still does not have much info to go on. Of course, the Schematron scripts should be easier to customize for tailored assertions and diaganostics. But also the phase mechanism is very useful: we can implement multiple different ways of checking the grammar and let the user decide on which one provides the best information.

A secondary reason is that Schematron only needs an XSLT implementation. There is still quite a suspicion that XML Schema implemantations are partial or broken. Japan Industrial Standards’ comment on Open XML were that they could not in fact even get the Schemas to run under Xerces and another major implementation. XSLT is much more common. However, we have decided to use XSLT2, and SAXON in particular, because it offers us some short cuts.

One shortcut that is quite fun is this possibility (I am not sure whether we will implement this method this round, it is outside our initial brief): by converting the children element names of an element into a string, such as “H1 p div div div table ht p” for example, and the converting a grammar such as ( (H1 | H2 | H3 | P | div | table )* into a regular expression equivalent, we can actually use the built-in regex recogniser of the XPath2 functions to validate the document. Just using a vanilla CSLT2. And this even copes with the minOccurs/maxOccurs cardinality contstraints, too.

This is rather exciting as these things go because it means that we can have a fallback validator that completely covers all the constraints of a grammar system, without leaving Schematron or the world of assertions. The downside? If implemented in a simple way, you only get the same kinds of diagnostics as a conventionally implemented XSD system will give you. But the advantage of having a complete Plan B means that we can concentrate on useful messages for the Plan A.

I’ll blog on how we implemented it over the next few weeks. Basically, we have a two-stage architecture: the first stage (3 XSLTs) takes all the XSD schema files and does a big series of macro processes on them, to make a single document that contains all the top-level schemas for each namespace, with all references resolved by substitution (except for simple types which we keep). This single big file gets rid off almost all the complications of XSD, which in terms makes it much simpler to then generate the Schematron assertions.

We have so far made the preprocessor, implemented simple type checking (including derivation by restriction) and the basic exception content models (empty, ALL, mixed content), with content models under way at the moment. I think the pre-processor stage might be useful for other projects involving XML Schemas.

Actually, the difficulty has been in an unexpected direction. XML Schemas is so unpleasant to work with, that one programmer asked to be take off the project because it was simply too much to cope with, and another has left the company (to take up an overseas appointment) but not before also getting frustrated, boggled and bogged down by XSD! Things like complex type with simple content derived by extension from a simple type with simple content etc become a maze or ratnest. (Hopefully we have that under control and we’ll be able to attend to our backlog of other work ASAP: we have been pretty poor.)

It is interesting that in all the last almost eight years of Schematron, I don’t recall anyone complaining it was too difficult. Instead, I regularly get surprised to hear of quite important projects where it has been quietly used without fuss or drama, and just chugs away doing its thing, with everyone involved feeling (and being) in control. This week for example I heard about UK taxation office’s use of Schematron for checking incoming documents being lodged. I think some of the reason for the success might be that because Schematron is small, it can be kept under control and understood, and that because there is zero support from the large software players, it is never used as part of an attempt to up-sell big hardware or message busses or protocols or enterprise systems etc.: it gets used for POX (Plain Old XML) sites.

Jennifer Golbeck

AddThis Social Bookmark Button

A while back, I named Gazzag.com my enemy for spamming people to join their social network. They have a new enemy partner: Quechup.com

Quechup allows users to import contacts from their email accounts (such as Gmail). The unsuspecting user, provides their login, and Quechup does retrieve their contacts. It then sends an email to every one of those contacts from the user telling them that the user has requested them to join Quechup.

This is annoying and just plain wrong.

Some people will say that users should be more careful about sharing this information with a website, or that they should read the fine print more carefully. However, I reiterate my claim that no reasonable person will want a site to email everyone they know. We all have professional or personal contacts that we would not choose to invite to a social network. I don’t believe that even a giant, dedicated, flashing warning page that alerts users to the fact that all of their contacts are about to be automatically invited in their name is sufficient. This practice, as I said before, is just evil.

I assume the people who run these websites believe that they will automatically get lots of new members by spamming users’ contacts. Instead, they create a lot of angry users who go out of their way to email their friends and actively discourage them from signing up for a network.

Users develop trust in a website for many reasons. Some are simple - appearance, comfort with the community of people there, nice features, etc. Some are more technical - a good privacy policy, parental controls, and the like. Many users come into websites with a base level of trust. This sort of mass invitation violates that trust, and is a sure way to spread negative publicity about your site (such as this post). Users should have explicit, obvious, and protective control over who is invited in their name. By default, no one should be invited. The user should have to knowingly undertake a process to select every person they want invited.

I hope other networks learn from the mistakes of Gazzag and Quechup. The spamming techniques used by these sites is a worst practice of the social networking world.

Kurt Cagle

AddThis Social Bookmark Button

In a recent interview with Rohit Khare, Director of CommerceNet Labs, Jon Udell may have been responsible for introducing a new meme into the noosphere that will be as important in its time as AJAX was in 2004. Rohit Khare gave an influential presentation describing ALIT, which utilized SOAP messages for transferring events between systems, but in the intervening years, his thinking has shifted to a new system based not upon SOAP but upon RESTful RSS and Atom feeds, for which he has coined the term Syndication Oriented Architecture, or SynOA.

Kurt Cagle

AddThis Social Bookmark Button

I have a confession to make. I’ve never had a formal class in C++ (though I’ve written quite a few). At no point did I ever get a professor spend several days trying to make me understand the significance of *,**,*void, &, ., -> or all those other rather strange glyphs that make reading C++ much like trying to understand the Chicago Manual of Style with 95% of the words removed.

The other day, as I was reviewing my Stroustrup, it occurred to me that a whole lot of programmers out there, especially in the web space, as likely as not never sat through that C++ class either, and so I began a thought experiment - if your only experience to programming was web development, how exactly could you teach someone about C++ in those terms? Curiously enough, the more I dug into this conceit, the more I realized that this was actually a useful exercise in understanding some fairly deep notions about how we deal with the concept of reference in programming and on the web.
Simon St. Laurent

AddThis Social Bookmark Button

I mentioned this a a month ago, but that was, well, a month ago, and the deadline is tomorrow. The XML 2007 Call for Papers ends tomorrow.

Proposals need to include speaker information, a short abstract, and a suggestion for its track. We have four tracks this year:

  • Documents and Publishing

  • XML on the Web (I’m chairing that track.)

  • Enterprise XML

  • XML Training

Lauren Wood (the previous chair of this conference) has posted advice for proposal submissions that I heartily recommend.

Rick Jelliffe

AddThis Social Bookmark Button

Vote “No”? But aren’t I supposed to be Microsoft’s biggest fanboy? Well, what I mean is a conditional approval, not a rejection. There are some things that can be fixed and should be fixed, and an ISO Ballot Resolution Meeting is the best forum to make sure it happens.

I’ve been quite active in the debate on adopting Office Open XML as a standard,* and this blog has frittered away many bits on explaining why (because it would be useful in my industry, which is industrial publishing and markup, and we have been demanding it for a long time) and why many of the specific reasons given against OOXML are flimsy (how many self-assured people have raised “autoSpaceLikeWord95″ who have no idea what a fullwidth character is, for example?) But not all was plain sailing: along the way I have pointed out several flaws that I thought needed to be corrected. A mild diversion has been to look at the various claims of bribery or faulty procedure bandied about.

On my travels, when I have been asked about how National Bodies should vote, I have always said that there is nothing wrong with a “No with Comments” vote, if the comments were doable. Indeed, this is exactly the vote that I have recommended to my national body, Standards Australia.

The actual list of comments I sent is here. Please note that these are just one person’s comments, not the official position. I have no idea how Standards Australia will vote, but I strongly urge them to vote “No with Comments”, specifically with my comments. I have tried in the comments to address many of the issue that people raised, and to limit the comments to issues that are relevant to Australia (which Standards Australia is quite keen on.)

Now when reading these comments, please realize that the intent is to state the technical and editorial position as clearly as possible. (When I say something is unacceptable, that is only in the context of the suggested fix to make i acceptable, not any claim that something cannot be fixed by the normal BRM process,) The whole point of these comments are that IMHO the big flaws in the standards are fixable (and fixable by the current processes) and that the edge-cases are not critical and can be left to maintenance.

In my comments I have attempted to expose the principles behind the comment, and to limit them to comments relevant to Australian industry. I definitely concentrate on getting the high-level issues right: the name of the standard, the organization of it, the conformance section, the over-abundance of non-normative text, the need to allow standard notations, and a future-proofing issue. My view is that getting these high-level issues right takes the sting out of the tail of many individual problems and edge-cases, and addresses many of the technical issues that people have raised piecemeal,.

Rick Jelliffe

AddThis Social Bookmark Button

This decade has seen a tectonic shift in technology: the new information applications which are succeeding are those in which information is based on simple topics; the new document major document formats are those which allow the packaging of a topic.

The organization of information in to simple interlinked topics, typically something that can be described in a single phrase, is the common factor between such seeming disparate but succeeding technologies as the web-based Wikipedia, Amazon, Google, Ebay, Flickr, MySpace, YouTube, blogs, RSS, but also has had strong impact in non-WWW areas: the ITIL Configuration Item, the SCORM Learning Object, the S1000D Descriptive Module, integrated UML systems, for example.

The difference from the WWW in general is that though web technologies indeed encourage small pages, their is no necessity that pages are about one topic in particular. So the WWW is an excellent basis for implementing topic-based systems, but not itself one. Similarly, RDF may allow resources to be linked, but these are not necessarily at the level of topics. Another way of looking at topics is that a lack of topicality is what makes an poor index item poor.

There has been a decade long process at ISO SC34 to make and develop a series of standards based on topics, for example the Topic Map standard, IS 13250. This is good technology (like Xlink and RDF) to look at when considering how to implement a topic-based system.

The rise of Topics represents a great challenge to operating system and desktop suite vendors. When we look at Windows, or Mac or Linux window managers, we see that they really interact with the user at the wrong level. They say that the topic the user is interested in is applications and files. But how many people nowadays start their computer interaction with a web browser pointed to Google? There are still people whose organizing topic of interest in their computer interaction is the file or application, of course, but they have been swamped by people who are interested in the topic.

There are interfaces which organizes the user with different topics: most notably the Sugar interface of the One Laptop Per Child ($100 computers) in which the primary metaphors are the person (and their private activities and journal), the neighborhood, and the group (and group activities and bulletin board.) The interaction topics are “people, places, objects, actions”. But as with the desktop, these are not topics in general, just the topics of one domain (a fairly compelling domain, that of children and communities).

Indeed, we can see the large successful web applications as being topic-based interfaces each for particular domains and scopes. A lot of the Web 2.0 or Social Interface systems talk focuses on the human or social or write-able web aspects; my question is this: should we think of Topics as the “how” and the social aspect as the “why”, or should we think of the Topics the “why” and the social aspects as the “how”?

Moreover, should Linux, Windows, Mac and all seriously respond to the rise of Topical Interfaces by ditching the desktop metaphor? I tend to think yes: in terms of my supprt/runner/plug-in model topic interfaces belong at the “suite” level, and a desktop interface is just another suite.

One reason I found (and still do find) the Windows desktop so cumbersome to use compared to the a UNIX shell or the old Mac desktop was that it never seemed to provide me with the topics I was interested in. When the topic was “Installed programs” it lets me look at a menu from the start button, but not all programs are there; I have to switch to a completely different system, the file explorer, and look in Program Files and figure out from the files and directories what applications are there. We have to fight with the army we have, not the army we want, but we won’t win unless we have the army we need.

Topical Interfaces have eclipsed the Desktop Interface and are severely challenging the central position of the file., because increasingly the value of some information is in its linked-in-ness to some larger system. From this point of view, the recent trend (JAR, WAR, EAR, ODF, Open XML, SCORM, etc) to use ZIP and therefore package together all the files needed for one application session can be seen as an attempt to turn documents themselves in to a container for a bounded topic. OOXML’s Open Packaging Convention (OPC) represents the high-point (though not the state of the art, for which see RDF and ISO Topic Maps) in this trend, adding a linking and typing mechanism (relationships) within the ZIP package, However, the moves to make a platform out of the office suite and out of the Web browser (and the various Java Rich Client Platforms such as Eclipse, NetBeans, and so on) fall short of providing the integrated, topic-based interfaces.

The two worlds need to converge: we need Topical Interfaces which lets us navigate between and within topics and perform transactions, but which also allow each Topic can be bundled and shipped around as a document.

Rick Jelliffe

AddThis Social Bookmark Button

Schematron is an ISO standard (ISO/IEC IS 19757-3) schema language for expressing assertions about the presence or absence of patterns in a document, usually using XPath. ISO standards are supposed to contain verifiable statements about some technology. And there is an schema for ISO standards (refer to How to write your own ISO Standard. So why not combine them? Executable specifications may provide the best form of verifiability!

I’ve made a little stylesheet that converts Schematron schemas into ISO Standard annexes. Each pattern becomes a separate clause, and assertions are treated as constraints and report statements are treated as errors that must be reported. The stylesheet handles abstract rules and abstract patterns (though these are starting to go into XPath territory and so are borderline ugly), and the @see attribute. Phases are treated as conformance profiles. Diagnostics are stripped out, they might perhaps have some use in application standards rather than document standards.

As well as its assertions, Schematron allows quite a bit of rich text and titles. The stylesheet handles bullet and numbered lists, most kinds of inline styling. The output is validated against eh RELAX NG Compact schema from the draft TR that I was using. (I had to clean up numbered lists a little: the drft stylesheet provided its own autonumbering when using <ol>.)

So is this a serious idea? Actually, yes. Schematron was developed with the human aspect of schemas as a very high priority, unlike any other schema language that I am aware of. By design, it is intended to be useful for generating documentation suitable for domain experts rather than XPath developers. (I am working on a commercial product that provides this as part of a collaborative schema development environment; the betas look good.)

So I hope that as more organizations take up Schematron to specify part or all of their standards, they will adopt this kind of approach, so that they end up with standards with no gaps between what is required and what is validatable. Note that you can still make Schematron assertions even when there is no XPath to check it: so Schematron does not back you into the corner that other schema languages do, where you have no high level constructs to document constraints beyond the capability of the validation expression language: refer to Expressing untested and untestable constraints in Schematron.

The stylesheet and an example

Schematron Validation Reporting Language is a small language specified as part of ISO Schematron for representing the output of a validation,. It can then be transformed into lots of other uses.

First: here is the Schematron schema for SVRL, unchanged from the ISO standard except I added three IDs that were missing (the XSLT expects patterns to have IDs): Download file

Next, here is the XSLT script: Download file

Here is the output from the script, using the SC34 schema: Download file

And , here is that output then converted to HTML, using the draft previewing script from ISO. (The SourceForge project has an XSL-FO generator): Download file

As a bonus, here is a blank XSLT template with all the Schematron elements exposed, for anyone who wants to make their own complex pretty-printer/transformer for Schematron schemas:
Download file

The annex generated is, I think, pretty acceptable as a draft standard, especially since the schema was written as a real schema and not as text in a standard per se. Obviously some things can be improved, such as being consistent with ’should’ and ‘is’, but I think this is a viable, useful and efficient approach to improving the quality of standards for XML vocabularies and document types.

Rick Jelliffe

AddThis Social Bookmark Button

You too can write your own ISO standard! Here are the steps:

1) Download the ISO/IEC Directives Part 2 Rules for the structure and drafting of International Standards. These give the general editorial guidelines. Read it all.

2) Download the documentation for the XML schema for ISO Standards, which is in Technical Report 9357-11. A good draft is available from SC34 Website. Read it all.

3) Download the Open Source schemas and stylesheets are available at SourceForge and embody a lot of the rules of the ISO/IEC Directives Part 2. They have been contributed to over the years by such people as Murata Makoto, Martin Byran, Ken Holman and James Clark and used in many standard: I used them for ISO Schematron for example. (If you want to use Word templates or whatever, these are available from ISO, but this is an XML list so it doesn’t deal with that.) Install and configure your production environment to use them.

4) Try to follow these writing guidelines:

  • When writing, think about clarity. A good rule of thumb is “Will this sentence be easily translatable into a language that does not have the words “the”, “a” and “it” or which does not have the future or past tense available?” and “Can a recent graduate understand this?” Note in particular that you must use “shall”, “should”, “must” in very particular ways, that you need to use the definitions section as much as possible, that you need to clearly distinguish normative text from informative text (which is not the same as required and optional/discretionary, and different again from the legal “Required Parts”), you need to be clear about different levels of conformance, and that you need to be careful with normative and non-normative references (see the Directives!)
  • Download any other standards in a similar domain, and try to re-use the phrasing and declarations from them. When writing, try to use the standard vocabulary that ISO suggests in standards such as IS 2382. If you use terminology that differs from these, make sure it is in your definitions section. Note that there are some trick words that have specialized meanings: so “define” is what you do, but “declare” is how you do it (loosely).
  • A standard should only contain verifiable statements. That rules out most adjectives, unless they are defined, and is why standards tend to have Germanic agglomerations of nouns. Where possible, try to specify the requirement in an executable form, such as a schema language, then use the text to fill in the gaps. Where possible, try to specify the requirement using a formalism, such as predicate logic or BNF or UML, especially if there is an unambiguous notation or a standard for these. Where possible use diagrams, however only use them if there is a common or standard diagraming type for which a reference is available.
  • When writing, avoid dependencies on other standards. Reference the most general version of other standards possible. Unless there is a good reason, allow the other standards to be maintained without this then making your standard outdated. Avoid specifying or summarizing other standards: completely in normative text, and as little as possible in informative text unless the other standard is not freely available.

5) Write your draft

6) Track down IP issues to the best of your ability. Also, try to have reviewed it for Internationalization, Security and Accessibility issues: the more that these are designed in from the beginning, the smoother things will be downstream. Most importantly, you need to show that there is some market (users) for this standard, that it is not some crackpot technology. One important thing that will influence reviewers is whether there is developer buy-in: is there an open-source implementation, is there some company willing to produce products that use the specification, and so on. If you want commercial buy-in, think about the carrots (an economic case why it would benefit vendors) and sticks (getting regulators or procurement departments to require it.)

7) Decide whether it should be an ISO/IEC International Standard, an ISO/IEC Internation Standard through fast-track, a Publicly Available Specification, an ISO/IEC Technical Report, a National Standard, a Consortium Standard, or just something on your own website. If you decide to take it through ISO you have to find or become a champion: you can go to your local national standards body and get them to propose it (or adopt it as a national standard first), you can find a friendly committee person on the relevant committee and get them to propose it from their Working Group, or you can find some boutique standards body that has liason with ISO (such as OASIS or W3C) and put it through their processes. You need to find an editor who is participating on the committees and can travel to enough meetings (See if your national body offers any travel subsidies; demand that the ISO working group use teleconferecing). You should expect that your draft may be substantially changed, especially if you have not written it according to stage 4). At this stage, remember that you are not alone: there will be other committee people and interested people around the world who can provide advice, only rarely crazy, and you cannot be too proprietorial: some parts of the standard will improve in your eyes, some parts will get worse in your eyes, but that it all OK because it becomes a collective effort. Especially remember that a really stupid comment from someone is undoubtedly a sign that your deathless prose is crap and needs to be fixed. Don’t take criticisms of the draft personally, and learn committee skills: how to challenge clearly, take the stated requirements of others seriously, and acquiesce gracefully—not understanding something or losing an argument does not involve a loss of face, but you have to give face when winning on an issue too. Don’t “play to win”; instead “play to win/win” (I am embarrased to write that!)

8) When a draft is produced, contact the various technical committees around the world to help answer questions. Actually, the ISO committee process itself provides a good forum for this; if you are fast-tracking you may need to do extra work to explain the draft.

9) Ask the committee to ask ISO to get the standard added to ISO’s free list. A standard that is not on the WWW is at a total disadvantage.

10) Assuming the vote on the Final Draft was “yes”, you now have your standard! Congratulations, that has only taken three years or so. Now you have to commit a little time over the next few years to maintain it and fix corrections that come up, and to try to get buy-in from the public. If you have a “grass-roots” standard like ISO DSDL (RELAX NG, Schematron etc) which do not fit into the plans of the military-industrial complex, then your expectations need to be modest and you need to think about how to encourage activity in the Open Source eco-system. Remember a good standard is one that meets its particular user’s needs, not one that takes over the world.

However, your name won’t be in the standard (unlike W3C or OASIS), or in the bibliographic entries. So don’t do it, or participate on committees, if you want to see your name on Amazon.

Rick Jelliffe

AddThis Social Bookmark Button

The licensing of IP for standards has four aspects: what the (case and statute) law says, what the standards bodies require, what the IP owner grants, and how the developer (adopter) is acting. Standards themselves never seem to have useful information about patent IP, and even their copyright boilerplate needs to be checked against licenses given by the copyright holder: W3C and ISO don’t like you copying their standards, Ecma does, for example.

law.gif

For an introduction to the legal aspects, see ConsortiumInfo.org, which is by a lawyer for OASIS. The Dell case is pertinent.

For an introduction to the standards body aspects, see Standards Law, which is by a lawyer for Microsoft. It has a reference to the ISO requirements. For the boutique standards bodies: OASIS, Ecma, W3C

For examples of the kind of grants that companies make see
Microsoft Open Specification Promise, IBM Open Source Portal, Sun’s OpenDocument Patent Statement. Adobe has not put their equivalent online if it has been finalized, as far as I can see. (Microsoft also has a “Covenant not to sue”, however this seems to have disappear from its website in a rearrangement of links. They need to get it put back online.)

So what does the user have to do with it? Some licenses provide particular conditions relating to private or not-for-sale use: the GNU licenses for example. Other times licenses are revoked if you try to sue the IP owner: these defensive patents are bargaining chips in legal wrangling.

One key term to understand is RAND: Reasonable and Non-Discriminatory Licensing. It is pretty much the bottom line for standards organizations. However, RAND licenses are controversial, and in the views of many of us, something that should be avoided by modern standards bodies in the age of Open Source and Free Software which, like standards, have strong counter-monopolistic and even communitarian aspects.

Another concept to understand is the Open Standard. Not all standards from standards organizations are Open Standards under anyone’s definition, especially older standards and standards which involve semi-scientific research and development (compression patents, for example) where the IP holder would only license a vital technology under RAND or not at all. (There is some creep on what an Open Standard is, to conflate it with Open Source or free implementations.)

And it should go without saying that someone cannot grant a license to IP they do not themselves hold. So all covenants and licenses only extend as far as the material in question. This is important for extensible formats such as ODF and Open XML, because the ZIP container allows any kind of media or binary file.

See the IBM material for a definition of Necessary Claims and Required Portions.

Rick Jelliffe

AddThis Social Bookmark Button

Just when I thought I had escaped, I had a request yesterday from Microsoft to join in a call with a journalist from ZDNET Asia about a blog An open document standard for China. Preparing for this gave me a good chance to review the use of Native Language Markup in Open XML: the area is quite arcane so it is a good topic for a blog (good because you probably won’t get the information elsewhere and good because your feedback can help if I have missed something.) I have included some asides and personal background, probably not even of interest to my mother, in small print that you can skip.


The Peter Junge blog basically warms up Rob Weir’s Swiss cheese (hmm, something wrong with that phrase): impossible in its thrust (a single file format that can cope with all cases?), alarmist in general (This kind of legacy is full of pitfalls for the open source developer.”), over-reaching in its analogies (see my Power plugs and low-hanging fruit), too strong in its conclusions (look at how “may” and “might” are used to say “will”) and misleading in its use of details (what has footnoteLayoutLikeWW8 (Emulate Word 6.x/95/97 Footnote Placement etc) to do with open-source developers in particular, especially since the spec gives the advice that “Typically, applications shall not perform this compatabiliity”? It is flag not a requirement for goodness sake.)

Native Language Markup

Native Language Markup is the use of names and symbols in markup of the users native language. This implies the use of the user’s native script (characters). It is different from “natural language” because names in markup may still have artificial limitations (such as no spaces or apostrophes) or use contracted forms that would not appear in natural language.


Native Language Markup was a term I developed in the early 1990s, when Allette Systems gave me a project to figure out why SGML was not popular in Asian countries. I came back with various items, and collected them into the ERCS (Extended Reference Concrete Syntax): these included things like allow native characters in tag names (SGML had large character set limitations for names then), hexadecimal numeric character references, the ability to reference any character by its Unicode number, and an initial set of the characters in Unicode that were suitable for use in markup. These were endorsed by a standards-related expert group, the CJK DOCP group, and when XML development started, were adopted into XML. This was recognized by a kind comment of Gavin Nicol in the 1999 Journal of Markup Theory and PracticeThe importance of native language markup, and the role the SGML declaration plays in an SGML system, are fairly well understood these days, partly due to the tireless efforts of Rick Jelliffe on the ERCS, and partly due to a lot of work done on HTML I18N (Internationalization)” Now, of course, I am not saying that I invented the idea that words you can read are more useful than words you cannot read! ERCS was a set of concrete technical proposals, and Native Language Markup is a name for the issue. Anyway, the bottom line is that this is a subject that I think is really important.

Native language markup has proved itself. Murata Makoto demonstrated at a conference last year how the Japanese government XML was using it, and China’s UOF format. It is not just an issue of translation: many languages have terms which do not have a satisfactory English equivalent. Nor is transliteration a useful approach: many languages require a romanization system with accents or tone marks to be useful. A technology that does not allow non-ASCII characters imposes a burden on non-ASCII users and limits the acceptance rates to the highly educated and foreign-literate.

However, native language markup becomes inappropriate whenever there is a cross-over between language groups. Most Australians stop learning new characters about the age of 5 or 6; Chinese language markup is not easy for us! So for international standards for fixed schemas, there is no practical alternative but adopting ASCII and English wordings.

ISO/IEC JTC1 SC34 has recognized this, so as part of the IS 19575 Document Schema Description Languages standard, there is a technology spearheaded by UK’s Martin Bryan called the Document Schema Renaming Language (DSRL or Dis-rule). This is a convenient language (Martin has an XSLT implementation) that allows conversion of the markup in documents (or schemas, potentially) to and from different languages (as well as other uses.) Non-ASCII-using nations looking at adopting ISO standards should look at whether they should also adopt a DSRL mapping into native formats. So that developers would work in the document using native language markup, then convert the document to the ISO standard form before shipping, for example. Or that internally in a country, the localized form was used, but it could be translated to the international form for shipping. My belief is that DSRL should become a standard part of the XML processing chain, because it addresses all sorts of versioning and localization issues.

The evolution of standards

Character sets have posed an big problem for standards makers.

  • In the 1960s/1970s generation of technologies, 7 bit character sets were used: the ASCII/EBCDIC generation. Technology standards rooted in the 60s had to cope with 7bit data transmission.
  • In the 1970s/1980s generation, communications systems moved from 7-bits to 8-bit clean systems. Typically with this generation and under the influence of the C programming language, instead of characters systems adopted a byte mentality, where a string was a sequence of bytes. Standards from this period naturally followed. However, because international data exchange was not important, the standards from this time pay no attention to identifying which character encoding was in use.
  • In the 1980s/1990s generation, attempts were made to extend the existing systems to cope with extra characters. This would involve adding overloading character escape mechanisms to allow character references to the local character set, or variable-width character sets which are ASCII compatible for single bytes but which allow multiple bytes for non-ASCII characters: UTF-8, Big 5, Shift-JIS are examples of standards that reflect this. These fitted into the constraints of 7-bit and 8-bit clean systems. However, the standards infrastructure was aimed at localization not internationalization: the advent of the PC retarded the reach of the internet initially but by the advent of the WWW suddenly there was a world-wide data incompatability problem: the standards and systems did not adequately support for resources to say which character set was used. Examples of this was HTML forms: for a long time, there was no definition of which character encoding should be used when sending forms data. In the standards world of the time, there was a real split between the internationalists, who said that everyone should adopt Unicode, and the nationalists, who said that every country should adopt locally-optimized formats.
  • The 1990s/2000s generation is the XML generation. It has been recognized that internationalization needs to be pervasive (standards should have first class support), systematic (based on Unicode), and friendly (allow people to be conservative in what they send but generous in what they recieve.) With XML, we defined XML in terms of Unicode characters, but allowed the user to use any encoding they wished: this was safe for data, because XML allows character references in terms of Unicode character numbers, and because the XML encoding header provided an effective in-band way to make sure that the character encoding of a document could be maintained. This effectively satisfied the requirements of both the nationalists and internationalists. Another example of this approach is the XML approach to URLs: in this case we deliberately went against the standard URL syntax to allow non-ASCII characters in system identifiers and namespace names, because native language markup is more important than compliance with that standard (or, at least, because conversion to ASCII-only transfer syntaxes should be a library function, not a document-writer’s job) . Sometimes the existing standards are sub-optimal and have to be ignored, even by other standards!
  • One of the last links in the chain for documents came with the much delayed release of the IRI specification. This officially standardized the system that XML adopted (and the address bar of browsers naturally had been using) of extending URLs to allow any characters. Protocols that used URLs would still use percent delimited ASCII, and address bars would still display any character, but with IRIs it becomes easier for the standards world to specify exactly what is needed. Nevertheless, the terminology IRI has not become common yet, with the result that people often say URL when they actually mean IRI, and with the subsequent result that sometimes standards drafters write URL when they mean IRI.

Native Language Markup in Open XML

The Open XML schemas use ASCII and English wording. Anything else would be rejected at ISO of course. Data values, for content and attribute values, allow non-English characters. Typically this is formalized so that things that may appear on user interfaces (such as style names) have both a print name and an internal identifier: this allows documents to be localized as far as their user interface information but international as far as their internal identifiers: good for off-shore document processing for example.

Formulas in spreadsheets are an interesting area. In order to be user friendly, the function names of course need to be meaningful to users. However, a standard cannot contain every language variant (I am told that Word 2007 has over 100 different localized versions.) So Open XML takes the view that this is the application’s responsibility: the Spanish version of a spreadsheet can present the formula to the user using Spanish words for example, but the markup is generated with the common form. (This is a respectable option: dates are usually handled as 8601 format rather than localized forms, in international standards along the same lines.)

IRIs

One area that deserves special attention, because it is so intricate, is the availability of IRIs in Open XML. This is an issue that has received a bit of attention, and was the issue in Peter Junge’s blog that has triggered this blog. The bottom line is that in the current draft text you can use any character for a relative IRI inside the package or to your file system (relative references) but for external references the current spec says the markup should use URL syntax.

Note that this does not mean that a URL on a user interface cannot use Chinese characters. Nor does it mean that Chinese characters cannot be percent encoded into a URL. This is an issue of Native Language Markup, at the software developer level.

I suspect this is just a drafting error, and I am pretty certain that JIS (Japanese Industrial Standards) at least will call for its correction in the final text. It is a strong enough requirement to force a “conditional yes” vote. The current datatypes use anyURI.

There are more details on Open Packaging Convention below.

Lets look at what Peter Junge’s blog said:

Another standard that Microsoft does not support, is the RFC 3987 specification, which defines UTF-8 capable Internet addresses. Consequently, OOXML does not support the use of Chinese characters within a Web address.

It is a textbook example of what is wrong with so much of the anti-Open XML material.

Lets look at the first sentence. Now RFC 3987 is the IRI spec (which was co-authored by Michel Suignard of Microsoft,) If you look at DIS 29500, Part 1, Annex A, Resolving Unicode Strings to Part Names, it has clauses such as “Creating an IRI from a Unicode string”, “Creating a URL from an IRI”. If you look at Part 1 Section 8.2.1, that annex is invoked. So the simple statement that Microsoft does not support is incorrect. (The explanation of IRIs is technically pretty garbled, but it is not easy to express in a single phrase.) The second sentence is incorrect too: you can have Chinese characters in the a web address, as long as they use URL syntax and are percent encoded.

Now there may be some way to weasel word this, that it really it says

Another standard that (DIS 29500) Microsoft does not support (in one case), is the RFC 3987 specification, which defines UTF-8 capable Internet addresses (internationalized WWW resource identifiers which map to ASCII based standard URLs using percent-encoded UTF-8. Consequently,(draft) OOXML does not support the (indirect) use of Chinese characters within a (external) Web address (in markup).

But an ordinary reader of a blog simply is not technically equipped to understand this. How anyone would write this if they ever had read the draft is beyond me. I mean that seriously. If you are writing a blog, and making comments about IRIs, how simple is it to download Part 2 of the spec, open it up in Acrobat or your PDF reader, and search for IRI?

Chinese Native Language Markup

I think the main trouble with Peter Junge’s blog comes from a misunderstanding of the ISO process and the position of voluntary standards. I don’t think he knows what a standard is.

When he says I hope China will not support OOXML in its ISO voting, but force Microsoft to consider talks for one harmonized office document standard for the whole world. it sounds nice and tough, but the ISO process is not geared to that kind of win/lose approach. In the ISO situation, when you find an error that can be fixed (such as this IRI mistake) you don’t throw the whole thing out, you point out the problem, propose a fix, and work together. Just because Open XML gets added to the library of voluntary standards at ISO, it does not mean that the Chinese national body is thereby forced to adopt Open XML in preference to UOF in any circumstance. Chinese businesses will be sending and receiving documents from overseas in formats outside the control of the standards bodies, and governments have little interest in making arbitrary restrictions on world trade now days; it is better for that data to be in a standard format than a non-standard one. Non-Chinese countries are not going to adopt UOF but still they will produce and receive documents: I am sure that the UOF people are entirely aware of this.

What the current generation of document standards (ODF, Open XML, UOF) does is expose all the different functionalities required. This is a great pre-requisite for getting Chinese and other requirements publicized.


Now the area of East Asian native language markup is one that is particularly important to me. I started off in SGML while working in Japan, I had a lot of contact with really wonderful East Asian experts because of my involvement in ERCS and CJK DOCP, and because I ran the Chinese XML Now!” project at Academia Sinica, Taipei in 1999/2000. This was a project (academic/practical, *not* political !!!) to try to work through issues relating to XML and Chinese. (The Chinese XML Now page is now old, and I hope there are much better sites now, but it did have a few million real hits as far as I could work out.) Schematron, now an ISO standard, came out of this work, because I wanted to develop a Schema language that did not depend on tokenized grammar rules, lay Chinese understand their language in terms of characters not words per se.


One part of this project was for me to represent Academia Sinica (*not* Taiwan) at various non-national level standards groups. One outcome was that in the XML Schema Working Group (repesenting Academia Sinica) I championed and suggested the name for anyURI: to allow better native language markup than URIs; the IRI standard was not available then.


ASCC’s reason for hiring me was a little shocking: my boss, a really incisive and surprising man, told me that Westerners on standards organizations do not listen to Asians (from Asia articulating Asian-only requirements), and so they wanted me to advocate for them (and for Chinese language requirements in general) because a white person would be more acceptable.


Now this is not so much a claim of personal racism at all: in part it is due to the language barriers, partly due to time zone and travel problems, partly due to the difficulty that people from respect-based cultures have in contention-based committee systems, the problem that people from seniority-based cultures have in expert committees, the problem that people from face-based cultures have in ad hoc discussions, and also the difficulty in getting up to speed with issues and procedures as a newcomer.


In the ISO SC34 committee on Document Processing and Description Languages, there has been an effort to schedule meetings in Asia (Korea last year), issues have to be tabled six weeks before meetings in order to prevent surprises and give people a chance to translate and discuss, and in general votes on important issues are not taken on the same day they are proposed, in order to allow consultation with national technical committees in different time zones. But nevertheless, learning how to operate effectively in committees dominated by Western-style relationships s a real difficulty (from my recent trip to India, this clearly doesn’t apply to Indians! So I mean “Asian” in the Australian sense of mainly East and South East Asians, not in the UK sense of “Indic”. )


These kind of thing are good, yet there will, in my opinion, always be a difficulty there. Of course, Westerners will learn how to interact with Asians better, and Asians will learn how to participate with Westerners better. But it is up to the nations that use a particular language or script to work out their requirements and communicate them effectively. UOF is at minimum a good exemplar of this. The Japanese kinsoku rules are perhaps another example.

Open Packaging Conventions, ZIP and IRIs

Open XML Part 2 Open Packaging Conventions sets out all details of packaging in Open XML: the profile of ZIP to use, the part referencing system, digital signatu