August 2005 Archives

M. David Peterson

AddThis Social Bookmark Button

Related link: http://www.aspectxml.org

[UPDATE: 2005.09.07 05:32 p.m. MST]

I have gone back through this post and deleted all the existing “[UPDATE: …]” entries such that this post can simply flow from beginning to end. I still need to finish off the explanation of the weaving process and then follow-up with the next entry in this series, that of bringing atom feeds together with the concept of site assets that are woven into the result output using the AOP approach used in the AspectXML project. I plan to finish this post off and then start work on the next post, publishing it tomorrow as one complete tutorial instead of the bits and pieces here and there that this post became. Please keep an eye out for [Part 3] Assets, Atom Feeds, and AspectXML - The Triple Threat of Web Development? around this same time tomorrow.

[ORIGINAL POST]

[A Quick Note: Saxon 6.5.4 was used to transform the files in this post. When we move to a true AspectXML implementation we will be using Saxon 8.5.1, and some changes will need to be made to our element naming practices to keep within the guidelines of the current XSLT 2.0 Working Draft specification. At the moment this is all conceptual, but when we move into the real-world implementation next week our element naming method will need to change.]

In developing this post I realized there needs to be a separation of concepts; The first concept covers the idea of weaving together separate data, layout, and layout definition files into a final output. Once this concept is understood is then much easier to understand how implementing AOP-based concepts can take things to the next level, bringing incredible power with very little additional code with each new project.

To start off lets dive straight into some code: [NOTE: I have zipped up the code files and made them available at http://www.aspectxml.org/downloads/dataWeave.zip]

File One: page.xml


<page>
  <layout>
    <template name="master"/>
    <output type="table"/>
  </layout>
  <content>
    <header>This is the header</header>
    <left_navigation>This is the left navigation</left_navigation>
    <content>This is the main content area</content>
    <footer>This is the footer</footer>
  </content>
</page>

Explanation of File One: page.xml

In a nutshell, page.xml represents a single page within any given website which informs our transformation file what template to use during the transformation process, what type markup should be output, and then the content of each section, using the name of the element to match against the ’source’ attribute of the ‘row’ or ‘data’ elements contained within the template specified.

The true goal of page.xml is to be as completely content-centric as possible, leaving the details of things like size, color, etc… to the chosen template. By separating these specific details we can, in essence, build on top of a system that is common within our development workplace where one department, usually marketing, can focus on the content, specify what type of format that content should be presented in, and where each specific piece of content should go within a page. By giving ownership of this file to the marketing department (or whichever department happens to own the content for your given website(s)) the art department, and the development department can focus on their own core agenda’s without concern as to what the content will be or how its going to be integrated into the system once the content is complete, approved, and ready to be merged.

Actually, looking at this, maybe a better usage of the output/@type attribute would be ‘web’, or ‘print’ or something to that effect. But that takes all of two changes, one on page.xml, and one on outputDefinition.xml as you will see below. In reality the names really don’t matter… actually, a better way to say that is names actually do matter quite a bit; but while one word or phrase might work really well for you and your marketing folks, the same may not be true within another company that just so happens to be based in Bangladesh. As such, keeping this flexible was an important consideration in its design.

File Two: layoutDefinition.xml


<templates>
  <template name="master">
    <layout>
      <object>
        <row>
          <column>
            <row source="header"/>
            <row>
              <column x="175px" y="auto">
                <data source="left_navigation"/>
              </column>
              <column x="600px" y="auto">
                <data source="content"/>
              </column>
            </row>
            <row source="footer"/>
          </column>
        </row>
      </object>
    </layout>
  </template>
</templates>

Explanation of File Two: layoutDefinition.xml

Here’s where the art department comes into play. It is very common within the systems we have in place for the art department to have very specific needs in regards to size, shape, color, etc… And for good reason. Their job is to make things look as good as they possibly can, and they need the type of control given to them to allow them to make these specifications without the need to know the various markup languages as well.

In today’s world HTML and CSS are fairly well known within the art department folks. But as we move towards a world where web sites become more application, or speaking in more web-centric terms, weblication specific, this is not always going to be the case in regards to their knowledge of markup languages such as XUL, XAML, in some ways SVG (although vector graphics are something the art folks probably have a better understanding of than us hacker folks). And when it comes to binding together the various pieces of a weblication through various code-behind and other binding mechanisms the separation gap is only going to increase.

As you can see I chose to use a standard ‘column’ ‘row’ layout architecture, with a ‘data’ element thrown in to specify when something is data specific instead of layout specific. Keep in mind two things: This was a quick “proof of concept” throw together, and: While most people can easily understand the concepts of columns and rows I am not totally convinced that this is the best overall approach as it doesn’t allow for a very rich UI-language similar to that in which we find with XHTML, XAML, XUL, and SVG. While I recognize that requiring the art department to learn an entire UI markup language goes against the entire purpose behind the separation of layoutDefinition.xml and outputDefinition.xml, I do believe there can exist a happy medium where the ability to get a little more specific as to what goes where and what it should look like can still exist alongside a simple, easy to understand UI language that allows limitless expression without the requirement to become a UI-markup hacking guru. In fact through the recent (yesterday as a matter of fact) influence and suggestion of both Russ and Sylvain, followed up with positive comments to the affirmative from Don, Uche, and Kurt I have decided to revitalize my WWULF project (World Wide Ubiquarian Lingua Franca) part of which is focused on developing this exact UI language that can in turn be transformed into any mainstream markup language such as XHTML/CSS, XUL, XUL/SVG, SVG/CSS, and XAML, all dependent on the client/device making the request and just what that client/device has the capacity to render and implement in regards to client-side functionality.

While I had every intention to pick this project back up at some point in the not to distant future, with the current onslaught of UI-markup languages coming at us from every direction, coupled with the somewhat “Graphic Dev Tool” requirement of SVG, now seems like the perfect time to put some effort into this project. Expect to hear more about this in the not to distant future.

File Three: outputDefinition.xml


<output>
  <definition type="table">
    <element name="object">
      <translation value="table"/>
      <attributes>
        <attribute name="x" value="width"/>
        <attribute name="y" value="height"/>
      </attributes>
    </element>
    <element name="column">
      <translation value="td"/>
      <attributes>
        <attribute name="x" value="width"/>
        <attribute name="y" value="height"/>
      </attributes>
    </element>
    <element name="row">
      <translation value="tr"/>
      <attributes>
        <attribute name="x" value="width"/>
        <attribute name="y" value="height"/>
      </attributes>
    </element>
    <element name="data">
      <translation value="p"/>
    </element>
  </definition>
  <definition type="div">
    <element name="object">
      <translation value="div"/>
      <attributes>
        <attribute name="x" value="width"/>
        <attribute name="y" value="height"/>
      </attributes>
    </element>
    <element name="column">
      <translation value="div"/>
      <attributes>
        <attribute name="x" value="width"/>
        <attribute name="y" value="height"/>
      </attributes>
    </element>
    <element name="row">
      <translation value="div"/>
      <attributes>
        <attribute name="x" value="width"/>
        <attribute name="y" value="height"/>
      </attributes>
    </element>
  </definition>
</output>

Explanation of File Three: outputDefinition.xml

In regards to the outputDefinition.xml file let me first state this is a very simple implementation meant only to showcase the general ideas behind such a file, what it contains, and why. This particular file structure is where we hackers come into play. This is our playground, so to speak, as this is where we take the layoutDefinition.xml file that contains the various template definitions available to the marketing folks and, based on the desired output format (again, specified by the marketing folks) match the elements contained in the layoutDefinition.xml file to there proper translation for the various possible output formats.

As mentioned in the last paragraph of the page.xml file, the names I chose for definition/@type really should be more specific to the output device or device platform (e.g. device=PC platform=WindowsXP/Firefox), or, at very least a focus placed on web, print, etc… But while using such specifications for CSS files embedded within the head tag of our html files and labeled alternative can work and work well (when desirous to display the content in a format more specific to the specified alternative format), this is because the device/platform has already been specified/accounted for and the web or print declaration for alternative CSS stylesheets is a way of further specifying the layout, colors, font-size, etc… when the end user is desirous to print a web page in a more printer friendly format, or view it on something other than your standard VGA 1024×768 monitor, which at the moment, I believe, happens to be the most common resolution in use.

This of course is way off topic. Focusing more on the file itself you will notice the output element contains two definition child elements, with the aformentioned type attribute. Below this you will notice the numerous ‘element’ children with a value attribute that is used to compare against the current element in focus when processing the layoutDefinition.xml file. As I am looking at this I am realizing that the ‘translation’ child element is really unnecessary, as I could have just as easily made the translation element an attribute of its ‘element’ parent, its value set to what is now the value of the attribute ‘value’ that is currently part of the translation ‘element’(way to confusing of a sentence, reason enough to fix this problem. ;)

But in regards to understanding the weaving process, this is really a moot point, especially given the fact that all of this will be changing fairly drastically when we bring AspectXML into the picture. So, for now, please disregard the obvious bad usage format. While its definitely still capable of showcasing the weaving process my apologies for the its confusing nature.

The portion of this file to truly take note of is the name attribute of the ‘element’ node and this attributes value. You will notice the ‘row’, ‘column’, and ‘data’ values contained in this attribute, the same element names we find in our layoutDefinition.xml file. It should be fairly obvious that our stylesheet uses the value of the name attribute to compare to the name() of the current element being processed within the layoutDefinition.xml file, to then look-up the proper translation for the output.

You will also notice the attributes/attribute elements contained below the translation element. It will become more obvious when we walk through the transformation file how these come into play, but in regards to the attribute names and values contained within the ‘attribute’ elements, as already mentioned, this is really how I should have set up the ‘element’ node for translation, using the ‘name’ attribute to match against, and the ‘value’ attribute as what it should be translated to. But, again, this is more conceptual than actual implementation, so hopefully you can forgive me for my obvious oversight of the way things should have been.

File Four: dateWeave.xsl


<?xml version="1.0" encoding="iso-8859-1"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">

<xsl:variable name="output" select="/page/layout/output/@type"/>
<xsl:variable name="layoutDef" select="document('layoutDefinition.xml')"/>
<xsl:variable name="outputDef" select="document('outputDefinition.xml')/output/definition[@type = $output]"/>
<xsl:variable name="template" select="/page/layout/template/@name"/>
<xsl:variable name="content" select="/page/content"/>

<xsl:strip-space elements="*"/>

<xsl:output method="xml" indent="yes"/>

<xsl:template match="/">
  <xsl:apply-templates select="$layoutDef/templates/template[@name = $template]/layout/*"/>
</xsl:template>

<xsl:template match="*">
  <xsl:variable name="name" select="name()"/>
  <xsl:variable name="source" select="@source"/>
  <xsl:variable name="content" select="$content/*[name() = $source]"/>
  <xsl:variable name="outputDef" select="$outputDef/element[@name = $name]"/>
  <xsl:element name="{$outputDef/translation/@value}">
    <xsl:apply-templates select="@*[not(name() = 'source')]" mode="attMatch">
      <xsl:with-param name="attributes" select="$outputDef/attributes"/>
    </xsl:apply-templates>
    <xsl:if test="$source"><xsl:value-of select="$content"/></xsl:if>
    <xsl:apply-templates/>
  </xsl:element>
</xsl:template>

<xsl:template match="@*" mode="attMatch">
<xsl:param name="attributes"/>
<xsl:variable name="currentAttName" select="name()"/>
<xsl:variable name="translatedAttName" select="$attributes/attribute[@name = $currentAttName]/@value"/>
  <xsl:attribute name="{$translatedAttName}"><xsl:value-of select="."/></xsl:attribute>
</xsl:template>

</xsl:stylesheet>

Before we get in to the how, lets look at what happens when you apply page.xml to dataWeave.xsl with the output type set to ‘table’:


<?xml version="1.0" encoding="UTF-8"?>
<table>
   <tr>
      <td>
         <tr>This is the header</tr>
         <tr>
            <td width="175px" height="auto">
               <p>This is the left navigation</p>
            </td>
            <td width="600px" height="auto">
               <p>This is the main content area</p>
            </td>
         </tr>
         <tr>This is the footer</tr>
      </td>
   </tr>
</table>

So what just happened? To understand this we will need to walk through the transformation file step by step.

[IN PROGRESS]


[EXTENSION TO POST]

First the file updates:

page.xml


<page xmlns:join="http://aspectxml.org/atom/join">
  <layout>
    <template name="master"/>
    <output type="table"/>
  </layout>
  <content>
    <header>This is the header</header>
    <left_navigation>This is the left navigation</left_navigation>
    <content>This is the main content area.
      <join:atom src="http://www.xsltblog.com/atom.xml" type="dropdown">This is a joined atom feed</join:atom>
    </content>
    <footer>This is the footer</footer>
  </content>
</page>

when transformed with:

dataWeave.xsl


<?xml version="1.0" encoding="iso-8859-1"?>
<xsl:stylesheet
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  xmlns:join="http://aspectxml.org/atom/join"
  xmlns:atom="http://www.w3.org/2005/Atom"
  xmlns:dc="http://purl.org/dc/elements/1.1/"
  version="1.0"
  exclude-result-prefixes="join atom dc">

<xsl:variable name="output" select="/page/layout/output/@type"/>
<xsl:variable name="layoutDef" select="document('layoutDefinition.xml')"/>
<xsl:variable name="outputDef" select="document('outputDefinition.xml')/output/definition[@type = $output]"/>
<xsl:variable name="template" select="/page/layout/template/@name"/>
<xsl:variable name="content" select="/page/content"/>

<xsl:strip-space elements="*"/>

<xsl:output method="xml" indent="yes"/>

<xsl:template match="/">
  <xsl:apply-templates select="$layoutDef/templates/template[@name = $template]/layout/*"/>
</xsl:template>

<xsl:template match="*">
  <xsl:variable name="name" select="name()"/>
  <xsl:variable name="source" select="@source"/>
  <xsl:variable name="content" select="$content/*[name() = $source]"/>
  <xsl:variable name="outputDef" select="$outputDef/element[@name = $name]"/>
  <xsl:element name="{$outputDef/translation/@value}">
    <xsl:apply-templates select="@*[not(name() = 'source')]" mode="attMatch">
      <xsl:with-param name="attributes" select="$outputDef/attributes"/>
    </xsl:apply-templates>
    <xsl:if test="$source"><xsl:apply-templates select="$content"/></xsl:if>
    <xsl:apply-templates/>
  </xsl:element>
</xsl:template>

<xsl:template match="@*" mode="attMatch">
<xsl:param name="attributes"/>
<xsl:variable name="currentAttName" select="name()"/>
<xsl:variable name="translatedAttName" select="$attributes/attribute[@name = $currentAttName]/@value"/>
  <xsl:attribute name="{$translatedAttName}"><xsl:value-of select="."/></xsl:attribute>
</xsl:template>

  <xsl:template match="text()">
   <xsl:value-of select="."/>
  </xsl:template>

  <xsl:template match="join:atom[@type = 'dropdown']">
    <xsl:variable name="textValue" select="."/>
    <xsl:apply-templates select="document(@src)" mode="dropdown">
      <xsl:with-param name="textValue" select="$textValue"/>
    </xsl:apply-templates>
  </xsl:template>

  <xsl:template match="atom:feed" mode="dropdown">
    <xsl:param name="textValue"/>
    <xsl:value-of select="$textValue"/>
    <a onclick="if (getElementById('childMenu').display = 'none' getElementById('childMenu').display = 'block';
      else getElementById('childMenu').display  = 'none';
      return true;">
      <xsl:value-of select="$textValue"/>
      <ul id="childMenu" style="display:none">
        <xsl:apply-templates select="atom:entry"/>
      </ul>
    </a>
  </xsl:template>

  <xsl:template match="atom:entry">
    <li><a href="{atom:link/@href}"><xsl:value-of select="atom:title"/></a></li>
  </xsl:template>
</xsl:stylesheet>

will result in:


<?xml version="1.0" encoding="utf-8"?>
<table>
   <tr>
      <td>
         <tr>This is the header</tr>
         <tr>
            <td width="175px" height="auto">
               <p>This is the left navigation</p>
            </td>
            <td width="600px" height="auto">
               <p>This is the main content area.
      <a onclick="if (getElementById('childMenu').display = 'none' getElementById('childMenu').display = 'block';       else getElementById('childMenu').display  = 'none';        return true;">
                     <ul id="childMenu" style="display:none">
                        <li>
                           <a href="http://www.xsltblog.com/archives/2005/08/_part_2_assets_1.html"> [Part 2] Assets, Atom Feeds, and AspectXML - The Triple Threat of Web Development?</a>
                        </li>

                        <li>
                           <a href="http://www.xsltblog.com/archives/2005/08/past_present_an.html">Past, Present, and Future of Mozilla SVG</a>
                        </li>
                        <li>
                           <a href="http://www.xsltblog.com/archives/2005/08/siteredesign_in.html">Site-Redesign in Mid-Stream... Please excuse the non functionality and, even more so, the currently 'ugliness'...  I promise it will get better :)</a>
                        </li>
                        <li>
                           <a href="http://www.xsltblog.com/archives/2005/08/gizmo_project_y.html">Gizmo Project: Yet Another Skype Wanna-Be?  First Take: This Is What Skype Should 'Wanna-Be'</a>
                        </li>
                        <li>
                           <a href="http://www.xsltblog.com/archives/2005/08/ann_saxon_851_i.html">[ANN] Saxon 8.5.1 is available [via saxon-help]</a>
                        </li>
                        <li>
                           <a href="http://www.xsltblog.com/archives/2005/08/this_belongs_in_1.html">This belongs in 'Code of the Day' but in reality should be *AT LEAST* Code of the Month, if not Year</a>
                        </li>
                        <li>
                           <a href="http://www.xsltblog.com/archives/2005/08/c_30_and_the_fu_1.html">C# 3.0 and the future of the CLI</a>
                        </li>
                        <li>
                           <a href="http://www.xsltblog.com/archives/2005/08/i_love_it_when.html">I Love it When Opportunity Knocks; Especially When It Knocks On the Door of the House You Moved Into A Month or Two Back, Giving You Just Enough Time to Unpack, Get Settled, and Have The Sara Lee Ready and Waiting</a>
                        </li>
                        <li>
                           <a href="http://www.xsltblog.com/archives/2005/08/via_my_first_or_1.html">via My First O'Reilly Blog Post | Assets, Atom Feeds, and AspectXML - The Triple Threat of Web Development?</a>
                        </li>
                        <li>
                           <a href="http://www.xsltblog.com/archives/2005/08/interesting_com_1.html">Interesting Comments from DonXML Regarding Acrylic</a>
                        </li>
                        <li>
                           <a href="http://www.xsltblog.com/archives/2005/08/a_very_light_bl.html">A very light blogging week</a>
                        </li>
                        <li>
                           <a href="http://www.xsltblog.com/archives/2005/08/via_bill_de_har.html">via Bill de hÓra | One-Click HTML?</a>
                        </li>
                        <li>
                           <a href="http://www.xsltblog.com/archives/2005/08/via_ebay_i_now.html">via EBay | I Now Know What I Am Going To Get Edd Dumbill For Christmas [Link Source: Engadget]</a>
                        </li>
                        <li>
                           <a href="http://www.xsltblog.com/archives/2005/08/via_tim_bray_in.html">via Tim Bray | In the Works Tim...  No, You Are Not the Only One Who has Connected the Dots</a>
                        </li>
                        <li>
                           <a href="http://www.xsltblog.com/archives/2005/08/im_not_sure_wha.html">I'm not sure what it may have been that Kurt stated in his Keynote at SVG 2005, but apparently someone is upset enough to have invoked a DOS attack on the transcript he posted earlier</a>
                        </li>
                     </ul>
                  </a>
               </p>
            </td>
         </tr>
         <tr>This is the footer</tr>
      </td>
   </tr>
</table>


We actually need to perform some CSS magic using an external CSS file to make the above unordered list appear as a menu, as opposed a standard unordered list. But this is less important at the moment than is understanding how we got to this point in the first place. So I plan to focus on this part of the tutorial first at which point we can work on turning the above into something much more usable on the web.

[IN PROGRESS]

Are there any particular to the above code that seem strange or difficult to understand without having a complete step-by-step walkthrough of the transformation file? If so, please post them as comments and I will be sure to put special emphasis as I continue forward with this post throughout this evening.

Dan Zambonini

AddThis Social Bookmark Button

I’ve re-worked the ‘Is programming art?‘ question: If programming was music, which genre would represent each programming language? I’d like your suggestions below, but here’s a couple to get you started:

Perl

Perl would undoubtedly be Jazz. Freeform Jazz, in fact. None of this easy listening, “I didn’t think that I liked Jazz but it’s not that bad” kind of Jazz. No siree — it would be the crazy stuff that makes your want to throw your CD player into industrial acid. Take a look at this:

  $h = 1 if (@Files <= 1 && !(($d && $Dirs_specified) || $r || $X));
  $reset = 1 if (((@Files > 1) || $X) && $Perlexpr =~ /[$@%][a-z]/ && !$x);
  $h = 0 if $H;
  if ($c || $l || $L || $O || $q || $Z) {
    $a = 1;
    $b = $N = $S = $T = 0;
  }
  $y = 1 if (($l || $L || $q) && !$No_slurp);
  $N = 0 if $T;
  $F = 0 if ($F && $Perlexpr !~ /bFb/);
  $P = 0 if ($P && $Perlexpr !~ /bPb/);

Jazz musical score

I’m guessing you didn’t even notice that I switched from a CPAN Perl extract into some Jazz musical score.

I might start using the term Perl Jazz more often to describe this kind of code (not to be confused with Perl Poetry), although it might already be the name of an unwholesome movie; I’ll need to check (if you see me looking at strange websites, I’m researching this).

It can’t be long until someone creates software that lets you write your Perl scripts in Musical Score notation. Or maybe even digitise the output of a saxophone into Perl. You heard it here first. (Horrible thought: Imagine working in an office where teams of programmers spent all day writing Perl applications through mass Jazz saxophone recitals. That should make you feel better about your current working environment.)

Assembly Language

Given the age of Assembly Language, it is perhaps suprising that its closest musical relative is quite recent – Dance music. You could be forgiven if you thought that some Dance tracks had been inspired by code extracts from an assembler book:

mov, mov, jump, jump, mov, push, push

By the way, I’m trying to establish a new dance sub-genre. Given the success of House and Garage, I’m hoping to continue the DIY theme and be the first successful artist in the Tongue and Groove category. I think it could be big, although yet again, I’m going to need to check if the name has been taken by an unsavory movie.

Any Others?

Let me know if you can think of any others. They have to be music genres though, not performing artists. As much as I’d like to compare ASP.NET to Celine Dion and Java to Meatloaf, it’s outside the rules of this challenge.

Bob DuCharme

AddThis Social Bookmark Button

Tim Bray recently pointed out Roger Sperberg’s mention of links from a Maureen Dowd op-ed piece in the New York Times. While I’ve seen the Times turn URLs mentioned in articles into links before, these were links using inline text phrases as link anchors. For example, in a piece titled Hey, What’s That Sound?, she includes the phrase “Newt Gingrich told Adam Nagourney and David D. Kirkpatrick for a Times article on G.O.P. jitters about the shadow of Iraq” and another saying “The man who won a Nobel Peace Prize for making a botched exit and humiliating defeat look like a brilliant act of diplomacy wrote an op-ed article in The Washington Post drawing the analogy the White House dreads…” Each links to the referenced article. (Two notes about access to nytimes.com: free registry is required, but even then free access to specific pieces seems to only last a week, which is why I’ve reproduced such long quotes here. If you’re reading this before September 3rd, see this more recent Dowd column for more linking examples.) It’s interesting that Dowd created both a link to an article in her own newspaper and another link to an article in a different newspaper. The link to the Washington Post had a target=”_0″ attribute in it, and that was the only extra metadata in either link.

I have many questions, and would have written about this earlier except for my unsuccessful efforts to find someone at the Times to answer them: How long have they been doing this? How does Maureen Dowd indicate where she wants her links and where they should link to? Is she using a tool that creates (X)HTML, or did she describe the links she wanted in a cover letter to her editor? When did they decide on the style of linking to Times articles in the same window and outside articles in a separate window? What other styles do they have in place? (The Times is very big on defining and enforcing consistent styles.)

A Times article speculating on Google’s plans has two more kinds of links. Mentions of some companies link to a script that redirects you to a Times-branded version of a Dow Jones Marketwatch page about that company. (I’d provide you with a link to a sample one, but when I strip my ID information from the URL, the link no longer works.) The names Google, Microsoft, Yahoo, and Apple all get such links, but the company names T-Mobile and General Magic don’t—I assume because they don’t have MarketWatch pages.

T-Mobile’s product Sidekick and the term Wi-Fi, when used in the article, are yet another kind of link. The Wi-Fi link triggers queries of two New York Times databases and displays a web page constructed from the results, with articles from the technology section listed first and relevant product reviews “in partnership with CNET” listed second. The reviews, of course, each include a “where to buy” link. (I didn’t have to modify the URLs to add them to this paragraph, because there’s no reason for the Times to restrict access to screens that may lead to potential new revenue.)

The Sidekick link is a variation on the Wi-Fi one: it queries the “Products” database, but not the articles one. By comparing the two URLs and tweaking the Sidekick one, I easily created a link that queried both databases for “Sidekick”. So, someone somewhere made a conscious decision to have the “Wi-Fi” link query both databases but the Sidekick link query only one.

In addition to the various design decisions that were made, business model decisions for nytimes.com were also made as they worked out a linking structure to let customers navigate information and maybe spend some money. The decisions that were made were interesting enough that I just went back through what I’ve written here and removed three uses of the word “interesting.” I’m sure that watching the old gray lady continue to make and evaluate these decisions about linking will continue to be interesting.

Jim Alateras

AddThis Social Bookmark Button

I am currently working on a project that is using an active object model (also known as adaptive/dynamic object model) in its architecture. An active object model provides runtime extensibility of the object model. The core domain object model can be very generic and a declarative approach can be used to define specific domain object types. Essentially you can define new types without making any programming changes.

The Active Object Model web site provides plenty of information on the subject. Ralph Johnson’s paper on The Dynamic Object Model Architecture provides a solid introduction covering a number of common design patterns (i.e..type-squared and type-cubed) used for developing such a model. Developing these architectures can be more challenging but they can add longevity and flexibility to your object model.

Adaptive object models are commonly used in medical applications and the folks at OpenEHR are doing some great work both on the specification and implementation fronts. On the standards side they have released the Archetype Definition Language (ADL), which defines a mechanism for declaring archetypes based on a reference information model. There have also released an ADL Parser and a reference implementation both in Java. Although parts of the implementation are specific to the Electronic Health Record industry it is still useful for other applications.

I don’t believe that the ADL specification will be widely adopted outside the EHR domain (could be wrong of course) and a specification based on more widely used specifications, such as OWL/RDF, may have been a better approach. The article, Mapping Archetypes to OWL, describes an approach for mapping ADL constructs to equivalent OWL constructs. Currently, I don’t have the background to understand how far we can go with OWL and tools like Jena but will look in to it when I have some spare cycles.

David A. Chappell

AddThis Social Bookmark Button

Related link: http://searchwebservices.techtarget.com/originalContent/0,289142,sid26_gci111840…

Another interview/writeup on the subject of the Apache Synapse proposal has appeared on the SearchWebServices.com web site -
http://searchwebservices.techtarget.com/originalContent/0,289142,sid26_gci1118404,00.html
Dave

David A. Chappell

AddThis Social Bookmark Button

Related link: http://www.ddj.com/documents/s=9824/ddj050822pc/

I recently did an interview with Dr. Dobbs Journal on the subject of the Web services, Apache Axis, and the newly proposed Synapse project. The interview was recorded and is available on their site as a podcast -
http://www.ddj.com/documents/s=9824/ddj050822pc/

Dave

M. David Peterson

AddThis Social Bookmark Button

Related link: http://www.aspectxml.org

[UPDATE: I am currently working on the follow-up post, the second in a planned 4-5 post series. I have also been pointed at and have since got in contact with the TIBET folks about gaining early access to the bits, which apparently are in the final testing phase for release in the next 2-4 weeks. They are looking at finding a way to give me access and plan to get back to me ASAP. I would REALLY like to include the work from this project as part of this series, and if I can gain access to these bits before the end of the business day, I will hold things off for a day such that I can play around with TIBET and see if it is in a state in which brings benefit to this series.

At the end of the business day (about 7 hours from now where I am located, Mountain Standard Time, USA) I will either post the next entry of this series, or an update stating that I have gained access to the bits and will instead post it tomorrow, hopefully with inclusion of the TIBET bits to whet your appetite.]

[UPDATE: [2005.09.01 12:13pm MST, USA] As you may have already noticed, the second part of this series was posted yesterday. I am currently working on the explanation of how everything works, but for those of you familiar with XSLT, the code is in place and should, for the most part, be pretty self explanatory. Expect an updated post with full explanations in the next hour to hour and a half.]

[ORIGINAL POST]
Whether we immediatelly realize it, our company websites, corporate and personal blogs, community and personal websites, discussion forums, listservers, etc… all contain common elements that appear on a regular basis. And yet in most cases instead of taking the time to, in essence, externalize these assets such that when simple things change such as:

- hyperlinks
- images
- an individuals contact information:
– their telephone
– email
– home address
– etc…

We instead choose to update these, in many cases, by hand. Yet the need to make changes to each instance of these assets throughout the entire site is simply unneccesary and furthermore can easily be automated and made part of the standard build process.

How?

Enter Assets, Atom Feeds, and AspectXML.

In many ways, if we were to catalog the items that tend to be fairly common within our web domains, along with anything that may be related (e.g. photographs, hyperlinks to various related places on the web) and then store these items in separate Atom data feeds with an entry for each related item, published, and made easily accessible, we could make the effort necessary to maintain the content of a web site mere childs play.

Scenario 1: A product image changes: Update the data feed, process the site using AspectXML and the predefined weaving templates, and anywhere within the site this image is used will automatically be updated to point to the proper image (as well as the proper hyperlink if that has changed as well)

Scenario 2: What about hyperlinks: The 404 error page is something I know we have all come to love and accept as part of our web-based lives. But do you think that if it suddenly just disappeared, we would miss it? While I have no doubt there are a sentimental few who probably would, I would tend to believe the majority of us would eventually forget what 404 even stood for, becoming more folk lore that is told around camp fires to scare children into minding the “ways of the web” or face “the dreaded 404!”

…or maybe not (the scare children bit… not the site-wide hyperlink update for each instance of an asset ;)

Scenario 3: Another area of value comes when a link on a page may have more than one hyperlink, image, or meta information associated with it that can not be crammed together each and every time the asset appears on the site. By containing this asset within an Atom data feed and then using an entry for each related item, a simple dropdown menu can be shown when the user clicks the assets link to then expose all of the possible places this asset links to, all or at least some related photos, and any other information that might be of interest.

When you really stop to think how often we use and reuse the same assets on a website, having a system that enables us to keep track of these assets, where they are located, and then keeping each and every one of them up to date with all of the latest related information is something that could very easily become the most important part of our build process each and every time we rebuild our web sites.

With that, I will leave you to ponder the above, ask any questions, protest my beliefs that this really can, will, and actually does work and work well, or flat out call me a lieing thief who stole the idea from your notebook while you werent looking (you really should manage your assets a lot better you know… Hey, maybe this tutorial series might help! ;)

I am currently planning part 2 of this series for the same time next week. If that changes I will make a post a day ahead to notify you of when it will be posted instead.

Cheers :)

<M:D/>

NOTE: This post is the first of what I plan as a 4-5 post series covering the basic concepts, the code, the real world implementation and maintenance, and the impact implementing this system can have on small, medium, and large websites alike. Your questions, comments, and/or concerns are always welcome.

So what do you think? Can AspectXML(an AOP-based project developed by Russ Miles and myself) and Atom feeds that contain the latest information related to commonly used site assets help in bringing sanity to content management on your web site(s) or am I completely off my rocker? (for this, not other reasons… While your other reasons may be valid, thats just flat out mean bringing them out in a public forum like this… have you no shame! ;)

David A. Chappell

AddThis Social Bookmark Button

Since making the announcement yesterday morning, a number of trade publications have reported on Synapse. Here is a list of what I have seen, which includes eWeek, InfoWorld, CNET, LooselyCoupled, InternetNews, and TheRegister -
Dave

The Register (UK)

ESB project planned by Apache

By Gavin Clarke in San Francisco
Published Friday 19th August 2005 02:43 GMT
http://www.theregister.co.uk/2005/08/19/esb_apache/
The Apache Software Foundation (ASF) is drilling deeper into infrastructure software with plans for an open source enterprise service bus (ESB) project.
ASF is next week expected to announce the ESB project, backed by a brace of software vendors and led by integration specialist Sonic Software. A source familiar with the project told The Register the ASF is offering customers an alternative to closed-source ESBs to “keep vendors honest.”

InternetNews.com - USA

Enterprise Service Bus Effort Under Apache Incubation
InternetNews.com - USA

http://www.internetnews.com/dev-news/article.php/3528941
The Apache Software Foundation is now incubating a web services broker/ Enterprise Service Bus (ESB) project called Synapse. The …

CNET News.com

Apache expands Web services reach

Martin LaMonica
August 21, 2005
>http://news.com.com/Apache+expands+Web+services+reach/2100-7344_3-5840388.html?tag=nefd.top
The Apache Software Foundation is expected to launch on Monday an open-source integration server project, part of a bigger effort to create a full suite of Web services infrastructure software.
Called Apache Synapse, the proposed project will create server software that processes XML documents as they travel between two machines.
Called a “Web services broker” or an enterprise service bus (ESB), Synapse will be designed to perform tasks such as translating between different XML document formats and routing information based on its contents….

InfoWorld

Apache kick-starts open source Web services

Synapse project aims to deliver a scalable, distributed services broker based on Web services standards
Eric Knorr
August 22, 2005
http://www.infoworld.com/article/05/08/22/34NNapache_1.html

Go beyond a few basic protocols, and confusion still reigns in the wild world of Web services and SOA. Not just the towering, complex stack of Web services specs, but also fundamental questions about how those specs should work together and how Web services should be deployed and managed.

A new open source answer to those questions, dubbed the Apache Synapse project, has arrived from an unlikely location: Sri Lanka. That country, not known as a technology hotbed, is the home of WSO2, a Web services venture founded by leaders of the Apache Web services project. Synapse, which WSO2 is publicly submitting to the Apache Software Foundation (Profile, Products, Articles) today, is intended to produce a lightweight, scalable, distributed services broker based on Web services standards. The kernel will be X-broker, donated by software vendor Infravio (Profile, Products, Articles), which will participate in the project along with middleware players Blue Titan, Iona, and Sonic Software (Profile, Products, Articles)….

eWeek

WSO2 Announces Web Services Mediation Project

Darryl K. Taft
August 22, 2005
http://www.eweek.com/article2/0,1895,1850787,00.asp
WSO2, a startup focusing on open-source Web services, is expected to announce Monday a new project to create a Web services mediation framework known as Synapse and submit it to the Apache Software Foundation’s incubator program.
Sanjiva Weerawarana, founder, chairman and CEO of Colombo, Sri Lanka-based WSO2 (which means Web Services Oxygen), said his company is being joined by four other founding companies that are leaders in the Web services and ESB (enterprise service bus) spaces: Blue Titan, Infravio, IONA and Sonic Software.
Weerawarana said Synapse is an open-source implementation of a Web service mediation framework and components for building and deploying SOAs (service-oriented architectures). It sets a framework for working between two Web services and enables users to define routing and other factors between services.
“Synapse is a Web services mediation framework that will provide a common core to be utilized by ESBs, Web services management platforms and other types of SOA infrastructure,” said Dave Chappell, vice president and chief technology evangelist at Sonic Software….

LooselyCoupled.com

Synapse to spark web services connections

by Phil Wainewright
August 22nd, 2005
http://www.looselycoupled.com/stories/2005/synapse-infr0822.html

A group of vendors are bidding to standardize a core piece of web services infrastructure on open-source code. According to its backers, the result could simplify development of enterprise web services projects, enhancing interoperability and scalability as web services adoption grows.

Apache-sponsored Synapse open-source project aims to build a universal web services intermediary: It will be a base-level component of ESB and services management products Adoption will ease interchangeability of infrastructure products
Independent of Java, it will be optimized for Linux Enterprise developers are expected to use it for build and test It enforces deployment of a services mediation layer ….

David A. Chappell

AddThis Social Bookmark Button

A group of open source contributors have announced the submission of the “Synapse” project for incubation under the Apache Software Foundation. Among these contributing organizations who have joined forces include Blue Titan, Infravio, IONA, Sonic Software and a newly formed company called WSO2.

Synapse is a Web Services mediation framework that allows users to get in the middle between service requesters and providers and perform various tasks – including transformation and routing and that helps to promote loose coupling between services

Synapse will be completely based on WS-* specifications, and will serve as a common framework that can be used across ESBs, Web Service Brokers, Web Services Management platforms, and other types of SOA infrastructure. Synapse will leverage other open source initiatives such as Apache Axis 2, Sandesha (WS-RM), and WSS4J.

It’s a real pleasure to be working with such a set of heavy hitters from the open source community such as Sanjiva Weerawarana, Glen Daniels, Davanum Srinivas (Dims), Paul Fremantle, and other industry luminaries such as Miko Matsumura and Eric Newcomer. Sanjiva and Paul recently left IBM, and Dims recently departed CA to form WS02. I’m happy to say that Glen is still with us :)


The home page for Synapse at the Apache site can be found at –
http://wiki.apache.org/incubator/SynapseProposal

Dave

Andrew Savikas

AddThis Social Bookmark Button

Related link: http://ask.slashdot.org/article.pl?sid=05/08/09/170209&tid=215&tid=95&tid=4

This post from Slashdot caught my eye because it hit close to home. (The post asked for tips on cleaning up and standardizing Word files for eventual posting to the Web.) I’ve been cleaning up Word files for some time now, and it’s never the same job twice.

The poster asked for advice, but rather than reply directly on Slashdot and get lost in the crowd, I thought I’d post some tips from the trenches right here. When I say “you”, I mean “the guy who asked Slashdot”.

  • If you’re working with Word 2003 for Windows, consider doing the bulk of the cleanup with XSLT. As MS rolls out the new XML-based file format, you’ll be ahead of the game.
    You can very quickly strip out a ton of cruft with just a few XSLT templates.


    For example, the following dozen or so lines of XSLT remove all “direct” formatting from a Word document. (Direct formatting means things like using the “B” button to apply bold to some text, rather than using a character style like “Strong”.) This stylesheet (written by Evan Lenz, co-author of the excellent book Word 2003 XML) is taken directly from Hack 97 in Word Hacks:

    <xsl:stylesheet version="1.0"
      xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
      xmlns:w="http://schemas.microsoft.com/office/word/2003/wordml">
    
      <xsl:template match="@*|node()">
        <xsl:copy>
          <xsl:apply-templates select="@*|node()"/>
        </xsl:copy>
      </xsl:template>
    
      <xsl:template match="w:p/w:pPr/*[not(self::w:pStyle)]"/>
    
      <xsl:template match="w:r/w:rPr/*[not(self::w:rStyle)]"/>
    
    </xsl:stylesheet>
    

    For more on how to do stuff like this, see chapter 10 in Word Hacks.

  • Even if you’re not using Word 2003 on Windows, consider using XSLT for a lot of the cleanup. Saving out as HTML, then running it through Tidy will give you a starting point for using XSLT.
  • Also, Word VBA macros are a natural choice for much of the cleanup you described. While VBA is far from the darling of Slashdot readers, it is fairly simple to learn, and the ability to record macros and then edit them is a Good Thing.

    As I’m sure you know, no two manuscripts are the same, so having a broad selection of short, single-purpose utility macros can be a huge time saver. You can even run them on a batch of files from a DOS command prompt, which is very useful. I take a look at a set of files, take note of which of the 2-dozen or so utility macros need to be run, then fire up a Ruby script to run them for me.

    In fact, here’s a simplified version (no usage message, etc.) of the Ruby script I use. Enjoy:

    # batchmacro.rb
    # Born 3/14/2005
    # Andrew Savikas, O'Reilly Media, Inc.
    
    # Use win32ole package
    require 'win32ole'
    require 'getoptlong'
    
    $macros_to_run = ""
    
    # Process command-line options
    opts = GetoptLong.new(
      [ "--macro", "-m", GetoptLong::REQUIRED_ARGUMENT]
    
    )
    
    # process the parsed options
    opts.each do |opt, arg|
        if opt == "--macro" then
            $macros_to_run = arg
        end
    end
    
    if ARGV.size == 0 then
            exit
    end
    
    # Get current instance of Word, or launch new one if needed
    begin
      wrd = WIN32OLE.connect('Word.Application')
    rescue WIN32OLERuntimeError
      puts("no instance of Word running ... launching new one. Please wait ...")
      wrd = WIN32OLE.new('Word.Application')
      $close_word = true
    end
    
    wrd.Visible = 1
    
    # Everything else is a document on which to run macro
    ARGV.each do |file|
      doc = wrd.Documents.Open(File.expand_path(file))
    	$macros_to_run.split.each do |macro|
        puts("Running #{macro} on #{file} ...")
        wrd.Run(macro)
    	end
      doc.Save()
      doc.Close()
    end
    
    if $close_word then
      puts "Closing Word ..."
      wrd.Quit()
    end
    puts "Done."
    
  • Several Slashdotters suggested using RegExp’s to help clean things up. If you’re using Word 2000 or later on Windows, you can use Perl-style RegExp’s right from a Word macro. Of course, if Perl’s more your style, go ahead and use Perl from within Word, just like you can do in emacs or vim.

In practice, the wide variety in the content and quality of the Word manuscripts processed will likely call for all of these tools. Before most O’Reilly manuscripts (yes, most of them are written in Word — at the authors’ request) hit the shelves, they’ve been poked, prodded and processed by VBA, Ruby, Perl, sed, and sometimes OpenOffice. The bigger the toolbox, the quicker the job.

How do you get those Word docs in shape?

Bob DuCharme

AddThis Social Bookmark Button

Related link: http://napsterization.org/stories/archives/000513.html

Mary Hodder made this interesting post last weekend, and it’s been a hot topic ever since. In her words,

Currently, blogs are measured in systems like Technorati or ranked in PubSub by links or by number of subscribers to a feed in Feedster. In particular, these are the not very interesting, subtle or telling measures used to make indexes like the Technorati Top 100 or the PubSub 100 or the Feedster 100. In Particular, the Technorati Top 100 is based purely on inbound links. All of these lists tend to favor those who blog in more general, popular topic areas, and not those who are specialists in an area.

For many bloggers the relevant sphere of influence is not overall popularity, as those indexes express. It’s influence and connection within a community. And the relevant measure of connection isn’t the number of connections — it’s the depth and impact of those connections.

She wants to quantify the depth and impact of those connections by looking at more than just the number of inbound links and the number of a particular blog’s subscribers. He blog posting includes a table with nineteen kinds of information that could be taken into account, and she floats the possibility of an open-source algorithm to combine this information into a score. (She also includes a nice picture of the napkin from the Paris restaurant where she and her dinner companions sketched out their initial ideas.)

About half of her table’s entries describe links and tagged URLs within particular contexts that can be considered metadata about those links and URLs—for example, links to a blog, links to a specific post, links to links to a post, the ratio of outbound links to generated traffic, outbound blogroll links, and more.

Her idea about developing this open-source algorithm, whose resulting score has since been dubbed “The Paris Index,” has generated a lot of discussion this week, and she summarized it today in a post with a title that I had to love: Lotta Linkin Going On… Or Not. (The XML- and RDF-oriented weblog crowd that I follow most closely was well-represented in Mary’s summary by Shelley Powers.)

The plans surrounding the Paris Index are a sophisticated new development in the evolution of one of Larry Page’s original ideas that led to Google’s PageRank algorithm: that we can derive link metadata from link context and then build useful applications from that metadata. Mary’s table of potential metadata is an important step beyond the scribbles on her napkin, and I look forward to seeing where further steps lead.

Simon St. Laurent

AddThis Social Bookmark Button

Related link: http://extrememarkup.com/

The brain-stretching continues, with one last morning of high-end talks at Extreme.

(Again, I’ll be updating this over the morning. Fortunately, I’m largely sneeze and Sudafed-free today.)


The first speaker, Erik Hennum is presenting on a “unified type hierarchy” for DITA, an “information typing architecture” that combines topics, maps, and context but is not Topic Maps.

DITA comes with a set of basic types for topics, but deriving new topics permits the use of much more specific content structures - tasks instead of paragraphs, for instance. More specialized approaches also let applications create more specific interfaces.

Currently, they’re doing derivation through architectural class attributes - DITA processing matches on class attributes rather than on element names. The system can be extended, but only by substitution - there’s no addition of new content.

As Hennum put it, “that’s working, but it could work better.” Hennum looked through a list of requests from DITA vocabulary designers, especially around content models and attributes. To address these many issues, Hennum proposes substantial extension of the existing type system, and suggests that the model could be represented in UML or OWL while being implemented separately in XML Schema or RELAX NG for instance validation if needed. He’s also showing how this might work in XSLT 2.0.

[paper]


Next up is Eric van der Vlist, who is demonstrating XML/RDF query. He’s presenting “100% angle brackets,” showing purely examples rather than slides.

The particular data van der Vlist is showing comes from LDAP, and is a mixture of tree structure and graphs. He proposed exporting a graph view of LDAP to RDF, “using RDF outside of the domain of the Semantic Web.” The LDAP tree turned into a very flat structure, though van der Vlist tried to minimize the “RDF tax,” keeping it readable as XML.

The user needed to query the data, and van der Vlist explored options, including LDAP filters, XQuery, and the W3C’s SPARQL query language for RDF, but all of them seemed too complicated for the task at hand.

Instead, van der Vlist turned to a query by example (QBE) approach. Starting from a simple approach of showing the query engine what he hopes to retrieve, van der Vlist developed it into a more robust approach with functions, joins, and conditions. There have been a few odd issues with RDF syntax expectations, but they haven’t been hard to work around.

[paper]


I almost made it through the conference without missing a session, but I didn’t quite make it. I was checking out during the second-to-last session, unfortunately. Jeff Beck reported on the challenges involved in creating PubMed Central (PMC), a system for giving the public access to research results funded by the National Institutes for Health (NIH). There’s an underlying XML repository, as well as process for submitting material and adding information to the system. One piece that stuck out was that they use tagging guidelines that are more restrictive than the DTD they use to validate, and use XSLT in a Schematron-like way as a “style checker.” Integrating the material with links to publishers’ sites also looks like a challenge. They’ve also used XSL-FO to generate PDFs. One interesting policy question here is that the grantees submit material voluntarily, not the publishers. So far they have about 1000 submissions, but the system is designed to support many more.


Closing the conference, as he always does, C. Michael Sperberg-McQueen spoke about “Getting it in writing.” The description of the session is “The letter killeth, but the spirit giveth life. Or was it the other way around?” (The quote is from Corinthians 3:6.)

He started with a story about jazz great Charlie Mingus coming to a session with a few measures left blank for improvisation, and being told he was getting lazy. Next he turned to the notion that “getting it in writing” signals a lack of trust, and connected this (as well as the UK’s unwritten constitution) to the W3C’s early hopes for a lack of formal process, something which hasn’t proven workable.

Sperberg-McQueen worried about an “endemic mistrust of democracy on the part of technical people,” perhaps brought on by experience in high school, but also noted that people mistrust writing things down. Some of the time CMSMCQ thinks that the spirit is more important than the letter, but not always. He talked about the story in Plato’s Phaedrus where Thoth presents writing, which the recipient realizes fosters reminiscence, not memory. Readers seem omniscient but know nothing - their answers will be the same without any concern for circumstance or audience.

However, “Individual memory is weaker, but the system is stronger,” from Sperberg-McQueen’s perspective. “Until artificial intelligence bears fruit, if it ever does…. markup makes computers look well-informed.” He concluded with some discussion of “true names”, something I think should be left strictly to magicians, and the conference was over.

Substitute? Extend? Restrict?

Simon St. Laurent

AddThis Social Bookmark Button

Related link: http://extrememarkup.com/

Even though I can’t stop sneezing, I still need to hear the morning sessions about overlap at Extreme - it’s the subject that strikes me as producing the most creative thought in the whole markup area. (Elliotte Rusty Harold gives some excellent background on the subject.)

Like yesterday, I’ll be updating this article throughout the day.


Syd Bauman opened the morning with discussion of the ways that the Brown Women Writers Project has been dealing with overlapping structures in documents. Bauman acknowledged that:

XML is probably not the best way to model humanities texts, but it’s what we’re using today.

The overlap problem is an old one, and the Text-Encoding Initiative (TEI) has been trying to deal with it for a long while. Bauman cited Hx and Steve DeRose’s CLIX work as prior work before setting off on a brief discussion of YAMFORX (yet another method for
overlap representation in XML), complete with images of, er, yams with fork legs standing in various places.

Empty elements marking start and end positions are a classic approach to indicating starts and ends of structures which can’t be nested cleanly. To indicate which start and end elements go together, identifiers are stored in sID and eID elements.

DTD-based validation can’t check that these elements are used properly (and I don’t think XML Schema could either), but Bauman showed RELAX NG for doing part of the rule set, and Schematron can validate all of it.

After sorting through the schema side, Bauman turned to practical use of this technique in TEI, looking at how to integrate this with the existing vocabulary. Processing involves two levels - one for the regular markup, and then extra work to support the YAMFORX pieces, possibly converting the regular XML to YAMFORX and converting the YAMFORX to regular XML.

Bauman concluded with a list of what still needs to be done to make this practical, but it’s a promising start.

[paper]


Next up was Paul Caton, also from Brown, discussing how LMNL could address similar problems. While LMNL has been a frequent subject at Extreme, it hasn’t seen much implementation, but Caton’s work in developing Limner suggests to him that LMNL has promise.

Caton finds LMNL important “simply because it’s here now… because it helps us appreciate the context in which it exists,” showing us a lot about traditional markup. Caton noted that everything in markup cycles around, with ideas constantly being reinvented, the leading edge becoming the trailing edge becoming the leading edge.

Caton showed a book he’d used for his master’s thesis fifteen years ago, marked up by hand with multi-color highlighting and inserted paper notes. Caton wants to be able to do such annotation collaboratively, with tools for separating layers of annotations.

After a brief description of LMNL’s layers, owner layers, ranges, overlays, and interaction with text, Caton looked at how he could apply these tools to his own application, a web-based tool for creating a variety of different kinds of markup. The left hand shows the text, while the right hand allows the creation of ranges through a form that takes range and overlay information as well as text at the start and end of the range. The information is then stored in MySQL, and the application can then provide a LMNL representation of the text, both as regular text with highlight and a more abstract view showing the layers graphically. (Caton is working on developing additional graphic representations using multiple planes.)

Caton also brought up attributed range algebra, work by Gavin Thomas Nicol that builds on his core range algebra work. There’s no layer model, just ranges and sequences. This may avoid some data model issues in LMNL. (I’m a fan of LMNL but haven’t found the data model compelling. One of these days I’d like to get back to working with LMNL, as it seems to make possible a lot of projects I’ve wanted to do for years.)

Caton also showed a frightening diagram of the many conversations in this space - perhaps a good sign of the intensity of the conversation, but also a clear sign that there is much left to be done in this area.


After the complexities of overlap, we moved to the challenges of difference calculation. Erich Schubert presented an effort to calculate differences among XML documents in a way that ensures the results are interpretable by humans.

After noting that the most frequently used output of GNU diff is a verbose form readable by people, Schubert looked at the reasons text difference algorithms don’t quite work with marked-up documents, and then examined why many tree-based approaches work better on XML but aren’t very usable by humans trying to sort out what’s changed.

Schubert’s preferred approach builds on query-by-example, which can support looser matching and handle questions like finding content that has moved within a document. Comparing nodes as graphs also supports a variety of structural possibilities, and Schubert explored a number of different ways to see nodeset correspondence. The paper also describes a number of places this work could go, from ways to optimize performance through integration with other types of data and databases.

The software implementing this is available as open source (along with the slides), if you want to explore in greater depth. It looks like a few people will be doing just that, as someone shouted “we want this!” during the applause.

[paper]


Sudafed has stopped my sneezing, but it’s strange typing when I can’t feel the tips of my fingers. If this entry lurches off into Hunter S. Thompson territory, let me know.


All this extreme structure is cool, but sometimes people just want to go extremely fast. Steve DeRose presented analysis of a lot of different XML operations, looking at the constant battle to make these things run efficiently and quickly. DeRose focused on tree management - DOM, storage management, and location identifiers after the document has been parsed.

DeRose started by explaining the processing costs of various kinds of processing, and how algorithm design affects these costs.

As a convenient first target, DeRose picked on the SGML & operator in DTDs, which produced factorial expansions of its contents during naive processing. A 23-item list in a document he was processing produced 10 to the 24th possibilities.

Next, he moved on to XPath and DOM operations, examining the different axes and their possibilities. XPath tends to return lists of nodes, while DOM typically returns single nodes. DOM also doesn’t have a native notion of preceding or following nodes. These differences make XPath implementations do extra work when XPath is built on top of DOM. DeRose suggested that XPath processing can be optimized by storing extra information about axes with nodes, reducing the number of nodes that need to be traversed for a given operation.

As XML documents grow larger, the number of nodes grows, and frequently the depth of the node tree may also grow. DeRose found ‘typical’ documents to go eight levels, while military manuals using CALS tables went 13. Similarly, most nodes have a relatively small number of child elements, but projects like dictionaries may have thousands and thousands of siblings at a single level.

DeRose concluded by looking at a few different ways to store XML data. Raw XML source is usually a forward-only approach, unless the program happens to save data along the way, which works less well for larger documents. That doesn’t have to be a full DOM tree - it could, for example, be an index to where elements start and stop within the document. (Unicode normalization and entities can make that extra interesting.)

Relational databases can collect information along the way, remembering parts of the document or all of the information - but “relational databases are not very efficient in their use of space,” creating new problems. DeRose did suggest some possible information especially worth keeping available, but it’s still a hike. Relational databases also impose some new costs, because XML is ordered and relational database tables, by definition, are not.

DeRose’s answer built on earlier work by Dongwook Shin on child sequences. They work really quickly, but keeping track of nodes’ positions in multiple levels, but are also somewhat brittle, requiring renumbering when changes are made to the document.

In questions, Daniela Florescu suggested that database vendors have already solved a lot of these problems, and the markup community needs to catch up to their work.

[paper]


In the afternoon, Mirco Hilbert and Andreas Witt presented their take on the overlap issues, returning to one of the oldest pieces in the conversation, SGML’s CONCUR feature, which they have implemented in their Multi-Layered XML (MuLaX).

CONCUR itself allows documents to be marked up using multiple DTDs. When processed, an application only sees one of the structures at a time. MuLaX goes beyond DTDs, supporting XML Schema and RELAX NG, though it uses a similar approach of reporting a single structure as an annotation layer and a similar syntax. While you could perhaps do something similar with namespaces, there are times when the same namespace might reasonably be used in multiple annotation layers, and namespaced XML still doesn’t support overlap.

One of the more interesting angles on MuLaX was discussion of multiple processing processing models, highlighting how the same data could take different different routes to arrive at similar results. Also of interest, in their broader work they used multiple approaches to overlap, not just MuLaX.

[paper]


The next presentation took a fresh look at XML parsing and processing. Virtually every parser to date has reported the information stored in XML documents in a single pass from beginning to end, whether generating SAX events or DOM trees. Antonio Sierra’s Free Cursor Mobility (FCM) breaks that pattern, offering more flexibility, as well as an opportunity to avoid the need to store a complete object version of the document in memory.

Sierra’s approach was designed for smaller platforms without major resources, which drove the decision to move a cursor within the document rather than duplicating the information in memory. After exploring the pros and cons of DOM, Push (SAX), and Pull parsing, Sierra showed how his FCM processor behaves.

FCM builds on the pull parser approach, using an iterator-based API that can move forward and backward through the document. The API looks a bit like regular Java iterators, with hasNext() and a hasPreviousBrother() methods, as well as ways to move the cursor in the document - to the parent element, for instance, of the previous sibling. It also allows programs to skip around, not parsing the contents of elements the program finds less interesting.


The last two sessions of the day focused on W3C work, both in progress and to come.

Felix Sasaki opened with a talk on “Schema Languages and Internationalization Issues”, combining two traditionally thorny issues. While some of the features they wanted to see supported - room for language identifiers, directionality indicators and Ruby markup - seem like things that can be integrated into vocabularies easily, but much of this seems like material that goes beyond schemas. On the other hand, he’s also looking at issues that could cause problems in processing chains including schemas.

Mostly it looks like he’s trying to explain how namespaces, pattern-based descriptions, and modularization may make it easier to implement an Internationalization Tag Set (ITS). It looks like they’re hoping for namespace sectioning, using something like Namespace Routing Language, or possibly using things like schema annotations or architectural forms.

In questions, Eric van der Vlist suggested that processing instructions might be an alternate path for some of these projects, avoiding the complications of schemas. He pushed strongly for not breaking existing schemas.

[paper]


For the last session, Liam Quin, W3C Activity Lead, is asking the audience “what are we not doing that we should be doing, and what are we doing that we should not be doing?” He’s opened with a brief introduction of what’s up at the W3C, including possible profiling of schemas, XLink 1.1, XSLT 2.0, XML Query, and XSL-FO 1.1.

Tommie Usdin expressed concerns about partial interoperability

I, of course, proposed that the W3C had done well when they were simplifying SGML into XML, DSSSL into XSL, and (well, sort of) HyTime into XLink. The W3C has since morphed into design-by-committee, and so I suggested getting that XSLT 2/XQuery/XPath 2 mess out the door and shutting down for a few years. The world could catch up to what’s been done, and when we go to subset again (or heck, create new features) we’ll have experience.

Ann Wrightson proposed that the W3C focus on making sure that schema reduction work, and that implementations actually interoperate.

John Cowan proposed shutting down the XML Core Working Group. (Of course, he noted that he was asking just after he’d finished doing what he’d wanted to accomplish there, XML 1.1!)

Scott Tsao of Boeing expressed his concerns that the W3C is moving quickly, but education of people like the vendors he has to work with is far behind the W3C’s work. He said “ultimately, my message is to slow down developing new standards.” He also related issues Boeing had had using XML Schema - “as it turned out, several XML editors that we are evaluating - none of them are able to support the test cases I put together quickly from Part 0.” Tsao seems to be pushing test-driven development at the W3C, which sounds intriguing.

Quin seemed hopeful about the growth of test cases at the W3C.

An audience member whose name I didn’t catch asked Quin about RELAX NG and W3C XML Schema. Quin felt it was “a strength of XML that we have multiple ways to do things,” and the audience seemed to agree that consolidating XML Schema and RELAX NG probably wasn’t a good idea.

Igor Ikonnikov suggested that the W3C slow down with incremental changes but prepare in the long term for some larger-scale qualitative changes, noting Daniela Florescu’s comments from yesterday as one possibility. He also suggested a certification program, which got support from another audience member. Quin noted some difficulties of member-funded organizations doing certification.

C. Michael Sperberg-McQueen wanted to ensure that the schema user experience workshop was better represented - not so much a profile, but rather patterns of usage. He said “there was no support for anything like a subset.” Ann Wrightson responded that she doesn’t currently have confidence in interoperability, a “substantial area… in which people like me can operate with confidence.”

Jon Bosak said “certification is a wonderful thing and would benefit us all,” but said there were problems around liability. Ken Holman also expressed concerns that products got tuned for test suites and not for the real world.

Emilia Georgieva asked for something like XSLT “Enterprise Edition” - documenting XSLT use in large situations. Her work at EBay involves 4 million lines of XSLT code, a maintenance nightmare. She asked for formal documentation on best practices, especially for issues like naming things, and better ways to integrate them with tools. She suggested that companies she has worked with are shying away from XSLT because of these issues.

Quin said that the W3C was especially interested in how XSLT 2.0 would play in such environments, and suggested she send in a formal comment.

Steve Newcomb made some general comments as a “message to Tim [Berners-Lee]”. Acknowledging that he was disgusted with standards bodies, but still sees the positive, hoping for some progress toward doing the right thing. Newcomb suggested that “unfortunately or fortunately for the W3C, because of Tim’s role as absolute monarch… we have a situation where the people perish or live depending on Tim’s vision. I would like to see Tim articulate a way forward for the institutional character of the W3C which would make it more than just a vendor consortium, motivated by higher goals than merely avoidance of situations and techologies that would be disruptive to their members’ business models.” Newcomb felt that a message from this ‘monarch’ could have a “salutary effect” on the field, which is currently a “garden of weeds”.

Where’s your overlap? What’s your diff?

Niel M. Bornstein

AddThis Social Bookmark Button

Related link: http://svn.usefulinc.com/svn/repos/trunk/doap/creators/doap-sharp/

Edd and I delivered a 3-hour tutorial on Mono at OSCON yesterday. It was a good session, although we easily could have continued for another 3 hours. As a result, I didn’t get to demonstrate DOAP#, my year-old project to embed DOAP metadata in .NET assemblies, and produce the appropriate DOAP RDF from the metadata.

The idea behind DOAP# is that you can use some metadata that’s already in AssemblyInfo.cs — like AssemblyTitle, AssemblyDescription, and AssemblyProduct — and add some additional ones — like BugDatabase, Created, Homepage, License, etc — in the form of C# attributes.

That metadata can then be used to automatically generate a DOAP file for the project.

The first milestone on my new roadmap is to complete the set of attributes that map to the DOAP RDF schema, and then produce a complete DoapWriter that understands all those attributes.

Next, some of the attributes need to determined at make dist time rather than hard-coding them in AssemblyInfo.cs. So I’ll need a tool to do that.

After that, I’ll probably need to build a framework for dynamic extensions to the DOAP schema. I imagine I’ll need to design some generic attributes for that.

The final phase of my world domination plan would be to submit patches to the major .NET/Mono projects to include the DOAP attributes in their AssemblyInfo.cs files, so that they can automatically produce the appropriate DOAP file in their own distributions.

I’m not sure when any of this will happen, but my vague goal is to hit the end of the year.

Am I a dope?

Simon St. Laurent

AddThis Social Bookmark Button

Related link: http://extrememarkup.com/

After an evening enjoying Montreal - I recommend the Lac Saint-Jean dessert crêpe at Chez Suzette - it was back to the mental stretching at Extreme Markup.

(I’m going to update this post over the course of the day.)


Ken Holman started the morning off by talking about how a project he’d presented here two years ago had seemed great, but then he found limitations. His original approach, LiterateXSLT™, was based on creating ‘literate results’ - annotated XSL Formatting Object Documents with XPath information about where to find source material to fill them. Then Holman combined the document structure and the annotations to generate a stylesheet.

That worked well, unless you needed to reuse the stylesheet with other data. He shifted the XPath information to separate files, making it possible to reuse the same target layout with multiple source vocabulary. In doing this he was able to “drastically reduce the number of annotations,” making the base document much easier to work with. Instead of expecting all of the information in one document and producing monolithic XSLT, Holman’s new approach - ResultXSLT™ - synthesizes XSLT which calls imported templates. Those imports carry the detailed information about how the vocabulary relates to the expectations of the stylesheet. (Ken’s also done some tricky work using namespaces as a signal for certain kinds of processing rather than as a vocabulary identifier, an approach that deserves a lot more look.)

One extra degree of separation can frequently simplify a larger set of problems. While extra separation does mean extra work, the extra flexibility becomes more useful as the size of the problems grows. (Those ™ marks don’t indicate Ken’s interest in owning the technology - he’s happy to see the technique used by other people.)

(I’ve written here about Ken’s XSLT training in the past.)

[paper]


Matthijs Breebart was next, presenting on an issue that has become more and more common as XML has reached into more and more places. Sometimes those places are readily accessible, but other times those places are accessible only under certain circumstances: on a particular network, or with a particular license. Breebart’s case involved annotating laws with commentary, both public and commercial, coming in a variety of different formats.

The commentary was organized by vendor, not by content. Looking for information about a particular subject required going back and forth between a number of different sites, often to find relatively little new. While vendors use URLs to create permanent links, they all use different systems. Breebart wasn’t thrilled by the prospect of manually reconciling these, and processing URLs to try to sort out the vendor-specific approach wasn’t fun either. The next step was asking vendors to use a standard form - getting everyone in one room.

Modeling data was one project, and then they had to figure out identifiers. Using random strings and a registry had disadvantages. They wound up combining some meaningful information for internal parts of regulations and a meaningless identifier (and number) for regulations themselves. They still have 10,000 identifiers, but having the internal portion of the identifier follow the structure of the document spared them many many more identifiers.

For sharing the information, they used RELAX NG to create a schema, and then generated XML Schema as appropriate. Once they had this set up, they could share the identifier list among all of the parties. They could also create a direct transformation of the XML describing the identifier to a URI, giving them a much more compact approach.

It’s a lot of work to reconcile something which seems simple on the surface, but knowing what to call something makes it vastly easier to reference it without paying a visit. Once the common format is established, it should be easier to ensure that new software supports it.

[paper]


After the coffee break, Ann Wrightson asked a basic question: why is some XML so difficult for humans to read? Wrightson’s question strikes at a key concern of mine, the interest I’ve always had in XML as a meeting place between human and computer understandings.

Wrightson said that “An awful lot of it has to do with how computers communicate with humans,” and then focused on “situation semantics,” looking at signs and basic items of information Wrightson called “infons”. She looked at a variety of ways these work and these break down, using a conversation about rugby for illustration.

Wrightson then carried this over to XML, examining how people can try to fill in the blanks when obscure markup names are used, and the immediate limits people hit when identifiers don’t conform to understood (natural language) expectations. She also examined the value of context, even partial context, for making sense of those identifiers. Abbreviations, numbers, using modeling roles for names, and opaque identifiers, all popular in computing for a lot of reasons, are frequently not helpful.

Wrightson then asked a key question: “Is human readability of XML just ’semantic sugar’?” The answer seems to depend on how much you value keeping humans close to their data. If you’re excited about packing as much information into a document as possible knowing that processors on the other end will devote major resources to presenting it, then maybe it is just semantic sugar. On the other hand, if there is more than one display possibility, and especially if humans will have to interact with it in any of those possibilities…

Wrightson concluded with a lovely bit of Klingon (a translation of Shakespeare sonnet) marked up with Elvish, giving us all an extra opportunity to contemplate how syntax sugar tastes.

[paper]


Next up will be Walter Perry, about whom Elliotte Rusty Harold said this morning:

a talk from Walter Perry, one of the most inconoclastic thinkers in the XML space. He’s so diametrically opposed to the conventional wisdom that most people can’t even hear what he’s saying. It’s like trying to explain atheism to an eight grade class in a Texas Christian school.

(I’m no doubt oversimplifying Walter’s positions here or getting them wrong, but here we go.)

Perry began with a quote from Peter Murray-Rust about working in fields we don’t understand (even things so reportedly simple as chemical bonds), and leaving space for other people to work. He then contrasted schemas and indices, and set up a some assumptions about search, suggesting that search is about finding semantic value in a particular context.

Internal contexts - like those described by schemas - are the traditional focus of a lot of XML work. External contexts - whether hyperlinks, indexing systems, or the processes which create and consume documents - seem more interesting to Perry. (This seems to me to be where the fracture line between his views and those of the traditionalists opens, and why the conversation is so difficult.)

External processes may be interested in the internal structure of documents, but they’re (at least potentially) less concerned about the internal structure or type of the document is than they are about how they can use that document and its content. The semantics created by these processes are more interesting to Perry than the lexical details.

To Perry, it’s a problem that schemas currently focus on one kind of consumption and validation, rather than an “Open-World Internetworked View” of “What Processes produce and publish (and at what URIs) documents with an external context that we understand and therefore might use.” Partial processing is a possibility here, as is processing document structure in ways that vary dramatically from their creators’ (or specifier’s) intentions. It’s the combination of the semantic expectations the reader brings with the content of the document that produces meaning, not just the internal definition.

When I first got into XML, I heard lots of stories, good and bad, about SGML and XML consultants who would show up, create a vocabulary, and head home. Processing and vocabulary evolution were implementation details, not part of the core of XML work. Perry seems to reject that approach, insisting that the XML work is going on all of the time, not just during one phase of vocabulary or document creation.

(And I have to love any talk with a slide contrasting Finnegan’s Wake with a Burger King pickup receipt.)


For the afternoon, the schedule split into two tracks. One is squarely focused on XQuery, while the other is about information integration issues. After seven years of hearing about XQuery without it yet reaching maturity, I’ve decided to tune out XQuery until it’s, well, ready. So on with information integration.

(Elliotte Rusty Harold is covering the XQuery material if you’re interested, and also has more on the morning presentations.)


Lee Iverson kicked off the afternooon talking not about XML, but “what XML is for, so that we can do XML-like things with anything.” He asked whether our current software models - front-end/business rules/database, and model/view/controller - might be preventing a number of useful possibilities. He abstracted them to context/knowledge/data approaches, and examined the ways these pieces are layered.

Iverson confessed to not being an XML person, preferring his data models separated from the syntax, referring to HyTime as an approach that allowed software to treat diverse data sources as having a common structure. Iverson suggested “working with as much as we can manage,” showing a diagram of a generic data model using typed nodes to create a simple and (perhaps) universal data model.

It’s intriguing in some ways, and if you’re looking for universal data models that can operate over a variety of data types (I’m very clearly not), this is definitely a good place to look.

[paper]


Wendell Piez followed, talking about a question that frequently dogs XML projects and applications, the idea that format and content are best kept separated. Piez started with an example of structure that had come up earlier, the sonnet, on a slide titled “but this poem doesn’t validate!”

Piez moved on to books, using the Table of Contents for Marshall McLuhan’s Understanding Media, and looking at how the relationship between title and subtitle can vary and sometimes vanish. Looking more generally, Piez cited tables as a common case where semantics lurk in the content but vary from table to table. Scalable Vector Graphics (SVG) offers even larger questions about semantics lurking in the markup.

Where was Wendell headed with this? Web Graphics Layout Language (WGLL, pronounced WGLL), the vocabulary he created so that he could generate SVG (with XSLT and Cocoon) without getting stuck in the mire of writing SVG directly. It’s a minimal format, inserted into SVG to simplify creation of graphics, layouts, and some simple animations to enhance the interface.

Relating the content back to a more generalized model may be slightly easier than working from SVG, but Piez suggested that it’s an acceptable cost for a lot of cases. It strikes a balance between classic descriptive markup without formatting and moving directly into formatting.

As Piez writes in the paper, it’s time to demystify our perspective on what descriptive markup can do for documents:

Introducing a layered system for the production of digital media provides many advantages for scaling, application design and management, and long-term maintenance. But it doesn’t actually take us closer to the “truth” of a text.

[paper]


For the last sessions, I thought I’d try Michel Biezunski’s presentation on Talking About Talking About Topic Maps. While earlier sessions have gone meta on meta on Topic Maps, this is the first one to do that at the “what are we trying to accomplish” level rather than in a strictly formal sense.

Biezunski opened with his concerns about “ontologies,” concerned that the word is too constraining, too focused on reaching a single agreement about categorization. He’d prefer to take a more pragmatic approach. He also looked at clarifying the distinction between “semantic interoperability” and “semantic integration”, with the former creating harmonized processes and systems while integration is more about aggregating data from a variety of different perspectives.

Biezunski sees connections between the Semantic Web and Artificial Intelligence communities, and sees the Semantic Web as a synthetic processing of data. In his own work, Biezunski keeps finding the boundaries between automatable work and projects requiring human involvement to be fluid, changing depending on the situation, “very delicate”.

After describing a number of terms he uses to reach his discussion of perspective (which he acknowledges as biased and not universal) Biezunski showed a variety of different and complex views of Lower Manhattan. He concluded with a question that seems ridiculous outside of a computing context:

Should there be one common perspective to describe Lower Manhattan?

Many systems seem intent on creating single massively-aggregated perspectives, and Biezunski challenges that, answering with another question: “what for?” While top-down approaches with everything in place work for islands of information, and provide interoperability, but bottom-up aggregation of perspectives allows more careful focus on relevant material. Getting back to Topic Maps, Biezunski sees Topic Maps as amenable to multiple views.

[paper]


Steven Newcomb closed out the day with discussion of Subject Map Disclosures, asking “how do know we what subject it is we want to talk about?” and formalizing the perspectives Biezunski described as Subject Map Patterns. Subject maps themselves are sets of unique subject proxies containing potentially multiple perspectives on multiple subjects. Disclosing Subject Maps means providing definitions of property classes.

It sounds like a promising way to present multiple perspectives on information, but the combination of Sudafed kicking in and the remnants of sneezing are making it hard for me to follow how this works out.

Patrick Durusau gets in a great quote I can’t let go by. Riffing on The Princess Bride, Durusau said “Everything has perpective. Anyone who tells you otherwise is selling you an undisclosed perspective.”


(I probably should have noted this earlier, but like Roger Sperberg, I’m here as press.)

Namespace triggers? URI generation? Situational context?

Simon St. Laurent

AddThis Social Bookmark Button

Related link: http://extrememarkup.com/

It’s early August, so it’s time to think about XML in depth.


B. Tommie Usdin opened the Extreme Markup Languages Conference “in praise of the edge case” by looking at past attendee compliments and complaints, and then exploring the comments reviewers had given papers this year. Her point: this is a conference where ideas aren’t required to be immediately practical, but rather one where difficult questions can surface in the hope of their finding answers eventually.

Usdin talked about the challenges corporate IT managers face in quarterly goals and business structures that require clear budgets. For better or worse, focusing strictly on questions with clear answers that have immediate returns can deaden conversations about potentially exciting but more difficult projects. Things that “can’t be done” become doable over time, and things that once moved from edge case to core - like mixed content in documents - can move back out to edge case status depending on how technology is used.

I come to Extreme for an annual brain-stretching, something that can keep me interested even while I perform much more mundane tasks. Ideas I hear at Extreme sometimes take years to percolate, but when they reach maturity they prove extremely useful. It’s good to have a conference that takes a long-term view of what’s practical rather than jumping on what’s hot this week.


Elliotte Rusty Harold, profilic XML author and keeper of the Cafe Con Leche XML news site, gave a talk on his Randomizer project.

While lots of people find raw XML documents obscure enough already, there are a lot of reasons why people can’t share their documents. Copyrights, security, and simple embarrassment can all get in the way. To make it easier for developers to exchange information, Harold has developed a tool that obscures XML document content and element and attribute names while preserving the structure of the markup.

This should reduce the “I’d tell you but I’d have to kill you” problem in sharing XML documents for debugging purposes, but what most interesting about the talk was the way the project set off a number of conversations with the audience, questioning how precisely the Randomizer worked and asking for a variety of additional features. Different levels of structure and content randomization came up a few times, and there were a number of questions about Harold’s approach to ensuring that documents couldn’t be converted back to their originals.

Hopefully Randomizer will make a contribution to improving XML software by making it much safer to share use cases and create test suites.

[paper]


Next up was Angelo Di Iorio of the University of Bologna, taking a crack at getting underneath the complexity of XML Schema (and even DTDs) by using a much smaller set of patterns to define document structures, and by creating schemas by processing annotated model documents, somewhat like Examplotron does. In a bit of irony, the processing uses XML Schema and even extensions to XML Schema, called SchemaPath, which deals with co-occurrence constraints (if X is here, Y must be here) and other kinds of conditional expression.

There were some good ideas along the way. I don’t frequently hear about doing more with less (or at least I haven’t at XML conferences for a few years). In a section on “syntactical minimality and semantic expressivity,” DiIorio looked at ways that fewer patterns can be used to do a wide variety of different things. A context-focused approach, rather than one where every element and attribute is defined in something of a vacuum, is interesting, especially the possibility of saying that “this extra possibility is [not] allowed in this context.”

[paper]


Anne Cregan opened the afternoon with a talk on reconciling OWL and Topic Maps. OWL, the Web Ontology Language, ties into RDF and the Semantic Web, while Topic Maps have emerged from the worlds of indexes, cross-references, tables of contents, and all kinds of related structures to become a general metadata framework.

The RDF/OWL/Semantic Web community and the Topic Maps community have taken very different approaches to similar problems, and sometimes compete. At an earlier Extreme (which I missed), there was a proposed battle between the two sides, though in the end they seem to have decided that cooperation was a better idea. Cregan suggested that the Topic Maps approach emphasizes humans finding information, which the RDF approach is more about computers managing information.

Cregan explored the parallels and disjunctions of Topic Maps and the RDF/OWL specifications, and decided to see if the Topic Maps Data Model could be reconciled with OWL Description Logic, and concluded that their fundamental similarities as entity-relationship models made it possible. Working in OWL-DL could even be used as a means of enforcing TMDM’s expectations.

As with most difficult projects, there are some caveats. Cregan hasn’t attempted to reverse the process, using Topic Maps to create OWL-DL. There are still some issues about type instances and supertypes that need to be hammered out (currently needing extra code beyond OWL-DL processing, or ugly workarounds).

[paper]


Lars Marius Garshol of Ontopia continued the RDF/Topic Maps discussion, also seeking to reconcile the two approaches. While conversions are possible, there hasn’t yet been a seamless approach. Garshol proposed a unifying model, Q. Topic Maps and RDF could both be converted into this model, and converted back out, making them interoperable. Garshol is also hoping to simplify the Topic Maps data model.

Unlike Cregan’s approach, which used RDF and OWL directly, Garshol preferred to abstract a layer beyond those to more directly accomodate Topic Maps’ greater complexity. Neither RDF itself or object models seemed like the right answer, so Garshol turned to quads, adding an extra piece to RDF triples. This lets him add identity to RDF triples, and simplifies the modeling.

While Garshol saw this approach as an improvement, there’s still room for additional development, depending on how much bloat matters, how much you value supporting parts of the model which aren’t normally used, issues of duplicate nodes, problems of round-tripping, your expectations about working with scope, and some odd issues around language tags and URI usage in RDF. In the end, as a surprise, Garshol leapt to quints instead of quads to solve context problems.

[paper]


Kristoffer Rose of IBM spoke next, showing how DFDL, (pronounced “daffodil,” the Data Format Description Language), might help XML processors get into a wide variety of file formats, including binary data. DFDL itself is built using annotations to XML Schema, indicating data format information and supporting structure directives. You can, for instance, tell a schema what the separator and what the terminator is for a given piece, as well as type information about its content. It’s flexible enough to support things like control characters XML normally prohibits.

Wisely (in my opinion, anyway), they seem to have taken a conservative approach to rules for how these annotations flow through the schema, limiting the effect of the annotations to the element directly annotated. It does support some more ambitious features like scoped default values.

DFDL processing lets developers treat non-XML documents as if they were in XML, returning XML models as if the document had come from an XML parser. It’s an interesting approach, one that has lots of promise for the many forms of data which meet its requirements. I do wonder how widely it will be adopted, but maybe the Global Grid Forum can make it work well enough to be attractive. Draft specifications have been published, but there is still ongoing discussion and only a few implementations have appeared so far.


Jon Bosak, the “father of XML” who got the project rolling, gave the final talk of the afternoon. Jon, a last minute substitute, talking about Universal Business Language. UBL’s focus is on business documents, taking existing practices that run on paper (and faxes) and providing a framework to use them electronically. That means reconciling that approach with another existing practice, EDI. As Jon put it:

“the key question is how to make this cheap enough for smaller companies…. How do you get cheap software? Through standardarization. And that means trading off all those proprietary features.”

One especially interesting piece is that they want to stay close to paper documents in part so that courts and other non-technical interpreters can look at transactions and understand what happened, or was supposed to happen.

Jon described UBL as “a radically un-new idea. We’re trying to connect with existing systems,” and pointed out that “disruptive” isn’t exactly a popular word among businesses trying to get their work done. UBL is also meant to be stable, providing a foundation like HTML has done. Much to my happiness, they’re also creating a “small business subset,” a profile that lets people work with a smaller set of pieces rather than demanding a huge initial implementation. Public review on that should begin soon.


What would you like to see in the markup conversation?

Michael(tm) Smith

AddThis Social Bookmark Button

Elliotte Rusty Harold’s Effective XML is one of my favorite XML books. The site for the book describes it as “a collection of guidelines and best practices for using XML. It focuses on using and developing XML applications, with a particular emphasis on aspects of XML that are often misunderstood or misapplied.” Some of the chapters, such as the Allow All XML Syntax chapter, contain advice on avoiding mistakes that others have made, while other chapters, such as the Catalog Common Resources chapter, provide details about XML-related technologies that aren’t quite as widely used as they might be.

Anyway, when I’m doing XSLT work, I sometimes find myself wishing I had a similar Effective XSLT book. XSLT, like most languages, sometimes provides more than one way of accomplishing the same task. Some of the ways are probably better than others. Not being as familiar with XSLT as I should be, I would find it useful to have a book around that provides some guidance.

Advertisement