Just like last year, I’ll be blogging from XML Conference 2007. Rather than imposing some editorial structure, this’ll simply be a serialization of the things I hear from various speakers in various sessions.

Some random notes:

  • By looking at the slide up when we walked in (though no one said anything), it seems that next year the conference will be in Arlington, VA on December 8-10, 2008. [WARNING: Others have suggested that I was probably reading that wrong and that it’ll still be in Boston.]
  • 300 people are attending this year (less than last year).
  • The weather here is nasty, so the smallish turnout for the opening plenary isn’t surprising.

Does XML have a future on the web?

The conference started with a 3 panelists (Douglas Crockford, Michael Day, & C. Michael Sperberg-McQueen) discussing XML’s future with regard to the web.

Michael Day

Michael has spent the last few years working on bringing together print and web rendering technologies for XML (and HTML) with Prince.

Has XML really ever been on the web? Well, not really as an alternative to HTML, which it’s never really replaced for websites (cue discussion of invalid XHTML). If we think of “the web” simply as documents served over HTTP, XML springs to much greater importance. It’s been far easier to get XML integrated into the server-side of the (human-facing) web than getting it properly used by the clients. That said, there have been some surprising jumps of server-side technology into the client (Java and XSLT). [Oops, I got this backward, see Michael’s comment]

Douglas Crockford

Does XML have a future on the web? Yes, see COBOL’s continued existence, for an example of the inability of a technology to ever actually leave the enterprise. Rather than guessing if XML will continue on the web, Douglas thinks it’s more important to look at the trends and he sees a downward trend for XML and an upward one for JSON. He believes that JSON is far superior as a data format than XML [surprise!]. Aiding that move away from XML, the web community has never really fully adopted well-formedness, which was perhaps a mistake from the start.

A more pressing question is perhaps the future of the web itself, which seems to be tremendously endangered by security concerns and a lack of forward movement.

C. Michael Sperberg-McQueen

XML will have a future on the web, in part because it should have a future on the web. However, Michael is thinking of the “web” as something larger than web browsers or documents served over HTTP. Instead, consider the “web” simply a way of getting at a variety of addressable resources. Some parts of this web may need fancy UIs but other parts may care more about reliability and data integrity. Others may be very interested in internationalization and localization. XML wins as a technology simply by its
standardization (as an alternative to the cost of implementing a new notation), by promoting loose coupling (despite some loss of speed), and because of its support of rich information (all browsers supporting XSLT for lossless presentation of XML rather than lossy HTML, for example). “Any notation that has acquired so many enemies… has got to be doing something right.”

Audience Questions

Elliotte Harold: Programmers (especially non-XML specialists) do indeed hate working with XML, but they only hate it because they’re only given access to the DOM. Shouldn’t we kill the DOM instead?
Douglas: The difference between JSON and XML isn’t notation, it’s the (data) structure, which is much closer to what programmers need. While this structure could be imposed over XML with a better API, that complexity isn’t necessary.
D Peters: People are talking about security, “do-overs”, etc. How does this impact the coming “Software as a Service” infrastructure?
Douglas: The current browser implementations have problems because they share all of the information between the current sessions (problems with cookie stealing, replay attacks, and chrome changes). That’s the dangerous web 1.0. Now, we’re trying to intentionally mashing stuff up (which we’d always tried to prevent when unintentional). Developing an engine in this environment that prevents against the evaluation of the “evil” scripts is tremendously hard. Fixing this problem for the web will be very difficult (especially with threats from Adobe and Microsoft, among others). The web started as a document delivery system that morphed into an application delivery system. That’s the part of the web that other (closed) technologies are trying to steal.
Michael S-McQ: OK, security is important, but how can you have specified a format [JSON] that is most easily implemented using eval?
JSON is no worse that HTML, so if you trust your server you are OK. More safely, there are libraries on json.org that load JSON securely.
Simon St. Laurent: We’ve abandoned the initial goals of XML with some long sidetracks. Microformats are now the most promising bits of XML, where we specify the bits we actually care about.
Michael D: While it might be nice to ignore the past, more attempts should be made at reconciling the split between document presentation and the web.
Michael S-McQ: Microformats were suggested at the start of the XML specification but were rejected. [I don’t really follow his explanation of why.]
Douglas: Be pragmatic: use whatever works/fits.
Tony L: “Aside from the march of the paired delimiters”, how is JSON different than XML? Aren’t they just serialized trees? XML has problems with entities,
Douglas: Yes, they’re both serializations of trees. The major difference (as before) is that the basic structures map onto what programmers use for data structures whereas XML structure map onto document structure. JSON was “standardized” because people couldn’t use JSON without it being standardized. “I am a standards body.” “Specialization in tools tends to make workers more
productive.”
Michael S-McQ: There is isomorphism between XML and JSON, but it only goes so far. JSON has a
sweet spot around a subset of XML. The part of XML that lives outside that subset is the stuff that historically had problems fitting into relational models (mixed markup, “variation”). Perhaps that’s what Douglas means when talks about XML being good for “document structure.”
Michael D: JSON works well with current programming languages. XML doesn’t fit well with C and Java’s data structures or relational databases. Perhaps in another 10 years it’ll fit better.
Michael Debinko?: Before XML, people who needed to interoperate had to specify syntax and vocabulary. XML made the need to specify syntax uniform. HTML5 seems to be specifying both again. Why?
Michael D: HTML4 should have specified itself long ago.
Douglas: Browsers are standardized. The HTML group is trying to specify browser technology.
Michael S-McQ: The separation of syntax and vocabulary does help in a lot of cases. The problem with HTML5 is defining parsing behavior, which has nothing for authors. If you want consistent handling, write valid HTML4!
?: When do we have to use XML? For what kind of data?
Douglas: I don’t understand data in XML. JSON databases are evolving.
Robin L: Java and XSLT surprisingly ended up on the browser. What are the coming surprises?
Michael D: Dreaming, I would love to see CSS as a stylesheet language for printing high-quality documents.
Douglas: The industry will “discover” horrendous security problems in the current browsers.
Michael S-McQ: Soon the world will rediscover ASCII terminals. IBM will reintroduce the 3270.

Melissa Utzinger (MITRE): Microformats: Catching on like wildfire

What are microformats?

Embed semantics into web pages. Melissa will focus on (X)HTML. They’ve only formally been around since 2005, but now have interest from both small (cork’d, Satisfaction) and large (Yahoo, Google) companies. The first book on microformats was published this year. Web 2.0 is pushing the Semantic Web, but the Semantic Web itself is very hard to learn and complex. Microformats try to squeeze utility out of some limited semantics. Microformats don’t try to solve the problem of the Semantic Web.

Early semantic extensions of HTML just tacked on values into allowed attributes on many HTML elements.

One example

hCard is a microformat to replace vCard. It uses attributes on HTML to define contact information. When using hCard, point to the hcard profile in your <head> element.

Viewing microformats

There are some browser tools for viewing microformats embedded in the regular web (like Yahoo Local). There’s also a wonderful degradation strategy for microformats, as the underlying HTML can be rendered normally.

Plugins are available for all the major browsers, but two browsers are planning native support Mozilla Firefox 3.0 (native for developers) & Internet Explorer 8.0. Operator is a common plugin for Firefox.

What about marking up information other than contacts? There are currently 20 microformats with 9 in draft form, among them calendar information, . No one standards body controls microformats, with work happening in both the W3C and IETF.

Adopting microformats for the XML community

Two communities want to speak the same language but use different markup:

<Battalion>54th infrantry</Battalion>
<Units>96th infantry</Units>

Add an attribute to help them interoperate:

<Battalion class="org">54th infrantry</Battalion>
<Units class="org">96th infantry</Units>

This seem simple? Well, that’s one of the goals of microformats.

Challenges

  • Users don’t want to install add-ons, but perhaps they wouldn’t care if the UI was seamless.
  • Development tools are not there, especially for validation.

Conclusion

Microformats are new and will hopefully continue to grow rapidly.

To learn more, visit microformats.org. To see websites using microformats internally, visit Google Maps, Yahoo Local/Tech/Flickr/Upcoming, and Technorati Kitchen.

[Surprise, the conference schedule uses the hCard microformat!]

Taylor Cowan: TripBlox: creating travel standards on the web

[Yay, a pure researcher not worried too much about business applicability!] How can ideas of the Semantic Web be applied to travel? Travel writers should be able to allow their content to be aggregatable and discoverable. In particular, blog postings about travel can be broken into interesting pieces (people, places, etc). After breaking these travel descriptions into consistent pieces, everyone’s posts about Colorado Springs can be found. After aggregation, users can be pointed to more specialized sites for more information (mapping, for example). If they share the same activity, maybe they share some feelings and would love to have some relationship in a social network.

How to bring microformats and RDF/OWL together? Designers love microformats, and they do indeed provide semantics. In the back-end, the underlying graphs are stored as RDF triples. RDF isn’t good for humans, but is nice for computers. Microformats have some issues for computers [crazy example of XHTML2vCard XPath].

So, what about travel? People want to have their “wish trip”, travel agents want to promote their Top 10, and others want to publish travel blogs. “On my trip, I want to…” drink wine, lie at the beach… The microformats community is very resistant to developing totally new microformats (want existing use cases live on the web). Travel microformats don’t have existing examples on the web. Happily, Atom already supports much of what is necessed (title, summary, categories, licenses, dates, names). There is already a microformat for Atom (hAtom).

How do you go from hAtom into RDF? Put together a bunch of other tools, first Tidy, then XSLT, ROME, and the Jena API for RDF. Taylor wrote JenaBean to make the Jena API more attractive.

How to get started? microformats.org (suggestions, discussion), w3c.org (working on OWL and RDF), planetrdf.com (other people’s work) and geonames.org (location help).

His ontology is available here.

Mark Jacobson, Charlton Barreto, Jeff Deskins, Laurens van den Oever: Where are XML authoring tools today, where are they going, and what do we want?

What do authors, editors, and copy-editors actually need to do their work? The panel presents their products: XMetal is made for technical publishing and will continue to be, Adobe is adding XML support to help automation and allow reuse (supporting RelaxNG in the future [Woot!]), xOpus only focuses on authoring XML, but it is browser-based and intended for non-technical users.

Two non-technical challenges for XML editors (from Mark J.): many people simply want to use Word and have no interest in adding structure.

Questions

How should authoring tools be effectively (and cheaply) deployed across a large, nationwide organization?
xOpus is browser-based tool, so would make quite a bit of sense if you’ve already got a CMS (they don’t provide a CMS). It costs ~€180/user (with volume discounts). Adobe is providing hosted services (this?).
“RelaxNG gives the little guy a chance.” [nodding from panel] Both DocBook and TEI have both gone to RelaxNG (probably for customization). How important is customization (available to a shop with a solitary developer) to the authoring tool? What about XML-back-ended wiki systems (with “upconversion” [ha!])?
Customization is important to xOpus (via XSLT); everyone else says they care about it too. RelaxNG support is coming.
How can we make XML editors UIs less confusing (b, i, u buttons) for people who don’t know XML? Many authors find the behavior of XML tools broken.
[muddled response]

Bob DuCharme: XHTML 2 for Publishers: New opportunities for storing interoperable content and metadata

Many small web designers are proud of the fact that they create valid XHTML1. They may not even know what “well-formed” is, but they like the idea of passing some validation checker. They also understand the value of separating content and structure using CSS (to save them work). With the modularization of XHTML1.1, subsets of the markup can stand alone and individual modules can be updated/customized.

XHTML2 is trying to solve a number of problems. It targets a lot of “thou shalt not” guidelines around XHTML1 (metadata, accessiblity, etc). XHTML2 also provides a lot more opportunity to encode semantics. It “hits a sweet spot” between the flat dumbness of XHTML1 and the complexity of DocBook. XHTML2 may not be your content master, though it might make sense for your first dip into XML, but it might fill bigger shoes than just sending content to the browser.

XHTML1 has no nesting, only flat siblings. XHTML2 has nesting <section> elements (similar to div but with more sematic meaning). This helps promoting/demoting sections inside documents (especially when copy+pasting). Another example of better semantics is hr => separator.

XHTML and the Semantic Web? XHTML has elements like address and kbd, but no one was using them. This is why the huge number of requests for adding semantic elements for XHTML2 was punted in favor of user-extensibility using RDFa. In addition to the broader semantics, RDFa brings along the ability to add rich metadata as well.

So, what’s the takeaway? XHTML2 is more appropriate for your workflow and may be more familiar to a wider variety of content authors.

Norman Walsh: XProc: An XML Pipeline Language

W3C recently said that after XML and XSLT, XProc was the most important standard.

XProc Development

XProc is a W3C working group starting in 2005 with two goals: produce a XML pipeline language and a processing model which describes a default processing model for XML documents.

One year ago, Norm though they’d be finished by now (before their charter ran out on 31 Oct 2007), but they are at Last Call (but there will be another one). The last working draft was published 29 November 2007. Today there are both open source and commercial implementations of XProc in the works.

What’s New

Some new stuff since last year:

  • A defaulting story has been developed for syntactic simplicity.
  • Parameters handling has been revised
  • There is now a mechanism for handling complex namespaces
  • XPath 1.0 + 2.0

Common Features

  1. Start with a document or documents
  2. Apply one or more processes, perhaps conditionally, perhaps iteratively
  3. Catch and recover from errors, if they occur
  4. Produce a document or documents

Details

XProc tries to be: amenable to streaming, fairly declarative, and be as simple as possible. It is based on a pipeline, which is based on steps. Each step performs a specific process. Each step is glued together with some help via XPath. “Most steps are atomic, black boxes that perform a task” (XInclude, Load from uri, XSLT 1.0, Render-to-PDF, Compare [XSLT2 deep-equal]). Documents flow through pipelines (not random subtrees). The non-atomic steps are wrappers around sub-pipelines, and are the basic control structures of XProc: Grouping, Conditional evaluation, Exception handling…. Pipelines themselves can become atomic steps for other pipelines.

Steps always have both a name (”db2html”) and a type (”XSLT”). They have ports, which are fixed by the step type. An XSLT step would have “source” and “stylesheet” ports for input and output to “result” and “secondary”. Steps are encoded like so <p:xslt name="db2html">. Ports for steps are defined by p:declare-step, which encodes input and output ports and also options for a step.

Many pipelines are linear or mostly linear with obvious “primary” input and “primary” output (like XInclude). These two observations led XProc to specify some default syntax for chaining together sequential steps.

Inputs come from a URI, an inline document, another step’s port, or [one I didn’t get]. Options can be computed from XPath expressions or from literal markup (untypedAtomic). Steps must specify the options they accept. Parameters are the final bit, with messiness coming from XSLT’s parameters. Unlike options, the names of parameters are not known in advance.

Conditional processing is available via the p:choose syntax. It looks very similar to XSLT’s xsl:choose. Iteration is handled by p:for-each, with a p:iteration-source. Exception handling comes from a simple try-catch model.

Stewart Taylor: XML and XPath in the Wild

This talk focuses on three studies, all centered on scraping XML and XPath from various sources and then looking at the statistics. The motivation was to develop high-performance XML parsing tools. Collecting qualitative and quantitative information on “real” XML was tremendously helpful for testing.

Scraping the documents was done through Google’s API. They collected hundreds to thousands of XHTML, RSS, Voice XML, SVG, SAML, and SMIL files. These documents may have been non-representative if they were tutorial material, but who knows. Simplistic statistical analysis on document and word length was done, but they also used Principal Component Analysis for more rigorous study. Now that they had some statistical analysis from example documents, they fed that to their trained XML generator, which produced the XML for the parser to test. [Stewart discusses the specifics of the XMLRand syntax, which sorta follows XML Schema, with some other interesting bits like (continuous) <rand> and (discrete) <set>.]

Results? Shakespeare’s plays have 4 scenes per act, but can be meaningfully modeled with their techniques. More interestingly, RSS 2.0 off of the web had 3.5 paragraphs per description, for example.

[He shows an example of the randomly generated XHTML. It looks pretty funny due to the gobbledygook (random characters), but does look cromulent if you squint… The random SVG is even more hilarious.]

The results on the XPath analysis were more interesting: 97% of XPaths within XSLT were single-step [questions from the audience on what exactly they were looking at]. That’s clearly a design choice, but still fairly striking. Finding XPath in other languages source code can be difficult, but thankfully there are searching tools (from Google and others). Some random bits: 18% of studied XPath expressions used predicates, of which half tested a attribute against a static value, and the next most common was [1]. 51% used functions.

Conclusion: They feel happy with their statistical-model-backed XML generator. Many thousands of open source projects use XML (and DOM is more common than XPath).