If you cannot read the information below, click here.
O'Reilly.comSpreading the Knowledge of Innovators

XML animal imageUnderstanding XML


The Latest from XML.com

Making Models and Watching for Swans

There's a problem with living life on the bleeding edge. For all of the exhilaration of being one of the first to play with a new technology (or in some cases even to create that new technology), it's also very lonely there - by definition, most people will be encountering the same technology about the time that you've come to think of that technology as old hat or even (gods forfend) passe. This means that sometimes its easy to lose sight of what happens when these tools and techniques hit the real world of the workaday developer.

XML is a case in point. It's ten-year old technology and has become the data lifeblood of the Internet (along with its younger sibling JSON). Every so often someone up on the xml-dev mailing list will pop up and say "Is XML Dead?" - to which the rest of the old guard will pipe up in defense of XML or agree that, yes, XML's existential crisis is upon it, and it will soon be pushing up the daisies, if it's not there already. As one colleague of mine put it - XML's just not interesting anymore.

And yet... I have, over the course of the last four months, heard about XML data services in the business reporting space (XBRL), have talked with cartographers and GIS professionals wrestle with services in GML and that application arena, have spent some time looking at aircraft specifications encoded in the S1000D schema, have waded through the debate on the acceptance of the OOXML standard (which, at least by hearsay, seems to have been approved by ISO) have at least dipped my toe into the HL7 specification once again, and have regained some exposure on services oriented architecture (SOAs) and its seeming rival for programmer mindshare, resource oriented architecture (ROA).

sponsor image

While I think that these efforts definitely give lie to the content that XML is dead (if anything, its only just now learning to walk) it also points out an interesting trend. Many, if not most, of the vertical specifications in particular, began to take shape in the explosive early days of XML technology, when everyone felt that if only they had a schema that everyone would begin adopting the schema quickly and the world would very quickly become totally wired - business-to-business (B2B) was just around the corner, and everyone would settle upon the one best, perfect schema.

Inside Lightroom
Inside Lightroom
Visit our Sponsored Developer Resource Pages and learn about cool stuff from our sponsors!

Interesting in Sponsoring?
Interested in sponsoring the XML.com newsletter?
Please email us at advertising@oreilly.com for rate and availability information. Thank you!

Only, somehow, it didn't quite work out that way. Semantics reared its ugly head - you say po-tAY-to, I say po-tAH-to, Joe over there just calls them "spuds". People discovered that names were political, and that the power to give something a name translated into budget priorities being set. The more vague the concept, the more contentious the wrangling, regardless of whether you were talking about amortized returns on investment or ventricular calcium vs. potassium ion levels. That's because these names are expressions of models, and as such, have at least an indirect connection to the real world.

The problem with models is that they are, by definition, incomplete descriptions of reality - they can't be complete. The more detailed the model, the more the number of interacting components (n), and consequently the higher the computational costs (roughly n2). The world's fastest computer is a dedicated weather forecasting system in Japan. It assumes a cell size of 100 miles, and even with something we would think of as unbelievably coarse, it is hard pressed to finish the computational process in something approaching real time with that level of granularity. Model complexity is non-linear.

This complexity is certainly true when dealing with semantically charged schemas. The challenge when building a schema is to add enough elements to a system to ensure that the resulting model is good enough to more or less predict the future, but not to add so many elements as to increase complexity unduly. Unfortunately, there are no hard and fast rules for identifying that point, beyond creating the models and testing them. This, I think has been the biggest impediment to widespread adoption of these formats, though as more data points come in about the efficacies of the models, these schemas will end up becoming easier to work with and manage.

Unfortunately, we also have a tendency to become so wrapped up in our models that we fail to take into account the fact that while a model may accurately be able to predict the past, there is no guarantee that it will always predict the future accurately. Indeed, Nassim Taleb, author of "The Black Swan" and a former Wall Street trader, recently discussed the problem with assuming that models do in fact reflect the world when discussing disruptive events (Black Swans, after swans of that color that were discovered in Australia even though it had been assumed that all swans were white):

Interviewer: I gather you don't have a lot of respect for the effectiveness of Wall Street's "risk management."

Taleb: It is the "science" of risk management that effectively turned everyone involved into a turkey. If the Food and Drug Administration monitored the business of risk management as rigorously as it monitored drugs, many of these "scientists" would be arrested for endangering us. We replaced so much experience and common sense with "models" that work worse than astrology, because they assume that the Black Swan does not exist.

Trying to model something that escapes modelization is the heart of the problem. We like models because they do not require experience and can be taught by a 33-year-old assistant professor. Sometimes you need to say, "No model is better than a faulty model" - like no medicine is better than the advice of an unqualified doctor, and no drug is better than any drug.

As more of the burden of modeling systems falls into the lap of XML specialists (and it definitely is), this is driving those same specialists to become experts not just in the mechanics of XML (such as validating XML against schemas or transforming it with XSLTs) but increasingly the semantics of modeling real-life processes - or of at least serving to train up those people that do. In many respects developing ontologies is as different from writing XML parsers as architecting business intelligence systems is from writing Java libraries, but given that the number of knowledgeable XML experts is still only a tiny fraction of Java or C++ developers, it often means that an XML specialist will need, at some point in their career, to don their ontologist hat, roll up their sleeves, and learn what they can about systems theory... and to be cognizant of the fact that models are approximations that are only as good as their creators grasp of the problem domain.

It's a rather sobering thought to realize that being wrong about a model can turn a nearly 100-year-old investment firm worth hundreds of billions of dollars insolvent overnight. Oops.

Insights: The Wizard as Detective

Hey, bud... step into the light of the streetlamp over there. I've got somethin' to tell ya.

Yeah, that's close enough. I wanted to talk to you about a detective. You know the type - hard-bitten, wearing a trench coat, a steel-gray gun in his pocket and a half-burned stogie hanging off his lip. He may look like a bum, but he can see inside your soul, know your every fault and sin, but hey, he's just an ordinary Joe, doing a hard job, trying to save the world one dame at a time...

I've long been a fan of the detective genre. From Conan Doyle's master detective Sherlock Holmes to gumshoe Sam Spade, the stuffy Hercule Poirot, and the mischievous, bird-like Jane Marple, all the way up to the present in the form of Columbo in the '70s or the techno-wizards of CSI, the detective has been an archetype that I think has all but replaced the wizard in contemporary thinking. I suspect that detective literature is a close second for many programmers to hard science fiction, as much because they can empathize with the detective in a way that's hard to do with many other contemporary fictional characters - they are problem solvers extraordinaire, but they are also the agents not of law but of justice, a concept that I think actually resonates well with many analytic types.

Admittedly, my search for good gumshoes has taken me on some strange tangents. One genre that I've particular come to enjoy is that of the Speculative Detective. Take the gumshoe out of the present day, out of their time or even the world entirely, change the rules, but keep the essence of what they are intact. Randall Garrett's Lord Darcy was one of the earliest prototypes, a Holmes like-genius in a world of magicians, where he himself was not - at least in any conventional way. In recent years, Glen Cook paid homage to Lord Darcy's creator with the introduction of Garrett, PI, Sam Spade meets Dungeons and Dragons, though the hard as nails dames are far more Dashiell Hammett than they are the late Gary Gygax (about whom I'll have more to say in another post).

Most recently, my favorite wizard/detective has definitely been Jim Butcher's Harry Blackstone Copperfield Dresden, Chicago's only practicing Wizard, and a gumshoe's gumshoe. Like many a detective before him, Dresden was as distrusted by the powers that be as he was feared by the supernatural criminals he pursued, a holder of arcane secrets who nonetheless had the integrity to get up everyday to help right the world of wrong. The most recent "Dresden Files" book, Small Favor, will be out April 15, 2008.

I'm not normally inclined to wax loquacious about detectives in an XML column, save for another book I found recently, in the Children's section of all places, by author Derek Landy. Skullduggery Pleasant (Harper Collins, 2007) is the story of a gumshoe who's quite literally a living skeleton, and his young witch apprentice, Stephanie. I picked this up to read while watching my eight year old daughter look for a diary at the local Chapters store, and I was pleasantly (sorry) surprised to find myself hooked. Landy has put together a hard-bitten (harder than most) protagonist presented with wit and style, the occasional homage, genuine humor, and occasionally profound insight. One quote in particular caught my eye, one that seemed especially suited to this newsletter's audience:

Skullduggery: [Breaking the door] would work if the door was in the same place as the lock, but things are seldom that straightforward.

Stephenie: So we need the key.

"We need the key."

"We don't have to solve a puzzle to get it, do we?"

"We may."

She groaned. "How come nothing's ever simple?"

"Every solution to every problem is simple. It's the distance between the two where the mystery lies."

Here's looking at you, kid.

Adding Scripting and Updates to XQuery

XQuery is a remarkably potent language for doing queries upon existing data stores, serving the same Data Manipulation Language (DML) functions that SQL's SELECT statement does for relational databases. However, the concentration strictly on DML has left a fairly gaping hole in XQuery, and has fairly radically diminished its appeal as a generalized data access langage. This lack has usually been made up by extensions to the core language by various vendors, but again without standardization the particular constructs for performing updates are all over the map syntax wise.

For this reason especially, the recent activity within the W3C's XQuery Working Group should be met with considerable applause. Quickly heading towards Recommendation status (likely by last summer of 2008), the XQuery Update Facility (XUF or XQUF, or for those that prefer more legible acronyms, XUpdate) and the XQuery Scripting Extensions (predictable XSX - XSex? ... hmm, how about just XScript?) should turn XQuery from being a read-only language and to instead set it up on a par with languages such as Perl, PHP, Ruby, ASP.NET, and so forth as a full fledged scripting language, a DDL (Data Definition Language) for the XML set, or so at least the claim goes.

You could be well excused for being skeptical. XML is usually seen, at least by programmers, as a convenient mechanism for storing or persisting data objects, but the idea of a query language as a full bore scripting environment, either for the Web or elsewhere, seems a bit of a stretch. However, the stretch is actually not as far as it may appear on the surface. One point to consider is that, unlike the case for XSLT, one of the ramifications of XScript is that it is not side-effect free - you can do things within XQuery functions that cause changes in the underlying environment. While the functional program purists may cringe at this, frankly, there are simply times where you have to change the environment.

For instance, in a side-effect free environment, there is no way that you can add a new XML document into a collection of documents, or into a sequence of nodes. You can't build indexed iterators. You can't even build while loops, because the conditional expression controlling the while loop involves a variable that must be changed within the body of the loop, but in a purely functional language, the expression is outside of the scope of the body.

XScript was developed in conjunction with XUpdate because the realization was made that once you assume you can change your state with XUpdate then side-effects will occur. For instance, the following example shows how, if you had a list of bids as a document, you could add a new bid that's 10% more than the highest previous bid into that list:

let $uid := doc("users.xml")/users/user_tuple[name = "Roger Smith"]/userid
let $topbid := max(doc("bids.xml")/bids/bid_tuple[itemno = 1002]/bid)
let $newbid := $topbid * 1.1
return (
{
    insert nodes
        <bid_tuple> 
            <userid>{ data($uid) }</userid> 
            <itemno>1002</itemno> 
            <bid>{ $newbid }</bid> 
            <bid_date>1999-03-03</bid_date> 
        </bid_tuple>

        into doc("bids.xml")/bids;
        },
 <new_bid>{ $newbid }</new_bid>
        )

In this particular XQuery script, the insert nodes "command" changes the state of the bids.xml document - it has a side-effect that is in this case more important than the result passed back (which just echoes the new bid value for external consumption or confirmation).

XUpdate makes it possible to change the value of a given element or attribute, insert or append nodes into existing XML documents or collections, replace one set of nodes with another in a given document, delete a document, or perform some form of transformation on an element (for instance, changing a <article-title> node into an <H1> node) while otherwise keeping the rest of the content the same.

This has implications, of course. Using a slight variation of these commands (through the eXist XML Database), I inadvertently modified an AtomPub data feed, introducing all kinds of complications in my debugging until I realized what was going on. It also means that you can do the equivalent of passing variables by reference rather than just by value (useful, if not potentially dangerous, when changing the order of sequences, for instance).

The question comes out, of course, about what happens when you're needing to do something that isn't directly involving working with XML. After all, most server languages do in fact tie into multiple systems, handle different kinds of document parsing, work with the server environment and can access specialized math and text manipulation routines. As it turns out, XQuery itself has long had support for exactly this type of situation through the use of XPath extensions in custom namespaces. You can set up access to a SQL database and pass SQL commands through a namespaced extension built around local ODBC or JDBC drivers. You can manipulate binary images by passing resource URIs (which is just another form of handle, after all) into image extension functions, and so forth.

What differs from prior to XScript was the fact that anything that changed the environment through such a module extension was technically speaking illegal within XQuery 1.0, which assumed a purely declarative environment. With XScript enabled, on the other hand, such extensions can effectively work with impunity - both for good and for ill, but for the most part granting such power is pretty much only giving programmers the due that they are competent that the language should give (I find I'm distrustful of computer languages that try to limit operations because they are potentially unsafe). For the most part, such update capabilities and side-effect enabled transactions are necessary, because filters are not applications. With these capabilities in place, you can work with XQuery exactly as you would any other server-side language, can take advantage of modularization of functional libraries, can establish type-safe constants, and XScript does open up a few other new keywords and capabilities. In addition to the aforementioned while statement, XScript opens up the formal declaration of variables as having a given type and multiplicity, then provides the set keyword for modifying these variables. This differs from the much weaker let statement, which generally doesn't require pre-definition. Similarly, XScript also defines XQuery constants, which cannot be changed once declared and will raise an exception if this is attempted.

Additionally, XScript now contains the exit with $result keyword command which terminates a given function or query and passes back the $result variable as the result of that function. This can be a major boon to function writing, as otherwise you end up having to construct elaborate if-then-else statements, many of which return an empty result-set, just because you can't just break out of a function at the point you've created your results. Similarly, XScript defines continue and break for use within a while block to better handle execution control.

Beyond some enhancements to functions and the introduction of a typeswitch command that makes it possible to execute content based upon a given variables data-type, the only other major (?) piece is the introduction of the semi-colon as a formal statement separator. This seems like it should be such a small thing, but the semi-colon can make code considerably more legible than open-ended expressions.

XScript is still very much a work in progress, and the utility of XQuery as a scripting language will only become apparent once a formal implementation is made. Still, as standards gurus from IBM, Oracle, BEA, and others are the primary editors, it is very likely that these changes will make their way into most major XQuery implementations by the end of the year, and as such should be watched closely.

Standards Watch

I'm setting up an automation engine for the Standards Watch section, and should have a complete listing for the last couple of weeks in my next newsletter.


To change your newsletter subscription options, please visit http://www.oreillynet.com/cs/nl/home
For assistance, email help@oreillynet.com
O'Reilly Media, Inc. 1005 Gravenstein Highway North, Sebastopol, CA 95472   (707) 827-7000