While a remarkable amount of both ink and electronic bandwidth have been expended upon the use of XML in the data realm, there are times where it is necessary to step back for a bit and look at what and where XML is being used today. One thing that becomes obvious when studying the XML landscape is that a significant amount of XML is still being used for purposes of describing narrative, for telling a story, advising people in the use of a product, structuring reports, and doing other things that focus more on documents than they do on data.

In some respects, this is not all that surprising. In general, when you’re dealing with data-centric applications, XML isn’t always the best choice for working with structured content, and indeed there are times where XML is perhaps the worst, most hideously inefficient mechanism for dealing with data. However, the use of XML as a means of writing and marking up narrative has become the standard means of encoding structured content in most organizations. That doesn’t mean that XML is dominant in most organizations for “unstructured” content - that distinction is still very much in favor of Microsoft Word, with XML occupying a considerably inferior position there - but for organizations that recognize the benefit of structured content, XML languages such as DITA and DocBook are very quickly becoming the standard for storing information.

I had a chance to see that principle at work this week at the DocTrain conference in Vancouver, British Columbia. Conference chairman Scott Abel (CEO of The Content Wrangler ) graciously invited me to the conference and I had the chance to talk with a number of people working with technical documentation, online content creation and related material, and overall it opened up my eyes fairly dramatically to the hyper-accelerated world of content management a decade after the introduction of XML.

One of the first things that became apparent, and was echoed repeatedly through the conference, was the fact that both the DITA and the DocBook specification are quite alive and well in organizations, and each is evolving into its own distinct application niches, with DITA looking to be turning into the default standard for large scale enterprises, while DocBook works more effectively at the small to intermediate level. What’s perhaps more interesting is the Microsoft Word, even with support for XML as provided by OOXML, is not making as much of an inroad in the structured document market, in great part because it is fairly difficult to constrain people’s use of the word-processing program to a limited, finite subset of potential styles.

So what, exactly, is a structured document, and why is it so important to organizations? I’m going to answer this is a somewhat roundabout fashion, taking my cue from a story that Scott Abel told me after the conference itself wrapped. Scott specializes in Technical Communications systems and has been a technical writer himself for a number of years. One day, he discovered that he had problems with cockroaches, and chose to solve the problem with a well-known name brand insecticide. Before blasting a particular cockroach to oblivion, he noted the instructions:

  1. Remove cap from can.

  2. Shake can twice.

  3. Apply product to a distance of 24″.

After following the instructions (as best he could) and wondering how one defined a shake and why it was so essential that the can be shaken twice (would it explode if it was shaken three times?) he decided to check out the documentation on the website for the product, and was rather astonished to find that the instructions on the website were rendered as graphics, were completely different from what was on the can, and furthermore didn’t make much more sense. On his own blog, he then wrote a piece about the odd discrepancy, especially after tracing down the printed documentation on the box that the can came from and finding that the instructions were inaccurate and incompatible with either of the two other pieces.

The gist of the blog was that insecticides were controlled by the Food and Drug Administration, which had very clear instructions about labeling on products, and that discrepancies between instruction sets, especially when dealing with noxious substances, could result in fines, recalls and other potentially disastrous decisions.

Cockroach free, Scott didn’t think much more about the issue until one day when a huge arrangement of flowers was delivered to his door. The flowers were from a lawyer, who had found his blog, and had used it as supporting evidence for his client’s claim that when he used the spray, there was nothing that indicated that when spraying, the wind should be at your back if using the spray outdoors. The client in question discovered the hard way about the missing instruction - he was left blinded when the insecticide caught in the wind and blew into his eyes.

Documentation is pervasive in today’s industry, but the documentation exists for more than simply keeping armies of technical writers employed. Documentation provides the instructions for using today’s technology, but also establishes the boundaries of liabilities that companies may have, and also has implications for marketing and customer management. Social media is changing that relationships somewhat - certain areas benefit from strong and active prosumer communities that are willing to put the time into building documentation for everything from role playing games to programming languages - but creating “legal” documentation is still an integral part of most company’s core customer relations.

Structured documentation provides a level of uniformity that can then serve for reusing content from a single document source. Today that is important because such structured source documents can in turn be transformed into HTML, into PDFs, PostScript files, RTF and Microsoft Word. Such source documents can also serve to power binary help files, to provide first-level semantics for text-to-speech and VoiceML applications and so forth - all at the same time. A consistent document language makes it possible to build transformations to import partial content into output for labels on cans or boxes, and provides a single point of authority for translation into foreign languages.

DocBook and DITA both provide XML Markup for describing different facets of technical documentation. DocBook actually has its origins, ironically enough, with O’Reilly & Associates as a language used to lay out narrative technical books, based primarily upon the works of Norman Walsh and Robert Stayton. DocBook was originally an SGML specification, and was one of the first non-W3C specifications to be converted to XML, with the formal specification for DocBook being then assigned to OASIS-Open as part of their documentation activity. It is used primarily for describing books, articles, research papers and (with some additions) slides, but its structured layout also makes it attractive for storing technical articles with small to moderate sized organizations. Indeed, even today, many of the books that O’Reilly produces are laid it first in DocBook, as is the newsletter that you’re currently reading.

DITA, on the other hand, evolved from the Darwin Information Typing Architecture developed by IBM in order to create individual “topics” of content - such as those that might be used for an online documentation system. The topics in turn are organized by topic maps that establish a hierarchical structure for the topics. Topics in turn use a basic layout language which borrows somewhat from HTML, but extends it to include figures, examples, notes, screen displays and so forth. DITA works especially in those cases where narrative content is limited to the domain of a single topic (such as the individual entry within a help application), although efforts are underway to try to extend this to formal business documents, with mixed success.

DITA specializations - extensions designed to deal with a specific domain or subject matter - make it possible to build DITA systems that are highly focused in one area or another. For instance, you could create specializations that are unique to aircraft specifications (such as S1000D) that can be added into technical documentation about a given airframe. Since DITA systems are typically intended not for intercommunication but as source documents for additional processes (mostly transformations). This means that in general any specializations that are maintained in core documents can generally be converted into domain neutral content when DITA resources are being transmitted to external sources

As a technology, DITA seems to work best in those situations where you’re dealing with content that can be parsed into distinct chunks that have to be updated by a wide number of authors. For instance, in the above cased discussed by Scott, It would make sense for the warning labels for each product in the company’s line to be written as DITA then to be rendered via an XSLT transformation to the appropriate format for output. In this way, if the instructions do end up changing, they can do so in a consistent manner across all potential media (layout of the can, written documentation, web documentation and so forth) all at once.

One question that I asked a number of people at DocTrain was the role of Microsoft Office and OOXML (similarly asking about Open Office and ODF) in their technical documentation system. The answers were revealing - while a major portion of the documentation that currently exists in most organizations is still in Word files, the ability to work with OOXML is not that big a factor for the typical attendee, because the problems inherent in Word as documentation format have more to do with structural integrity and accessibility than they do with pipeline production. It is possible, as is the case with In.vision Research , to create components that will set up some of these constraints so that the constraining benefits of DITA can be employed there, but Word in particular was not designed to be a constraining editor out of the box, making it fairly useless on its own to provide enough structure consistently to make the resulting XML meaningful.

Indeed, this is a strategy I would recommend highly to both Microsoft and Sun - set up a “constrained” mode that can utilize a DTD, Relax NG or XSDL schema to determine what styles are valid at any given point in a document, then only expose those styles. It would require some other modifications (cut and paste would have to become more intelligent, for instance, but the advantage to this approach is huge - it turns the word processors that most people use into an XML friendly word processor. Not to put a damper on good third party tools such as those provided by In.vision Research or JustSystem’s XMetal , but an overarching theme that I heard at the conference was that for business documents and documentation in particular, structured documentation beats unstructured pretty much universally - it’s easier to author, easier to repurpose, easier to search, easier to integrate into larger systems. Which format works best where is still somewhat debatable, but the documentation community has taken to XML in a huge way, and more than nearly any other sector, they are working with XML in the way that it was originally intended to be worked.

While I was at the conference I had the chance to interview one of the seminal figures of XML - Robert Glushko, now a professor at the Berkeley’s School of Information and author of the superb Document Engineering from MIT Press. There’s a wealth of information from both the book and my interview with him, and I’ll be devoting my next newsletter to covering it in much greater detail.

Once again, I want to thank Scott Abel for inviting me to the conference, and the many people that I talked to at the conference itself for sharing their stories.

I will be in Bellevue, Washington on June 4th to speak at Boeing on “XQuery and XML Databases” at 11am, and will be doing a webcast of that presentation afterwards. For more information, contact me at .