Take My Advice: Don't Learn XMLby Michael(tm) Smith
- An Alternate Course of Study
If you're a developer interested only in the data-oriented side of XML, and if you don't care about document authoring (writing books, articles, manuals, love poems, Web pages, whatever), feel free to ignore this article.
If, on the other hand, document authoring is important to you (you're a technical writer, an HTML markup author, manager of a documentation group, an anonymous pamphleteer) and you're trying to decide whether it would be worthwhile for you to learn XML and use it for authoring documents, stick around. What you learn might save you a lot of time and spare you from some unnecessary frustration.
It's likely you've heard a lot of hype about the advantages that XML-based document authoring is supposed to provide: the capability to create "self-describing data", for example, or to "separate content from presentation" or to do "single-sourcing". Regardless of how the advantages are described, they all basically imply the same goal: the idea of creating document content in such a way that you can more easily store and reuse it for multiple purposes. One common example is to generate multiple delivery formats (HTML, PDF, online help, whatever) from the same source content.
Now, looking at the document-authoring approach you currently use (whether it's based on open source tools or proprietary applications such as Adobe FrameMaker or Microsoft Word), you realize it's got some serious limitations and doesn't come close to providing the kinds of grand "single sourcing" capabilities you've heard so much about. So you're probably hoping to learn about any authoring approach that does provide those capabilities, right? If so, the best piece of advice anybody can give you is this: Don't learn XML.
Why not learn XML? The short answer is: because "learning XML" is a complicated task--a chore much more complicated than learning, say, HTML, and one that you don't need to put yourself through if you're just looking for a better system for document authoring.
So instead of learning XML, I recommend that you complete the following alternate course of study, which I've organized into just three easy "lessons". If you're an advanced learner and the first two lessons get too tedious for you, don't throw the whole thing away in disgust. Just skip ahead to Lesson 3 (and then throw it away in disgust). And don't bother reading the footnotes at all unless you're really looking for trouble.
If you have a background in HTML, it might seem reasonable for you to assume you can learn XML with maybe a little more time and effort than it took you to learn HTML. Unfortunately, that's actually not a reasonable assumption. The fact is that learning XML is going to take you a lot more time and effort. To see what I mean, try the following simple exercise:
Exercise: Count Your X's
The exercise for this lesson is to pick out two or three introductory books on XML, look through the table of contents of each book, and write down every word or phrase you run into that starts with a capital X: XSLT, XPath, and so on. Then make another pass through and look for any non-X acronyms: DOM, RDF, etc. You'll most likely end up with a list that, at a minimum, looks something like this:
XSLT, XPath, XSL-FO, XPointer, XLink, XBase, XInclude, XML Schema, XQuery, XHTML, DTDs, DOM, SAX, RDF, CSS
Now, at this point, if you did some minimal exploration of a couple items on your list, you'd quickly discover that none of them could be said to be part of XML. Instead, they're all related technologies, each defined in hefty specifications quite separate from the specification for XML itself. (Print them out and you'd need a wheelbarrow to carry them all around.) DTDs and CSS, for example, actually predate XML. And XSLT is actually a full-fledged programming language.
So to get back to the comparison to learning HTML: To learn HTML, all you really need to do is get familiar with a limited set of tags <p> <a> etc. On the other hand, to be reasonably familiar with XML, you actually need to be familiar with a complex set of related standards and technologies, like DTDs and XSLT, that are a substantial challenge to learn in their own right. So it would be a big waste of time to set out on an exploration of XML without first considering just what it is that you really want to learn.
To cut through some of the complexity and confusion surrounding XML, take a moment to remind yourself of your real reasons for wanting to learn it. Like any other standard, it's not worth a damn to you or anybody else unless you can put it to practical use. So, you probably have some specific tasks in mind, and you think (for some reason) that XML can help you accomplish those tasks.
If that's that case, hey, wouldn't it be great if you had a diagram or table or something that showed the actual purposes that XML and some of its key related technologies are supposed to serve? Well, polish up an apple for the teacher, because I just happen to have such a table handy1.
|Purpose||Standard or Technology|
|Authoring, structuring, and validating||XML, SGML, DTDs|
|Formatting and publishing||CSS, DSSSL, XSL-FO, XHTML|
|Associating||RDF, XTM (XML Topic Maps)|
|Linking||XLink, XBase, XInclude|
|Searching||XPath, XPointer, XQuery|
Hopefully, the information in this table2 will make it easier for you to place XML and its related standards and technologies in the context of purposes that either relate or don't relate to specific tasks you have in mind--tasks you do now that you're hoping XML might help you do better.
Once you've done that, the next step is to start thinking about how you can lighten your learning load. Here's an exercise that ought to help.
Exercise: Question Yourself
The exercise for this lesson is to use the information in Table 1 to ask yourself a few simple questions:
What sort of programming (real programming) or application-development tasks do I do now that XML could help me get done better?
Hint: If you don't actually do any programming tasks now, well, XML isn't going to help you do those non-programming tasks any better.
What sort of linking and searching capabilities do I currently build into applications?
Hint: If you're mainly a document author, the answer to that questions is, None. Linking and searching mechanisms are not things you control at the document level; they must be built into the applications you use for authoring, storing, and viewing documents.
There's nothing wrong with learning about the XML-related standards for those mechanisms, but in the short term at least it's not going to help you get your document-authoring tasks done any quicker.
Is there a standard I can use for building rich semantic associations into documents and between documents?
Hint: Although I'm certain you've got some tasks that this question relates to, I won't say anything about it right now, because it's really a topic for another article.
The point of the question strategy is that after looking at Table 1, and thinking about XML and the rest in terms of specific purposes, you'll probably end up deciding that the standards or technologies with the most practical use to you are those for authoring, structuring, transforming, formatting, and publishing documents.
If this describes the kind of work you do, we can cut to the chase, and at the risk of repeating myself, I can once again say: You don't need to learn XML. For your needs, learning all about XML is overkill: too much information and too little of it related directly to the kind of work you do. You'd be much better off spending your time learning about a system (for lack of a better word) that does relate directly to the document-authoring work you do, if such a system existed.
A Pipe Dream
Imagine if a specific system was available that was designed to give rich structure to the very kinds of document you author, one for which there was also "off the shelf" support for transforming, formatting, and publishing your content as, say, HTML pages or PDF files. And as long as we've got this pipe dream going, why not shoot for the moon and imagine a few more things: let's say that (somehow) our system is already widely implemented (so no need for early adopter anxiety or hassles) and that we can get free support for it from other users who've been working with it for a long time.
I'm sure you've seen through that lame bit of rhetoric and guessed (if you didn't know already) that such a system does already exist. In fact, there are actually a few structured authoring systems (or markup dialects, if you want to call them that) that meet the criteria in the previous paragraph. There's even one markup dialect that meets the criteria so well that it's become the benchmark against which most other structured-authoring alternatives3 are measured. That standard is DocBook4.
O'Reilly & Associates recently announced the publication license on DocBook: the Definitive Guide has been changed to the GNU Free Documentation License (GFDL), version 1.1. Norm Walsh, the book's coauthor, has made the sources for the book freely available on SourceForge. We hope this change to the license will allow more users to get exposure to the benefits of DocBook.
I'm sure you've already heard plenty of hype about the value of learning XML, so I don't expect you to simply make a leap of faith and take my word for it when I tell you that learning DocBook or TEI5 or some other specific markup dialect (instead of learning about XML in general) may very well be the best route to the document-authoring solution you've been looking for. I reckon that to be convinced, you probably want to know some details. So that's what I'll try to provide in the remainder of this lesson, using DocBook as an example, because it's my preferred dialect. First, I'll give some information about what DocBook is, along with a few details that show what sets it apart from some of the alternatives. (To find out more about TEI, visit the Text Encoding Initiative Web page.)
What DocBook Is
DocBook is simply a markup dialect (or vocabulary6), like HTML. And like HTML, DocBook is defined by a document type definition, or DTD. A DTD specifies certain rules, called content models, that control what elements and attributes can be used in a document and where they can be used. For example, the DocBook DTD specifies a <para> tag for marking up paragraphs, just as the HTML DTD specifies a <p> tag for marking up paragraphs.
And as with HTML, you don't really need any special tools to create or edit DocBook documents. Though there are special editing applications that make working with DocBook a lot easier, you could actually use Microsoft Notepad or any other simple text editor to create DocBook documents, just as you can use those same tools to create HTML documents. To show you what DocBook looks like, here's an example of footnote 1 from this article, marked up in DocBook:<footnote>
<para>The initial basis for the taxonomy in <xref linkend="table.SpecificationsByGroup"/> was the <citetitle pubwork="chapter" >Taxonomy of Standards</citetitle> appendix in Erik T. Ray’s <citetitle>&url.learnx;</citetitle>, though the grouping/labeling in <xref linkend="table.SpecificationsByGroup"/> differs quite a bit from Ray’s, and it adds some non-&w3c; technologies. So, if you disagree with the table, you’ve only got me to blame. And if you <emphasis>really</emphasis> disagree with it, hey, come up with your own table, and we’ll see if we can get a senate subcommittee to evaluate which one’s better.</para></footnote>
To explain some of the tags: <para> is something like the <p> tag in HTML; <xref> doesn't really have a direct HTML equivalent, but it's a little like an <a href> anchor; and <citetitle> and <emphasis> are something like HTML's <cite> and <em>. The other bits of markup that start with an ampersand (&) are general entities, sort of like variables, that are defined elsewhere. For example, ’ is part of a standard general entity set, and is defined as a right-hand single quotation mark (Unicode character number 2019). The other entities in the example, &url.learnx; and &w3c; are custom entities I've defined for my own use. For example, &url.learnx; is defined as: <ulink url="http://www.oreilly.com/catalog/learnxml/">Learning &xml;</ulink>
That's a hyperlink to the Web site for Erik T. Ray's Learning XML, basically the same as an HTML <a href> anchor.
So if DocBook is so much like HTML, why would you want to use it instead of just using HTML? Well, the short answer is that unlike HTML, which isn't very useful for marking up richly structured documents, DocBook is specifically designed for marking up structured documents of all kinds, especially documents related to computer software and hardware. Basically, DocBook is built from the ground up to give you a means for marking up content in such way that you can store it and reuse it as you see fit, what's sometimes called "single-sourcing" or "separating content from style."7
That, in a nutshell, is what DocBook is: a sort of super-powered HTML. If that sounds like something you need, and you're anxious to get started with it right away, skip ahead to the First steps section. It's beyond the scope of this article to provide a full tutorial on installing a DocBook-authoring system and getting started with using it, but that section does provide some links to documentation and tutorials that'll help you get started.
Besides, before you decide to jump into DocBook, you really should at least spend a little time finding out what some of the alternatives are (and what sets DocBook apart from them) and then thinking about what the payoff of learning DocBook might be, as compared to learning about XML in general. So bear with me for a minute while I cover some of that ground.
The Standard and the Alternatives
DocBook has basically become the de facto standard for structured authoring of technical documents, and for structured authoring of many other types of documents as well. But some people question the value of learning a standard like DocBook at all8. For example, consider the case of documentation written solely for proprietary, non-open source applications. You might not expect writers and managers who develop such documentation to care too much about the value of learning and using an open structured-authoring standard. After all, they probably don't need to exchange or interchange their documentation with anyone else outside of their own company. And they can get a budget to bring in a professional XML/SGML consultant to help them produce a DTD that will very precisely meet their needs9.
So, in cases like that, the alternative to DocBook is to develop a custom DTD. In fact, for technical documentation, the only real alternatives to DocBook are all the various in-house DTDs that different documentation shops have each designed for their own use. That alternative--coming up with a company-specific DTD--may not sound like such a bad idea, and may not be in many cases. But keep in mind that there are other costs and concerns to consider when deciding on a custom DTD instead of DocBook, like the cost of training new writers10.
For example, a company that builds a document-authoring system around a custom in-house DTD has zero chance of hiring new writers who will already be familiar with their DTD (because their company is of course the only one using it). So they will need to spend money to train new writers in the specifics of that DTD. On the other hand, the company has a very good chance of hiring writers who have already learned DocBook. There are already thousands of DocBook-savvy writers to choose from, and the number is increasing everyday. So before you decide to go with an alternative DTD, take some time to consider what sets DocBook apart from the rest.
What Sets DocBook Apart
Some of the features that have earned DocBook an especially large and loyal user base include the fact that it is:
accompanied by sophisticated off-the-shelf support for transforming, formatting, and publishing content as HTML pages and PDF files and in many other formats, including TeX, RTF, FrameMaker MIF, JavaHelp, Microsoft HTML Help, Unix man pages, and TeXinfo
thoroughly documented (in Norman Walsh and Leonard Muellner's DocBook: The Definitive Guide)
widely implemented and extensively tested in production systems around the world, by commercial organizations such as Sun Microsystems, Hewlett-Packard, Novell, SCO, and Red Hat, and by open source groups such as the KDE, GNOME, FreeBSD, Debian, and Linux documentation projects, as well as the Darwin Documentation Project (Darwin is the open source core of Apple's Mac OS X).
freely supported by a network of thousands of users, many of whom have been working with it for years
firmly rooted in open source ideals: DocBook is a truly open standard in that you don't have to pay anyone in order to use it, or to implement support for it in an application. And you don't need to ask anyone for permission to change it for your own use, or even to change it and redistribute your changes to others
carefully designed from the ground up to be highly customizable so you can tailor it (or have it tailored) to your specific needs
commonly supported in many document-authoring applications; several commercial editing- and publishing-tool vendors provide built-in support for DocBook in their applications, and several open source packages are available that provide integrated DocBook support, including the Debian task-sgml package, the docbook-tools package (RPMs), and Paul Kinnucan's XAE.
completely XML/SGML-based and 100 percent XML/SGML-compliant; the good things you've heard or read about XML and SGML in general can be said even more emphatically and specifically about DocBook
I hope that with this look at DocBook, I've managed to make it clear that I think there's a big difference between learning XML and learning a specific markup dialect, and that I think you're much better off starting out your exploration into XML-based structured authoring by learning the specifics of a markup dialect like DocBook than you are trying to study XML in general. But despite the title and tone of this article, I'm not for a second suggesting that you shouldn't eventually learn more about XML, XSLT, and the rest. In fact, you will need to learn more as you use DocBook more. But you can learn what you need to learn as you go12.
As concisely as I can possibly put it, the best reason for starting out with DocBook is that it pays off right away. That is, as a document author, you get immediate benefits from learning DocBook, because you can directly apply what you learn, right away, to the kind of work you do, especially if that work happens to be writing computer software and hardware documentation.
Learn by Doing
I think it's useful to compare the difference between learning XML and learning DocBook to the difference between learning SGML and learning HTML13. Think about the way you learned HTML: You didn't drive to a bookstore to pick up a book on SGML. In fact, it's likely that you didn't know or care that HTML was based on SGML. So I doubt you felt compelled to read the ISO specification for SGML or even to visit the W3C's Web site to read the official HTML specification. Instead, you probably started out by copying other people's Web pages, ones that were similar to what you wanted to create. You looked at the general structure of other pages, said "OK, now I see how to do that", adjusted the structure as needed, then dropped in your own content. And you ended up with your first HTML documents.
I'm suggesting that you learn XML in a similar way: by actually using it, that is, by using DocBook now, right away, to do structured authoring of your documents.
This is not a tutorial on using DocBook, but I will give you some quick details on what first steps to take, and then point you to some tutorials and other resources. The first steps for learning DocBook are simple: pick up the documentation for it, get an XML or SGML editing application14, and with the documentation close at hand, start creating your first documents15.
Documentation and Tutorials
In addition to an exhaustive reference section with dozens of examples, DocBook: The Definitive Guide provides basic information for helping you get started with DocBook, including a quick introduction to XML/SGML concepts and short "how to" information about:
putting together a DocBook article, chapter, book, or reference page
validating the documents you create and interpreting any error messages you might run into
publish your documents (using DSSSL)
customizing the DocBook DTD to tailor it to your specific needs
Another resource you might find very useful is Sample Chapter 2, Markup and Core Concepts from Erik T. Ray's Learning XML book, which, in addition to covering most of the XML basics, includes an annotated example of a complete DocBook document. Chapter 5 of Learning XML, "Document Models: A Higher Level of Control" on DTDs, is also worth reading. It includes an annotated example of a DTD that is a subset of DocBook.
As far as tutorials go, here's a final exercise that ought to get you going in the right direction.
Exercise: Initial Explorations
The exercise for this lesson is to grab a cup of coffee, fire up a Web browser, and check out a few online DocBook resources:
The best set of DocBook links is Mark Johnson's DocBookmarks page. Peruse a couple of the Novice Docs tutorials links near the top of the page. Also make sure to check out a few other links further down, like Nik Clayton's FreeBSD documentation primer, Lauri Watts' KDE DocBook handbook, Dave Mason, et al's GNOME documentation handbook, and Markus Hoenicka's epic SGML for NT tutorial.
Take a look at the DocBook site at OASIS (Organization for the Advancement of Structured Information Standards), the official home of the DocBook DTD. Make sure to visit the OASIS mailing list page and subscribe to the docbook and docbook-apps mailing lists.
Another required stop is the DocBook site maintained by Norm Walsh, who maintains the modular DocBook stylesheets, as well as DocBook customizations for use in creating Web sites and slide presentations. (Walsh has many DocBook roles, including being the principal author of DocBook: The Definitive Guide and the chair of the DocBook Technical Committee at OASIS.) Although Walsh has moved some of this content to the DocBook Open Repository (see the next item), you should still peruse his site. Among the things you'll find are a set of slides for an Introducing DocBook presentation, as well as an article, The Design of the DocBook XSL Stylesheets.
Finally, check out the site for the DocBook Open Repository project16, hosted by SourceForge. At that site, you can browse the CVS repository that holds the development versions of the stylesheets, schemas, and DTDs. You'll also find a page you can use to submit or review bug reports, stylesheet feature requests, and requests for enhancements (RFEs) for the DocBook DTD and schemas.
I wrote this article not only to try to sell other document authors on what I see as the value of learning DocBook, but also to elicit some further discussion. So, if you have comments on the article, please feel free to state it here or post your comments to the xml-doc mailing list, where we should really be able to get a good discussion going on the topic.
Michael Smith: Along with being the creator of the Emily Dickinson Random Epigram Machine, Michael moderates the xml-doc mailing list and he is a member of the DocBook Technical Committee and the DocBook Open Repository development team. In August 2001, he'll be permanently relocating to Tokyo to live and work. You can email him at firstname.lastname@example.org.
O'Reilly & Associates published DocBook: The Definitive Guide (October 1999).
An online version of the book is available for free.
For more information, or to order the book, click here.
Return to xml.oreilly.com.