O'Reilly    
 Published on O'Reilly (http://www.oreilly.com/)
 http://xml.oreilly.com/news/learningxml_0201.html
 See this if you're having trouble printing code examples


An Interview with Learning XML Author, Erik Ray

by Bruce Stewart
02/27/2001

Extensible Markup Language (XML) is a toolkit for creating, shaping, and using other markup languages. Hailed as the savior that will help all our applications speak to one another, XML seems to mean different things to different people. Erik Ray has written a new book called Learning XML, which explains the theory and philosophy behind this powerful tool, and helps readers wade through the acronym soup. We asked Erik what all the fuss is about.

Stewart: Let's start at the beginning, what is XML?

Ray: OK, my answer is going to be a little long, because I need to explain the why of XML as well as the what.

Most people know that XML stands for the "Extensible Markup Language," and a few have an idea that it's some kind of improvement over the reigning Web language, HTML. Actually, XML isn't a markup language, and it won't replace HTML. It's a toolkit for making markup languages that fit your data better and are more flexible for your needs. It's kind of like that commercial on TV: "We don't make a lot of the stuff you buy; we make a lot of the stuff you buy better." HTML, instead of going away, will become XHTML, basically XML-compliant HTML.

XML comes to the Internet as a set of minimal standards for improving data storage and conveyance. I like to call it the "Tupperware of the Internet," because it comes in lots of sizes and shapes that fit your data perfectly. What do I mean by that? Well, a lot of people who put things up on the Web find that they have to force their documents to fit into HTML, and they end up worrying about how it will look at the same time they're trying to grind out content.

For example, people use the <P> tag to create extra space as well as to delimit paragraphs. Or they use a <TABLE> to create a two-column page. Well, that's what we call "presentational" markup, and it's an anemic way to package your document. It's better to mark up something based on its function than its appearance, and map the style to elements later. At O'Reilly, we write technical books, so we use an XML-derived language called DocBook, which was designed specifically for books. Now we can divide a document into application-specific elements like <CHAPTER>s, <PROGRAMLISTING>s and <PARAMETER>s. If we want to translate that into HTML, we can do that easily by mapping a DocBook element to a particular HTML element. Or we could turn it into print, Braille, CD-ROM, whatever.

Related Reading

Learning XML

Learning XML
Guide to Creating Self-Describing Data
By Erik T. Ray

Table of Contents
Index
Sample Chapter
Author's Article

Read Online--Safari Search this book on Safari:
 

Code Fragments only

Any language derived from XML is guaranteed to satisfy some minimal criteria, which we called "well-formed." This is a very important benefit for your data. It means you can use any off-the-shelf tool or code library to work with your data. You can more easily do consistency checks, and convert or transform your data into other forms. So, since you aren't tied down to any particular software program, and because XML is an international standard, your data is in a much better position to be reused, repurposed, reworked, archived, and enjoyed by any user.

Stewart: Where is XML currently in the standards process?

Ray: XML has been a recommended standard since late 1998. It's only a couple of years old, and it's already spawned dozens of offshoot technologies for describing everything from mathematical equations to conceptual topic maps. That's a testament to the excellent work done by the World Wide Web Consortium (W3C), which hammers out specifications and provides them to the public free of charge. XML is always going to be a work in progress, with new functionality being added all the time; but it's here now and people are using it.

Stewart: What are XML schemas?

Ray: XML has something called a Document Type Definition (DTD), which is a set of rules for a document to follow to be classified as belonging to a particular language. If a document passes all the rules, we say that it's a valid instance of X, where X is the language (DocBook, XHTML, what have you). This lets you do quality control, ensuring that a form is filled out completely and that all the elements are where they are supposed to be.

DTDs have some weaknesses, however. You can require that an element contains some text, for example, but you have no way to restrict the kind of text someone can put there. That's a problem when you want to ensure that there's a date in the <date> field, and not a street address or a smiley face. So, that's where schema come in.

A schema, like a DTD, enforces structure in a document, such as which elements can go where. It even goes further and checks the type of data in each element. You can specify that a field contains a floating-point number, or a string of alphabetic characters, or even a date format. This is ideal for documents with small fields that have particular formats and needs, such as a query to a database, or a Web form. Also cool is the fact that a schema is itself an XML document that you can edit with an XML authoring tool.

Schemas won't replace DTDs, which are still a vital part of XML. Rather, they add another capability to a growing pantheon of XML technologies.

Stewart: For a while it seemed that XML was being positioned to replace HTML. Do you think that was the case, and is there any chance of that ever happening?

Ray: No, not a chance. The difference, as I mentioned before, is that XML is a set of rules that markup languages have to follow, sort of like a seal of approval. HTML serves a very useful purpose as a generic Web page description language. XML merely tightens some bolts in HTML, preventing sloppy markup from gumming up program gears, and making available a huge array of XML tools that weren't possible before. If anything, XML will enhance and extend HTML.

It will also offer alternatives to HTML. As many designers and authors have learned over the years, HTML can be frustratingly limited. It was originally designed to put up simple research papers in a physics lab--not support multiple column spreads with overlapping images and industry-specific jargon. For such advanced applications, XML will provide a bridge from HTML to other possibilities.

Stewart: Can Web authors expect HTML to be replaced with the XML-compliant XHTML anytime soon?

Ray: It's a gradual process, driven by incentives such as being able to use XML tools on your Web files. Kip Hampton wrote an article at XML.com about new Perl libraries like XML::Path that give you all kinds of power over your Web files once they're converted to XHTML. You can make a site mapper, improved search tools, and stuff like that. There are many off-the-shelf products for XML, and more coming all the time, such as authoring environments, quality assurance tools, and digital asset management systems. Getting your files to be XML compliant will benefit you in the long run, so you should start thinking about it, even if there's no short-term requirement to do so. The World Wide Web Consortium offers a tool called HTML Tidy, which makes the task simple.

Stewart: You work in the Tools department at O'Reilly & Associates. What can you tell us about O'Reilly's use of XML?

Ray: O'Reilly was one of the pioneers of using structured markup in the publishing industry. Before most people had heard of SGML, a precursor to XML, we were already producing books in it. O'Reilly employees, including Dale Dougherty, O'Reilly Network publisher, and former O'Reilly employee Norm Walsh, helped to develop the DocBook language, first as an SGML application and then later adapted to XML. Today, it's one of the most well-known and oft-used XML applications. Big names such as Sun Microsystems are using DocBook to write all of its documents.

Using SGML and XML hasn't always been easy. Good off-the-shelf tools have historically been too expensive and feature-poor. We've usually opted for the build-it-yourself option, where we write our own programs to process and print documents. We created a tool called "gmat," which translates DocBook SGML and XML into troff, a presentational formatting language, which can then be transformed into Postscript using existing software. I think that FreeBSD includes gmat in their distribution now.

So, why did we go through all the trouble to adopt XML in our workflow, when it would have been much easier and cheaper to stick with standard tools like FrameMaker and QuarkXPress? Unlike most publishers, we take great pains to make our books last a long time, to be of sufficient quality that people will still want them years from now. The long-term view favors XML. When your data is in good XML, it's poised to be reused in many different ways. Not only can we make printed books, but we can put them on CD-ROMs or publish them to the Web. We're doing that now with a project called Safari, O'Reilly's online, subscription-based venture. By mid-year, we're going to try to put all of our hundreds of books online for people to subscribe to.

Fortunately, XML is easier to write software for than SGML, so new products are appearing at a dizzying rate. We're watching developments with FOP and other XML-to-PDF tools, so that someday we can start using a stylesheet-based program for printing books. XSLT, a new technology for transforming XML into different forms is a favorite new tool in our arsenal. We use it to format our books in HTML for Web and CD-ROM delivery, as well as simple filtering to fix errors.

Stewart: What is FOP?

Ray: FOP stands for (XSL) Formatting Objects to Print. It's an XML-to-print solution that uses XSLT and XSL-FO (XSL for Formatting Objects). Since XSL-FO is still a work in progress, FOP is also still in development. For more information, check out the FOP page at the Apache XML project headquarters.

Stewart: Does XSLT bring the type of stylesheets Web authors are familiar with to XML?

Ray: Let me back up a little bit and explain where XSLT came from.

Most Web authors have worked with Cascading Stylesheets (CSS), which is a simple but effective means for applying style to your pages. You can create borders, margins, and other colorful effects by mapping elements to styles with presentational rules. A new kind of stylesheet being developed by the World Wide Web Consortium (W3C) is called the Extensible Stylesheet Language (XSL). It's going to be much more powerful and descriptive than CSS.

XSL is divided into two parts: XSLT (XSL for Transformations) and XSL-FO (XSL for Formatting Objects). XSLT is not a language for applying styles to elements. Rather, it's a language that describes how to transform one form of XML into another. It was designed originally to work with XSL-FO to transform any XML document into a formatted end product. XSLT would convert the XML document into a temporary form called a formatting object tree (basically XSL-FO), which would then be translated into the final, formatted product.

XSL-FO is still being hammered out, but XSLT has been finished for a while now. The cool thing about XSLT is that it can be used for many more purposes than just getting a document into XSL-FO form. It is a general-purpose conversion tool, turning your document into any text form you want. For example, you can convert a DocBook document into HTML to display on the Web. That's how we create our CD-ROM products, and it is also the secret behind Safari, our online book library.

So, to make a long story short: CSS will still be around for a long time because it does its job well; it applies simple styles to simple documents. You can even use it with XML, as I explain in my book, Learning XML. But for more complicated formatting requirements, you'll want to use something like XSL.

Stewart: Tell us a little more about DocBook. What makes it a good tool for producing technical books?

Ray: DocBook is an application of XML. It's a markup language that conforms to XML's strict rules for labeling and structuring data. Its specific purpose is to contain technical books, such as the kinds we produce for computer technology. DocBook contains many specific tags for things like <filename>, <function>, and <command>. By using these tags to markup certain terms, we can enable advanced capabilities like searching by element. For example, you could search for a <command> that contains the word 'print' (something you'd type at a command line) and distinguish it from something similarly named but from the wrong context, such as a <function> called print(); (something you'd see in a program). This is just another example of how XML lets you mark up things without ambiguity.


For an introduction to SGML, XML, and the DocBook DTD, plus complete reference information for DocBook, see O'Reilly's DocBook: The Definitive Guide.


Stewart: Do you have anything to do with O'Reilly Network's Meerkat, which is XML-based? If so, could you discuss what makes it such a cool tool?

Ray: I personally am not involved with Meerkat development. Rael Dornfest, another O'Reilly employee, is the techno maven who set that up. Meerkat uses an application of XML called RSS, a resource description language. The way it works is a content aggregator, like a portal or news service, maintains a list of content providers' Web sites. Each content site has an RSS document that lists the resources available, when they were last changed, etc. The aggregator uses that information to decide what to grab and to announce availability to the public. It's like a syndication system, where the aggregator pulls in information volunteered by content generators.

The coolness of this tool is that information is maintained exclusively by the people who know it best: the authors, who retain full control of their own sites. The aggregator only maintains links to the sites, and announces any new stuff that appears, providing a convenient, one-stop place to shop for information. It's the perfect community service, where everyone benefits.

Stewart: Why is XML an important development in Web technology?

Ray: The Web is all about delivering content between millions of computers around the world for entertainment, business, news, banking, reference, and research. XML is all about describing and packaging information for most any application. They're two great tastes that taste great together.

XML eliminates some of the bottlenecks that prevent information exchange from being a painless, instant process. It makes it easier to convert from one form to another, to filter and summarize documents, and to search for needles in haystacks. It goes much further than traditional HTML with the ability to format documents online, in print, and other mediums like Braille, aural, whatever.

Designers will love the greater control they have over appearance. Programmers will love the ease with which they can manipulate and process content. Authors will love the extent to which they can personalize and configure the markup. In short, what's not to love?

Stewart: In what ways has XML lived up to its potential to share data among applications and among businesses?

Ray: Things are still in the heady development stage around the world, but some early success stories can be found. Some examples I've seen: WML is an XML language to deliver textual information to cell phones; syndications use XML to transfer intellectual property from content producers to content consumers using a system called ICE; marketing information is routinely swapped using XML applications; Open E-Book is being pushed by Microsoft and various publishers as the premier container for books in digital form. In the next few years, we will be seeing more of XML as it gets snatched up to fill voids in communications chains.

Stewart: Who should read Learning XML?

Ray: My book is aimed at the person who has at least a small amount of knowledge about the Web and HTML and is curious about this new technology but doesn't necessarily need to get into the advanced topics, such as how to write programs that process XML. It gives an overview of what XML can do, its philosophy, and what future trends will be. For those who do want to get more technical, they can use this book as a springboard to our books for developers, books like Building Oracle XML Applications and Java and XML.

Stewart: What does Learning XML have to offer to the typical Web worker who isn't a developer?

Ray: XML is as much a philosophy as it is a technology, so I try to spend equal time on both facets. I lay the groundwork by describing how markup can and should be used to package a document. Then I launch into several areas important for XML: linking, using stylesheets, transforming XML, DTDs, and internationalization. That's a basic tour of XML's envelope, and that should put any Web worker in a good position for following the explosion of new technologies for the Web.

Stewart: How well do you think XML has been accepted by the Internet community? What other communities are taking up the XML cause?

Ray: It has been accepted quite well by various technical fields. By that I mean lawyers, publishers, the medical community, banking institutions, and other specialized industries have latched onto XML as a way to organize and share data in a seamless way. Anywhere there's information that has to be organized, packaged, and repackaged, there's a place for XML. Developers see that and are quickly jumping on the bandwagon.

For the average netizen, I don't think the shock wave has really hit yet. We're living in an age of buzzwords, where you can say "have you heard of XYZ?" and people will roll their eyes or yawn. Those who know XML know it's going to revolutionize digital data and communication; but for most users it hasn't hit the radar yet.

I think XML is going to continue to be mostly used in vertical applications for a few years before it hits prime time. HTML is doing a good job for now, getting home pages up and disseminating information in a reasonably pretty way. Eventually, the software will catch up to the hype and we'll start to see some really funky sites on the Web.

Some pundits are actually predicting a demise of the typical desktop browser. I can see their point. Instead of limiting yourself to viewing the world through one little window, why not bring the whole world into your home? XML is flexible enough to package almost any information, and serve it to you in all kinds of ways. So you can expect it to show up in the great "integration" of TV and the Web, networked gaming, and downloadable books, for example. It's hard to predict how it will be used at the consumer level, but I bet it will be interesting.

Stewart: How does Microsoft's .NET platform incorporate XML?

Ray: .NET is a new paradigm for Microsoft. Instead of buying software on a CD-ROM and installing it onto your computer's hard drive, you will "subscribe" to a set of software services and use applications loaded over the Internet as you need them. Whether the business will fly, I don't know, but it's an interesting model. XML can be a strong component of any networked system where applications need to share and transmit data.


For a list of related articles, see our .NET/C# Roundup.


Microsoft has made it clear to the XML world that they want to be a big player there. They are heavily involved in the standards process, and their enthusiasm for implementing technologies quickly is impressive. They jumped to the lead in standards compliance with Internet Explorer 5.0. It supports XML validation with DTDs and transformation with XSLT stylesheets. If that's an indication of what's to come with .NET, I would say that XML has a bright future in Redmond.

Stewart: Does Microsoft's embracing of XML surprise you? They don't exactly have a reputation for working with open standards.

Ray: With Linux and other open software projects gaining momentum, Microsoft has realized that they need to play nicely with the community at large if they're going to continue to prosper. XML is a way to show that Microsoft can use their marketing and technological muscle to make money without having to appear like they're taking over. It's refreshing to see that the open standards process can thrive and not be at odds with moneymaking enterprises.

Stewart: What do you think the future holds for XML?

Ray: If HTML was the booster for the Web, XML will be the rocket's second stage, taking the world of communication to new heights of ease and flexibility. With its support for languages and Unicode, it will cross boundaries between countries. Its configurability will continue to spawn languages for every purpose. The ease with which software can be written for XML will usher in a new breed of software, one in which applications can share data like never before and where proprietary formats will be a thing of the past. Will it reach a ceiling like HTML has, where we find that it can't do everything we want it to do? I don't know, but XML is a much more slippery and resourceful animal than we have seen in the past. I don't think it will become extinct for a long, long time.


Erik Ray is an XML software specialist and developer at O'Reilly & Associates. He lives with his wife Jeannine and five parrots in Saugus, Massachusetts. Besides writing, he practices kendo, plays go, binds books, and stalks bookstores for rare and antiquarian books.

Bruce Stewart is a freelance technology writer, focusing on Web development and wireless issues. He coauthor's the Industry Standard's Wireless News, and is a regular contributor to ZDNet and Web Tools.


O'Reilly & Associates recently released (January 2001) Learning XML.


Return to xml.oreilly.com

Copyright © 2007 O'Reilly Media, Inc.