One of the most impressive features of XML (eXtensible Markup Language) is how little you need to know to get started. This appendix gives you some of the key pieces of information you’ll need. The references at the end of Chapter 6, Working with Configuration Files point you to many excellent resources that you can turn to for more information.
This excerpt is from System Administration with Perl, Second Edition . Thoroughly updated and expanded in its second edition to cover the latest operating systems, technologies, and Perl modules, Automating System Administration with Perl will help you perform your job with less effort. The second edition not only offers you the right tools for your job, but also suggests the best way to approach particular problems and securely automate pressing tasks.
Thanks to the ubiquity of XML’s older and stodgier cousin, HTML, almost everyone is familiar with the notion of a markup language. Like HTML, XML consists of plain text interspersed with little bits of special descriptive or instructive text. HTML has a rigid definition for which bits of markup text, called tags, are allowed, while XML allows you to make up your own.
Consequently, XML provides a range of expression far beyond that of HTML. One example of this range of expression is found in Chapter 6, Working with Configuration Files, but here’s another simple example that you should find easy to read even if you don’t have any prior XML experience:
<hosts> <machine> <name> quiddish </name> <department> Software Sorcery </department> <room> 314WVH </room> <owner> Horry Patter </owner> <ipaddress> 192.168.1.13 </ipaddress> </machine> <machine> <name> dibby </name> <department> Hardware Hackery </department> <room> 310WVH </room> <owner> Harminone Grenger </owner> <ipaddress> 192.168.1.15 </ipaddress> </machine> </hosts>
Despite XML’s flexibility, it is pickier in places than HTML. There are syntax and grammar rules that your data must follow. These rules are set down rather tersely in the XML specification found at http://www.w3.org/TR/REC-xml/. Rather than poring through the official spec, I recommend you seek out one of the annotated versions, such as Tim Bray’s version (available at http://www.xml.com) or Robert Ducharme’s book XML: The Annotated Specification (Prentice Hall). The former is online and free; the latter has many good examples of actual XML code.
Here are two of the XML rules that tend to trip up people who know HTML:
If you begin something, you must end it. In the preceding
example, we started a machine listing with
<machine> and finished it with
</machine>. Leaving off the
ending tag would not have been acceptable XML.
In HTML, tags like
src="picture.jpg"> are legally allowed to stand by
themselves. Not so in XML. This would have to be written as
<img src="picture.jpg"> </img>
<img src="picture.jpg" />
The extra slash at the end of this last tag lets the XML parser know that this single tag serves as both a start and an end tag. A pair of start and end tags and the data they contain are together called an element.
Start tags and end tags must mirror one another exactly.
Changing the case is not allowed, because XML is case-sensitive. If
your start tag is
your end tag must be
</MaChINe> and cannot be
</MACHine> or any other case
combination. HTML is much more forgiving in this regard.
These are three of the general rules in the XML specification. But
sometimes you want to define your own additional rules for an XML parser
to enforce (where by “enforce” I mean “complain vociferously” or “stop
parsing” while reading the XML data if a violation is encountered). If we
use our previous machine database XML snippet as an example, one
additional rule we might to enforce is “all
<machine> entries must contain a
<name> and an
<ipaddress> element.” You may also wish to
restrict the contents of an element to a set of specific values, like
How these rules get defined is less straightforward than the other material we’ll cover, because there are several complementary and competitive definition “languages” afloat at the moment.
The current XML specification uses a Document Type Definition (DTD), the SGML standby. Here’s an example piece of XML code from the XML specification that has its definition code at the beginning of the document itself:
<?xml version="1.0" encoding="UTF-8" ?> <!DOCTYPE greeting [ <!ELEMENT greeting (#PCDATA)> ]> <greeting>Hello, world!</greeting>
The first line of this example specifies the version of XML in use
and the character encoding (Unicode) for the document. The next three
lines define the types of data in this document. This is followed by the
actual document content (the
<greeting> element) in the final line of
If we wanted to define how the
<hosts> XML code at the beginning of this
appendix should be validated, we could place something like this at the
beginning of the file:
<?xml version="1.0" encoding="UTF-8" ?> <!DOCTYPE hosts [ <!ELEMENT hosts (machine)*> <!ELEMENT machine (name,department,room,owner,ipaddress)> <!ELEMENT name (#PCDATA)> <!ELEMENT department (#PCDATA)> <!ELEMENT room (#PCDATA)> <!ELEMENT owner (#PCDATA)> <!ELEMENT ipaddress (#PCDATA)> ]>
This definition requires that a
hosts element contains
machine elements and that each
machine element consists of
ipaddress elements (in this
specific order). Each of those elements is described as being
#PCDATA (see the section the section called “Leftovers” for details).
The World Wide Web Consortium (W3C) has also created a specification for data descriptions called schemas for DTD-like purposes. Schemas are themselves written in XML code. Here’s an example of schema code that uses the 1.0 XML Schema recommendation syntax found at http://www.w3.org/XML/Schema (version 1.1 of this recommendation was still in process while this book was being written):
<?xml version='1.0' ?> <xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema"> <xsd:complexType name="MachineType"> <xsd:sequence> <xsd:element name="name" type="xsd:string"/> <xsd:element name="department" type="xsd:string"/> <xsd:element name="room" type="xsd:string"/> <xsd:element name="owner" type="xsd:string"/> <xsd:element name="ipaddress" type="xsd:string"/> </xsd:sequence> </xsd:complexType> <xsd:complexType name="ListOfMachines"> <xsd:sequence> <xsd:element name="machine" type="MachineType" minOccurs="1" maxOccurs="unbounded" /> </xsd:sequence> </xsd:complexType> <xsd:element name="hosts" type="ListOfMachines" /> </xsd:schema>
Both the DTD and schema mechanisms can get complicated quickly, so we’re going to leave further discussion of them to the books that are dedicated to XML/SGML.
You can’t go very far in XML without learning two important terms. First, XML data is said to be well-formed if it follows all of the XML syntax and grammar rules (matching tags, etc.). Often a simple check for well-formed data can help you spot typos in XML files. That’s an advantage when the data you are dealing with holds configuration information, as in the machine database excerpted in the last section.
Second, XML data is said to be valid if it conforms to the rules we’ve set down in one of the data definition mechanisms mentioned earlier. For instance, if your data file conforms to its DTD, it is valid XML data.
Valid data by definition is well-formed, but the converse does not have to be true. It is possible to have perfectly wonderful XML data that does not have an associated DTD or schema. If it parses properly, it is well-formed, but not valid.
Here are three terms that appear throughout the XML literature and may stymie the XML beginner:
The descriptions of an element that are part of the
initial start tag. To reuse a previous example, in the element
<img src="picture.jpg" />,
src="picture.jpg" is an
attribute. There is some controversy in the XML world about when to
use the contents of an element and when to use attributes. The best
set of guidelines on this particular issue is found at http://www.oasis-open.org/cover/elementsAndAttrs.html.
The term CDATA (Character Data) is used in two contexts. Most of the time it refers to everything in an XML document that is not markup (tags, etc.). The second context involves CDATA sections. A CDATA section is declared to indicate that an XML parser should leave that section of data alone even if it contains text that could be construed as markup. CDATA sections look a little strange. Here’s the example from the XML spec:
In this case the
<greeting></greeting> tags get
treated like just plain characters and not as markup that needs to
The string PCDATA itself stands for “Parsed Character Data.” It is another inheritance from SGML; in this usage, “parsed” means that the XML processor will read this text looking for markup signaled by
You can think of this as data composed of CDATA and potentially some markup. Most XML data falls into this classification.
Here are two final tips about things that experienced XML users say may trip up people new to XML:
Pay attention to the characters that, as in HTML, cannot be
included in your XML data without being represented as entity
references. These include
'(single quote), and
" (double quote). These are represented
using the same convention as in HTML:
". Lots of new users get stymied
because they leave an ampersand somewhere in their data and it doesn’t
If you are going to place non-UTF-8 data into your documents, be sure to specify an encoding. Encodings are specified in the XML declaration:
<?xml version="1.0" encoding="iso-8859-1" ?>
A common mistake is to either omit this declaration or declare the document as UTF-8 when it has other kinds of characters in it.
XML has a bit of a learning curve, but this small tutorial should help you get started. Once you have the basics down, you can begin to look at some of the more complex specifications that surround XML, including XSLT (for transforming XML to something else, such as HTML), XPath (a way of referring to a specific part of an XML document; see the next appendix), and SOAP/XML-RPC (used to communicate with remote services using messages written in XML).
See the end of Chapter 6, Working with Configuration Files for more references on XML-related topics.
If you enjoyed this excerpt, buy a copy of Automating System Administration with Perl, Second Edition .
Copyright © 2009 O'Reilly Media, Inc.