Sign In/My Account | View Cart  
advertisement

Sponsored By:




Print

Organizing XML with Entities

by Erik T. Ray
01/23/2001

The Basics

First, a review of the basics. According to the XML Recommendation1, "an XML document may consist of one or many storage units ... called entities." So, an entity is a piece of text and markup (called mixed content), or basically any subset of an XML document. The whole document is referred to as the document entity. (The document type definition [DTD] is another entity.) More interesting to you, the XML author, however, are the smaller bits and pieces of a document that can be contained inside entities.

Any entity (except for the document and its DTD) can be named, which gives you the power to call upon a segment of mixed-content text when you need it. As you will see, there are many applications for named entities, from inserting boilerplate text to spreading a document over multiple pages.

General Entities

To use a named (or general) entity, you first have to state your intention to use it in a piece of syntax called an entity declaration. Most often, entities are declared in the internal subset of the DTD. That's a place at the top of the document, inside the <!DOCTYPE> tag. Here's an example of a document that declares a general entity called "friend:"

<?xml version="1.0"?>
<!DOCTYPE memo
[
  <!ENTITY friend "Samuel Jeremiah Bagpipe-Grubbins">
]>
<memo priority="normal">
  <from>Julie</from>
  <to>roommates</to>
  <message>
My good friend, &friend;, will be stopping by to feed 
and talk to my goldfish while I'm away. I've given &friend; 
a set of keys and told him how to work the alarm. So please 
make &friend; feel at home when he's here. Thanks. 
  </message>
</memo>

In this document, we declared an entity called "friend" for the text "Samuel Jeremiah Bagpipe-Grubbins." Later, there are references to the entity of the form &friend;. The ampersand (&) and semicolon (;) are delimiters that tell the XML parser to treat the word as an entity reference. (Because the symbol & has special meaning for entity references, you actually have to use an entity &amp; when you just want to use the character (&) by itself.) When the XML parser reads this document, it automatically replaces all entity references with the entity's defined text.

Related Reading

Learning XML
Guide to Creating Self-Describing Data
By Erik T. Ray

This is a powerful way to store and retrieve bits of text. Some reasons you'd do this are:

External Entities

Perhaps the most important role of an entity is to include text from another file in your document. An external entity is declared slightly differently than the general entity in our previous example because its replacement text is located in another file. The declaration needs to tell the XML parser how to find that text, whether it's in another place on the same computer system, or perhaps on another system somewhere on the Internet.


O'Reilly's XML books focus on providing core XML information as well as information on how to integrate XML with other key technologies, such as Java and Oracle. Our books will show you how to make the most of this license-free, platform-independent, and well-supported markup language.

In addition to Learning XML, our offerings include:


If this sounds like linking (e.g. hypertext links in HTML), you're partially right. It isn't a way to link one document to another; rather it's a way to link segments of a document together. Once the XML parser finds the replacement text and pops it into the places where a reference to an external entity is found, it treats that text as if it had been there all along. The user has no idea (and shouldn't care) that an external entity pulled in content from another file. This is quite different from XLink2, XML's linking paradigm. XLink defines the ways a document can link separate documents together, which often involves some user interaction (like clicking on a highlighted word in HTML).

The following example, a fictional lab report written by one Dr. Wyse, is a document consisting of several parts spread over multiple files:

<?xml version="1.0"?>
<!DOCTYPE report
[
  <!ENTITY abstract     SYSTEM "abs.xml">
  <!ENTITY data         SYSTEM "dat.xml">
  <!ENTITY analysis     SYSTEM "ana.xml">
  <!ENTITY conclusion   SYSTEM "con.xml">
  <!ENTITY bibliography SYSTEM "bib.xml">
  <!ENTITY appendix     SYSTEM
      "http://www.scistuff.org/pub/info/tables/tbl34.xml">
  <!ENTITY bio          SYSTEM "/home/penny/bio.txt">
  <!ENTITY header       PUBLIC 
      "-//SCILAB//XML Corp Banner v1.3//EN" 
      "/company/boilerplate/banner3.xml">
  <!ENTITY equations    SYSTEM "/company/sci/eqs.ent">
  &eqs.ent;
]>
<report>
  <date>2001.04.13</date>
  <author>Dr. Penny Wyse<author/>

  <!-- Main Document Parts -->

  &header;         <!-- company logo, legal info, etc. -->
  &abstract;       <!-- overview of the experiment -->
  &data;           <!-- experimental results in a big table -->
  &analysis;       <!-- graphs, diagrams, equations, gab -->
  &conclusion;     <!-- what we think happened -->
  &bibliography;   <!-- citations and research sources -->
  <appendix>
    <title>Isotopic Measurements for Einsteinium</title>
    &appendix;     <!-- a useful table I found somewhere -->
  </appendix>
  <colophon>       <!-- brief career history -->
    <authorbio>
      &bio;
    </authorbio>
  </colophon>
</memo>

The first thing you'll notice about this example is how sparse it is. Where is all the content? With clever use of entities, Dr. Wyse has spread all the content out among a bunch of files. The header of the report is a piece of boilerplate living in a file somewhere else on the system. The critical components, from abstract to bibliography, are files in the same location as the file printed here. Dr. Wyse also includes a table for her appendix, which is sourced in from a location on the Internet. Finally, she includes a bio from her home directory, where she can maintain her personal information.

The second thing you'll see is that Dr. Wyse uses different kinds of external entity declarations. These declarations you use depends on how you want to access the resource. The first declaration is a system identifier, which is a URL or a path to the file on the Internet. The second is a public identifier, which is a name for a resource that is universally recognized and doesn't require that you know precisely where the resource is. We won't go into the details of public identifiers, but system identifiers are quite useful on their own because they usually use a URL or a filesystem path to specify the location of a resource.

Dividing an XML Document into Components

So, why did Dr. Wyse butcher her document into so many files when it would be simpler just to keep all that stuff in one place? Dividing an XML document into components has several advantages:

Summary

Entities were introduced with XML's predecessor, the Standardized General Markup Language (SGML). But they've proved so valuable to XML authors that they were included in the slimmer XML specification while other features were pared away. Master the use of entities and you'll find that writing documents in XML is an easier and more manageable process. And you can impress your friends at parties with impressive XML tricks. (Well, maybe.)


Notes:

  1. The Extensible Markup Language (XML) Recommendation is written and maintained by the XML Working Group of the World Wide Web Consortium (W3C). (See the W3C's page on XML resources and information.) Version 1.0, Second Edition, of this document is available online.

  2. The XML Linking Language, also called XLink, is another recommendation by the W3C, available online.

  3. MathML is an XML markup language proposed by the W3C for encoding mathematical expressions and functions.

 

Erik Ray is an XML software specialist and developer at O'Reilly & Associates. He lives with his wife Jeannine and five parrots in Saugus, Massachusetts. Besides writing, he practices kendo, plays go, binds books, and stalks bookstores for rare and antiquarian books.

Learning XML

Related Reading

Learning XML
Guide to Creating Self-Describing Data
By Erik T. Ray

Return to xml.oreilly.com