XML.com FAQs > D. Developers and Implementors (including WebMasters and server operators)
Question:  D.2 What are these terms DTDless, valid, and well-formed?
Answer:

XML lets you use a Document Type Definition (DTD) to describe the markup (elements and other constructs) available in any specific type of document. However, the design and construction of a DTD can be complex and non-trivial, so XML also lets you work without a DTD. DTDless operation means you can invent markup without having to define it formally, provided you stick to the rules of XML syntax.

To make this work, a DTDless file is assumed to define its own markup by the existence and location of elements where you create them. When an XML application encounters a DTDless file, it builds its internal model of the document structure while it reads it, because it has no DTD to tell it what to expect. There must therefore be no surprises or ambiguous syntax: the document must be `well-formed' (must follow the rules).

Well-formed documents

To understand why this concept is needed, look at standard HTML as an example:

  • The <IMG> element, which is defined (in the SGML DTDs for HTML) as EMPTY, doesn't have an end-tag (there is no such thing as </IMG>); and many other HTML elements (such as <P>) allow you to omit the end-tag for brevity.
  • If an XML processor reads an HTML file without knowing this (because it isn't using a DTD), and it encounters <IMG> or <P> or many other start-tags, it would have no way to know whether or not to expect an end-tag, which makes it impossible to know if the rest of the file is correct or not, because it has now lost track of whether it is inside an element or if it has finished with it.

Well-formed documents therefore require start-tags and end-tags on every normal element, and any EMPTY elements must be made unambiguous, either by using normal start-tags and end-tags, or by affixing a slash to the start-tag before the closing > as a sign that there will be no end-tag.

All XML documents, both DTDless and valid, must be well-formed. They must start with an XML Declaration if necessary (for example, identifying the character encoding or using the Standalone Document Declaration):

<?xml version="1.0" encoding="iso-8859-1" standalone="yes"?>
<foo>
  <bar>...<blort/>...</bar>
</foo>
David Brownell notes: XML that's just well-formed doesn't need to use a Standalone Document Declaration at all. Such declarations are there to permit certain speedups when processing documents while ignoring external parameter entities--basically, you can't rely on external declarations in standalone documents. The types that are relevant are entities and attributes. Standalone documents must not require any kind of attribute value normalization or defaulting, otherwise they are invalid.

Rules for well-formedness:

  • All tags must be balanced: that is, every element which may contain character data or sub-elements must have both the start-tag and the end-tag present (omission is not allowed except for empty elements, see below);
  • All attribute values must be in quotes. The single-quote character (the apostrophe) may be used if the value contains a double-quote character, and vice versa. If you need isolated quotes as data as well, you can use &apos; or &quot;. Do not under any circumstances use the automated typographic ( `curly' ) inverted commas substituted by some wordprocessors for quoting attribute values.
  • any EMPTY elements (eg those with no end-tag like HTML's <IMG>, <HR>, and <BR> and others) must either end with />or they must look like non-EMPTY elements by having a real end-tag (but no content). Example: <BR> would become either <BR/> or <BR></BR> (with nothing in between).
  • There must not be any isolated markup-start characters (< or &) in your text data. They must be given as &lt; and &amp; respectively, and the sequence ]]> may only occur as the end of a CDATA marked section: if you are using it for any other purpose it must be given as ]]&gt;.
  • Elements must nest inside each other properly (no overlapping markup, same as for HTML);
  • DTDless well-formed documents may use attributes on any element, but the attributes are all assumed to be of type CDATA. You cannot use ID/IDREF attribute types for parser-checked cross-referencing in DTDless documents.
  • XML files with no DTD are considered to have &lt;, &gt;, &apos;, &quot;, and &amp; predefined and thus available for use. With a DTD, all character entities must be declared, including these five. If you need other character entities in a DTDless file, you can declare them in an internal subset without referencing anything other than the root element type (thanks to Richard Lander for this):
  • <?xml version="1.0" standalone="yes"?>
    <!DOCTYPE example [
    <!ENTITY mdash "---">
    ]>
    <example>Hindsight&mdash;a wonderful thing.</example>
    

Valid XML

Valid XML files are well-formed files which have a Document Type Definition (DTD) and which conform to it. They must already be well-formed, so all the rules above apply.

A valid file begins with a Document Type Declaration, but may have an optional XML Declaration prepended:

<?xml version="1.0"?>
<!DOCTYPE advert SYSTEM "http://www.foo.org/ad.dtd">
<advert>
  <headline>...<pic/>...</headline>
  <text>...</text>
</advert>

The XML Specification predefines an SGML Declaration for XML which is fixed for all instances and is therefore hard-coded into most XML software (the declaration has been removed from the text of the Specification and is now in a separate document). The specified DTD must be accessible to the XML processor using the URL supplied in the SYSTEM Identifier, either by being available locally (ie the user already has a copy on disk), or by being retrievable via the network.

It is possible (many people would say preferable) to supply a Formal Public Identifier with the PUBLIC keyword, and use an XML Catalog to dereference it, but the Specification mandates a SYSTEM Identifier so this must still be supplied (after the PUBLIC identifier: no further keyword is needed):

<!DOCTYPE advert PUBLIC "-//Foo, Inc//DTD Advertisements//EN"
  "http://www.foo.org/ad.dtd">
<advert>...</advert>

The test for validity is that a validating parser finds no errors in the file: it must conform absolutely to the definitions and declarations in the DTD.


This FAQ is from The XML FAQ, maintained by Peter Flynn