XML lets you use a Document Type Definition (DTD) to describe the markup (elements and other constructs) available in any specific type of document. However, the design and construction of a DTD can be complex and non-trivial, so XML also lets you work without a DTD. DTDless operation means you can invent markup without having to define it formally, provided you stick to the rules of XML syntax.
To make this work, a DTDless file is assumed to define its own markup by the existence and location of elements where you create them. When an XML application encounters a DTDless file, it builds its internal model of the document structure while it reads it, because it has no DTD to tell it what to expect. There must therefore be no surprises or ambiguous syntax: the document must be `well-formed' (must follow the rules).
To understand why this concept is needed, look at standard HTML as an example:
<IMG> element, which is defined (in the SGML DTDs for HTML) as
EMPTY, doesn't have an end-tag (there is no such thing as
</IMG>); and many other HTML elements (such as
<P>) allow you to omit the end-tag for brevity.
- If an XML processor reads an HTML file without knowing this (because it isn't using a DTD), and it encounters
<P> or many other start-tags, it would have no way to know whether or not to expect an end-tag, which makes it impossible to know if the rest of the file is correct or not, because it has now lost track of whether it is inside an element or if it has finished with it.
Well-formed documents therefore require start-tags and end-tags on every normal element, and any
EMPTY elements must be made unambiguous, either by using normal start-tags and end-tags, or by affixing a slash to the start-tag before the closing
> as a sign that there will be no end-tag.
All XML documents, both DTDless and valid, must be well-formed. They must start with an XML Declaration if necessary (for example, identifying the character encoding or using the Standalone Document Declaration):
<?xml version="1.0" encoding="iso-8859-1" standalone="yes"?>
David Brownell notes: XML that's just well-formed doesn't need to use a Standalone Document Declaration at all. Such declarations are there to permit certain speedups when processing documents while ignoring external parameter entities--basically, you can't rely on external declarations in standalone documents. The types that are relevant are entities and attributes. Standalone documents must not require any kind of attribute value normalization or defaulting, otherwise they are invalid.
Rules for well-formedness:
Valid XML files are well-formed files which have a Document Type Definition (DTD) and which conform to it. They must already be well-formed, so all the rules above apply.
A valid file begins with a Document Type Declaration, but may have an optional XML Declaration prepended:
<!DOCTYPE advert SYSTEM "http://www.foo.org/ad.dtd">
The XML Specification predefines an SGML Declaration for XML which is fixed for all instances and is therefore hard-coded into most XML software (the declaration has been removed from the text of the Specification and is now in a separate document). The specified DTD must be accessible to the XML processor using the URL supplied in the SYSTEM Identifier, either by being available locally (ie the user already has a copy on disk), or
by being retrievable via the network.
It is possible (many people would say preferable) to supply a Formal Public Identifier with the PUBLIC keyword, and use an XML Catalog to dereference it, but the Specification mandates a SYSTEM Identifier so this must still be supplied (after the PUBLIC identifier: no further keyword is needed):
<!DOCTYPE advert PUBLIC "-//Foo, Inc//DTD Advertisements//EN"
The test for validity is that a validating parser finds no errors in the file: it must conform absolutely to the definitions and declarations in the DTD.