XML.com FAQs > C. Authors of SGML (including writers of HTML: Web page owners)
Question:  C.6 How can I make my existing HTML files work in XML?
Answer:

Either convert them to conform to some new document type (with or without a DTD) and write a stylesheet to go with them; or edit them to conform to XHTML.

It is necessary to convert existing HTML files because XML does not permit end-tag minimization (missing </p>, etc), unquoted attribute values, and a number of other shortcuts which are normal in most HTML DTDs. However, many HTML authoring tools already produce almost (but not quite) well-formed XML. As a preparation for XML, the W3C's HTML Tidy program can clean up some of the formatting mess left behind by inadequate HTML editors, and even separate out some of the formatting to a stylesheet, but there is usually still some hand-editing to do.

Converting to a new document type

If you want to move your files out of HTML into some other DTD entirely, there are already many native XML application DTDs, and several XML versions of popular SGML DTDs like TEI and DocBook to choose from. There is a pilot site run by CommerceNet (http://www.xmlx.com/) for the exchange of XML DTDs.

Alternatively you could just make up your own markup: so long as it makes sense and you create a well-formed file, you should be able to write a CSS or XSLT stylesheet and have your document displayed in a browser.

Converting valid HTML to XHTML

If your HTML files are valid (full formal validation with an SGML parser, not just a simple syntax check), then try validating them as XHTML. If you have been creating clean HTML without embedded formatting then this process should throw up only mismatches in upper/lowercase element and attribute names, and empty elements (plus perhaps the odd non-standard element type name if you use them). Simple hand-editing or a short script should be enough to fix these changes.

If your HTML validly uses end-tag omission, this can be fixed automatically by a normalizer program like sgmlnorm (part of SP) or by the sgml-normalize function in an editor like Emacs/psgml (don't be put off by the names, they both do XML).

If you have a lot of valid HTML files, could write a script to do this in a programming language which understands SGML/XML markup (such as Omnimark, Balise, SGMLC, or a system using one of the SGML libraries for Perl, Python, or Tcl), or you could even use editor macros if you know what you're doing.

Converting invalid HTML to well-formed XHTML

If your files are invalid HTML (95% of the Web) they can be converted to well-formed DTDless files as follows:

  • replace the DOCTYPE Declaration with the XML Declaration <?xml version="1.0" standalone="yes" encoding="iso-8859-1"?>. If there was no DOCTYPE Declaration, just prepend the XML Declaration.
  • change any EMPTY elements (eg every <ISINDEX>, <BASE>, <META>, <LINK>, <NEXTID> and <RANGE> in the header, and every <IMG>, <BR>, <HR>, <FRAME>, <WBR>, <BASEFONT>, <SPACER>, <AUDIOSCOPE>, <AREA>, <PARAM>, <KEYGEN>, <COL>, <LIMITTEXT>, <SPOT>, <TAB>, <OVER>, <RIGHT>, <LEFT>, <CHOOSE>, <ATOP>, and <OF> in the body of the document) so that they end with /> instead, for example <img src="mypic.gif" alt="Picture"/>;
  • make all element names and attribute names lowercase;
  • ensure there are correctly-matched explicit end-tags for all non-empty elements; eg every <p> must have a </p>, etc;
  • escape all < and & non-markup (ie literal text) characters as &lt; and &amp; respectively (there shouldn't be any isolated &lt; characters to start with);
  • ensure all attribute values are in quotes.

Be aware that many HTML browsers may not accept XML-style EMPTY elements with the trailing slash, so the above changes may not be backwards-compatible. An alternative is to add a dummy end-tag to all EMPTY elements, so <IMG src="foo.gif"/> becomes <img src="foo.gif"></img>. This is still valid XML provided you guarantee never to put any text content in such elements. Adding a space before the slash (eg <img src="foo.gif" />) may also fool older browsers into accepting XHTML as HTML.

If your HTML files fall into this category (HTML created by some WYSIWYG editors is frequently invalid) then they will almost certainly have to be converted manually, although if the deformities are regular and carefully constructed, the files may actually be almost well-formed, and you could write a program or script to do as described above. The oddities you may need to check for include:

  • do the files contain markup syntax errors? For example, are there any missing angle-brackets, backslashes instead of forward slashes on end-tags, or elements which nest incorrectly (eg <B>an element starting <I>inside another</B> but ending outside</I>)?
  • are there any URLs (eg in hrefs or srcs) which use backslashes instead of forward slashes?
  • do the files contain markup which conflicts with HTML DTDs, such as headings or lists inside paragraphs, list items outside list environments, header elements like <base>preceding the first <html>, etc?
  • do the files use imaginary elements which are not in any known HTML DTD? (large amounts of these are used in proprietary markup systems masquerading as HTML). Although this is easy to transform to a DTDless well-formed file (because you don't have to define elements in advance) most proprietary or browser-specific extensions have never been formally defined, so it is often impossible to work out meaningfully where the element types can be used.
  • Are there any non-ISO Latin-1 (8859-1) characters or wrongly-coded characters in your files? Look especially for native Apple Mac characters left by careless designers, or any of the illegal characters (the 32 characters at decimal codes 128--159 inclusive) inserted by MS-Windows editors. These need to be converted to the correct characters in ISO 8859-1 or the relevant plane of Unicode (and the XML Declaration should show iso-8859-1 encoding unless you specifically know otherwise).
  • Do your files contain malformed (Mosaic/Netscape-style) comments? Comments must look <!-- like this --> with double-dashes each end and no double dashes in between (safest: no multiple dashes in between).

If you answer Yes to any of these, you can save yourself a lot of grief by fixing those problems first before doing anything else. You will likely then be getting close to having well-formed files.

Markup which is syntactically correct but semantically meaningless or void should be edited out before conversion. Examples are spacing devices such as repeated empty paragraphs or linebreaks, empty tables, invisible spacing GIFs etc: XML uses stylesheets, so you won't need any of these.

Unfortunately there is rather a lot of work to do if your files are invalid: this is why many professional Webmasters will always insist that only valid or well-formed files are used (and why you should instruct designers to do the same), in order to avoid unnecessary manual maintenance and conversion costs later.


This FAQ is from The XML FAQ, maintained by Peter Flynn