I use HTML Tidy in a well-tuned shell alias that cleans up HTML from articles and weblogs before I post them. We use a subset of XHTML on the O’Reilly Network, and this wonderful utility turns poor HTML (especially converted from word processor files) into valid XHTML. It’s simple to parse that with an XML parser to transform into something useful and clean.
I’ve even used it on hand-written HTML just to make sure things were correct. It’s a great utility I use almost without thinking. Thank you, developers of and contributors to HTML Tidy!