Kitchen-sink standards are developed by committees and have to cope with a wide variety of different applications. If someone’s software does something, there has to be some element or attribute or value stuck in. Sometimes the backdoor of properties (open ended value lists) is used, so that the schema can be simplified at the expense of enumerating possible values. But schemas like DOCBOOK, TEI, ODF and OpenXML are classic kitchen sinks.
There is an objective way to detect them: check their Structured Document Complexity Metric and if it is over 300, you probably have a kitchen sink. I gave some metrics earlier in Comparing Office Document Formats.
Now the trouble with kitchen-sink schemas is that any particular set of documents will only use a subset of the total possible features. So writing a complete converter that accepts any possible input from a kitchen-sink schema and outputing them to some more targetted document type is a completely wasteful process. YAGNI. But, and here’s the rub, every so often, someone will in fact use one of these strange often, someone will in fact use one of the elements you didn’t expect.
One way to cope with this is the usage schema. This is a schema derived from sampling representative documents. When new documents come in, you first validate them against the usage schema, and if there is a problem, escalate it to the roject management to schema, and if there is a problem, escalate it to the roject management to discuss how to handle it. It is a sign that the data is not what they expected.
There are some tools to generate XSD usage schemas, but you can also generate them using Schematron. The tool I use first generates all three-level Xpaths found in the document, then makes a Schematron schema that reports if any node was found that was not caught by these XPaths. Very straightforward, but effective.
Another use for usage schemas is for software development. If the customer has provided a sample of the output format, then make a usage schema for that and check that the output from your converter validates. Escalate any differences to project management, This gives a way of proving that your program meets their specs, and also of showing where their specs (e.g. the sample output) was inadequate.