Structure Document Complexity Metric

The Structured Document Complexity Metric asks the question “How complex is this document set or schema?” for the purposes of project estimation and management. The metric works for a sampled set of documents (XML or SGML) and for grammar-based schemas as well. A schema that perfectly describes a set of document instances will have the same metric as that set, and so can be used to judge how optimal a schema is.

The Structure Document Complexity Metric is based on a count of elements and attributes, with addition scores for certain features that may add complexity. Of course, the document complexity is only one input in determining costs or estimating resources required for a transformation, say. Mileages vary. However, at its least the Structured Document Complexity Metric provides independent verification that a schema is the order of complexity you are expecting, or that a sample set of documents completely exercises a schema.

The Structured Document Complexity Metric is a single number. For documents, or a set of documents, it is calculated by adding:

  • The total number of unique elements (in SGML terminology, “element types”)
  • The total number of unique attributes
  • An extra point for every element that is required (i.e., always found in its parent element)
  • An extra point for every attribute that is required (i.e., always found on its parent element)
  • An extra point for every element that can only appear in position 1 of its parent

In the earlier examples, the JapaneseAddress schema has a metric of 13 (8 elements total plus 4 required elements plus 1 only ever appearing in first position) and the Taiwanese address has a metric of 14. The metric is not particularly revelatory or useful for such a small schema, of course!

For schemas or DTDs, it is calculated by adding:

  • The total number of unique elements (in SGML terminology, “element types”)
  • The total number of unique attributes
  • An extra point for every element that is required (e.g., has a non-zero minOccurs in an XSD schema)
  • An extra point for every attribute that is required (e.g., has #REQUIRED in a DTD)
  • An extra point for every element that can only appear in position 1 of its parent

The Structured Document Complexity Metric has been validated against many projects, and is reported to give better indication of complexity than, say, a simple element count. It has the virtue of being applicable to schemas and document sets. I know one company has used the metric as an input in variation costing.

There are of course other possible factors that could be added: some people consider the presence or possibility of recursion or optional elements before a required sibling to be additional factors, for example. A suggestion has been made to add points for each namespace used, because they indicate some change in semantic domain that may require different processing.



YAGNI

Whenever I ask at a conference or client meeting “How do you estimate how much resources an XSLT transform will take?” the answer is almost invariably a more respectable word for guesswork. You find a similar project, and use that as the basis. These metrics can help provide better information.

Occassionally, some SGML dinosaur arises from the swamp, dripping with experience and magnificently surprised to be still around, and gives some more practical approach: for example, Sean MacGrath mentioned that he thought it important to delay implementing rare elements, in a kind of 80/20 YAGNI heuristic.

YAGNI deserves to be tatooed on the breast of every XML manager. We often understand well the implications of the XML Mapping Completeness Ratio, but the implications of a higher-than-expected Structured Document Complexity Metric or a low XML Mapping Additions Ratio are often under-appreciated. In particular, they may show that a kitchen-sink schema has been adopted which may require substanial extra work that does not spring from any actual business requirement.



Links