The newly-formed DocBook SubCommittee for Publishers is currently researching commonly-used DocBook elements to explore whether a subset of DocBook 5.0 would be generally useful. I’ve been spending a lot of time getting O’Reilly’s (DocBook 4.4) content into our new Atom Publishing Protocol repository, and decided I’d rather explore the commonly used markup in our own content instead of making up my own (unfounded) opinions.

To get relatively recent markup fashion (our markup standards has evolved over the years), I pulled the DocBook 4.4 for 49 books published in 2006 (see the bottom for the titles). The 49 books represent a reasonable mix of our core books: a few Hacks, a few Cookbooks, many Animals.

After downloading the DocBook, I did a simple parse of the aggregated <book>s to count element names. The 49 books yielded a whopping 939241 element nodes over 24312 pages, or nearly 20,000 elements per book and 40 per page. Here’s the 117 different elements graphed and sorted:

elements_in_49_books.png

It should be no surprised that <para> comes in first (way off the chart with nearly 200,000 instances), but the fact that <literal> came is second surely shows that O’Reilly’s content skews very technical (and perhaps not semantically marked up enough). I was surprised that <indexterm> came in third (63,000), but we’ve long thought that our customers valued the effort that goes into a good index and the indexing markup pays off tremendously in applications similar to those shown at labs.oreilly.com.

Sorting the 117 elements alphabetically (then scaling the Y axis logarithmically just for fun) gives a different histogram that might be more useful as a reference:


elements_in_49_books_log.png

Finally, I imposed an artificial categorization on the elements so that I could do a drill-down graph like this (click-thru for the drill-down):


elements_in_49_books_categorized.png

A more comprehensive look at (older) O’Reilly content can be found at the Labs Statistics page.

For more nitty-gritty on how I scraped this together, see this post.

Books used: