The newly-formed DocBook SubCommittee for Publishers is currently researching commonly-used DocBook elements to explore whether a subset of DocBook 5.0 would be generally useful. I’ve been spending a lot of time getting O’Reilly’s (DocBook 4.4) content into our new Atom Publishing Protocol repository, and decided I’d rather explore the commonly used markup in our own content instead of making up my own (unfounded) opinions.
To get relatively recent markup fashion (our markup standards has evolved over the years), I pulled the DocBook 4.4 for 49 books published in 2006 (see the bottom for the titles). The 49 books represent a reasonable mix of our core books: a few Hacks, a few Cookbooks, many Animals.
After downloading the DocBook, I did a simple parse of the aggregated <book>s to count element names. The 49 books yielded a whopping 939241 element nodes over 24312 pages, or nearly 20,000 elements per book and 40 per page. Here’s the 117 different elements graphed and sorted:
It should be no surprised that <para> comes in first (way off the chart with nearly 200,000 instances), but the fact that <literal> came is second surely shows that O’Reilly’s content skews very technical (and perhaps not semantically marked up enough). I was surprised that <indexterm> came in third (63,000), but we’ve long thought that our customers valued the effort that goes into a good index and the indexing markup pays off tremendously in applications similar to those shown at labs.oreilly.com.
Sorting the 117 elements alphabetically (then scaling the Y axis logarithmically just for fun) gives a different histogram that might be more useful as a reference:
Finally, I imposed an artificial categorization on the elements so that I could do a drill-down graph like this (click-thru for the drill-down):
A more comprehensive look at (older) O’Reilly content can be found at the Labs Statistics page.
For more nitty-gritty on how I scraped this together, see this post.
Books used:
- Google Maps Hacks, 1e
- Excel Scientific and Engineering Cookbook, 1e
- Active Directory, 3e
- RFID Essentials, 1e
- Visual Basic 2005 in a Nutshell, 3e
- PSP Hacks, 1e
- Baseball Hacks, 1e
- Mind Performance Hacks, 1e
- Repairing and Upgrading Your PC, 1e
- Web Site Cookbook, 1e
- Flickr Hacks, 1e
- Fixing Access Annoyances, 1e
- Fixing PowerPoint Annoyances, 1e
- Programming SQL Server 2005, 1e
- Learning C# 2005, 2e
- Photoshop CS2 RAW, 1e
- Web Design in a Nutshell, 3e
- Google: The Missing Manual, 2e
- Don’t Get Burned on eBay, 1e
- The Art of SQL, 1e
- Fixing Windows XP Annoyances, 1e
- iPhoto 6: The Missing Manual, 1e
- iPod & iTunes: The Missing Manual, 4e
- Ajax Hacks, 1e
- Flash 8: The Missing Manual, 1e
- MySQL Stored Procedure Programming, 1e
- Flash 8: Projects for Learning Animation and Interactivity, 1e
- XAML in a Nutshell, 1e
- Linux Annoyances for Geeks, 1e
- Programming PHP, 2e
- Flash 8 Cookbook, 1e
- Learning SQL on SQL Server 2005, 1e
- Programming Excel with VBA and .NET, 1e
- iMovie 6 & iDVD: The Missing Manual, 1e
- Enterprise SOA, 1e
- Perl Hacks, 1e
- Java I/O, 2e
- Enterprise JavaBeans 3.0, 5e
- Building Scalable Web Sites, 1e
- MCSE Core Required Exams in a Nutshell, 3e
- DNS and BIND, 5e
- Learning PHP and MySQL, 1e
- Computer Security Basics, 2e
- Active Directory Cookbook, 2e
- Ubuntu Hacks, 1e
- Unicode Explained, 1e
- Digital Photography: The Missing Manual, 1e
- Ajax Design Patterns, 1e
- Python in a Nutshell, 2e





It's funny that there is only one single <formalpara> element in the histogram; a lonely, unloved element, outshined by it's more popular if less adorned relative, <para>. Which book was it used in? :)
Michael: <formalpara> certainly isn't something we've used a lot in the past, but the one book that did use it, Unicode Explained, shows the relatively ugly approach we've taken more recently (2007 books produced in DocBook) to marking up the printing history of a book:
<printhistory>
<formalpara>
<title>First Edition</title>
<para>June 2006</para>
</formalpara>
</printhistory>
Sean McGrath shared his own findings about element distribution always looking like a power graph here: http://seanmcgrath.blogspot.com/2004_05_23_seanmcgrath_archive.html
In DocBook XML, how do you know how many pages are in a book without rendering out to PDF using XSL-FO? Are pages defined in the original DocBook XML or, if not, how do you determine the number of pages?
John: I used the number of pages in the printed book (regardless of whether it was typeset using DocBook or not). We've designed our customizations to the DocBook-XSL stylesheets as a mirror to our other typesetting systems, so the pagecounts usually end up being similar.
We don't have any notion of the pages in our DocBook markup itself, so the numbers above are just there to give a general sense rather than anything definitive.
Looks like another example of Zipf's law in full effect!
Here's a followup with some newer content: http://www.oreillynet.com/xml/blog/2007/05/docbook_elements_in_the_wild_a.html
Interesting about <formalpara> - I actually use that construct quite often for creating titled bulleted points:
<listitem>
<formalpara>
<title>Relevance</title>
<The concept that information must contain some significance to other information</para>>
</formalpara>
</listitem>
What do you use for this kind of construct, as I can't imagine this isn't a markup requirement for O'Reilly?
I actually use that construct quite often for creating titled bulleted points
Kurt: The O'Reilly style for that would be to use a <variablelist> rather than other markup (you can see the huge number of <varlistentry> in the graphs above).