Document Engineering
Related link: http://mitpress.mit.edu/catalog/item/default.asp?tid=10476&ttype=2
The most surprising thing I picked up at last week's
>Open Standards 2005 conference in Sydney came from the C.T.0. of an established company that provides electronic data exchange capability, especially for the shipping industry. He said that the documents they received were mainly EDI, then CSV, with very little XML. But that they converted all the data inhouse to XML for easier processing. I expected it would be the other way around: lots of XML data coming and being processed by old-school database tools.
But my surprise was not so much the low external use-rates of XML —after all, if you already supply your documents in one structured notation it is tempting to see a move to XML as only satisfying cosmetic rather than business requirements—my eyebrows raised on how intermixed XML, CSV and EDI all are now: an EDI house uses XML internally.
Towards a New Discipline
Bob Glushko and Tim McGrath's standout new book Document Engineering, Analyzing and Designing Documents for Business Informatics and Web Services takes this intermixing even further. Glushko is from an SGML publishing and XML e-commerce background, McGrath is from an EDI and UBL background.
They see a new discipline of Document Engineering
emerging. A nice summary is at an
>IBM Research seminar. Document Engineering applies a dataflow approach to the whole organization, identifying and modeling which documents get sent between business processes and their contents. The documents could be transactional documents (EDI, XML invoices) or publication (HTML, PDF, custom DTDs) or even mixed.
Its not Software Engineering, its not IT, its not web publishing, its not Enterprise Architecture, its not Business Process Re-engineering, but it straddles all these.
Document Engineering is, of course, more sophisticated than simple dataflow. The analysis also includes signals and routing aspects.
Glushko teaches at the Center for Document Engineering at Berkeley, and this book, published by MIT, is definitely aimed as an undergraduate text book for similar courses. I recommend it for anyone involved in adopting a highly-automated, loosely-coupled Service Oriented Architecture.
Taster
The book features little key points (floating
outdented paragraphs) throghout to provide easy
summaries. Here is a taster:
precision for all applications.
because the overlapping information isn't explicitly identified.
from many document-centric industries.
a slow link in its information flow that nullifies most of the investments to improve other processes.
Practical, well-expressed and timely.
Document Analysis and Design
When XML was created, SGML authors had moved from issues of syntax and were dealing with issues of how to model information (publishing-related in the first instance) in documents. Most prominantly, the mid 90s Prentice Hall books
- 1995's Developing SGML DTDs: From Text to Model to Markup by Eve Maler and Jeanne El Andaloussi described how to systematically analyze publications and create a DTD from that.
- My 1998 The XML and SGML Cookbook: Recipes for Structured Information dealt with constructing DTDs from typical components (a meme that permeates XML Namespaces and XML Schemas type derivation), emphasizing the need to understand the different possibilities in order to choose the best one.
- 1998's Structuring XML Documents by Dave Megginson dealt with customizing kitchen-sink industry-standard DTDs and using architectural forms.
Document Engineering mainly takes the "analyze then assemble" kind of approach of the Maler book and gives only lip service (in s15.1.1.3, 'review' and 're-use') to the detailed knowledge of alternatives advocated by my book and (in s5.1.1.4) customizing standard components as in Dave's book. This is, in one sense, fair enough because it is not the place for a textbook to deal with the minutae of particular schemas. However, the publishing experience is that people who use the "analyze then assemble" approach but who don't have a good grounding in the tradeoffs of the different ways to implement structures frequently make lousy DTDs or schemas.
This mirrors Christopher Alexander's finding that the first people who adopted his pattern language approach to building houses ended up with buildings that looked familiar: if you are only aware of one corner of the solution space you will only sit there.
But the book is primarily concerned with model-based XML, influenced by the database, object-oriented and business rules analysis realms. Other influences are UBL, RosettaNet, CMM, UML and pattern languages.
Quibbles
The only quibbles I have with the book are minor: the word 'context'is used throughout, but it is—perhaps necessarily—such a vague word as to make any sentence using it seem amorphous and suspect. There is a chapter Analyzing the Context of Use that helps. And in the discussion of transaction patterns particularly "Offer and Acceptance" some brief treatment of the legal aspects would be appropriate for undergraduates: what is a legal contract and which country's law applies to international transactions over the web, in particular. I don't want to shock my gentle readers, but biggest sign of how far XML has emerged from its publishing roots is that the index refers to section numbers not page numbers: probably unthinkable for an SGML book!
Categories
WebRead More Entries by Rick Jelliffe.

Page numbers?
Indexing by section number is regarded, in the quality market, as a hack that you use when your indexes are created as a separate process to your typesetting, or when you are using typesetting systems not up the job of making whole books, or as a sign of a tight deadline.
It is less quality because it is more cumbersome to find the page: rather than being able to do a binary chop or other approximation method based on page numbers, the user has to locate section headings (where the numbers are) in the text and guess how long each section is in order to find their target.
In SGML books, one of the hurdles that needed to be proved was that you could make just as high quality books with structural markup as you could using presentation tags (troff, tex) or hand made indexes.
Also, one of the selling points for SGML was the ability to handle large documents: SGML would be contrasted with weaker technologies such as using MS Word, where you were stuck with producing one chapter at a time, and so had to index by section number rather than page number.
When I made my book, which was quite large (about 650 pages) production issues made me divide it into three or so parts, so the index items have a part number then a page number within that, which is in intermediate approach that is quite acceptable as far as predictability for users.
Another issue to be factored in to the decision about whether to index by page number or section number is that usually people will use the same thing in the cross-references. If you use page numbers for cross-references, then when you make any last-minute changes to the text, you may have to go through the entire book to check that automated and forced page breaking still produce acceptable results: adding text in one place may cause the IDed object to cross page borders which may cause references to it require an extra digit which may trigger different line and page breaks and flow through the whole document. That is an issue where deadlines fight against best quality, so where there are deadlines it can be prudent not to index by page number.
So my comment is that Document Engineering has not been produced with intent on proving with the book that marked-up document can produce just as good a product as hand indexed books, which is not to say that they haven't made the appropriate production decision, no professional slight was remotely intended! Indeed, I have no idea whether the book was made using declarative markup at all: it may have been written in a unstructured word processor with styles for all I know. (The design looks like the kind of thing that people do with TeX-based typesetting systems, as a guess, but nowadays it is difficult to say.) SGML books were read by publishing people and had to make a contain in their production values confirmations of the subject matter; XML books have a different market and so don't have have some special thing in them.
(As a further example, in my book I also have bullet lists using "radio buttons" with one item --the default or most important or typical case-- selected. Or a bullet list made from tears, signifying woes. The intent was to demonstrate that SGML/XML didn't prevent you from playfullness or innovation in design. We had to eat our own dogfood, and it tasted nice in parts.)
Page numbers?
How do section numbers vs. page numbers relate to XML vs. SGML?
docengineering.com -- sample chapters
My pleasure Bob, and congratulations on the book: it really is a great achievement and major step forward for the industry.
docengineering.com -- sample chapters
Thanks, Rik for your review of our Document Engineering book. We have set up a site for the book where we have sample chapters and will be putting up lecture notes, news, and other useful stuff...
docengineering.com
-bob glushko
Bad link... but it will be back soon.
Just a heads up, the MIT store is doing a bit of an update today, ('doh!), according the front page. So when it comes back up today or tomorrow, maybe we can see this book.