The IBM Information Server has a business glossary manager that I am implementing for several clients. Some of those clients have existing data dictionaries and glossaries that will need to be imported into the product. The IBM information server has an XML format to allow you to import/export business glossaries.
There is a lot to talk about in examining this format. There is the good, the bad and the ugly in this format. Before we begin our dissection there are two contextual topics in need of some discussion. First is examining the goals of the format and second is determining whether those goals could have been achieved using existing formats.
At a high-level, the format has three main goals which correspond to its three main elements: represent terms and their definitions (via the term element), categorize terms (via the category element) and add custom attributes to categories or terms (via the attribute element). Except for the metadata extension mechanism (custom attributes), this is a simple way to create and organize a dictionary in XML. When examining the schema or the example of the format it is clear that it is far from a complete standard. For example, the available data types for custom attributes is only String. So, it is clear that this format will evolve. A bigger question is - should it? And should it even have been created in the first place?
There are quite a few formats for capturing glossaries, dictionaries and thesauri in XML. A colleague of mine, Ken Sall, examined this for the government a few years back. The W3C has SKOS, IBM has subject classification in DITA (though DITA is much broader than glossaries), and XML topic maps can also serve this purpose.
So, although we will continue to explore the details of this format and even conversion of some of the others mentioned into this format, what are your thoughts on it?
Until next time, see you in the trenches… - Mike


The right to exist is in the utility. What processors consume this format, what do they emit, what use is made of the post-format processed products and what do they replace?
Hi Len,
Agree on the issue of utility. Of course, utility can also be served by existing formats. Many times, if you dig back through the history of a product, it comes down to a single developer that wanted to "reinvent-the-wheel" for the sake of "simplicity" (for them). Frankly, I am tired of the simplicity argument being used as a bludgeon against reuse and interoperability.
Regards,
- Mike
I agree about the single vendor issue. Big system design is where I think the simplicity argument makes the most sense.
Simplicity has virtue where multiple scales are comingling. Some artfulness here can overcome advantages attributed to one-size-fits-all specs and standards.
Looking over some of the proposals and RFPs for systems that have to integrate scaling command and control, the numbers of orthogonal interfaces and the perception that the system must be average idiot proof results in very expensive procurement and lifecycle costs. While portable data makes it reasonable to build these, demanding interoperability past the sense and respond actions significantly raises complexity.
The problem is ensuring the use cases aren't gerrymandered.
Much depends on the bite sizes of the procurements. Where an agency is my customer, I've no choice but to accept responsibility for multiple interfaces. Where the city is my customer, I can replace more legacy systems and vendors. Where a State is my customer, the scale of implementation is large but homegeneity is much improved.
Two other devils are in the details:
1. Not correctly assessing the skill set of the user therefore always building for the lowest common denominator (most people are better trained than they admit).
2. Not correctly assessing the median case of incident complexity and assuming local events require major resources instead of adjusting the scale properly.
The couplings of one and two are where the art is. Using human intelligence and training more astutely is the master stroke.
Hi Len,
Interesting post ... I read it a few times but think I may need to read it about 10 more times to understand it fully.
Would like to focus on what you said about simplicity -
"simplicity has virtue where multiple scales are comingling."
I think I understand that case and agree with it. I think a generalization of that case is when you can clearly see that you have an overly-complex design because things are continually bolted on at the last-minute in a knee-jerk reaction to a new requirement. Thus simplicity becomes key to redesigning a more elegant solution that eliminates the cruft.
The opposite of that is what I am talking about here - when you say something needs to be simpler because the developer does not want to be bothered with reasonable complexity.
You may have to expand on some of your other points because I did not grok it all.
Regards,
- Mike
It is about coupling. As the numbers of components rise, there is a non-linear increase in the complexity and cost given a complex base.
When you look at the formats that have been most successful at scale, the majority have the virtue that at least in their initial incarnation, they are very simple (eg, HTML, RSS, and the Air Force messaging format the name of which I can't remember). The more we try to communicate in the namespace, the lumpier the system space gets as the namespace is aggregated.
A questions is, how do these namespaces become complex? Typically, overreaching, noisy requirements, mammal nonsense, and failing to cut legacy at launch. Gerrymandered use cases are a problem of projects that have cominingled marketing with design, ambition with structure, and so on. Think about the awful evolution of CALS.
I don't think 'less is more' or 'simplicity for its own sake' is right. What I do see is that the forces on the design have to be pared down, requirements need to be strict, use cases have to be focused, and so on. Otherwise, at the end, developers are sitting at their desk with a contract punch list ticking off the requirements they meet, those they don't, and on the other side is a customer/procurement official threatening actions or parlaying for more work.
Too often too many usefully separable systems are procured by the same specification. That is a recipe for failure. Too often the specification was written for the abstract use case anticipating every possible even if improbable failure mode, and that is a recipe for very high costs and a system in which 10% of the features are used 90% of the time while the remainder because not used are too hard to use given lack of experience or training or untested intersystem failure modes.
Simpler formats that do one job well succeed. Complex formats that do one job infrequently don't regardless of how elegant the solution. It isn't that the second class doesn't work; it is that it doesn't fit smoothly into the environment of other slightly jagged systems.
Hi Len,
Excellent dissection of the roots of unnecessary complexity!
With my programming lens on, you are pointing out how projects ignore the proper "separation of concerns" by over-reaching.
There are many such examples of "greedy" standards.
Truly a superb post (+1),
- Mike
Sometimes you cannot avoid complexity in order to provide flexibility. The IBM Business Glossary has two import format: the simpler CSV that you can put together in Excel or the more complex XML. I've used both and the CSV is simpler but too hard to use! Glossary definitions tend to have a lot of carriage returns in them and csv files cannot handle them. Glossaries also tend to have additional custom properties - something that can be configured in the IBM Business Glossary and handled by the flexible XML format but not well handled by the fixed CSV format.
So I vote for a flexible XML format but with additional instructions on how to populate it. We all know how to build a list of terms and definitiosn in Excel for import but building complex XML lists is not so easy.
Michael, how did you go about building your glossary XML input files?
Hi Vincent,
I am still in the process of collecting our existing dictionary artifacts but I will probably be using a java program to do it since I enjoy programming.
I'll be writing more blog entries on this as I progress...
Best wishes,
- Mike