Update : In response to a post made by Danny Ayers, I’ve added some extended information to the bottom of this post that helps clean this theory up a bit more.
[Orig. Post]
Dare Obasanjo aka Carnage4Life - Metadata Quality, Events Databases and Live Clipboard
In the above linked post, Dare Obasanjo quotes from a recent post from Jon Udell:
All this leads up to a question: How can I copy an event from one of these services and paste it into another? My conclusion is that adopting Live Clipboard and microformats will be necessary but not sufficient. We’ll also need a way to agree that, for example, this venue is the same as that venue. At the end, I float an idea about how we might work toward such agreements.
He goes on to share his own feelings based on his solid experience of working in this space at a fairly intense level for quite some time. As such, I don’t want to come across in a way that suggests that I have the absolute answer, problem solved, lets all take the day off and go snowboarding one last time before the season officially ends here before too long.
While I’m all for the snowboarding piece, I’d be willing to go even if what I am about to propose is not something that would, or even could work from a local, national, and international level.
That said, I ask the titled question:
Are We Trying Too Hard, And Simply Overlooking The Obvious?
It seems to me we might be. As such, it seems worth taking a look at the following to determine what, if anything, can be done with it:
Since 1974 (I think that’s the right year… forgive me if I’m off by a year or two either way) the bar code, or U.P.C., has become an integrated part of each and every one of our daily lives. So much so, that its often easy to forget that the primary purpose for these codes is not (necessarily) to remind us that Big Brother is always watching over, or even to increase the efficiency of the check-out counter at our local grocery store, although often times this is a side effect.
Borrowing from the insightu.org explanation:
—
Example
UCC-12 (U.P.C.) number that could be assigned by the manufacturer to a one gallon bottle of aquarium cleaner.
The General EAN.UCC Specification specifies that UCC-12 data structure be encoded in the UPC- A or the UPC - E symbology. UPC version A looks like this:
—
This same linked document goes on to explain EAN-13, EAN/UCC-14, and beyond, the first applies to areas of the world outside the US and Canada, the second a way to identify “intermediate packs and shipping containers holding standard configurations of consumer units.” [NOTE: A FASCINATING look into the relationship between bar codes and computing (beyond the immediately obvious) is provided by none other than Charles Petzold in his title “Code“. If you haven’t had a chance to read this title, I HIGHLY recommend it to even the most experienced computer scientist who might read this post. To understand what it TRULY means to be passionate about code, you need to read this title. As entertaining as it is educational. In fact, its the entertaining part of that last sentence that makes this an ABSOLUTE MUST READ for anyone, geek or non-geek alike, who simply wants to look at life from a completely new and fascinating perspective. :)]
Moving forward — the UC Council, formerly the ENN/UCC, is the organization in charge of GS1 BarCodes and eCom. Extending this effort, the UC Council sits at the head of various other efforts to bring a globally unique identifier into the public sector, of which one effort is termed Global Location Number and begins its Executive Summary with:
—
The GLN (Global Location Number) provides a standard means to identify legal entities,
trading parties and locations to support the requirements of electronic commerce. The
GLN is designed to improve the efficiency of integrated logistics while contributing added
value to the partners involved, as well as to customers. Examples of parties and locations
that can be identified with GLNs are:
• Functional entities - e.g., a purchasing department within a legal entity, an
accounting department, a returns department, a nursing station, a ward, a customer
number within a legal entity, etc.
• Physical entities - e.g., a particular room in a building, warehouse, warehouse gate,
loading dock, delivery point, cabinet, cabinet shelf housing circuit boards, room
within a building, hospital wing, etc.
• Legal entities/Trading Partner – e.g., buyers, sellers, whole companies, subsidiaries
or divisions such as suppliers, customers, financial services companies, freight
forwarders, etc.
—
As this document continues it presents an extensive list of additional benefits. I won’t copy it here, but encourage you to visit this same link and learn more about this system as time allows.
I should also point out that there is an extensive listing of additional efforts being made in all areas of life, all headed up by this same non-profit org who’s primary purpose is to help us identify tangible items in ways that will help make our lifes more efficient and easy to manage from a personal AND business perspective.
Okay, all of this said, there are other factors that come into play that are not solved by the unique ID systems mentioned above. This list includes, of course, (but not limited to) how do we associate this same ID with a naming convention we can all agree upon.
The answer (I guess more of a question, really.):
Why would we want to do that?
It’s obvious that ANY attempt to try force naming conventions on ANYTHING results in nothing but pure and simple chaos. The very fact that hackers (the good kind for those who might read this and believe I mean the bad, illegal kind of hacking) are involved with the process is a proven disaster waiting to happen, made obvious by the shear volume of programming languages in existence, each of which have their own set of naming conventions and syntax variations/break-offs that fit the personal style and interests of each developer(s) involved.
So then how do we fix the supposed problem?
We don’t. As far as I can tell, it isn’t a problem.
Instead we allow folks to give things their own name/tag and implement a simple system that this personalized name/tag can then be mapped backed to, which can then, of course, be mapped back to the tag placed on this tangible item by someone else.
Of course, the easiest way to map any two or more tags is to make them globally accessible via none other than an Atom (or RSS) web feed, allowing either anonymous, or permission-based access allowing us to control, at a very granular level, who can see this information and who can not, setting an expiration date such that one-off transactions can take place without worrying about creating a long term “trust” policy with someone in whom you barely even know, but want to sync up your calendars with the proper meeting place for an event you both plan to attend.
A simple, straight forward implementation of this type of simple URI-based permission settings is showcased in Amazon’s S3 documentation and found under the subjects of “Authenticating REST Requests” and “Authenticating SOAP Requests” (two separate sections of this same document) which showcase how using the combination of a public and private numeric key, you can sign a request that takes a URI, combines this with, among several other items, the time a request made with this query string is no longer valid.
Contained in the above linked document we find this simple example showcasing this process using a signed (with your private ID), base 64 encoded, sha1 hash of the various pieces necessary to make this all work, and do so efficiently, effectively, and securely.
The following canonical string contains the various necessary pieces to properly implement this process (please see the docs for a deeper understand of what this string includes)
GETnnn1141889120n/quotes/nelson
The following code snippet then takes this string, creates a properly signed base64+sha1 result:
import base64
import hmac
import sha
import urllib
h = hmac.new("OtxrzxIsfpFjA7SwPzILwy8Bw21TLhquhboDYROV",
"GETnnn1141889120n/quotes/nelson",
sha)
urllib.quote_plus(base64.encodestring(h.digest()).strip())
The result is then URL encoded and combined with the same expiration date and public AWSAccessKeyId to make:
http://s3.amazonaws.com/quotes/nelson?AWSAccessKeyId=44CF9590006BF252F707&Expires=1141889120&Signature=vjbyPxybdZaNmGa%2ByT272YEAiv4%3D
—
NOTE: In regards to a “signing” mechanism, it would seem to be that SAML would offer a premiere solution to just such a situation, as this, among other things, would allow for a proven ability to implement layered security, defined in essence as proxy representation of an entity — e.g. O’Reilly would represent the proxy that would make requests for and in behalf of its employees, without revealing any personal details of the employee making the request. This,of course, is just the tip of SAML iceberg, but a topic for another post none-the-less.
—
Of course, there are still some areas that can be seen as possible showstoppers. Of all of the possibilities, it would seem to me that the primary area of concern will more than likely stem from the fact that:
- While a TON of work has been done, and plenty of this information available, there is still no way to map EVERY tangible item from around the globe with a globally unique id.
There are others, but lets take this one in particular and see what we can do with:
The United States Postal Service (and I assume that many other postal services across the world offer something similar) offers web-based API’s that allow you to map a general US and International address into its canonical format found in the system database. In fact, there’s even an XML-based API available. To access this, you first need to sign up for the service.
In the mean time, lets take a simple example that can be implemented without first signing up for API access to the system (which is free by the way, although the services that you would normally pay for e.g. postage/stamps, etc — are obviously charged at their specified rate).
If you visit the web interface into their zip code/address lookup tool and enter a specified address, in a way that makes the most sense to you, in return you will receive their canonical interpretation of the address you entered. While error is always possible, I do know that the system they have in place is about as good as software of this type gets. Its RARELY unable to convert even the most hacked up text offered up by human interpretation. That said, there are always going to be the exceptions to the rule.
Even still, the simplicity involved with creating ones own GUID based on a physical address, that in and of itself would be impossible to reconstruct the data it represents, yet allows for those given permission to access what this does represent is just that… simple.
Without revealing the value returned by this specified interface for my own address here in SLC, by placing this information into a file called “guid-physical-us-address” and then running an md5sum from the command line I will receive in return a GUID string that represents this files contents.
$ md5sum guid-physical-us-address
2609f82f87ad8372c6863e8df8d9e0c3 *guid-physical-us-address
By then taking this same file, posting it to my web server, and invoking the mentioned set of rights controls, I can then publish this string value to the various databases that offer indexing services of this type that I can in turn offer various advanced forms of search services that use the name I’ve personally applied to this file as the way to index this information. Maybe I ask them to index this information under “M. David Peterson, Salt Lake City, UT, Home Address, URI=http://domain.info/access/2609f82f87ad8372c6863e8df8d9e0c3″. (Even better, of course, but beyond the topic of discussion, would be to use a Matrix URI’s, something I was reminded of yesterday via a post by Mark Nottingham to the W3C public-web-http mailing list.)
At this stage, anybody interested in gaining access to this information can then search using their own idea for how I may have this information tagged, and make a determination from the results which of the choices is most likely to be accurate. They would then make a request to the related URI using something like:
https://domain.info/access/2609f82f87ad8372c6863e8df8d9e0c3?requestor=https://requestorsdomain.info/query?whoareyou
of which could then be used with my system software to determine if the information returned by this request query is something I feel comfortable with enough to share the requested information with. It also does two other things:
- Keeps ME in control of the associated data
- If that information changes, a simple update to the indexing system with an associated connection identifier in the provided metadata to the new value of the string created by md5sum will then allow me the ability to create one new file, to then have this information propogate to those in whom have permission to access this information, who have, of course, subscribed to my personal Atom/RSS feed that will notify their own system based on their own prefrences as to whether or not they want to be updated in the first place. (you know, ex-girlfriends who now hate dislike are no longer interested in pursuing any sort of relationship with me, people who think I wear silly hats and smile too much and therefore don’t care to be updated by any changes in my postal, email, or URI-based address, you know, that kind of stuff ;) :D
Of course this example adds a level of privacy and security control, something that any particular public establishment is not necessarily going to be quite as concerned with. None-the-less, even with privacy added to the mix, the process is simple enough to implement, and accurate enough such that with the various advanced search capabilities of the top tier search engines, fine tuning the search process would, or at very least, should, be mere child’s play.
—
In a nutshell, that’s its. Again, I’m not suggesting this is absolutely something that will work and instead a suggestion that maybe we are trying to hard to apply standardized naming conventions to the problem at hand, when its obvious ANY attempt at such standardization is simply not going to work to any level of success that could be seen as a globally acceptable solution.
Please…. Have at it. Poke as many holes as you can into this and lets see if after doing so, it still has the ability to float.
She’s all yours :)
—
NOTE: The purpose, of course, for taking the effort to create the USPS-based canonical address to then create an MD5 hash of this file is, among other things, because this allows for various possibilties for “decentralizing” the indexing system, as well as reassociating data-files with the owners specified URI, without attaching/including this information directly in or on the file itself. For example, by running md5sum against that same file, you can then use a reverse-lookup styled query to gain access to the associated URI — of course you could also use SHA1 for this, but I’m not 100% convinced that the extra pieces that come as part of SHA1 are even all that necessary, but then again, I’m not 100% convinced they’re not, either.
—
Update Just finished reading up an interesting post by Danny Ayers in regards to the same article from Jon Udell. Here is a near complete copy of the comment I just left at the bottom of his post:
Hey Danny,
I *think* there might be a simpler way to take any particular combination of metadata, and keeping this and any other associated data under our own control such that giving this data out to any particular person remains our own choice, provide a way for anybody to associate any type of tag they might prefer to this information.
http://www.oreillynet.com/xml/blog/2006/04/mapping_data_between_domains_a.html
While the example used is obviously flawed — would be a little too easy for someone to simply purchase the information contained in the database of the USPS, or any other particular countries postal system, and then map this data to the MD5 checksum to then query any public indexing service to gain access to the metadata associated with whomever happens to claim that particular address — it seems to me that through a combination of various pieces of semantic information specified for a given tangible item (person, place, thing) you can then use OWL to associate the MD5 or SHA1 checksum with whatever your current URI happens to be,
2609f82f87ad8372c6863e8df8d9e0c3
owl:sameAs https://my-current-uri.info/access?2609f82f87ad8372c6863e8df8d9e0c3
maintaining complete control of that information, while at the same time allowing any public indexing service the ability to associate any metadata that is specified when the MD5 or SHA1 is originally registered with their system. In other words, making it impossible for someone to associate any particular checksum value with a URI and metadata in which they don’t currently maintain some level of control over. When someone searches for a particular set of information, and decides “I’m pretty sure thats the person, place, or thing I am looking for” they can then send a query to this URI:
https://my-current-uri.info/access/2609f82f87ad8372c6863e8df8d9e0c3?https://requestors-current-uri/aboutme/metadata
which I can then use my system software to determine if this is someone in whom I want to share the associated information with. What kind of data would be needed for a system to make this determination is an obvious piece to this that is necessary to gain the benefits of a trust-based system that doesn’t apply too much burden such that gaining the trust to access a set of information becomes a long and detailed process that no one will be willing to put up with from a large scale perspective, but this is a separate topic, therefore seperate conversation.
—
Any comments from the community at large?




bbf94b34eb hi, i`m from india, and i has been very hart by you site)))