This idea of creating additional "unauthorized" interfaces to various web sites is fairly widespread, especially among the advanced users who make up O'Reilly's target audience.
|For those of you who don't know RSS, it's an XML-based rendering of stories that a web site wants to make available for syndication.|
We created one such program in our market research group, a program that gives our editors and product managers faster access to competitive information from Amazon. We call this program amarank, since it returns a list of books, sorted by amazon rank and number of positive reviews, for some set of search criteria. Every publisher, and any bookstore or library worth its salt, uses Amazon for bibliographic research. But what a waste of time to click around in a browser! Amarank lets us ask questions like: "Show me the top 30 books on Java at Amazon" or "how do books on Perl stack up against books on Java" or "show me all the books on Jini published since January."
Because we don't want the user to wait while amarank crawls Amazon's site, and we want the data in a form that can be subjected to further processing, the output is sent to a mail program, and ultimately, a spreadsheet such as Excel or Gnumeric.
Some other examples:
- The Finance-QuoteHist Perl module fetches historical quotes from such sources as The Motley Fool and FinancialWeb.
- YahooChart is a perl module that grabs stock charts from Yahoo! Finance.
DeepLeap is a new online assistant that, among its many talents, provides a quick search through Google and local movie information via Yahoo! Movies. It also can be used to send information from any web page to your Palm Pilot, reformat a page for your printer, or be notified of page changes via instant messenger, email or pager.
Now, many of these programs are written in Perl, because it's so good at dealing with the barely-structured HTML text that is returned as the data from a web query.
Jon's contention, and mine, is that this whole area is about to explode, as XML takes off, giving more structure to data, and as web sites realize that they can benefit from exposing a more explicit API for access by other web programs.
In what Jon calls the second generation object web, web sites will provide explicit APIs for interaction with other programs.
Jon goes on:
"Last year, Dun and Bradstreet reengineered their whole online business around the idea of wrapping up back-end data sources and middle-tier components in XML interfaces that are accessible via HTTP or HTTPS. Traditionally, D&B customers bought packaged reports, not raw data, and the creation of a customized report was a slow and painful process. The new scheme turned the old one upside down. It defined an inventory of data services, and empowered developers to transact against those services using an XML-over-HTTP protocol. In their case, the mechanism was based on the webMethods' B2B server, but conceptually that's not too different from XML-RPC. Prior to last year, developers who needed custom D&B feeds had to ask D&B to create them, which took forever, and then had to maintain private network links to D&B and speak a proprietary protocol over those links. In the new scheme, D&B publishes a catalog listing a set of data products. A developer fetches the catalog over the Web, using any XML-oriented and HTTP-aware toolkit, and writes a little glue code to integrate that feed into an application."
We've put these ideas to work in a new tool we've built at O'Reilly, Rael Dornfest's RSS aggregator, which he calls Meerkat.
For those of you who don't know RSS, it's an XML-based rendering of stories that a web site wants to make available for syndication. These stories are composed of a title, a link (back to your site), and an optional description or blurb. Anyone who wants to can come along and grab these stories for incorporation into their own sites -- with links back to the full stories on the originating site.
Meerkat takes this a few steps further. Rather than just using RSS to load individual stories into various web pages, Rael built a tool that helps our editorial team manage the flow of syndicated content. Meerkat's back-end searches the net for RSS files representing technology/computer/geek/science-related content. These files are stored in a MySQL database, with an editorial interface organized along category/channel/chronological lines and sporting the power of regular expression searches. This interface allows editors to select stories along to dispatch stories to individual target publications. Rael has also made a public version of this interface available as a general purpose RSS reader.
I'm not going to go into the details of Rael's API. What I want to call out is how important what he did was from both an architectural point of view, and from the point of view of building a web site that is consonant with the collaborative computing culturer. More important than the fact that he built the site with open source tools like mySql and PHP is that he engineered Meerkat not only so it could meet his objectives, but also so that it could be used in ways he did not know or intend.
What is going to make this whole area explode is the emergence of standards for XML based metadata exchange and service discovery. Web-based services will publish their APIs as XML, and will pass data back and forth as XML, over existing Internet protocols such as HTTP and SMTP. This will help to recreate the loosely coupled architecture that I suggested earlier was so important to the original flowering of open source culture around UNIX.
Once again quoting Jon Udell:
"As Web programmers, we're all in the game of creating -- and using -- network services. A Web server running CGI scripts makes a pretty shabby ORB, but compared to what was available before 1994, it's a wonderful thing. It will get a lot more wonderful with some fairly basic improvements. AltaVista, and Yahoo, and every other site that offers services to the Web ought to be implementing those services, internally, using XML interfaces. When the client is a browser, the XML can be rendered into HTML. When the client is a program, the XML can be delivered directly to that program."
Of course, there are an awful lot of web sites out there that will need to be retrofitted so that they can be used more cooperatively.
In that regard, I saw a very interesting startup here on the JavaOne show floor yesterday, an Irish company called Cape Clear. If I understand it right, they do introspection on JavaBeans or Corba components, and express the resulting interfaces as XML, which thus makes JavaBeans and Corba accessible to XML transport protocols like XML-RPC and SOAP.
I was very excited to see that Sun just yesterday threw its support behind SOAP. IBM also just released Java based SOAP support, and what's more, contributed the code to the XML-Apache project. XML is busting out all over, and that's a good thing!