Searching on the Web has long been caught in a web of spiders and crawlers. From early crawlers like Lycos to the state of the art Google, Web search engines have always suffered from a pretty severe lag time. Between the time a new document is posted on the Web to the time you can find it in Google is often a period of weeks.
Get cracking on your JXTA code
Announcing the OpenP2P.com Project JXTA Developer Contest. Enter your hot JXTA code and win a pass to the O'Reilly P2P and Web Services Conference in Washington, D.C., this September, and a Yopy PDA. Deadline is July 15.
As Gene Kan discovered, though, peer to peer networks like Gnutella -- in which nodes on the network receive live queries -- are ready-made for something better than the current state of Web searching. Rather than searching an index of what was there two weeks ago, Kan's InfraSearch technology allowed for searching what is there right now.
In May, InfraSearch was sold to Sun Microsystems, and the InfraSearch team joined Sun's Project JXTA. On Monday, JXTA Search, the result of InfraSearch's work with JXTA was released at JavaOne. OpenP2P.com contributors Rael Dornfest and Kelly Truelove talked with Kan and Steve Waterhouse, who joined Project JXTA as director of engineering when InfraSearch was acquired.
Rael Dornfest: I thought maybe we might start with a little bit of history. I know that you guys are probably sick of the history, but at the same time most people don't know where InfraSearch came from, its relationship to Gnutella, where it might have diverged, where it fits into Sun's JXTA framework, and so on. Perhaps we could start there.
Gene Kan: Well, Gnutella started from a bunch of my friends and me sitting around and thinking of how we might prove that Gnutella was interesting for more than music sharing and the like. We saw that at bottom what Gnutella really was a distributed searching network, and clearly the way to demonstrate this was to build a search engine using Gnutella, and you know that would be the pure expression, right?
So we wanted to demonstrate that Gnutella doesn't care about the data that's trafficked over the network, and that the software and the protocol are completely agnostic to what information is carried. So we thought, well wouldn't it be awesome if we could just make every node on the network participate as not the file-sharing client but rather as a distributed search client, so that we could tie together boxes from effectively hundreds of different information providers.
And I was really excited about this because the company that I was working for at the time had this problem where all of their data was stored in an application database. None of that was exposable to a traditional search engine or a traditional crawler-based search engine, because the crawlers would get scared when they see the question mark in the URL. So there was this huge mine of data out there that wasn't being accessed, and, you know, pretty much if you look in the URL field of your browser, you can tell that this is a huge problem where the question mark in the URL is preventing crawlers from accessing a large portion of the Web's data.
|Questions for Gene or Steve? Post them here and we'll try to get them answered.|
And so we threw up a prototype that was based entirely on the Gnutella protocol, and where you could type in algebraic expressions or simple arithmetic expressions and have PC calculate the results for you, and a few others. So, you know, we thought this was real all pretty cool, and we had a few Valley types that were interested in the technology, and we decided that this would be a pretty cool thing that we should develop further. And that's why we started InfraSearch, the company. Since our inception as a company, we decided that we proved our point with Gnutella and we moved to a proprietary protocol that was specifically tailored to the task of generalized searching, and it's of course XML-based, so that we are more compatible with what's out there on the Web.
Kelly Truelove: So, Gene, there were a couple of aspects of Gnutella. One was that the queries were passed from peer to peer, and the other is that each peer evaluated the query as it saw fit.
Kelly: The InfraSearch prototype -- it's clear from the way it works that, you know, each peer evaluated the query by running it against some site search or something like that, but how was the distribution of queries handled?
Gene: In the prototype, the distribution of queries was Gnutella. You know, it was entirely a Gnutella backend, so you would go to infrasearch.com and a custom Web server would throw up this little search box. When you entered your query, the front end of that Web server would accept your query and the back end would shuttle it out of Gnutella. So the front end of it was HTTP and the back end of it was Gnutella protocol.
Rael: The public demo was rather reminiscent of the early days of Web and WAIS integration. I don't know if WAIS goes back too far for you.
Gene: I never played with WAIS.
Rael: WAIS was basically a search protocol where you could define how your node understood and responded to queries -- sound familiar? While it wasn't a distributed search like Gnutella (it was point to point), your demo was, for me, a nice return to that concept.
Gene: That's cool, didn't know that.
Rael: So, continuing on, how did you get hooked up with Sun and JXTA?
Gene: Well, we were building our technology and in the meantime the JXTA team started to scour the peer-to-peer world, talking with everybody and figuring out what they should do. And they started to talk to us and, basically everything that Ingrid and Mike and Li said, we thought, "Well, yeah, that's pretty much what we want to do, too." And it just evolved, and eventually that turned into an acquisition.
Rael: So that was on the business side. On the technical side, what changes were made along the way? Did you diverge from Gnutella completely? What did you carry along with you?
Gene: Once we incorporated as a company, and we started to really think seriously about search, we stopped looking at the Gnutella protocol because we decided that we could develop something that was much more specifically tailored to the needs that we wanted to address, that was XML-based, and everything like that. Certainly, we've learned a lot with Gnutella, and all those lessons are reflected --
Kelly: So you derived from Gnutella the inspiration of, you know, let each peer process the queries as it sees fit?
Kelly: And the notion of queries being distributed out to peers, although not necessarily the same way that Gnutella distributes them?
Kelly: Not necessarily in the same Gnutella bucket brigade distribution?
Kelly: Why did you move away from Gnutella?
Gene: Probably one of the most compelling reasons was that we thought, in a commercial endeavor it wasn't a good approach to, you know, suppose you're Company A. Do you really want to carry competitor B's query traffic? Or his result traffic? Probably not, right?
Steve Waterhouse: When I joined the company we weren't using the Gnutella protocol at that point. Essentially, what we designed and what we've now completed while at Sun was a distributed search network or a framework for distributed search, and the first area that we targeted this at was the Web -- primarily, deep content contained inside application servers and databases connected to Web servers that is typically not accessed well by a standard crawler approach to searching. You know, all the big search engines essentially go out and crawl the pages, stick them in a big index, and then when you go to search, you're searching against an old piece of data, if you like.
So the alternative idea that the founders of InfraSearch had was to distribute this query out to the edges of the network and let the intelligence of the peer that it's being sent to process it in whatever format was appropriate for that query, and respond. And the way that we did this distribution was using this system within the network which we called "hubs." We have a network of hubs, and when you post a query into the hub, or into the network, it gets picked up by one of the hubs and the hub says to itself, "Is there any provider that I know about that can handle this query?" And if it doesn't know about that, and you can have various different rules for how it should process that, it also shuttles it off to the other hubs and they continue to process it until hopefully it gets answered by one of the different providers. The provider then, in the case of search, runs a query against its database, or index sever or whatever it's running, and then returns the result back to the hub and then ultimately back to the requester.
At the stage when we were in discussions with Sun, we had written a purely Web-focused engine -- in other words, HTTP, we were using Java servlets -- but at the same time it was very distributed in the sense that all the different components could be placed around the networks in different ways. And we also designed this to be relatively network agnostic, in terms of the messaging used, and also agnostic in the type of transport, and so this actually fit in really neatly to what we then started working on at Sun, which is part of the JXTA project.
What the leaders of the JXTA group asked us to do was, make this work, not only on the Web but also to make search within a distributed network very efficient. And so those are the two sides of JXTA Search: both deep and wide. Deep in the sense of finding content at the edges, deep into the database on the edge of the Web network; and wide in the sense of helping peers shuttle queries around more efficiently because of a varied distributed network.
Kelly: Let me follow up with a little technical point. The query goes through the hub, the hub sends it out to the appropriate information providers, and the provider can respond directly back to the node issuing the query?
Steve: One of the things that we thought about in terms of providers being able to respond directly back to the client is that the client can just send back the end point identifier for itself when it requests the query, and then the provider can respond directly to it, but having said that, we see some advantages in the hubs.
Rael: Could you have one be the default, subject to overriding?
Steve: Right now the default is that it responds back to the person that asked it or responds back to the hub, but the protocol's actually, we call it "symmetric," in the sense that it doesn't matter whether a individual client sends a request to a provider or whether the client sends a request to a hub. They all look the same to the hub, and the hub looks the same to the client and provider and so on. The hub, instead of thinking of it as a server, you can think of it like an underground peer that has the job of, you know, working out where is the best place to send these queries. If the client knows where to send the query, then it can be as efficient to send it directly to the provider. But we're targeting the case where people -- although clients don't necessarily know what the best place to send the query to is, and then the hub does a good job. Does that make sense?
Kelly: Right, so one way of looking at it is as a variation on the super-peer concept?
Steve: Yeah, I guess so.
Kelly: It's metadata based routing.
Kelly: Could you talk a little more about hub-to-hub communication. If I have a query and the hub doesn't recognize it --
Steve: I think one of the challenges, Kelly, in trying to design an absolutely neat framework for something like searching -- it's essentially an ambitious thing to do -- but you're constantly fighting the battle with yourselves of how much to specify and how much to leave unspecified, and we thought a lot about how to do hub-to-hub communication. Some of the issues are things like how does a hub register with another hub? Does it take some average of all the different provider data and aggregate them together and then publish that to the other hubs? Or does it take all the things that knows about all the providers?
And so we've got a number of different strategies -- not only how to describe hubs to other hubs, but also how to best send queries from one hub to the other. We definitely got some of the concepts from Gnutella -- things like time-to-live for queries and fan-out so you can specify for how many providers, which includes hubs, to send this thing to. But we've essentially left it up to the implementation, so whoever wants to be running a hub out there -- and of course the source code's available -- so people can run hubs.
Whoever's running them can do a couple different things. You could, for example, decide to send all queries to all the hubs that you know about, or you could decide if they can satisfy the query, they're not going to send it to somebody else ... So the short answer is we have the framework and the protocol for the bad things to not happen -- some of the different routing problems Gnutella had in the early days -- but we left the rest of it to the user, or to the runner of the hub.
Rael: Let's talk a little about JXTA and Web Services.
Steve: It's funny talking about the phrase "Web services" because the Web has so many connotations. I've been trying to figure out a better term -- something like "distributed services" or "network services," sort of agnostic as to whether it's the Web or Gnutella or JXTA. It doesn't really matter. The underlying issues there are: Can I support this thing in transport or does somebody have to pop through the transport for me to get into the other network? Can I understand the same messaging?
Rael: I've actually floated about the term "network services," too. It's a double-edged sword, in the sense that "net services" are usually associated with things like NFS, DNS, NNTP, etc., so it can be a little overloaded. On the other hands, it should be fine in terms of, are we inventing some of the new net services? Will NFS, DNS, messaging, presence management be the new group of Net services?
Kelly: There's a sort of irony of some of these different terms is they start to get overloaded pretty fast. I mean even "peer to peer" -- I tend to think about it more now, you know, as a framework for enabling easy communication between things.
Gene: Just to describe a little bit of our protocol. Essentially we have three kinds of messages. We have the request message, and we have a response message format, and we have a registration message format. And the request and responses are pretty clear. They're either coming from a client into a hub, or it could be a client into a provider. And the response is either coming from a provider to the client directly or from provider to the hub and then from the hub to the client and so on.
The registration component is interesting. You can compare this with a publish and subscribe model, if you're looking for analogies in the broader distributed network space. Each node registers with the hub -- although the registration actually fits on the node itself and so it could be registering it with a client, for all it knows. And the registration describes the metadata about that node, what it wishes to publish to the network about the content of its most frequently accessed database fields.
Those concepts describe metadata. In other words, some summary of the textual content. But alternatively you could think about it and our resolver -- which is the indexing component of the hub, the indexing and matching component of the hub -- the way it does the indexing is it indexes not only the textual content but also the XML tags that surround that textual content.
So you can play around with the idea that instead of throwing a query into our network and expecting a response, you could throw, for example, an XML-RPC request. You could say, "I want this request to be answered by somebody," and then the hub would look at that request and it would say, "Well, which providers that I know about first of all have even registered for this XML structure, and secondly have either explicitly or just as a wildcard registered for a particular method?" And then it would get routed to different providers around the network who could qualify that; the idea there is you can build this distributed, RPC mechanism without the clients or those other providers having to know exactly who's who.
Kelly: I just wonder where things are going next with JXTA Search.
Steve: A good question. Where JXTA's going next is we're giving it to you guys. We're outsourcing it. And the thing that we tried to do with JXTA Search is provide first of all an official searching framework for JXTA, and secondly try to demonstrate to people the advantages of having open and common standards for things like querying and responding. It's pretty amazing to me that we've got this far on the Web and we don't really have any common formats for this stuff. And as a result it's very difficult to build applications, and what we'd like to do is help people build these applications on top of the JXTA Search protocol, or a protocol that evolves from that. And we'd like to work with the community to improve it and make it a bigger, better, more successful thing, hopefully.
As you're probably aware, we're only planning to run a demo. We're not trying to sort of be a search aggregator in our plans with Sun. We're in business and JXTA's sort of creating some cool new technologies for this community.
Rael: How is JXTA Search integrated both team-wise and concept-wise with JXTA? How integrated is the experience for somebody who wants to build on JXTA and use JXTA Search?
Steve: You know, we've made no secrets about that our backgrounds not being in JXTA. As an independent company, it was impossible for us to be in JXTA, especially since it was evolving as we were evolving.
In terms of how we integrated it technically, what we've done is the existing hub service, the router and resolver, like I said, we designed them to be agnostic to the network transfer protocol and also the messaging format -- so essentially what we did was we adapted what we had and built a JXTA version of the resolver and the router that communicated out of a pipe using JXTA messages instead of using the HTTP post connection that we were using on the Web. As a result we've got a system that's pretty easy to use from both JXTA and also from a Web programming standpoint. I mean, we're not trying to push this angle but maybe a side effect of this is that people can see how easy it is to integrate these different environments such as JXTA on the Web.
Rael: You should obviously see folks building gateways to HTTP or SMTP or whatever vs. a pure JXTA search.
Steve: Yeah, absolutely. And I think that the key thing, as long as you're sharing the same messages -- you know, in our sense we send the same messages to a JXTA peer that we send to a Web server -- as long as they can understand what we're sending them, and as long as we have some method of bridging the different protocol bases, the transfer protocol bases, then you get to go.
The issue is in some of the design of the hub. In the HTTP version of the router, we have to be careful about how we manage HTTP connections, whereas in the JXTA version, it's actually a little bit easier because we can just connect and send and forget about it, and as long as we tag the queries and the responses, we can keep track of what's going on.
In some of these JXTA searches, asynchronous communication and grouping work very well when you want to do things like chaining lots of different hubs together so you don't have to keep all these synchronous connections open, whereas other situations that we have on the Web work equally as well. So on the one hand there's a sort of convergence of all these different ideas and different protocol spaces and messaging spaces, and at the same time I imagine that people will tend to keep what they know works while at the same time hopefully adopting cool ideas at JXTA.
Rael: Are there going to be other APIs that you're going to actually be focusing on, such as SOAP or XML-RPC, particularly for the Web services stuff?
Gene: You know, I think pretty clearly that's something that is on the map of things to do because the idea for InfraSearch is to make search something that's effectively protocol-independent, right? When you distribute search, you have the capability to make your search not only protocol-independent but also independent of the data format and of the data-storage paradigm. That's one way of looking at what InfraSearch does. And so clearly talking to JXTA Search nodes that are enabled using XML-RPC through a bridge or something like that is something that we would probably want to do.
Really, the problem that JXTA Search solves is this problem where traditional Web-based searches expected data to be represented on the Web in a very specific way on very specific types of systems, and using specifically HTTP and HTML, and that's proven to not be enough.
Rael: What you're really getting at here is resource discovery.
Rael: And the resources are a schema and you already hinted about RDF descriptions, but you're talking basically about distributed resource discovery. How would this relate to UDDI and the like?
Gene: Oh, the .NET question. I was waiting for that.
Rael: I may say "Hailstorm" in a moment, but --
Gene: I think what will happen is that in the long term these things will play out, and in some cases UDDI will prove to be a successful approach. It certainly seems to be an effective way for people to easily publish information about their business, which is what seems to be the major use for it, like a global contact book. And I can imagine that getting extended as people start to use more Web services as to a way to publish what you've got available.
And I think the difference with what we've done is that -- and I'm not fully familiar with all the work of UDDI -- but I think the big difference is that we don't have any real structured definition as to how you should work with our network. You know, we have these envelopes and protocols that allow you to send, receive, and register what you've got, but then internal to that you can pretty much do whatever you want, as long as people are willing to accept queries or accept requests for that XML or specifically the data within the XML, and as long as they can be formatted and responded to OK.
So to answer your question in a long way, I don't see them as particularly competitive, and I think this is just going to play out. I think ours is a more attractive scheme for what we were talking about, the ad hoc distributed resource discovery networking, and so on.
Kelly: I wonder what the technical metrics for success for JXTA Search are.
Steve: I don't know, Kelly. I mean actually people like it, think it's a good approach, and I guess the measure of success if you open source something is that people both use it and extend it. And so I guess looking a year out I'd be happy if people were using it, if there were JXTA Search hubs all around the place, if providers were adapted to JXTA Search, if people were using it across their distributed peer-to-peer networks as well as on the deep search of the Web. And that people were making it better. I guess that would make me happy.
Copyright © 2009 O'Reilly Media, Inc.