The Holy Grail: searching for dynamically generated data
Gnutella is a fairly simple protocol. It defines only how a string is
passed from one site to another, not how each site interprets the
string. One site might handle the string by simply running
fgrep on a bunch of files, while another might insert it into
an SQL query, and yet another might assume that it's a set of Japanese
words and return rough English equivalents, which the original
requester may then use for further searching. This flexibility allows
each site to contribute to a distributed search in the most
sophisticated way it can. Would it be pompous to suggest that Gnutella
could become the medium through which search engines operate in the
Status of Gnutella
Gnutella was started by a division of America Online called Nullsoft. America Online cut off support when it heard about the project, afraid of its potential use for copyright infringement. But a programmer named Brian Mayland reverse engineered the protocol and started a new project to develop clients. None of the developers of current software have looked at code from Nullsoft. Gnutella is an open source project with clients registered under the GNU License.
Limitations and risks of Gnutella
Early experiments with Gnutella suggest it is efficient and useful, but has problems scaling. If you send out a request with a time-to-live of 10, for instance, and each site contacts six other sites, up to 106 or 1 million messages could be exchanged.
The exponential spread of requests opens up the most likely source of disruption: denial-of-service attacks caused by flooding the system with requests. The developers have no solution at present, but suggest that clients keep track of the frequency of requests so that they can recognize bursts and refuse further contact with offending nodes.
Furthermore, the time-to-live imposes a horizon on each user. I may repeatedly search a few hundred sites near me, but I will never find files stored a step beyond my horizon. In practice, information may still get around. After all, Europeans in the Middle Ages enjoyed spices from China even though they knew nothing except the vaguest myths about China. All they had to know was some sites in Asia Minor, who traded with sites in Central Asia, who traded with China.
Spencer Kimball, a developer of the Linux client for Gnutella, says this subnetting can serve to protect Gnutella from attack. Gnutella has already suffered service disruptions, mostly because of bugs in clients, and in the future it is certain to be attacked with vicious and sophisticated attempts to bring it down. While some groups of sites have slowed down temporarily or become severed from other groups, the system has never actually come down.
People may misuse Gnutella for other reasons besides denial of service, of course. One site was recently reported to use it for a sting: The site advertised file names that appeared to offer child pornography, then logged the IP address and domain name of every download request. The reason such information was available is that Gnutella uses HTTP; there is no difference between the user information Gnutella offers and that offered by any Web browser.
A final limitation of Gnutella worth mentioning is the difficulty authenticating the source of the data returned. You really have no idea where the data came from -- but that's true of e-mail and news right now too. Clients don't have to choose anonymity; they can identify themselves as strongly as they want. If a Gnutella client chooses to return a URL, that's just as trustworthy as a URL retrieved in any other manner. If a digital signature infrastructure becomes widespread, clients could use that too. I examine reliability and related policy issues in the article The Value of Gnutella and Freenet.