O'Reilly Network    
 Published on O'Reilly Network (http://www.oreillynet.com/)
 See this if you're having trouble printing code examples


Gnutella and Freenet Represent True Technological Innovation

by Andy Oram
05/12/2000

Related Articles:

Napster and MP3: La Revolucion or La Larceny?


Music Industry Turns Heat on Net Music Pirates


Why the RIAA is Fighting a Losing Battle


Napster: Popular Program Raises Devilish Issues


Why the RIAA Still Stands a Chance

The computer technologies that have incurred the most condemnation recently -- Napster, Gnutella, and Freenet -- are also the most interesting from a technological standpoint. I'm not saying this to be perverse. I have examined these systems' architecture and protocols, and I find them to be fascinating. Freenet emerged from a bona fide, academically solid research project, and all three sites are worth serious attention from anyone interested in the future of the Internet.

In writing this essay, I want to take the hype and hysteria out of current reports about Gnutella and Freenet so the Internet community can evaluate them on their merits. This is a largely technical article; I address the policy debates directly in a companion article, The Value of Gnutella and Freenet. I will not cover Napster here because its operation has received more press. It's covered in "Napster: Popular Program Raises Devilish Issues" by Erik Nilsson, and frankly, it is less interesting and far-reaching technically than the other two systems.

In essence, Gnutella and Freenet represent a new step in distributed information systems. Each is a system for searching for information; each returns information without telling you where it came from. They are innovative in the areas of distributed information storage, information retrieval, and network architecture. But they differ significantly in both goals and implementation, so I'll examine them separately from this point on.

Gnutella basics

Each piece of Gnutella software is both a server and a client in one, because it supports bidirectional information transfer. The Gnutella developers call the software a "servent," but since that term looks odd I'll stick to "client." You can be a fully functional Gnutella site by installing any of several available clients; lots of different operating systems are supported. Next you have to find a few sites that are willing to communicate with you: some may be friends, while others may be advertised Gnutella sites. People with large computers and high bandwidth will encourage many others to connect to them.

Evil or Just Controversial?:

Open Source software such as Gnutella and Freeware are spreading as quickly as a virus. But are they really so unhealthy? Andy Oram points out the advantages--and disadvantages--of controversial technologies in this week's edition of Platform Independent on Web Review.

You will communicate directly only with the handful of sites you've agreed to contact. Any material of interest to other sites will pass along from one site to another in store-and-forward fashion. Does this sound familiar, all you grizzled, old UUCP and Fidonet users out there? The architecture is essentially the same as those unruly, interconnected systems that succeeded in passing Net News and e-mail around the world for decades before the Internet became popular.

But there are some important differences. Because Gnutella runs over the Internet, you can connect directly with someone who's geographically far away just as easily as with your neighbor. This introduces robustness and makes the system virtually failsafe, as we'll see in a minute.

Second, the protocol for obtaining information over Gnutella is a kind of call-and-response that's more complex than simply pushing news or e-mail. Figure 1 shows the operation of the protocol. Suppose site A asks site B for data matching "MP3." After passing back anything that might be of interest, site B passes the request on to its colleague at site C -- but unlike mail or news, site B keeps a record that site A has made the request. If site C has something matching the request, it gives the information to site B, which remembers that it is meant for site A and passes it through to that site.

Figure 1. How Gnutella retrieves information

I am tempted to rush on and describe the great significance of this simple system, but I'll pause to answer a few questions for those who are curious.

  1. How are requests kept separate?

    Each request has a unique number, generated from random numbers or semi-randomly from something unique to the originating site like an Ethernet MAC address. If a request goes through site C on to site D and then to site B, site B can recognize from the identifier that it's been seen already and quietly drop the repeat request. On the other hand, different sites can request the same material and have their requests satisfied because each has a unique identifier. Each site lets requests time out, simply by placing them on a queue of a predetermined size and letting old requests drop off the bottom as new ones are added.

  2. What form does the returned data take?

    It could be an entire file of music or other requested material, but Gnutella is not limited to shipping around files. The return could just as well be a URL, or anything else that could be of value. Thus, people are likely to use Gnutella for sophisticated searches, ending up with a URL just as they would with a traditional search engine. (More on this exciting possibility later.)

  3. What protocol is used?

    Gnutella runs over HTTP (a sign of Gnutella's simplicity). A major advantage of using HTTP is that two sites can communicate even if one is behind a typical organization's firewall, assuming that this firewall allows traffic out to standard Web servers on port 80. There is a slight difficulty if a client behind a firewall is asked to serve up a file, but it can get by the firewall by issuing an output command called GIV to port 80 on its correspondent. The only show-stopper comes when a firewall screens out all Web traffic, or when both correspondents are behind typical firewalls.

  4. How does the system stop searching?

    Like IP packets, each Gnutella request has a time-to-live, which is normally decremented by each site until it reaches zero. A site can also drastically reduce a time-to-live that it decides is ridiculously high. As we will see in a moment, the time-to-live limits the reach of each site, but that can be a benefit as well as a limitation.

  5. How is a search string like "MP3" interpreted?

    That is the $64,000 question, and leads us to Gnutella's greatest contribution.

The Holy Grail: searching for dynamically generated data

Gnutella is a fairly simple protocol. It defines only how a string is passed from one site to another, not how each site interprets the string. One site might handle the string by simply running fgrep on a bunch of files, while another might insert it into an SQL query, and yet another might assume that it's a set of Japanese words and return rough English equivalents, which the original requester may then use for further searching. This flexibility allows each site to contribute to a distributed search in the most sophisticated way it can. Would it be pompous to suggest that Gnutella could become the medium through which search engines operate in the 21st century?

Status of Gnutella

Gnutella was started by a division of America Online called Nullsoft. America Online cut off support when it heard about the project, afraid of its potential use for copyright infringement. But a programmer named Brian Mayland reverse engineered the protocol and started a new project to develop clients. None of the developers of current software have looked at code from Nullsoft. Gnutella is an open source project with clients registered under the GNU License.

Limitations and risks of Gnutella

Early experiments with Gnutella suggest it is efficient and useful, but has problems scaling. If you send out a request with a time-to-live of 10, for instance, and each site contacts six other sites, up to 106 or 1 million messages could be exchanged.

The exponential spread of requests opens up the most likely source of disruption: denial-of-service attacks caused by flooding the system with requests. The developers have no solution at present, but suggest that clients keep track of the frequency of requests so that they can recognize bursts and refuse further contact with offending nodes.

Furthermore, the time-to-live imposes a horizon on each user. I may repeatedly search a few hundred sites near me, but I will never find files stored a step beyond my horizon. In practice, information may still get around. After all, Europeans in the Middle Ages enjoyed spices from China even though they knew nothing except the vaguest myths about China. All they had to know was some sites in Asia Minor, who traded with sites in Central Asia, who traded with China.

Spencer Kimball, a developer of the Linux client for Gnutella, says this subnetting can serve to protect Gnutella from attack. Gnutella has already suffered service disruptions, mostly because of bugs in clients, and in the future it is certain to be attacked with vicious and sophisticated attempts to bring it down. While some groups of sites have slowed down temporarily or become severed from other groups, the system has never actually come down.

People may misuse Gnutella for other reasons besides denial of service, of course. One site was recently reported to use it for a sting: The site advertised file names that appeared to offer child pornography, then logged the IP address and domain name of every download request. The reason such information was available is that Gnutella uses HTTP; there is no difference between the user information Gnutella offers and that offered by any Web browser.

A final limitation of Gnutella worth mentioning is the difficulty authenticating the source of the data returned. You really have no idea where the data came from -- but that's true of e-mail and news right now too. Clients don't have to choose anonymity; they can identify themselves as strongly as they want. If a Gnutella client chooses to return a URL, that's just as trustworthy as a URL retrieved in any other manner. If a digital signature infrastructure becomes widespread, clients could use that too. I examine reliability and related policy issues in the article The Value of Gnutella and Freenet.

Freenet basics

The goals of Gnutella and Freenet are very different. Those of Freenet are more explicitly socio-political and, to many people, deliciously subversive:

The latter feature characterizes both Freenet and Gnutella, and differentiates them from Napster. A court order can shut down Napster (and any mirror site), but shutting down Freenet or Gnutella would be just as hard as prosecuting all those 317,000 Internet users who allegedly exchanged Metallica songs.

Another technical goal of Freenet proves particularly interesting: it spreads data randomly among sites, where the data can appear and disappear unpredictably. In addition to serving the social goals listed above, Freenet offers an intriguing possible solution to the problem of Internet congestion, because popular information automatically propagates to many sites.

Freenet bears no relation to the community networks with similar names of the 1980s and early 1990s. It grew out of a research project launched in 1997 by Ian Clarke at the Division of Informatics at the University of Edinburgh. He has made a paper from that project available online. (Warning: It's a PDF and I had trouble both viewing and printing it from a couple different PDF viewers.)

The Freenet architecture and protocol is similar to Gnutella in many ways. Each cooperating person downloads a client and sends requests to a few other known clients. Requests are uniquely marked, are handed from one site to another, are temporarily stored on a stack so that data can be returned, and are dropped after each one's time-to-live expires.

The game of find-the-data

The main difference between the two systems is that when a Freenet client satisfies a request, it passes the entire data to the requester. This is an option in Gnutella but is not required. Even more important, as the data passes back along a chain of Freenet clients to the original requester, each client keeps a copy (unless it is a huge amount of data and the client decides that keeping it is not worth the disk space). The client keeps the data so long as other people keep asking for it, but discards the data after some period of time when no one seems to want it.

What is accomplished by this practice, apparently so inefficient compared to the Internet? Ian Clarke tells me it is not all that inefficient -- for large amounts of data its efficiency is comparable to that of the Web -- and that in fact it accomplishes quite a number of things:

The last item is particularly interesting architecturally, because the popularity of each site's material causes the Freenet system to actually alter its topology. When a site discovers that it is getting a lot of material routinely from one of its partners, it tends to favor that partner for future requests. Bandwidth increases where it benefits the end users. Building on my Europe/China Silk Road analogy, Clarke says, "Freenet is like bringing China closer to Europe as more and more Europeans ask to trade with it."

Other unique features of Freenet

Freenet is more restrained in the traffic generated than Gnutella, perhaps because it expects to transfer a complete file of data for each successful request. When a Freenet client receives a request it cannot satisfy, it sends the request on to a single peer; it does not multicast to all peers as Gnutella does. If the client receives a failure notice because no further systems are known down the line, or if the client fails to get a response because the time-to-live timed out, it tries another one of its peers. In brief, searching is done depth-first and not in parallel. Nevertheless, Clarke says searches are reasonably fast; each takes a couple seconds as with a traditional search engine. The simple caching system used in Freenet also seems to produce just as good results as the more deliberate caching used by ISPs for Web pages.

Freenet is being developed in Java and requires the Java Runtime Environment to run. It uses its own port and protocol, rather than running over HTTP as Gnutella does.

Limitations and risks of Freenet

Freenet seems more scalable than Gnutella. One would imagine that it could be impaired by flooding with irrelevant material (writing a script that dumped the contents of your 8-gig disk into it once every hour, for instance) but that kind of attack actually has little impact. So long as nobody asks for material, it doesn't go anywhere.

Furthermore, once someone puts up material, no one can overwrite it with a bogus replacement. Each item is identified by a unique identifier. If a malicious censor tries to put up his own material with the same identifier, the system checks for an existing version and says, "We already have the material!" The only effect is to make the original material stay up longer, because a request for it was made by the would-be censor.

The searching problem

The unique identifier is Freenet's current weak point. Although someone posting material can assign any string as an identifier, Freenet chooses for security reasons to hash the string. Two search strings that differ by a single character (like "HumanRights" and Human-Rights") will hash to very different values, as with any hashing algorithm. This means that a prosecuting agency that is trying to locate offending material will have great difficulty identifying the material from a casual scan of each site.

But the hashing also renders Freenet unusable for random searches. If you know exactly what you're looking for on Freenet -- because someone has used another channel to say, for instance, "Search for HumanRights on Freenet" -- you can achieve success. But you can't do free text searches.

One intriguing use of Freenet is to offer sites with hyperlinks. Take people interested in bird-watching as an example. Just as an avid aviarist can offer a Web page with links to all kinds of Web sites, she can offer links that generate Freenet requests using known strings that retrieve data about birds over Freenet. Already, for people who want to try out Freenet without installing the client, a gateway to the Web exists under the name fproxy.

Another area for research is a client that accepts a string and changes it slightly in the hope of producing a more accurate string, then passes it on. The most important task in the Freenet project currently, according to Clarke, is to resolve the search problem.

Letting go

Once again, I refer readers to The Value of Gnutella and Freenet for a discussion of these systems' policy and social implications. I'll end this technical article by suggesting that the Gnutella and Freenet continue to loosen the virtual from the physical, a theme that characterizes network evolution. DNS decoupled names from physical systems; URNs will allow users to retrieve documents without domain names; virtual hosting and replicated servers change the one-to-one relationship of names to systems. Perhaps it is time for another major conceptual leap, where we let go of the notion of location. Welcome to the Heisenberg Principle, as applied to the Internet. Information just became free.

Gnutella and Freenet, in different ways, make the location of documents irrelevant; the search string becomes the location. To achieve this goal, they add a new layer of routing on top of the familiar routing done at the IP level. The new layer may appear at first to introduce numerous problems in efficiency and scaling, but in practice these turn out to be negligible or least tolerable. I think readers should take a close look at these systems; even if Gnutella and Freenet themselves turn out not to be good enough solutions for a new Internet era, they'll teach us some lessons when it's time for yet another leap.

Andy Oram is an editor for O'Reilly Media, specializing in Linux and free software books, and a member of Computer Professionals for Social Responsibility. His web site is www.praxagora.com/andyo.


Discuss this article in the O'Reilly Network General Forum.

Return to the O'Reilly Network Hub.

Copyright © 2009 O'Reilly Media, Inc.