Remember when searching the Internet was hard? The dark days when we relied on dumb-as-sand machine intelligences, like those on the back-ends of AltaVista and Lycos, to rank the documents that matched our keywords? The grim era before Google, when searching was a spew of boolean mumbo-jumbo, NEAR this, NOT that, AND the other?
God, that sucked.
Lucky for the Internet, Google figured out the One True Way to make sense of the Internet, to defeat gamers of the system and send info-free brochureware plummeting to number n - 1 out of n results.
They did it with our help. Google's near-magical ordering of the Internet is built around the notion that computers are good at doing repetitive, uncreative things -- fetishistically counting things, for example -- and rotten at understanding why they're being asked to do these boring tasks. By contrast, human beings are great at understanding why they're doing something, but they're woefully deficient in the do-the-same-thing-perfectly-and-forever department.
AltaVista tried to get computers to do both the repetitive parts (capturing billions of documents) and the creative parts (figuring out what the documents are about). This yielded the largest collection of randomly organized documents in the world, a Web-accessible version of a library where all the books have been re-shelved by axe-grinding illiterates who wanted to make sure that no matter what you were looking for, you'd find porn.
Yahoo tried just the opposite, getting human beings to manually identify and describe all the documents comprising what was meant to be an exhaustive index of all the worthwhile pages on the Web. There were "scaling issues" involved in this laudable effort (for "scaling issues" here, substitute "catastrophic failures"), and over time, Yahoo's directory dwindled to an increasingly marginal sliver of the Internet's vastness. At the rate that Yahoo's army of indexers work, and at the rate that the Internet's unwashed horde of writers is adding to the noosphere, it's only a matter of a few years before every human being alive will have to pass his or her every working hour contributing to Yahoo's index, just to keep its sliver from dwindling into utter pointlessness.
Google bridges the divide between human-generated indexes and machine-generated analysis.
Y'see, the Web is full of people like you and me, making links between documents; human beings, making decisions about documents, voting with their links. When I link to some arbitrary document, it's an indication that I think that it's in some way authoritative. When you link to a document I wrote, you're indicating that I'm in some way authoritative. The Internet is already structured in a meaningful way, but that structure is obscured. Google teases out the relationship between the URLs, examining the webs of authority: this person is linked to by 50,000 others, and he links to this other person over here, which indicates that person one is a pretty sharp individual, one who's inspired 50,000 human beings to take time out of their busy schedules to link to him; and person one thinks that person two is on the ball, which suggests that person two knows what she's on about.
The 2002 O'Reilly Emerging Technologies Conference explored how P2P and Web services are coming together in a new Internet operating system.
It's a best-of-both-worlds solution. The computers at Google are asked to tirelessly count and re-count the number and destination of links on every page that Scooter, the Googlebot, can lay its user-agent on. Those links are made by human beings, doing what they do best, link by link, drip by drip, layering a film of order over the Internet.
The approach works well. Eerily well. Enter a couple of search terms, and biff-bam, the most authoritative documents containing those keywords are served up in an instant. Nearly every document on the Web has a human decision associated with it for Google to glom onto; that's because nearly every document on the Web has a human author. Human authors don't just put documents onto the Web; they put them into the Web, into the meshed hairball of incoming and outgoing links, indicating not only what keywords the document contains, but also who the document's author believes is authoritative, and vice versa.
It's quite elegant.
Meatspace ASCII, the revered printed word, has many things going for it:
It's high-resolution: Whether scrawled with a toddler's crayon or hammered out by a quaint, humming Selectric's print-ball, a traditionally printed word is an order of magnitude sharper and better-defined than the phosphors marching across your screen.
It requires no specialized reader: A printed word can be read by any literate human being during daylight hours without any particular technological assist, specialized readers, or even electricity.
It is hard to make obsolete: Printed works don't staledate the way that electronic words do. It's difficult to apply "digital rights management" schemes to the printed word that will stymie generations to come with bizarre cryptosystems that seek to circumvent posterity.
As someone in possession of tens of thousands of books, I understand why people get misty and sentimental about dead-tree libraries. As someone who has moved twice in the past 18 months, I feel compelled to point out that the printed word has a couple of major downsides:
It is fragile: We print books on the same substrate we employ for cleaning our nether regions after excreting. Think about that for a second: Paper is considered degradable enough to flush billions of sheets of it down the crapper every day, and yet we entrust our precious words to a material that auto-incinerates if you put it into contact with oxygen.
Well, so what? We've got mass production techniques that will let us preserve our most important documents by making millions of copies of them. Which brings us to the next problem:
It is bulky. Moving-box companies sell specialized shipping boxes for books, boxes that are smaller than all the other species of boxen. That's because books are freakin' heavy. They're made from trees!
By contrast, of course, bytes are pretty manageable. They've got their own degradability issues -- CDs, magnetic tape, flash, and platters all fall apart pretty quickly -- but that's OK, because bytes are not only comparatively tiny (I can carry 50 novels on my 3-ounce PDA, or 7,000 novels on my 6.5-ounce iPod), but they get tinier every year.
Every year, storage media increases in density, decreases in size, and gets cheaper. I can fit all the hard drives of all the computers I've owned, plus all the floppies for all the computers that I owned before hard drives were common, onto the hard drive of my latest laptop, with storage to spare. Hell, most of that stuff will fit on my iPod! The data that previously occupied a roomful of storage devices now fits comfortably in my pocket.
In a world of degradable storage, replicating copies is the surest way to guarantee longevity. Whether your data is in atoms or bits, the more copies you make of it and the more widely you disperse it, the greater the likelihood that your data will persist forever. (That's why Jaron Lanier jokingly proposed encoding printed matter into the DNA of the notoriously prolific cockroach, as a means of ensuring archives through a nuclear war and beyond.)
With bulky printed words, only the commercially successful (and hence prolific) and very lucky works are likely to survive the voyage through history. All the words we write try to crowd into the lifeboat, but only a lucky few survive.
The historical forgettery is something of a blessing, though. Many's the word that's been penned, in casual correspondence or published works, that is best forgotten. I know that I've written a few things I'd rather no one ever saw. Much of it is embarrassing; most of it is banal. History flenses away the great bulk of utterance and leaves behind a barely manageable archive that we can get our heads around.
Words-as-bytes need not be forgotten! Storage is cheap, storage is compact, and the lifeboat has got plenty of room for every jot and tittle keyed into the Internet. Brewster Kahle built an archive with several copies of the Web at different times, using off-the-shelf PCs and standard drives.
This is a good thing, but it's also a pain in the ass. Our embarrassing excesses, drunken rants, typos and brain farts and flames no longer vanish into our sub-consciences, but rather hang around like embarrassing relatives, undeniably ours, with us forever.
There's an upside, of course. The enduring presence of our publicly stated positions acts as an accountability system, making us own up to our errors and perhaps encouraging us to think carefully before putting our fingers on our keyboards. Old Usenet clients used to have a standard warning that would appear the first time you used Usenet to send a message, a dire warning to the effect that your words were about to pass from your computer and onto the computers of thousands of other people, and are you really sure that you've expressed yourself adequately?
Jonathan Lethem's Motherless Brooklyn features Lionel Essrog, a private detective with Tourette's Syndrome whose obsessive-compulsive illness makes him ideal for long, boring stake outs and wiretap parties. Once the compulsion to listen for a keyword in the soup of a rambling conversation or to continually re-check a staked-out doorway for a suspect has been planted in Lionel's Tourettic brain, he is unable to do anything except listen and watch until the compulsion has been satisfied.
Boring, repetitive, endless tasks don't actually require someone with a compulsive disorder to do them; computers can do them just fine. A computer can sieve through the torrent of packets passing over the Internet and look for keywords like "terrorism" and "anthrax" and "fissile" and "child-porn," then flag them for later consideration by law-enforcement officials at spooky three-letter agencies.
Law enforcement doesn't really need any specialized equipment to surveil the average netizen. Google does it better than anything else possibly could (dirty snitch), and it doesn't cost a cent.
But Google only acts on the public data that human beings are free to link to and that the Googlebot is free to discover. Private documents (email, instant messages, internal memos) are off-limits to Google. Even if you manually poured them down the Googlebot's throat, the absence of incoming or outgoing links to these documents means that they won't be placed in any meaningful context in the Googleverse.
Increasingly, law-enforcement agencies are pushing for (or owning up to) the creation of really creepy spyware projects like Eschelon, Magic Lantern, and Carnivore, systems that are placed on your computer, at your ISP or at a major Internet backbone, and used to indiscriminately capture all of the data they encounter, shunting it off to shadowy bunkers where the secret masters of the universe can use it to shine a light up the skirts of your privacy and, possibly, that of criminals, too.
People are, rightfully, very upset about all of this. Continuous wiretapping of the entire Internet is a revolting idea, something like the Panopticon, a prison where the warders can see your every move from perfect obscurity. It's enough to make you want to draw your blinds and curl up under the sofa.
But what do they do with all of that data that they collect? Filter it for keywords? Fat chance. The volume of false positives (e.g., people talking about child pornography who aren't child pornographers) far exceeds the volume of actual criminal activity. Even creaky old Lycos gave up on plain-old keyword matching a long, long time ago.
Maybe they manually check it. After all, that approach worked for Yahoo, right? Oh, right, it didn't work. Scratch that.
Then they must use some hybrid approach: human editors and AI (Artificial Intelligence or Almost Implemented, take your pick) working in concert to tweeze out the most relevant material as quickly and efficiently as possible.
Cory Doctorow is the co-editor of Boing Boing and the Outreach Coordinator for the Electronic Frontier Foundation.
Return to the O'Reilly Network.
Copyright © 2009 O'Reilly Media, Inc.