O'Reilly Network    
 Published on O'Reilly Network (http://www.oreillynet.com/)
 See this if you're having trouble printing code examples


What to Do About Spam?

by Cory Doctorow
08/06/2002

Editor's Note: Cory Doctorow presents his views on spam-killer technologies in this message, which he posted to the Silklist mailing list on August 5.

From: Cory Doctorow
Date: August 5, 2002
Subject: What to do about Spam?

Bruce Sterling, in his speech, A Contrarian's View of Open Source at the O'Reilly Open Source Convention, rehashed a debate we've been having about spam and the end-to-end principle

I think the world of Bruce, and I have some deep disagreements with him. He's wicked-smart, and that makes those disagreements all the more enjoyable. He's challenged my thinking on this, but I'm still unconvinced (though I've come around on some points).

It comes down to this--to a certain extent, Bruce and I are both techno-determinists, except that, in my opinion, I'm an optimistic determinist and he's a pessimistic determinist.

I shared a room with Bob Frankston at a conference last spring and he basically blew my mind. He did it with his "Connectivity" rant, which you can find by googling "connectivity frankston"--I won't try to sum it up here because connectivity can't be summed up; it can only be grasped, as far as I can tell, through repeated exposure to Frankston at close quarters.

It's about the end-to-end principle and messaging protocols. In an end-to-end, messaging world, my machine sends your machines suggestions, without any intervention from any third party, provided that both machines are on the Internet (or, tautologically, the way that you can tell if a machine is "on the Internet"--an ethereal concept at best--is whether it can receive suggestions from any other machine "on the Internet").

Two principles: any two parties can communicate, and what they pass to one another are suggestions.

The koan that Frankston told me that led me to enlightenment was this: "On the Internet, my right to swing my fist doesn't stop just short of your nose, because it can only impact with your nose if you execute the 'punch-yourself-in-the-nose' suggestion. It's your responsibility to figure out which suggestions you want to execute."

Or words to that effect.

When you see things this way, there is no malware, no spam.

Really. I mean, yes, in the real, present-day world, we don't get to choose which suggestions we execute, but that's because we've got bad software.

But the software is getting better. My second revelatory experience was installing Mozilla 1.0 and finding the "block images from this server" context menu item. The lid lifted off of my head and my brains did a traditional folk-dance in celebration of the extreme cleverness of the Moz hacker hivemind.

Because here, at last, is a crude but amazingly effective way for me to easily decide which suggestions I want to honor and which ones I want to ignore. My pal Raffi at the MIT Media Lab has a sterling (heh) exemplar of the power of suggestions versus orders. He's working on a project that polls a bunch of news-sites, like cnn.com, every couple of minutes, and times the latency of the response.

The idea is that the next time a crisis of 9-11 grade rolls around, these news sites will experience a massive DDoS attack from a half-billion panicked netizens and all the sites will go down. The way it goes is, Raffi's automated user-agent sends an HTTP request to cnn.com, and its httpd believes that it is sending an html file to a browser, and accordingly, it passes a file over the connection that contains a bunch of suggestions as to what to do with that file. Raffi tosses all those suggestions out, and times the response and decides, based on latency, whether something critical is happening somewhere in the world. Because CNN is passing Raffi a suggestion (and not an order!), Raffi can do something totally unexpected with it--Raffi's like a jazz-critic who listens to the pauses as much as he listens to the notes.

Vipul's Razor hints at another universe of suggestion-management. I get a metric fuckload of spam--like, 400 pieces--every day (I get 600 pieces of non-spam every day, too, which is why my time for mailing-lists like this one is pretty limited). Spammers are total shitbirds, sure. But I'm loathe to look to legal solutions for this sort of thing. When we apply trespass doctrines -- the leading legal theory for shutting down spammers -- to open services, the Internet basically ceases to exist. Could the silklist exist if the list-manager needed to hunt down the admin of every subscriber's mail-server and secure his/her permission to send silklist messages? Hell, no.

It's like deep-linking. There's an excellent technical means for stopping search engines from spidering your site -- create a robots.txt file at your docroot and biff-bam, you're dark-matter. If you don't want to get linked to by offsite referrers, you need only edit your Apache config, add a couple lines, and hey-presto, no one can visit your site except through the paths you specify.

But deep-linking opponents don't want to use technical means to stop links to their sites. Rather, they assert a moral right in copyright to the public fact of the existence of their pages. They believe that if I want to state, on my Web page, that there is a page x at location y, that the author of page x should be consulted in advance and his permission secured. Could a search engine exist in this universe? Hell, no.

Would the Web work without search engines? Hell, no.

So, how do we address the flood -- the torrent, the deluge, the avalanche -- of spam that threatens to drown out all civilized discourse with MAKE PENIS FAST? With automated tools that let us manage whose suggestions we want to honor. When I xmit this message, it will travel from my computer to smtp.well.com and thence to lists.vipul.net. Lists.vipul.net will relay that message to a couple hundred smtp servers around the world, which will in turn deposit the messages in a bunch of standards-defined and proprietary mailboxes -- POP, IMAP, Webmail, whatever. Then you folks will use a piece of client software to download the message.

The message is, in itself, a suggestion: please display this message inside of a mailer, with the following summary info (headers). There is nothing present in this message that can possibly compel your mail-client to display it. The decision to display the message takes place entirely at the "edge" (though the Internet, in my opinion, doesn't have edges -- this is a pernicious lie perpetrated by Hollyweirdniks who think that the Internet is an "entertainment medium", where the "center" sends "content" to the "edge," so that the couch potatoes at the "edge" can passively "consume" it; Gibson nailed the Hollywood conception of a consumer in "Idoru", a person "... best visualized as a vicious, lazy, profoundly ignorant, perpetually hungry organism craving the warm god-flesh of the anointed.

Personally, I like to imagine something the size of a baby hippo, the color of a week-old boiled potato, that lives by itself, in the dark, in a double-wide on the outskirts of Topeka. It's covered with eyes and it sweats constantly. The sweat runs into those eyes and makes them sting. It has no mouth..., no genitals, and can only express its mute extremes of murderous rage and infantile desire by changing the channels on a universal remote. Or by voting in presidential elections.") (man, I get parenthetical after midnight) (to continue this point about the Internet's shape, it's like a Klein bottle -- the "edges" all touch each other, that's what end-to-end means, and so there is no "last mile," only an infinitude of "first miles").

Crap, where was I? Oh yah, mail is a suggestion.

Your mailer and your mailer alone is the point at which a command-decision is made to show you a message. Vipul's Razor works on a dead-simple principle: ask motivated spam-haters (the people who are currently reenacting Lord of the Flies in smtp black holes and 1984 with regexps in Spam Assassin) to manually tag those messages that are unambiguously spam as spam, then publish cryptographic digests of the found spam so that other people can gain access to them.

But Cory, I hear you say, w-t-f good does that do my sainted grandmother, who wants only to see the Webcam photos I sent her of my newest offspring without wading through 700 Nigerian scam-letters and appeals to send postcards to poor little Craig Shergold? If you emit the phrase "cryptographic digest" within 300 yards of her perfect, wrinkled ears, she will shriek in technophobic horror, assume a fetal position, and it will take the technical wizardry of the finest doctors at Johns Hopkins to revive her.

Or words to that effect.

There's a flip answer to this: Your grandmother will be dead soon. Meanwhile, illiterate street kids can figure out how to do really amazing shit with a computer. Soon, those children will be adults, and your grandmother will be dead. So the future of the Internet is in good shape.

But screw the flip answer, because there's a better one: Give your grandmother a mailer that interoperates with Vipul's Razor. One percent of the Internet's users, if that, create Web pages with links in them. Nevertheless, that one percent is more than sufficient to provide a citation structure that Google can tease forth to organize the whole goddamned Web with eerie accuracy. Hoo-ah!

Likewise, if one percent of the world's cranky, fixated, high-functioning-autistic techno-elite can be coaxed into tagging all the goddamned spam, in real-time, then there will be a kick-ass, up-to-the-minute database that your grandmother's mailer can consult before it shows her the messages it's fetched from the POPd. The mailer takes anything with an x-vipul-spam: Yes header and sticks it in a "Don't look at this unless you're a cranky, fixated, high-functioning, etc." folder, and it is invisible to your grandmother. When spam slips through the filter, your grandmother can (but doesn't have to) use a special "delete" shortcut that tags the message and sends it out to the universe.

The Cloudmark people are doing good work on hardening the Razor. There's a reputation metric that kills the ability of careless/malicious users who submit non-spam as spam to actively participate. There are a bunch of mailer plug-ins in the works. It won't be built overnight. Like Moz, it'll probably take several years before we see a ready-for-grandma release of Vipul's Razor, and it will probably have a cute, grandma-friendly name like "Kali, the Destroyer, Devourer of Spammers' Souls."

Or words to that effect.

Damn, it's getting late. Dammit Udhay, stop asking me interesting questions at midnight!

More stuff:

Regexp hackers aren't the devil, neither is SpamAssassin. Danny and Quinn have taught me this. SpamAssassin is basically a framework for encompassing a variety of anti-spam approaches, and its genetic algorithms mean that it will (if I'm right) eventually evolve away from all the document-parsing/machine-intelligence approaches in favor of a multiplicity of collaborative filters.

Razor isn't quite Google for spam. Google captures implicit, idle decisions -- links -- and uses them to organize the online universe. Razor asks its users to generate explicit decisions outside of their normal course of business. This makes Razor a DMOZ for spam (at best) or a Yahoo for spam (at worst). If collaborative filtering approaches to spam are going to work, they need to find ways to capture more implicit decisions. Figuring out where those implicit decisions about spam can be found is the most important technical challenge of the next 24 months.

Cloudmark has a partial solution. They ask ISPs to reactivate accounts that have been dead for 12-plus months and submit all the mail that reaches them to the Razor. This is the kind of thing that I'm talking about, but it's still a flawed approach, since it assumes that no legitimate list will leave subscribers on its roll after one year's worth of bounce-messages. This is an invalid assumption. Lots of legitimate lists do not prune dead addresses from their rolls, which means that using reanimated zombie accounts for Vipul-fodder will end up tagging legitimate mail traffic as spam.

The winner will be the system that never generates false positives. It is simply not acceptable for mail to vanish or bounce if it's a legitimate communication. A system that advises you to ignore suggestions from your trusted pals undermines the end-to-end principle.

A friend's workplace has a porn-filter in place that keeps mistaking my short stories for pornography and bouncing them back to me. My friend and his co-workers have started to hand out alternate email addresses to people in the outside world, not because they want to receive porn at work, but because they can't afford to have their conversations disrupted by an imperious Perl script (the phrase in my current novel that keeps tripping SpamAssassin is "young Asian cop" -- "young Asian" rings the cherries on SpamAssassin's pornfilter).

Cory Doctorow is the co-editor of Boing Boing and the Outreach Coordinator for the Electronic Frontier Foundation.


Return to the O'Reilly Network.

Copyright © 2009 O'Reilly Media, Inc.