O'Reilly Network    


 Published on The O'Reilly Network (http://www.oreillynet.com/)
 http://www.oreillynet.com/pub/wlg/3841

Content-based spam filtering is a dead-end path

by Andy Lester
Oct. 3, 2003

In the arms race of spam prevention, content-based filters, including any Bayesian ones you care to throw at it, have been beaten. Until we get truly intelligent recognition, where a computer is smart enough to know that a subject of "She will love you for it" is Viagra spam, and that "I was at the end of my rope until I found this" is some money scam, the spammers will be able to get any content past the filters.

In addition to the tricks discussed in the ActiveState Field Guide To Spam, spammers are already started foiling the filters by throwing in random real words. I regularly get spam through two levels of filtering (SpamAssassin and Eudora) that looks like this:

      Our rates are the lowest!  You can get 3.45% fixed for 
rough pencil final happy
      30-years!  Follow this link to get the best rates
napkins canine amazed
      in the country, but only for a limited time!
The extra random non-spam text foils it. And, since the words are random, tactics to get a checksum or signature on it are, or will be, useless. I suspect it won't be long before spam comes through with three lines of spam content, and a couple K of random words. If we get to where words that are clearly random are somehow caught, then the spammers will turn to pulling random pages off the net for their obscuring text. Maybe they'll throw in, say, a few pages of Macbeth to foil things.

The answer is to stop the spammers before they get their message in. All content-based filtering depends on the spammer getting their payload to us first, instead of checking them at the gate. This will mean a replacement of SMTP. Until then, SPF seems to have potential, but it has its drawbacks.

Mind you, I'm not throwing away my SpamAssassin install. It helps stop a significant amount of the spam. Unfortunately, content-based filtering is a Band-Aid on the real problem.

Andy Lester is a QA & Release Manager for Socialtext. He is also in charge of PR for The Perl Foundation and maintains over 25 modules on CPAN.

oreillynet.com Copyright © 2006 O'Reilly Media, Inc.