I’ve said it before:
Content-based spam filtering is a dead-end path. Here’s one big example from my mail this morning:
., ,; .R,
@FS fUD jos
DN Gw,
Fzw OUn hdx DLdknFf: qgOKPugU aYkIda @ygoaQr
Dj hN Sam xb tJ. mBT. fSV zek Nw; @Hf
dxd Stk ALQ TZFwKw: qR ol HJb EmpiiA@
sb .Vz XWw chY:: Aw, ju iA GFk aHs,c woi
FsrQua Gcc pW kA IBy HFd ZVx Gsx SME
ziyA riA UNvhcHbgj NZaBdunU TYA NsaQfMzrRB
, ,:;U : Ae , ,;w .:
lze yrP
IegDp.
Your spam filter isn’t going to catch the keyword “Viagra” there, is it? “But the filter knows that those aren’t words,” I hear you say. So here’s a trivial Perl program to translate all that input into names from a list:
open( my $fh, '/usr/share/dict/propernames' ) or die $!;
while (<$fh>) {
chomp;
push( @{$words{length($_)}}, $_ );
}
while (<DATA>) {
s/(S+)/replace($1)/ge;
print;
}
sub replace {
my $list = $words{length $_[0]} or return $_[0];
return $list->[rand @$list];
}
__DATA__
., ,; .r,
@ln qly tlg
nq aq,
Brg iaB WiW iqpbduk: ifcciWvj Wypdip @rnoqqS
lc st unx mm su. Wyl. eee daa jb; @kS
kjt smp WkW 8hytct: ih xd WiZ Zlantc@
tg .vk WrW cyW:: hy, vx bo WnW gtx,i 0rW
SnjsaS WbW gw oo kkZ rto WeW fvB 0qZ
xbcd ocg tfrotxynk veqWhurb kdy wavkuseax0
, ,:;i : yr , ,;i .:
Zjc ugr
btfau.
which gives back (for one run)
Ed Al Roy
Amy Tom Jim
Ji Len
Spy Lin Alf Roderick Srinivas Rajeev Juliane
Hy Ti Tao Ed Amy Renu Fay Bud Tom Jef
Tim Kyu Mat Nicolas No No Hsi Shannon
Al Ami Tai Judge Hal Al Hy Fay Piete Hsi
Gregge Suu Al Al Ken Art Moe Lar Mwa
Vern Vic Stephanie Teruyuki Rod Cristopher
, King : Ji , Les Hy
Bob Dan
Dannie
The bottom line is that we’ll never be able to handle the spam problem only by content filtering. The good guys will never be able to win the arms race.
The best I can see is that SMTP must be replaced by something that doesn’t allow anyone to send email to anyone else without any accountability.
Content-based filters are rags stuffed in the hole of a leaky boat. Water’s still getting in, and they’re not going to hold forever.
What’s the best solution you see that isn’t based on content analysis?


Whitelist works for me
I have to say that since I changed from Thunderbird to Kmail, the whitelist option works really well for me. I'm in the fortunate(?) position that not many people send me emails so I know who my "friends" are. The situation improved no end though when I wrote a PHP "Send me mail" script for my web site.
I agree that it's all a mess though, and the sooner an alternative to the current system is found and implemented the better: i'm not convinced that software will be the final answer though - there's scope for a hardware solution..
SMTP is essentially a dead end as it exists today
The "problem" revolves around the fact that spam is profitable. As long as it's profitable, cash will convince programmers to work around any spam "solution" that others come up with. And since SMTP is an unauthenticated protocol, anyone can claim to be anyone else, with varying degrees of success.
What solutions exist besides content-based?
SPF? On its face, it stops domain spoofing. But then virus writers are using real email settings to send out mails. SPF won't stop those. Plus, with zombie PCs and virtually unlimited bandwidth, mails will just be routed somewhere else.
Hashcash? That's death to mailing lists. Plus, the hashcash is verified on the client side, so the mail is already sent, and the bandwidth wasted.
RBLs may help to eliminate some spam, but not effectively enough. Plus, with the virus technique discussed above, the servers that will begin being RBLed are ISP-level IPs, which will adversely affect thousands of customers at the same time.
DNS checks are a dead end: it's $7 to register a domain these days, and point it somewhere where nobody will care.
Laws? Totally ineffective, as they'd have to be global, and that's not likely to happen. Plus, they'd have to be enforced.
The other issue facing any solution is how to eliminate false positives. This problem is hard enough itself as to cause many technical solutions to be scrapped in the interests of not blocking valid email. RBLs proved this a long time ago, which is why many spam solutions use RBLs in conjunction with other techniques.
Pretty much all spam "solutions" are band-aids on a gaping wound. The root problem still exists.
Let's make spam unprofitable. How can we do that?
ISPs should do their part in being good neighbors. This means suspending people who are spewing large volumes of mail, requiring virus protection, helping update computers, etc. True, there's a large capital investment involved in doing it up front, but over time that investment may pay itself off. Plus, it's just good sense to be a good neighbor.
Let's have real penalties for spammers. Jail won't do it; it's got to be monetary to hurt. It'd be wonderful to get the money reversed from the spammers. If the full legal weight came down from Visa, MC, et al on spammers, you can bet many would stop.
And then there's the old standby, vigilante justice. You can't beat a good old-fashioned lynch mob to convince someone they've done something wrong.
Whitelist works for me
The problem with whitelisting is that it doesn't scale. It works for you, as a knowledgeable computer user. It doesn't work for a 500-user Lotus Notes installation where few are savvy. Also, sometimes you can't whitelist, such as when customers (or potential customers) want to contact you.
This content is still spammy.
Your revised content is still spammy, if your filter sees different amounts of spaces as tokens... (Yes, the current filter generation discards spaces, but the next does not need to).
Also the mail headers should usually still be incriminating, as well as the percentage of words that your ham writing friends would not use. Also, non-english folks like me (who get 90%+ of their ham in their native tongue) will have no problems filtering out this rubbish...
This content is still spammy.
You can't predict what a valid format or valid content will be. Basing rules on tokens is great, but it leaves out many valid use cases that happen just often enough to make email unusable if implemented.So would this be spammy?
And remember: computing power, especially on a server, is finite. When you deal with an email server for over 15,000 users taking in 2+ million messages a day, spending 100ms per message is not acceptable. More rules = less throughput.
And if you put it on the client side, what about people on dialup? People that pay per byte? The server's accepted the message, so the client has to do extra work to decide if the message is really valid.
Even services like Mailblocks don't get around the problem completely. They just offload responsibility to the sender. What if a smart sender bounced messages to a couple domains, which then "clicked" on the URL, which now delivers the message *and* whitelists the address for further spam?
And before you say that isn't possible or feasible, remember that spammers are smart, too. Never underestimate the collective power of teenage boys looking for pr0n.
Does Cloudmark count as "content-based"?
They use a "neighborhood watch" approach. I've been using it on web-published email accounts, and it works great. No false positives, and very few misses. These guys have figured it out. I don't understand why they haven't taken over the world yet.
Does Cloudmark count as "content-based"?
I don't know if it's content-based, but I do know it only runs on Outlook for Windows. That's a gaping hole in coverage.
Then again, Outlook is one of the reasons viruses spread so easily, so maybe they're taking the 80/20 approach.
Does Cloudmark count as "content-based"?
Well, that would explain why they haven't taken over the world yet.
SMTP is essentially a dead end as it exists today
The other issue facing any solution is how to eliminate false positives. This problem is hard enough itself as to cause many technical solutions to be scrapped in the interests of not blocking valid email.
I guess it's just part of the fallout, but these days, at work, we are prepared to lose a bit of valid email. We keep a close eye on logs, and our users are well trained: if they don't get a message they were expecting they file a ticket, and we follow up.
The one thing remaining I need to do is to summarise the daily rejects in a format that's easy for users to understand. A number of scripts available today aren't good enough, because they are bedeviled with their own false positives problem:
"What do you mean you blocked message from aunt.vera@hotmail.com?" And we have to go through the logs and identify that the message was rejected because the HELO connection string was set to the name of our own domain. I need to find a way to explain that clearly.
And even the people whose messages are being rejected don´t seem particularly fussed. They are usually pretty happy that someone is taking the time to help them out.
I gave up the holy grail of zero false positives a while back. So far my users are pretty happy with the results.
Whitelist works for me
Neither does just saying SMTP is dead scale.
A replacement technology will take a decade or more to get globally accepted.
That's more than enough time for spammers and other criminals to find holes and prepare to exploit them.
In the meantime the few providers and users who do use the new system (whatever it will be) will either have to use SMTP on the side or find themselves cut off from the rest of the world or will have to use some form of gateway technology which effectively removes whatever advantage the new tech has by allowing anything that comes in over SMTP to pass through a single trusted gate into your new secure system, thus validating all spam as legitimate mail there and then.
Personally I use a combination whitelist/blacklist.
I block large swaths of subnets and domains (as well as many other patterns which are used almost exclusively by spammers and virus senders) and whitelist the few people in those domains that I do have a need to communicate with.
On top of that a Bayesian filter gets the last few bad things that do get through.
Whitelist works for me
Neither does just saying SMTP is dead scale.
A replacement technology will take a decade or more to get globally accepted.
I agree, replacing SMTP is not the simple answer. However, until we stop wasting our time on the arms race of content-based filtering, we are no closer to finding a solution.
Content filter could still beat this example
It's not profitable just to send someone some ascii-art saying "viagra", at some point you have to give users that might want your product a means of contacting you. The means of contact (link, e-mail, whatever) will get caught by content-based filtering.
Won't it?
Wrong
Yes, you've said it before, and you were wrong then, just as you are now. Yes, of course content-based filtering can never be 100 percent? How does this prove that it is not worthwhile? If it reduces the amount of spam that gets through significantly, then I'll prefer it when there are no better options, which is currently the case.
Wrong
I never said it wasn't worthwhile. It's very valuable.
The "no better options", however, will never change so long as we spend all our time and energy on this arms race with the spammers. We need a new solution, but relatively little effort is going into one, compared to effort going into improving content-based systems.
I'm inclined to agree - it's still valuable
Spammers serve a market. As spammers are forced to produce more and more obscure messages, their market must shrink, and their profits along with it. If costs also rise - e.g. because of laws and penalties - it may simply become uneconomic for many spammers. Furthermore, if flashy advertising is worth anything, filtering makes it virtually impossible for spammers to take advantage of.
I also think it's a myth that laws need to be global in order to be effective. If the U.S. enforced a sensible law, not all of affected spammers would reappear elsewhere. Such success could also make filtering easier, and create a snowball effect in which other countries would be forced to follow suit to maintain access.
We don't have to eliminate every single spam message to achieve success. If we can manage this through market mechanisms, like filtering and law, we can preserve the open nature of email which made it so valuable in the first place.
Does Cloudmark count as "content-based"?
Isn't cloudmark a commercial implementation of razor ? I believe you can still use this if you have some savvy. Spamassassin by default looks for a razor installation.
spell check anti spam
With all the spammers misspelling words intentionally to avoid filtering. Why not filter body for misspelled words and if there is more then 3 spelling errors the mail is filtered to a folder. This not only would eliminate much spam but might encourage good english skills amongst the Internet community.
Secure SMTP
I believe the only way to solve spam is by limiting identity theft.
This can be achieved by doing the following:
Step1: Change the SMTP-Auth protocol to require that all emails be signed with a self or commercially generated certificate. This ensured that white-list and black-list system will work, where now, they are being circumvented.
Step2: Create a number of internet "white pages" registers of email addresses and their associated public certificate. These directories have blacklist functionality on registered certificates that are found to be spamming. The onus of blocking spammer is then more on the registry than on the end user.
To register a new address they can implement the following: Charge a fee, have a delay of 24 hours, only register individuals after checking proof of their social security number, etc.
Step3: Set the SMTP server to only allow emails registered in these directories.
Step4: Each user then only has to deal with the few individuals who are spamming them specifically.