Linux DevCenter    
 Published on Linux DevCenter (http://www.linuxdevcenter.com/)
 See this if you're having trouble printing code examples


Bayesian Filtering with bogofilter and Sylpheed Claws

by Oktay Altunergil
01/30/2003

In August 2002, Paul Graham published a paper suggesting that Bayes' probability theorem (see Resources) applied to the spam emails we receive. The gist of Graham's paper is that each word you receive in your emails -- including those that make up the email header -- carry a spam value of 0 to 1. This number is calculated by studying a large number of emails that are known to be spam versus another set of emails that are known to be legitimate. If a particular word only appears in spam emails, there is a high probability that the next time you see this word in an email message, it will be part of a spam message. Similarly if a word, such as your secret nickname that only a few people know or the From: address of a coworker, tends to appear only in good emails, that word will have a higher probability of being present in a non-spam email message. Of course, we should score all of the words in a message and get an average "spam probability value" for the whole message so that an email from a friend trying to let you know about "a great business opportunity !!" does not go into your trash bin or a spam email about "how to copy your DVDs" don't go into your good email folder just because it addressed you by your first name.

What makes Bayesian filtering special is that false positives -- legitimate emails marked as spam -- are very rare. As Graham points out, spammers can fool every system we put in place, but they still have to deliver their commercial message. This message is exactly what causes them to shoot themselves in the foot. It is trivial to recognize spam email if you take a quick look at the subject and the message body. This action can be emulated very successfully using a Bayesian filter that learns on your behalf, applying acquired knowledge to your future emails. If you notice the filter is making a mistake, you can teach it to not do the same thing again. After a very short while, the filter will be almost bullet proof.

Shortly after Graham's article, a number of people implemented spam filters that use the Bayesian algorithms. For this article we will look at bogofilter written by Eric S. Raymond. We have chosen bogofilter because of its speediness, which arises from its being written in C and using BerkeleyDB as its storage facility, as opposed to a plain text file. As long as we're picking software based on speed, I decided it would only make sense to pick Sylpheed (of the "claws" variety) as our email client to demonstrate bogofilter. (See my previous article about Sylpheed and Sylpheed claws.)

Related Reading

Linux Server Hacks
100 Industrial-Strength Tips and Tools
By Rob Flickenger

Installing bogofilter

It's fairly simple to configure and install bogofilter. You can either download the latest source package or find a package for your operating system. The current latest version 0.9.0.5 is available as an RPM or FreeBSD package. The Gentoo distribution also has an ebuild for it in its portage package collection.

If you will be installing it from the source package, all you have to do is download it in a temporary directory, decompress it and run ./configure && make then make install as root in the uncompressed source directory. Coincidentally, these are the generic instructions to configure, compile and install a source package on Unix and Linux systems. If something goes wrong, I suggest asking for assistance from somebody with adequate experience. Often everything will go as planned and the installation procedure will create the program binaries and put them in /usr/bin/. It will also create a sample configuration file (with which you need not concern yourself) and place it in the /etc directory.

By default, bogofilter keeps its data in two database files called goodlist.db and spamlist.db. These files are stored in a .bogofilter directory in the user's home directory. You need not create the directory or the files explicitly since they will be created by bogofilter while training it.

Training bogofilter

As mentioned above, bogofilter, like all other Bayesian filters, does its magic based on the principles of probability. For this reason you need a archive of spam and non-spam emails. The more emails you have gathered, the finer tuned your filter will be. I normally just ignore spam emails instead of deleting them, so for me it wasn't very difficult to find hundreds of spam emails in my incoming email directory in Sylpheed. We will create two mail directories in Sylpheed and call one of them SPAM and the other NONSPAM. If you disinfect your regular incoming email directory by removing each and every spam message, you can do without a dedicated NONSPAM directory. If you choose to do so, make sure you keep this incoming directory free of spam in the future too.

Before starting to train your bogofilter, make sure there's at least 100 emails in each folder. This should be a nice quantity and variety. If you don't have enough spam messages (if you delete them as you receive them or if you don't receive any -- those were the days!) , you can download a batch of spam messages from a Bayesian spam filtering web site. I recommend against doing this since every individual receives a different variety of spam messages and what looks like spam to somebody else might actually be something you receive as good mail regularly. (Many people confuse spam with emails they once asked to receive but don't want anymore.) You will find that the spam accumulated over a few days will be enough to tune your filter. Better yet, keep training it as you go along. The result is a highly customized personal filter that will allow bogofilter to think and act just like you would.

We will start with training bogofilter to recognize spam words. In order to do this we will start a shell and go into the SPAM directory. By default Sylpheed keeps its emails in a Mail directory in the user's home directory. This directory contains all spam messages, each in its own file. Sylpheed uses an identifying number for each filename. The directory resembles:

grog oktay # cd ~/Mail/SPAM
grog SPAM # ls
1    108  117  126  135  144  153  162  171  180  19   199  26

We will need to feed the whole message text, header and body, into the bogofilter command and mark them as spam by using the -s option. Since the number of messages is irrelevant to the Bayesian algorithm, we can run the command in one of two ways.

The following command feeds all spam messages into bogofilter at once. The -v option increases the verbosity of the command and prints out some useful information.

grog SPAM # cat * | bogofilter -s -v
# 93861 words, 3 messages

We can also invoke the bogofilter command one at a time and have bogofilter process them individually as can be seen from the partial output below.

grog SPAM # for i in *; do echo Processing Mail ID \#$i; \
	bogofilter -s -v < $i ; done;
Processing Mail ID #1
# 279 words, 1 message
Processing Mail ID #10
# 113 words, 1 message
Processing Mail ID #100
# 498 words, 1 message
Processing Mail ID #101
# 685 words, 1 message
Processing Mail ID #102

Whichever method you use, bogofilter will create the .bogofilter directory as well as a spamlist.db database file. Please do not access this or the goodlist.db file directly as they are both in a binary format. Repeat the above steps in the ~/Mail/NONSPAM directory to create the non-spam list database. Since these are non-spam files, you will need to substitute the -s option for the -n option such that the command is now bogofilter -n -v. If everything goes as planned, you will now have both the good words list goodlist.db and the spam words list spamlist.db. We're ready to filter out spam.

Marrying bogofilter to Sylpheed

If you run bogofilter manually on a bunch of text (i.e., an email message), it will return either 0 or 1 depending on whether the email is found to be good or spam. However, it would be inconvenient to run this command manually for every email that we receive. Instead we will configure Sylpheed to run the command on our behalf each time it receives an email, before delivering the message to the appropriate directory. Using Sylpheed-claws, this is done by selecting Configuration from the menu and clicking on Filtering. There are 3 fields to fill in. The first field is the Condition. Here we execute bogofilter with the current incoming email. Enter the following line:

execute "/usr/bin/bogofilter < %F"

The second field determines which action to take if the email is found to be spam. I recommend leaving this at Move to move the spam email to the SPAM folder. You could also Delete the email or just mark it as spam and deliver as usual but I don't recommend either. If you choose Move as the action, then you should also specify the mail directory to which to move the messages. Using the Select... button, choose the SPAM folder we created earlier. Finally, activate the new filtering rule by clicking Register. Figure 1 shows what the filtering rule should look like.


Figure 1 -- the filtering configuration window

Keeping bogofilter Sharp

The configuration we have implemented so far will probably catch more spam than you think it would. However, the key to success is keeping bogofilter on its toes at all times. Keep training the filter to be able to deal with new types of spam messages and be able to identify non-spam messages for years to come. It would be really convenient to have a "register as spam" button on all email clients. In the future they will probably have this. For now, we have to emulate this functionality ourselves. It's really pretty simple.

We will move spam messages that bogofilter misses to the SPAM directory manually. After you do this, make bogofilter process the message by running it with the -s filter again. It will be too much work to do this manually, so we will create a cron job that automates this process. This way we can keep moving spam messages to the SPAM folder as we receive them (effectively scheduling them to be marked as spam) and rely on the cron job to take care of the rest for us. You might also want to copy a bunch of good emails into the NONSPAM directory every once in a while since non-spam words need to be up to date as well. Here's what a typical script to train bogofilter everyday may look like:

#!/bin/sh

# /home/oktay/bin/bogolearn.sh
# train bogofilter with new spam and non-spam
# user is assumed to be 'oktay'

BOGOFILTER="/usr/bin/bogofilter";
GOODDIR="/home/oktay/Mail/NONSPAM";
SPAMDIR="/home/oktay/Mail/SPAM";

cd $SPAMDIR
cat * | $BOGOFILTER -s
cd $GOODDIR;
cat * | $BOGOFILTER -n

The following crontab entry will make this script run every morning at 3:30.

30 3 * * *  /home/oktay/bin/bogolearn.sh

This is all there is to it. You will see that your filter gets better and better everyday. You might even start hoping that you will receive more spam just to see how cool bogofilter is.

Resources

Oktay Altunergil works for a national web hosting company as a developer concentrating on web applications on the Unix platform.


Return to the Linux DevCenter.

Copyright © 2009 O'Reilly Media, Inc.