Women in Technology

Hear us Roar



Article:
  Bayesian Filtering with bogofilter and Sylpheed Claws
Subject:   Errata and More Info
Date:   2003-02-20 12:33:29
From:   oktaya
I have been contacted by David Relson from the bogofilter team after the article was published here.


Please is a verbatim copy of part of the email correspondence between us.


"You mention ESR as the author of the program. He started it last August,
worked with it for a few weeks, and has been busy with other projects since
releasing version 0.7 in early September. Others have been doing the work
since then, with a list being in the AUTHORS file.


Ihe latest version of bogofilter is 0.9.0.5. That was true at the end of
November. From early December through late January, version 0.9.1.2 was
the stable version. More recently 0.10.1.2 has been released with some
significant new capabilities including mime processing and decoding
(base64, quoted-printable, uuencode), the discarding of html comments, and
improved database (wordlist) access and locking.


Your article has reversed the return codes for spam and ham. The unix
convention is to return 0 to indicate program success. Since bogofilter's
purpose is to detect spam, 0 is used to indicate spam and 1 for ham.


When the article switches from registering spam to registering ham, it says
"substitute the -s option for the -n option". I had to reread this a
couple of times because registering ham uses the "-n". Had you worded it
as "change the -s option to a -n option", it would have been much clearer
to me.


There seems to be a problem using your maildir with bogofilter. The "ls"
command shows 13 files in the maildir and the output of "cat * | bogofilter
-s" shows "93861 words, 3 messages". There's a discrepancy between file
and message count. When registering messages (the -s and -n options),
bogofilter expects mailbox formatted input - with "^From " lines separating
the messages. As a guess, the files in your maildir don't have the message
headers, so "cat *" doesn't produce a properly formatted mailbox. Of
course, using the "for" statement (as you have done), works correctly.


Lastly, bogolearn.sh will cause bogofilter to train and retrain on the
messages in the spam and nonspam directories. The best practice is to
remove messages after they have been used for training. Again, using the
"cat" command with a maildir is a bad idea."


He also encourages people to join the project mailing lists at bogofilter@aotto.com and bogofilter-dev@aotto.com . More information can be found on the project website.


And here's an update bogolearn script that doesn't use 'cat' :
#!/bin/sh


BOGOFILTER="/usr/bin/bogofilter";
GOODDIR="/root/Mail/NONSPAM";
SPAMDIR="/root/Mail/SPAM";


cd $SPAMDIR;
echo Spam:
for i in *
do
echo Processing Mail ID \#$i;
bogofilter -s -v < $i ;
done;



cd $GOODDIR;


echo NonSpam:
for i in *
do
echo Processing Mail ID \#$i;
bogofilter -n -v < $i ;
done;



Cheers..


Oktay Altunergil
oktay@optonline.net


PS: Sylpheed-Claws now has a spam plugin that uses spamassassin. I'm hoping somebody will incorporate bogofilter also.