Ottawa PC Users' Group, Inc.
 Product Review 


SpamBayes
by
Chris Tayor

From September to December, 2002, I wrote a series of articles and reviews on anti-spam programs. If you missed them, you can find them on the OPCUG web site in the Articles section at
http://opcug.ca/public/Articles/2002.htm. Since then spam has continued to increase at prodigious rates and anti-spam options continue to proliferate.

Last February, when Microsoft released beta 2 of Office 2003, they finally added decent anti-spam filtering to Outlook. I found it to be at least as effective as my previous favourite - iHateSpam from Sunbelt Software, and it seemed to work more smoothly. I happily used that until a little over a month ago.

Alan German did a review of McAfee's SpamKiller (http://opcug.ca/public/Reviews/SpamKiller.htm) in the November issue of Ottawa PC News. But there was nothing about SpamKiller that would convince me to switch from the native Outlook 2003 filters.

So, what was it that made me stop using Outlook's built-in filters? A truly wonderful, free, open source program called SpamBayes.

As the name implies, SpamBayes uses Bayesian statistics to determine what is and isn't spam. It is based on algorithms developed by Paul Graham and described in a white paper he penned in August 2002, A Plan for Spam - www.paulgraham.com/spam.html.

According to Webster, Bayesian involves the application of Bayes' theorem and the use of probabilities based on prior knowledge and accumulated experience. Basically, Bayesian anti-spam predicts the likelihood of a new item being spam based upon your assessment of emails received in the past.

An anti-spam program that works on Bayesian principals starts out not making any assumptions about spam. It should have no concept of Viagra being any different from airplane or fence. It should have no pre-conceived notion that a particular header attribute is an indication that something is spam or not. You train a Bayesian anti-spam program what you think is spam. If done right, such a system has a very high degree of accuracy with very few false positives where a non-spam email is treated as spam. And believe me, SpamBayes does it right!

SpamBayes is available for free on SourceForge at spambayes.sourceforge.net. There are different packages available. If you use Outlook 2000, XP, or 2003 (not Outlook Express), go for the Outlook plug-in - it is by far the easiest to install and operate. If you use any other mail program that accesses a POP3 or IMAP mail server, you need to install a copy of Python, the source code for SpamBayes (which are Python scripts) and either a POP3 or IMAP proxy depending on how you get your email from your ISP's server. All required files may be downloaded from the SpamBayes site. I use Outlook 2003, so I went with the Outlook plug-in. Note that while all versions use the same algorithms to determine what is and isn't spam, the way you install and interact with the program differs considerably depending on the version.

The easiest way to get SpamBayes working effectively is to make sure you have a couple of training folders ready. You should have one or more folders with 250 to 500 emails that you consider to be spam and one or more folders with 250 to 500 emails that you consider to be non-spam. To the extent possible, you should ensure they represent all types of emails you typically receive in each category - spam and non-spam. Remember, you are training SpamBayes to understand what you think is spam and non-spam.

The 250 to 500 numbers are broad approximations. Much less than 250 may cause SpamBayes to make more errors initially.  More than 500 are unlikely to make a big difference in the accuracy of spam detection. It will not, however, be harmful to use more than 500.

Installation was very straightforward with only the standard prompts for the location to install the program. The next time I started Outlook, the SpamBayes Configuration wizard automatically popped up to help me get things started.

As I had already prepared training folders, I chose the option to do an immediate training.  SpamBayes chugged away for about ten minutes analyzing the emails. SpamBayes looks at the content of the body as well as header content and attributes. After it finished training, it allowed me to select the folders to use for items it is certain are spam and for items that may be spam. I chose folders called Junk E-mail and Junk Suspects.

You do not have to train SpamBayes based on pre-existing emails - you can also have SpamBayes "learn as you go". In this mode, initially all emails are treated as possible spam and moved to the Junk Suspects folder. As you correct the program and classify emails as either spam or non-spam,
SpamBayes learns and gets more and more accurate.

The trait of having a possible spam category in addition to certain spam is very nice and sets it aside from many other anti-spam programs. The world of spam is frequently not black and white. Sometimes an email may contain some or even many characteristics of emails you consider spam, yet not be spam. It is a matter of degrees.

SpamBayes rates new items on a spaminess scale from 0 to 100. By default, if an item reaches a spaminess rating of 90, it is treated as certain spam and moved to the Junk E-mail folder. If it has a spaminess rating between 15 and 90, it is treated as possible spam and moved to the Junk Suspects folder. While these values are adjustable, you should use the program for a while with the defaults before adjusting them and then only adjust in small increments. Of the couple of dozen people I know using SpamBayes, all are quite happy with the default values.

When the current folder is anything other than the Junk E-mail or Junk Suspects folders, the SpamBayes toolbar shows a Delete As Spam button. If I find spams that were not caught, I select them and click this button. SpamBayes analyses the message to add to its body of knowledge about spam and then moves the messages to the Junk E-mail folder. 

If the current folder is the Junk E-mail folder, the toolbar sports a Recover from Spam button. If I find false positives in the Junk E-mail folder, I select them and click this button. SpamBayes analyses the messages so it will be less likely to treat similar items as spam in the future. Then it moves the items to the Inbox.

Finally, if my current folder is the Junk Suspects folder, the toolbar has both buttons, allowing me to classify items there as either spam or non-spam.

Since the initial 10 minute training session, I have not run into a single instance where emails in the Junk E-mail folder were false positives. All have always been spam. I know people who get 300 spams per day who say the same thing. Very impressive!

The Junk Suspects folder typically does collect some spam and some non-spam. This is to be expected, especially when first using the program. Fortunately, not too many messages end up here so they are pretty easy to deal with. As you classify messages found here as either spam or non-spam, SpamBayes learns from these actions and continually improves. 

SpamBayes does not have whitelists - addresses that should never be treated as spammers, or blacklist - addresses that should always be treated as spammers. My experience bears out the assertion that these are really not required in a Bayesian anti-spam program. But it does lead to a possible curious effect.

Say you get the Daily Dilbert and SpamBayes moves it to the Junk Suspects folder (I hope SpamBayes would never move the Daily Dilbert to Junk E-mail folder!). Of course, you use the toolbar button Recover from Spam to move it back to your Inbox and train SpamBayes that this is not actually spam.

But you may find that it is caught the next day...and the next. At first I found this sort of thing puzzling. But then it dawned on me. Perhaps the first day, the Daily Dilbert got a spaminess rating of 50. By clicking on Recover from Spam, SpamBayes trained on it and the next day, it thought it was less likely to be spam. Maybe then it got a spaminess rating of 35. Then because it was corrected again, perhaps the third day it got a rating of 20. Finally, on the fourth day, it fell below the rating of 15 and was no longer treated as possible spam.

SpamBayes allows you to see the spaminess rating of any email. I found it somewhat interesting to look at the ratings, particularly for those emails that ended up in the Junk Suspects folder.

I am almost to the point where I will delete the contents of the Junk E-mail folder
without reviewing to look for false positives.

So far, after over a month using SpamBayes, I am very impressed with it. I have found the Outlook plug-in to be very easy to install and use. It has not caused any problems with Outlook and has been amazingly effective at getting spam out of my Inbox. Well over 95% of spam ends up being shuttled directly to the Junk E-mail folder. Virtually all remaining spam has gone to the Junk Suspects folder.  Absolutely no non-spams have ended up in the Junk E-mail folder and well under 1% of non-spams have been sent to the Junk Suspects folder. I am almost to the point where I will delete the contents of the Junk E-mail folder without reviewing to look for false positives.

System Requirements:
For the Outlook plug in, you need Outlook 2000, XP, or 2003 and Windows 98 or better. You need no additional software.

For use with other email programs, you will need to install  copy of Python, the source files for SpamBayes (which are Python script files), and either a POP3 or IMAP proxy. Links to all required files are at the SpamBayes site on SourceForge at spambayes.sourceforge.net


Bottom Line:

SpamBayes
Freeware - open source, released under the Python Software Foundation license.
from SpamBayes
Web site: http://spambayes.sourceforge.net


Copyright and Usage
Ottawa Personal Computer Users' Group (OPCUG), Inc.
3 Thatcher Street, Ottawa, ON  K2G 1S6

The opinions expressed in these reviews may not necessarily
represent the views of the OPCUG or its members.