From September to
December, 2002, I wrote a series of articles and reviews
on anti-spam programs. If you missed them, you can find
them on the OPCUG web site in the Articles section
at
http://opcug.ca/Articles/2002.htm.
Since then spam has continued to increase at prodigious
rates and anti-spam options continue to proliferate.
Last February, when
Microsoft released beta 2 of Office 2003, they finally
added decent anti-spam filtering to Outlook. I found it
to be at least as effective as my previous favourite -
iHateSpam from Sunbelt Software, and it seemed to work
more smoothly. I happily used that until a little over a
month ago.
Alan German did a review
of McAfee's SpamKiller (http://opcug.ca/Reviews/SpamKiller.htm) in the November issue of Ottawa
PC News. But there was nothing about SpamKiller that
would convince me to switch from the native Outlook 2003
filters.
So, what was it that made
me stop using Outlook's built-in filters? A truly
wonderful, free, open source program called SpamBayes.
As the name implies,
SpamBayes uses Bayesian statistics to determine what is
and isn't spam. It is based on algorithms developed by
Paul Graham and described in a white paper he penned in
August 2002, A Plan for Spam -
www.paulgraham.com/spam.html.
According to Webster,
Bayesian involves the application of Bayes' theorem and
the use of probabilities based on prior knowledge and
accumulated experience. Basically, Bayesian anti-spam
predicts the likelihood of a new item being spam based
upon your assessment of emails received in the past.
An anti-spam program that
works on Bayesian principals starts out not making any
assumptions about spam. It should have no concept of Viagra
being any different from airplane or fence. It
should have no pre-conceived notion that a particular
header attribute is an indication that something is spam
or not. You train a Bayesian anti-spam program what you
think is spam. If done right, such a system has a very
high degree of accuracy with very few false positives
where a non-spam email is treated as spam. And believe
me, SpamBayes does it right!
SpamBayes is available
for free on SourceForge at spambayes.sourceforge.net.
There are different packages available. If you use
Outlook 2000, XP, or 2003 (not Outlook Express), go for
the Outlook plug-in - it is by far the easiest to install
and operate. If you use any other mail program that
accesses a POP3 or IMAP mail server, you need to install
a copy of Python, the source code for SpamBayes (which
are Python scripts) and either a POP3 or IMAP proxy
depending on how you get your email from your ISP's
server. All required files may be downloaded from the
SpamBayes site. I use Outlook 2003, so I went with the
Outlook plug-in. Note that while all versions use the
same algorithms to determine what is and isn't spam, the
way you install and interact with the program differs
considerably depending on the version.
The easiest way to get
SpamBayes working effectively is to make sure you have a
couple of training folders ready. You should have one or
more folders with 250 to 500 emails that you consider to
be spam and one or more folders with 250 to 500 emails
that you consider to be non-spam. To the extent possible,
you should ensure they represent all types of emails you
typically receive in each category - spam and non-spam.
Remember, you are training SpamBayes to understand what
you think is spam and non-spam.
The 250 to 500 numbers
are broad approximations. Much less than 250 may cause
SpamBayes to make more errors initially. More than
500 are unlikely to make a big difference in the accuracy
of spam detection. It will not, however, be harmful to
use more than 500.
Installation was very
straightforward with only the standard prompts for the
location to install the program. The next time I started
Outlook, the SpamBayes Configuration wizard automatically
popped up to help me get things started.
As I had already prepared
training folders, I chose the option to do an immediate
training. SpamBayes chugged away for about ten
minutes analyzing the emails. SpamBayes looks at the
content of the body as well as header content and
attributes. After it finished training, it allowed me to
select the folders to use for items it is certain are
spam and for items that may be spam. I chose
folders called Junk E-mail and Junk Suspects.
You do not have to train
SpamBayes based on pre-existing emails - you can also
have SpamBayes "learn as you go". In this mode,
initially all emails are treated as possible spam and
moved to the Junk Suspects folder. As you correct
the program and classify emails as either spam or
non-spam,
SpamBayes learns and gets
more and more accurate.
The trait of having a possible
spam category in addition to certain spam is
very nice and sets it aside from many other anti-spam
programs. The world of spam is frequently not black and
white. Sometimes an email may contain some or even many
characteristics of emails you consider spam, yet not be
spam. It is a matter of degrees.
SpamBayes rates new items
on a spaminess scale from 0 to 100. By default, if
an item reaches a spaminess rating of 90, it is treated
as certain spam and moved to the Junk E-mail
folder. If it has a spaminess rating between 15 and 90,
it is treated as possible spam and moved to the Junk
Suspects folder. While these values are
adjustable, you should use the program for a while with
the defaults before adjusting them and then only adjust
in small increments. Of the couple of dozen people I know
using SpamBayes, all are quite happy with the default
values.
When the current folder is
anything other than the Junk E-mail or Junk
Suspects folders, the SpamBayes toolbar shows a Delete
As Spam button. If I find spams that were not caught,
I select them and click this button. SpamBayes analyses
the message to add to its body of knowledge about spam
and then moves the messages to the Junk E-mail
folder.
If the current folder is
the Junk E-mail folder, the toolbar sports a Recover
from Spam button. If I find false positives in the Junk
E-mail folder, I select them and click this button.
SpamBayes analyses the messages so it will be less likely
to treat similar items as spam in the future. Then it
moves the items to the Inbox.
Finally, if my current
folder is the Junk Suspects folder, the toolbar
has both buttons, allowing me to classify items there as
either spam or non-spam.
Since the initial 10
minute training session, I have not run into a single
instance where emails in the Junk E-mail folder
were false positives. All have always been spam. I know
people who get 300 spams per day who say the same thing.
Very impressive!
The Junk Suspects
folder typically does collect some spam and some
non-spam. This is to be expected, especially when first
using the program. Fortunately, not too many messages end
up here so they are pretty easy to deal with. As you
classify messages found here as either spam or non-spam,
SpamBayes learns from these actions and continually
improves.
SpamBayes does not have whitelists
- addresses that should never be treated as spammers, or blacklist
- addresses that should always be treated as spammers. My
experience bears out the assertion that these are really
not required in a Bayesian anti-spam program. But it does
lead to a possible curious effect.
Say you get the Daily
Dilbert and SpamBayes moves it to the Junk Suspects
folder (I hope SpamBayes would never move the Daily
Dilbert to Junk E-mail folder!). Of course, you
use the toolbar button Recover from Spam to move
it back to your Inbox and train SpamBayes that this is
not actually spam.
But you may find that it
is caught the next day...and the next. At first I found
this sort of thing puzzling. But then it dawned on me.
Perhaps the first day, the Daily Dilbert got a
spaminess rating of 50. By clicking on Recover from
Spam, SpamBayes trained on it and the next day, it
thought it was less likely to be spam. Maybe then it got
a spaminess rating of 35. Then because it was corrected
again, perhaps the third day it got a rating of 20.
Finally, on the fourth day, it fell below the rating of
15 and was no longer treated as possible spam.
SpamBayes
allows you to see the spaminess rating of any email. I
found it somewhat interesting to look at the ratings,
particularly for those emails that ended up in the Junk
Suspects folder.
I am almost to the point where I will delete the contents of the Junk
E-mail folder without reviewing to look for false positives.
So far, after over a
month using SpamBayes, I am very impressed with it. I
have found the Outlook plug-in to be very easy to install
and use. It has not caused any problems with Outlook and
has been amazingly effective at getting spam out of my
Inbox. Well over 95% of spam ends up being shuttled
directly to the Junk E-mail folder. Virtually all
remaining spam has gone to the Junk Suspects
folder. Absolutely no non-spams have ended up in
the Junk E-mail folder and well under 1% of
non-spams have been sent to the Junk Suspects
folder. I am almost to the point where I will delete the
contents of the Junk E-mail folder without
reviewing to look for false positives.
System Requirements:
For the Outlook plug in, you
need Outlook 2000, XP, or 2003 and Windows 98 or better.
You need no additional software.
For use with other email
programs, you will need to install copy of Python,
the source files for SpamBayes (which are Python script
files), and either a POP3 or IMAP proxy. Links to all
required files are at the SpamBayes site on SourceForge
at spambayes.sourceforge.net
Bottom Line:
Bottom Line:
SpamBayes (Freeware/Open Source)
Released under the Python Software Foundation license
SpamBayes
http://spambayes.sourceforge.net
Originally published: January, 2004
top of page