Automated Score Tuning for SpamAssassin

2006/01/18

Automated Score Tuning for SpamAssassin

During the creation of an institute-wide spamfilter system based on SpamAssassin we have collected a set of 77,286 ham and spam mails from eight users, developed our own training methods and evaluated them against commercial and free spam filtering systems, including Symantec BrightMail 6. A more recent report puts the work in larger context, adds new empirical results and proposes a convenient approach for mail data collection.

Scripts to train SpamAssassin

One of the outcomes of this research was a way to adapt SpamAssassin to a specific mail collection. Contrary to sa-learn, which only trains the NaiveBayes model, SA-Train.pl additionally learns a set of score values via linear SVM. I found out early that a single score value set is not sufficient for all applications. The approach outlined here is similar to SA-Train with training methodology simple.

  • SA-Collect.pl: collects a set of most recent mails from a mail archive. Only folders in mbox format, or maildirs (provided each mail begins with ^From\ ) are currently supported. If you use Mozilla or mutt/pine/elm you should be fine... even Thunderbird might work.
  • SA-Train.pl: Main learner script. Computes a CV on input data, or just learns the full model (default). In the latter case, user_prefs and bayes_* files are to be found in the current directory, and just need to be copied to the right place. Don't forget to adapt bayes_path (in user_prefs). If it is not set correctly, you will get no BAYES_* tests in the mail header, and performance will suffer (except for -S).

SA-Train.pl uses Algorithm::SVM V0.12 instead of the perl-port of WEKA's SMO which I used before. It should be around three times faster. There have also been optimizations concerning memory usage which has been reduced about twentyfold. Until the time that Algorithm::SVM V0.12 appears at CPAN, you can download the latest version here (source code).

Bugs, comments and extension requests are welcome. If you use this for research purposes, please cite one of the papers - preferable the IDA Journal paper.

Performance

Since the latest version, SA-Train can also use simplified training procedures -S and -B, which work as follows: -S ignores the NaiveBayes model and just learns the linear SVM, which is somewhat similar to the process for obtaining the default scores for each new version of SpamAssassin; -B outputs static weights for the BAYES_* tests and ignores all other tests (i.e. sets their score to 0), which essentially reduces to a pure NaiveBayes learner (similar to SpamBayes).

Based on our sample of 77286 mails with a spam/ham ratio of roughly 1, and using a five-fold crossvalidation, these are the results for the main three settings -S, -B and the default one (neither -B nor -S specified). All values in percent (i.e. multiplied by 100)

(default)-S-B
Ham error (FP rate)0.420±0.0471.495±0.1710.089±0.015
Spam error (FN rate)0.423±0.0303.550±0.2322.549±0.276

Another way to look at the performance is with ROC curves. In this case, we have plotted ham vs. spam error at all possible thresholds. This needs a logarithmic scale, otherwise the differences between the three settings would be too small.

This is what you can see on the left. As you can see, -B and the default setting are rather close. Since -B is the fastest, I would suggest to try it first. The default threshold (required_hits) of 5 reduces ham error at the cost of increased spam error. If you want similar performance as the default setting, change the threshold to 2.

You could also try the default setting with different complexity parameters for the SVM (i.e. the -c parameter), which you should vary systematically. Using different thresholds for SVM output - as we did here - is not really sensible, as it will lose the optimality of the underlying optimization problem.

What works worst is not using the bayesian model, which is what -S does. This is reasonably similar to what the SpamAssassin developers do to compute the default scores, which also give very bad performance on this sample.


Mail data collection

How mails should be collected at an institute - i.e. generating a model for a large(r) number of users.

  • empty all files where SpamAssassin puts automatically recognized spam mails (possibly backup first, so no misclassified mails get lost)
  • ask all institute members to collect spam that comes through, into separate mailboxes (e.g. spam)
  • After 1-2 weeks, put all SA-recognized mails and members- collected mails into one mailbox per user. Check this mailbox extensively for misclassified mails!
  • Use SA-Collect.pl to get the same number of ham mails from non-spam mailboxes. This ensures a Spam-to-Ham ratio of 1.0. Again, check this mailbox extensively for misclassifications!
  • Using SA-Train.pl, train via -x 2, 5 or 10 - depending on how long you can wait. ;-) Afterwards, full training (default, -x 0/1) will give user_prefs and bayes_* files for SpamAssassin.
  • You will need at least 1000 mails - the more, the better. SA-Train at OFAI (www.ofai.at) uses around 50,000 mails and remains stable for 6-12 months.

This collection procedure gives excellent results, is less costly that incremental training (no more checking tens of thousands of spam mails for the elusive false positive) and the filter should work from day one with the error rates estimated via CV. Humans are of course not perfect at this task: error estimates range from 0.25% to 1%, so everything within this range is competitive.

Datasets

For confidentiality reasons, I cannot give away my full mailbox collection. However, I have chosen to contribute a small part of my personal emails in a reasonably safe form - as word vector of their contents, as results from SpamAssassin rule sets, and actual full sender email addresses (see downloads)