Spamassassin - automating sa-learn with IMAP folders
Among the useful things we have found for our clients is a methodology for building a learning spam filter using Spamassassin and a mail server that supports IMAP folders such as dovecot. Simply adding Spamassassin with a standard configuration on incoming mail on a mail server can dramatically decrease the amount of spam users receive, but it will not catch nearly all spam sent to the server.
The reason for the lack of complete filtering is clear. Spammers play a cat-and-mouse game with spam filters, always attempting to modify messages in such a way as to avoid filtering. As filters change, spammers experiment until they find ways through, and they change their tactics as each new technique is detected.
Because of this uncertainty with spam, Emergent Path recommends that clients who maintain their own mail servers implement the Bayesian filtering engine in Spamassassin and automate the learning process through the sa-learn script.
sa-learn is a command-line program that can be called and passed various arguments to classify messages as either ham (real messages) or spam (fake messages). Because it is a command-line program, it can be easily automated using cron on Unix/Linux systems. We recommend running a daily process on the mail server (depending on volume of mail and number of mail servers involved) that scans user-classified spam and ham using sa-learn to train Spamassassin.
A sample script might look something like the script below. This is a simple example and not necessarily a final production script:
!#/bin/bash
sa-learn --showdots --no-sync --spam /var/mail/domains/*/*/Maildir/.MakeSpam/cur/
sa-learn --showdots --no-sync --ham /var/mail/domains/*/*/Maildir/.MakeHam/cur/
rm /var/mail/domains/*/*/Maildir/.MakeSpam/cur/*
rm /var/mail/domains/*/*/Maildir/.MakeHam/cur/*
In this example script, each user who wants to tag spam creates an IMAP folder in the root of their account called MakeSpam. (The example assume a typical mail directory structure of /var/mail/domains/<domain_name>/<account>/Maildir/ for the root location of each user's mail folders.) For any spam messages that got through filtering to the inbox, the user drags those messages to the MakeSpam folder and leaves them. When the script aboe runs (via cron on the server), the messages will be classified as spam and then deleted. Over time this system will help Spamassassin improve its hit rate on spam messages.
Manually marking messages as ham that have been previously classified as spam may vary slightly depending on your SpamAssassin configuration. If Spamassassin is set up move spam to a Junk or Spam folder and simply add a header to the message, the user can simply move the message to the MakeHam folder, and when the script runs it will identify those messages as ham (good) and remember those settings for the future. If Spamassassin is set to create a new message and forward the original message as an attachment, the user may need to extract the original message from the attachment and place it in the MakeHam folder.
Automated systems like this one can take time to develop and are sometimes tedious and error-prone to get right and keep right. We always recommend starting small with minimal functionality, proving that functionality over time, and adding to the functionality at a later date.
