Monday, July 04, 2005

Using Spamassassin and procmail to filter spam

Using Spamassassin and procmail to filter spam



See also: Mail Server FAQ
Table of contents
1 Procmail and Spamassassin

1.1 SpamAssassin: Automagic teaching

1.1.1 One spamassassin autolearning setup
1.1.2 An Alternative Autolearning Setup
1.1.3 Performance issues with training the bayesian filter
[edit]
Procmail and Spamassassin

We'll use the combination of procmail (http://www.procmail.org) and Spamassassin (http://www.spamassassin.org) to filter spam from our incoming mail. Procmail allows us to feed our incoming mail through different programs, then place them in different mailboxes based on the outout of those programs.

You can setup procmail and Spamassassin to be systemwide, but I prefer to do it by user - for the sake of flexibility. If you do set it up system wide (http://www.spamassassin.org/sitewide.html), the users can always customize their own rules. Refer to this IBM Developerworks article (http://www-106.ibm.com/developerworks/linux/library/l-spam/?t=gr,lnxw03=StampSpam) to get more detail on Spamassassin. Once you have installed Spamassassin and procmail, we want to setup a .procmailrc in our home directory. This will actually take the mail from Postfix, filter it through spamassassin, then deliver it to the Maildirs. For the sake of definition, procmail becomes our MDA (Mail Delivery Agent). Since procmail will be delivering to the Maildirs instead of Postfix, you need to double check this setup again.

[user@mail ~]$ cat .procmailrc
#.procmailrc - by jorge - Updated by glenn 6/8/05

# Set our default shell, which may not be necessary, as it
# uses the user's default
SHELL=/usr/bin/bash

# Set our default Maildir. You can also set the HOME variable, which
# may be set automatically by procmail
MAILDIR=/home/jorge/Maildir

# Set our default mailbox, which is where mail that doesn't have
# a sort rule applied goes. The trailing / is important.
DEFAULT=/home/jorge/Maildir/

# Set up logging. It's a good idea to let procmail log all its
# actions until you're sure it's working right.
LOGFILE=${MAILDIR}/procmail.log
LOG="--- Logging ${LOGFILE} for ${LOGNAME}, "

# Whatever recipes you'll use
# The order of the recipes is significant

# First, run everything through spamassassin. The "-P -a" tags are
# not needed with 3.0+, as they are now the defaults.
:0fw
| spamassassin

# Now that we've tagged spam, put it in its own folder

:0:

* ^X-Spam-Status: Yes
/home/jorge/Maildir/.spam/

Notice the DEFAULT directory ... that's where my mail will go, the new directory under my Maildir. That's how it shows up as new mail. If I put it in the cur directory in my Maildir, the mail would never show up as new, it would show up as already read (but we don't want that). Once you read a mail in the new directory, it moves it over to the cur directory. That's how the system keeps track of which mails are new and which ones are the 'cur'rent ones in your inbox. That's also why each subfolder has its own new,cur, and tmp directories.

Procmail is smart enough to put new mail into the "new" subdirectory for whatever Maildir folder you want it to put mail in. If you are using an IMAP server like dovecot that builds indexes, you have to omit "new/" and just leave the trailing slash (as in the script), as otherwise procmail doesn't use standard Maildir filenames, and dovecot will constantly be rebuilding a corrupted index file.

The big spamkiller is the 0fw | spamassassin line. Users of versions of spamassassin prior to 3.0 may want to add the e -a switch, which enables autowhitelisting (enabled by default in 3.0+). When you reply to people, after a while, they get put in the whitelist, and then they won't be judged so harshly the next time Spamassassin checks mail from them. If you're like me, you have some friends, that even with an autowhitelist, will score 20 or higher on the spamassassin score.

The next line puts it in the new subdirectory of my spam directory. It will then show up as new spam in our mail reader, already filed for inspection. As you guessed it, once you read the spam, it will get moved to the cur folder. I could send it directly to the .Trash folder, but I prefer to keep it seperate, and this way you can scan it everyonce in a while and catch the nifty HTML email your friends from hotmail send you. Confused? Just use the example (http://www.spamassassin.org/dist/procmailrc.example) from the Spamassassin website. You can direct the possible spam to dev/null, but this is not recommended, if you get a false positive then you'll lose that mail, that is why we created the .spam folder.

Next, we need to create a .forward file to get the email to go to procmail.

[user@mail ~]$ cat .forward
|/usr/bin/procmail

Notice that all the file does is send your incoming email to procmail. Procmail then accesses your .procmailrc file, and through it, delivers your email to the Maildir/ after running spamassasin. Without the .forward file your email will be directly delivered to your /Maildir and will avoid spamassassin entirely.

Spamassassin does a great job by default, but let's say you want to tailor it more. Create a .spamassassin directory in your home directory (mkdir .spamassassin). In there, create a user_prefs file using your favorite text editor. In time, the autowhitelister will put your whitelist in this directory too.

[user@mail ~/.spamassassin]$ cat user_prefs
# custom rules for spamassassin
score RAZOR_CHECK 4.0
score REMOVE_SUBJ 4.0
score SUBJ_REMOVE 4.0

score REPLY_REMOVE_SUBJECT 4.0
score REMOVE_IN_QUOTES 4.0
score HTML_WITH_BGCOLOR 4.0
score REALLY_UNSAFE_JAVASCRIPT 4.0
score CHARSET_FARAWAY_BODY 4.0
score NO_MX_FOR_FROM 4.0

score CTYPE_JUST_HTML 4.0
score WEB_BUGS 4.0
score SUBJ_ALL_CAPS 4.0
score LINES_OF_YELLING 4.0
score FOR_FREE 4.0

From this file, we are overriding the default rules in Spamassassin with our own custom scores. We did this because I thought the default values for these rules were too low, because who gets valid email with web bugs? So we raised the scores of these rules up. Spamassassin still keeps the default score of 5 as the definition of a spam, so these rules won't declare a mail a spam by themselves, but if they're using one of these techniques, then they're probably using other ones as well they will score higher and be tagged as spam. Refer to the Spamassassin rules list (http://www.spamassassin.org/tests.html) to find the complete list of rules.

Remember that these custom rules are in the user accounts, so if you don't want a certain account to use them, don't create a user_prefs for them. If they don't want any spam filtering whatsoever, then don't create a .procmailrc for them, Postfix will work just fine, because it was working before you even got to this step, right?
[edit]
SpamAssassin: Automagic teaching

SpamAssassin works well as a spam filter, but it works much better when it uses Bayesian filtering as opposed to depending on its ruleset, simply because of the adaptable nature of Bayesian filtering. You may be familiar with mail clients such as Thunderbird that use Bayesian filtering that you have to 'teach' your email habits too. This requires marking lots of spam as such, and unmarking lots of email labeled as spam to ham. The same thing is required with SpamAssassin, to get it to have enough data to be able to use its Bayesian component. Of course, who wants to sit at the command line, telling SpamAssassin when it was right, and when it was wrong? So, we use a simple script and crontab and some client side manipulation to automate the process.
[edit]
One spamassassin autolearning setup

In my (glenn) setup, I have spamassassin learn from my Inbox and Junk mailboxes, as I keep them nicely cleaned up and properly sorted. I don't want to deal with an overflowing Junk mailbox, but I also don't want spam cleared so quickly that I miss false positives. So, I run sa-learn every Thursday and Sunday night, and clean out spam older than 31 days at that point. I use the following learning script:

glenn@vasp:~$ cat learn.sh
#!/bin/bash
/usr/bin/sa-learn --spam ~/Maildir/.Junk/cur
/usr/bin/sa-learn --ham ~/Maildir/cur

I omit /new as that's email I haven't had a chance to properly file in case of false positives/negatives. Then, I use the following script to clean up old spam:

glenn@vasp:~$ cat cleanup-junk.sh
#!/bin/bash

# Removes all files from ~/Maildir/.Junk/cur that are older than
# 31 days ago

find ~/Maildir/.Junk/cur -mtime +30 -exec rm -f {} \;

I simply placed these in my crontab:

glenn@vasp:~$ crontab -l
# min hr dom month dow cmd
# Teach spamassassin about spam
* 4 * * 4,7 /home/glenn/learn.sh

# Clean up spam directory
* 5 * * 4,7 /home/glenn/cleanup-junk.sh

This works nicely for me, keeping spamassassin smart and preventing me from missing important emails.
[edit]
An Alternative Autolearning Setup

In my .procmailrc, I tell procmail to move mail marked as spam by Spamassassin to my $Maildir/.Trash/. This is important, because the way I am setting this up, anything that is sent to the Junk folder is deleted every 4 hours. For this reason, you should probably disable automatic moving of spam around by your email client. Or, setup two junk folders, junk_manual, which is stuff you've personally approved as spam, and junk_auto, which is mail moved by your mail client's junk mail filter and that you should probably check over before moving over to junk_manual for reading by SpamAssassin and removal from your system.

It is important to tell SpamAssassin not only what is spam, but what is ham, or legitimate email. I am running this script in my root's crontab, so I specify which home directory to go to. I don't really see why you couldn't run this as your normal user, but since SpamAssassin is run as root when called by procmail (at least, I think it is), I figure it is best to do the same here. My learning script first tells spamassassin that any read or unread junk mail is spam. I then have it hit my inbox and a number of other oft used folders on my system to tell it what is ham. It may be wise to leave out your inbox, as you may not have a chance to move spam that gets through, leading SA to believe that spam is ham, which is, of course, a no-no. This is the same reason I do not have SA read my Trash folder, which sometimes has legit email in it (though I do keep 99% of my email, so it's not a huge concern.)

[user@mail ~]$ cat ~/learn.sh
#!/bin/bash
/usr/bin/sa-learn --spam /home/user/Maildir/.Junk/new
/usr/bin/sa-learn --spam /home/user/Maildir/.Junk/cur
/usr/bin/sa-learn --ham /home/user/Maildir/new
/usr/bin/sa-learn --ham /home/user/Maildir/cur
rm /home/user/Maildir/.Junk/new/*
rm /home/user/Maildir/.Junk/cur/*

We then want to run that script every 4 hours, or however often you wish to have it run.

[user@mail ~]$ crontab -l
SHELL=/bin/sh
PATH=/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin
MAILTO=""
# m h dom mon dow command
0 */4 * * * /home/user/bin/learn.sh

And now, SpamAssassin can learn from its mistakes as you simply 'delete' (move to the Junk folder) your spam clientside.
[edit]
Performance issues with training the bayesian filter

I (tm) used to use a script like those above on my linode (http://www.linode.com), but for whatever reason, the sa-learn process would suck up a lot of memory and essentially bring the linode to its knees. Without having completed the learning process. So, instead of throwing whole directories at sa-learn, I give it each mail, one at a time.

So now, my learnspam script looks like:

#!/bin/sh
## Process high scoring spam in the .Spam folder and anything leftover in .Junk
#set -x
# feed to the bayesian learner
echo "Processing Junk maildir..."
spams=`find ~/Maildir/.Junk/cur ~/Maildir/.Junk/new/ -type f -mtime +7`
for spam in $spams
do sa-learn --spam --showdots --no-sync $spam
done
rm -f $spams
echo "Processing Spam maildir..."
spams=`find ~/Maildir/.Spam/cur ~/Maildir/.Spam/new/ -type f -mtime +3`
for spam in $spams
do sa-learn --spam --showdots --no-sync $spam
done
sleep 1
rm -f $spams

And my learnham script is:

#!/bin/sh
## Process ham
if [ $# -lt 1 ]; then
echo "Usage: $0 "
exit 3
fi
#set -x
# feed to the bayesian learner
for allmail in `find ~/Maildir -name cur -type d -ctime -$1 | egrep -v '(Trash|Junk|Spam)' | xargs -n 50 -idir find dir -type f -ctime -$1`
do for mail in $allmail
do echo $mail; sa-learn --ham --showdots --no-sync $mail
done
done
sleep 1
for allmail in `find ~/Maildir -name new -type d -ctime -$1 | egrep -v '(Trash|Junk|Spam)' | xargs -n 50 -idir find dir -type f -ctime -$1`
do for mail in $allmail
do echo $mail; sa-learn --ham --showdots --no-sync $mail
done
done

This may be helpful to anyone else who wants to run SA with bayesian learning on something with little memory like a linode.

No comments: