Perl.com: Stopping Spam with SpamAssassin

Published on Perl.com
http://www.perl.com/pub/a/2002/03/06/spam.html
See this if you're having trouble printing code examples

Stopping Spam with SpamAssassin
By Simon Cozens

I receive a lot of spam; an absolute massive bucket load of spam. I received more than 100 pieces of spam in the first three days of this month. I receive so much spam that Hormel Foods sends trucks to take it away. And I'm convinced that things are getting worse. We're all being bombarded with junk mail more than ever these days.

Well, a couple of days ago, I reached my breaking point, and decided that the simple mail filtering I had in place up until now just wasn't up to the job. It was time to call in an assassin.

SpamAssassin

SpamAssassin is a rule-based spam identification tool. It's written in Perl, and there are several ways of using it: You can call a client program, spamassassin, and have it determine whether a given message is likely to be spam; you can do essentially the same thing but use a client/server approach so that your client isn't always loading and parsing the rules each time mail comes; or, finally, you can use a Perl module interface to filter spam from a Perl program.

SpamAssassin is extremely configurable; you can select which rules you want to use, change the way the rules contribute to a piece of mail's "spam score," and add your own rules. We'll look at some of these features later in the article. First, how do we get SpamAssassin installed and start using it?

If you're using Debian Linux or one of the BSDs, then this couldn't be easier: just install the appropriate package using apt or the ports tree respectively. (The BSD port is called p5-Mail-SpamAssassin)

Those less fortunate will have to download the latest version of SpamAssassin, and install it themselves.

Vipul's Razor

SpamAssassin uses a variety of ways for testing whether an e-mail is spam, ranging from simple textual checks on the headers or body and detecting missing or misleading headers to network-based checks and an interesting distributed system called Vipul's Razor.

Vipul's Razor takes advantage of the fact that spam is, by its nature, distributed in bulk. Hence, a lot of the spam that you see, I'm also going to see at some point. If there were a big clearing-house where you could report spam and I could see if my incoming mail matches what you've already reported, then I could have a guaranteed way of determining whether a given mail is spam. Vipul's Razor is that clearing-house.

Why is it a Razor? Because it's a collaborative system, its strength is directly derived from the quality of its database, which comes back to the way it's used by the likes of you and me. If end-users report lots of real spam, the Razor gets better; if the database gets "poisoned" by lots of false or misleading reports, then the efficiency of the whole system drops.

Just like any other spam detection mechanism, Razor isn't perfect. There are two points particularly worth noting. First, while it tries to completely avoid false positives (saying something's spam when it isn't) by requiring that spam be reported, it doesn't do anything about false negatives (saying something's not spam when it is) because it only knows about the mail in its database.

Second, spammers, like all other primitive organisms, are constantly evolving. Vipul's Razor only works for spam that is delivered in bulk without modification. Spam that is "personalized" by the addition of random spaces, letters or the name of the recipient, will produce a different signature that won't match similar spam messages in the Razor database.

Nevertheless, the Razor is an excellent addition to the spam fighter's arsenal, since when it marks something as spam, you can be almost positive it's correct. And just like SpamAssassin, it's all pure Perl. Mail::Audit has long supported a Razor plugin, but now we can move to calling Razor as part of a more comprehensive mail filtering system based on SpamAssasin and Mail::Audit

Installing Vipul's Razor is similar to installing SpamAssassin. Debian and BSD users have packages called "razor" and "razor-clients," respectively; and the rest of the world can download and install from the home page. SpamAssassin will detect whether Razor is available and, by default, use it if so.

Assassinating Spam With Mail::Audit : The Easy Way

So this is the part you've all been waiting for. How do we use these things to trap spam? For those of you who aren't familiar with Mail::Audit, the idea is simple: just like with procmail, you write recipes that determine what happens to your mail. However, in the case of Mail::Audit, you specify the recipe in Perl. For instance, here's a recipe to move all mail sent to perl5-porters@perl.org to another folder:


    use Mail::Audit;
    my $mail = Mail::Audit->new();
    if ($mail->from =~ /perl5-porters\@perl.org/) {
        $mail->accept("p5p");
    }
    $mail->accept();

For more details on how to construct mail filters with Mail::Audit, see my previous article.

Plugging SpamAssassin into your filters couldn't be simpler. First of all, you absolutely need the latest version of Mail::Audit, version 2.1 from CPAN. Nothing earlier will do! Now write a filter like this:


    use Mail::Audit;
    use Mail::SpamAssassin;
    my $mail = Mail::Audit->new();

    ... the rest of your rules here ...

    my $spamtest = Mail::SpamAssassin->new();
    my $status = $spamtest->check($mail);

    if ($status->is_spam ()) {
        $status->rewrite_mail() };
        $mail->accept("spam");
    }
    $mail->accept();

As you might be able to guess, the important thing here is the calls to check and is_spam. check produces a "status object" that we can query and use to manipulate the e-mail. is_spam tells us whether the mail has exceeded the number of "spam points" required to flag an e-mail as spam.

The rewrite_mail method adds some headers and rewrites the subject line to include the distinctive string "*****SPAM******". The additional headers explain why the e-mail was flagged as spam. For instance:


X-Spam-Status: Yes, hits=6.1 required=5.0 
tests=SUBJ_HAS_Q_MARK,REPLY_TO_EMPTY,SUBJ_ENDS_IN_Q_MARK version=2.1

This message had a question mark in the subject, an empty reply-to, and the subject ended in a question mark. The mail wasn't actually spam, but this goes to prove that the technique isn't perfect. Nevertheless, since installing the spam filter, I've only seen about 10 false positives, and zero false negatives. I'm happy enough with this solution.

One important point to remember, however, is where in the course of your filtering you should call SpamAssassin's checks. For instance, you want to do so after your mailing list filtering, because mail sent to mailing lists may have munged headers that might confuse SpamAssassin. However, this means that spam sent to mailing lists might slip through the net. Experiment, and find the best solution for your own e-mail patterns.

Assassinating Spam Without Mail::Audit

Of course, there are times when it might not be suitable to use Mail::Audit or you may not want to. Since SpamAssassin is provided as a command line tool as well as a set of Perl modules, it's easy enough to integrate it in whatever mail filtering solution you use.

For instance, here's a procmail recipe that calls out to spamassassin to filter out spam:


:0fw
| spamassassin -P

:0:
* ^X-Spam-Status: Yes
spambox

For the speed-conscious, you can run the spamd daemon and replace calls to spamassassin with spamc; be aware that this is a TCP/IP daemon that you may want to firewall from the rest of the world.

Another approach is to call spamassassin in your mail transport agent, meaning that spam is filtered out before it even attempts to be delivered to you. There's a Sendmail milter library available that allows you to use SpamAssassin, and similar tricks for Exim and other MTAs are available.

Assassinating Spam With Mail::Audit : More Complex Operations

The Mail::SpamAssassin module has many other methods you can use to manipulate e-mail. For instance, if you've identified something as definitely being spam, then you can use


    $spamtest->report_as_spam($mail);

to report it to Vipul's Razor. (Take note of this: As we've mentioned above, the efficiency of the Razor database comes from the fact that e-mails in it are confirmed as spam by a human. Adding false positives to the database would degrade its usefulness for everyone. Only submit mail that you've confirmed personally.)

If you're finding that mail checking is taking too long because SpamAssassin is having to contact the various network-based blacklists and databases, then you can instruct it to only perform "local" checking:


    $spamtest = Mail::SpamAssassin->new({local_tests_only => 1});

There is a wealth of other options available. See the Mail::SpamAssassin documentation for more details, and happy assassinating!