Hacking SpamAssassin

This is mostly intended to remind me what I've done, so that when I forget and install a new version I can recover. However, it might be useful to someone else. This applies to version 2.6.1 at present.

Rationale

Originally SpamAssassin's default message body ruleset and Bayesian filtering cleaned up more than 99 per cent of spam we received here at Bristol. But with the growth of one-line URL spams I wanted to start using its RBL facilities. Our mail server is on the other side of institutional mail servers, so we don't have the option of doing RBL lookups at delivery time. SpamAssassin, however, uses the Received: lines, which is ideal for our purposes.

My first discovery (via spamassassin -D --lint) was that RBL was not being done at all, because we didn't have the DNS lookup perl packages (Net::DNS::Resolver). These are easy to get through CPAN:

perl -MCPAN -e shell                    [as root]
o conf prerequisites_policy ask
install Net::DNS::Resolver::UNIX
quit
Then I modified the local.cf file as follows, to add some additional RBL services (gleaned from Google):
# This is the right place to customize your installation of SpamAssassin.
# See 'perldoc Mail::SpamAssassin::Conf' for details of what can be
# tweaked.
#
###########################################################################
#
#rewrite_subject 0
#report_header 1
#defang_mime 0
use_terse_report 1
use_bayes 1
bayes_path /usr/local/spamassassin/bayes
bayes_file_mode 0666
bayes_auto_learn 1
bayes_auto_learn_threshold_spam 10
bayes_min_spam_num 100

header RCVD_IN_BNBL eval:check_rbl('bl', 'bl.blueshore.net.')
describe RCVD_IN_BNBL Received via a relay listed by BNBL
tflags RCVD_IN_BNBL net
score RCVD_IN_BNBL 2.0

header RCVD_IN_RFC_PM eval:check_rbl('relay','postmaster.rfc-ignorant.org.')
describe RCVD_IN_RFC_PM Received via a relay in postmaster.rfc-ignorant.org
score RCVD_IN_RFC_PM 2.0

header X_CHINESE_RELAY eval:check_rbl('relay', 'cn.rbl.cluecentral.net.')
describe X_CHINESE_RELAY Received via a relay in China
score X_CHINESE_RELAY 1.5

header X_KOREAN_RELAY eval:check_rbl('relay','korea.services.net.')
describe X_KOREAN_RELAY Received via a relay in Korea
score X_KOREAN_RELAY 1.5

header X_SPAMHAUS eval:check_rbl('relay','spamhaus.relays.osirusoft.com.')
describe X_SPAMHAUS Received via relay in Spamhaus Blacklist
score X_SPAMHAUS 1.5

header RCVD_IN_NJABL   eval:check_rbl('relay', 'dnsbl.njabl.org')
describe RCVD_IN_NJABL Received via a relay in NJABL
score RCVD_IN_NJABL    2.0
(Yes, we have a shared Bayes database, because we use spamd.)

Problem

However, RBL lookups were only working in a small minority of cases, mostly involving forwards from a non-Bristol machine. This turned out to be because of the way SpamAssassin parses Received: lines. Received: lines from our Bristol mail servers are of the form
Received: from some.host.or.other by dirf.bris.ac.uk 
          with SMTP-SLOPPY; Tue, 6 Jan 2004 05:42:43 +0000
--- if the HELO string looks up to be the same as the originating IP address, then only the HELO string is reported. If they're different, the form is different
Received: from bogus.org (actually host some.host.or.other) by dirg.bris.ac.uk
         with SMTP-SLOPPY with ESMTP; Tue, 6 Jan 2004 12:44:49 +0000
SpamAssassin cannot cope with the first form, because it doesn't know that it should trust the upstream mail server to get the name right. In fact, in Received.pm, this form is explicitly disallowed:
  # Received: from virtual-access.org by bolero.conactive.com ; Thu, 20 Feb 2003
 23:32:58 +0100
  if (/^from (\S+) by (\S+) *\;/) {
    return;     # can't trust this
  }
In practice, SpamAssassin doesn't know what to do with the second form, either.

Solution

The script that parses Received: headers needs to know that it can trust this behaviour if, and only if, the upstream mail server is responsible for it. This is achieved by some code at the start of the parse_received_line subroutine:
  if (/^from /) {

# First try to parse out Bristol headers
     if (/^from (\S+) \(actually host (${IP_ADDRESS})\) by (dir\S+\.bris\.ac\.uk)/) {
          dbg("received_header: bristol w helo, ip $1 $2 $3");
          $helo=$1;
          $ip=$2;
          $by=$3;
          goto enough;
      }

      if (/^from (\S+) \(actually host (\S+)\) by (dir\S+\.bris\.ac\.uk)/) {
          dbg("received_header: bristol w helo $1 $2 $3");
          $helo=$1;
          $rdns=$2;
          $ip=$self->lookup_a($rdns);
          $by=$3;
          goto enough;
      }

      if (/^from (\S+) by (dir\S+\.bris\.ac\.uk)/) {
          dbg("received_header: bristol $1 $2");
          $rdns=$1;
          $helo=$1;
          $by=$2;
          $ip=$self->lookup_a($rdns);
          goto enough;
      }
First we deal with the form where the IP address is there; then we deal with the much more problematic forms where it isn't, by looking up the A record corresponding to the address written in there by the upstream server. Of course, we need to implement lookup_a, which goes in Dns.pm, and is based on lookup_ptr:
sub lookup_a {
  my ($self, $dom) = @_;

  return undef unless $self->load_resolver();
  if ($self->{main}->{local_tests_only}) {
    dbg ("local tests only, not looking up A");
    return undef;
  }

  dbg ("looking up A record for '$dom'");
  my $name = '';

  eval {
        my $query = $self->{res}->search($dom);
        if ($query) {
          foreach my $rr ($query->answer) {
            if ($rr->type eq "A") {
              $name = $rr->address; last;
            }
          }
        }

  };
  if ($@) {
    dbg ("A lookup failed horribly, perhaps bad resolv.conf setting?");
    return undef;
  }

  dbg ("A for '$dom': '$name'");

  # note: undef is never returned, unless DNS is unavailable.
  return $name;
}

Thoughts

It ought to be possible to do this in a more generic, and less hacky, way by privileging Received: lines from trusted machines and allowing them more latitude in the format in which they can supply the information. If I were a Proper Perl Coder, I would try harder.

This will probably be handled completely differently in 2.7.

Update 23/03/04

This page used to refer to monkey.com's spam lists, but they have been discontinued. If you used the template local.cf from this page, you should make sure you're not using monkeys.com any more. I am unamused by the discrepancy between their shutdown announcement (`I have no plans whatsoever to implement any sorts of ``positive response'' wild cards in any of these zones') and their actions on March 14th (implementing a positive response wildcard), but they clearly have the right to behave however they choose.