How to train SpamAssassin

Home / How to train SpamAssassin

How to train SpamAssassin

June 28, 2019 | System Administration | No Comments


In order for this to be effective you need to have a collection of good email (HAM) and a collection of bad email (SPAM). These collections of ham and spam should be 1000+ messages each, and you should probably have more HAM than SPAM.

You should keep updating these folders with new messages as time goes on as spam practices change and you don’t want to run the risk of SpamAssassin thinking a specific year or month is spam.

I like to create a mailbox for this purpose. Lets call it and through IMAP create folders called HAM and SPAM to keep things organized.

If you are using Ubuntu Server, the default bayes path for your SpamAssassin DB is /var/lib/amavis/.spamassassin so that is where we will do our work. Otherwise, check your distro package for details.

$ sudo su -
$ cd /var/lib/amavis/.spamassassin

Next lets check that status of the bayes db.

$  sa-learn --dbpath . --dump magic
0.000          0          3          0  non-token data: bayes db version
0.000          0       1207          0  non-token data: nspam
0.000          0       3784          0  non-token data: nham
0.000          0     177278          0  non-token data: ntokens
0.000          0 1079041431          0  non-token data: oldest atime
0.000          0 1561554929          0  non-token data: newest atime
0.000          0 1561558640          0  non-token data: last journal sync atime
0.000          0 1561526651          0  non-token data: last expiry atime
0.000          0          0          0  non-token data: last expire atime delta
0.000          0          0          0  non-token data: last expire reduction count

Now we will train it, but not SYNC it. Syncing makes any new data live, and you may not want that until you’ve built a sufficiently detailed database.

$ sa-learn --no-sync --dbpath . --progress --ham /Path/To/Mailbox/{cur,new}
96% [=========================================  ]  25.00 msgs/sec 02m31s DONE
Learned tokens from 33 message(s) (3783 message(s) examined)
$ sa-learn --no-sync --dbpath . --progress --spam /Path/To/Mailbox/{cur,new}
98% [========================================== ]  26.23 msgs/sec 00m47s DONE
Learned tokens from 355 message(s) (1242 message(s) examined)

The {cur,new} tell sa-learn to look into both the cur and new sub directories of the HAM AND SPAM folders.

Run the dump magic command again, and if satisfied with the number of tokens, sync the database.

$  sa-learn --dbpath . --sync

That’s all. SpamAssassin is trained up and live.

About Author

about author


Jack of all trades. I.T. edition. Programmer, Systems Administrator, DevOps and whatever else comes up.