[Maia-users] PDF spam solutions

Robert LeBlanc rjl at renaissoft.com
Mon Aug 13 14:41:19 PDT 2007


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Like the rest of you, I'm sure, I've been receiving a glut of PDF spam
lately, and I've been experimenting with various tactics for curbing the
onslaught.  Some tactics work better than others, naturally, so I
thought I'd share my results here.


(1) SpamAssassin core rules

To deal with PDF spam, the SpamAssassin developers added a new core rule
called TVD_PDF_FINGER01, which identifies emails that have empty bodies
but contain PDF attachments.  It works well, but its default score of
1.0 is too low to make it the only tool for the job.  Increasing the
score isn't really a good idea, though, since a lot of business users
regularly send PDF attachments with empty mail bodies, and this could
lead to false positives in a hurry.

You can certainly get this new rule for any version of SpamAssassin
(newer than 3.1.1) using sa-update, but now that the 3.2.x series
appears to have stabilized I'd also recommend that you upgrade to 3.2.3
to take advantage of the latest rulesets.


(2) PDFInfo plugin

Available from <http://www.rulesemporium.com/plugins.htm>, this plugin
is a step better in that it tries to identify specific PDF spams by
their characteristics--image dimensions, number of images in the file,
image-to-text ratio, filename, and meta-information (e.g. author,
creator, creation/modified date, etc.), as well as fuzzy hashes of the
file itself.

The downside is that it's /too/ specific, and that requires you to
download new versions of the pdfinfo.cf file whenever new signatures are
added, because every new signature is a new rule.  This makes the plugin
very nice for catching PDF spam that's already circulating, but it's not
effective at catching new variants, and updating it is awkward.


(3) PDFText plugin

The PDFText plugin uses the pdftotxt and pdfinfo utilities from the xpdf
package to try to extract the text and meta-information from PDF files,
so that they can then be subjected to pattern-based tests for spammy
content.  Two versions are currently available:

For SpamAssassin 3.1.x:

<http://www.mail-archive.com/users@spamassassin.apache.org/msg45465.html>

For SpamAssassin 3.2.x:

<http://www.mail-archive.com/users@spamassassin.apache.org/msg45494.html>

Unfortunately this plugin is still a very early alpha--proof-of-concept,
really--and needs a considerable amount of polishing before it could
really be recommended for production use.  It also relies on its own
wordlist for scoring, rather than making the discovered text available
to the full battery of SpamAssassin rules, but the author is apparently
working on that, along with experimental support for using GOCR to scan
the images in PDF files.


(4) FuzzyOCR plugin

There's been some discussion about FuzzyOCR's potential role in catching
PDF spam--at least the PDF spam that incorporates images.  The plugin's
author is reluctant at best: "actually, I will not try to scan PDFs, the
risk of false positives is too high and PDFs do not have a future for
spammers (in my opinion) as most clients do not display them directly.
Sending PDFs is only a desperate try of spammers to circumvent image
scanners, but I don't think this will be the new "trend", neither do I
think that this kind of spam has any future or success, like image spam
has."

That said, he seems to have relented under the pressure, and some basic
support for this was added recently to the svn version with a lot of
disclaimers ("highly experimental and disabled by default", "Enable this
at your own risk, this might lead to false positives and classify
important documents as spam. YOU HAVE BEEN WARNED.").

Since you need to be using the svn version of FuzzyOCR if you're running
SpamAssassin 3.2.x anyway, you may wish to experiment with the
PDF-scanning support, since it won't cost you any resources you aren't
already spending.  If you're /not/ using FuzzyOCR, though, I wouldn't
advise installing it just to solve the PDF spam problem.


(5) Custom rules

Eric A. Hall posted a custom ruleset recently to the SpamAssassin-Users
list that uses the AWL to determine whether the sender of a binary
attachment (major MIME-type of application, image, audio, video, or
model) has sent the recipient mail before.  If this is the first email
the recipient has ever received from this sender, and it contains such
an attachment, it gets penalized accordingly for coming from a stranger.

You need to have the MIMEHeader plugin installed, but this is included
by default in the newer SpamAssassin 3.2.x series.  The ruleset can be
added easily to your local.cf file:

ifplugin Mail::SpamAssassin::Plugin::MIMEHeader

mimeheader  __L_C_TYPE_APP     Content-Type =~ /^application/i
mimeheader  __L_C_TYPE_IMAGE   Content-Type =~ /^image/i
mimeheader  __L_C_TYPE_AUDIO   Content-Type =~ /^audio/i
mimeheader  __L_C_TYPE_VIDEO   Content-Type =~ /^video/i
mimeheader  __L_C_TYPE_MODEL   Content-Type =~ /^model/i

meta        L_STRANGER_APP     (!AWL && __L_C_TYPE_APP)
score       L_STRANGER_APP     1.0
tflags      L_STRANGER_APP     noautolearn
priority    L_STRANGER_APP     1001 # defer till after AWL
describe    L_STRANGER_APP     Application file sent by a stranger

meta        L_STRANGER_IMAGE   (!AWL && __L_C_TYPE_IMAGE)
score       L_STRANGER_IMAGE   1.0
tflags      L_STRANGER_IMAGE   noautolearn
priority    L_STRANGER_IMAGE   1001 # defer till after AWL
describe    L_STRANGER_IMAGE   Image file sent by a stranger

meta        L_STRANGER_AUDIO   (!AWL && __L_C_TYPE_AUDIO)
score       L_STRANGER_AUDIO   1.0
tflags      L_STRANGER_AUDIO   noautolearn
priority    L_STRANGER_AUDIO   1001 # defer till after AWL
describe    L_STRANGER_AUDIO   Audio file sent by a stranger

meta        L_STRANGER_VIDEO   (!AWL && __L_C_TYPE_VIDEO)
score       L_STRANGER_VIDEO   1.0
tflags      L_STRANGER_VIDEO   noautolearn
priority    L_STRANGER_VIDEO   1001 # defer till after AWL
describe    L_STRANGER_VIDEO   Video file sent by a stranger

meta        L_STRANGER_MODEL   (!AWL && __L_C_TYPE_MODEL)
score       L_STRANGER_MODEL   1.0
tflags      L_STRANGER_MODEL   noautolearn
priority    L_STRANGER_MODEL   1001 # defer till after AWL
describe    L_STRANGER_MODEL   Model file sent by a stranger

endif


(6) SaneSecurity signatures

If you use ClamAV (you do, don't you?), another option is to use the
phishing and scam signatures published by SaneSecurity
<http://www.sanesecurity.co.uk/clamav/>.  These signatures are updated
multiple times a day, and include a lot of PDF spam, making it perhaps
the most responsive solution available at the moment.

These phishing/scam emails get caught by ClamAV rather than
SpamAssassin, so they show up in Maia's "Viruses/Malware" quarantine
instead of the spam quarantine, which is a bit annoying, but that's
something I'll be working to address in future versions.

I can't argue with the effectiveness of SaneSecurity's signatures,
though--they are by far the most effective blockers of PDF spam that
I've found, and I would strongly recommend that you use them.


(7) Other plugins

While rules and plugins that target PDF spam specifically are very
useful, it's worth noting that the bulk of the PDF spam comes from
botnets, so adding the Botnet plugin
<http://people.ucsc.edu/~jrudd/spamassassin/> can catch a lot of these
things on its own, and it provides a nice score supplement to go along
with the PDF-specific rules.  The latest version is 0.8, and it just
needs one small patch (courtesy of Mark Martinec):

- --- Botnet.pm.orig	Mon Aug  6 15:59:16 2007
+++ Botnet.pm	Mon Aug  6 16:02:43 2007
@@ -711,5 +711,14 @@
         (defined $max) &&
         ($max =~ /^-?\d+$/) ) {
- -      $resolver = Net::DNS::Resolver->new();
+      $resolver = Net::DNS::Resolver->new(
+               udp_timeout => 5,
+               tcp_timeout => 5,
+               retrans => 0,
+               retry => 1,
+               persistent_tcp => 0,
+               persistent_udp => 0,
+               dnsrch => 0,
+               defnames => 0,
+       );
       if ($query = $resolver->search($name, $type)) {
          # found matches
@@ -834,5 +843,14 @@
    my ($ip) = @_;
    my ($query, @answer, $rr);
- -   my $resolver = Net::DNS::Resolver->new();
+   my $resolver = Net::DNS::Resolver->new(
+       udp_timeout => 5,
+       tcp_timeout => 5,
+       retrans => 0,
+       retry => 1,
+       persistent_tcp => 0,
+       persistent_udp => 0,
+       dnsrch => 0,
+       defnames => 0,
+       );
    my $name = "";


- --
Robert LeBlanc <rjl at renaissoft.com>
Renaissoft, Inc.
Maia Mailguard <http://www.maiamailguard.com/>

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (GNU/Linux)

iD8DBQFGwM//GmqOER2NHewRAhqDAKCRY5U7T4hgl3yj928ajM8KuceI2wCfYESS
25zC3NMEDVmcUaEJw9En4A8=
=zjNR
-----END PGP SIGNATURE-----


More information about the Maia-users mailing list