[Maia-users] Bayes_00 pain

Robert LeBlanc rjl at renaissoft.com
Wed Aug 23 05:33:16 PDT 2006


Stefan G. Weichinger wrote:

> Is this OCR-stuff ready for usage in "productive environments"?
> I haven't yet found/taken the time to take a real close look, as until
> now I wasn't affected by image-spam AFAIK (at least my own boxes/domains).
> 
> I was under the impression that all this was relatively new and
> experimental. What's your impression? Does it generate much load in
> terms of cpu/ram-usage?

While I would still call it "experimental" at this stage, that's mostly
because it's being developed very rapidly.  The version I'm using in
production is the one I describe in the wiki (2.1c), but there are
already beta versions in the 2.2 series, and alphas in the 2.3 series,
with new experimental releases becoming available at a rate of one or
two a day.  Clearly this is an area receiving a lot of attention at the
moment, and there's a mailing list called "Devel-Spam"
<http://lists.own-hero.net/mailman/listinfo/devel-spam> you can
subscribe to if you want to keep up with the bleeding edge of its
development.

The 2.1 series is quite stable and works quite well for most purposes.
In terms of the extra load and resource usage, it's minor because of the
fact that the OCR plugin only gets invoked on mail that contains inline
images.  For those particular emails, it adds 2-4 seconds of processing
time, but since those emails represent a very small fraction of the
total mail volume, the average increase in processing time works out a
few milliseconds per item, or a few (i.e. < 10) extra processor-minutes
per day.

The decision to implement OCR in a production environment at this stage
is obviously your call, but with the 2.1 stable series I don't see the
harm in it, unless perhaps your server is very close to its resource
limits as it is.  You must also weigh this against the prevalence of
image-spam, of course; if you haven't been receiving much of it yet, you
probably won't feel much pressure to implement OCR.  Once you /do/ start
receiving it in larger volumes, however, the pressure may reach a
tipping point, and you may be willing to accept a bit more risk and a
bit more resource consumption in order to stem the tide of the image-spam.

As image-spam becomes more pervasive, however, we're eventually /all/
going to need to implement OCR or something equivalent.  When the spam
content is entirely within the images, and the text portion of the mail
contains just non-spammy words and phrases, there's really very little
else left for us to do but try to extract the spam content from the images.

-- 
Robert LeBlanc <rjl at renaissoft.com>
Renaissoft, Inc.
Maia Mailguard <http://www.maiamailguard.com/>

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: OpenPGP digital signature
Url : http://www.renaissoft.com/pipermail/maia-users/attachments/20060823/a5f56dc1/attachment.bin 


More information about the Maia-users mailing list