[Maia-users] FuzzyOCR plugin 2.3b released

Tóth Csaba tsabi-maia at tsabi.hu
Mon Sep 18 03:16:53 PDT 2006


Hi!

I updated my spamassassin-fuzzyocr package to the version 3, it fixes
the installation problems after the changed original package. (The files
in the original package moved into an inner FuzzyOcr-2.3b/ directory,
and the ebuild completely failed because of this.)

You can remove the warning about the wrong digest, because it happened
already.

So the link to the updated package is:
http://dev.davidnet.hu/gentoo-portage/fuzzyocr-gentoo-3.tar.bz2

Please feel free to mail me with any suggestions or advice.

tsabi



Tóth Csaba írta:
> Hi!
> 
> I made ebuilds to can install this easier:
> 
> app-text/gocr
>  * added the segfault patch
> perl-gcpan/String-Approx
>  * generated the ebuild with g-cpan
> media-libs/giflib
>  * added the segfault patch
> mail-filter/spamassassin-fuzzyocr
>  * new ebuild
>  * hashdb USE flag to enable hashdb support in config
>  * hashdb-fix is included
> 
> Download this package:
> http://dev.davidnet.hu/gentoo-portage/fuzzyocr-gentoo-2.tar.bz2
> 
> Unpack into /usr/local, and enable the overlay at /etc/make.conf:
> PORTDIR_OVERLAY="/usr/local/portage"
> 
> Than you can intall mail-filter/spamassassin-fuzzyocr.
> 
> If install fails with wrong digest (you cannot download the FuzzyOcr
> tarball), than that means the new tarball is released without version
> growing. You can run this command, and than u should can install without
> any error:
> cd /usr/local/portage/mail-filter/spamassassin-fuzzyocr
> ebuild spamassassin-fuzzyocr-2.3b.ebuild digest
> 
> Have fun,
> tsabi
> 
> 
> Robert LeBlanc írta:
>> The latest release of the FuzzyOCR plugin (2.3b) is out, and I've
>> updated the wiki accordingly with new installation and configuration
>> instructions: <https://secure.renaissoft.com/maia/wiki/FuzzyOCR23>.
>>
>> There are a number of key improvements in this version of the plugin
>> that make it worth the upgrade, including:
>>
>> * Handling of interlaced and animated GIFs
>>
>> As the anti-spam community has come to embrace OCR technologies,
>> spammers have been working on ways to confuse OCR engines, using
>> interlaced images and animated GIFs.  With interlaced images, the data
>> is ordered differently (all the odd-numbered pixel rows together, all
>> the even-numbered pixel rows together), so tools that aren't able to
>> detect interlaced images and reconstruct them properly would fail to
>> load them.  Animated GIFs are also becoming more common, since tools
>> that don't know how to handle them will only see the first frame of the
>> animation--so spammers simply include a blank first frame that lasts a
>> fraction of a second, followed by a long second frame that contains the
>> spam message.  This version of the FuzzyOCR plugin uses tools that can
>> properly detect and handle interlaced and animated GIFs, unpacking the
>> individual frames as necessary.
>>
>>
>> * Word list in a separate file
>>
>> In previous versions of the plugin, the list of target words was stored
>> in the FuzzyOcr.cf file.  Now they're stored in a separate file
>> (FuzzyOcr.words) that won't be overwritten during plugin upgrades.
>>
>>
>> * Hashing database cache for previously-scanned images
>>
>> On the theory that if you see one instance of a given spam image, you're
>> likely to see multiple copies of it, this version of the plugin
>> maintains a local database to cache scan information about the images it
>> has seen.  It's not an MD5 hash, it's a collection of image property
>> data that aims to be an invariant "signature" for a given image, even if
>> other copies aren't exact pixel-perfect matches.  The image's score is
>> cached as well, so that if it is seen again in the future, the plugin
>> won't need to run the OCR engine on it again.
>>
>>
>> * Ability to use multiple, more configurable scan sets
>>
>> This is perhaps the most powerful addition, as it's a feature that lets
>> you configure the plugin to run the OCR scanner on the image multiple
>> times with different resolution and tolerance settings, in order to
>> catch a wider range of image spam (at the cost of extra processing time,
>> naturally; you can still configure the plugin to do just one pass, if
>> you prefer).
>>
>> The trouble with OCR is that a single pass over the image only tells you
>> what's visible at a single, fixed resolution, and if the image is
>> crafted with a very different resolution the OCR scan may see a bunch of
>> dots instead of letters, or misinterpret "noise" dots as parts of
>> letters.  If you're only going to do one pass over the image, you've got
>> to choose a "compromise" resolution--one that will read text in most
>> cases, but will fail for edge cases.
>>
>> By making a second pass over the image at a different resolution though,
>> you can get the best of both worlds by comparing the results from both
>> scans and choosing the one with the best result.  You could even make a
>> third pass with yet another set of scanner options if you wanted to test
>> for even more challenging conditions (e.g. white text on a dark
>> background, text in multiple colours, etc.).
>>
>> These "scan sets" are also highly configurable--you can construct your
>> own tool-chains by piping the image through a series of utilities, as
>> long as the image begins as a PNM and ends the chain as input to GOCR.
>> Thus you can do things like normalize, resize, greyscale, rotate, etc.
>> as you see fit, using the Netpbm, Libungif, or ImageMagick tools to
>> prepare the image in whatever way you want before it gets OCR'ed.  I
>> expect that various "recipes" will be shared eventually, as people
>> experiment with tool-chains and scanner settings that catch particular
>> image strains.
>>
>>
>> Upgrade notes:
>>
>> (1) SpamAssassin 3.1.4 is preferred, due to some optimizations that make
>> handling animated GIFs somewhat easier.  The plugin will still work with
>> versions as early as 3.1.0 using some (less-efficient) internal
>> workarounds, but you should really be using the latest SpamAssassin in
>> any case, if only for the newer rules and bug fixes anyway, so consider
>> this your excuse to upgrade :)
>>
>> (2) This version of the plugin requires the ImageMagick suite,
>> specifically for the "convert" and "identify" utilities, used to unpack
>> the animated GIFs.
>>
>> (3) There's a small patch for the libungif utility "giftext" which
>> hardens it against a segfault exploit.  This means you'll need to get
>> the libungif sources and patch them and build them, rather than just
>> using the binary packages from your favourite repository.  Sad, but
>> necessary.
>>
>>
>>
>> ------------------------------------------------------------------------
>>
>> _______________________________________________
>> Maia-users mailing list
>> Maia-users at renaissoft.com
>> http://www.renaissoft.com/mailman/listinfo/maia-users
> 
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> Maia-users mailing list
> Maia-users at renaissoft.com
> http://www.renaissoft.com/mailman/listinfo/maia-users


-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 250 bytes
Desc: OpenPGP digital signature
Url : http://www.renaissoft.com/pipermail/maia-users/attachments/20060918/6ab31f5c/attachment.bin 


More information about the Maia-users mailing list