develooper Front page | perl.perl5.porters | Postings from May 2003

Re: COW in CORE and SvFAKE in older Perl

Thread Previous | Thread Next
From:
Aaron Sherman
Date:
May 6, 2003 08:51
Subject:
Re: COW in CORE and SvFAKE in older Perl
Message ID:
1052236281.11759.69.camel@localhost.localdomain
On Tue, 2003-05-06 at 05:41, Nick Ing-Simmons wrote:
> Nicholas Clark <nick@unfortu.net> writes:
> >
> >Have you profiled spamassassin to death yet and addressed everything else
> >that can be addressed by conventional means?

> I havn't done that, but have established that (for me at least) the main 
> snag with spamassassin is its inclination to run _all_ its tests even 
> when a mail has already racked up a massive "this is SPAM" score.
> The "body" tests in particular are expensive when (as with Klutz virus)
> the body is big. Spamassassin was also tickling Tk/UTF-8 interaction so 
> I have stopped using it for now. 

SA has to run all of its tests in the current version. I've proposed an
alternate plan where it might be able to short-circuit at a pre-defined
point, but even then body tests will be required. Mind you, *you* could
remove all body tests from the configs (the ".cf" files) and then re-run
the GA on your own corpus (you need a big one) of known spam and ham.
The result would not be as accurate as full SA, but would be better than
applying any one of the header-based tests (e.g. Razor2, blacklists,
forgery detection, subject analysis, etc) alone.

The problem is that the genetic algorithm that scores the tests does so
based on ALL of the rules, so if you only apply half of them, and say
"well, the score is high, so it's spam," you introduce a slightly higher
rate of false positives for mail that would have swung back around.

You could run the tests in decreasing order of score, and that way you
could get to a point where you say "all of the tests that are left score
too low to change the score back to non-spam", but even then you end up
with a final score that's "truncated" so users (like me) who handle
delivery based on multiple "tiers" of scoring will have problems. That
ordering of tests also has nothing to do with how expensive they are to
run :(

One thought that came up recently that I like was to sample only parts
of the body (or all for small messages), and that might work. You also
probably want to eliminate any non-viewable attachments from what body
tests look at by default (only tests that WANTED to see them should).
All stuff I'm looking into.

When SA is running in daemon mode, it handles my company's mail spool
quite well, but I want it to be able to handle HUGE sites just as
easily....


Thread Previous | Thread Next


nntp.perl.org: Perl Programming lists via nntp and http.
Comments to Ask Bjørn Hansen at ask@perl.org | Group listing | About