Wednesday, November 20, 2002

Spam Filtering’s Last Stand?

Jerry Kindall is more optimistic about Bayesian spam filtering than Jeremy Bowers. Statistical spam filtering is just getting started (in terms of deployment; researchers have been looking at it for years now). The current version of SpamSieve only scratches the surface of what’s possible, yet it is more than 98% accurate on my spam. Jerry has ideas for improving it, as do Gary Robinson, the Python folk, myself, and probably many others. In the near term, things look good.

There is no doubt that as spam filters get better, spam becomes harder to detect. Out of the last hundred thousand spams I’ve received, about two of them were really tricky messages that I had to actually read and understand before I was sure they were spam (although it’s worth noting that only looking at the headers SpamSieve would have caught them). This is the future of spam. General spam detection is AI-complete. However, at the moment I am optimistic like Paul Graham because spammers have a disadvantage, their need to get you to respond:

All along the spectrum, if you restrict the sales pitches spammers can make, you will inevitably tend to put them out of business. That word business is an important one to remember. The spammers are businessmen. They send spam because it works. It works because although the response rate is abominably low (at best 15 per million, vs 3000 per million for a catalog mailing), the cost, to them, is practically nothing.…Sending spam does cost the spammer something, though. So the lower we can get the response rate—whether by filtering, or by using filters to force spammers to dilute their pitches—the fewer businesses will find it worth their while to send spam.

Jerry also mention that the new version of SpamSieve increases the spam probability of unknown words from 0.2 to 0.4. A few people have been asking, so I may as well mention some of the hidden tuning knobs in SpamSieve. Use them at your own risk.


defaults write com.c-command.SpamSieve SpamProbabilityOfUnknownWords 0.4

defaults write com.c-command.SpamSieve MaxInterestingWords 15

These are the defaults. The number of interesting words is the number of words in the message that SpamSieve uses to determine whether a message is spam. Decreasing the number tends to make it flag more messages as spam, in my experience.

Comments RSS · Twitter

Leave a Comment