Archive for November 20, 2002

Wednesday, November 20, 2002

Spam Filtering’s Last Stand?

Jerry Kindall is more optimistic about Bayesian spam filtering than Jeremy Bowers. Statistical spam filtering is just getting started (in terms of deployment; researchers have been looking at it for years now). The current version of SpamSieve only scratches the surface of what’s possible, yet it is more than 98% accurate on my spam. Jerry has ideas for improving it, as do Gary Robinson, the Python folk, myself, and probably many others. In the near term, things look good.

There is no doubt that as spam filters get better, spam becomes harder to detect. Out of the last hundred thousand spams I’ve received, about two of them were really tricky messages that I had to actually read and understand before I was sure they were spam (although it’s worth noting that only looking at the headers SpamSieve would have caught them). This is the future of spam. General spam detection is AI-complete. However, at the moment I am optimistic like Paul Graham because spammers have a disadvantage, their need to get you to respond:

All along the spectrum, if you restrict the sales pitches spammers can make, you will inevitably tend to put them out of business. That word business is an important one to remember. The spammers are businessmen. They send spam because it works. It works because although the response rate is abominably low (at best 15 per million, vs 3000 per million for a catalog mailing), the cost, to them, is practically nothing.…Sending spam does cost the spammer something, though. So the lower we can get the response rate—whether by filtering, or by using filters to force spammers to dilute their pitches—the fewer businesses will find it worth their while to send spam.

Jerry also mention that the new version of SpamSieve increases the spam probability of unknown words from 0.2 to 0.4. A few people have been asking, so I may as well mention some of the hidden tuning knobs in SpamSieve. Use them at your own risk.


defaults write com.c-command.SpamSieve SpamProbabilityOfUnknownWords 0.4

defaults write com.c-command.SpamSieve MaxInterestingWords 15

These are the defaults. The number of interesting words is the number of words in the message that SpamSieve uses to determine whether a message is spam. Decreasing the number tends to make it flag more messages as spam, in my experience.

Nicholas Riley on the Finder Again

Nicholas Riley follows up with his list of the problems with the Finder’s error window, and how to reproduce it.

To people who say I should report these bugs: yes I should, but more importantly, Apple should fix this stuff internally before it ever reaches beta. There are systemic issues at Apple causing these kinds of UI disasters to happen, and identifying such problems one at a time is an inefficient way of dealing with them. Mac OS 8.5 was the last version of the OS to have any kind of consistent UI quality. The line can be drawn rather clearly: the “Index selection…” contextual menu item, one of the few visible additions in 8.6, managed to violate a bunch of guidelines (capitalization and wording) in just two words.

Right on. I would love for Apple to take usability more seriously. But the reality at this time is that Apple won’t fix this stuff without lots of outside feedback.

This will be my last “OS X sucks” rant for the year. Time to affect what I can really affect instead: my own software.

Yup, unfortunately, no one outside Apple has time to go through OS X and report all the problems. That’s why they need to bring back the HI group.