Thursday, December 18, 2003

Blocklist

There’s a new round of spams that can be hard for SpamSieve to catch if it hasn’t seen many of them before. They have subjects like:

Re: GSF, despite the promise
Re: IOWRGW, the magician disappeared

and, when the spam-generating software doesn’t work properly:

Re: %RND_UC_CHAR[2-8], rimsky suddenly reached

Generally, my approach with SpamSieve is to make it better at learning from tactics used in spam messages, rather than try to match transient characteristics with a countless number of rules. Sometimes, however, rules work great, and if you add the ones below to your copy of SpamSieve 2.1, it should be easy to catch these new spams.

SpamSieve Blocklist

If you want to copy and paste, the rule text is:

(?:(?-i)^(Re: )?[A-Z]{2,8}, [ a-z]*$)
RND_UC_CHAR

7 Comments RSS · Twitter

thanks michael! i was wondering how those have been slipping through.

Why isn't there a rule mime-type or url scheme?

Great idea; thanks, Robb!

I ran that first rule against the last thousand or so legit messages I'd received and came up with what would have been 76 false positives, including book release announcements, bug report replies, and complaints about the lack of coherence of the current season of 24.

Matching the expression from the beginning of the email subject reduced the number of false positives to three, and even caught a couple of spams that had snuck in from somewhere.

^(Re: )?[A-Z]{2,8}, [ a-z]*

But even that version matched 44 legit messages in my mailbox for the last year-to-date or so. I don't think that's safe enough.

The first time I saw "%RND_UC_CHAR" in a spam subject I laughed out loud, but I must have seen dozens of them since. (No knock on SpamSieve, since I see a lot of my mail at the shell before SpamSieve gets ahold of it.)

This bizarre glitch feels like confirmation of the worst fears about spam's business model. Anyone who can screw up so comprehensively and apparently not notice for weeks isn't going broke.

Thanks, Nat. The ^ should definitely be there. I've made a few other improvements to the pattern:

(?:(?-i)^(Re: )?[A-Z]{2,8}, [ a-z]*$)

and updated the entry and screenshot. I just used Mailsmith to test it on my last 10,000 good messages and got no false positives.

I've seen some more characters after the comma, so perhaps:

(?:(?-i)^(Re: )?[A-Z]{2,8}, [ a-z0-9'?]*$)

That's much improved. I see only three false positives against my whole personal mail archive with the first revised pattern, and four with the second. They're all some variation of, "OK, declaratory statement" or, "ACRONYM, pithy comment", and look enough like the spam pattern to verge on mechanical indistinguishability.

Leave a Comment