SpamSieve 3.3
SpamSieve 3.3 is an update of my Mac e-mail spam filter that includes lots of changes to improve the filtering accuracy:
Much of this is automatic: SpamSieve is better at analyzing the message structure, HTML, and URLs within messages.
The other part is helping customers help themselves. If SpamSieve is continually letting spam messages through, the most common reason is that it’s been trained that messages like them are good. This can happen either due to accidental training or due to not training (e.g. just deleting) previous spam messages that it missed. It’s long been possible to fix uncorrected mistakes, but there can be so many messages to go through that this is overwhelming. Now, if you’re wondering why a certain spam message wasn’t caught, the expanded search features make it possible to find the specific previous messages that made SpamSieve think that one was good. For example, if the log shows that SpamSieve thought ”v1agra” was a key indicator of the message not being spam, you can find the previous messages containing that word and make sure that they are trained as spam, not good.
This sort of search can take a while because it has to read and process all the old messages in the corpus or log. The message parsing was already pretty optimized, due to work for EagleFiler and the bulk importer added in SpamSieve 3.0. Spam-processing messages to find their words was less optimized because it hadn’t been a bottleneck—the mail client doesn’t send very many new messages to be filtered at once, and it’s happening in the background, anyway. But now we have potentially a huge amount of data to process, and the user is waiting for the results.
The first step was to make the spam engine fully threadsafe so that each core could run a separate instance of it.
I initially planned to use multiple threads for reading individual messages from disk, but that turned out not to be necessary. SSDs are really fast.
Even with LZFSE, which is supposed to be highly optimized for decompression speed, decompressing still takes way longer than reading from disk and was sometimes on par with SpamSieve’s own message processing. So it did make sense to do this in a separate thread from the I/O. Thankfully, I was not using transformable Core Data attributes, so it was easy to separate the fetching from the decompression.
There are a bunch of regex objects that are used very frequently. These had been stored in an ivar dictionary, but that no longer made sense because I don’t want to recompile them for each message. My initial approach was to just put them in a threadsafe cache, which is also the approach that Swift Regex takes. But it turns out that with so many threads running at once there is significant overhead just from locking to read the cache. It works much better to use static variables, though that’s a lot more verbose.
Likewise,
NSStringuses a shared object to convert between different encodings, and there was significant locking overhead around that. As there’s no API to access the converter object directly, I ended up implementing my own lock-free solution for the specific encodings that SpamSieve cares about.Swift string operations were another source of slowness. SpamSieve was calling the generalized
contains()with a single ASCII character. That can be made much faster by usingutf8.contains(). There are other cases where usingunicodeScalars.contains()makes sense.The HTML processor is still written in Objective-C, and it turned out that Swift bridging overhead was taking more time than the actual HTML processing. This was fixable through a combination of (a) adding specialized Objective-C methods with known return types instead of returning
idand casting from Swift, and (b) Usingas NSDictionaryto avoid eager conversions of whole dictionaries when often we only need to read one key.
I fixed a regression that started when SpamSieve 3.2 switched to using
NSDockTileto draw numeric badges on the Dock icon instead of drawing them itself. This was necessary because doing it manually doesn’t work with Liquid Glass. When SpamSieve did the drawing itself, it used a cached image of the Dock icon. The system API apparently relies on the image file on disk, even just to update the badge, and so it would sometimes crash during a software update if that file got updated.Some customers were seeing a new issue on macOS Tahoe, seemingly caused by App Nap. The app would be running in the background, with no windows visible, and get woken up by an Apple event—so far so good—but then macOS would stop giving it processor time while it was still generating the response to the Apple event. I’m not sure what’s going on here since most customers are not affected.
I continue to have problems with fake GitHub repos, but GitHub is once again helping to take them down.
Previously: