Bloom Filters
“Ok, let me see if I have this right,” says Sarah. “You want to be able to quickly exclude posts that a user has read, so that you don’t suggest those posts again. And a Bloom filter is a good fit for this problem, because it will never let through a post that the user has read, and even though it might exclude a post that they haven’t read, that’s ok because they’ll never know what they don’t see? And it’s very fast?”
In this post I take an unusual approach to explaining Bloom filters. We won’t begin with a full-blown explanation. Instead, I’ll gradually build up to the full data structure in stages. My goal is to tell a plausible story explaining how one could invent Bloom filters from scratch, with each step along the way more or less “obvious”. Of course, hindsight is 20-20, and such a story shouldn’t be taken too literally. Rather, the benefit of developing Bloom filters in this way is that it will deepen our understanding of why Bloom filters work in just the way they do.
[…]
In actual applications of Bloom filters, we won’t know S in advance, nor |S|. So the way we usually specify a Bloom filter is to specify the maximum size n of set that we’d like to be able to represent, and the maximal probability of error, p, that we’re willing to tolerate.
[…]
Bloom filters have been used to solve many different problems. Here’s just a few examples to give the flavour of how they can be used. An early idea was Manber and Wu’s 1994 proposal to use Bloom filters to store lists of weak passwords. Google’s BigTable storage system uses Bloom filters to speed up queries, by avoiding disk accesses for rows or columns that don’t exist. Google Chrome uses Bloom filters to do safe web browsing – the opening example in this post was quite real!
[…]
Instead, the
delete
operation is implemented using an idea known as a counting Bloom filter. The basic idea is to take a standard Bloom filter, and replace each bit in the bit array by a bucket containing several bits (usually 3 or 4 bits).