Let's take a look at Bloom Filters and how it works! Based on a talk and an article from Scott Helme.
6 min read
By Kathrine Løfqvist
December 21, 2021
Ever since I heard a talk from Troy Hunt and Scott Helme at NDC 2.5 years ago, I have been fascinated by the two guys. Not necessarily because of all the great work they have done over the years (e.g. HaveIBeenPawned by Troy Hunt), but simply because of their presenting skills and how they can explain things in a way I think most people can understand. Of course, I speak only on behalf of myself, but they managed to fill 60 minutes without me being overwhelm or left with a feeling of imposter syndrome. Therefore, when I found out that Scott Helme was on the list of speakers for NDC in Oslo this year, I had to attend! The talk was about how Bloom Filter became the solution to a relatively simple but at the same time complex problem at his company Report URI.
This article will try to create a summary of that talk in combination with an article, to give a brief introduction to Bloom Filters and how Report URI put this into practice. The full article is available here:
Report URI is a company that monitors security features in your application, e.g. watching over your Content Security Policy (CSP) reports. Report URI receives over 4,620,000,000 reports every week, and one of the most important features they have is filtering out the reports that are not usable. To filter out these reports, it can be very useful to know if you have seen this particular CSP report before. Mainly because if you have not seen it before, you can assume a closer look will be necessary. So the question is simple; Have we seen this report before?
Report URI only hold on to the data for 90 days. Despite this, they have around 3.27TB of data on disk and process around 14,000 inbound CSP per second! Which means asking the database for a specific report will be demanding and add a lot of extra activity.
Then to the simple but perhaps ingenious solution! Bloom Filters. Many people may remember this word from school. Personally, I remember having memorized the name and formula to be prepared for an exam, without really thinking about putting it into practice. In a Bloom filter, you will be able to see if a set of items contains a special item, without actually having to save the entire set of items. Which makes the filter extremely space efficient! As Scott writes in his article, you can do many cool things with probabilistic data, IF you are willing to sacrifice a small amount of accuracy for time and space efficiency.
Bloom filters are really just a list of bits, where every bit has an initial value of 0. To be able to add a new item to the filter, in this case a CSP report for Report Uri, you have to hash the item to find out which bits to set as 1 in the array. In a Bloom Filter K is the number of hash functions and M is the size of the filter. To be able to add an item to the filter, you must put bits in K locations, which means using the hash functions on all the items. This sounds simple enough, right? One of the disadvantages of using Bloom Filters, as mentioned earlier, is that you have to sacrifice a small amount of accuracy. This is because in a Bloom filter you can encounter hash collisions. There is no guarantee that another item has not yet set the bits in the Bloom filter, which will give a false positive result. It is therefore impossible to say:
The item is definitely part of the set
However, we always know if an item is not a part of the set, because then the bits is not set in the filter. So we can say:
The item is most likely part of the set.
It is also possible to control how likely collisions are in your Bloom filter, by using these three values:
M: the size of the array
N: number of items in the filter
K: how many hash functions used
The larger the filter is, it is less likely to encounter hash collisions and this reduces the chances for false positives. At the same time, it increases the memory requirements. Reducing the number of items and increasing the number hash functions will also reduce the probability of collisions. It is important to think carefully through before calculating these values, because it is also possible to overpopulate the array. As Helme writes in his article:
“Consider our array of size m = 10 and assume we had k = 10 hash functions. Once the first item was set in the filter, everything would be a false positive after that because all bits in the array are set.”
Additionally, you can configure the probability (p) of of false positives. Let's say we want it to be 1%, which really is not that bad. This is helping us create the most suitable filter for our use case!
The formula for calculating these values are:
n = ceil (m / (-k / log (1 - exp (log (p) / k))))
p = pow (1 - exp (-k / (m / n)), k)
m = ceil ((n * log (p)) / log (1 / pow (2, log (2))));
k = round ((m / n) * log (2));
Luckily, as Helme said during his presentation: there are some great Bloom Filter calculators available online!
After creating a Bloom Filter, you can query against it after a specific item, and not the database, which means that you save a lot of time. An important thing to be aware of is that after creating the filter it is not possible to change the value for M and K.
As Scott writes in his article, this seems like just the solution for the problem and the use of a Bloom Filter at Report URI is being tested and considered for further use. He also mentions other examples where such a filter may be relevant, e.g. checking the quality of a password. Instead of checking weak passwords (or trusted web pages) in an enormous dataset on every submit, you can have a local bloom filter in your browser, saving a lot of time preventing unnecessary queries.
Before I went to this talk, I knew what a Bloom Filter was, but found it difficult to imagine the use in practice. Hope you learned something after reading this article, if nothing else picked up a few names it might be worth following to learn new things!