Back Home

Tuning

A Bloom filter has three parameters that control its behaviour:

  • m — the number of bits in the array
  • k — the number of hash functions
  • n — the expected number of items you will insert

Getting these right determines how accurate the filter is.

The false positive rate formula

After inserting n items into a filter with m bits and k hash functions, the probability of a false positive is approximately:

p ≈ (1 - e^(-k·n/m))^k

You do not need to memorise this formula. The key insight is what each variable does:

  • Larger m (more bits) → fewer collisions → lower false positive rate
  • Larger k (more hashes) → more bits checked → lower chance of all matching by accident, up to a point
  • Larger n (more items) → more bits flipped → higher false positive rate

The optimal number of hash functions

For a given m and n, there is a value of k that minimises the false positive rate:

k_optimal = (m / n) * ln(2)  ≈  0.693 * (m / n)

Using too few hashes means each item leaves a small footprint, but many items overlap. Using too many hashes flips too many bits, filling the array quickly and increasing collisions.

Explore the trade-offs

Practical sizing

If you know your expected item count and can tolerate a target false positive rate p, you can work backwards to find the right m:

m = -n * ln(p) / (ln(2))^2

For example, to store 1 million items with a 1% false positive rate, you need roughly 9.6 million bits — about 1.2 MB. A plain hash set storing those million items as strings could easily use 50–100 MB or more.

That is the core value of a Bloom filter: a well-tuned filter can give you a very low false positive rate at a tiny fraction of the memory cost.