Back Home

Intuition

After hashing, every input becomes a binary number that behaves like a sequence of random coin flips. Each bit is independent and equally likely to be 0 or 1.

This randomness unlocks something surprising: we can learn about the size of a stream without counting items directly.

Leading zeros and the first 1-bit

Consider a hashed value written in binary:

00010110 00000000 00000000 00000000
  • This number begins with three leading zeros.
  • The first 1 appears at position 4 (counting from the left, starting at 1).

HyperLogLog works with the position of this first 1-bit, though it’s often easier to think in terms of how many leading zeros come before it. Both viewpoints describe the same thing.

Rare patterns

Because each bit is random, long runs of leading zeros are rare. Think of it like flipping a coin. Getting heads once is common, but getting 10 heads in a row is extremely unlikely.

Let’s call P the probability (chance) of an event happening. If each bit is like a coin flip, the probability of seeing the first 1 at position k is:

P(first 1 at position k)=12kP(\text{first 1 at position } k) = \frac{1}{2^k}

Here’s what this means in practice:

  • First 1 at position 1 (like “1…”): probability = 1/2 = 50%
  • First 1 at position 2 (like “01…”): probability = 1/4 = 25%
  • First 1 at position 3 (like “001…”): probability = 1/8 = 12.5%
  • First 1 at position 4 (like “0001…”): probability = 1/16 ≈ 6%

Most hashed values will start with a 1 or 01. Seeing 0001 is uncommon. Seeing 0000001 is extremely unlikely — only about 1 in 128 chance!

The key insight

If, while scanning a stream, the rarest pattern you observe has k leading zeros, the stream must be large enough to produce an event with probability 1/2^k. In other words, the most extreme observation hints at how many items you must have seen.

Example: If you’ve seen a hash starting with “0000001” (first 1 at position 7), that event has only a 1/128 chance. You probably needed to see around 128 items before encountering something this rare.

This is the core idea behind HyperLogLog:

  • Hash inputs into random bits
  • Watch for extremely rare patterns
  • Use probability to turn rarity into a size estimate

Relying on a single rare observation is noisy and unreliable. But this intuition is the foundation. Next, we’ll turn it into the simplest possible algorithm and see why it needs improvement.