The single-counter approach has a problem: we rely on just one “most extreme” observation. If we get unlucky (or lucky), our estimate can be way off. Imagine estimating the average height of students by measuring only one person. You might pick the tallest or shortest by chance.
The fix is simple: run many experiments in parallel, then combine the results. This is the same idea behind taking multiple measurements and averaging them to get a more reliable answer.
Instead of running separate experiments, we cleverly reuse parts of the same hash. We split each hash into two pieces:
If we want m buckets, we need p index bits, where 2^p = m.
More buckets means better accuracy, but uses more memory.
Now the estimate doesn’t depend on one lucky hash. If one bucket gets an unusually high or low value, it only affects 1 out of m buckets. When we combine all the buckets together, these random spikes cancel out and the estimate becomes much more stable.
It’s like asking 64 different groups of people to count independently, then averaging their answers. One group might make a mistake, but the overall average is still reliable.
Here’s a simplified version of the code. Don’t worry if the bit operations look confusing. The key idea is: extract the bucket index from the left bits, then find the first 1-bit position in the remaining bits.
class HyperLogLog {
// p = number of index bits
// m = number of buckets (always 2^p)
readonly p: number;
readonly m: number;
readonly registers: Uint8Array;
constructor(p = 6) {
this.p = p;
this.m = 1 << p;
this.registers = new Uint8Array(this.m);
}
add(hash: number) {
// Calculate bucket index
const index = hash >>> (32 - this.p);
// Shift left to remove index bits, leaving only measurement bits
const w = (hash << this.p) >>> 0;
// Count leading zeroes and put in the target bucket.
const k = Math.clz32(w) + 1;
if (k > this.registers[index]) this.registers[index] = k;
}
// We will implement the real estimator next
estimate(): number {
throw new Error("Estimator not implemented yet");
}
}