Hashing

HyperLogLog works best when it processes fixed-size bit strings where each bit has an equal chance of being 0 or 1. Real-world data, however, is rarely random. User IDs are often sequential, IP addresses share common prefixes, and email addresses contain repeated and predictable patterns. In many cases, some bits hardly change at all. Because of this, we first use a hash function.

A hash function converts any input (no matter its size) into a fixed-size number, and it is designed to spread the results evenly across all possible outputs. This makes the data suitable for HyperLogLog processing.

Hashing Function

You do not need to understand how hashing algorithms work internally for this section. In practice, you can simply use existing hashing algorithms that are already designed to distribute values evenly.

export function hash32Bits(input: string): number {
  return mix32(fnv1a32(input));
}
 
function mix32(x: number): number {
  x ^= x >>> 16;
  x = Math.imul(x, 0x7feb352d);
  x ^= x >>> 15;
  x = Math.imul(x, 0x846ca68b);
  x ^= x >>> 16;
  return x >>> 0;
}
 
function fnv1a32(str: string): number {
  const bytes = new TextEncoder().encode(str);
 
  let h = 0x811c9dc5;
  for (let i = 0; i < bytes.length; i++) {
    h ^= bytes[i];
    h = Math.imul(h, 0x01000193);
  }
  return h >>> 0;
}

Note: To keep this lesson simple, we use a basic hashing function that produces 32-bit hashes. In real-world systems, 64-bit hashes are more common because they reduce collisions and handle much larger datasets. For learning and visualization, however, 32 bits are more than enough.

Introduction Intuition