Count Min Sketch

Algorithms/Data Structures
Updated on:
May 12, 2024

What is a Count Min Sketch?

A Count Min Sketch is a specialized probabilistic data structure optimized for approximating the frequency of events in data streams using limited memory.

It provides a frequency table allowing incrementing counters associated with given items. It relies on hash functions to map items randomly across a fixed number of buckets, colliding items into the same buckets.

Count Min Sketches leverage collision resistance properties of hash functions to provide compact frequency estimates with strong guarantees. They are useful for queries like data pruning that filter data based on frequency thresholds.

How does it work?

A Count Min Sketch consists of a 2D array with w columns and d rows along with hash functions. For each new item, the hash functions map it to one element in every row to increment. To estimate an item's frequency, retrieve the minimum counter value in all rows mapped to by the item's hashes.

Collisions cause overcounting, but taking the minimum provides an upper bound estimate. More columns and rows yield higher accuracy at the cost of space.

Why use a Count Min Sketch?

Count Min Sketches provide memory-efficient approximate counting with strong accuracy guarantees compared to storing exact counts. They are useful for tracking frequent items, cardinality estimation, analytics on massive data streams using limited memory.

They avoid storing keys explicitly like in a hash table, using only hash values for lookups. This provides compact storage for frequency estimation.

FAQ

How is a Count Min Sketch different from a hash table?

Unlike a hash table storing keys explicitly, a Count Min Sketch uses hash values to Probabilistically estimate item frequencies and counts. This provides:

  • Fixed memory usage regardless of number of items.
  • No storage of keys explicitly, only hash values.
  • Frequency estimates with strong accuracy guarantees.
  • Much more space efficiency compared to hash tables.

What are the tradeoffs of using a Count Min Sketch?

Some tradeoffs versus storing exact counts:

  • Provides estimates not exact counts.
  • Small overcounting risk due to collisions.
  • Bounded error that decreases with more space.
  • Only supports increment, lookup, merge operations.

When is a Count Min Sketch suitable over exact counting?

Count Min Sketch shines for:

  • Approximate counts for massive data streams.
  • When memory availability is highly constrained.
  • Cardinality estimation queries.
  • Tracking most frequent items.
  • Analytics on high volume event streams.

References:

Related Entries

Data Pruning

Data pruning refers to database techniques that eliminate irrelevant data during query processing to minimize resource usage and improve performance.

Read more ->
Hash Functions

Hash functions are algorithms that map data of arbitrary size to fixed-size values called hashes in a deterministic, one-way manner for purposes like data integrity and database lookup.

Read more ->
Collision Resistance

Collision resistance is the property of cryptographic hash functions to minimize chances of different inputs mapping to the same output hash, making it difficult to intentionally cause collisions.

Read more ->

Get early access to AI-native data infrastructure