Problem
Count the number of times elements appear in a stream of data.
This can be solved easily by using a HashMap. However, data streaming is unbounded so we may end up out of memory when there are too many different elements.
Count-Min Sketch (CMS) is a probabilistic data structure that helps us solve this problem using a fixed memory at the cost of accuracy.
CMS
- we need d pairwise independent hash functions H1, H2, ..., Hd
- CMS is a matrix C of size d x w, initially
C[i][j] = 0
. C contain the approximated count of elements.

-
for each element x in the stream, we update C as follow:
∀i∈{1,…,d}: C[i][Hi(x)]++
-
Retrieve the count of element x:
min(C[i,Hi(a)] ∀i∈{1,…,d}
-
How accurate is CMS?
- let
sum(count)
= total of count of all the elements - with
d = ⌈ln(1/δ)⌉, w = ⌈e/ϵ⌉
, CMS can guarantee that the error is at mostϵ∗sum(count)
with the probability of at least1-δ
- let
Example: total count is 10^6, we choose 7 hash function, w = 10000: error <= 270
with probability >= 99.9%
. The size of CMS is 7x10000 integers <=> only 280KB.
Application
CMS can be used when we want to approximately count the frequency of elements in a data stream: count Facebook like, count Youtube video view.