Count the number of times elements appear in a stream of data.
This can be solved easily by using a HashMap. However, data streaming is unbounded so we may end up out of memory when there are too many different elements.
Count-Min Sketch (CMS) is a probabilistic data structure that helps us solve this problem using a fixed memory at the cost of accuracy.
- we need d pairwise independent hash functions H1, H2, ..., Hd
- CMS is a matrix C of size d x w, initially
C[i][j] = 0. C contain the approximated count of elements.
for each element x in the stream, we update C as follow:
Retrieve the count of element x:
How accurate is CMS?
sum(count)= total of count of all the elements
d = ⌈ln(1/δ)⌉, w = ⌈e/ϵ⌉, CMS can guarantee that the error is at most
ϵ∗sum(count)with the probability of at least
Example: total count is 10^6, we choose 7 hash function, w = 10000:
error <= 270 with probability >=
99.9%. The size of CMS is 7x10000 integers <=> only 280KB.
CMS can be used when we want to approximately count the frequency of elements in a data stream: count Facebook like, count Youtube video view.