Problem

Count the number of times elements appear in a stream of data.
This can be solved easily by using a HashMap. However, data streaming is unbounded so we may end up out of memory when there are too many different elements.

Count-Min Sketch (CMS) is a probabilistic data structure that helps us solve this problem using a fixed memory at the cost of accuracy.

CMS

  • we need d pairwise independent hash functions H1, H2, ..., Hd
  • CMS is a matrix C of size d x w, initially C[i][j] = 0. C contain the approximated count of elements.
  • for each element x in the stream, we update C as follow:
    ∀i∈{1,…,d}: C[i][Hi(x)]++

  • Retrieve the count of element x:
    min(C[i,Hi(a)] ∀i∈{1,…,d}

  • How accurate is CMS?

    • let sum(count) = total of count of all the elements
    • with d = ⌈ln(1/δ)⌉, w = ⌈e/ϵ⌉, CMS can guarantee that the error is at most ϵ∗sum(count) with the probability of at least 1-δ

Example: total count is 10^6, we choose 7 hash function, w = 10000: error <= 270 with probability >= 99.9%. The size of CMS is 7x10000 integers <=> only 280KB.

Application

CMS can be used when we want to approximately count the frequency of elements in a data stream: count Facebook like, count Youtube video view.