#### Problem

Count the number of times elements appear in a stream of data.
This can be solved easily by using a HashMap. However, data streaming is unbounded so we may end up out of memory when there are too many different elements.

Count-Min Sketch (CMS) is a probabilistic data structure that helps us solve this problem using a fixed memory at the cost of accuracy.

### CMS

• we need d pairwise independent hash functions H1, H2, ..., Hd
• CMS is a matrix C of size d x w, initially `C[i][j] = 0`. C contain the approximated count of elements.
• for each element x in the stream, we update C as follow:
`∀i∈{1,…,d}: C[i][Hi(x)]++`

• Retrieve the count of element x:
`min(C[i,Hi(a)] ∀i∈{1,…,d}`

• How accurate is CMS?

• let `sum(count)` = total of count of all the elements
• with `d = ⌈ln(1/δ)⌉, w = ⌈e/ϵ⌉`, CMS can guarantee that the error is at most `ϵ∗sum(count)` with the probability of at least `1-δ`

Example: total count is 10^6, we choose 7 hash function, w = 10000: `error <= 270` with probability >= `99.9%`. The size of CMS is 7x10000 integers <=> only 280KB.

##### Application

CMS can be used when we want to approximately count the frequency of elements in a data stream: count Facebook like, count Youtube video view.