Choose a simple random sample, without replacement, of k items from a population of size n.
Fisher-Yates shuffling for known n
The basic idea is to shuffle the original array of n number, and pick the first k numbers.
To shuffle, we use Fisher-Yates algorithm:
for i from n−1 downto 1 do j ← random integer s.t. 0 ≤ j ≤ i swap(a[j], a[i])
Imagine you have a magical hat and n balls. Put all n balls into the hat.
Now close your eyes and start taking balls randomly out of the hat and placing them in a line. If you can convince yourself that the line of ball you get is random, you can understand how Fisher-Yates works.
We have to prove that number
a[i] will have
1/n probability of ending up at index
This is equivalent to
a[i] is picked at
j trial, and not being picked at first
n-j trial. This probability is
(1/j) * (1 - 1/(j+1)) * (1 - 1/(j+2)) * ... * (1 - 1/n) = 1/j * (j/j+1) * (j+1)/(j+2) * (j+2)/(j+3) * ... * (n-1)/n = 1/n
Algorithm R for unknown n
When n is unknown in advance, for example when we receive a stream of events, the problem becomes harder. This is known as Reservior sampling problem. There are many algorithms for this problem, but I will write the simplest one:
ReservoirSample(S[1..n], R[1..k]) // fill the reservoir array for i := 1 to k R[i] := S[i] for i := k+1 to n j := randomInteger(1, i) if j <= k R[j] := S[i]
We need to prove that
Pr(S[i] ∈ R) = k/n.
This is equivalent to S[i] is picked into R and S[i+1], S[i+2], ..., S[n] do not replace S[i]. This probability is
k/i * (1 - 1/(i+1)) * (1 - 1/(i+2)) * ... * (1 - 1/n) = k/i * i/(i+1) * (i+1)/(i+2) * ... * (n-1)/n = k/n