I am learning Knuth shuffle and it says that the random number should be between 0 and i, not N-1 and I can not understand why.
I saw similar questions asked before but cannot find a clear explanation.
thanks for your attention.
CodePudding user response:
The "true" Fisher-Yates shuffle looks like this:
for i from 0 to n - 1:
pick index j uniformly from i to n - 1 # [i, n-1]
swap array[i] and array[j]
You're proposing this variation:
for i from 0 to n - 1:
pick index j uniformly from 0 to n - 1 # [0, n-1]
swap array[i] and array[j]
There are a couple of ways to see why this modification will not produce a uniformly random permutation of the elements.
The first is a clever mathematical argument. The loop in the incorrect version of Fisher-Yates has n iterations. On each iteration, we pick one of n indices to swap with. This means that there are n choices for the index on the first iteration, n choices for the index on the second, ..., and n choices for the index on the last. Overall, that means that the number of different ways the algorithm can execute is n × n × ... × n = nn.
However, the number of different permutations of n elements is n!. In order for it to be possible for the broken shuffle to produce each of the n! permutations with equal probability, each permutation must be reachable by the same number of paths through the broken implementation (say, k ways). That would mean that k · n! = nn, meaning that nn must be a multiple of n!. However, for any n > 2, this isn't the case, and so the distribution cannot be uniform.
You can see this empirically. Here's some Python code that uses the broken shuffle 1,000,000 times on a three-element array and looks at the resulting frequencies of the different permutations:
from random import randint
from collections import defaultdict
def incorrect_shuffle(arr):
for i in range(len(arr)):
j = randint(0, len(arr) - 1)
arr[i], arr[j] = arr[j], arr[i]
freq = defaultdict(int)
for i in range(1000000):
sequence = ['a', 'b', 'c']
incorrect_shuffle(sequence)
freq["".join(sequence)] = 1
print(freq)
When I ran this, I got the following frequencies:
abc
: 148455acb
: 185328bac
: 184830bca
: 185235cab
: 148017cba
: 148135
Notice that the permutations acb
, bac
, and bca
are all much more likely than the other three.
For more about the sorts of distributions that arise from the broken shuffle, check out this earlier Stack Overflow question.