Why is this simple shuffle algorithm — sorting by random()

In this thread we see this simple and beautiful algorithm to shuffle an array:

function shuffle<T>(array: T[]): T[] {
  return array.sort(() => Math.random() - 0.5);
}

And we can see comments that say this algorithm is biased. But I made a simple script to create an empirical probability distribution of the index that the last element of the array ends after the shuffle:

function shuffle(array) {
  return array.sort(() => Math.random() - 0.5);
}

function generateDistribution(iterations = 10_000, arrayLength = 10) {
  const testArray = Array(arrayLength - 1).fill("test");
  const testTarget = "target";
  testArray.push(testTarget);

  const results = {};

  for (let index = 0; index < iterations; index  ) {
    countTargetPosition();
  }

  return generateResult();

  function countTargetPosition() {
    const shuffled = shuffle(testArray);
    shuffled.forEach((value, index) => {
      if (value === testTarget) {
        results[index] = results[index]   1 || 1;
      }
    });
  }

  function generateResult() {
    return Object.entries(results).map(([index, count]) => {
      return {
        [index]: count / iterations,
      };
    });
  }
}

const result = generateDistribution()
document.write(`<h1>Result</h1>`)

document.write(JSON.stringify(result))

We expect an unbiased algorithm to have a uniform distribution, and the result is very close to that, even for array with 100 elements. Why is this algorithm biased then?

CodePudding user response：

JavaScript doesn't specify a specific algorithm for sort, and this shuffling algorithm can give very biased results depending on the specific sorting algorithm in use. Below, I describe some simple, well-known sorting algorithms that give very biased results; I demonstrate that Firefox and Chrome both give very biased results for an array of length 4; and I give a general argument for why any sorting algorithm will give biased results (though not necessarily as biased as these explicit examples).

Example #1 — selection sort. In selection sort, we first find the least element and put it at index 0, then find the second-least element and put it at index 1, and so on. The important thing to note is that, with the comparison function () => Math.random() - 0.5, each argument to the comparison has an equal chance of being considered "less". So if you find the least element by iterating over the array and comparing each element to the previously-least element, then there's a 50% chance that you'll deem the last element to be the least, a 25% chance that you'll deem the second-to-last element to be the least, a 12.5% chance that you'll deem the third-to-last element to be the least, etc., therefore giving a biased distribution of which element goes first.

Example 2 — insertion sort. In insertion sort, we build up a "sorted" section of the array by taking each element in turn and inserting it into the right place in that sorted section (shifting all greater elements by one to make room for it). This means that the last element has a 50% chance of being deemed the least, a 25% chance of being deemed the second-least, a 12.5% chance of being deemed the third-least, etc.

Examples #3 and #4 — whatever Firefox and Chrome use for a four-element array.

Now, realistically, I wouldn't expect any implementation of sort to use exactly selection sort or insertion sort, since there are other algorithms that are much more efficient for large inputs. But sophisticated modern sorting algorithms, such as Timsort, incorporate multiple different sorting algorithms, adaptively choosing between them based on the size and characteristics of the input (or of parts of the input, since they can combine these algorithms in sophisticated ways). So, as an experiment, I tried out this shuffle algorithm on the array [1, 2, 3, 4] — a short enough array that it seemed likely that the sort implementation might just use insertion sort for the whole array.

Here's the code I used:

const counts = {};
for (let i = 0; i < 1_000_000;   i) {
  const permutation = [1, 2, 3, 4].sort(() => Math.random() - 0.5).join('');
  counts[permutation] = (counts[permutation]||0)   1;
}

const result = [];
for (let permutation in counts) {
  result.push(permutation   ': '   counts[permutation]);
}

result.join('\n')

I tried this in both Firefox and Chrome.

In Firefox, I got results like this:

1234: 125747
1243: 62365
1324: 62299
1342: 31003
1423: 31320
1432: 15635
2134: 125380
2143: 62216
2314: 62615
2341: 31255
2413: 31509
2431: 15608
3124: 62377
3142: 31166
3214: 62194
3241: 31293
3412: 15631
3421: 15782
4123: 31056
4132: 15672
4213: 31231
4231: 15319
4312: 15727
4321: 15600

which doesn't match what I'd expect from insertion sort, so it must be doing something different, but regardless, it shows very clear biases. Some permutations occur 1/64 of the time (15,625 times out of a million, plus/minus random noise), some occur 1/32 of the time (31,250), some occur 1/16 of the time (62,500), and some occur 1/8 of the time (125,000); so some permutations are eight times as common as others.

In Chrome, I got results like this:

1234: 187029
1243: 62380
1324: 15409
1342: 15679
1423: 62476
1432: 15368
2134: 31280
2143: 31291
2314: 15683
2341: 15482
2413: 31482
2431: 15732
3124: 15786
3142: 15692
3214: 47186
3241: 47092
3412: 15509
3421: 46600
4123: 62825
4132: 15595
4213: 31091
4231: 15763
4312: 15624
4321: 171946

which also doesn't match what I'd expect from insertion sort, and is a bit more complicated than the distribution in Firefox (I think I see some 3/16ths (187,500) and 3/64ths (46,875) in there?), but is actually even more heavily biased, with a factor of twelve difference between the most and least common permutations.

Example #5 — any deterministic sorting algorithm. I've given various examples above of fairly extreme biases; but really, any sorting algorithm would be expected to produce some bias, because if the algorithm does worst-case k comparisons on an array of length n, and each comparison has a 50–50 split, then the probability of any given permutation has to be a multiple of ¹/_2^k, whereas an unbiased shuffler has to give each permutation the probability ¹/_n!, which won't be a multiple of ¹/_2^k if n ≥ 3 (because then n! will be a multiple of 3).

That said, I should acknowledge that these biases might be small enough that they don't matter; after all, even 1.0 / 3.0 doesn't compute exactly 1/3, but rather, rounds it to a binary approximation. Of more direct relevance, a typical implementation of Math.random() holds 64 or 128 bits of internal state, which means that it doesn't even have 21! or 35! distinct internal states, which means that no algorithm that uses Math.random() to shuffle an array of 21 or 35 or more elements can possibly produce each permutation with nonzero probability. So I suppose that some bias is inevitable!

Even if you're using a sort implementation that gives results that you consider good enough, there's just no reason to do it this way, since a Fisher–Yates shuffle is simple to code and is faster than any comparison-based sorting algorithm.

But I made a simple script to create an empirical probability distribution of the index that the last element of the array ends after the shuffle: […]

Note that there can potentially be subtler biases such that not all permutations are equally likely even if the last element is equally likely to end up at any position. Even if the sort implementation is held fixed, you'd want to do a more thorough analysis (probably including a look at its source code) before relying on this shuffling algorithm giving unbiased results.

CodePudding user response：

The reason is that the comparison outcomes are incoherent, and as ruakh detailed, this can bias some sorting algorithms.

A correct solution is to associate a random key to the elements and sort on the key. But this takes O(n) extra space.