Home > Net >  An efficient algorithm to sample non-duplicate random elements from an array
An efficient algorithm to sample non-duplicate random elements from an array

Time:09-19

I'm looking for an algorithm to pick M random elements from a given array. The prerequisites are:

  • the sampled elements must be unique,
  • the array to sample from may contain duplicates,
  • the array to sample from is not necessarily sorted.

This is what I've managed to come up with. Here I'm also making an assumption that the amount of unique elements in the array is greater (or equal) than M.

#include <random>
#include <vector>
#include <algorithm>
#include <iostream>


const std::vector<int> sample(const std::vector<int>& input, size_t n) {
    std::random_device rd;
    std::mt19937 engine(rd());
    std::uniform_int_distribution<int> dist(0, input.size() - 1);
    
    std::vector<int> result;
    result.reserve(n);

    size_t id;
    do {
        id = dist(engine);
        if (std::find(result.begin(), result.end(), input[id]) == result.end())
            result.push_back(input[id]);
    } while (result.size() < n);

    return result;
}


int main() {
    std::vector<int> input{0, 0, 1, 1, 2, 2, 3, 3, 4, 4};

    std::vector<int> result = sample(input, 3);

    for (const auto& item : result)
        std::cout << item << ' ';
    std::cout << std::endl;
}

This algorithm does not seem to be the best. Is there a more efficient (with less time complexity) algorithm to solve this task? It would be good if this algorithm could also assert the amount of unique elements in the input array is not less than M (or pick as many unique elements as possible if this is not the case).

Possible solution

As MSalters suggested, I use std::unordered_set to remove duplicates and std::shuffle to shuffle elements in a vector constructed from the set. Then I resize the vector and return it.

const std::vector<int> sample(const std::vector<int>& input, size_t M) {
    std::unordered_set<int> rem_dups(input.begin(), input.end());
    if (rem_dups.size() < M) M = rem_dups.size();
    
    std::vector<int> result(rem_dups.begin(), rem_dups.end());
    std::mt19937 g(std::random_device{}());
    std::shuffle(result.begin(), result.end(), g);

    result.resize(M);
    return result;
}

CodePudding user response:

The comments already note the use of std::set. The additional request to check for M unique elements in the input make that a bit more complicated. Here's an alternative implementation:

  1. Put all inputs in a std::set or std::unordered_set. This removes duplicates.
  2. Copy all elements to the return vector
  3. If that has more than M elements, std::shuffle it and resize it to M elements.
  4. Return it.

CodePudding user response:

Use a set S to store the output, initially empty.

i = 0

while |S| < M  && i <= n-1
  swap the i'th element of the input with a random greater element
  add the newly swapped i'th element to your set if it isn't already there
  i  

This will end with S having M distinct elements from your input array (if there are M distinct elements). However, elements which are more common in the input array are more likely to be in S (unless you go through the additional work of eliminating duplicates from the input first).

  • Related