I am using the function numpy.random.choice for generating random samples at once. But I'd like all the samples to be different. Is somebody aware of a function doing this? Explicitly, I'd like to have this:
import numpy as np
a = np.random.choice(62, size=(1000000, 8))
assert( len(set([tuple(a[i]) for i in range(a.shape[0])])) == a.shape[0])
The values on the integers can be replaced. The only which is required is that all row entries to be different.
CodePudding user response:
First things first, if you have a numpy version >= 1.17 avoid using np.random.choice
for the recommended method:
rng = np.random.default_rng()
rng.choice
Ironically enough, doing what you did is the best way to go about it. Just generate all of the numbers and make a check that it satisfies your restrictions.
samples = 1000000
while True:
a = np.random.choice(62, size=(samples, 8))
if len(set(tuple(row) for row in a)) == samples:
break
The reason for that is each sample has 8 values, where each value can take up to 62 different values. So effectively you have 62**8 unique samples. This is such a huge number compared to the 1 million samples you want to draw and considering the birthday problem they will all be unique 99.8% of the time. And if they are not, a second draw virtually guarantees that. You won't find yourself caught in an infinite loop.
Normally the way you'd go about this is drawing each sample in a loop and check if it has been encountered before.
seen = set()
draws = []
while len(draws) < samples:
draw = tuple(np.random.choice(62, size=8))
if draw not in seen:
seen.add(draw)
draws.append(draw)
a = np.array(draws)
This turns out to be much slower because of the python loops and the numerous calls to np.random.choice
. On my machine this clocks 15 seconds compared to the method above which only takes 2 seconds. Now, if the first method creates duplicate samples so frequently that you'll be in that loop for more than 7-8 iterations the second method becomes more efficient. But this isn't your case for the reason explained above.
Edit
A hybrid approach would be to generate all the numbers like in the first method but then instead of creating a set of the samples, use a dict to track in which row each sample has been encountered. Then if there are any duplicates, you won't have to generate a whole new array, but just replace a few individual samples.
from collections import defaultdict
import numpy as np
value = 20
samples = 1000000
length = 8
a = np.random.choice(value, size=(samples, length))
d = defaultdict(list)
for i, row in enumerate(a):
d[tuple(row)].append(i)
if len(d) < samples:
print(f'Found {samples - len(d)} duplicates')
idx = []
for rows in d.values():
if len(rows) > 1:
idx.extend(rows[1:])
del rows[1:]
while idx:
draw = np.random.choice(value, size=length)
if t := tuple(draw) not in d:
d[t].append(idx[-1])
a[idx.pop()] = draw
print('Done')
Again, for value = 62
you will very likely be fine with one draw. But for value = 20
it generates on average 20 duplicates with near certainty. It is thus faster to replace those few samples with new unique ones instead of using the second method from above. By the time you increase the value to value = 30
, it's almost a 50-50 whether you'll get a duplicate or not. While this approach has a lot more code in it, it retains a lot of the speed advantages by just generating the whole array in one go.
In your case I would still use the top suggested method because it's so unlikely to generate any duplicates that the only reason you even spend a line for a sanity check is just for the unthinkable. No need to complicate matters more.