Generate random binary matrix with all theirs rows different using numpy-CodePudding

I need to generate a random binary matrix with dimensions m x n where all their rows are different among themselves. Using numpy I tried

import numpy as np
import random
n = 512
m = 1000

a = random.sample(range(0, 2**n), m)
a = np.array(a)
a = np.reshape(a, (m, 1))
np.unpackbits(a.view(np.uint8), axis=1)

But it is not suitable for my case because n > 128 and m > 1000. So, the code above generates only rows with at most 62 elements. Could you help me, please?

CodePudding user response：

You could generate a random array of 0's and 1's with numpy.random.choice and then make sure that the rows are different through numpy.unique:

import numpy as np

m = 1000
n = 512

while True:
    a = np.random.choice([0, 1], size=(m, n))
    if len(np.unique(a, axis=0)) == m:
        break

CodePudding user response：

I would try creating one row at a time and check if that row exists already via a set which has a membership testing runtime of O(1). If the row exists simply generate another 1, if not add it to the array and move to the next row until you are done. This principle can be made faster by:

Setting the unique counter to 0
generating m - counter rows, adding the unique rows to the solution
increasing counter the by unique rows added
if counter == m you are done, else return to 2

The implementation is as follows:

import numpy as np
n = 128
m = 1000
a = np.zeros((m,n))
rows = set()
counter = 0
while counter < m:
    temp = np.random.randint(0, 2, (m-counter, n))
    for row in temp:
        if tuple(row) not in rows:
            rows.add(tuple(row))
            a[counter] = row
            counter  = 1

Runtime comparison

By generating all the matrix at once and checking if all the rows are unique you are saving a lot of time, only if n >> log2(m).

Example 1 with the following:

n = 128
m = 1000

I ran my suggestion and the solution mentions in the other answer, resulting in:

# my suggestion
17.7 ms ± 328 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# generating all matrix at once and chacking if all rows are unique
4.62 ms ± 198 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

This is because the probabily of generating m different rows is very high in this situation.

Example 2 When changing to:

n = 10
m = 1024

I ran my suggestion and the solution mentions in the other answer, resulting in:

# my suggestion
26.3 ms ± 1.36 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

The suggestion of generating all matrix at once and checking if all rows are unique did not finish running. This is because when math.log2(m) == n there are exactly m valid matrices and the probability of generating a valid matrix randomly approaches 0 as the shape of the matrix increases. More accurately, the probability of generating a valid matrix in that case is:

P(valid) = m / 2**(n*m)

For n=5 and m=2**5 = 32 the probability is approximately 2e-47

CodePudding user response：

You could create a matrix with unique rows and shuffle the rows:

n = 512
m = 1000

d = np.arange(m) # m unique numbers
d = ((d[:, None] & (1 << d[:n])) > 0).astype(np.uint8) # convert to binary array
i = np.random.randn(m).argsort() # indices used for shuffling rows
a = d[i] # output

all rows are unique:

assert len(np.unique(a, axis=0)) == m

Timings

n=128, m=1000:

271 µs ± 6.06 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

n=2**10, m=2**14:

50.9 ms ± 2.33 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

This works best for n <= m, otherwise you need to swap d[:n] with np.arange(n), resulting in longer runtime.