Home > Software engineering >  How do I create a random mask matrix where we mask a contiguous length?
How do I create a random mask matrix where we mask a contiguous length?

Time:11-25

How do I create a 10000 x 1000 mask matrix randomly such that each row has 3 contiguous masked entries of length 100? One naive way of doing this is as follows:

import numpy as np
mask = np.ones((10000, 1000))
idx = np.random.choice(mask.shape[1] - 100, 3 * mask.shape[0]).reshape([mask.shape[0], 3])
for i, id in enumerate(idx):
    for j in range(3):
        for k in range(100):
            mask[i][id[j]   k] = 0

However, this is extremely inefficient and takes a lot of time. What would be an efficient implementation? Also, it would be nice if the three blocks in a row are non-overlapping.

CodePudding user response:

You can create a list of indices for each row and apply this directly on the mask instead of using 2 for loops. For example:

mask = np.ones((10000, 1000))
for i in range(len(mask)):
    start_indices = np.random.choice(900, 3)
    indices = [idx for start_idx in start_indices for idx in range(start_idx, start_idx 100)]
    mask[i][indices] = 0

To make sure that the blocks are non-overlapping, add this as a condition for the indices as follows:

mask = np.ones((10000, 1000))
for i in range(len(mask)):
    cond = True
    while cond:
        start_indices = sorted(np.random.choice(900, 3))
        cond = any([True for idx1, idx2 in zip(start_indices, start_indices[1:]) if idx1   100 >= idx2])
    
    indices = [idx for start_idx in start_indices for idx in range(start_idx, start_idx 100)]
    mask[i][indices] = 0

Timings:

# original
3.42 s ± 153 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

# overlaps allowed
1.41 s ± 108 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

# no overlaps
2.25 s ± 199 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

CodePudding user response:

I got quite good performance improvement (30-40x faster than original)

I make sure zeros do not overlap:

  • In each sample there are 700 ones, I split 700 to 4 random integers (so they sum up to 700) -> I have sizes of ones
  • I calculate indices of zeros based of sizes of ones
def faster_than_original():
    zeros_size = 100
    n_zeros = 3
    mask = np.ones((10000, 1000))
    indices_weights = np.random.random((mask.shape[0], n_zeros   1))

    number_of_ones = mask.shape[1] - zeros_size * n_zeros
    ones_sizes = np.round(indices_weights[:, :n_zeros].T
                          * (number_of_ones / np.sum(indices_weights, axis=-1))).T.astype(np.int32)
    ones_sizes[:, 1:]  = zeros_size
    zeros_start_indices = np.cumsum(ones_sizes, axis=-1)
    for sample_idx in range(len(mask)):
        for zeros_idx in zeros_start_indices[sample_idx]:
            mask[sample_idx, zeros_idx: zeros_idx   zeros_size] = 0
    return mask

Profiling:

    42         1    8974014.0 8974014.0     76.2      mask = original()
    43         1     235235.0 235235.0      2.0      mask2 = faster_than_original()
    44         1    2565371.0 2565371.0     21.8      mask3 = shaido_method()
  • Related