numpy shuffle a fraction of sub-arrays-CodePudding

I have one-hot encoded data of undefined shape within an array of ndim = 3, e.g.,:

import numpy as np

arr = np.array([ # Axis 0
    [ # Axis 1
        [0, 1, 0], # Axis 2
        [1, 0, 0],
    ],
    [
        [0, 0, 1],
        [0, 1, 0],
    ],
])

What I want is to shuffle values for a known fraction of sub-arrays along axis=2.

If this fraction is 0.25, then the result could be:

arr = np.array([
    [
        [1, 0, 0], # Shuffling happened here
        [1, 0, 0],
    ],
    [
        [0, 0, 1],
        [0, 1, 0],
    ],
])

I know how to do that using iterative methods like:

for i in range(arr.shape[0]):
    for j in range(arr.shape[1]):
        if np.random.choice([0, 1, 2, 3]) == 0:
            np.random.shuffle(arr[i][j])

But this is extremely inefficient.

Edit: as suggested in the comments, the random selection of a known fraction should follow an uniform law.

CodePudding user response：

One approach:

import numpy as np

np.random.seed(42)

fraction = 0.25
total = arr.shape[0] * arr.shape[1]

# pick arrays to be shuffled
indices = np.random.choice(np.arange(total), size=int(total * fraction), replace=False)

# convert the each index to the corresponding multi-index
multi_indices = np.unravel_index(indices, arr.shape[:2])

# create view using multi_indices
selected = arr[multi_indices]

# shuffle select by applying argsort on random values of the same shape
shuffled = np.take_along_axis(selected, np.argsort(np.random.random(selected.shape), axis=1), axis=1)

# set the array to the new values
arr[multi_indices] = shuffled
print(arr)

Output (of a single run)

[[[0 1 0]
  [0 0 1]]

 [[0 0 1]
  [0 1 0]]]

CodePudding user response：

Your iterative method is great and definitely the best solution in terms of number of logical operations involved. The only way to do better, up to my knowledge, is to take advantage of numpy's vectorisation speedup. The following code is an example

def permute_last_maybe(x):
    N, M, K = x.shape
    y = np.transpose(x, [2, 0, 1])
    y = np.random.permutation(y)
    y = np.transpose(y, [1, 2, 0])
    mask = (np.random.random((N, M, 1)) > 0.25) * np.ones([N, M, K])
    return np.where(mask, x, y)

A timeit magic shows 300 us instead of 4.2 ms with an array of size (40, 40, 30). Note that this code does NOT use the new random Generators from numpy (I tried, but the overload of creating an instance of the class was significant).

I should probably mention also that this function does not mutate the given array x but returns a copy of it.