Random shuffle with weight in python-CodePudding

I am currently trying to shuffle an array and am running into some problems.

What I have:

my_array=array([nan, 1, 1, nan, nan, 2, nan, ..., nan, nan, nan])

What I want to do:
I want to shuffle the dataset while keeping the numbers (e.g. the 1,1 in the array) together. What I did is first converting every naninto an unique negative number.

my_array=array([-1, 1, 1, -2, -3, 2, -4, ..., -2158, -2159, -2160])

Afterward I split everything up with pandas:

df = pd.DataFrame(my_array)
df.rename(columns={0: 'sampleID'}, inplace=True)
groups = [df.iloc[:, 0] for _, df in df.groupby('sampleID')]

If I know shuffle my dataset I will have an equal probability for every group to appear at a given place, but this would neglect the number of elements in each group. If I have a group of several elements like [9,9,9,9,9,9] it should have a higher chance at appearing earlier than some random nan. Correct me on this one if I'm wrong.
One way to get around this problem is numpys choice method. For this I have to create a probability array

probability_array = np.zeros(len(groups))

for index, item in enumerate(groups):
    probability_array[index] = len(item) / len(groups)

All of this to finally call:

groups=np.array(groups,dtype=object)
rng = np.random.default_rng()
shuffled_indices = rng.choice(len(groups), len(groups), replace=False, p=probability_array)
shuffled_array = np.concatenate(groups[shuffled_indices]).ravel()
shuffled_array[shuffled_array < 1] = np.NaN

All of this is quite cumbersome and not very fast. Besides the fact that you can certainly code it better, I feel like I am missing some very simple solution to my problem. Can somebody point me in the right direction?

CodePudding user response：

One approach:

import numpy as np
from itertools import groupby

# toy data
my_array = np.array([np.nan, 1, 1, np.nan, np.nan, 2, 2, 2, np.nan, 3, 3, 3, np.nan, 4, 4, np.nan, np.nan])

# find groups
groups = np.array([[key, sum(1 for _ in group)] for key, group in groupby(my_array)])

# permute
keys, repetitions = zip(*np.random.permutation(groups))

# recreate new array
res = np.repeat(keys, repetitions)
print(res)

Output (single run)

[ 3.  3.  3. nan nan nan nan  2.  2.  2.  1.  1. nan nan nan  4.  4.]

CodePudding user response：

I have solved your problem under some restrictions

Instead of NaN, I have used zeros as separators
I assumed that an array of yours ALWAYS starts with a sequence of non-zero integers and ends with another sequence of non-zero integers.

With these provisions, I have essentially shuffled a representation of the sequences of integers, and later I have stitched everything in place again.

In [102]: import numpy as np
     ...: from itertools import groupby
     ...: a = np.array([int(_) for _ in '1110022220003044440005500000600777'])
     ...: print(a)
     ...: n, z = [], []
     ...: for i,g in groupby(a):
     ...:     if i:
     ...:         n.append((i, sum(1 for _ in g)))
     ...:     else:
     ...:         z.append(sum(1 for _ in g))
     ...: np.random.shuffle(n)
     ...: nn = n[0]
     ...: b = [*[nn[0]]*nn[1]]
     ...: for zz, nn in zip(z, n[1:]):
     ...:     b  = [*[0]*zz, *[nn[0]]*nn[1]]
     ...: print(np.array(b))
[1 1 1 0 0 2 2 2 2 0 0 0 3 0 4 4 4 4 0 0 0 5 5 0 0 0 0 0 6 0 0 7 7 7]
[7 7 7 0 0 1 1 1 0 0 0 4 4 4 4 0 6 0 0 0 5 5 0 0 0 0 0 2 2 2 2 0 0 3]

Note

The lengths of the runs of separators in the shuffled array is exactly the same as in the original array, but shuffling also the separators is easy. A more difficult problem would be to change arbitrarily the lengths, keepin' the array length unchanged.