I am currently trying to shuffle an array and am running into some problems.
What I have:
my_array=array([nan, 1, 1, nan, nan, 2, nan, ..., nan, nan, nan])
What I want to do:
I want to shuffle the dataset while keeping the numbers (e.g. the 1,1
in the array) together.
What I did is first converting every nan
into an unique negative number.
my_array=array([-1, 1, 1, -2, -3, 2, -4, ..., -2158, -2159, -2160])
Afterward I split everything up with pandas:
df = pd.DataFrame(my_array)
df.rename(columns={0: 'sampleID'}, inplace=True)
groups = [df.iloc[:, 0] for _, df in df.groupby('sampleID')]
If I know shuffle my dataset I will have an equal probability for every group to appear at a given place, but this would neglect the number of elements in each group. If I have a group of several elements like [9,9,9,9,9,9]
it should have a higher chance at appearing earlier than some random nan
. Correct me on this one if I'm wrong.
One way to get around this problem is numpys choice method.
For this I have to create a probability array
probability_array = np.zeros(len(groups))
for index, item in enumerate(groups):
probability_array[index] = len(item) / len(groups)
All of this to finally call:
groups=np.array(groups,dtype=object)
rng = np.random.default_rng()
shuffled_indices = rng.choice(len(groups), len(groups), replace=False, p=probability_array)
shuffled_array = np.concatenate(groups[shuffled_indices]).ravel()
shuffled_array[shuffled_array < 1] = np.NaN
All of this is quite cumbersome and not very fast. Besides the fact that you can certainly code it better, I feel like I am missing some very simple solution to my problem. Can somebody point me in the right direction?
CodePudding user response:
One approach:
import numpy as np
from itertools import groupby
# toy data
my_array = np.array([np.nan, 1, 1, np.nan, np.nan, 2, 2, 2, np.nan, 3, 3, 3, np.nan, 4, 4, np.nan, np.nan])
# find groups
groups = np.array([[key, sum(1 for _ in group)] for key, group in groupby(my_array)])
# permute
keys, repetitions = zip(*np.random.permutation(groups))
# recreate new array
res = np.repeat(keys, repetitions)
print(res)
Output (single run)
[ 3. 3. 3. nan nan nan nan 2. 2. 2. 1. 1. nan nan nan 4. 4.]
CodePudding user response:
I have solved your problem under some restrictions
- Instead of NaN, I have used zeros as separators
- I assumed that an array of yours ALWAYS starts with a sequence of non-zero integers and ends with another sequence of non-zero integers.
With these provisions, I have essentially shuffled a representation of the sequences of integers, and later I have stitched everything in place again.
In [102]: import numpy as np
...: from itertools import groupby
...: a = np.array([int(_) for _ in '1110022220003044440005500000600777'])
...: print(a)
...: n, z = [], []
...: for i,g in groupby(a):
...: if i:
...: n.append((i, sum(1 for _ in g)))
...: else:
...: z.append(sum(1 for _ in g))
...: np.random.shuffle(n)
...: nn = n[0]
...: b = [*[nn[0]]*nn[1]]
...: for zz, nn in zip(z, n[1:]):
...: b = [*[0]*zz, *[nn[0]]*nn[1]]
...: print(np.array(b))
[1 1 1 0 0 2 2 2 2 0 0 0 3 0 4 4 4 4 0 0 0 5 5 0 0 0 0 0 6 0 0 7 7 7]
[7 7 7 0 0 1 1 1 0 0 0 4 4 4 4 0 6 0 0 0 5 5 0 0 0 0 0 2 2 2 2 0 0 3]
Note
The lengths of the runs of separators in the shuffled array is exactly the same as in the original array, but shuffling also the separators is easy. A more difficult problem would be to change arbitrarily the lengths, keepin' the array length unchanged.