Home > database >  How to shuffle according column (id) but keep descending True
How to shuffle according column (id) but keep descending True

Time:12-08

I have a dataframe which is structed as followed:

>>>df
   a  b  id  
0  1  4  3  
1  4  1  2  
2  7  5  1

3  2  9  3
4  4  11 2
5  2  7  1

6  3  4  2
7  9  2  1

I have added paragraphs in code for readability.

Now I want to shuffle according id but keep the initial descending order of column id True. What is the best way? A possible output would look like following:

>>>df
   a  b  id  
0  3  4  2
1  9  2  1

2  2  9  3
3  4  11 2
4  2  7  1

5  1  4  3  
6  4  1  2  
7  7  5  1

So in principle I just want the blocks to mix or to be randomly placed in another place.

CodePudding user response:

Use a categorical index to sort values by block:

out = df.assign(order=df['id'].ge(df['id'].shift()).cumsum()).sample(frac=1)
cat = pd.CategoricalDtype(out['order'].unique(), ordered=True)
out = out = out.astype({'order': cat}).sort_values(['order', 'id'], ascending=False)
print(out)

# Output:
   a   b  id order
0  1   4   3     0
1  4   1   2     0
2  7   5   1     0
6  3   4   2     2
7  9   2   1     2
3  2   9   3     1
4  4  11   2     1
5  2   7   1     1

Obviously, you can remove the order column by appending .drop(columns='order') after sort_values but I keep it here for demonstration purpose.

The key here is to set ordered=True to your new categorical dtype.

>>> cat
CategoricalDtype(categories=[1, 2, 0], ordered=True)

CodePudding user response:

Create groups by difference in id - each groups strat if difference is not -1 and then get unique groups ids, shuffling and change ordering by DataFrame.loc:

df['g'] = df['id'].diff().ne(-1).cumsum()
#if possible differency is not always -1
df['g'] = df['id'].ge(df['id'].shift()).cumsum()
print (df)
   a   b  id  g
0  1   4   3  1
1  4   1   2  1
2  7   5   1  1
3  2   9   3  2
4  4  11   2  2
5  2   7   1  2
6  3   4   2  3
7  9   2   1  3

ids = df['g'].unique()
np.random.shuffle(ids)
df = df.set_index('g').loc[ids].reset_index(drop=True)
print (df)
   a   b  id
0  1   4   3
1  4   1   2
2  7   5   1
3  3   4   2
4  9   2   1
5  2   9   3
6  4  11   2
7  2   7   1

If need test groups by helper column change last reset_index(drop=True):

ids = df['g'].unique()
np.random.shuffle(ids)
df = df.set_index('g').loc[ids].reset_index()
print (df)
   g  a   b  id
0  2  3   4   2
1  2  9   2   1
2  1  2   9   3
3  1  4  11   2
4  1  2   7   1
5  0  1   4   3
6  0  4   1   2
7  0  7   5   1

Performance: In sample data, I guess repeated sorting should be reason for slowier perfromance in another solution.

#4k rows
df = pd.concat([df] * 500, ignore_index=True)
print (df)

In [70]: %%timeit
    ...: out = df.assign(order=df['id'].ge(df['id'].shift()).cumsum()).sample(frac=1)
    ...: cat = pd.CategoricalDtype(out['order'].unique(), ordered=True)
    ...: out = out = out.astype({'order': cat}).sort_values(['order', 'id'], ascending=False)
    ...: 
6.13 ms ± 845 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


%%timeit
df['g'] = df['id'].diff().ne(-1).cumsum()
ids = df['g'].unique()
np.random.shuffle(ids)
df.set_index('g').loc[ids].reset_index(drop=True)


3.93 ms ± 161 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
  • Related