I have a dataframe which is structed as followed:
>>>df
a b id
0 1 4 3
1 4 1 2
2 7 5 1
3 2 9 3
4 4 11 2
5 2 7 1
6 3 4 2
7 9 2 1
I have added paragraphs in code for readability.
Now I want to shuffle according id but keep the initial descending order of column id True. What is the best way? A possible output would look like following:
>>>df
a b id
0 3 4 2
1 9 2 1
2 2 9 3
3 4 11 2
4 2 7 1
5 1 4 3
6 4 1 2
7 7 5 1
So in principle I just want the blocks to mix or to be randomly placed in another place.
CodePudding user response:
Use a categorical index to sort values by block:
out = df.assign(order=df['id'].ge(df['id'].shift()).cumsum()).sample(frac=1)
cat = pd.CategoricalDtype(out['order'].unique(), ordered=True)
out = out = out.astype({'order': cat}).sort_values(['order', 'id'], ascending=False)
print(out)
# Output:
a b id order
0 1 4 3 0
1 4 1 2 0
2 7 5 1 0
6 3 4 2 2
7 9 2 1 2
3 2 9 3 1
4 4 11 2 1
5 2 7 1 1
Obviously, you can remove the order
column by appending .drop(columns='order')
after sort_values
but I keep it here for demonstration purpose.
The key here is to set ordered=True
to your new categorical dtype.
>>> cat
CategoricalDtype(categories=[1, 2, 0], ordered=True)
CodePudding user response:
Create groups by difference in id
- each groups strat if difference is not -1
and then get unique groups ids
, shuffling and change ordering by DataFrame.loc
:
df['g'] = df['id'].diff().ne(-1).cumsum()
#if possible differency is not always -1
df['g'] = df['id'].ge(df['id'].shift()).cumsum()
print (df)
a b id g
0 1 4 3 1
1 4 1 2 1
2 7 5 1 1
3 2 9 3 2
4 4 11 2 2
5 2 7 1 2
6 3 4 2 3
7 9 2 1 3
ids = df['g'].unique()
np.random.shuffle(ids)
df = df.set_index('g').loc[ids].reset_index(drop=True)
print (df)
a b id
0 1 4 3
1 4 1 2
2 7 5 1
3 3 4 2
4 9 2 1
5 2 9 3
6 4 11 2
7 2 7 1
If need test groups by helper column change last reset_index(drop=True)
:
ids = df['g'].unique()
np.random.shuffle(ids)
df = df.set_index('g').loc[ids].reset_index()
print (df)
g a b id
0 2 3 4 2
1 2 9 2 1
2 1 2 9 3
3 1 4 11 2
4 1 2 7 1
5 0 1 4 3
6 0 4 1 2
7 0 7 5 1
Performance: In sample data, I guess repeated sorting should be reason for slowier perfromance in another solution.
#4k rows
df = pd.concat([df] * 500, ignore_index=True)
print (df)
In [70]: %%timeit
...: out = df.assign(order=df['id'].ge(df['id'].shift()).cumsum()).sample(frac=1)
...: cat = pd.CategoricalDtype(out['order'].unique(), ordered=True)
...: out = out = out.astype({'order': cat}).sort_values(['order', 'id'], ascending=False)
...:
6.13 ms ± 845 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
df['g'] = df['id'].diff().ne(-1).cumsum()
ids = df['g'].unique()
np.random.shuffle(ids)
df.set_index('g').loc[ids].reset_index(drop=True)
3.93 ms ± 161 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)