I want to shuffle each n (window size) rows of a data frame but I am not sure how to do it in a pythonic way. I found answers for shuffling all rows but not for a given window size:
def permute(df: pd.DataFrame, window_size: int = 10) -> pd.DataFrame:
df_permuted = df.copy()
"""How would you shuffle every window_size rows for the modifiable columns?"""
df_permuted.loc[:, modifiable_columns]
...
return df_permuted
CodePudding user response:
This code defines a function called permute that takes in a Pandas dataframe and a window size (which is set to 10 by default) and returns a new dataframe that has been shuffled.
The function first calculates the number of windows by dividing the length of the input dataframe by the window size. It then iterates over the windows and shuffles the rows within each window using the sample method of the dataframe, which randomly reorders the rows. Finally, it concatenates all of the shuffled windows together into a single dataframe using the concat method and returns this dataframe.
The code then tests the permute function by creating a small dataframe and printing it out, then calling the permute function on it with a window size of 3 and printing out the shuffled dataframe.
import pandas as pd
def permute(df: pd.DataFrame, window_size: int = 10) -> pd.DataFrame:
num_windows = len(df) // window_size
compil = []
for i in range(num_windows):
start = i * window_size
end = (i 1) * window_size
compil.append( df.iloc[start:end].sample(frac=1))
df = pd.concat(compil)
return df
# Test the permute function
df = pd.DataFrame({"A": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
"B": [11, 12, 13, 14, 15, 16, 17, 18, 19, 20]})
print(df)
df_permuted = permute(df, window_size=3)
print(df_permuted)
output:
A B
0 1 11
1 2 12
2 3 13
3 4 14
4 5 15
5 6 16
6 7 17
7 8 18
8 9 19
9 10 20
A B
2 3 13
0 1 11
1 2 12
5 6 16
3 4 14
4 5 15
6 7 17
8 9 19
7 8 18
CodePudding user response:
The accepted answer is not vectorized. Using groupby.sample
is a better choice:
df.groupby(np.arange(len(df))//N).sample(frac=1)
CodePudding user response:
To add the additional requirement that is in your code's comment, but not in your question, here's a version that also takes into account modifiable columns.
In the example below, mod
and mod2
are your modifiable columns, while the nomod
column is not modifiable.
I believe that modifiable columns cannot be achieved using a vectorized approach and adds to the accepted answer. Also, the accepted answer keeps in memory another full representation of the entire df, while my version only keeps a memory record as large as window_size
.
df = pd.DataFrame([np.arange(0, 12)]*3).T
df.columns = ['mod', 'nomod', 'mod2']
df
mod nomod mod2
0 0 0 0
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
5 5 5 5
6 6 6 6
7 7 7 7
8 8 8 8
9 9 9 9
10 10 10 10
11 11 11 11
def permute(df, window_size, modifiable_columns):
num_chunks = int(len(df) / window_size)
for i in range(0, num_chunks):
start_ind = i * window_size
end_ind = i * window_size window_size
df_row_subset = df.loc[start_ind:end_ind-1, modifiable_columns].sample(frac=1, random_state=1)
df_row_subset.index = np.arange(start_ind, end_ind)
df.loc[df_row_subset.index, modifiable_columns] = df_row_subset
return df
permute(df, 4, ['mod', 'mod2'])
mod nomod mod2
0 3 0 3
1 2 1 2
2 0 2 0
3 1 3 1
4 7 4 7
5 6 5 6
6 4 6 4
7 5 7 5
8 11 8 11
9 10 9 10
10 8 10 8
11 9 11 9