What is the best way to shuffle/permute each n rows of a data frame in python?-CodePudding

I want to shuffle each n (window size) rows of a data frame but I am not sure how to do it in a pythonic way. I found answers for shuffling all rows but not for a given window size:

def permute(df: pd.DataFrame, window_size: int = 10) -> pd.DataFrame:
    df_permuted = df.copy()
    """How would you shuffle every window_size rows for the modifiable columns?"""
    df_permuted.loc[:, modifiable_columns]
    ...
    return df_permuted

CodePudding user response：

This code defines a function called permute that takes in a Pandas dataframe and a window size (which is set to 10 by default) and returns a new dataframe that has been shuffled.

The function first calculates the number of windows by dividing the length of the input dataframe by the window size. It then iterates over the windows and shuffles the rows within each window using the sample method of the dataframe, which randomly reorders the rows. Finally, it concatenates all of the shuffled windows together into a single dataframe using the concat method and returns this dataframe.

The code then tests the permute function by creating a small dataframe and printing it out, then calling the permute function on it with a window size of 3 and printing out the shuffled dataframe.

import pandas as pd

def permute(df: pd.DataFrame, window_size: int = 10) -> pd.DataFrame:
    num_windows = len(df) // window_size
    
    compil = []
    for i in range(num_windows):
        start = i * window_size
        end = (i 1) * window_size
        compil.append( df.iloc[start:end].sample(frac=1))
        
    df = pd.concat(compil)
    return df

# Test the permute function
df = pd.DataFrame({"A": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
                   "B": [11, 12, 13, 14, 15, 16, 17, 18, 19, 20]})
print(df)

df_permuted = permute(df, window_size=3)
print(df_permuted)

output:

CodePudding user response：

The accepted answer is not vectorized. Using groupby.sample is a better choice:

df.groupby(np.arange(len(df))//N).sample(frac=1)

CodePudding user response：

To add the additional requirement that is in your code's comment, but not in your question, here's a version that also takes into account modifiable columns.

In the example below, mod and mod2 are your modifiable columns, while the nomod column is not modifiable.

I believe that modifiable columns cannot be achieved using a vectorized approach and adds to the accepted answer. Also, the accepted answer keeps in memory another full representation of the entire df, while my version only keeps a memory record as large as window_size.

df = pd.DataFrame([np.arange(0, 12)]*3).T
df.columns = ['mod', 'nomod', 'mod2']
df

    mod     nomod   mod2
0   0   0   0
1   1   1   1
2   2   2   2
3   3   3   3
4   4   4   4
5   5   5   5
6   6   6   6
7   7   7   7
8   8   8   8
9   9   9   9
10  10  10  10
11  11  11  11

def permute(df, window_size, modifiable_columns):
    num_chunks = int(len(df) / window_size)
    for i in range(0, num_chunks):
        start_ind = i * window_size
        end_ind = i * window_size   window_size
        
        df_row_subset = df.loc[start_ind:end_ind-1, modifiable_columns].sample(frac=1, random_state=1)
        df_row_subset.index = np.arange(start_ind, end_ind)
        
        df.loc[df_row_subset.index, modifiable_columns] = df_row_subset
        
    return df

permute(df, 4, ['mod', 'mod2'])

    mod     nomod   mod2
0   3   0   3
1   2   1   2
2   0   2   0
3   1   3   1
4   7   4   7
5   6   5   6
6   4   6   4
7   5   7   5
8   11  8   11
9   10  9   10
10  8   10  8
11  9   11  9