How to create a loop to takes an existing df, and creates a randomized new df-CodePudding

I am trying to build a tool that will essentially scramble a dataset while maintaining the same elements. For example, if I have the table below

 1     2     3       4     5     6
0   ABC    1234  NL00  Paid    VISA
1   BCD    2345  NL01  Unpaid  AMEX
2   CDE    3456  NL02  Unpaid  VISA

I want it to then look go through each column, pick a random value, and paste that into a new df. An example output would be

 1     2     3       4     5     6
2   BCD    2345  NL01  Unpaid  VISA
0   BCD    1234  NL02  Unpaid  VISA
0   CDE    3456  NL01  Paid    VISA

I have managed to make it work with the code below, although for 24 columns the code was quite repetitive and I know a loop should be able to do this much quicker, I just have not been able to make it work.

import pandas as pd
import random

lst1 = df['1'].to_list()
lst2 = df['2'].to_list()
lst3 = df['3'].to_list()
lst4 = df['4'].to_list()
lst5 = df['5'].to_list()
lst6 = df['6'].to_list()

df_new = pd.DataFrame()

df_new['1'] = random.choices(lst1, k=2000)
df_new['2'] = random.choices(lst2, k=2000)
df_new['3'] = random.choices(lst3, k=2000)
df_new['4'] = random.choices(lst4, k=2000)
df_new['5'] = random.choices(lst5, k=2000)
df_new['6'] = random.choices(lst6, k=2000)

CodePudding user response：

Here's an easy solution:

df.apply(pd.Series.sample, replace=True, ignore_index=True, frac=1)

Output (potential):

   1    2     3     4     5     6
0  2  CDE  3456  NL00  Paid  VISA
1  2  BCD  3456  NL01  Paid  VISA
2  0  CDE  3456  NL01  Paid  VISA

pd.DataFrame.apply applies pd.Series.sample method to each column of the dataframe with resampling (replace=True) and return 100% size of the original dataframe with frac=1.

CodePudding user response：

cols = list(df.columns)

for x in range(len(cols)):
   lst = df[cols[x]].to_list()
   colname = str(x 1)
   df_new[colname] = random.choices(lst, k=2000)

Here's a loop for you to iterate through the columns names. Something like this should work.

CodePudding user response：

You can loop over the columns of the original dataframe and use sampling with replacement on each column to get the columns of the new dataframe.

df_new = pd.DataFrame()

for col_name in df.columns:
    df_new[col_name] = df[col_name].sample(n=2000, replace=True).tolist()

print(df_new)