Home > Mobile >  How to create a loop to takes an existing df, and creates a randomized new df
How to create a loop to takes an existing df, and creates a randomized new df

Time:12-07

I am trying to build a tool that will essentially scramble a dataset while maintaining the same elements. For example, if I have the table below

 1     2     3       4     5     6
0   ABC    1234  NL00  Paid    VISA
1   BCD    2345  NL01  Unpaid  AMEX
2   CDE    3456  NL02  Unpaid  VISA

I want it to then look go through each column, pick a random value, and paste that into a new df. An example output would be

 1     2     3       4     5     6
2   BCD    2345  NL01  Unpaid  VISA
0   BCD    1234  NL02  Unpaid  VISA
0   CDE    3456  NL01  Paid    VISA

I have managed to make it work with the code below, although for 24 columns the code was quite repetitive and I know a loop should be able to do this much quicker, I just have not been able to make it work.

import pandas as pd
import random

lst1 = df['1'].to_list()
lst2 = df['2'].to_list()
lst3 = df['3'].to_list()
lst4 = df['4'].to_list()
lst5 = df['5'].to_list()
lst6 = df['6'].to_list()

df_new = pd.DataFrame()

df_new['1'] = random.choices(lst1, k=2000)
df_new['2'] = random.choices(lst2, k=2000)
df_new['3'] = random.choices(lst3, k=2000)
df_new['4'] = random.choices(lst4, k=2000)
df_new['5'] = random.choices(lst5, k=2000)
df_new['6'] = random.choices(lst6, k=2000)

CodePudding user response:

Here's an easy solution:

df.apply(pd.Series.sample, replace=True, ignore_index=True, frac=1)

Output (potential):

   1    2     3     4     5     6
0  2  CDE  3456  NL00  Paid  VISA
1  2  BCD  3456  NL01  Paid  VISA
2  0  CDE  3456  NL01  Paid  VISA

pd.DataFrame.apply applies pd.Series.sample method to each column of the dataframe with resampling (replace=True) and return 100% size of the original dataframe with frac=1.

CodePudding user response:

cols = list(df.columns)

for x in range(len(cols)):
   lst = df[cols[x]].to_list()
   colname = str(x 1)
   df_new[colname] = random.choices(lst, k=2000)

Here's a loop for you to iterate through the columns names. Something like this should work.

CodePudding user response:

You can loop over the columns of the original dataframe and use sampling with replacement on each column to get the columns of the new dataframe.

df_new = pd.DataFrame()

for col_name in df.columns:
    df_new[col_name] = df[col_name].sample(n=2000, replace=True).tolist()

print(df_new)
  • Related