I am trying to build a tool that will essentially scramble a dataset while maintaining the same elements. For example, if I have the table below
1 2 3 4 5 6
0 ABC 1234 NL00 Paid VISA
1 BCD 2345 NL01 Unpaid AMEX
2 CDE 3456 NL02 Unpaid VISA
I want it to then look go through each column, pick a random value, and paste that into a new df. An example output would be
1 2 3 4 5 6
2 BCD 2345 NL01 Unpaid VISA
0 BCD 1234 NL02 Unpaid VISA
0 CDE 3456 NL01 Paid VISA
I have managed to make it work with the code below, although for 24 columns the code was quite repetitive and I know a loop should be able to do this much quicker, I just have not been able to make it work.
import pandas as pd
import random
lst1 = df['1'].to_list()
lst2 = df['2'].to_list()
lst3 = df['3'].to_list()
lst4 = df['4'].to_list()
lst5 = df['5'].to_list()
lst6 = df['6'].to_list()
df_new = pd.DataFrame()
df_new['1'] = random.choices(lst1, k=2000)
df_new['2'] = random.choices(lst2, k=2000)
df_new['3'] = random.choices(lst3, k=2000)
df_new['4'] = random.choices(lst4, k=2000)
df_new['5'] = random.choices(lst5, k=2000)
df_new['6'] = random.choices(lst6, k=2000)
CodePudding user response:
Here's an easy solution:
df.apply(pd.Series.sample, replace=True, ignore_index=True, frac=1)
Output (potential):
1 2 3 4 5 6
0 2 CDE 3456 NL00 Paid VISA
1 2 BCD 3456 NL01 Paid VISA
2 0 CDE 3456 NL01 Paid VISA
pd.DataFrame.apply
applies pd.Series.sample
method to each column of the dataframe with resampling (replace=True
) and return 100% size of the original dataframe with frac=1
.
CodePudding user response:
cols = list(df.columns)
for x in range(len(cols)):
lst = df[cols[x]].to_list()
colname = str(x 1)
df_new[colname] = random.choices(lst, k=2000)
Here's a loop for you to iterate through the columns names. Something like this should work.
CodePudding user response:
You can loop over the columns of the original dataframe and use sampling with replacement on each column to get the columns of the new dataframe.
df_new = pd.DataFrame()
for col_name in df.columns:
df_new[col_name] = df[col_name].sample(n=2000, replace=True).tolist()
print(df_new)