If I have this data frame:
df = pd.DataFrame(
{"A":[45,67,12,78,92,65,89,12,34,78],
"B":["h","b","f","d","e","t","y","p","w","q"],
"C":[True,False,False,True,False,True,True,True,True,True]})
How can I select 50% of the rows, so that column "C" is True in 90% of the selected rows and False in 10% of them?
CodePudding user response:
- firstly create a dataframe in 1000 rows
import pandas as pd
df = pd.DataFrame(
{"A":[45,67,12,78,92,65,89,12,34,78],
"B":["h","b","f","d","e","t","y","p","w","q"],
"C":[True,False,False,True,False,True,True,True,True,True]})
df = pd.concat([df]*100)
print(df)
- secondly get the
true_row_num
andfalse_row_num
row_num, _ = df.shape
true_row_num = int(row_num * 0.5 * 0.9)
false_row_num = int(row_num * 0.5 * 0.1)
print(true_row_num, false_row_num)
- thirdly randomly sample true_df and false_df respectively
true_df = df[df["C"]].sample(true_row_num)
false_df = df[~df["C"]].sample(false_row_num)
new_df = pd.concat([true_df, false_df])
new_df = new_df.sample(frac=1.0).reset_index(drop=True) # shuffle
print(new_df["C"].value_counts())
CodePudding user response:
I think if you calculate the needed sizes ex ante and then perform random sampling per group it might work. Look at something like this:
new=df.query('C==True').sample(int(0.5*len(df)*0.9)).append(df.query('C==False').sample(int(0.5*len(df)*0.1)))