I have this dataframe:
record = {
'F1': ['x1', 'x2','x3', 'x4','x5','x6','x7'],
'F2': ['a1', 'a2','a3', 'a4','a5','a6','a7'],
'Sex': ['F', 'M','F', 'M','M','M','F'] }
# Creating a dataframe
df = pd.DataFrame(record)
I would like to create for example 2 samples of this dataframe while keeping a fixed ratio of 50-50 on the Sex column. I tried like this:
df_dict ={}
for i in range(2):
df_dict['df{}'.format(i)] = df.sample(frac=0.50, random_state=123)
But the output I get does not seem to match my expectation:
df_dict["df0"]
# Output:
F1 F2 Sex
1 x2 a2 M
3 x4 a4 M
4 x5 a5 M
0 x1 a1 F
Any help ?
CodePudding user response:
Might not be the best idea, but I believe it might help you to solve your problem somehow:
n = 2
fDf = df[df["Sex"] == "F"].sample(frac=0.5, random_state=123).iloc[:n]
mDf = df[df["Sex"] == "M"].sample(frac=0.5, random_state=123).iloc[:n]
fDf.append(mDf)
Output
F1 F2 Sex
0 x1 a1 F
2 x3 a3 F
5 x6 a6 M
1 x2 a2 M
CodePudding user response:
This should also work
n = 2
df.groupby('Sex', group_keys=False).apply(lambda x: x.sample(n))
CodePudding user response:
Don't use frac
that will give your a fraction of each group, but n
that will give you a fixed value per group:
df.groupby('Sex').sample(n=2)
example output:
F1 F2 Sex
2 x3 a3 F
0 x1 a1 F
3 x4 a4 M
4 x5 a5 M
using a custom ratio
ratios = {'F':0.4, 'M':0.6} # sum should be 1
# total number desired
total = 4
# note that the exact number in the output depends
# on the rounding method to convert to int
# round should give the correct number but floor/ceil might
# under/over-sample
# see below for an example
s = pd.Series(ratios)*total
# convert to integer (chose your method, ceil/floor/round...)
s = np.ceil(s).astype(int)
df.groupby('Sex').apply(lambda x: x.sample(n=s[x.name])).droplevel(0)
example output:
F1 F2 Sex
0 x1 a1 F
6 x7 a7 F
4 x5 a5 M
3 x4 a4 M
1 x2 a2 M