Sampling with fixed column ratio in pandas-CodePudding

I have this dataframe:

record = {
   'F1': ['x1', 'x2','x3', 'x4','x5','x6','x7'],
   'F2': ['a1', 'a2','a3', 'a4','a5','a6','a7'],
   'Sex': ['F', 'M','F', 'M','M','M','F'] }

# Creating a dataframe
df = pd.DataFrame(record)

I would like to create for example 2 samples of this dataframe while keeping a fixed ratio of 50-50 on the Sex column. I tried like this:

df_dict ={}
for i in range(2):
    df_dict['df{}'.format(i)] = df.sample(frac=0.50, random_state=123)

But the output I get does not seem to match my expectation:

df_dict["df0"]

# Output:
    F1  F2  Sex
1   x2  a2  M
3   x4  a4  M
4   x5  a5  M
0   x1  a1  F

Any help ?

CodePudding user response：

Might not be the best idea, but I believe it might help you to solve your problem somehow:

n = 2
fDf = df[df["Sex"] == "F"].sample(frac=0.5, random_state=123).iloc[:n]
mDf = df[df["Sex"] == "M"].sample(frac=0.5, random_state=123).iloc[:n]
fDf.append(mDf)

Output

    F1  F2  Sex
0   x1  a1  F
2   x3  a3  F
5   x6  a6  M
1   x2  a2  M

CodePudding user response：

This should also work

n = 2
df.groupby('Sex', group_keys=False).apply(lambda x: x.sample(n))

CodePudding user response：

Don't use frac that will give your a fraction of each group, but n that will give you a fixed value per group:

df.groupby('Sex').sample(n=2)

example output:

   F1  F2 Sex
2  x3  a3   F
0  x1  a1   F
3  x4  a4   M
4  x5  a5   M

using a custom ratio

ratios = {'F':0.4, 'M':0.6}  # sum should be 1
# total number desired
total  = 4
# note that the exact number in the output depends
# on the rounding method to convert to int
# round should give the correct number but floor/ceil might
# under/over-sample
# see below for an example

s = pd.Series(ratios)*total
# convert to integer (chose your method, ceil/floor/round...)
s = np.ceil(s).astype(int)

df.groupby('Sex').apply(lambda x: x.sample(n=s[x.name])).droplevel(0)

example output:

   F1  F2 Sex
0  x1  a1   F
6  x7  a7   F
4  x5  a5   M
3  x4  a4   M
1  x2  a2   M