Random sampling from a dataframe with different size inputs for all groups-CodePudding

Say I have a dataframe like this,

df = pd.DataFrame({
        'Name': ['A1', 'A2', 'A3', 'A4', 'A5', 'B1','B2','B3','B4','C1','C2'],
        'Type': [1,1,1,1,1,3,3,3,3,5,5],
        'Point': [10,6,8,2,5,6,4,1,7,8,8],      
     })

I can easily take a random sample from "Name" column by using the "Point" column as probabilities like :

np.random.choice(df["Name"],6 ,p=df["Point"] / df["Point"].sum(),replace=False)
# array(['C1', 'A1', 'C2', 'B1', 'A3', 'A2'], dtype=object)

Then, I wanted to make the same thing among with the different groups inside the "Type" column. Additionally, I wanted to take different sample sizes for each group. Fortunately,I achieved this by using a loop like,

sample_sizes = [4,2,1]
output = []
for count, i in enumerate(np.unique(df['Type'])):
    data= df[df['Type']==i]
    result = np.random.choice(data["Name"],sample_sizes[count] ,p=data["Point"] / data["Point"].sum(),replace=False)
    output.append(result)

# [array(['A2', 'A3', 'A1', 'A5'], dtype=object),
# array(['B4', 'B1'], dtype=object),
# array(['C2'], dtype=object)]

My question is, How can I achieve such a thing by using pandas features like group_by, apply etc? There is a similar question. But I couldn't adapt it to my case.

Thank you in advance.

CodePudding user response：

You can apply groupby.ngroup method to assign group numbers. Then use this number to get the appropriate sample size for each group and use your custom function for each group.

gbobj = df.groupby('Type')
df['ngroup'] = gbobj.ngroup()
f = lambda data: np.random.choice(data["Name"], sample_sizes[data['ngroup'].iloc[0]], 
                                  p=data["Point"] / data["Point"].sum(),replace=False)
out = gbobj.apply(f)

Two samples:

Type
1    [A3, A5, A4, A1]
3            [B1, B2]
5                [C2]
dtype: object

Type
1    [A1, A5, A2, A3]
3            [B2, B4]
5                [C2]
dtype: object