Home > Back-end >  Random sampling of groups after pandas groupby
Random sampling of groups after pandas groupby

Time:12-29

I have a large dataframe that looks like this:

Nationality Sex Response
American Female I have no need for this product.
German Male It looks great.
Finnish Female I would definitely buy one.

etc.

What I want to do is to randomly select a number of responses from each group so that I can analyse them further.

My groupby function has returned something like this:

Nationality Sex
American    Male    567
American    Female  342
German      Male    421
German      Female  234
Finnish     Male    149
Finnish     Female  67

etc.

I want to have a new dataframe with 20 random responses of each group. Is that possible using lambda? Because new_df = df.groupby('Nationality')['Sex'].apply(lambda x: x.sample(20)) doesn't return what I am looking for. Is there a way to do this?

CodePudding user response:

Using iterrows from Pandas you can iterate over DataFrame rows as (index, Series) pairs, and get what you want:

new_df = df.groupby(['Nationality', 'Sex'], as_index=False).size()

for _, row in new_df.iterrows():
    print(df[(df.Nationality==row.Nationality)&(df.Sex==row.Sex)].sample(20))

CodePudding user response:

Try:

df_sample = df.groupby(['Nationality', 'Sex']).sample(20)

MVCE:

import pandas as pd
import numpy as np
df = pd.DataFrame({'Col1':np.random.choice([*'ABC'],100),
                   'Col3':np.arange(100), 
                   'Col2':np.random.randint(1000,5000,100)})

print(df.groupby('Col1').sample(5))

Output:

   Col1  Col3  Col2
83    A    83  1637
58    A    58  4090
17    A    17  4179
86    A    86  3848
74    A    74  2067
49    B    49  4369
50    B    50  4452
42    B    42  4205
7     B     7  2394
54    B    54  3541
40    C    40  3956
67    C    67  4018
9     C     9  4591
48    C    48  1536
26    C    26  2720

CodePudding user response:

Your group seem to depend on nationality-sex groupings. So perhaps you're looking for:

out = df.groupby(['Nationality', 'Sex'])['Response'].apply(lambda x: x.sample(20))

This will select 20 responses from each nationality-sex group.

  • Related