I have a large dataframe that looks like this:
Nationality | Sex | Response |
---|---|---|
American | Female | I have no need for this product. |
German | Male | It looks great. |
Finnish | Female | I would definitely buy one. |
etc.
What I want to do is to randomly select a number of responses from each group so that I can analyse them further.
My groupby function has returned something like this:
Nationality Sex
American Male 567
American Female 342
German Male 421
German Female 234
Finnish Male 149
Finnish Female 67
etc.
I want to have a new dataframe with 20 random responses of each group. Is that possible using lambda? Because new_df = df.groupby('Nationality')['Sex'].apply(lambda x: x.sample(20))
doesn't return what I am looking for. Is there a way to do this?
CodePudding user response:
Using iterrows from Pandas you can iterate over DataFrame rows as (index, Series) pairs, and get what you want:
new_df = df.groupby(['Nationality', 'Sex'], as_index=False).size()
for _, row in new_df.iterrows():
print(df[(df.Nationality==row.Nationality)&(df.Sex==row.Sex)].sample(20))
CodePudding user response:
Try:
df_sample = df.groupby(['Nationality', 'Sex']).sample(20)
MVCE:
import pandas as pd
import numpy as np
df = pd.DataFrame({'Col1':np.random.choice([*'ABC'],100),
'Col3':np.arange(100),
'Col2':np.random.randint(1000,5000,100)})
print(df.groupby('Col1').sample(5))
Output:
Col1 Col3 Col2
83 A 83 1637
58 A 58 4090
17 A 17 4179
86 A 86 3848
74 A 74 2067
49 B 49 4369
50 B 50 4452
42 B 42 4205
7 B 7 2394
54 B 54 3541
40 C 40 3956
67 C 67 4018
9 C 9 4591
48 C 48 1536
26 C 26 2720
CodePudding user response:
Your group seem to depend on nationality-sex groupings. So perhaps you're looking for:
out = df.groupby(['Nationality', 'Sex'])['Response'].apply(lambda x: x.sample(20))
This will select 20 responses from each nationality-sex group.