I have the following dataset and Age_group <20, 20-30, 31-40, and >40.
import pandas as pd
df = pd.DataFrame({'Age':[25,30,35,60]})
df['Age_Group'] = ''
df.loc[(df['Age']<20) , 'Age_Group'] = "<20"
df.loc[ ((df['Age']>= 20) & (df['Age'] <= 30)), 'Age_Group'] = "20-30"
df.loc[ ((df['Age']>= 31) & (df['Age'] <= 40)), 'Age_Group'] = "31-40"
df.loc[(df['Age'] > 40) , 'Age_Group'] = ">40"
#I got the following
Age Age_Group
25 20-30
30 20-30
35 31-40
60 >40
Now I use
print(df.groupby(['Age_Group'])['Age'].mean().round(2))
I got
Age_Group
20-30 27.5
31-40 35.0
>40 60.0
Name: Age, dtype: float64
But I wish to also print the group that is not included in the data like this. So that when applied to new data, the average is NA for whatever the group that is missing:
Age_Group
<20 NA
20-30 27.5
31-40 35.0
>40 60.0
Name: Age, dtype: float64
Any help is appreicated.
CodePudding user response:
df['Age']
doesn't have a value less than 20, so it wouldn't show up on aggregate data. But once you add such a value, it will be calculated. Alternatively, you can add a null value to df
and do groupby on it:
df.loc[len(df)] = [np.nan, '<20']
out = df.groupby('Age_Group').mean().reindex(['<20','20-30','31-40','>40'])
Output:
Age
Age_Group
<20 NaN
20-30 27.5
31-40 35.0
>40 60.0