Home > Back-end >  Python dataframe how to add label that is not exist to the output
Python dataframe how to add label that is not exist to the output

Time:12-11

I have the following dataset and Age_group <20, 20-30, 31-40, and >40.

import pandas as pd
df = pd.DataFrame({'Age':[25,30,35,60]})

df['Age_Group'] = ''

df.loc[(df['Age']<20) , 'Age_Group'] = "<20" 
df.loc[ ((df['Age']>= 20) & (df['Age'] <= 30)), 'Age_Group'] = "20-30" 
df.loc[ ((df['Age']>= 31) & (df['Age'] <= 40)), 'Age_Group'] = "31-40"     
df.loc[(df['Age'] > 40) , 'Age_Group'] = ">40" 

#I got the following

Age Age_Group
25  20-30
30  20-30
35  31-40
60  >40

Now I use

print(df.groupby(['Age_Group'])['Age'].mean().round(2))

I got

Age_Group
20-30    27.5
31-40    35.0
>40      60.0
Name: Age, dtype: float64

But I wish to also print the group that is not included in the data like this. So that when applied to new data, the average is NA for whatever the group that is missing:

Age_Group
<20      NA
20-30    27.5
31-40    35.0
>40      60.0
Name: Age, dtype: float64

Any help is appreicated.

CodePudding user response:

df['Age'] doesn't have a value less than 20, so it wouldn't show up on aggregate data. But once you add such a value, it will be calculated. Alternatively, you can add a null value to df and do groupby on it:

df.loc[len(df)] = [np.nan, '<20']
out = df.groupby('Age_Group').mean().reindex(['<20','20-30','31-40','>40'])

Output:

            Age
Age_Group      
<20         NaN
20-30      27.5
31-40      35.0
>40        60.0
  • Related