Pandas group by aggregation for non numeric data-CodePudding

my sample df looks like this:

sid    score   cat_type
101      70      na
102      56      PNP
101      65      BAW
103      88      SAO  
103      50      na  
102      42      VVG  
105      79      SAE  
....

df_groupby = df.groupby(['sid']).agg(
    score_max = ('score','max'),
    cat_type_first_row = ('cat_type', '?')
)

with the group by, I want to get the first row value of cat_type that is not na and assign it to cat_type_first_row

my final df should look like this:

sid    score_max   cat_type_first_row
101       70              BAW
102       56              PNP
103       88              SAO  
105       79              SAE  
....

Could you please assist me in solving this problem?

CodePudding user response：

Try replace na to NaN and do first

df_groupby = df.replace('na',np.nan).groupby(['sid']).agg(
    score_max = ('score','max'),
    cat_type_first_row = ('cat_type', 'first')
)
df_groupby
     score_max cat_type_first_row
sid                              
101         70                BAW
102         56                PNP
103         88                SAO
105         79                SAE

CodePudding user response：

If you do not want to drop any row, you could use:

(df.merge((df.dropna(subset=['cat_type'])
             .groupby('sid')['cat_type']
             .first()
             .rename('cat_type_first_row')
          ), on='sid')
)

output:

   sid  score cat_type cat_type_first_row
0  101     70      NaN                BAW
1  101     65      BAW                BAW
2  102     56      PNP                PNP
3  102     42      VVG                PNP
4  103     88      SAO                SAO
5  103     50      NaN                SAO
6  105     79      SAE                SAE

CodePudding user response：

You can define a function that takes as input the grouped pandas series. I've tested this code and got your desired output (added rows for cases when all cat_type values are np.nan for a group):

df = {
    'sid': [101, 102, 101, 103, 103, 102, 105, 106, 106],
    'score': [70, 56, 65, 88, 50, 42, 79, 0, 0],
    'cat_type': [np.nan, 'PNP', 'BAW', 'SAO', np.nan, 'VVG', 'SAE', np.nan, np.nan]
}
df = pd.DataFrame(df)
display(df)

def get_cat_type_first_row(series):
    series_nona = series.dropna()
    if len(series_nona) == 0:
        return np.nan
    else:
        return series.dropna().iloc[0]

df_groupby = df.groupby(['sid']).agg(
    score_max = ('score','max'),
    cat_type_first_row = ('cat_type', get_cat_type_first_row)
)
df_groupby

Output: