Home > Enterprise >  Multiple column groupby with pandas to find maximum value for each group
Multiple column groupby with pandas to find maximum value for each group

Time:11-18

I have a dataframe like below:

Feature value frequency label
age_45_and_above No 2700 negative
age_45_and_above No 1707 positive
age_45_and_above No 83 other
age_45_and_above Yes 222 negative
age_45_and_above Yes 15 positive
age_45_and_above Yes 8 other
age_45_and_above [Null] 323 negative
age_45_and_above [Null] 8 other
age_45_and_above [Null] 5 positive
talk No 20 negative
talk No 170 positive
talk No 500 other
talk Yes 210 negative
talk Yes 1500 positive
talk Yes 809 other
talk [Null] 234 negative
talk [Null] 43 other
talk [Null] 85 positive

and so on.

for each feature group, I want to find the maximum frequency with all its related row data, like if the feature is age_45_and_above then by looking for NO group we have 3 rows with different frequency and label, I want to report the maximum one with it's related data.

I've tried groupby in different ways:

result.groupby(['Feature', 'Value'])['Frequency', 'Predict'].max()

or this one, with this one, I'm getting multi-Index dataframe which is not the desired results:

result.groupby(['Feature', 'Value', 'Predict'])['Frequency'].max()

and so many failed attempts with idxmax, transfrom and ... .

the intended output I'm looking for looks like this:

Feature value frequency label
age_45_and_above No 2700 negative
age_45_and_above Yes 222 negative
age_45_and_above [Null] 323 negative
talk No 500 other
talk Yes 1500 positive
talk [Null] 234 negative

Also, I wonder how to sum the frequencies for each <<Feature-value>> group except the max row as I don't know how to locate the max row, like in here for the first feature and value, <<age_45_and_above-No>> max is 2700, so the sum would be 1707 83.

Thanks for your time.

CodePudding user response:

using idxmax after the groupby in aloc.

print(df.loc[df.groupby(['Feature','value'])['frequency'].idxmax()])
             Feature   value  frequency     label
0   age_45_and_above      No       2700  negative
3   age_45_and_above     Yes        222  negative
6   age_45_and_above  [Null]        323  negative
11              talk      No        500     other
13              talk     Yes       1500  positive
15              talk  [Null]        234  negative

and for the sum without the max, then do the total sum per group and remove the frequency of the row, then select the max row

gr = df.groupby(['Feature','value'])['frequency']

res = (
    df.assign(total=gr.transform(sum)-df['frequency'])
      .loc[gr.idxmax()]
)
print(res)
             Feature   value  frequency     label  total
0   age_45_and_above      No       2700  negative   1790
3   age_45_and_above     Yes        222  negative     23
6   age_45_and_above  [Null]        323  negative     13
11              talk      No        500     other    190
13              talk     Yes       1500  positive   1019
15              talk  [Null]        234  negative    128

CodePudding user response:

I would do it by using merge on the grouped data.

Based on this data:

df = pd.DataFrame({'Feature':['age']*9 ['talk']*9,
                   'value':(['No']*3 ['Yes']*3 ['[Null]']*3)*2,
                   'frequency':[2700,1707,83,222,15,8,323,8,5,20,170,500,210,1500,809,234,43,85],
                   'label':['N','P','O']*6})

Using:

df.groupby(['Feature','value'],as_index=False)['frequency'].max().merge(df,on=['Feature','Value','frequency'])

Outputs:

  Feature   value  frequency label
0     age      No       2700     N
1     age     Yes        222     N
2     age  [Null]        323     N
3    talk      No        500     O
4    talk     Yes       1500     P
5    talk  [Null]        234     N

Adding the extra column can be done via a simple assignment:

df_1['sum_no_max'] = df.groupby(['Feature','value'])['frequency'].sum().values - df_1['frequency'].values

Finally outputting:

  Feature   value  frequency label  sum_no_max
0     age      No       2700     N        1790
1     age     Yes        222     N          23
2     age  [Null]        323     N          13
3    talk      No        500     O         190
4    talk     Yes       1500     P        1019
5    talk  [Null]        234     N         128

CodePudding user response:

try this:

df.groupby(['Feature', 'value'], dropna=False).frequency.max().reset_index()

>>>
      Feature           value        frequency
0     age_45_and_above  No           2700
1     age_45_and_above  Yes          222
2     age_45_and_above  NaN          323
3     talk              No           500
4     talk              Yes          1500
5     talk              NaN          234

  • Related