Pandas - new column based on `max` of grouped values-CodePudding

I have a Pandas dataframe with multiple groups in it, A, B, C. Each group has multiple counts associated with it and I want to create a new column that is normalised to the max value of each group.

i.e.

index, group, year, count
0, A, 2015, 1
1, A, 2016, 2
2, A, 2017, 3
3, B, 2012, 10
4, B, 2013, 14
5, B, 2014, 18
6, C, 2014, 55
7, C, 2015, 59
8, C, 2016, 58

...becomes

index, group, year, count, normalised
0, A, 2015, 1,  0.333
1, A, 2016, 2,  0.667
2, A, 2017, 3,  1.000
3, B, 2012, 10, 0.557
4, B, 2013, 14, 0.778
5, B, 2014, 18, 1.000
6, C, 2014, 55, 0.932
7, C, 2015, 59, 1.000
8, C, 2016, 58, 0.983

If I try something like...

df.assign(normalised=lambda x: x['count']/df[df['group'] == x['group']]['count'].max()

then max will return 59 rather than the largest number within the category

CodePudding user response：

You can use groupby transform to calculate the ratio between current value and maximum value in each group:

df['normalised'] = df['count'].groupby(df.group).transform(lambda x: x / x.max())

df
   index group  year  count  normalised
0      0     A  2015      1    0.333333
1      1     A  2016      2    0.666667
2      2     A  2017      3    1.000000
3      3     B  2012     10    0.555556
4      4     B  2013     14    0.777778
5      5     B  2014     18    1.000000
6      6     C  2014     55    0.932203
7      7     C  2015     59    1.000000
8      8     C  2016     58    0.983051

CodePudding user response：

Similar to Psidom's answer, but avoiding the the lambda and therefore faster:

df['normalised'] = df['count']/df.groupby('group')['count'].transform('max')

Timings

>>> %timeit df['normalised'] = df['count']/df.groupby('group')['count'].transform('max')                                         
1.16 ms ± 79.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>>                                                                                                                              
>>> %timeit df['normalised'] = df['count'].groupby(df.group).transform(lambda x: x / x.max())                                    
1.86 ms ± 114 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)