I have a Pandas dataframe with multiple groups in it, A, B, C. Each group has multiple counts associated with it and I want to create a new column that is normalised to the max value of each group.
i.e.
index, group, year, count
0, A, 2015, 1
1, A, 2016, 2
2, A, 2017, 3
3, B, 2012, 10
4, B, 2013, 14
5, B, 2014, 18
6, C, 2014, 55
7, C, 2015, 59
8, C, 2016, 58
...becomes
index, group, year, count, normalised
0, A, 2015, 1, 0.333
1, A, 2016, 2, 0.667
2, A, 2017, 3, 1.000
3, B, 2012, 10, 0.557
4, B, 2013, 14, 0.778
5, B, 2014, 18, 1.000
6, C, 2014, 55, 0.932
7, C, 2015, 59, 1.000
8, C, 2016, 58, 0.983
If I try something like...
df.assign(normalised=lambda x: x['count']/df[df['group'] == x['group']]['count'].max()
then max
will return 59
rather than the largest number within the category
CodePudding user response:
You can use groupby
transform
to calculate the ratio between current value and maximum value in each group:
df['normalised'] = df['count'].groupby(df.group).transform(lambda x: x / x.max())
df
index group year count normalised
0 0 A 2015 1 0.333333
1 1 A 2016 2 0.666667
2 2 A 2017 3 1.000000
3 3 B 2012 10 0.555556
4 4 B 2013 14 0.777778
5 5 B 2014 18 1.000000
6 6 C 2014 55 0.932203
7 7 C 2015 59 1.000000
8 8 C 2016 58 0.983051
CodePudding user response:
Similar to Psidom's answer, but avoiding the the lambda
and therefore faster:
df['normalised'] = df['count']/df.groupby('group')['count'].transform('max')
Timings
>>> %timeit df['normalised'] = df['count']/df.groupby('group')['count'].transform('max')
1.16 ms ± 79.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>>
>>> %timeit df['normalised'] = df['count'].groupby(df.group).transform(lambda x: x / x.max())
1.86 ms ± 114 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)