I'm an R/Tidyverse guy getting my feet wet in python/pandas and having trouble discerning if there is a way to do the following as elegantly in pandas as tidyverse:
(
dat
%>% group_by(grp)
%>% mutate(
value = value/max(value)
)
)
So, there's a grouped mutate that involves a non-reducing operation (division) that in turn involves the result of a reducing operation (max). I know the following is possible:
import pandas as pd
import numpy as np
df = pd.DataFrame({'grp': np.random.randint(0,5, 10), 'value': np.random.randn(10)}).sort_values('grp')
tmp = (
df
.groupby('grp')
.agg('max')
)
(
df
.merge(tmp,on='grp')
.assign(
value = lambda x: x.value_x / x.value_y
)
)
But I feel like there must be a way to avoid the creation of the temporary variable tmp
to achieve this in one expression like I can achieve in tidyverse. Am I wrong?
Update: I'm marking @PaulS's answer as correct as it indeed addresses the question as posed. On using it something other than my minimal example, I realized there was further implicit behaviour in tidyverse I hadn't accounted for; specifically, that columns not involved in the series of specified operations are kept in the tidyverse case and dropped in @PaulS's answer. So here instead is an example & solution that more closely emulates tidyverse:
df = (
pd.DataFrame({
'grp': np.random.randint(0,5, 10) #to be used for grouping
, 'time': np.random.normal(0,1,10) #extra column not involved in computation
, 'value': np.random.randn(10) #to be used for calculations
})
.sort_values(['grp','time'])
.reset_index()
)
#computing a grouped non-reduced-divided-by-reduced:
(
df
.groupby('grp', group_keys=False)
.apply(
lambda x: (
x.assign(
value = (
x.value
/ x.value.max()
)
)
)
)
.reset_index()
.drop(['index','level_0'],axis=1)
)
I also discovered that if I want to index into one column during the assignment, I have to tweak things a bit, for example:
#this time the reduced compute involves getting the value at the time closest to zero:
(
df
.groupby('grp', group_keys=False)
.apply(
lambda x: (
x.assign(
value = (
x.value
/ x.value.values[np.argmin(np.abs(x.time))] #note use of .values[]
)
)
)
)
.reset_index()
.drop(['index','level_0'],axis=1)
)
CodePudding user response:
A possible solution:
(df.groupby('grp')
.apply(lambda g: g['value'].div(g['value'].max()))
.droplevel(1)
.reset_index())
Output:
grp value
0 0 1.000000
1 1 1.000000
2 1 1.052922
3 2 1.000000
4 2 5.873499
5 3 10.009542
6 3 1.000000
7 4 1.000000
8 4 -0.842420
9 4 0.410153
CodePudding user response:
For this specific case, a transform is a better fit, and should be more performant than apply:
df.assign(value = df.value/df.groupby('grp').value.transform('max'))
grp value
1 0 1.000000
2 1 -0.290494
3 1 1.000000
4 1 0.214848
6 2 8.242604
7 2 1.000000
8 2 1.156246
0 3 0.655760
9 3 1.000000
5 4 1.000000