Home > database >  Assign involving both reducing & non-reducing operations in Pandas
Assign involving both reducing & non-reducing operations in Pandas

Time:11-23

I'm an R/Tidyverse guy getting my feet wet in python/pandas and having trouble discerning if there is a way to do the following as elegantly in pandas as tidyverse:

(
    dat
    %>% group_by(grp)
    %>% mutate(
        value = value/max(value)
    )
)

So, there's a grouped mutate that involves a non-reducing operation (division) that in turn involves the result of a reducing operation (max). I know the following is possible:

import pandas as pd
import numpy as np

df = pd.DataFrame({'grp': np.random.randint(0,5, 10), 'value': np.random.randn(10)}).sort_values('grp')

tmp = (
    df
    .groupby('grp')
    .agg('max')
)

(
    df
    .merge(tmp,on='grp')
    .assign(
        value = lambda x: x.value_x / x.value_y
    )
)

But I feel like there must be a way to avoid the creation of the temporary variable tmp to achieve this in one expression like I can achieve in tidyverse. Am I wrong?

Update: I'm marking @PaulS's answer as correct as it indeed addresses the question as posed. On using it something other than my minimal example, I realized there was further implicit behaviour in tidyverse I hadn't accounted for; specifically, that columns not involved in the series of specified operations are kept in the tidyverse case and dropped in @PaulS's answer. So here instead is an example & solution that more closely emulates tidyverse:

df = (
    pd.DataFrame({
        'grp': np.random.randint(0,5, 10) #to be used for grouping
        , 'time': np.random.normal(0,1,10) #extra column not involved in computation
        , 'value': np.random.randn(10) #to be used for calculations
    })
    .sort_values(['grp','time'])
    .reset_index()
)

#computing a grouped non-reduced-divided-by-reduced:
(
    df
    .groupby('grp', group_keys=False)
    .apply(
        lambda x: (
            x.assign(
                value = (
                    x.value
                    / x.value.max()
                )
            )
        )
    )
    .reset_index()
    .drop(['index','level_0'],axis=1)
)

I also discovered that if I want to index into one column during the assignment, I have to tweak things a bit, for example:

#this time the reduced compute involves getting the value at the time closest to zero:
(
    df
    .groupby('grp', group_keys=False)
    .apply(
        lambda x: (
            x.assign(
                value = (
                    x.value
                    / x.value.values[np.argmin(np.abs(x.time))] #note use of .values[]
                )
            )
        )
    )
    .reset_index()
    .drop(['index','level_0'],axis=1)
)

CodePudding user response:

A possible solution:

(df.groupby('grp')
 .apply(lambda g: g['value'].div(g['value'].max()))
 .droplevel(1)
 .reset_index())

Output:

    grp      value
0    0   1.000000
1    1   1.000000
2    1   1.052922
3    2   1.000000
4    2   5.873499
5    3  10.009542
6    3   1.000000
7    4   1.000000
8    4  -0.842420
9    4   0.410153

CodePudding user response:

For this specific case, a transform is a better fit, and should be more performant than apply:

df.assign(value = df.value/df.groupby('grp').value.transform('max'))
   grp     value
1    0  1.000000
2    1 -0.290494
3    1  1.000000
4    1  0.214848
6    2  8.242604
7    2  1.000000
8    2  1.156246
0    3  0.655760
9    3  1.000000
5    4  1.000000
  • Related