Home > database >  Python pandas equivalent to R's group_by, mutate, and ifelse
Python pandas equivalent to R's group_by, mutate, and ifelse

Time:12-15

Probably a duplicate, but I have spent too much time on this now googling without any luck. Assume I have a data frame:

import pandas as pd
data = {"letters": ["a", "a", "a", "b", "b", "b"],
        "boolean": [True, True, True, True, True, False],
        "numbers": [1, 2, 3, 1, 2, 3]}
df = pd.DataFrame(data)
df

I want to 1) group by letters, 2) take the mean of numbers if all values in boolean is True. In R I would write:

library(dplyr)
df %>% 
  group_by(letters) %>%
  mutate(
    condition = n_distinct(boolean) == 1,
    numbers = ifelse(condition, mean(numbers), numbers)
  ) %>% 
  select(-condition)

This would result in the following output:

# A tibble: 6 x 3
# Groups:   letters [2]
  letters boolean numbers
  <chr>   <lgl>     <dbl>
1 a       TRUE          2
2 a       TRUE          2
3 a       TRUE          2
4 b       TRUE          1
5 b       TRUE          2
6 b       FALSE         3

How would you do it using Python pandas?

CodePudding user response:

We can use lazy groupby and transform:

g = df.groupby('letters')

df.loc[g['boolean'].transform('all'), 'numbers'] = g['numbers'].transform('mean')

Output:

  letters  boolean  numbers
0       a     True        2
1       a     True        2
2       a     True        2
3       b     True        1
4       b     True        2
5       b    False        3

CodePudding user response:

Another way would be to use np.where. where a group has one unique value, find mean. Where it doesnt keep the numbers. Code below

df['numbers'] =np.where(df.groupby('letters')['boolean'].transform('nunique')==1,df.groupby('letters')['numbers'].transform('mean'), df['numbers'])



letters  boolean  numbers
0       a     True      2.0
1       a     True      2.0
2       a     True      2.0
3       b     True      1.0
4       b     True      2.0
5       b    False      3.0

Alternatively, mask where condition does not apply as you compute the mean.

m=df.groupby('letters')['boolean'].transform('nunique')==1

df.loc[m, 'numbers']=df[m].groupby('letters')['numbers'].transform('mean')

CodePudding user response:

Since you are comparing drectly to R, I would prefer to use siuba rather than pandas:

from siuba import mutate, if_else, _, select, group_by, ungroup

df1 = df >>\
    group_by(_.letters) >> \
    mutate( condition = _.boolean.unique().size == 1, 
            numbers = if_else(_.condition, _.numbers.mean(), _.numbers)
          ) >>\
    ungroup() >> select(-_.condition)

print(df1)
letters  boolean  numbers
0       a     True      2.0
1       a     True      2.0
2       a     True      2.0
3       b     True      1.0
4       b     True      2.0
5       b    False      3.0

Note that >> is the pipe. I added \ in order to jump to the next line. Also note that to refer to the variables you use _.variable

EDIT

It seems your R code has an issue, In R, you should rather use condition = all(boolean) instead of the code you have. That will translate to condition = boolean.all() in Python

  • Related