Probably a duplicate, but I have spent too much time on this now googling without any luck. Assume I have a data frame:
import pandas as pd
data = {"letters": ["a", "a", "a", "b", "b", "b"],
"boolean": [True, True, True, True, True, False],
"numbers": [1, 2, 3, 1, 2, 3]}
df = pd.DataFrame(data)
df
I want to 1) group by letters, 2) take the mean of numbers if all values in boolean is True. In R I would write:
library(dplyr)
df %>%
group_by(letters) %>%
mutate(
condition = n_distinct(boolean) == 1,
numbers = ifelse(condition, mean(numbers), numbers)
) %>%
select(-condition)
This would result in the following output:
# A tibble: 6 x 3
# Groups: letters [2]
letters boolean numbers
<chr> <lgl> <dbl>
1 a TRUE 2
2 a TRUE 2
3 a TRUE 2
4 b TRUE 1
5 b TRUE 2
6 b FALSE 3
How would you do it using Python pandas?
CodePudding user response:
We can use lazy groupby
and transform
:
g = df.groupby('letters')
df.loc[g['boolean'].transform('all'), 'numbers'] = g['numbers'].transform('mean')
Output:
letters boolean numbers
0 a True 2
1 a True 2
2 a True 2
3 b True 1
4 b True 2
5 b False 3
CodePudding user response:
Another way would be to use np.where. where a group has one unique value, find mean. Where it doesnt keep the numbers. Code below
df['numbers'] =np.where(df.groupby('letters')['boolean'].transform('nunique')==1,df.groupby('letters')['numbers'].transform('mean'), df['numbers'])
letters boolean numbers
0 a True 2.0
1 a True 2.0
2 a True 2.0
3 b True 1.0
4 b True 2.0
5 b False 3.0
Alternatively, mask where condition does not apply as you compute the mean.
m=df.groupby('letters')['boolean'].transform('nunique')==1
df.loc[m, 'numbers']=df[m].groupby('letters')['numbers'].transform('mean')
CodePudding user response:
Since you are comparing drectly to R, I would prefer to use siuba
rather than pandas
:
from siuba import mutate, if_else, _, select, group_by, ungroup
df1 = df >>\
group_by(_.letters) >> \
mutate( condition = _.boolean.unique().size == 1,
numbers = if_else(_.condition, _.numbers.mean(), _.numbers)
) >>\
ungroup() >> select(-_.condition)
print(df1)
letters boolean numbers
0 a True 2.0
1 a True 2.0
2 a True 2.0
3 b True 1.0
4 b True 2.0
5 b False 3.0
Note that >>
is the pipe. I added \
in order to jump to the next line. Also note that to refer to the variables you use _.variable
EDIT
It seems your R code has an issue, In R, you should rather use condition = all(boolean)
instead of the code you have. That will translate to condition = boolean.all()
in Python