import numpy as np
import pandas as pd
df = pd.DataFrame({
'user' : ['A', 'A', 'B', 'B', 'B', 'C', 'C'],
'step_1' : [True, True, True, True, True, True, True],
'step_2' : [True, False, False, True, False, True, True],
'step_3' : [False, False, False, False, False, True, True]
})
print(df)
user step_1 step_2 step_3
0 A True True False
1 A True False False
2 B True False False
3 B True True False
4 B True False False
5 C True True True
6 C True True True
I would like to run the calculation to see what fraction of users get to each step. I have multiple observations of some users, and the order cannot be counted on to simply do a df.drop_duplicates( subset = ['user'] )
.
In this case, the answer should be:
- Step 1 = 1.00 (because A, B, and C all have a True in Step 1)
- Step 2 = 1.00 (A, B, C)
- Step 3 = 0.33 (C)
(I do not need to worry about any edge case in which a user goes from False in one step to True in a subsequent step within the same row.)
CodePudding user response:
In your case you can do
df.groupby('user').any().mean()
Out[11]:
step_1 1.000000
step_2 1.000000
step_3 0.333333
dtype: float64