I'm unable to understand the output when calculating probability for a group by use-case.
I'm interested to calculate probability, for example, in the below data frame, grouped by a1
probability of a2
import pandas as pd
df = pd.DataFrame([[1,1,0],[0,1,1],[0,1,1],[1,1,0],[1,1,0],[1,0,0]],
columns=['a1','a2','a3'])
df[["a1","a2"]].groupby('a1').apply(lambda x: x[x>0].count()/len(x))
I get output as:
a1 a2
a1
0 0.0 1.00
1 1.0 0.75
The probability column should add up as 1. I cannot understand why for columns a2
the addition of total probability is 1.75. Second, how do I format the output from python in the tabular format as needed by stackoverflow.
Following link gives mean: https://stackoverflow.com/a/43015011/2740831 However, if IIUC probability is based upon the count of event occurance.
CodePudding user response:
In your ouput is 0.75
, not 1.75
- solution should be simplify with mean
by boolean DataFrame
:
df1 = df["a2"].gt(0).groupby(df['a1']).mean().reset_index(name='prob')
print (df1)
a1 prob
0 0 1.00
1 1 0.75
df2 = df[["a1","a2"]].gt(0).groupby(df['a1']).mean()
print (df2)
a1 a2
a1
0 0.0 1.00
1 1.0 0.75