I'm trying to calculate a conditional response probabilities when aggregating my dataset. Take the following toy example:
import pandas as pd
gender = [0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1]
is_family = [0,0,0,0,1,1,1,1,0,0,0,0,1,1,1,1]
treatment = [0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1]
response = [1,0,0,1,1,0,0,1,1,0,0,1,1,0,0,1]
num_rows = [10,10,5,20,0,5,10,30,20,30,10,5,60,10,10,20]
df = pd.DataFrame(data={'gender': gender, 'is_family': is_family, 'treatment': treatment, 'response': response, 'num_rows': num_rows})
gender is_family treatment response num_rows
0 0 0 0 1 10
1 0 0 1 0 10
2 0 0 0 0 5
3 0 0 1 1 20
4 0 1 0 1 0
5 0 1 1 0 5
6 0 1 0 0 10
7 0 1 1 1 30
8 1 0 0 1 20
9 1 0 1 0 30
10 1 0 0 0 10
11 1 0 1 1 5
12 1 1 0 1 60
13 1 1 1 0 10
14 1 1 0 0 10
15 1 1 1 1 20
When grouping and aggregating by gender
, treatment
, and response
I want to (1) sum the number of rows for each group and (2) calculate the probability of response given treatment. The result should look like this
gender treatment response num_rows resp_prob
0 0 0 0 15 0.600000
1 0 0 1 10 0.400000
2 0 1 0 15 0.230769
3 0 1 1 50 0.769231
4 1 0 0 20 0.200000
5 1 0 1 80 0.800000
6 1 1 0 40 0.615385
7 1 1 1 25 0.384615
The first response probability is calculated as follows: 15 (response=0, treatment=0) / 25 (treatment=0) = 0.6. The third response probability is calculated as follows: 15 / 65 = 0.23. Etc.
I can sum up the number of samples for each group with:
df.groupby(by=['gender', 'treatment', 'response'])['num_rows'].sum().reset_index()
but what about the probabilities?
Any ideas?
CodePudding user response:
IIUC, use a double groupby:
(df.groupby(by=['gender', 'treatment', 'response'],
as_index=False)
['num_rows'].sum()
.assign(resp_prob=lambda d: d['num_rows'].div(
d.groupby(['gender', 'treatment'])
['num_rows'].transform('sum'))
)
)
output:
gender treatment response num_rows resp_prob
0 0 0 0 15 0.600000
1 0 0 1 10 0.400000
2 0 1 0 15 0.230769
3 0 1 1 50 0.769231
4 1 0 0 20 0.200000
5 1 0 1 80 0.800000
6 1 1 0 40 0.615385
7 1 1 1 25 0.384615
CodePudding user response:
You can do this:
df["resp_prob"] = df["num_rows"].div(
df.groupby(["gender", "treatment"])["num_rows"].transform("sum")
)
So all you need is total per gender, treatment and then you already know individual totals in the form of num_rows
, so you get probability for each as num_rows/total
output:
gender treatment response num_rows resp_prob
0 0 0 0 15 0.600000
1 0 0 1 10 0.400000
2 0 1 0 15 0.230769
3 0 1 1 50 0.769231
4 1 0 0 20 0.200000
5 1 0 1 80 0.800000
6 1 1 0 40 0.615385
7 1 1 1 25 0.384615