Here is my data:
df1 = pd.DataFrame()
df1['a1'] = ['ABC','ACC','BCC','ABC','ABC','ACC','BCC']
df1['b1'] = ['ACC','AAC','BAC','ACC','ACC','AAC','BAC']
df1['group'] = ['A1','A2','A1','A3','A2','A1','A1']
df1['names'] = ['n1','n2','n3','n4','n1','n3','n3']
df2 = pd.DataFrame()
df2['a2'] = ['ACC','BCC','ABC']
df2['b2'] = ['AAC','BAC','ACC']
df2['types'] = ['t1','t2','t3']
DF = pd.merge(df1, df2, left_on=['a1','b1'], right_on=['a2','b2'])
>>> DF.sort_values('group')
a1 b1 group names a2 b2 types
0 ABC ACC A1 n1 ABC ACC t3
4 ACC AAC A1 n3 ACC AAC t1
5 BCC BAC A1 n3 BCC BAC t2
6 BCC BAC A1 n3 BCC BAC t2
2 ABC ACC A2 n1 ABC ACC t3
3 ACC AAC A2 n2 ACC AAC t1
1 ABC ACC A3 n4 ABC ACC t3
I want to calculate the probability of each type occurring among the total occurring times (nrow of df) on each name and then, sum them for each group.
For example, for group A1
:
for n1:
P_1 = P(t1_n1) P(t2_n1) P(t3_n1) = 0 0 1/7 = 1/7
for n2:
P_2 = P(t1_n2) P(t2_n2) P(t3_n2) = 0
for n3:
P_3 = P(t1_n3) P(t2_n3) P(t3_n3) = 1/7 0 2/7 = 3/7
for n4:
P_4 = P(t1_n4) P(t2_n4) P(t3_n4) = 0
P_total = P_1 P_2 P_3 P_4
Expected output:
groups P_n1 P_n2 P_n3 P_n4 P_total
0 A1 1/7 0 3/7 0 4/7
1 A2 ....
2 A3
3 A4
how can I accomplish my goal without many loop functions, in an elegant way? Thanks
CodePudding user response:
You can use pd.crosstab with normalize=True :
pd.crosstab(DF['group'],DF['names'],normalize=True)
names n1 n2 n3 n4
group
A1 0.142857 0.000000 0.428571 0.000000
A2 0.142857 0.142857 0.000000 0.000000
A3 0.000000 0.000000 0.000000 0.142857
To get you the total etc:
pd.crosstab(DF['group'],DF['names'],normalize=True)\
.assign(total = lambda x : x.sum(axis=1)).reset_index()
names group n1 n2 n3 n4 total
0 A1 0.142857 0.000000 0.428571 0.000000 0.571429
1 A2 0.142857 0.142857 0.000000 0.000000 0.285714
2 A3 0.000000 0.000000 0.000000 0.142857 0.142857
CodePudding user response:
The computation of the Xj
s in your example is confusing, but based on the following description
I want to calculate the probability of each type happening on each name and sum them for each group.
it seems that you want
prob_types_by_name = (
DF.groupby('names')['types'] # compute the probabilities of each
.value_counts(normalize=True) # type happening on each name group
.rename('prob_types') # name the resulting Series as 'prob_types'
)
res = (
DF.merge(prob_types_by_name, # merge the result to
on=['names','types'], how='left') # the original DF
.groupby('group')['prob_types'] # sum the 'prob_types'
.sum() # per 'group'
)
Input
>>> DF
a1 b1 group names a2 b2 types
0 ABC ACC A1 n1 ABC ACC t3
4 ACC AAC A1 n3 ACC AAC t1
5 BCC BAC A1 n3 BCC BAC t2
6 BCC BAC A1 n3 BCC BAC t2
2 ABC ACC A2 n1 ABC ACC t3
3 ACC AAC A2 n2 ACC AAC t1
1 ABC ACC A3 n4 ABC ACC t3
Output
>>> res
group
A1 2.666667
A2 2.000000
A3 1.000000
Name: prob_types, dtype: float64